This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 8bcdc0f384 GH-41186: [C++][Parquet][Doc] Denote PARQUET:field_id in
parquet.rst (#41187)
8bcdc0f384 is described below
commit 8bcdc0f3849e616dc09b8c19bbc1387c1773639b
Author: mwish <[email protected]>
AuthorDate: Fri May 24 00:34:37 2024 +0800
GH-41186: [C++][Parquet][Doc] Denote PARQUET:field_id in parquet.rst
(#41187)
### Rationale for this change
Denote PARQUET:field_id in parquet.rst
### What changes are included in this PR?
Just a doc improvement
### Are these changes tested?
No
### Are there any user-facing changes?
No
* GitHub Issue: #41186
Lead-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
---
docs/source/cpp/parquet.rst | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/docs/source/cpp/parquet.rst b/docs/source/cpp/parquet.rst
index 96897d139b..9d2a5d791f 100644
--- a/docs/source/cpp/parquet.rst
+++ b/docs/source/cpp/parquet.rst
@@ -522,8 +522,8 @@ An Arrow Dictionary type is written out as its value type.
It can still
be recreated at read time using Parquet metadata (see "Roundtripping Arrow
types" below).
-Roundtripping Arrow types
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Roundtripping Arrow types and schema
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While there is no bijection between Arrow types and Parquet types, it is
possible to serialize the Arrow schema as part of the Parquet file metadata.
@@ -531,8 +531,7 @@ This is enabled using
:func:`ArrowWriterProperties::store_schema`.
On the read path, the serialized schema will be automatically recognized
and will recreate the original Arrow data, converting the Parquet data as
-required (for example, a LargeList will be recreated from the Parquet LIST
-type).
+required.
As an example, when serializing an Arrow LargeList to Parquet:
@@ -542,6 +541,20 @@ As an example, when serializing an Arrow LargeList to
Parquet:
:func:`ArrowWriterProperties::store_schema` was enabled when writing the
file;
otherwise, it is decoded as an Arrow List.
+Parquet field id
+""""""""""""""""
+
+The Parquet format supports an optional integer *field id* which can be
assigned
+to a given field. This is used for example in the
+`Apache Iceberg specification
<https://github.com/apache/iceberg/blob/main/format/spec.md#column-projection>`__.
+
+On the writer side, if ``PARQUET:field_id`` is present as a metadata key on an
+Arrow field, then its value is parsed as a non-negative integer and is used as
+the field id for the corresponding Parquet field.
+
+On the reader side, Arrow will convert such a field id to a metadata key named
+``PARQUET:field_id`` on the corresponding Arrow field.
+
Serialization details
"""""""""""""""""""""
@@ -549,6 +562,7 @@ The Arrow schema is serialized as a :ref:`Arrow IPC
<format-ipc>` schema message
then base64-encoded and stored under the ``ARROW:schema`` metadata key in
the Parquet file metadata.
+
Limitations
~~~~~~~~~~~