[
https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534227#comment-17534227
]
Joris Van den Bossche commented on ARROW-16339:
-----------------------------------------------
[~emkornfield] Thanks for the input!
bq. Q4: Not sure on this one, it is a little surprising they aren't already
covered with the Arrow Schema?
Yes, for data written using the standard Arrow mechanisms, that typically won't
happen. But so this is exactly the case that I ran into while reading Parquet
files created by GDAL (using the Arrow C++ APIs,
https://github.com/OSGeo/gdal/pull/5477). By using the Arrow C++ APIs, they do
create a Parquet file with an {{ARROW:schema}} metadata field, but they also
add additional (geospatial) metadata, and add this directly to the Parquet file
metadata, instead of adding it to the Arrow schema that is getting written.
>From GDAL's point of view, I think it makes sense that they add it directly to
>the Parquet file metadata (since this metadata is also meant to be read by
>other, non-arrow readers).
I think for this case if the serialized "ARROW:schema" schema does not contain
custom metadata fields, but the Parquet file metadata does, I would say we can
clearly preserve the ones in the Parquet file metadata.
It is only when both contain key/value metadata that there might be conflicts,
and that it is less clear what to do.
That also raises yet another question, related to writing:
Q5: If we are writing data with schema-level metadata (and we decided to map
this to Parquet file-level metadata, Q1 above), do we then drop this metadata
fields from the original schema as how it is serialized into "ARROW:schema"?
Because writing both would exactly give such a case of duplicated metadata keys
(but not writing both (only in the actual parquet file metadata fields,
dropping from the "ARROW:schema" field) would result in files that don't read
any metadata with existing versions of Arrow)
> [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to
> Arrow Schema metadata
> -------------------------------------------------------------------------------------------------
>
> Key: ARROW-16339
> URL: https://issues.apache.org/jira/browse/ARROW-16339
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Parquet, Python
> Reporter: Joris Van den Bossche
> Priority: Major
>
> Context: I ran into this issue when reading Parquet files created by GDAL
> (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which
> writes files that have custom key_value_metadata, but without storing
> ARROW:schema in those metadata (cc [~paleolimbot]
> —
> Both in reading and writing files, I expected that we would map Arrow
> {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}.
> But apparently this doesn't (always) happen out of the box, and only happens
> through the "ARROW:schema" field (which stores the original Arrow schema, and
> thus the metadata stored in this schema).
> For example, when writing a Table with schema metadata, this is not stored
> directly in the Parquet FileMetaData (code below is using branch from
> ARROW-16337 to have the {{store_schema}} keyword):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
> pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
> pq.write_table(table, "test_metadata_without_arrow_schema.parquet",
> store_schema=False)
> # original schema has metadata
> >>> table.schema
> a: int64
> -- schema metadata --
> key: 'value'
> # reading back only has the metadata in case we stored ARROW:schema
> >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
> a: int64
> -- schema metadata --
> key: 'value'
> # and not if ARROW:schema is absent
> >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
> a: int64
> {code}
> It seems that if we store the ARROW:schema, we _also_ store the schema
> metadata separately. But if {{store_schema}} is False, we also stop writing
> those metadata (not fully sure if this is the intended behaviour, and that's
> the reason for the above output):
> {code:python}
> # when storing the ARROW:schema, we ALSO store key:value metadata
> >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
> {b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...',
> b'key': b'value'}
> # when not storing the schema, we also don't store the key:value
> >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata
> >>> is None
> True
> {code}
> On the reading side, it seems that we generally do read custom key/value
> metadata into schema metadata. We don't have the pyarrow APIs at the moment
> to create such a file (given the above), but with a small patch I could
> create such a file:
> {code:python}
> # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
> >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
> {b'key': b'value'}
> # this metadata is now correctly mapped to the Arrow schema metadata
> >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
> a: int64
> -- schema metadata --
> key: 'value'
> {code}
> But if you have a file that has both custom key/value metadata and an
> "ARROW:schema" key, we actually ignore the custom keys, and only look at the
> "ARROW:schema" one.
> This was the case that I ran into with GDAL, where I have a file with both
> keys, but where the custom "geo" key is not also included in the serialized
> arrow schema in the "ARROW:schema" key:
> {code:python}
> # includes both keys in the Parquet file
> >>> pq.read_metadata("test_gdal.parquet").metadata
> {b'geo': b'{"version":"0.1.0","...',
> b'ARROW:schema': b'/////3gBAAAQ...'}
> # the "geo" key is lost in the Arrow schema
> >>> pq.read_table("test_gdal.parquet").schema.metadata is None
> True
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)