[jira] [Commented] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

Joris Van den Bossche (Jira) Tue, 26 Apr 2022 08:43:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528253#comment-17528253
 ]


Joris Van den Bossche commented on ARROW-16339:
-----------------------------------------------

Note: I recently created an issue about _field-level_ metadata (ARROW-15548), 
but so this issue is about schema-level (for Arrow) / file-level (for Parquet) 
metadata.

The above is a long description with examples, but trying to summarize the 
findings, and questions to answer:

- Do we generally want to map the schema-level metadata from Arrow with Parquet 
file-level metadata? (I think the answer is yes?)
- When _reading_, and the file metadata does not contain a "ARROW:schema" key, 
we actually already do map the Parquet file metadata to resulting Arrow schema 
metadata (this is OK)
- When _writing_, the {{store_schema}} flag seems to also influence whether we 
store schema metadata key/values in the Parquet file. This might be a bug? (or 
at least unintended behaviour?)
- When _reading_, and the file metadata does contain both an "ARROW:schema" key 
and other keys, we ignore the other keys. Should we merge the keys from the 
metadata in the serialized "ARROW:schema" schema with the other keys in the 
Parquet FileMetaData key_value_metadata? (those could of course be duplicative 
/ conflicting)

cc [~apitrou] [~emkornfield] 
  

> [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to 
> Arrow Schema metadata
> -------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16339
>                 URL: https://issues.apache.org/jira/browse/ARROW-16339
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Parquet, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Context: I ran into this issue when reading Parquet files created by GDAL 
> (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
> writes files that have custom key_value_metadata, but without storing 
> ARROW:schema in those metadata (cc [~paleolimbot]
> —
> Both in reading and writing files, I expected that we would map Arrow 
> {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. 
> But apparently this doesn't (always) happen out of the box, and only happens 
> through the "ARROW:schema" field (which stores the original Arrow schema, and 
> thus the metadata stored in this schema).
> For example, when writing a Table with schema metadata, this is not stored 
> directly in the Parquet FileMetaData (code below is using branch from 
> ARROW-16337 to have the {{store_schema}} keyword):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
> pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
> pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
> store_schema=False)
> # original schema has metadata
> >>> table.schema
> a: int64
> -- schema metadata --
> key: 'value'
> # reading back only has the metadata in case we stored ARROW:schema
> >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
> a: int64
> -- schema metadata --
> key: 'value'
> # and not if ARROW:schema is absent
> >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
> a: int64
> {code}
> It seems that if we store the ARROW:schema, we _also_ store the schema 
> metadata separately. But if {{store_schema}} is False, we also stop writing 
> those metadata (not fully sure if this is the intended behaviour, and that's 
> the reason for the above output):
> {code:python}
> # when storing the ARROW:schema, we ALSO store key:value metadata
> >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
> {b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...',
>  b'key': b'value'}
> # when not storing the schema, we also don't store the key:value
> >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata 
> >>> is None
> True
> {code}
> On the reading side, it seems that we generally do read custom key/value 
> metadata into schema metadata. We don't have the pyarrow APIs at the moment 
> to create such a file (given the above), but with a small patch I could 
> create such a file:
> {code:python}
> # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
> >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
> {b'key': b'value'}
> # this metadata is now correctly mapped to the Arrow schema metadata
> >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
> a: int64
> -- schema metadata --
> key: 'value'
> {code}
> But if you have a file that has both custom key/value metadata and an 
> "ARROW:schema" key, we actually ignore the custom keys, and only look at the 
> "ARROW:schema" one. 
> This was the case that I ran into with GDAL, where I have a file with both 
> keys, but where the custom "geo" key is not also included in the serialized 
> arrow schema in the "ARROW:schema" key:
> {code:python}
> # includes both keys in the Parquet file
> >>> pq.read_metadata("test_gdal.parquet").metadata
> {b'geo': b'{"version":"0.1.0","...',
>  b'ARROW:schema': b'/////3gBAAAQ...'}
> # the "geo" key is lost in the Arrow schema
> >>> pq.read_table("test_gdal.parquet").schema.metadata is None
> True
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

Reply via email to