[ 
https://issues.apache.org/jira/browse/ARROW-15548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486403#comment-17486403
 ] 

Joris Van den Bossche edited comment on ARROW-15548 at 2/3/22, 11:46 AM:
-------------------------------------------------------------------------

> I think this is the same for schema-level metadata, or am I mistaken?

That's actually stored in the Parquet FileMetaData.key_value_metadata 
(separately from the ARROW:schema field). For example:

{code}
schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": 
"value"})], metadata={"schema_level_key": "value"})
table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
pq.write_table(table, "test_field_metadata.parquet")

>>> pq.read_metadata("test_field_metadata.parquet").metadata
{b'ARROW:schema': 
b'//////gAAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAEAAAAAEAAAAAQAAAAQAAACA////CAAAABwAAAAQAAAAc2NoZW1hX2xldmVsX2tleQAAAAAFAAAAdmFsdWUAAAABAAAAGAAAAAAAEgAYAAgABgAHAAwAAAAQABQAEgAAAAAAAQIUAAAAWAAAAAgAAAAYAAAAAAAAAAsAAABjb2x1bW5fbmFtZQABAAAADAAAAAgADAAEAAgACAAAAAgAAAAMAAAAAwAAAGtleQAFAAAAdmFsdWUAAAAIAAwACAAHAAgAAAAAAAABQAAAAA==',
 b'schema_level_key': b'value'}
{code}

So this "schema_level_key" metadata is present in the Parquet metadata as is (I 
don't know if we actually deduplicate it from the ARROW:schema)


was (Author: jorisvandenbossche):
> I think this is the same for schema-level metadata, or am I mistaken?

That's actually stored in the Parquet FileMetaData.key_value_metadata 
(separately from the ARROW:schema field). For example:

{code}
schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": 
"value"})], metadata={"schema_level_key": "value"})
table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
pq.write_table(table, "test_field_metadata.parquet")

>>> pq.read_metadata("test_field_metadata.parquet").metadata
{b'ARROW:schema': 
b'//////gAAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAEAAAAAEAAAAAQAAAAQAAACA////CAAAABwAAAAQAAAAc2NoZW1hX2xldmVsX2tleQAAAAAFAAAAdmFsdWUAAAABAAAAGAAAAAAAEgAYAAgABgAHAAwAAAAQABQAEgAAAAAAAQIUAAAAWAAAAAgAAAAYAAAAAAAAAAsAAABjb2x1bW5fbmFtZQABAAAADAAAAAgADAAEAAgACAAAAAgAAAAMAAAAAwAAAGtleQAFAAAAdmFsdWUAAAAIAAwACAAHAAgAAAAAAAABQAAAAA==',
 b'schema_level_key': b'value'}
{code}

> [C++][Parquet] Field-level metadata are not supported? 
> (ColumnMetadata.key_value_metadata)
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15548
>                 URL: https://issues.apache.org/jira/browse/ARROW-15548
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Parquet
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Due to an application where we are considering to use field-level metadata 
> (so not schema-level metadata), but also want to be able to save this data to 
> Parquet, I was looking into "field-level metadata" for Parquet, which I 
> assumed we supported this. 
> We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with 
> this example:
> {code:python}
> schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": 
> "value"})])
> table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
> pq.write_table(table, "test_field_metadata.parquet")
> >>> pq.read_table("test_field_metadata.parquet").schema
> column_name: int64
>   -- field metadata --
>   key: 'value'
> {code}
> However, the reason this is restored is actually because of this being stored 
> in the Arrow schema that we (by default) store in the {{ARROW:schema}} 
> metadata in the Parquet FileMetaData.key_value_metadata.
> With a small patched version to be able to turn this off (currently this is 
> harcoded to be turned on in the python bindings), it is clear this 
> field-level metadata is not restored on roundtrip without this stored arrow 
> schema:
> {code:python}
> pq.write_table(table, "test_field_metadata_without_schema.parquet", 
> store_arrow_schema=False)
> >>> pq.read_table("test_field_metadata_without_schema.parquet").schema
> column_name: int64
> {code}
> So there is currently no mapping from Arrow's field level metadata to 
> Parquet's column-level metadata ({{ColumnMetaData.key_value_metadata}} in 
> Parquet's thrift structures). 
> (which also means that using field-level metadata roundtripping to parquet 
> only works as long as you are using Arrow for writing/reading, but not if you 
> want to be able to also exchange such data with non-Arrow Parquet 
> implementations)
> In addition, it also seems we don't even expose this field in our C++ or 
> Python bindings, to just access that data if you would have a Parquet file 
> (written by another implementation) that has key_value_metadata in the 
> ColumnMetaData.
> cc [~emkornfield] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to