Joris Van den Bossche created ARROW-15548:
---------------------------------------------

             Summary: [C++][Parquet] Field-level metadata are not supported? 
(ColumnMetadata.key_value_metadata)
                 Key: ARROW-15548
                 URL: https://issues.apache.org/jira/browse/ARROW-15548
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Parquet
            Reporter: Joris Van den Bossche


Due to an application where we are considering to use field-level metadata (so 
not schema-level metadata), but also want to be able to save this data to 
Parquet, I was looking into "field-level metadata" for Parquet, which I assumed 
we supported this. 

We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with 
this example:

{code:python}
schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": 
"value"})])
table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
pq.write_table(table, "test_field_metadata.parquet")

>>> pq.read_table("test_field_metadata.parquet").schema
column_name: int64
  -- field metadata --
  key: 'value'
{code}

However, the reason this is restored is actually because of this being stored 
in the Arrow schema that we (by default) store in the {{ARROW:schema}} metadata 
in the Parquet FileMetaData.key_value_metadata.

With a small patched version to be able to turn this off (currently this is 
harcoded to be turned on in the python bindings), it is clear this field-level 
metadata is not restored on roundtrip without this stored arrow schema:

{code:python}
pq.write_table(table, "test_field_metadata_without_schema.parquet", 
store_arrow_schema=False)

>>> pq.read_table("test_field_metadata_without_schema.parquet").schema
column_name: int64
{code}

So there is currently no mapping from Arrow's field level metadata to Parquet's 
column-level metadata ({{ColumnMetaData.key_value_metadata}} in Parquet's 
thrift structures). 

(which also means that using field-level metadata roundtripping to parquet only 
works as long as you are using Arrow for writing/reading, but not if you want 
to be able to also exchange such data with non-Arrow Parquet implementations)

In addition, it also seems we don't even expose this field in our C++ or Python 
bindings, to just access that data if you would have a Parquet file (written by 
another implementation) that has key_value_metadata in the ColumnMetaData.

cc [~emkornfield] 






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to