jorisvandenbossche commented on PR #41769:
URL: https://github.com/apache/arrow/pull/41769#issuecomment-2125436248

   On the issue I mentioned https://github.com/apache/arrow/issues/31723 that 
has some prior discussion about this.
   
   I agree with the general idea here that the writing of the Arrow schema 
metadata to the Parquet FileMetaData `key_value_metadata` should not depend on 
the `store_schema` flag. If we think in general we should map our schema 
metadata to parquet, that should be done independent from writing an 
ARROW:schema key in the parquet metadata
   
   
   
   > It's also weird that, when `store_schema` is true, we copy the Arrow 
metadata directly into the Parquet metadata, but AFAIU we _also_ serialize it 
as part of the Arrow schema. So it will end up essentially duplicated,
   
   That is indeed an issue, and actually causing issues on the _read_ side 
(which is what https://github.com/apache/arrow/issues/31723 is originally 
about). Because at the read side we will then ignore any other keys except for 
ARROW:schema and the metadata included in that serialized schema.
   
   The main problem is that just dropping the custom metadata from ARROW:schema 
would cause issues with compatibility (reading a file with arrow 16 would then 
not read any metadata)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to