jorisvandenbossche commented on PR #41769: URL: https://github.com/apache/arrow/pull/41769#issuecomment-2125436248
On the issue I mentioned https://github.com/apache/arrow/issues/31723 that has some prior discussion about this. I agree with the general idea here that the writing of the Arrow schema metadata to the Parquet FileMetaData `key_value_metadata` should not depend on the `store_schema` flag. If we think in general we should map our schema metadata to parquet, that should be done independent from writing an ARROW:schema key in the parquet metadata > It's also weird that, when `store_schema` is true, we copy the Arrow metadata directly into the Parquet metadata, but AFAIU we _also_ serialize it as part of the Arrow schema. So it will end up essentially duplicated, That is indeed an issue, and actually causing issues on the _read_ side (which is what https://github.com/apache/arrow/issues/31723 is originally about). Because at the read side we will then ignore any other keys except for ARROW:schema and the metadata included in that serialized schema. The main problem is that just dropping the custom metadata from ARROW:schema would cause issues with compatibility (reading a file with arrow 16 would then not read any metadata) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
