rmnskb commented on PR #47253: URL: https://github.com/apache/arrow/pull/47253#issuecomment-3431331666
> Looking at the original PR implementing spark schema sanitizer (https://github.com/apache/arrow/pull/1076/files) I would actually agree the proposed fix would fit in PyArrow, if we see it needs to happen on our side. I would still like it to be a bit less hacky, as already mentioned 😊 > > If I have time, I will try to make the example work with the use of metadata as explained here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files Thank you for the update! I also took a look at the original issue, and also at the way Pandas uses the parquet writer: it leverages both `write_table` and `write_to_dataset`, so we'd have to cover both options. If we're to continue with the approach that I've proposed, we'd have to inject the updated metadata at the write time, and to do that, we'd also have to map the Arrow data types to Spark ones. I still have to check whether there is an already existing mapping available, either in this repo or in Spark's one, but one way or another, I cannot imagine any other way to ensure compatibility between these two frameworks. I will test the compatibility, and how much forgiving Spark can be when it comes to the schema ingestion. I will most probably come back with more concrete implementation later this week, will keep you updated :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
