rmnskb commented on PR #47253:
URL: https://github.com/apache/arrow/pull/47253#issuecomment-3431331666

   > Looking at the original PR implementing spark schema sanitizer 
(https://github.com/apache/arrow/pull/1076/files) I would actually agree the 
proposed fix would fit in PyArrow, if we see it needs to happen on our side. I 
would still like it to be a bit less hacky, as already mentioned 😊 
   > 
   > If I have time, I will try to make the example work with the use of 
metadata as explained here: 
https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files
   
   Thank you for the update! 
   I also took a look at the original issue, and also at the way Pandas uses 
the parquet writer: it leverages both `write_table` and `write_to_dataset`, so 
we'd have to cover both options. 
   If we're to continue with the approach that I've proposed, we'd have to 
inject the updated metadata at the write time, and to do that, we'd also have 
to map the Arrow data types to Spark ones. I still have to check whether there 
is an already existing mapping available, either in this repo or in Spark's 
one, but one way or another, I cannot imagine any other way to ensure 
compatibility between these two frameworks. I will test the compatibility, and 
how much forgiving Spark can be when it comes to the schema ingestion. 
   I will most probably come back with more concrete implementation later this 
week, will keep you updated :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to