BryanCutler commented on pull request #34509:
URL: https://github.com/apache/spark/pull/34509#issuecomment-993166320


   The difference with #28743 is that was trying to deal with pyarrow extension 
types. For a pandas extension type the `__arrow_array__` interface will return 
an arrow array which could be a standard arrow type or an extension type. For 
this PR, we are talking about a standard string array, which pyspark can work 
with. Otherwise, if it's a pyarrow extension type, the storage type would need 
to be checked, which would be a standard arrow type. From that, pyspark could 
work with the storage type but that might not be very useful because all of the 
extension would be stripped out.
   
   This PR is a step in the right direction, so I think it's ok to merge. This 
will add support for any pandas extension types that are backed by a standard 
arrow array, although I don't think it will be able to convert it back to 
pandas as the original extension type. To fully support pandas/pyarrow 
extension types we would need to propagate the extension type info through 
spark so that when it is worked on again in python, the extension part can be 
loaded back up. I'm not exactly sure how difficult that might be to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to