[GitHub] [spark] Kimahriman commented on pull request #41569: [SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow

via GitHub Sat, 15 Jul 2023 05:26:06 -0700


Kimahriman commented on PR #41569:
URL: https://github.com/apache/spark/pull/41569#issuecomment-1636752962


   Attempted a PR for the arrow issue: 
https://github.com/apache/arrow/pull/36701. Though after doing some digging I 
think that was only causing one test to fail that's a weird case of trying to 
convert a double to a string as part of the arrow conversion. Arrow already 
supports converting pandas series of strings to large_string type (when the 
numpy type is object), but not a numpy string list (when numpy type is utf8). 
The former goes through 
https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L324C9-L324C26
 instead of the other `Visit` paths.
   
   The other test failures were just due to arrow not having large type support 
when looking up the numpy type for an arrow type (also added that to the above 
PR). That can be fixed on the Spark side by just using np.object explicitly for 
string and binary types, but hitting a weird new test issue I'm trying to 
figure out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Kimahriman commented on pull request #41569: [SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow

Reply via email to