[GitHub] [spark] jpivarski commented on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

GitBox Tue, 26 Oct 2021 04:39:24 -0700


jpivarski commented on pull request #26783:
URL: https://github.com/apache/spark/pull/26783#issuecomment-951848597



   For Arrow-based access, I agree that a developer-level API is appropriate. 
In fact, if it had data analyst-oriented features, the developers might end up 
fighting against those features to build their backends.
   
   I think that both "map" in chunks, like `mapPartitions` and `mapInPandas`, 
and the first step of a "reduce" tree are likely applications. The first 
application seems pretty direct, and the second could be built from 
map-in-chunks by having the process on Arrow buffers return single-row Arrow 
buffers, which are then flattened and further reduced on the Spark side. Or 
flattened, repartitioned on the Spark side, and then sent back to the 
Arrow-based process for further reduction. As long as it's possible for the 
Arrow-based process to return a different number of rows than it's given (same 
number of chunks, arbitrary number of returned rows per chunk), then all of 
these become possibilities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jpivarski commented on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

Reply via email to