HyukjinKwon edited a comment on issue #23746: [SPARK-26761][SQL][R] Vectorized 
R gapply() implementation
URL: https://github.com/apache/spark/pull/23746#issuecomment-462671978
 
 
   > what about 
[e8982ca#diff-b11442485f6b77bf47b58b4747321638R267](https://github.com/apache/spark/commit/e8982ca7ad94e98d907babf2d6f1068b7cd064c6#diff-b11442485f6b77bf47b58b4747321638R267)
 can this be changed from file to stream also?
   
   Yes.. it actually used file on purpose actually.... if I use a file to 
transfer the data (instead of socket), I can send batch by batch in a streaming 
matter but if I use socket, I should buffer all the batches and then send them 
at once due to the Arrow R API limitation (ARROW-4512).
   
   I used file there because the existing protocol at R DataFrame -> Spark 
DataFrame was already using a file .. but looks all other protocol are using 
sockets. So, I had to use socket here to match the protocol although I have to 
buffer all the batches due to Arrow R API limitation.
   
   BTW, this limitation looks going to be fixed at Arrow 0.14.0. As soon as 
it's fixed, we can match it to Python side's vectorization easily because I 
tried to follow Python side's as possible as I can.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to