HyukjinKwon edited a comment on issue #23746: [SPARK-26761][SQL][R] Vectorized R gapply() implementation URL: https://github.com/apache/spark/pull/23746#issuecomment-462671978 > what about [e8982ca#diff-b11442485f6b77bf47b58b4747321638R267](https://github.com/apache/spark/commit/e8982ca7ad94e98d907babf2d6f1068b7cd064c6#diff-b11442485f6b77bf47b58b4747321638R267) can this be changed from file to stream also? Yes.. it actually used file on purpose actually.... if I use a file to transfer the data (instead of socket), I can send batch by batch in a streaming matter but if I use socket, I should buffer all the batches and then send them at once due to the Arrow R API limitation (ARROW-4512). I used file there because the existing protocol at R DataFrame -> Spark DataFrame was already using a file .. but looks all other protocol are using sockets. So, I had to use socket here to match the protocol although I have to buffer all the batches due to Arrow R API limitation. BTW, this limitation looks going to be fixed at Arrow 0.14.0. As soon as it's fixed, we can match it to Python side's vectorization easily because I tried to follow Python side's as possible as I can.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
