[GitHub] [spark] jpivarski commented on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

GitBox Sun, 26 Jul 2020 06:07:30 -0700


jpivarski commented on pull request #26783:
URL: https://github.com/apache/spark/pull/26783#issuecomment-663985775



   Well, the cuda-kernels in Awkward Array are currently _in development_, so 
we can't actually do that test right now. (One function, `ak.num`, has been 
executed on a GPU, and we're not even looking at performance yet.)
   
   However, this is exactly the reason why we wanted a low-level Arrow UDF, to 
interpret the Arrow buffers as Awkward Arrays for fast processing in Python. 
Now that we have a little more experience with it, we find that the speedups in 
NumPy-like functions are modest (8x in the full analysis I showed in 
https://youtu.be/WlnUF3LRBj4 and 30x in the simpler example I used as an 
advertisement in this talk), but Awkward Array combined with Numba are 
significant (250x for the full analysis in the same talk). This is for the same 
reasons as NumPy—doing a full analysis with many passes over the same data 
gives you some order-of-magnitude speedup just because you avoid the Python VM 
with all of its dynamic type checking, but many passes over the same arrays is 
not cache-efficient. Doing the analysis in one pass with NumExpr or Numba fixes 
this second-order problem.
   
   This PR, by comparison, is ancient. It was written before Awkward Array was 
rewritten in C++ and soon it will be outdated because of the cuda-kernels as 
well. That's why we didn't want to focus too much on the 3x speedup in the 
example that was available at the time, but to point out the potential. Some 
analyses can't be written in an efficient way in Pandas (particularly those 
that would involve many DataFrames with distinct MultiIndexes that have to be 
joined in every step, which is true of most particle physics analyses), yet we 
can write efficient analyses for them in Python using other tools that start 
from Arrow. In these cases, just creating the Pandas DataFrames that we don't 
use is an expensive bottleneck. The argument for providing a more generic Arrow 
UDF for other tools to take advantage of was not really based on the 3x speedup 
in the original example, but on the general consideration that it opens doors 
to more significant speedups in the future.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jpivarski commented on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

Reply via email to