jpivarski commented on pull request #26783: URL: https://github.com/apache/spark/pull/26783#issuecomment-663985775
Well, the cuda-kernels in Awkward Array are currently _in development_, so we can't actually do that test right now. (One function, `ak.num`, has been executed on a GPU, and we're not even looking at performance yet.) However, this is exactly the reason why we wanted a low-level Arrow UDF, to interpret the Arrow buffers as Awkward Arrays for fast processing in Python. Now that we have a little more experience with it, we find that the speedups in NumPy-like functions are modest (8x in the full analysis I showed in https://youtu.be/WlnUF3LRBj4 and 30x in the simpler example I used as an advertisement in this talk), but Awkward Array combined with Numba are significant (250x for the full analysis in the same talk). This is for the same reasons as NumPy—doing a full analysis with many passes over the same data gives you some order-of-magnitude speedup just because you avoid the Python VM with all of its dynamic type checking, but many passes over the same arrays is not cache-efficient. Doing the analysis in one pass with NumExpr or Numba fixes this second-order problem. This PR, by comparison, is ancient. It was written before Awkward Array was rewritten in C++ and soon it will be outdated because of the cuda-kernels as well. That's why we didn't want to focus too much on the 3x speedup in the example that was available at the time, but to point out the potential. Some analyses can't be written in an efficient way in Pandas (particularly those that would involve many DataFrames with distinct MultiIndexes that have to be joined in every step, which is true of most particle physics analyses), yet we can write efficient analyses for them in Python using other tools that start from Arrow. In these cases, just creating the Pandas DataFrames that we don't use is an expensive bottleneck. The argument for providing a more generic Arrow UDF for other tools to take advantage of was not really based on the 3x speedup in the original example, but on the general consideration that it opens doors to more significant speedups in the future. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
