[GitHub] [arrow] spenczar opened a new issue, #37318: pyarrow.ChunkedArray.combine_chunks fast path

via GitHub Tue, 22 Aug 2023 16:29:05 -0700


spenczar opened a new issue, #37318:
URL: https://github.com/apache/arrow/issues/37318


   ### Describe the enhancement requested
   
   Chunked arrays are common in Pyarrow, since Tables are build on chunked 
arrays. Working with the underlying data requires combining chunks.
   
   Combining chunks is surprisingly slow. In my application, I need to work 
with actual arrays, but my code spends a significant amount of its time calling 
combine_chunks even when there is just one chunk.
   
   I have some informal benchmarking I did in an IPython session, which shows 
an improvement from ~150 microseconds to ~300 nanoseconds for an array of 1 
million floats:
   
   ```python
   In [1]: import pyarrow as pa
   
   In [2]: import numpy as np
   
   In [3]: data = pa.array(np.random.random(1_000_000))
   
   In [4]: chunked_array = pa.chunked_array([data])
   
   In [5]: %timeit chunked_array.combine_chunks()
   154 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   
   In [6]: def fast_combine_chunks(chunked_array):
      ...:     if chunked_array.num_chunks == 1:
      ...:         return chunked_array.chunk(0)
      ...:     return chunked_array.combine_chunks()
      ...:
   
   In [7]: %timeit fast_combine_chunks(chunked_array)
   290 ns ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] spenczar opened a new issue, #37318: pyarrow.ChunkedArray.combine_chunks fast path

Reply via email to