spenczar opened a new issue, #37318:
URL: https://github.com/apache/arrow/issues/37318
### Describe the enhancement requested
Chunked arrays are common in Pyarrow, since Tables are build on chunked
arrays. Working with the underlying data requires combining chunks.
Combining chunks is surprisingly slow. In my application, I need to work
with actual arrays, but my code spends a significant amount of its time calling
combine_chunks even when there is just one chunk.
I have some informal benchmarking I did in an IPython session, which shows
an improvement from ~150 microseconds to ~300 nanoseconds for an array of 1
million floats:
```python
In [1]: import pyarrow as pa
In [2]: import numpy as np
In [3]: data = pa.array(np.random.random(1_000_000))
In [4]: chunked_array = pa.chunked_array([data])
In [5]: %timeit chunked_array.combine_chunks()
154 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [6]: def fast_combine_chunks(chunked_array):
...: if chunked_array.num_chunks == 1:
...: return chunked_array.chunk(0)
...: return chunked_array.combine_chunks()
...:
In [7]: %timeit fast_combine_chunks(chunked_array)
290 ns ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]