paleolimbot opened a new pull request, #433:
URL: https://github.com/apache/arrow-nanoarrow/pull/433
The basic idea here is to make it easy/possible to work with arrays that
come in chunks (e.g., a stream or `nanoarrow.Array`) without leaking the
CArrayView internal. Still playing with the best way to do this (it might be
just to commit to the `CArrayView` in the iterator instead of trying to avoid
leaking it).
```python
import nanoarrow as na
from nanoarrow import iterator
array = na.c_array([1, 2, 3], na.int32())
for offsets, lengths, buffers in iterator.iter_buffers_recursive(array):
print((offsets, lengths, buffers))
#> ([0], [3], (nanoarrow.c_lib.CBufferView(bool[0 b] ),
nanoarrow.c_lib.CBufferView(int32[12 b] 1 2 3)))
for offset, length, buffers, children, dictionary in
iterator.iter_buffers(array):
print((offset, length, buffers, children, dictionary))
#> (0, 3, (nanoarrow.c_lib.CBufferView(bool[0 b] ),
nanoarrow.c_lib.CBufferView(int32[12 b] 1 2 3)), (), None)
```
This benchmark is engineered to find the point where a pure Python iterator
would be slower than `pa.ChunkedArray.to_numpy()` (for a million doubles in
this specific example, PyArrow becomes faster between 100 and 1000 chunks).
```python
import nanoarrow as na
from nanoarrow import c_lib, iterator
import pyarrow as pa
import numpy as np
n = int(1e6)
chunk_size = int(1e4)
num_chunks = n // chunk_size
n = chunk_size * num_chunks
chunks = [na.c_array(np.random.random(chunk_size)) for i in
range(num_chunks)]
array = na.Array(c_lib.CArrayStream.from_array_list(chunks,
na.c_schema(na.float64())))
def make():
out = np.empty(len(array), dtype=np.float64)
cursor = 0
for offsets, lengths, buffers in iterator.iter_buffers_recursive(array):
offset, = offsets
length, = lengths
data = np.array(buffers[1], copy=False)[offset:(offset + length)]
out[cursor:(cursor + length)] = np.array(data, copy=False)
cursor += length
return out
def make2():
out = np.empty(len(array), dtype=np.float64)
cursor = 0
for offset, length, buffers, children, dictionary in
iterator.iter_buffers(array):
data = np.array(buffers[1], copy=False)[offset:(offset + length)]
out[cursor:(cursor + length)] = np.array(data, copy=False)
cursor += length
return out
%timeit make()
#> 960 µs ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit make2()
#> 830 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
chunked = pa.chunked_array([pa.array(item) for item in chunks])
%timeit chunked.to_numpy()
#> 2.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.testing.assert_equal(make(), chunked.to_numpy())
np.testing.assert_equal(make2(), chunked.to_numpy())
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]