[PR] feat(python): Iterate over array buffers [arrow-nanoarrow]

via GitHub Thu, 18 Apr 2024 13:06:27 -0700


paleolimbot opened a new pull request, #433:
URL: https://github.com/apache/arrow-nanoarrow/pull/433


   The basic idea here is to make it easy/possible to work with arrays that 
come in chunks (e.g., a stream or `nanoarrow.Array`) without leaking the 
CArrayView internal. Still playing with the best way to do this (it might be 
just to commit to the `CArrayView` in the iterator instead of trying to avoid 
leaking it).
   
   
   ```python
   import nanoarrow as na
   from nanoarrow import iterator
   
   array = na.c_array([1, 2, 3], na.int32())
   
   for offsets, lengths, buffers in iterator.iter_buffers_recursive(array):
       print((offsets, lengths, buffers))
   #> ([0], [3], (nanoarrow.c_lib.CBufferView(bool[0 b] ), 
nanoarrow.c_lib.CBufferView(int32[12 b] 1 2 3)))
   
   for offset, length, buffers, children, dictionary in 
iterator.iter_buffers(array):
       print((offset, length, buffers, children, dictionary))
   #> (0, 3, (nanoarrow.c_lib.CBufferView(bool[0 b] ), 
nanoarrow.c_lib.CBufferView(int32[12 b] 1 2 3)), (), None)
   ```
   
   This benchmark is engineered to find the point where a pure Python iterator 
would be slower than `pa.ChunkedArray.to_numpy()` (for a million doubles in 
this specific example, PyArrow becomes faster between 100 and 1000 chunks).
   
   ```python
   import nanoarrow as na
   from nanoarrow import c_lib, iterator
   import pyarrow as pa
   import numpy as np
   
   n = int(1e6)
   chunk_size = int(1e4)
   num_chunks = n // chunk_size
   n = chunk_size * num_chunks
   
   chunks = [na.c_array(np.random.random(chunk_size)) for i in 
range(num_chunks)]
   array = na.Array(c_lib.CArrayStream.from_array_list(chunks, 
na.c_schema(na.float64())))
   
   def make():
       out = np.empty(len(array), dtype=np.float64)
   
       cursor = 0
       for offsets, lengths, buffers in iterator.iter_buffers_recursive(array):
           offset, = offsets
           length, = lengths
           data = np.array(buffers[1], copy=False)[offset:(offset + length)]
           out[cursor:(cursor + length)] = np.array(data, copy=False)
           cursor += length
   
       return out
   
   def make2():
       out = np.empty(len(array), dtype=np.float64)
   
       cursor = 0
       for offset, length, buffers, children, dictionary in 
iterator.iter_buffers(array):
           data = np.array(buffers[1], copy=False)[offset:(offset + length)]
           out[cursor:(cursor + length)] = np.array(data, copy=False)
           cursor += length
   
       return out
   
   %timeit make()
   #> 960 µs ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   
   %timeit make2()
   #> 830 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   
   chunked = pa.chunked_array([pa.array(item) for item in chunks])
   %timeit chunked.to_numpy()
   #> 2.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   np.testing.assert_equal(make(), chunked.to_numpy())
   np.testing.assert_equal(make2(), chunked.to_numpy())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(python): Iterate over array buffers [arrow-nanoarrow]

Reply via email to