amol- commented on pull request #11185:
URL: https://github.com/apache/arrow/pull/11185#issuecomment-924990970


   > > That seemed a more suitable strategy because converting from 
`numpy.array` to `BooleanArray` should usually be zero copy and thus have 
little overhead
   > 
   > That's not actually the case, unfortunately: numpy uses bytes for boolean 
arrays (like int8), while arrow uses bits. Now, whether this additional 
conversion step (instead of iterating over the numpy array directly) matters 
and has a noticeable impact, I don't know (you could time it)
   
   Ah, that's unfortunate, I thought we were using the same memory layout for 
all PrimitiveArrays.
   I very roughly tested the difference using this snippet
   ```
   import pyarrow as pa
   import numpy as np
   NUM_ENTRIES = 100000000
   arr = np.arange(NUM_ENTRIES)
   mask = np.array([False, True]*(NUM_ENTRIES//2))
   import timeit
   timeit.timeit(lambda: pa.array(arr, mask=mask), number=1)
   ```
   
   On master (thus without any conversion from numpy to pyarrow) I get 2.76 
seconds
   On my branch (thus converting numpy mask to pyarrow mask) I get 3.69 seconds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to