amol- commented on pull request #11185: URL: https://github.com/apache/arrow/pull/11185#issuecomment-924990970
> > That seemed a more suitable strategy because converting from `numpy.array` to `BooleanArray` should usually be zero copy and thus have little overhead > > That's not actually the case, unfortunately: numpy uses bytes for boolean arrays (like int8), while arrow uses bits. Now, whether this additional conversion step (instead of iterating over the numpy array directly) matters and has a noticeable impact, I don't know (you could time it) Ah, that's unfortunate, I thought we were using the same memory layout for all PrimitiveArrays. I very roughly tested the difference using this snippet ``` import pyarrow as pa import numpy as np NUM_ENTRIES = 100000000 arr = np.arange(NUM_ENTRIES) mask = np.array([False, True]*(NUM_ENTRIES//2)) import timeit timeit.timeit(lambda: pa.array(arr, mask=mask), number=1) ``` On master (thus without any conversion from numpy to pyarrow) I get 2.76 seconds On my branch (thus converting numpy mask to pyarrow mask) I get 3.69 seconds -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
