Re: [I] slow comparison compute vs numpy [arrow]

via GitHub Thu, 09 Nov 2023 07:20:46 -0800


jorisvandenbossche commented on issue #38640:
URL: https://github.com/apache/arrow/issues/38640#issuecomment-1804032397


   Answered on the mailing list as well, copying my answer here:
   
   The reason for this big difference between using the numpy vs python
   scalar in case of `pc.equal` is because in pyarrow we don't have a
   smart casting of python scalars, depending on the other types in the
   operation. The Python scalar gets converted to a pyarrow scalar, and
   defaults to int64. As a result, we are doing a comparison of uint8 and
   int64 types, and then the uint8 array gets cast to the common type,
   i.e. int64. While when passing a numpy scalar, pyarrow will preserve
   the type and convert that to a uint8 pyarrow scalar, and then the
   comparison is done with uint8 and uint8, not requiring any casting.
   
   You can see this difference as well when explicitly creating a pyarrow 
scalar:
   
   ```
   import numpy as np
   import pyarrow as pa
   import pyarrow.compute as pc
   data_np = np.random.randint(0, 100, 10_000_000, dtype="uint8")
   data_pa = pa.array(data_np)
   
   In [12]: %timeit pc.equal(data_pa, pa.scalar(115, pa.uint8()))
   3.38 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   In [13]: %timeit pc.equal(data_pa, pa.scalar(115, pa.int64()))
   35.6 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   (my numbers are generally a bit bigger, but the relative difference of
   around x10 without vs with casting is similar to your timings)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] slow comparison compute vs numpy [arrow]

Reply via email to