jorisvandenbossche commented on issue #38640: URL: https://github.com/apache/arrow/issues/38640#issuecomment-1804032397
Answered on the mailing list as well, copying my answer here: The reason for this big difference between using the numpy vs python scalar in case of `pc.equal` is because in pyarrow we don't have a smart casting of python scalars, depending on the other types in the operation. The Python scalar gets converted to a pyarrow scalar, and defaults to int64. As a result, we are doing a comparison of uint8 and int64 types, and then the uint8 array gets cast to the common type, i.e. int64. While when passing a numpy scalar, pyarrow will preserve the type and convert that to a uint8 pyarrow scalar, and then the comparison is done with uint8 and uint8, not requiring any casting. You can see this difference as well when explicitly creating a pyarrow scalar: ``` import numpy as np import pyarrow as pa import pyarrow.compute as pc data_np = np.random.randint(0, 100, 10_000_000, dtype="uint8") data_pa = pa.array(data_np) In [12]: %timeit pc.equal(data_pa, pa.scalar(115, pa.uint8())) 3.38 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [13]: %timeit pc.equal(data_pa, pa.scalar(115, pa.int64())) 35.6 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` (my numbers are generally a bit bigger, but the relative difference of around x10 without vs with casting is similar to your timings) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
