kszucs commented on PR #45001: URL: https://github.com/apache/arrow/pull/45001#issuecomment-2541734844
> Seems like we generate the same hash for both `NULL` and `0` which is not ideal. > > ```python > In [1]: import pyarrow as pa > > In [2]: import pyarrow.compute as pc > > In [3]: pc.hash_64([None]) > Out[3]: > <pyarrow.lib.UInt64Array object at 0x124247be0> > [ > 0 > ] > > In [4]: pc.hash_64([0]) > Out[4]: > <pyarrow.lib.UInt64Array object at 0x1033027a0> > [ > 0 > ] > ``` @pitrou `NULL`s are explicitly hashed as `0`, but `0::int` also hashes into `0` due to https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/key_hash_internal.cc#L304, any idea how could we overcome this to generate unique hashes for both `NULL` and `0` without performance regression? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
