jorisvandenbossche commented on issue #35360: URL: https://github.com/apache/arrow/issues/35360#issuecomment-1527112161
Thanks for the report! I can confirm this on the last development version as well, and looking into our `Scalar::Hash` implementation, it seems that when the scalar is backed by an array (as is the case for a ListArray), our hashing implementation is a bit too simple: https://github.com/apache/arrow/blob/05a61d6fd4fee0c998ae650fa0e67881681e4a5a/cpp/src/arrow/scalar.cc#L154-L165 First, this will just loop through the child data and hash those. But I _think_ that this then doesn't take into account the correct length / offset in case your array is sliced (in which case the child data correspond to the full un-sliced data. For example, for a StructArray, getting a field does not just access the child_data, but slices the child data: https://github.com/apache/arrow/blob/05a61d6fd4fee0c998ae650fa0e67881681e4a5a/cpp/src/arrow/array/array_nested.cc#L584-L598 Now, in addition you can also see from the first snippet we actually _only_ check the length and null count of an array, and not the values inside it ... So that means we actually ignore the content and give the same hash for different scalars: ``` In [69]: hash(pa.scalar([{'a': 1}])) Out[69]: -285312971393311483 In [70]: hash(pa.scalar([{'a': 2}])) Out[70]: -285312971393311483 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
