jorisvandenbossche commented on issue #35360:
URL: https://github.com/apache/arrow/issues/35360#issuecomment-1527112161

   Thanks for the report!
   
   I can confirm this on the last development version as well, and looking into 
our `Scalar::Hash` implementation, it seems that when the scalar is backed by 
an array (as is the case for a ListArray), our hashing implementation is a bit 
too simple:
   
   
https://github.com/apache/arrow/blob/05a61d6fd4fee0c998ae650fa0e67881681e4a5a/cpp/src/arrow/scalar.cc#L154-L165
   
   First, this will just loop through the child data and hash those. But I 
_think_ that this then doesn't take into account the correct length / offset in 
case your array is sliced (in which case the child data correspond to the full 
un-sliced data. For example, for a StructArray, getting a field does not just 
access the child_data, but slices the child data:
   
   
https://github.com/apache/arrow/blob/05a61d6fd4fee0c998ae650fa0e67881681e4a5a/cpp/src/arrow/array/array_nested.cc#L584-L598
   
   Now, in addition you can also see from the first snippet we actually _only_ 
check the length and null count of an array, and not the values inside it ... 
So that means we actually ignore the content and give the same hash for 
different scalars:
   
   ```
   In [69]: hash(pa.scalar([{'a': 1}]))
   Out[69]: -285312971393311483
   
   In [70]: hash(pa.scalar([{'a': 2}]))
   Out[70]: -285312971393311483
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to