drin opened a new pull request, #13487:
URL: https://github.com/apache/arrow/pull/13487

   We would like to expose hashing functions via new compute functions. This PR 
would add a scalar compute function, `FastHash32`, which uses 
`Hashing32::HashMultiColumn` to compute a 32-bit hash for each row of the input 
array (or scalar).
   
   This PR focuses on 32-bit hashing, but I would like to use this PR to help 
decide if a different design is preferable before adding 64-bit hashing. I also 
am not sure what should be in the unit test except for code coverage.
   
   Potential design changes:
   - Using an `Options` argument.
   - Using an init function for the kernel (but I don't know the input length 
until run time).
   - Changing how hashing is used.
   
   Confusion about unit tests:
   - I'm not sure how to validate the hash outputs are "as expected". Further, 
unit tests for the hashing functions don't seem to validate hash outputs.
   - I'm not sure validating the output data types is necessary.
   - Code coverage of the various input types seems to be the most important, 
but I need help to figure out the best way to approach this.
   
   
   Edit (8/9/22):
   
   This PR has been open awhile but seems to have hit a good level of 
functionality. This PR does not:
   
   * address whether the hash functions are "acceptable" statistically
   * comprehensively benchmark, or test, the various data types
   
   I think these can be addressed in future improvements:
   * [ARROW-16017](https://issues.apache.org/jira/browse/ARROW-16017)
   * Closes: #17211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to