JerAguilon opened a new pull request, #39804:
URL: https://github.com/apache/arrow/pull/39804

   ### Rationale for this change
   
   Issue is described visually in https://github.com/apache/arrow/issues/39803.
   
   The key hasher works by inspecting the [column 
metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412)
 for the asof-join key fields. This returns whether columns are fixed width, 
among other things.
   
   The issue is we are passing the `output_schema`, rather than the input's 
schema.
   
   If an input looks like 
   
   ```
   key_string_type,ts_int32_type,val
   ```
   
   But our expected output schema looks like:
   
   ```
   ts_int32,key_string_type,...
   ```
   Then the hasher will think that the `key_string_type`'s type is an int32. 
This completely throws of hashes. Tests currently get away with it since we 
just use ints across the board.
   
   ### What changes are included in this PR?
   
   One line fix and test with string types.
   
   ### Are these changes tested?
   
   Yes. Can see the test run before and after changes here: 
https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc
   
   Before the change, notice that inputs 0 and 1 have mismatched hashes:
   
   ```
   AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 
1050982531982388796, 10763536662319179482, 2029627098739957112, 
11814237723602982167, 3080328155728858293, 12792882290360550483, 
4058972722486426609, 13771526852823217039]
   ...
   AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 
12047706865972860560, 18017664240540048750, 12358837084497432044, 
8151160321586084686, 8691136767698756332, 15973065724125580046, 
9654919479117127288, 618127929167745505, 3403805303373270709]
   
   ```
   
   And after, they do match:
   
   ```
   AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 
12047706865972860560, 18017664240540048750, 12358837084497432044, 
8151160321586084686, 8691136767698756332, 15973065724125580046, 
9654919479117127288, 618127929167745505, 3403805303373270709]
   ...
   AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 
12047706865972860560, 18017664240540048750, 12358837084497432044, 
8151160321586084686, 8691136767698756332, 15973065724125580046, 
9654919479117127288, 618127929167745505, 3403805303373270709]
   ```
   
   ...which is exactly what you want, since the `key` column for both tables 
looks like `["0", "1", ..."9"]`
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line 
below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix 
either (a) a security vulnerability, (b) a bug that caused incorrect or invalid 
data to be produced, or (c) a bug that causes a crash (even when the API 
contract is upheld). We use this to highlight fixes to issues that may affect 
users without their knowledge. For this reason, fixing bugs that cause errors 
don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to