[GitHub] [arrow-datafusion] Dandandan commented on issue #235: Failing tests in master: left_join_using and left_join

GitBox Sun, 02 May 2021 03:08:34 -0700


Dandandan commented on issue #235:
URL: 
https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830783405



   I reproduced the bug by explicitly setting the concurrency for those tests 
to `24`.
   
   So here is my hypothesis into what happens:
   
   * The left join has a wrong implementation in that it will produce rows when 
they are missing in the right batch, instead of in the entire partition.
   * The referenced commit has some changes to (re)hashing of single columns, 
which means that columns could end up without any right-side rows.
   * We also use the same hashing code in hash-repartition which means that the 
`33` row could end up in its "own" partition. In that case, no right batch is 
being processed, so no row is being generated for `33`.
   
   I have a feeling that to fix this in the general case it would be best to 
"just" fix the left join implementation.
   
   Another option would be maybe to cherry-pick this change which would fix 
just this test from PR #55:
   
   
https://github.com/apache/arrow-datafusion/pull/55/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR904


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #235: Failing tests in master: left_join_using and left_join

Reply via email to