tobixdev commented on issue #19067:
URL: https://github.com/apache/datafusion/issues/19067#issuecomment-3608986130

   So I thought this would be a great excuse for looking a bit into the code of 
`HashJoin` as I've had a bit of time at the end of my day. 
   
   First and foremost, thanks @HawaiianSpork for the reproducer! It made 
finding the issue way easier.
   
   Some findings:
   - The issue is not a left join problem. The optimizer will switch the two 
join sides, as the right side is smaller than the left one. The issue 
disappears if you add two additional rows to the right side. (Adjust code below)
   - I don't think the HashJoin is the problem. When the right join stream 
processes a batch, it calls `adjust_indices_by_join_type` to add additional 
indices that are related to the join type. In the right join, this adds indices 
for each row that is not matched. The left indices of those unmatched rows will 
be correctly set to `NULL`. However, the `take` kernel for the fixed size 
binary array seems to ignore the validity of the indices array 
(https://github.com/apache/arrow-rs/issues/8947). 
   
   I'll run your reproducer against my fix in 
https://github.com/apache/arrow-rs/pull/8948 to see whether this is really the 
root cause. @HawaiianSpork @Jefffrey Hopefully, you haven't been working on 
this issue. I just got an itch to look at the HashJoin code for this reason and 
didn't want to commit to solving the issue if I couldn't locate it.
   
   **Larger right table that does not exercise the bug**
   
   ```rust
       let right_join_key = Arc::new(
           FixedSizeBinaryArray::try_from_sparse_iter_with_size(
               vec![
                   Some(vec![0xAA, 0xAA, 0xAA, 0xAA]),
                   Some(vec![0xBB, 0xBB, 0xBB, 0xBB]),
                   Some(vec![0xDD, 0xBB, 0xBB, 0xBB]),
                   Some(vec![0xEE, 0xBB, 0xBB, 0xBB]),
               ]
                   .into_iter(),
               4,
           )
               .unwrap(),
       ) as ArrayRef;
       let right_value = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 
4000])) as ArrayRef;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to