2010YOUY01 commented on issue #18070:
URL: https://github.com/apache/datafusion/issues/18070#issuecomment-3419730649

   > > Could you check the size of the inputs for the NLJoin? If this is NLJoin 
related it could be similar to 
[#17547](https://github.com/apache/datafusion/issues/17547) and 
[#17488](https://github.com/apache/datafusion/issues/17488) .
   > 
   > I ran `EXPLAIN ANAYZE`
   > 
   > The relevant part of the plan I think is like this (tiny left input - 1 
row, giant right input 22M rows)
   > 
   > ```
   > NestedLoopJoinExec
   >    ProjectionExec (1 row)
   >    CoalesceBatchesExec (21917655 rows)
   > ```
   > 
   > The full plan is here:
   > 
   > Full Output
   
   This 1 row in the left side is a lengthy aggregated array
   
   The NLJ implementation is:
   ```
   for each right_batch:
       for each left_row:
           join(left_row, right_batch)
   ```
   and this `join(left_row, right_batch)` line will first repeat the left_row 
to the same length as `right_batch`, then apply the join filter. It's easier to 
be implemented this way due to the current filter API limitation.
   
   This repeating/copying row step is done through `to_array_of_size()` in 
`50.0`
   
https://github.com/apache/datafusion/blob/f199b000861360aca01d4f1b9104bf73e9d831cc/datafusion/physical-plan/src/joins/nested_loop_join.rs#L1668
   Note in `49.0` it's also copying the the row but with a different API. I 
suspect the slow down reason is: the `50.0` version is performing deep copy and 
`49.0` is doing shallow copy for this lengthy array in the left one-row input.
   
   If that's the case, an easier fix would be making `to_array_of_size()` doing 
shallow copy on inner array/list. An alternative fix can be directly evaluating 
filter on `row X batch`, without repeating/copying the row to the same size as 
the batch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to