Re: [PR] Rewrite Nested Loop Join executor for 3.5× speed and 1% memory usage [datafusion]

via GitHub Mon, 11 Aug 2025 02:08:30 -0700


2010YOUY01 commented on PR #16996:
URL: https://github.com/apache/datafusion/pull/16996#issuecomment-3173870087


   > > I really appreciate your detailed review. I have addressed them in 
[4c111c3](https://github.com/apache/datafusion/pull/16996/commits/4c111c38e6b05553d78bf9c46346f4a47cb7e710)
   > > BTW, I take code understandability very seriously, and I’m happy to make 
any small changes that improve it. I encourage others to do similar reviews -- 
point out anything that doesn’t make sense and to share even small nitpicks.
   > > > Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity 
:) Appreciate for the comments it was much easier to navigate.
   > > > Good sign we got fuzz testing working with the new implementation. For 
the post filtering it would be probably possible to split filter stage 
evalulation for certain types of join. Inner, outer joins filter can be 
evaluated early whereas SEMI, ANTI on late stage like now.
   > > 
   > > 
   > > I don't get this changing filter evaluation time idea, could you 
elaborate?
   > 
   > As far as I understood the filter evaluation happens in post phase, after 
tuples joined by key and this makes a lot of sense especially for SEMI, ANTI 
joins where you have to track filter eval results for the same key across all 
batches. However for Inner join and outer joins it maybe beneficial to evaluate 
record batches early and avoid join calculation and save some ticks
   > 
   > For example
   > 
   > ```
   > > with t as (select 1 a, 2 b), t1 as (select 2 a, 3 b) select * from t 
left join t1 on (t.a < t1.b and t1.b = 4);
   > +---+---+------+------+
   > | a | b | a    | b    |
   > +---+---+------+------+
   > | 1 | 2 | NULL | NULL |
   > +---+---+------+------+
   > ```
   > 
   > So the right record batch you can evaluate filter `t1.b = 4` and figure 
out that indices in the batch doesn't fit to join condition even without having 
actual join
   
   I see, but this step should be handled below the join operator by filter 
pushdown:
   See the `FilterExec` in the left input in the below example
   ```
   > explain SELECT *
           FROM range(10000) AS t1
           JOIN range(10000) AS t2
           ON ((t1.value + t2.value) % 1000 = 0) AND (t1.value > 5000);
   
+---------------+------------------------------------------------------------+
   | plan_type     | plan                                                       
|
   
+---------------+------------------------------------------------------------+
   | physical_plan | ┌───────────────────────────┐                              
|
   |               | │     NestedLoopJoinExec    ├──────────────┐               
|
   |               | └─────────────┬─────────────┘              │               
|
   |               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ 
|
   |               | │   CoalescePartitionsExec  ││      RepartitionExec      │ 
|
   |               | │                           ││    --------------------   │ 
|
   |               | │                           ││ partition_count(in->out): │ 
|
   |               | │                           ││          1 -> 14          │ 
|
   |               | │                           ││                           │ 
|
   |               | │                           ││    partitioning_scheme:   │ 
|
   |               | │                           ││    RoundRobinBatch(14)    │ 
|
   |               | └─────────────┬─────────────┘└─────────────┬─────────────┘ 
|
   |               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ 
|
   |               | │    CoalesceBatchesExec    ││       LazyMemoryExec      │ 
|
   |               | │    --------------------   ││    --------------------   │ 
|
   |               | │     target_batch_size:    ││     batch_generators:     │ 
|
   |               | │            8192           ││ range: start=0, end=10000,│ 
|
   |               | │                           ││       batch_size=8192     │ 
|
   |               | └─────────────┬─────────────┘└───────────────────────────┘ 
|
   |               | ┌─────────────┴─────────────┐                              
|
   |               | │         FilterExec        │                              
|
   |               | │    --------------------   │                              
|
   |               | │         predicate:        │                              
|
   |               | │        value > 5000       │                              
|
   |               | └─────────────┬─────────────┘                              
|
   |               | ┌─────────────┴─────────────┐                              
|
   |               | │      RepartitionExec      │                              
|
   |               | │    --------------------   │                              
|
   |               | │ partition_count(in->out): │                              
|
   |               | │          1 -> 14          │                              
|
   |               | │                           │                              
|
   |               | │    partitioning_scheme:   │                              
|
   |               | │    RoundRobinBatch(14)    │                              
|
   |               | └─────────────┬─────────────┘                              
|
   |               | ┌─────────────┴─────────────┐                              
|
   |               | │       LazyMemoryExec      │                              
|
   |               | │    --------------------   │                              
|
   |               | │     batch_generators:     │                              
|
   |               | │ range: start=0, end=10000,│                              
|
   |               | │       batch_size=8192     │                              
|
   |               | └───────────────────────────┘                              
|
   |               |                                                            
|
   
+---------------+------------------------------------------------------------+
   ```
   
   Perhaps it's better to rename the variable name from `filter` to 
`join_predicate` to make it more clear.
   
   I'm not sure if such filter pushdowns would be incorrect for SEMI/ANTI 
joins, I'll double check that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Rewrite Nested Loop Join executor for 3.5× speed and 1% memory usage [datafusion]

Reply via email to