[I] Join Ranking [datafusion]

via GitHub Thu, 23 Oct 2025 16:22:02 -0700


NGA-TRAN opened a new issue, #18257:
URL: https://github.com/apache/datafusion/issues/18257


   This task is part of  feature #18249 illustrating a join ranking system that 
prioritizes joins based on their own properties and those of their inputs.
   
   ## Interesting Properties
   
   Let’s highlight several useful properties that can influence join ranking:
   
   - **Partitioning on filter columns**: If the input table is partitioned by 
the query’s filter column(s), filtering can be performed more efficiently.
   - **Partitioning on join columns**: Partitioning by join key(s) enables 
non-overlapping partitions or streams, facilitating parallel join execution.
   - **Sorting on join columns**: If the input is sorted on join key(s), merge 
join becomes a viable and efficient option.
   - **Sorting on filter columns**: Sorting by filter column(s) can accelerate 
filtering operations.
   - **High selectivity (filtered input)**: When a table is heavily filtered, 
less data needs to be read and joined, improving performance.
   - **Many-to-one (m:1) join relationships**: These reduce the risk of output 
size explosion.
   - **Small input size**: Smaller inputs typically lead to faster join 
execution.
   
   Some of these properties may carry more weight than others. The following 
table illustrates how we might assign relative importance to each.
   
   <img width="586" height="285" alt="Image" 
src="https://github.com/user-attachments/assets/bb19e375-f252-4802-8a95-354e16607a37";
 />
   
   ## Ranking the Joins
   
   Using properties, weights and selection criteria in Table 1, let us compute 
join ranks for Figure 2 in previous section
   
   <img width="326" height="340" alt="Image" 
src="https://github.com/user-attachments/assets/cae54dfa-6cbc-4705-8a1d-31600755c84a";
 />
   
   Figure 2 (repeated): Join Graph Annotated with Join Ranked
   
   ### Join Rank Calculation for L–O:
   
   - O is partitioned on the filter column: 1 × 1 = 1
   - L is sorted on the join column: 0.5 × 1 = 0.5
   - O is sorted on the join column: 0.5 × 1 = 0.5
   - O is filtered on some date: 1 × 1 = 1
   - L–O has a many-to-one (m:1) relationship: 1 × 1 = 1
   
   **Total join rank for L–O = 1 + 0.5 + 0.5 + 1 + 1 = 4**
   
   Using the same approach, we can compute the ranks for the remaining joins.
   
   
   ## Re-ranking after each round
   
   After a join is performed, certain input properties may be lost, requiring 
join ranks to be recomputed. As shown in Figure 12, the join rank between O and 
C drops to 2 after O is joined with L, whereas it was previously 4.
   
   <img width="252" height="282" alt="Image" 
src="https://github.com/user-attachments/assets/6184f5a0-d798-4865-b0c9-155e68cf69f0";
 />
   
   Figure 12: Level-1 Partial Plan (1)
   
   
   ## Join Ranking Summary
   
   As shown, a join ranking system can vary based on:
   
   - The properties it considers
   - The weights assigned to those properties
   - The criteria used for selecting or prioritizing joins
   
   This suggests we can design a flexible join ranking framework—one that 
allows customization of properties, weights, and selection logic.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Join Ranking [datafusion]

Reply via email to