NGA-TRAN opened a new issue, #18257: URL: https://github.com/apache/datafusion/issues/18257
This task is part of feature #18249 illustrating a join ranking system that prioritizes joins based on their own properties and those of their inputs. ## Interesting Properties Let’s highlight several useful properties that can influence join ranking: - **Partitioning on filter columns**: If the input table is partitioned by the query’s filter column(s), filtering can be performed more efficiently. - **Partitioning on join columns**: Partitioning by join key(s) enables non-overlapping partitions or streams, facilitating parallel join execution. - **Sorting on join columns**: If the input is sorted on join key(s), merge join becomes a viable and efficient option. - **Sorting on filter columns**: Sorting by filter column(s) can accelerate filtering operations. - **High selectivity (filtered input)**: When a table is heavily filtered, less data needs to be read and joined, improving performance. - **Many-to-one (m:1) join relationships**: These reduce the risk of output size explosion. - **Small input size**: Smaller inputs typically lead to faster join execution. Some of these properties may carry more weight than others. The following table illustrates how we might assign relative importance to each. <img width="586" height="285" alt="Image" src="https://github.com/user-attachments/assets/bb19e375-f252-4802-8a95-354e16607a37" /> ## Ranking the Joins Using properties, weights and selection criteria in Table 1, let us compute join ranks for Figure 2 in previous section <img width="326" height="340" alt="Image" src="https://github.com/user-attachments/assets/cae54dfa-6cbc-4705-8a1d-31600755c84a" /> Figure 2 (repeated): Join Graph Annotated with Join Ranked ### Join Rank Calculation for L–O: - O is partitioned on the filter column: 1 × 1 = 1 - L is sorted on the join column: 0.5 × 1 = 0.5 - O is sorted on the join column: 0.5 × 1 = 0.5 - O is filtered on some date: 1 × 1 = 1 - L–O has a many-to-one (m:1) relationship: 1 × 1 = 1 **Total join rank for L–O = 1 + 0.5 + 0.5 + 1 + 1 = 4** Using the same approach, we can compute the ranks for the remaining joins. ## Re-ranking after each round After a join is performed, certain input properties may be lost, requiring join ranks to be recomputed. As shown in Figure 12, the join rank between O and C drops to 2 after O is joined with L, whereas it was previously 4. <img width="252" height="282" alt="Image" src="https://github.com/user-attachments/assets/6184f5a0-d798-4865-b0c9-155e68cf69f0" /> Figure 12: Level-1 Partial Plan (1) ## Join Ranking Summary As shown, a join ranking system can vary based on: - The properties it considers - The weights assigned to those properties - The criteria used for selecting or prioritizing joins This suggests we can design a flexible join ranking framework—one that allows customization of properties, weights, and selection logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
