Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    In the PR description:
    *"Usually cardinality is more important than size, we can simplify cost 
evaluation by using only cardinality. Note that this also enables us to not 
care about column pru(n)ing during reordering. Because otherwise, project will 
influence the output size of intermediate joins."*
    
    I do not quite agree with this statement. If we have two candidates in the 
join reordering
    
    T1 join T2_100_columns T2 on p(T1,T2) join T3_1_column T3 on p(T1,T3)... [1]
    and
    T1 join T3_1_column T3 on p(T1,T3) join T2_100_columns T2 on p(T1,T2) ... 
[2]
    
    assuming all columns are of equal width (say, all are integer), 
T2_100_columns and T3_1_column has 100 columns and 1 column carried along to 
the input of the next operator above the 3-table join, and, the estimate 
cardinality of [1] and [2] are 10 and 20 rows respectively. Without taking the 
average widths of the tables into account, the join reordering algorithm will 
favour [1] over [2] for the first join, which will carry along the payload of 
the extra 100 columns from T2 to the second join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to