Github user nsyca commented on the issue:
https://github.com/apache/spark/pull/17240
In the PR description:
*"Usually cardinality is more important than size, we can simplify cost
evaluation by using only cardinality. Note that this also enables us to not
care about column pru(n)ing during reordering. Because otherwise, project will
influence the output size of intermediate joins."*
I do not quite agree with this statement. If we have two candidates in the
join reordering
T1 join T2_100_columns T2 on p(T1,T2) join T3_1_column T3 on p(T1,T3)... [1]
and
T1 join T3_1_column T3 on p(T1,T3) join T2_100_columns T2 on p(T1,T2) ...
[2]
assuming all columns are of equal width (say, all are integer),
T2_100_columns and T3_1_column has 100 columns and 1 column carried along to
the input of the next operator above the 3-table join, and, the estimate
cardinality of [1] and [2] are 10 and 20 rows respectively. Without taking the
average widths of the tables into account, the join reordering algorithm will
favour [1] over [2] for the first join, which will carry along the payload of
the extra 100 columns from T2 to the second join.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]