Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22008#discussion_r208379788
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
---
@@ -152,3 +153,45 @@ object EliminateOuterJoin extends Rule[LogicalPlan]
with PredicateHelper {
if (j.joinType == newJoinType) f else Filter(condition,
j.copy(joinType = newJoinType))
}
}
+
+/**
+ * Swaps right and left logical plans of a join when left is bigger than
right. This is useful
+ * because underlying cartesian product performs a nested loop, thus if
the outer table is
+ * smaller there are less iterator initialization.
--- End diff --
This is indeed an interesting point. I am not sure how/if we can measure
the cost in the creation of the involved iterator and the cost of creating it.
Anyway, actually this will optimize not only the initialization cost for
the iterator, but also the overall number of record read/processed. Let's take
an example. Imagine that we have a table A with 10M record and a table B with
100 records. The total number of record retrieved is:
- if A is the left table, we process: 10M (all the records from A) + 100 *
10M (all the records from B for every record from A) = 101 * 10M
- if B is the left table, we process: 100 (all the records from B) + 100 *
10M (all the records from A for each record from B) = ~ 100 * 10M
So in the second case we process size of A - size B less records (same
applies to number of bytes read).
And there is another good point for the second option: ie. Spark is much
better at computing/reading 10 times 10M records that 10M times 2 records as it
can exploits its parallelism.
That said, your comment still applies, ie. there may be cases in which one
side is very onerous despite is the one with less records involved. Do you have
any suggestion about how to estimate this? Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]