GitHub user mgaido91 opened a pull request:
https://github.com/apache/spark/pull/22008
[SPARK-24928][SQL] Optimize cross join according to stats
## What changes were proposed in this pull request?
The cartesian product of 2 RDDs perform a nested loop. This means that the
iterator for the inner RDD is built as many times as the number of rows of the
outer one. If the two RDDs have a very different size, the performance
difference can be huge.
As there is no way to know which is the best RDD to choose as outer one
(since we don't know the sizes), this cannot be addressed at RDD level. Only a
comment has been added to warn/help the user to be careful about how they write
their code.
The PR proposed to add an optimizer rule which uses statistics collected on
tables in order to change the sides of the cartesian product so that the outer
table is the smaller one.
## How was this patch tested?
added test suite
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mgaido91/spark SPARK-24928
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22008.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22008
----
commit 1caf2567694c56cca019e6608609b81ac70deefa
Author: Marco Gaido <marcogaido91@...>
Date: 2018-08-06T15:51:58Z
[SPARK-24928][SQL] Optimize cross join according to stats
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]