GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22008

    [SPARK-24928][SQL] Optimize cross join according to stats

    ## What changes were proposed in this pull request?
    
    The cartesian product of 2 RDDs perform a nested loop. This means that the 
iterator for the inner RDD is built as many times as the number of rows of the 
outer one. If the two RDDs have a very different size, the performance 
difference can be huge.
    
    As there is no way to know which is the best RDD to choose as outer one 
(since we don't know the sizes), this cannot be addressed at RDD level. Only a 
comment has been added to warn/help the user to be careful about how they write 
their code.
    
    The PR proposed to add an optimizer rule which uses statistics collected on 
tables in order to change the sides of the cartesian product so that the outer 
table is the smaller one.
    
    ## How was this patch tested?
    
    added test suite

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-24928

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22008
    
----
commit 1caf2567694c56cca019e6608609b81ac70deefa
Author: Marco Gaido <marcogaido91@...>
Date:   2018-08-06T15:51:58Z

    [SPARK-24928][SQL] Optimize cross join according to stats

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to