[GitHub] spark pull request #22502: [SPARK-25474][SQL]Size in bytes of the query is c...

shahidki31 Thu, 20 Sep 2018 12:10:26 -0700

GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/22502


    [SPARK-25474][SQL]Size in bytes of the query is coming in EB in case of 
parquet datasource

    ## What changes were proposed in this pull request?
    In case of CatalogFileIndex datasource table, sizeInBytes is always coming 
as default size in bytes, which is  8.0EB. So, the datasource table which has 
CatalogFileIndex, always prefer SortMergeJoin, instead of BroadcastJoin, even 
if the size is below broadcast join threshold.
    In this PR, In case of CatalogFileIndex table, if we enable 
"fallBackToHdfsForStatsEnabled=true", then the computeStatistics  get the 
sizeInBytes from the hdfs and we get the actual size of the table. Hence, 
during join operation, when the table size is below broadcast threshold, it 
will prefer broadCastHashJoin instead of SortMergeJoin.
    
    ## How was this patch tested?
    Added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark SPARK-25474

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22502
    
----
commit 79d0794d098ed15030dde3a7fea8b65952fa0d72
Author: Shahid <shahidki31@...>
Date:   2018-09-20T18:58:22Z

    [SPARK-25474][SQL]Size in bytes of the query is coming in EB in case of 
parquet datasource

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22502: [SPARK-25474][SQL]Size in bytes of the query is c...

Reply via email to