[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...

cloud-fan Mon, 20 Aug 2018 01:40:42 -0700

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/22152


    [SPARK-25159][SQL] json schema inference should only trigger one job

    ## What changes were proposed in this pull request?
    
    This fixes a perf regression caused by 
https://github.com/apache/spark/pull/21376 .
    
    We should not use `RDD#toLocalIterator`, which triggers one Spark job per 
RDD partition. This is very bad for RDDs with a lot of small partitions.
    
    To fix it, this PR introduces a way to access SQLConf in the scheduler 
event loop thread, so that we don't need to use `RDD#toLocalIterator` anymore 
in `JsonInferSchema`.
    
    ## How was this patch tested?
    
    a new test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark conf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22152
    
----
commit cf13d71cb1b23ad6e5ad4644df8c591bfb7a00f9
Author: Wenchen Fan <wenchen@...>
Date:   2018-08-17T04:30:31Z

    allow accessing SQLConf in the scheduler event loop thread

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...

Reply via email to