GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/22152
[SPARK-25159][SQL] json schema inference should only trigger one job
## What changes were proposed in this pull request?
This fixes a perf regression caused by
https://github.com/apache/spark/pull/21376 .
We should not use `RDD#toLocalIterator`, which triggers one Spark job per
RDD partition. This is very bad for RDDs with a lot of small partitions.
To fix it, this PR introduces a way to access SQLConf in the scheduler
event loop thread, so that we don't need to use `RDD#toLocalIterator` anymore
in `JsonInferSchema`.
## How was this patch tested?
a new test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark conf
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22152.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22152
----
commit cf13d71cb1b23ad6e5ad4644df8c591bfb7a00f9
Author: Wenchen Fan <wenchen@...>
Date: 2018-08-17T04:30:31Z
allow accessing SQLConf in the scheduler event loop thread
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]