GitHub user habren opened a pull request:
https://github.com/apache/spark/pull/22018
[SPARK-25038][SQL] Accelerate Spark Plan generation when Spark SQL reâ¦
https://issues.apache.org/jira/browse/SPARK-25038
When Spark SQL read large amount of data, it take a long time (more than 10
minutes) to generate physical Plan and then ActiveJob
Example:
There is a table which is partitioned by date and hour. There are more than
13 TB data each hour and 185 TB per day. When we just issue a very simple SQL,
it take a long time to generate ActiveJob
The SQL statement is
select count(device_id) from test_tbl where date=20180731 and hour='21';
Before optimization, it takes 2 minutes and 9 seconds to generate the Job
The SQL is issued at 2018-08-07 09:07:41
However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and
9 seconds later than the SQL issue time
After the optimization, it takes only 4 seconds to generate the Job
The SQL is issued at 2018-08-07 09:20:15
And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later
than the SQL issue time
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/habren/spark SPARK-25038
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22018.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22018
----
commit 2bb5924e04eba5accfe58a4fbae094d46cc36488
Author: Jason Guo <jason.guo.vip@...>
Date: 2018-08-07T03:13:03Z
[SPARK-25038][SQL] Accelerate Spark Plan generation when Spark SQL read
large amount of data
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]