[ https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Guo updated SPARK-25038: ------------------------------ Description: When Spark SQL read large amount of data, it take a long time (more than 10 minutes) to generate physical Plan and then ActiveJob Example: There is a table which is partitioned by date and hour. There are more than 13 TB data each hour and 185 TB per day. When we just issue a very simple SQL, it take a long time to generate ActiveJob The SQL statement is {code:java} select count(device_id) from test_tbl where date=20180731 and hour='21'; {code} Before optimization, it takes 2 minutes and 9 seconds to generate the Job The SQL is issued at 2018-08-07 09:07:41 !issue sql original.png! However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 seconds later than the SQL issue time !job start original.png! After the optimization, it takes only 4 seconds to generate the Job The SQL is issued at 2018-08-07 09:20:15 !issue sql optimized.png! And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later than the SQL issue time !job start optimized.png! was: When Spark SQL read large amount of data, it take a long time (more than 10 minutes) to generate physical Plan and then ActiveJob Example: There is a table which is partitioned by date and hour. There are more than 13 TB data each hour and 185 TB per day. When we just issue a very simple SQL, it take a long time to generate ActiveJob The SQL statement is {code:java} select count(device_id) from test_tbl where date=20180731 and hour='21'; {code} The SQL is issued at 2018-08-07 08:43:48 !issue sql original.png! However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 seconds later than the SQL issue time !job start original.png! > Accelerate Spark Plan generation when Spark SQL read large amount of data > ------------------------------------------------------------------------- > > Key: SPARK-25038 > URL: https://issues.apache.org/jira/browse/SPARK-25038 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Jason Guo > Priority: Critical > Attachments: issue sql optimized.png, issue sql original.png, job > start optimized.png, job start original.png > > > When Spark SQL read large amount of data, it take a long time (more than 10 > minutes) to generate physical Plan and then ActiveJob > > Example: > There is a table which is partitioned by date and hour. There are more than > 13 TB data each hour and 185 TB per day. When we just issue a very simple > SQL, it take a long time to generate ActiveJob > > The SQL statement is > {code:java} > select count(device_id) from test_tbl where date=20180731 and hour='21'; > {code} > > Before optimization, it takes 2 minutes and 9 seconds to generate the Job > > The SQL is issued at 2018-08-07 09:07:41 > !issue sql original.png! > However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 > seconds later than the SQL issue time > !job start original.png! > > After the optimization, it takes only 4 seconds to generate the Job > The SQL is issued at 2018-08-07 09:20:15 > !issue sql optimized.png! > > And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later > than the SQL issue time > !job start optimized.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org