[
https://issues.apache.org/jira/browse/SPARK-20187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuming Wang resolved SPARK-20187.
---------------------------------
Resolution: Duplicate
> Replace loadTable with moveFile to speed up load table for many output files
> ----------------------------------------------------------------------------
>
> Key: SPARK-20187
> URL: https://issues.apache.org/jira/browse/SPARK-20187
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Yuming Wang
> Attachments: spark.loadTable.log.tar.gz, spark.moveFile.log.tar.gz
>
>
> [HiveClientImpl.loadTable|https://github.com/apache/spark/blob/v2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L667]
> load files one by one, so this step will take a long time if a job generates
> many files. There is a [Hive.moveFile
> api|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2567]
> can speed up this step for {{create table tableName as select ...}} and
> {{insert overwrite table tableName select ...}}
> Here are two APIs comparison:
> {noformat:align=left|title=loadTable api: It took about 26 minutes(10:50:14 -
> 11:16:18) to load table}
> 17/04/01 10:50:04 INFO TaskSetManager: Finished task 207165.0 in stage 0.0
> (TID 216796) in 5952 ms on jqhadoop-test28-8.int.yihaodian.com (executor 54)
> (216869/216869)
> 17/04/01 10:50:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have
> all completed, from pool
> 17/04/01 10:50:04 INFO DAGScheduler: ResultStage 0 (processCmd at
> CliDriver.java:376) finished in 541.797 s
> 17/04/01 10:50:04 INFO DAGScheduler: Job 0 finished: processCmd at
> CliDriver.java:376, took 551.208919 s
> 17/04/01 10:50:04 INFO FileFormatWriter: Job null committed.
> 17/04/01 10:50:14 INFO Hive: Replacing
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
> dest:
> viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
> Status:true
> 17/04/01 10:50:14 INFO Hive: Replacing
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
> dest:
> viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
> Status:true
> ...
> 17/04/01 11:16:11 INFO Hive: Replacing
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
> dest:
> viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
> Status:true
> 17/04/01 11:16:18 INFO SparkSqlParser: Parsing command:
> `tmp`.`spark_load_slow`
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> Time taken: 2178.736 seconds
> 17/04/01 11:16:18 INFO CliDriver: Time taken: 2178.736 seconds
> {noformat}
> {noformat:align=left|title=moveFile api: It took about 9 minutes(13:24:39 -
> 13:33:46) to load table|align=right}
> 17/04/01 13:24:38 INFO TaskSetManager: Finished task 210610.0 in stage 0.0
> (TID 216829) in 5888 ms on jqhadoop-test28-28.int.yihaodian.com (executor 59)
> (216869/216869)
> 17/04/01 13:24:38 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have
> all completed, from pool
> 17/04/01 13:24:38 INFO DAGScheduler: ResultStage 0 (processCmd at
> CliDriver.java:376) finished in 532.409 s
> 17/04/01 13:24:38 INFO DAGScheduler: Job 0 finished: processCmd at
> CliDriver.java:376, took 539.337610 s
> 17/04/01 13:24:39 INFO FileFormatWriter: Job null committed.
> 17/04/01 13:24:39 INFO Hive: Replacing
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_13-14-46_099_8962745596360417817-1/-ext-10000,
> dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow_movefile,
> Status:true
> 17/04/01 13:33:46 INFO SparkSqlParser: Parsing command:
> `tmp`.`spark_load_slow_movefile`
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> Time taken: 1142.671 seconds
> 17/04/01 13:33:46 INFO CliDriver: Time taken: 1142.671 seconds
> {noformat}
> More log can be find in attachments.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]