[ 
https://issues.apache.org/jira/browse/SPARK-20187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-20187.
---------------------------------
    Resolution: Duplicate

> Replace loadTable with moveFile to speed up load table for many output files
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-20187
>                 URL: https://issues.apache.org/jira/browse/SPARK-20187
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Yuming Wang
>         Attachments: spark.loadTable.log.tar.gz, spark.moveFile.log.tar.gz
>
>
> [HiveClientImpl.loadTable|https://github.com/apache/spark/blob/v2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L667]
>  load files one by one, so this step will take a long time if a job generates 
> many files. There is a [Hive.moveFile 
> api|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2567]
>  can speed up this step for {{create table tableName as select ...}} and 
> {{insert overwrite table tableName select ...}}
> Here are two APIs comparison:
> {noformat:align=left|title=loadTable api: It took about 26 minutes(10:50:14 - 
> 11:16:18) to load table}
> 17/04/01 10:50:04 INFO TaskSetManager: Finished task 207165.0 in stage 0.0 
> (TID 216796) in 5952 ms on jqhadoop-test28-8.int.yihaodian.com (executor 54) 
> (216869/216869)
> 17/04/01 10:50:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have 
> all completed, from pool 
> 17/04/01 10:50:04 INFO DAGScheduler: ResultStage 0 (processCmd at 
> CliDriver.java:376) finished in 541.797 s
> 17/04/01 10:50:04 INFO DAGScheduler: Job 0 finished: processCmd at 
> CliDriver.java:376, took 551.208919 s
> 17/04/01 10:50:04 INFO FileFormatWriter: Job null committed.
> 17/04/01 10:50:14 INFO Hive: Replacing 
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
>  dest: 
> viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
>  Status:true
> 17/04/01 10:50:14 INFO Hive: Replacing 
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
>  dest: 
> viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
>  Status:true
> ...
> 17/04/01 11:16:11 INFO Hive: Replacing 
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
>  dest: 
> viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000,
>  Status:true
> 17/04/01 11:16:18 INFO SparkSqlParser: Parsing command: 
> `tmp`.`spark_load_slow`
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
> Time taken: 2178.736 seconds
> 17/04/01 11:16:18 INFO CliDriver: Time taken: 2178.736 seconds
> {noformat}
> {noformat:align=left|title=moveFile api: It took about 9 minutes(13:24:39 - 
> 13:33:46) to load table|align=right}
> 17/04/01 13:24:38 INFO TaskSetManager: Finished task 210610.0 in stage 0.0 
> (TID 216829) in 5888 ms on jqhadoop-test28-28.int.yihaodian.com (executor 59) 
> (216869/216869)
> 17/04/01 13:24:38 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have 
> all completed, from pool 
> 17/04/01 13:24:38 INFO DAGScheduler: ResultStage 0 (processCmd at 
> CliDriver.java:376) finished in 532.409 s
> 17/04/01 13:24:38 INFO DAGScheduler: Job 0 finished: processCmd at 
> CliDriver.java:376, took 539.337610 s
> 17/04/01 13:24:39 INFO FileFormatWriter: Job null committed.
> 17/04/01 13:24:39 INFO Hive: Replacing 
> src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_13-14-46_099_8962745596360417817-1/-ext-10000,
>  dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow_movefile, 
> Status:true
> 17/04/01 13:33:46 INFO SparkSqlParser: Parsing command: 
> `tmp`.`spark_load_slow_movefile`
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> 17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
> Time taken: 1142.671 seconds
> 17/04/01 13:33:46 INFO CliDriver: Time taken: 1142.671 seconds
> {noformat}
> More log can be find in attachments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to