[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang updated SPARK-20107: -------------------------------- Description: Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] for many output files. It can speed up {{11 minutes}} for 216869 output files: {code:sql} CREATE TABLE tmp.spark_20107 AS SELECT category_id, product_id, track_id, concat( substr(ds, 3, 2), substr(ds, 6, 2), substr(ds, 9, 2) ) shortDate, CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type FROM tmp.user_action WHERE ds > date_sub('2017-01-23', 730) AND actiontype IN ('0','1','2','3'); {code} {code} $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l 216870 {code} This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher versions(see: [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] and [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) and apache's hadoop 2.7.0 higher versions. was: It can speed up {{11 minutes}} for 216869 output files. This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher versions,(see: https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433 and https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0) and apache's hadoop 2.7.0 higher versions. > Speed up FileOutputCommitter#commitJob for many output files > ------------------------------------------------------------ > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.0 > Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org