[
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuming Wang updated SPARK-20107:
--------------------------------
Description:
Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up
[HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
for many output files.
It can speed up {{11 minutes}} for 216869 output files:
{code:sql}
CREATE TABLE tmp.spark_20107 AS SELECT
category_id,
product_id,
track_id,
concat(
substr(ds, 3, 2),
substr(ds, 6, 2),
substr(ds, 9, 2)
) shortDate,
CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav'
WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE
'invalid actio' END AS type
FROM
tmp.user_action
WHERE
ds > date_sub('2017-01-23', 730)
AND actiontype IN ('0','1','2','3');
{code}
{code}
$ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
216870
{code}
This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher
versions(see:
[cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
and
[cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
and apache's hadoop 2.7.0 higher versions.
was:
It can speed up {{11 minutes}} for 216869 output files.
This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher
versions,(see:
https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433
and
https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0)
and apache's hadoop 2.7.0 higher versions.
> Speed up FileOutputCommitter#commitJob for many output files
> ------------------------------------------------------------
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
> for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
> category_id,
> product_id,
> track_id,
> concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
> ) shortDate,
> CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav'
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE
> 'invalid actio' END AS type
> FROM
> tmp.user_action
> WHERE
> ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher
> versions(see:
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
> and
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
> and apache's hadoop 2.7.0 higher versions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]