[jira] [Updated] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

Igor Dvorzhak (JIRA) Mon, 11 Feb 2019 17:36:56 -0800


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Igor Dvorzhak updated MAPREDUCE-7185:
-------------------------------------
    Description: 
If map task outputs multiple files it could be slow to move them from temp 
directory to output directory in object stores (GCS, S3, etc).

To improve performance we need to parallelize move of more than 1 file in 
FileOutputCommitter.

Repro:
 Start spark-shell:
{code}
spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf 
spark.dynamicAllocation.maxExecutors=2
{code}
>From spark-shell:
{code}
val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 
10).repartition(50)
df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path"
 -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
{code}
With the fix execution time reduces from 130 seconds to 50 seconds.

  was:
If map task outputs multiple files it could be slow to move them from temp 
directory to output directory in object stores.

To improve performance we need to parallelize move of more than 1 file in 
FileOutputCommitter.

Repro:
Start spark-shell:
{code:bash}
spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf 
spark.dynamicAllocation.maxExecutors=2
{code}
>From spark-shell:
{code:scala}
val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 
10).repartition(50)
df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path"
 -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
{code}

With the fix execution time reduces from 130 seconds to 50 seconds.


> Parallelize part files move in FileOutputCommitter
> --------------------------------------------------
>
>                 Key: MAPREDUCE-7185
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7185
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 3.2.0, 2.9.2
>            Reporter: Igor Dvorzhak
>            Priority: Major
>         Attachments: MAPREDUCE-7185.patch
>
>
> If map task outputs multiple files it could be slow to move them from temp 
> directory to output directory in object stores (GCS, S3, etc).
> To improve performance we need to parallelize move of more than 1 file in 
> FileOutputCommitter.
> Repro:
>  Start spark-shell:
> {code}
> spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf 
> spark.dynamicAllocation.maxExecutors=2
> {code}
> From spark-shell:
> {code}
> val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 
> 10).repartition(50)
> df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path"
>  -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
> {code}
> With the fix execution time reduces from 130 seconds to 50 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

Reply via email to