[jira] [Updated] (MAPREDUCE-7465) performance problem in FileOutputCommiter for big list processed by single thread

Arnaud Nauwynck (Jira) Sat, 23 Dec 2023 02:44:05 -0800


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arnaud Nauwynck updated MAPREDUCE-7465:
---------------------------------------
    Description: 
when commiting a big hadoop job (for example via Spark) having many partitions,
the class FileOutputCommiter process thousands of dirs/files to rename with a 
single Thread. This is performance issue, caused by lot of waits on FileStystem 
storage operations.

I propose that above a configurable threshold (default=3, configurable via 
property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class 
FileOutputCommiter process the list of files to rename using parallel threads, 
using the default jvm ExecutorService (ForkJoinPool.commonPool())

See Pull-Request: 
[https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378]

Notice that sub-class instances of FileOutputCommiter are supposed to be 
created at runtime dependending of a configurable property 
([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).

But for example in Parquet + Spark, this is buggy and can not be changed at 
runtime. 
There is an ongoing Jira and PR to fix it in Parquet + Spark: 
[https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]



  was:
when commiting a big hadoop job (for example via Spark) having many partitions,
the class FileOutputCommiter process thousands of dirs/files to rename with a 
single Thread. This is performance issue, caused by lot of waits on FileStystem 
storage operations.


Notice that sub-class instances of FileOutputCommiter are supposed to be 
created at runtime dependending of a configurable property 
([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).

But for example in Parquet + Spark, this is buggy and can not be changed at 
runtime. 
There is an ongoing Jira and PR to fix it in Parquet + Spark: 
[https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]




> performance problem in FileOutputCommiter for big list processed  by single 
> thread
> ----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7465
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7465
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6
>            Reporter: Arnaud Nauwynck
>            Priority: Minor
>              Labels: pull-request-available
>
> when commiting a big hadoop job (for example via Spark) having many 
> partitions,
> the class FileOutputCommiter process thousands of dirs/files to rename with a 
> single Thread. This is performance issue, caused by lot of waits on 
> FileStystem storage operations.
> I propose that above a configurable threshold (default=3, configurable via 
> property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class 
> FileOutputCommiter process the list of files to rename using parallel 
> threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())
> See Pull-Request: 
> [https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378]
> Notice that sub-class instances of FileOutputCommiter are supposed to be 
> created at runtime dependending of a configurable property 
> ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).
> But for example in Parquet + Spark, this is buggy and can not be changed at 
> runtime. 
> There is an ongoing Jira and PR to fix it in Parquet + Spark: 
> [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-7465) performance problem in FileOutputCommiter for big list processed by single thread

Reply via email to