[jira] [Commented] (MAPREDUCE-7465) performance problem in FileOutputCommiter for big list processed by single thread

Steve Loughran (Jira) Wed, 03 Jan 2024 07:13:04 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802213#comment-17802213
 ]


Steve Loughran commented on MAPREDUCE-7465:
-------------------------------------------

bq. As of now, I have seen that the class "ParquetOutputformat" is hardcoded in 
spark, and is hard-coding the use of (extends) class "FileOutputCommiter". I 
did not see any way to make it configurable. It requires more PR in parquet 
library.

spark's hadoop cloud library has the workaround for this; subclass of 
ParquetOutputCommitter which dynamically binds to the underlying target config. 
It's exactly the same as needed for the s3a committers, so any docs there will 
help.

 email me at stevel@apache and I'll help you get set up. 

> performance problem in FileOutputCommiter for big list processed  by single 
> thread
> ----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7465
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7465
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6
>            Reporter: Arnaud Nauwynck
>            Priority: Minor
>              Labels: pull-request-available
>
> when commiting a big hadoop job (for example via Spark) having many 
> partitions,
> the class FileOutputCommiter process thousands of dirs/files to rename with a 
> single Thread. This is performance issue, caused by lot of waits on 
> FileStystem storage operations.
> I propose that above a configurable threshold (default=3, configurable via 
> property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class 
> FileOutputCommiter process the list of files to rename using parallel 
> threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())
> See Pull-Request: 
> [https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378]
> Notice that sub-class instances of FileOutputCommiter are supposed to be 
> created at runtime dependending of a configurable property 
> ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).
> But for example in Parquet + Spark, this is buggy and can not be changed at 
> runtime. 
> There is an ongoing Jira and PR to fix it in Parquet + Spark: 
> [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7465) performance problem in FileOutputCommiter for big list processed by single thread

Reply via email to