[jira] [Commented] (MAPREDUCE-7465) performance problem in FileOutputCommiter for big list processed by single thread

ASF GitHub Bot (Jira) Mon, 01 Jan 2024 12:04:15 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801586#comment-17801586
 ]


ASF GitHub Bot commented on MAPREDUCE-7465:
-------------------------------------------

steveloughran commented on PR #6378:
URL: https://github.com/apache/hadoop/pull/6378#issuecomment-1873460631

   @Arnaud-Nauwynck just stuck up #6399 which is rajesh's impl with my reviews 
in too. I'm not going to merge that into hadoop either because it's still got 
so many problems on cloud storage, especially abfs throttling. best to embrace 
the manifest committer and complain if you hit problems.




> performance problem in FileOutputCommiter for big list processed  by single 
> thread
> ----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7465
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7465
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6
>            Reporter: Arnaud Nauwynck
>            Priority: Minor
>              Labels: pull-request-available
>
> when commiting a big hadoop job (for example via Spark) having many 
> partitions,
> the class FileOutputCommiter process thousands of dirs/files to rename with a 
> single Thread. This is performance issue, caused by lot of waits on 
> FileStystem storage operations.
> I propose that above a configurable threshold (default=3, configurable via 
> property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class 
> FileOutputCommiter process the list of files to rename using parallel 
> threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())
> See Pull-Request: 
> [https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378]
> Notice that sub-class instances of FileOutputCommiter are supposed to be 
> created at runtime dependending of a configurable property 
> ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).
> But for example in Parquet + Spark, this is buggy and can not be changed at 
> runtime. 
> There is an ongoing Jira and PR to fix it in Parquet + Spark: 
> [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7465) performance problem in FileOutputCommiter for big list processed by single thread

Reply via email to