[
https://issues.apache.org/jira/browse/MAPREDUCE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024724#comment-18024724
]
ASF GitHub Bot commented on MAPREDUCE-7465:
-------------------------------------------
github-actions[bot] commented on PR #6399:
URL: https://github.com/apache/hadoop/pull/6399#issuecomment-3368627101
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> performance problem in FileOutputCommitter for big list processed by single
> thread
> -----------------------------------------------------------------------------------
>
> Key: MAPREDUCE-7465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7465
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: performance
> Affects Versions: 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6
> Reporter: Arnaud Nauwynck
> Priority: Minor
> Labels: pull-request-available
>
> when commiting a big hadoop job (for example via Spark) having many
> partitions,
> the class FileOutputCommiter process thousands of dirs/files to rename with a
> single Thread. This is performance issue, caused by lot of waits on
> FileStystem storage operations.
> I propose that above a configurable threshold (default=3, configurable via
> property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class
> FileOutputCommiter process the list of files to rename using parallel
> threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())
> See Pull-Request:
> [https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378]
> Notice that sub-class instances of FileOutputCommiter are supposed to be
> created at runtime dependending of a configurable property
> ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).
> But for example in Parquet + Spark, this is buggy and can not be changed at
> runtime.
> There is an ongoing Jira and PR to fix it in Parquet + Spark:
> [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]