[ https://issues.apache.org/jira/browse/MAPREDUCE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arnaud Nauwynck updated MAPREDUCE-7465: --------------------------------------- Description: when commiting a big hadoop job (for example via Spark) having many partitions, the class FileOutputCommiter process thousands of dirs/files to rename with a single Thread. This is performance issue, caused by lot of waits on FileStystem storage operations. I propose that above a configurable threshold (default=3, configurable via property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class FileOutputCommiter process the list of files to rename using parallel threads, using the default jvm ExecutorService (ForkJoinPool.commonPool()) See Pull-Request: [https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378] Notice that sub-class instances of FileOutputCommiter are supposed to be created at runtime dependending of a configurable property ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]). But for example in Parquet + Spark, this is buggy and can not be changed at runtime. There is an ongoing Jira and PR to fix it in Parquet + Spark: [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416] was: when commiting a big hadoop job (for example via Spark) having many partitions, the class FileOutputCommiter process thousands of dirs/files to rename with a single Thread. This is performance issue, caused by lot of waits on FileStystem storage operations. Notice that sub-class instances of FileOutputCommiter are supposed to be created at runtime dependending of a configurable property ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]). But for example in Parquet + Spark, this is buggy and can not be changed at runtime. There is an ongoing Jira and PR to fix it in Parquet + Spark: [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416] > performance problem in FileOutputCommiter for big list processed by single > thread > ---------------------------------------------------------------------------------- > > Key: MAPREDUCE-7465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: performance > Affects Versions: 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6 > Reporter: Arnaud Nauwynck > Priority: Minor > Labels: pull-request-available > > when commiting a big hadoop job (for example via Spark) having many > partitions, > the class FileOutputCommiter process thousands of dirs/files to rename with a > single Thread. This is performance issue, caused by lot of waits on > FileStystem storage operations. > I propose that above a configurable threshold (default=3, configurable via > property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class > FileOutputCommiter process the list of files to rename using parallel > threads, using the default jvm ExecutorService (ForkJoinPool.commonPool()) > See Pull-Request: > [https://github.com/apache/hadoop/pull/6378|https://github.com/apache/hadoop/pull/6378] > Notice that sub-class instances of FileOutputCommiter are supposed to be > created at runtime dependending of a configurable property > ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]). > But for example in Parquet + Spark, this is buggy and can not be changed at > runtime. > There is an ongoing Jira and PR to fix it in Parquet + Spark: > [https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org