enhance FileInputFormat.setInputPaths() ---------------------------------------
Key: MAPREDUCE-1946 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1946 Project: Hadoop Map/Reduce Issue Type: Improvement Components: job submission Affects Versions: 0.20.2 Reporter: Ted Yu FileInputFormat.setInputPaths(Job job, Path... inputPaths) can be enhanced in the following 3 ways: 1) when the input paths are known only at runtime, we need another form which accepts Collection<> as second parameter. E.g. Set<Path> inputPaths 2) Use StringBuilder instead of StringBuffer because StringBuilder doesn't incur synchronization cost 3) The biggest performance boost comes from calling the following constructor of StringBuilder: public StringBuilder(int capacity) capacity can be a 3rd parameter to setInputPaths() This would avoid excessive calls to Arrays.copyOf(). The following stack trace was observed when our code used FileInputFormat.addInputPath() many times when a lot of files are eligible for processing: java.lang.Thread.State: RUNNABLE at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:330) at com.carrieriq.m2m.platform.mmp2.input.PackageInput.configureJobConf(PackageInput.java:336) After incorporating all three optimizations, total time taken in customized setInputPaths(JobConf conf, Set<Path> inputPaths) was 2 seconds. The combined time calling FileInputFormat.addInputPath() was over 80 minutes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.