[jira] [Created] (HADOOP-14137) Allow DistCp to take a file list within a src directory
Zheng Shao created HADOOP-14137: --- Summary: Allow DistCp to take a file list within a src directory Key: HADOOP-14137 URL: https://issues.apache.org/jira/browse/HADOOP-14137 Project: Hadoop Common Issue Type: New Feature Components: tools/distcp Reporter: Zheng Shao DistCp is very slow to start when the src directory has a huge number of subdirectories. In our case, we already have the directory listing (via "hdfs oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would like to use that instead of doing realtime listing on the NameNode. The "-f" option doesn't help in this case because it would try to put everything into a single flat target directory. We'd like to introduce a new option "-list " for distcp. The contains the result of listing the src directory. In order to achieve this, we plan to: 1. Add a new CopyListing class PregeneratedCopyListing similar to SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the listing via "-list" 2. Add an option "-list " which will automatically make distcp use the new PregeneratedCopyListing class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-14086) Improve DistCp Speed for small files
Zheng Shao created HADOOP-14086: --- Summary: Improve DistCp Speed for small files Key: HADOOP-14086 URL: https://issues.apache.org/jira/browse/HADOOP-14086 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.6.5 Reporter: Zheng Shao Assignee: Zheng Shao Priority: Minor When using distcp to copy lots of small files, NameNode naturally becomes a bottleneck. The current distcp code did *not* optimize to reduce the NameNode calls. We should restructure the code to reduce the number of NameNode calls as much as possible to speed up the copy of small files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13975) Allow DistCp to use MultiThreadedMapper
Zheng Shao created HADOOP-13975: --- Summary: Allow DistCp to use MultiThreadedMapper Key: HADOOP-13975 URL: https://issues.apache.org/jira/browse/HADOOP-13975 Project: Hadoop Common Issue Type: New Feature Components: tools/distcp Affects Versions: 3.0.0-alpha1 Reporter: Zheng Shao Assignee: Zheng Shao Priority: Minor Although distcp allow users to control the parallelism via number of mappers, sometimes it's desirable to run fewer mappers but more threads per mapper. Since distcp is network bound (either by throughput or more frequently by latency of creating connections, opening files, reading/writing files, and closing files), this can make each mapper much more efficient. In that way, a lot of resources can be shared so we can save memory and connections to NameNode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] Created: (HADOOP-6499) Standardize when we return false and when we throw IOException in FileSystem API
Standardize when we return false and when we throw IOException in FileSystem API Key: HADOOP-6499 URL: https://issues.apache.org/jira/browse/HADOOP-6499 Project: Hadoop Common Issue Type: Improvement Reporter: Zheng Shao Currently most of the methods in Hadoop FileSystem has 2 ways of returning errors: 1. Return false 2. throw an IOException We should standardize what should happen in what case, so that the caller can retry/fail accordingly. The standard can be added to javadoc of FileSystem, then we need to verify all FileSystem implementation follow the same standard. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-6490) Path.normalize should use StringUtils.replace in favor of String.replace
Path.normalize should use StringUtils.replace in favor of String.replace Key: HADOOP-6490 URL: https://issues.apache.org/jira/browse/HADOOP-6490 Project: Hadoop Common Issue Type: Bug Affects Versions: 0.20.1 Reporter: Zheng Shao in our environment, we are seeing that the JobClient is going out of memory because Path.normalizePath(String) is called several tens of thousands of times, and each time it calls String.replace twice. java.lang.String.replace compiles a regex to do the job which is very costly. We should use org.apache.commons.lang.StringUtils.replace which is much faster and consumes almost no extra memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-6481) FileStatus should have a field isUnderConstruction
FileStatus should have a field isUnderConstruction Key: HADOOP-6481 URL: https://issues.apache.org/jira/browse/HADOOP-6481 Project: Hadoop Common Issue Type: New Feature Reporter: Zheng Shao Assignee: Zheng Shao This is for HDFS, so that the name node can tell clients whether a file is under construction or not. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.