[jira] [Created] (HADOOP-14137) Allow DistCp to take a file list within a src directory

2017-03-01 Thread Zheng Shao (JIRA)
Zheng Shao created HADOOP-14137:
---

 Summary: Allow DistCp to take a file list within a src directory
 Key: HADOOP-14137
 URL: https://issues.apache.org/jira/browse/HADOOP-14137
 Project: Hadoop Common
  Issue Type: New Feature
  Components: tools/distcp
Reporter: Zheng Shao


DistCp is very slow to start when the src directory has a huge number of 
subdirectories.  In our case, we already have the directory listing (via "hdfs 
oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would like to 
use that instead of doing realtime listing on the NameNode.

The "-f" option doesn't help in this case because it would try to put 
everything into a single flat target directory.

We'd like to introduce a new option "-list " for distcp.  The  
contains the result of listing the src directory.


In order to achieve this, we plan to:
1. Add a new CopyListing class PregeneratedCopyListing similar to 
SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the 
listing via "-list"
2. Add an option "-list " which will automatically make distcp use the 
new PregeneratedCopyListing class.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14086) Improve DistCp Speed for small files

2017-02-15 Thread Zheng Shao (JIRA)
Zheng Shao created HADOOP-14086:
---

 Summary: Improve DistCp Speed for small files
 Key: HADOOP-14086
 URL: https://issues.apache.org/jira/browse/HADOOP-14086
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.6.5
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Minor


When using distcp to copy lots of small files,  NameNode naturally becomes a 
bottleneck.

The current distcp code did *not* optimize to reduce the NameNode calls.  We 
should restructure the code to reduce the number of NameNode calls as much as 
possible to speed up the copy of small files.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13975) Allow DistCp to use MultiThreadedMapper

2017-01-11 Thread Zheng Shao (JIRA)
Zheng Shao created HADOOP-13975:
---

 Summary: Allow DistCp to use MultiThreadedMapper
 Key: HADOOP-13975
 URL: https://issues.apache.org/jira/browse/HADOOP-13975
 Project: Hadoop Common
  Issue Type: New Feature
  Components: tools/distcp
Affects Versions: 3.0.0-alpha1
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Minor


Although distcp allow users to control the parallelism via number of mappers, 
sometimes it's desirable to run fewer mappers but more threads per mapper.  
Since distcp is network bound (either by throughput or more frequently by 
latency of creating connections, opening files, reading/writing files, and 
closing files), this can make each mapper much more efficient.

In that way, a lot of resources can be shared so we can save memory and 
connections to NameNode.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] Created: (HADOOP-6499) Standardize when we return false and when we throw IOException in FileSystem API

2010-01-19 Thread Zheng Shao (JIRA)
Standardize when we return false and when we throw IOException in FileSystem API


 Key: HADOOP-6499
 URL: https://issues.apache.org/jira/browse/HADOOP-6499
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Zheng Shao


Currently most of the methods in Hadoop FileSystem has 2 ways of returning 
errors:
1. Return false
2. throw an IOException

We should standardize what should happen in what case, so that the caller can 
retry/fail accordingly.

The standard can be added to javadoc of FileSystem, then we need to verify all 
FileSystem implementation follow the same standard.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-6490) Path.normalize should use StringUtils.replace in favor of String.replace

2010-01-12 Thread Zheng Shao (JIRA)
Path.normalize should use StringUtils.replace in favor of String.replace


 Key: HADOOP-6490
 URL: https://issues.apache.org/jira/browse/HADOOP-6490
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 0.20.1
Reporter: Zheng Shao


in our environment, we are seeing that the JobClient is going out of memory 
because Path.normalizePath(String) is called several tens of thousands of 
times, and each time it calls String.replace twice.

java.lang.String.replace compiles a regex to do the job which is very costly.
We should use org.apache.commons.lang.StringUtils.replace which is much faster 
and consumes almost no extra memory.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-6481) FileStatus should have a field isUnderConstruction

2010-01-07 Thread Zheng Shao (JIRA)
FileStatus should have a field isUnderConstruction


 Key: HADOOP-6481
 URL: https://issues.apache.org/jira/browse/HADOOP-6481
 Project: Hadoop Common
  Issue Type: New Feature
Reporter: Zheng Shao
Assignee: Zheng Shao


This is for HDFS, so that the name node can tell clients whether a file is 
under construction or not.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.