[
https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529421#comment-14529421
]
Jing Zhao commented on HADOOP-1540:
-----------------------------------
Thanks again for working on this, [~rhaase]! The patch looks good to me
overall. Some comments and thoughts:
# In the current patch we expect the user to only define regular expressions
for exclusions. However, since the patch allows the user to define all the
patterns into a file, it is possible that the user puts a long list of file
names into the exclusion file (by writing some script e.g.), which can cause
issue since the Mapper will compile every line into a regex pattern. Thus I
guess what we can do here is to limit the total number of regex. Another option
can be to have an ExclusionListing similar with the current CopyListing and we
handle it while generating the sequence file. Currently we can even only define
ExclusionListing's interface and provide a simple implementation just like
SimpleCopyListing, and leave its extensions to separate jiras.
Also some minor comments on the code:
# We can use this chance to remove the following unnecessary unboxing.
{code}
if (mapBandwidth.intValue() <= 0) {
throw new IllegalArgumentException("Bandwidth specified is not " +
"positive: " + mapBandwidth);
}
{code}
# Any reason to delete the original {{testParseNumListstatusThreads}} test?
# In ({{initializeExclusionPatterns}}), we can {{IOUtils#cleanup}} instead of
{{reader.close}} and put the following in a "try-finally" block.
{code}
InputStream is = new FileInputStream(new File(exclusionsPath.getName()));
BufferedReader reader = new BufferedReader(new InputStreamReader(is,
Charset.forName("UTF-8")));
String line;
while ((line = reader.readLine()) != null) {
exclusionPatterns.add(Pattern.compile(line));
}
reader.close();
{code}
# Maybe we can remove "map" from "distcp.map.exclusions.file"?
> distcp should support an exclude list
> -------------------------------------
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
> Issue Type: Improvement
> Components: util
> Affects Versions: 2.6.0
> Reporter: Senthil Subramanian
> Assignee: Rich Haase
> Priority: Minor
> Labels: patch
> Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch
>
>
> There should be a way to ignore specific paths (eg: those that have already
> been copied over under the current srcPath).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)