[
https://issues.apache.org/jira/browse/HADOOP-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488082#comment-15488082
]
Chris Nauroth commented on HADOOP-13169:
----------------------------------------
[~rajesh.balamohan], thank you for this patch, and thank you also for the
performance testing on both S3A and HDFS. I have a few comments.
{code}
private int fileStatusLimit = 1000;
private boolean randomizeFileListing;
{code}
Please make both of these {{final}}, because they only need to be assigned
inside the constructor.
{code}
LOG.info("numListstatusThreads=" + numListstatusThreads
+ ", fileStatusLimit=" + fileStatusLimit
+ ", randomizeFileListing=" + randomizeFileListing);
{code}
{code}
LOG.info("Number of paths written to fileListing=" +
fileStatusInfoList.size());
{code}
Would you please switch these to debug level? These messages are more like the
debug level logging that this class already does.
{code}
List<FileStatusInfo> fileStatuses =
Collections.synchronizedList(new ArrayList<FileStatusInfo>());
{code}
{code}
List<FileStatusInfo> statusList =
Collections.synchronizedList(new ArrayList<FileStatusInfo>());
{code}
Is the locking on these lists necessary? If I am reading this code correctly,
concurrency happens through the {{ProducerConsumer}} executing
{{FileStatusProcessor}} in parallel, but I don't see the
{{FileStatusProcessor}} threads accessing these lists. Instead, it appears
that completed work is pulled back out of the {{ProducerConsumer}} and added to
the lists on a single thread. (If I am misreading this code, please let me
know.)
{code}
List<Path> srcPaths = new ArrayList<Path>();
{code}
Lines like this can use the Java 7 diamond operator, so {{new ArrayList<>()}}.
This patch is targeted to 2.8.0, and we require at least Java 7 on that branch,
so using the diamond operator is fine.
{code}
Path p = new Path("/tmp/", String.valueOf(i));
{code}
Minor nit: please remove the trailing slash from "/tmp/", as the {{Path}}
constructor will normalize and strip trailing slashes anyway.
{code}
OutputStream out = fs.create(fileName);
out.write(i);
out.close();
{code}
Please use try-finally or try-with-resources to guarantee close of each stream.
It's unlikely that this will ever be a problem in practice, but some platforms
like Windows are very picky about leaked file handles.
{code}
Assert.fail("Should have failed as file listing should be randomized");
{code}
Is there a random chance that this assertion could fail if the call to
{{Collections#shuffle}} results in the same order that was input?
{code}
} catch (IOException e) {
LOG.error("Exception encountered ", e);
Assert.fail("Test build listing failed");
{code}
I suggest not catching the exception and simply letting it be thrown out of the
test method. If the test fails due to an exception, this would make the
problem more visible in the JUnit report on Jenkins.
I know the other tests in this suite are coded to catch the exception. No need
to go back and clean up all of those as part of this patch. (I'd still +1 it
though if you feel like doing that clean-up.)
{code}
Assert.assertEquals(fs.makeQualified(srcFiles.get(idx)),
currentVal.getPath());
{code}
For this assertion, I suggest passing a descriptive message including the value
of {{idx}} as the first argument. That way, if we ever see a failure, we'll
have more information about which iteration of the loop failed.
The Checkstyle warnings are for exceeding the maximum line length. Please fix
those in the next patch revision.
> Randomize file list in SimpleCopyListing
> ----------------------------------------
>
> Key: HADOOP-13169
> URL: https://issues.apache.org/jira/browse/HADOOP-13169
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Minor
> Attachments: HADOOP-13169-branch-2-001.patch,
> HADOOP-13169-branch-2-002.patch, HADOOP-13169-branch-2-003.patch,
> HADOOP-13169-branch-2-004.patch, HADOOP-13169-branch-2-005.patch
>
>
> When copying files to S3, based on file listing some mappers can get into S3
> partition hotspots. This would be more visible, when data is copied from hive
> warehouse with lots of partitions (e.g date partitions). In such cases, some
> of the tasks would tend to be a lot more slower than others. It would be good
> to randomize the file paths which are written out in SimpleCopyListing to
> avoid this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]