[jira] [Commented] (HADOOP-13169) Randomize file list in SimpleCopyListing

Chris Nauroth (JIRA) Thu, 15 Sep 2016 13:58:04 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494503#comment-15494503
 ]


Chris Nauroth commented on HADOOP-13169:
----------------------------------------

I think {{ArrayList}} would be a better choice for this algorithm than 
{{LinkedList}}.  Here is a quote from the JavaDocs of 
[Collections#shuffle|http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#shuffle(java.util.List)].

{quote}
This method runs in linear time. If the specified list does not implement the 
[{{RandomAccess}}|http://docs.oracle.com/javase/7/docs/api/java/util/RandomAccess.html]
 interface and is large, this implementation dumps the specified list into an 
array before shuffling it, and dumps the shuffled array back into the list. 
This avoids the quadratic behavior that would result from shuffling a 
"sequential access" list in place.
{quote}

Since a linked list cannot offer constant-time indexing, the shuffle must copy 
it into a separate array to satisfy its own linear-time guarantee.  With an 
array, we'd have constant-time indexing, so we'd avoid the extra memory 
footprint of copying the list.

bq. If its smaller list, there are higher chances of getting the same result. 
With more items (increased to 100 now), it might not be the case.

I don't think we can commit a test with any non-zero chance of intermittent 
failure if the shuffle results in the same input list.  We have a problem with 
managing intermittently failing tests in Hadoop, so I'd prefer not to add more.

The only solution I can think of is to generate a random seed in the 
constructor, save it as a member variable, and then use it to build a 
{{Random}} to pass to [{{Collections#shuffle(List, 
Random)}}|http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#shuffle(java.util.List,%20java.util.Random)].
  The test could retrieve the seed by calling a {{VisibleForTesting}} method, 
make its own {{Random}}, and then make its own call to 
[{{Collections#shuffle(List, 
Random)}}|http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#shuffle(java.util.List,%20java.util.Random)].
  Since both product code and test code would use the same seed, the 2 shuffles 
would yield the same results.

I realize that makes the code more awkward though.  Let me know if you have a 
better idea.


> Randomize file list in SimpleCopyListing
> ----------------------------------------
>
>                 Key: HADOOP-13169
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13169
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13169-branch-2-001.patch, 
> HADOOP-13169-branch-2-002.patch, HADOOP-13169-branch-2-003.patch, 
> HADOOP-13169-branch-2-004.patch, HADOOP-13169-branch-2-005.patch, 
> HADOOP-13169-branch-2-006.patch, HADOOP-13169-branch-2-007.patch, 
> HADOOP-13169-branch-2-008.patch
>
>
> When copying files to S3, based on file listing some mappers can get into S3 
> partition hotspots. This would be more visible, when data is copied from hive 
> warehouse with lots of partitions (e.g date partitions). In such cases, some 
> of the tasks would tend to be a lot more slower than others. It would be good 
> to randomize the file paths which are written out in SimpleCopyListing to 
> avoid this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13169) Randomize file list in SimpleCopyListing

Reply via email to