[ 
https://issues.apache.org/jira/browse/HADOOP-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HADOOP-13169:
--------------------------------------
    Attachment: HADOOP-13169-branch-2-006.patch

Thank you very much for the review [~cnauroth].

Changes:
1. Made {{fileStatusLimit, randomizeFileListing}} as final fields.
2. Fixed logging to debug level in {{SimpleCopyListing}} related change.
3. You are correct about the {{synchronizedList}} related change. It is not 
accessed in multi-threaded mode. Marked it as LinkedList.
4. Added diamond operator instead of {{new ArrayList<Path>}}
5. Added "try-with-resources" in test case
6. Removed the IOException in test case and let it throw the exception.
7. Fixed "/tmp/" in Path
8. Added better error message by including {{idx}} in test case
9. For "Collection.shuffle()", it ends up shuffling 10 items  (1,2,..10). If 
its smaller list, there are higher chances of getting the same result. With 
more items (increased to 100 now), it might not be the case. Please correct me 
if I am wrong.
10. Fixed the checkstyle issues.



> Randomize file list in SimpleCopyListing
> ----------------------------------------
>
>                 Key: HADOOP-13169
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13169
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13169-branch-2-001.patch, 
> HADOOP-13169-branch-2-002.patch, HADOOP-13169-branch-2-003.patch, 
> HADOOP-13169-branch-2-004.patch, HADOOP-13169-branch-2-005.patch, 
> HADOOP-13169-branch-2-006.patch
>
>
> When copying files to S3, based on file listing some mappers can get into S3 
> partition hotspots. This would be more visible, when data is copied from hive 
> warehouse with lots of partitions (e.g date partitions). In such cases, some 
> of the tasks would tend to be a lot more slower than others. It would be good 
> to randomize the file paths which are written out in SimpleCopyListing to 
> avoid this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to