[jira] [Resolved] (HADOOP-15268) Back port HADOOP-13972 to 2.8.1 and 2.8.3
[ https://issues.apache.org/jira/browse/HADOOP-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S resolved HADOOP-15268. Resolution: Invalid Target Version/s: 2.8.3, 2.8.1 (was: 2.8.1, 2.8.3) This is not required. > Back port HADOOP-13972 to 2.8.1 and 2.8.3 > - > > Key: HADOOP-15268 > URL: https://issues.apache.org/jira/browse/HADOOP-15268 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/adl >Affects Versions: 2.8.1, 2.8.3 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S >Priority: Major > > Back port the HADOOP-13972 to branch-2.8.1 and branch-2.8.3 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15268) Back port HADOOP-13972 to 2.8.1 and 2.8.3
[ https://issues.apache.org/jira/browse/HADOOP-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378131#comment-16378131 ] Omkar Aradhya K S commented on HADOOP-15268: Thanks [~jojochuang] I will delete this sub-task. This is not required. > Back port HADOOP-13972 to 2.8.1 and 2.8.3 > - > > Key: HADOOP-15268 > URL: https://issues.apache.org/jira/browse/HADOOP-15268 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/adl >Affects Versions: 2.8.1, 2.8.3 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S >Priority: Major > > Back port the HADOOP-13972 to branch-2.8.1 and branch-2.8.3 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15268) Back port HADOOP-13972 to 2.8.1 and 2.8.3
[ https://issues.apache.org/jira/browse/HADOOP-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-15268: --- Summary: Back port HADOOP-13972 to 2.8.1 and 2.8.3 (was: Back port to 2.8.1 and 2.8.3) > Back port HADOOP-13972 to 2.8.1 and 2.8.3 > - > > Key: HADOOP-15268 > URL: https://issues.apache.org/jira/browse/HADOOP-15268 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/adl >Affects Versions: 2.8.1, 2.8.3 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S >Priority: Major > > Back port the HADOOP-13972 to branch-2.8.1 and branch-2.8.3 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-15268) Back port to 2.8.1 and 2.8.3
Omkar Aradhya K S created HADOOP-15268: -- Summary: Back port to 2.8.1 and 2.8.3 Key: HADOOP-15268 URL: https://issues.apache.org/jira/browse/HADOOP-15268 Project: Hadoop Common Issue Type: Sub-task Components: fs/adl Affects Versions: 2.8.3, 2.8.1 Reporter: Omkar Aradhya K S Assignee: Omkar Aradhya K S Fix For: 2.8.3, 2.8.1 Back port the HADOOP-13972 to branch-2.8.1 and branch-2.8.3 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13972) ADLS to support per-store configuration
[ https://issues.apache.org/jira/browse/HADOOP-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362074#comment-16362074 ] Omkar Aradhya K S commented on HADOOP-13972: Hi [~jzhuge] do you have any further information from where you left off on this feature? > ADLS to support per-store configuration > --- > > Key: HADOOP-13972 > URL: https://issues.apache.org/jira/browse/HADOOP-13972 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/adl >Affects Versions: 3.0.0-alpha2 >Reporter: John Zhuge >Priority: Major > > Useful when distcp needs to access 2 Data Lake stores with different SPIs. > Of course, a workaround is to grant the same SPI access permission to both > stores, but sometimes it might not be feasible. > One idea is to embed the store name in the configuration property names, > e.g., {{dfs.adls.oauth2..client.id}}. Per-store keys will be consulted > first, then fall back to the global keys. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024138#comment-16024138 ] Omkar Aradhya K S commented on HADOOP-14407: Thanks [~yzhangal]. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, HADOOP-14407.branch2.002.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022643#comment-16022643 ] Omkar Aradhya K S edited comment on HADOOP-14407 at 5/24/17 10:06 AM: -- [~yzhangal] Thanks for pointing this out! I have couple of doubts: 1. My tests are present in "/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java" --> testToString(), testParseCopyBufferSize(). 2. The "/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpOptions.java --> " doesn't have "blocksPerChunk=0" so how was it passing earlier? 3. Why are there 2 redundant "testToString()" in 2 different classes? 4. Can I re-open this issue and add the patch here itself? 5. What is the procedure to test branch-2 patches? I see a "-1" from Hadoop QA for the branch-2 patch above! was (Author: omkarksa): [~yzhangal] Thanks for pointing this out! I have couple of doubts: 1. The "/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpOptions.java" doesn't have "blocksPerChunk=0" so how was it passing earlier? 2. Can I re-open this issue and add the patch here itself? 3. What is the procedure to test branch-2 patches? I see a "-1" from Hadoop QA for the branch-2 patch above! > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022643#comment-16022643 ] Omkar Aradhya K S commented on HADOOP-14407: [~yzhangal] Thanks for pointing this out! I have couple of doubts: 1. The "/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpOptions.java" doesn't have "blocksPerChunk=0" so how was it passing earlier? 2. Can I re-open this issue and add the patch here itself? 3. What is the procedure to test branch-2 patches? I see a "-1" from Hadoop QA for the branch-2 patch above! > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019155#comment-16019155 ] Omkar Aradhya K S commented on HADOOP-14407: Thanks [~yzhangal] for the quick reviews and commits. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017770#comment-16017770 ] Omkar Aradhya K S commented on HADOOP-14407: [~yzhangal] I have backported the feature to branch-2 and uploaded the patch (HADOOP-14407.004.branch2.patch). Please do the needful. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: HADOOP-14407.004.branch2.patch Feature patch backported to branch-2 > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > HADOOP-14407.004.branch2.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017038#comment-16017038 ] Omkar Aradhya K S commented on HADOOP-14407: {quote} I committed to trunk. When trying to backport to branch-2, saw quite some conflicts. Would you please help doing branch-2 version and other ones you prefer? {quote} [~yzhangal] Thanks. OK, I will work on the branch-2 patch today. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Patch Available (was: Open) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Open (was: Patch Available) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: HADOOP-14407.004.patch > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, HADOOP-14407.004.patch, > HADOOP-14407.004.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Patch Available (was: Open) submit new patch - removed "long" variable type for copybuffersize > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, HADOOP-14407.004.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: HADOOP-14407.004.patch uploading new patch - removed "long" variable type for copybuffersize > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, HADOOP-14407.004.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Open (was: Patch Available) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Patch Available (was: Open) Submit new patch with cosmetic changes > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Open (was: Patch Available) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: HADOOP-14407.003.patch new patch with cosmetic changes > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, HADOOP-14407.003.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015158#comment-16015158 ] Omkar Aradhya K S commented on HADOOP-14407: {quote} Thanks for the updated patch Omkar Aradhya K S, are you guys still looking into setting input and output buffer to different size? Or any chance we need to do that in the future? {quote} [~yzhangal] Thanks for checking the patch. As explained in the previous commit, we don't need to do this change since even a small copybiffersize can give huge boos in performance. {quote} Somehow your submitting the patch did not trigger a jenkins test, maybe there is an infra issue. {quote} Thanks for pointing this out. Yes, even I waited for quite some time, but there was no result! Do I need any additional permissions for this? Also, can you point out how exactly you triggered the build? May be I missed something? > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > HADOOP-14407.002.patch, TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16014375#comment-16014375 ] Omkar Aradhya K S commented on HADOOP-14407: {quote} Also, we will do more investigations into introducing both input and output copybuffersize configurations. {quote} We did a small set of runs to see at what level the copybuffersize stagnates in performance: !TotalTime-vs-CopyBufferSize.jpg! In this case, with copybuffersize set to just 128KB, we get >3x performance! {quote} If there is benefit of doing this I will submit a new patch with the changes or else we will go ahead with this patch. {quote} Since even a small increase in the copybuffersize give the desired performance, we will not need two separate copybuffersize for input and output. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: TotalTime-vs-CopyBufferSize.jpg > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch, > TotalTime-vs-CopyBufferSize.jpg > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Open (was: Patch Available) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Patch Available (was: Open) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013785#comment-16013785 ] Omkar Aradhya K S commented on HADOOP-14407: [~yzhangal] Thanks for reviewing. Please find attached the new patch with the fixes. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: HADOOP-14407.002.patch Attached new patch (HADOOP-14407.002.patch) with all required fixes. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch, HADOOP-14407.002.patch > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007968#comment-16007968 ] Omkar Aradhya K S commented on HADOOP-14407: [~yzhangal] I have submitted the patch. Could you please check this (HADOOP-14407.001.patch)? Also, we will do more investigations into introducing both input and output copybuffersize configurations. If there is benefit of doing this I will submit a new patch with the changes or else we will go ahead with this patch. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Attachment: HADOOP-14407.001.patch Initial patch with the required changes - HADOOP-14407.001.patch > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-14407.001.patch > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Patch Available (was: In Progress) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Status: Open (was: Patch Available) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Work started] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HADOOP-14407 started by Omkar Aradhya K S. -- > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007606#comment-16007606 ] Omkar Aradhya K S commented on HADOOP-14407: Thanks [~yzhangal]. We found that we don't need 2 buffers. Just making the existing copy buffer size configurable should do. I will submit the patch soon. > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Omkar Aradhya K S > Fix For: 2.9.0, 3.0.0-alpha3 > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Description: Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just 8KB. We have noticed in our performance tests that with bigger buffer sizes we saw upto ~3x performance boost. Hence, making the copy buffer size a configurable setting via the new parameter . (was: The minimum unit of work for a distcp task is a file. We have files that are greater than 1 TB with a block size of 1 GB. If we use distcp to copy these files, the tasks either take a long long long time or finally fails. A better way for distcp would be to copy all the source blocks in parallel, and then stich the blocks back to files at the destination via the HDFS Concat API (HDFS-222)) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Yongjun Zhang > Fix For: 2.9.0, 3.0.0-alpha3 > > > Currently, the RetriableFileCopyCommand has a fixed copy buffer size of just > 8KB. We have noticed in our performance tests that with bigger buffer sizes > we saw upto ~3x performance boost. Hence, making the copy buffer size a > configurable setting via the new parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Release Note: The copy buffer size can be configured via the new parameter . By default the is set to 8KB. (was: If a positive value is passed to command line switch -blocksperchunk, files with more blocks than this value will be split into chunks of `` blocks to be transferred in parallel, and reassembled on the destination. By default, `` is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when both the source file system supports getBlockLocations and target supports concat. ) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Yongjun Zhang > Fix For: 2.9.0, 3.0.0-alpha3 > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Affects Version/s: (was: 0.21.0) 2.9.0 > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Yongjun Zhang > Fix For: 2.9.0, 3.0.0-alpha3 > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
[ https://issues.apache.org/jira/browse/HADOOP-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-14407: --- Hadoop Flags: (was: Reviewed) > DistCp - Introduce a configurable copy buffer size > -- > > Key: HADOOP-14407 > URL: https://issues.apache.org/jira/browse/HADOOP-14407 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.9.0 >Reporter: Omkar Aradhya K S >Assignee: Yongjun Zhang > Fix For: 2.9.0, 3.0.0-alpha3 > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-14407) DistCp - Introduce a configurable copy buffer size
Omkar Aradhya K S created HADOOP-14407: -- Summary: DistCp - Introduce a configurable copy buffer size Key: HADOOP-14407 URL: https://issues.apache.org/jira/browse/HADOOP-14407 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 0.21.0 Reporter: Omkar Aradhya K S Assignee: Yongjun Zhang Fix For: 2.9.0, 3.0.0-alpha3 The minimum unit of work for a distcp task is a file. We have files that are greater than 1 TB with a block size of 1 GB. If we use distcp to copy these files, the tasks either take a long long long time or finally fails. A better way for distcp would be to copy all the source blocks in parallel, and then stich the blocks back to files at the destination via the HDFS Concat API (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) Enable distcp to copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965371#comment-15965371 ] Omkar Aradhya K S commented on HADOOP-11794: [~steve_l] I was able to test the bits with HDI 3.3, which is *2.7.1*. However, I was wondering if we can go as back as *2.5.x*/*2.2.x*? > Enable distcp to copy blocks in parallel > > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.branch2.patch, > HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) Enable distcp to copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964044#comment-15964044 ] Omkar Aradhya K S commented on HADOOP-11794: {quote} Yongjun Zhang Sure, I will finish testing this by early next week. {quote} [~yzhangal] I was able to do some basic tests and it works! Thanks for the patch. The branch-2 is *2.9.0*. However, will this patch work on older versions like *2.2.x*? > Enable distcp to copy blocks in parallel > > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.branch2.patch, > HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) Enable distcp to copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960541#comment-15960541 ] Omkar Aradhya K S commented on HADOOP-11794: {quote} Hi Omkar Aradhya K S, wonder if you could help run this branch-2 patch on ADLS too if possible? {quote} [~yzhangal] Sure, I will finish testing this by early next week. > Enable distcp to copy blocks in parallel > > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.branch2.patch, > HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) Enable distcp to copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956675#comment-15956675 ] Omkar Aradhya K S commented on HADOOP-11794: {quote} BTW, Steve still has an item for you to follow-up here https://issues.apache.org/jira/browse/HADOOP-11794?focusedCommentId=15938217=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15938217 {quote} [~yzhangal] Sorry for the late reply. Thanks for pointing this out. I almost missed this! {quote} Omkar: if ADL doesn't implement the distcp contract test, you might want to follow up this patch with a distcp test that forces the use of the concat operation. {quote} [~steve_l] I will look into this. > Enable distcp to copy blocks in parallel > > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) Enable distcp to copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950360#comment-15950360 ] Omkar Aradhya K S commented on HADOOP-11794: [~yzhangal] Thanks for re-considering the suggestions and re-doing the patch to accommodate all FileSystem implementations. {quote} I just committed to trunk. Will work on branch-2 version asap (tried and see quite some conflicts). {quote} Could you please elaborate on how you plan to proceed with backporting? > Enable distcp to copy blocks in parallel > > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948924#comment-15948924 ] Omkar Aradhya K S commented on HADOOP-11794: [~yzhangal] I have tested the patch with ADLS and it works without any changes. Thanks. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Aradhya K S updated HADOOP-11794: --- Hi Yongjun, I am on vacation till tomorrow. Would it be late if I review it tomorrow? If you are held up because om me, I can reach home and try it today itself. Please let me know. Regards, Omkar > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940043#comment-15940043 ] Omkar Aradhya K S commented on HADOOP-11794: {quote} This is usually handled in the same ticket, and by cherry-picking the patch. Backporting doesn't usually warrant a new JIRA unless the implementation is significantly different. {quote} [~chris.douglas], [~yzhangal] Thanks for all the clarifications. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, HADOOP-11794.010.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937918#comment-15937918 ] Omkar Aradhya K S commented on HADOOP-11794: [~yzhangal] Thanks for reconsidering the comments and making the required changes for FlieSystem compatibility. [~steve_l], [~jzhuge], [~atm], [~chris.douglas] Thanks for providing clarity and a way ahead. {quote} Omkar Aradhya K S, can you post your patch? {quote} I made only the following changes in my patch to get it working with ADLS on hadoop 2.7: # Remove all the checks for *DistributedFileSystem* # Use {code}fs.concat{code} instead of {code}dstdistfs.concat{code} # Use {code}fs.getFileBlockLocations{code} instead of {code}dfs.getBlockLocations{code} # Use (now deprecated) {code}final DFSClient dfs = new DFSClient(conf);{code} instead of {code}final DFSClient dfs = new DFSClient(DFSUtilClient.getNNAddress(conf), conf);{code} [~yzhangal] Once the new patch with all above changes is checked in, we need to back port it to older versions of hadoop, which will be addressed by new JIRAs? > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > HADOOP-11794.009.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936826#comment-15936826 ] Omkar Aradhya K S commented on HADOOP-11794: Thanks [~steve_l], [~jzhuge], Yes, this could be one way to do it. Let me see if there is any other way. {quote} There is always the option of doing that: sending in an invalid concat() request and differentiating between: UnsupportedException and any other response, then assuming that the "any other response" exception means that it is implemented, but that the arguments were invalid. concat("/", new Path[0]) should be enough. {quote} That's right ADLS supports both *concat* and *getFileBlockLocations* and as commented before in this JIRA, I was able to get this patch working with ADLS with these changes and one more change. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935814#comment-15935814 ] Omkar Aradhya K S commented on HADOOP-11794: Thanks for the clarification [~yzhangal]. About this ... {quote} About other file systems, I did not get to test various file systems with this patch (except for DistributedFileSystem), we could follow-up with new jira to relax the file system requirement, and add corresponding tests for the corresponding file system? {quote} Is there any reason *not* to use *FileSystem.concat* & *FileSystem.getFileBlockLocations* ? > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935743#comment-15935743 ] Omkar Aradhya K S commented on HADOOP-11794: {quote} The main reason of checking DistributedFileSystem is the support of getBlockLocations, and concat feature. I'm not sure whether we can assume other File System support that. {quote} The *getFileBlockLocations* and *concat* are APIs that are part of *FileSystem.java* from [hadoop v1.2.1|https://hadoop.apache.org/docs/r1.2.1/api/index.html] {quote} The current patch is for trunk where client and server code are separated. When we backport this change to other version of hadoop, we can make the change accordingly, for example, to use DFSUtil. {quote} You could just use the default constructor that would internally get the NNAddress: {code} final DFSClient dfs = new DFSClient(conf); {code} > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934045#comment-15934045 ] Omkar Aradhya K S edited comment on HADOOP-11794 at 3/21/17 3:46 AM: - I was trying to evaluate your patch with ADLS: Tried the bits on a HDInsight 3.5 cluster (this comes with hadoop 2.7) Observed following compatibility issues: a. You are checking for instance of *DistributedFileSystem* in many places and all other *FileSystem* implementations don’t implement *DistributedFileSystem* i.Could this be changed to something more compatible with other *FileSystem* implementations? b. You are using the new *DFSUtilClient*, which makes DistCp incompatible with older versions of Hadoop i. Can this be changed to be backward compatible? If the compatibility issues are addressed, the DistCp with your feature would be available for other *FileSystem* implementations and also would be backward compatible. was (Author: omkarksa): I was trying to evaluate your patch with ADLS: Tried the bits on a HDInsight 3.5 cluster (this comes with hadoop 2.7) Observed following compatibility issues: a. You are checking for instance of {code}DistributedFileSystem{code} in many places and all other {code}FileSystem{code} implementations don’t implement {code}DistributedFileSystem{code} i.Could this be changed to something more compatible with other {code}FileSystem{code} implementations? b. You are using the new {code}DFSUtilClient{code}, which makes DistCp incompatible with older versions of Hadoop i. Can this be changed to be backward compatible? If the compatibility issues are addressed, the DistCp with your feature would be available for other {code}FileSystem{code} implementations and also would be backward compatible. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934045#comment-15934045 ] Omkar Aradhya K S commented on HADOOP-11794: I was trying to evaluate your patch with ADLS: Tried the bits on a HDInsight 3.5 cluster (this comes with hadoop 2.7) Observed following compatibility issues: a. You are checking for instance of {code}DistributedFileSystem{code} in many places and all other {code}FileSystem{code} implementations don’t implement {code}DistributedFileSystem{code} i.Could this be changed to something more compatible with other {code}FileSystem{code} implementations? b. You are using the new {code}DFSUtilClient{code}, which makes DistCp incompatible with older versions of Hadoop i. Can this be changed to be backward compatible? If the compatibility issues are addressed, the DistCp with your feature would be available for other {code}FileSystem{code} implementations and also would be backward compatible. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, HADOOP-11794.004.patch, HADOOP-11794.005.patch, > HADOOP-11794.006.patch, HADOOP-11794.007.patch, HADOOP-11794.008.patch, > MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org