[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333079#comment-15333079 ] Suraj Nayak commented on HADOOP-13114: -- [~raviprak] : Any improvements/suggestions/review on this patch ? > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, > HADOOP-13114-trunk_2016-05-12-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Component/s: tools/distcp > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, > HADOOP-13114-trunk_2016-05-12-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Affects Version/s: 2.7.3 2.8.0 Target Version/s: (was: ) Fix Version/s: (was: 3.0.0-alpha1) > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, > HADOOP-13114-trunk_2016-05-12-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283079#comment-15283079 ] Suraj Nayak commented on HADOOP-13114: -- JIRA was not accepting comments when I uploaded the latest patch with {{CodecPool}} changes. Adding the details of Jenkins build here with this comment : Jenkins Console output Link : [https://builds.apache.org/job/PreCommit-HADOOP-Build/9414/console] Jenkins output : +1 overall | Vote | Subsystem | Runtime | Comment | 0 |reexec | 0m 13s| Docker mode activated. | +1 | @author | 0m 0s | The patch does not contain any @author | ||| tags. | +1 |test4tests | 0m 0s | The patch appears to include 2 new or | ||| modified test files. | +1 |mvninstall | 7m 1s | trunk passed | +1 | compile | 0m 14s| trunk passed with JDK v1.8.0_91 | +1 | compile | 0m 17s| trunk passed with JDK v1.7.0_95 | +1 |checkstyle | 0m 17s| trunk passed | +1 | mvnsite | 0m 22s| trunk passed | +1 |mvneclipse | 0m 15s| trunk passed | +1 | findbugs | 0m 28s| trunk passed | +1 | javadoc | 0m 12s| trunk passed with JDK v1.8.0_91 | +1 | javadoc | 0m 15s| trunk passed with JDK v1.7.0_95 | +1 |mvninstall | 0m 17s| the patch passed | +1 | compile | 0m 13s| the patch passed with JDK v1.8.0_91 | +1 | javac | 0m 13s| the patch passed | +1 | compile | 0m 15s| the patch passed with JDK v1.7.0_95 | +1 | javac | 0m 15s| the patch passed | +1 |checkstyle | 0m 14s| the patch passed | +1 | mvnsite | 0m 20s| the patch passed | +1 |mvneclipse | 0m 11s| the patch passed | +1 |whitespace | 0m 0s | The patch has no whitespace issues. | +1 | findbugs | 0m 36s| the patch passed | +1 | javadoc | 0m 10s| the patch passed with JDK v1.8.0_91 | +1 | javadoc | 0m 12s| the patch passed with JDK v1.7.0_95 | +1 | unit | 8m 40s| hadoop-distcp in the patch passed with | ||| JDK v1.8.0_91. | +1 | unit | 7m 55s| hadoop-distcp in the patch passed with | ||| JDK v1.7.0_95. | +1 |asflicense | 0m 17s| The patch does not generate ASF License | ||| warnings. | || 29m 51s | || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:cf2ee45 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12803827/HADOOP-13114-trunk_2016-05-12-1.patch | | JIRA Issue | HADOOP-13114 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 62e2be2ea3c4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / fa440a3 | | Default Java | 1.7.0_95 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 | | findbugs | v3.0.0 | | JDK v1.7.0_95 Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/9414/testReport/ | | modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/9414/console | | Powered by | Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org | > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0-alpha1 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, > HADOOP-13114-trunk_2016-05-12-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Attachment: HADOOP-13114-trunk_2016-05-12-1.patch > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0-alpha1 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, > HADOOP-13114-trunk_2016-05-12-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Attachment: (was: HADOOP-13114-trunk_2016-05-12-1.patch) > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0-alpha1 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Attachment: HADOOP-13114-trunk_2016-05-12-1.patch > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0-alpha1 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0-alpha1 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, > HADOOP-13114-trunk_2016-05-12-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Attachment: HADOOP-13114-trunk_2016-05-10-1.patch Fixed the directory rename issue. Only the files will be renamed with compression suffix. Attaching new patch. > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277629#comment-15277629 ] Suraj Nayak commented on HADOOP-13114: -- With the uploaded patch [HADOOP-13114-trunk_2016-05-08-1.patch|https://issues.apache.org/jira/secure/attachment/12802907/HADOOP-13114-trunk_2016-05-08-1.patch] there is a issue with directory naming. The change was intended to change the file name(append the codec file extensioin), but the patch is changing the directory name itself instead of file names. Working on patch to fix it. > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Attachment: HADOOP-13114-trunk_2016-05-08-1.patch > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, > HADOOP-13114-trunk_2016-05-08-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Affects Version/s: 3.0.0 Status: Patch Available (was: Open) > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Attachment: HADOOP-13114-trunk_2016-05-07-1.patch * Added {{-compressoutput}} option. * JUnit TestCases for output compression test and option parsing * Created helper method {{getCodec}} which sets codec only once -> Needs review. > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Attachments: HADOOP-13114-trunk_2016-05-07-1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275375#comment-15275375 ] Suraj Nayak commented on HADOOP-13114: -- [~raviprak] : Regarding your [comment|https://issues.apache.org/jira/browse/HADOOP-8065?focusedCommentId=15269857=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15269857] on reusing codec instead of creating new for each file, here are my thoughts and questions: * {{org.apache.hadoop.io.compress.CompressionCodec.Util}} has a static Util class which consists of {{createOutputStreamWithCodecPool}} method. Do you think its good idea to change the class and method to public ? * I thought of copying the {{createOutputStreamWithCodecPool}} method code into {{DistCpUtils}}, but that will result in code duplication. What would you suggest for making this code reusable? > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Labels: distcp (was: ) > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 3.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.
[ https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275173#comment-15275173 ] Suraj Nayak commented on HADOOP-8065: - Thanks [~raviprak]. I have created JIRA [HADOOP-13114|https://issues.apache.org/jira/browse/HADOOP-13114] and added you as watcher. On your comment on {{codec}}, you are right, I was in mid of extracting the default codec extension that needed to be appended to the end of the file. Will upload the patch once my local build gives +1. > distcp should have an option to compress data while copying. > > > Key: HADOOP-8065 > URL: https://issues.apache.org/jira/browse/HADOOP-8065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Affects Versions: 0.20.2 >Reporter: Suresh Antony >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 0.20.2 > > Attachments: HADOOP-8065-trunk_2015-11-03.patch, > HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, > patch.distcp.2012-02-10 > > > We would like compress the data while transferring from our source system to > target system. One way to do this is to write a map/reduce job to compress > that after/before being transferred. This looks inefficient. > Since distcp already reading writing data it would be better if it can > accomplish while doing this. > Flip side of this is that distcp -update option can not check file size > before copying data. It can only check for the existence of file. > So I propose if -compress option is given then file size is not checked. > Also when we copy file appropriate extension needs to be added to file > depending on compression type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Description: DistCp utility should have capability to store data in user specified compression format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster also get benefit of saving one IO operation to and from HDFS, thus saving resources, time and effort. * Create an option -compressOutput defaulting to {{org.apache.hadoop.io.compress.BZip2Codec}}. * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} * If distcp compression is enabled, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data. was: DistCp utility should have capability to store data in user specified compression format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster also get benefit of saving one IO operation to and from HDFS, thus saving resources, time and effort. * Create an option -compressOutput defaulting to {{org.apache.hadoop.io.compress.BZip2Codec}}. * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} * If distcp compression is enabled, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data. This JIRA is similar to [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to compress data *during transit* which is a huge effort. This JIRA is simplified to enable to user to compress data when the data lands on target filesystem. > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275171#comment-15275171 ] Suraj Nayak commented on HADOOP-13114: -- This JIRA is similar to HADOOP-8065. HADOOP-8065 aims to compress data during transit which is a huge effort. This JIRA is simplified to enable to user to compress data when the data lands on target filesystem. > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. > This JIRA is similar to > [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. > [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to > compress data *during transit* which is a huge effort. This JIRA is > simplified to enable to user to compress data when the data lands on target > filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write
[ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Nayak updated HADOOP-13114: - Description: DistCp utility should have capability to store data in user specified compression format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster also get benefit of saving one IO operation to and from HDFS, thus saving resources, time and effort. * Create an option -compressOutput defaulting to {{org.apache.hadoop.io.compress.BZip2Codec}}. * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} * If distcp compression is enabled, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data. This JIRA is similar to [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to compress data *during transit* which is a huge effort. This JIRA is simplified to enable to user to compress data when the data lands on target filesystem. was: DistCp utility should have capability to store data in user specified compressed format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster gets benefit saving one IO operation, time and effort. * Create a option -compressOutput with defaulting to {{org.apache.avro.file.BZip2Codec}}. * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.avro.file.SnappyCodec}} * If distcp compression is enables, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data. This JIRA is similar to [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to compress data *during transit* which is a huge effort. This JIRA is simplified to enable to user to compress data when the data lands on target filesystem. > DistCp should have option to compress data on write > --- > > Key: HADOOP-13114 > URL: https://issues.apache.org/jira/browse/HADOOP-13114 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Suraj Nayak >Assignee: Suraj Nayak >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > DistCp utility should have capability to store data in user specified > compression format. This avoids one hop of compressing data after transfer. > Backup strategies to different cluster also get benefit of saving one IO > operation to and from HDFS, thus saving resources, time and effort. > * Create an option -compressOutput defaulting to > {{org.apache.hadoop.io.compress.BZip2Codec}}. > * Users will be able to change codec with {{-D > mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}} > * If distcp compression is enabled, suffix the filenames with default codec > extension to indicate the file is compressed. Thus users can be aware of what > codec was used to compress the data. > This JIRA is similar to > [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. > [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to > compress data *during transit* which is a huge effort. This JIRA is > simplified to enable to user to compress data when the data lands on target > filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13114) DistCp should have option to compress data on write
Suraj Nayak created HADOOP-13114: Summary: DistCp should have option to compress data on write Key: HADOOP-13114 URL: https://issues.apache.org/jira/browse/HADOOP-13114 Project: Hadoop Common Issue Type: Improvement Reporter: Suraj Nayak Assignee: Suraj Nayak Priority: Minor Fix For: 3.0.0 DistCp utility should have capability to store data in user specified compressed format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster gets benefit saving one IO operation, time and effort. * Create a option -compressOutput with defaulting to {{org.apache.avro.file.BZip2Codec}}. * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.avro.file.SnappyCodec}} * If distcp compression is enables, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data. This JIRA is similar to [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to compress data *during transit* which is a huge effort. This JIRA is simplified to enable to user to compress data when the data lands on target filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.
[ https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272016#comment-15272016 ] Suraj Nayak commented on HADOOP-8065: - [~raviprak] : It will be really helpful if can you provide me some hints how to implement the compression *during transit*? Is it after {{context.write()}} or before ? > distcp should have an option to compress data while copying. > > > Key: HADOOP-8065 > URL: https://issues.apache.org/jira/browse/HADOOP-8065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Affects Versions: 0.20.2 >Reporter: Suresh Antony >Assignee: Suraj Nayak >Priority: Minor > Labels: distcp > Fix For: 0.20.2 > > Attachments: HADOOP-8065-trunk_2015-11-03.patch, > HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, > patch.distcp.2012-02-10 > > > We would like compress the data while transferring from our source system to > target system. One way to do this is to write a map/reduce job to compress > that after/before being transferred. This looks inefficient. > Since distcp already reading writing data it would be better if it can > accomplish while doing this. > Flip side of this is that distcp -update option can not check file size > before copying data. It can only check for the existence of file. > So I propose if -compress option is given then file size is not checked. > Also when we copy file appropriate extension needs to be added to file > depending on compression type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org