[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write

2016-06-15 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333079#comment-15333079
 ] 

Suraj Nayak commented on HADOOP-13114:
--

[~raviprak] : Any improvements/suggestions/review on this patch ?

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-25 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Component/s: tools/distcp

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-21 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Affects Version/s: 2.7.3
   2.8.0
 Target Version/s:   (was: )
Fix Version/s: (was: 3.0.0-alpha1)

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-13 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283079#comment-15283079
 ] 

Suraj Nayak commented on HADOOP-13114:
--

JIRA was not accepting comments when I uploaded the latest patch with 
{{CodecPool}} changes. Adding the details of Jenkins build here with this 
comment :
Jenkins Console output Link : 
[https://builds.apache.org/job/PreCommit-HADOOP-Build/9414/console]
Jenkins output : 



+1 overall

| Vote |  Subsystem |  Runtime   | Comment

|   0  |reexec  |  0m 13s| Docker mode activated. 
|  +1  |   @author  |  0m 0s | The patch does not contain any @author 
|  ||| tags.
|  +1  |test4tests  |  0m 0s | The patch appears to include 2 new or 
|  ||| modified test files.
|  +1  |mvninstall  |  7m 1s | trunk passed 
|  +1  |   compile  |  0m 14s| trunk passed with JDK v1.8.0_91 
|  +1  |   compile  |  0m 17s| trunk passed with JDK v1.7.0_95 
|  +1  |checkstyle  |  0m 17s| trunk passed 
|  +1  |   mvnsite  |  0m 22s| trunk passed 
|  +1  |mvneclipse  |  0m 15s| trunk passed 
|  +1  |  findbugs  |  0m 28s| trunk passed 
|  +1  |   javadoc  |  0m 12s| trunk passed with JDK v1.8.0_91 
|  +1  |   javadoc  |  0m 15s| trunk passed with JDK v1.7.0_95 
|  +1  |mvninstall  |  0m 17s| the patch passed 
|  +1  |   compile  |  0m 13s| the patch passed with JDK v1.8.0_91 
|  +1  | javac  |  0m 13s| the patch passed 
|  +1  |   compile  |  0m 15s| the patch passed with JDK v1.7.0_95 
|  +1  | javac  |  0m 15s| the patch passed 
|  +1  |checkstyle  |  0m 14s| the patch passed 
|  +1  |   mvnsite  |  0m 20s| the patch passed 
|  +1  |mvneclipse  |  0m 11s| the patch passed 
|  +1  |whitespace  |  0m 0s | The patch has no whitespace issues. 
|  +1  |  findbugs  |  0m 36s| the patch passed 
|  +1  |   javadoc  |  0m 10s| the patch passed with JDK v1.8.0_91 
|  +1  |   javadoc  |  0m 12s| the patch passed with JDK v1.7.0_95 
|  +1  |  unit  |  8m 40s| hadoop-distcp in the patch passed with 
|  ||| JDK v1.8.0_91.
|  +1  |  unit  |  7m 55s| hadoop-distcp in the patch passed with 
|  ||| JDK v1.7.0_95.
|  +1  |asflicense  |  0m 17s| The patch does not generate ASF License 
|  ||| warnings.
|  ||  29m 51s   | 


|| Subsystem || Report/Notes ||

| Docker |  Image:yetus/hadoop:cf2ee45 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12803827/HADOOP-13114-trunk_2016-05-12-1.patch
 |
| JIRA Issue | HADOOP-13114 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 62e2be2ea3c4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / fa440a3 |
| Default Java | 1.7.0_95 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_91 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 |
| findbugs | v3.0.0 |
| JDK v1.7.0_95  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/9414/testReport/ |
| modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/9414/console |
| Powered by | Apache Yetus 0.3.0-SNAPSHOT   http://yetus.apache.org |

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0-alpha1
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput 

[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Attachment: HADOOP-13114-trunk_2016-05-12-1.patch

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0-alpha1
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-13 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Attachment: (was: HADOOP-13114-trunk_2016-05-12-1.patch)

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0-alpha1
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-12 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Attachment: HADOOP-13114-trunk_2016-05-12-1.patch

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha1
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0-alpha1
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-10 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Attachment: HADOOP-13114-trunk_2016-05-10-1.patch

Fixed the directory rename issue. Only the files will be renamed with 
compression suffix. Attaching new patch.

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-09 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277629#comment-15277629
 ] 

Suraj Nayak commented on HADOOP-13114:
--

With the uploaded patch 
[HADOOP-13114-trunk_2016-05-08-1.patch|https://issues.apache.org/jira/secure/attachment/12802907/HADOOP-13114-trunk_2016-05-08-1.patch]
 there is a issue with directory naming. The change was intended to change the 
file name(append the codec file extensioin), but the patch is changing the 
directory name itself instead of file names. Working on patch to fix it.

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-08 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Attachment: HADOOP-13114-trunk_2016-05-08-1.patch

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Affects Version/s: 3.0.0
   Status: Patch Available  (was: Open)

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Attachment: HADOOP-13114-trunk_2016-05-07-1.patch

* Added {{-compressoutput}} option.
* JUnit TestCases for output compression test and option parsing
* Created helper method {{getCodec}} which sets codec only once -> Needs review.


> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275375#comment-15275375
 ] 

Suraj Nayak commented on HADOOP-13114:
--

[~raviprak] : Regarding your 
[comment|https://issues.apache.org/jira/browse/HADOOP-8065?focusedCommentId=15269857=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15269857]
 on reusing codec instead of creating new for each file, here are my thoughts 
and questions:
* {{org.apache.hadoop.io.compress.CompressionCodec.Util}} has a static Util 
class which consists of {{createOutputStreamWithCodecPool}} method. Do you 
think its good idea to change the class and method to public ? 
* I thought of copying the {{createOutputStreamWithCodecPool}} method code into 
{{DistCpUtils}}, but that will result in code duplication. What would you 
suggest for making this code reusable?

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Labels: distcp  (was: )

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.

2016-05-07 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275173#comment-15275173
 ] 

Suraj Nayak commented on HADOOP-8065:
-

Thanks [~raviprak]. I have created JIRA 
[HADOOP-13114|https://issues.apache.org/jira/browse/HADOOP-13114] and added you 
as watcher. On your comment on {{codec}}, you are right, I was in mid of 
extracting the default codec extension that needed to be appended to the end of 
the file. Will upload the patch once my local build gives +1.

> distcp should have an option to compress data while copying.
> 
>
> Key: HADOOP-8065
> URL: https://issues.apache.org/jira/browse/HADOOP-8065
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 0.20.2
>Reporter: Suresh Antony
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 0.20.2
>
> Attachments: HADOOP-8065-trunk_2015-11-03.patch, 
> HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, 
> patch.distcp.2012-02-10
>
>
> We would like compress the data while transferring from our source system to 
> target system. One way to do this is to write a map/reduce job to compress 
> that after/before being transferred. This looks inefficient. 
> Since distcp already reading writing data it would be better if it can 
> accomplish while doing this. 
> Flip side of this is that distcp -update option can not check file size 
> before copying data. It can only check for the existence of file. 
> So I propose if -compress option is given then file size is not checked.
> Also when we copy file appropriate extension needs to be added to file 
> depending on compression type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Description: 
DistCp utility should have capability to store data in user specified 
compression format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster also get benefit of saving one IO 
operation to and from HDFS, thus saving resources, time and effort.

* Create an option -compressOutput defaulting to 
{{org.apache.hadoop.io.compress.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
* If distcp compression is enabled, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

  was:
DistCp utility should have capability to store data in user specified 
compression format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster also get benefit of saving one IO 
operation to and from HDFS, thus saving resources, time and effort.

* Create an option -compressOutput defaulting to 
{{org.apache.hadoop.io.compress.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
* If distcp compression is enabled, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

This JIRA is similar to 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
compress data *during transit* which is a huge effort. This JIRA is simplified 
to enable to user to compress data when the data lands on target filesystem.


> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275171#comment-15275171
 ] 

Suraj Nayak commented on HADOOP-13114:
--

This JIRA is similar to HADOOP-8065. HADOOP-8065 aims to compress data during 
transit which is a huge effort. This JIRA is simplified to enable to user to 
compress data when the data lands on target filesystem.

> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.
> This JIRA is similar to 
> [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
> [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
> compress data *during transit* which is a huge effort. This JIRA is 
> simplified to enable to user to compress data when the data lands on target 
> filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
-
Description: 
DistCp utility should have capability to store data in user specified 
compression format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster also get benefit of saving one IO 
operation to and from HDFS, thus saving resources, time and effort.

* Create an option -compressOutput defaulting to 
{{org.apache.hadoop.io.compress.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
* If distcp compression is enabled, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

This JIRA is similar to 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
compress data *during transit* which is a huge effort. This JIRA is simplified 
to enable to user to compress data when the data lands on target filesystem.

  was:
DistCp utility should have capability to store data in user specified 
compressed format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster gets benefit saving one IO operation, 
time and effort.

* Create a option -compressOutput with defaulting to 
{{org.apache.avro.file.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.avro.file.SnappyCodec}}
* If distcp compression is enables, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

This JIRA is similar to 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
compress data *during transit* which is a huge effort. This JIRA is simplified 
to enable to user to compress data when the data lands on target filesystem.


> DistCp should have option to compress data on write
> ---
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Suraj Nayak
>Assignee: Suraj Nayak
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.
> This JIRA is similar to 
> [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
> [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
> compress data *during transit* which is a huge effort. This JIRA is 
> simplified to enable to user to compress data when the data lands on target 
> filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13114) DistCp should have option to compress data on write

2016-05-07 Thread Suraj Nayak (JIRA)
Suraj Nayak created HADOOP-13114:


 Summary: DistCp should have option to compress data on write
 Key: HADOOP-13114
 URL: https://issues.apache.org/jira/browse/HADOOP-13114
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Suraj Nayak
Assignee: Suraj Nayak
Priority: Minor
 Fix For: 3.0.0


DistCp utility should have capability to store data in user specified 
compressed format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster gets benefit saving one IO operation, 
time and effort.

* Create a option -compressOutput with defaulting to 
{{org.apache.avro.file.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.avro.file.SnappyCodec}}
* If distcp compression is enables, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

This JIRA is similar to 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
compress data *during transit* which is a huge effort. This JIRA is simplified 
to enable to user to compress data when the data lands on target filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.

2016-05-05 Thread Suraj Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272016#comment-15272016
 ] 

Suraj Nayak commented on HADOOP-8065:
-

[~raviprak] : It will be really helpful if can you provide me some hints how to 
implement the compression *during transit*? Is it after {{context.write()}} or 
before ? 

> distcp should have an option to compress data while copying.
> 
>
> Key: HADOOP-8065
> URL: https://issues.apache.org/jira/browse/HADOOP-8065
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 0.20.2
>Reporter: Suresh Antony
>Assignee: Suraj Nayak
>Priority: Minor
>  Labels: distcp
> Fix For: 0.20.2
>
> Attachments: HADOOP-8065-trunk_2015-11-03.patch, 
> HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, 
> patch.distcp.2012-02-10
>
>
> We would like compress the data while transferring from our source system to 
> target system. One way to do this is to write a map/reduce job to compress 
> that after/before being transferred. This looks inefficient. 
> Since distcp already reading writing data it would be better if it can 
> accomplish while doing this. 
> Flip side of this is that distcp -update option can not check file size 
> before copying data. It can only check for the existence of file. 
> So I propose if -compress option is given then file size is not checked.
> Also when we copy file appropriate extension needs to be added to file 
> depending on compression type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org