[ 
https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322869#comment-16322869
 ] 

Jason Lowe commented on HADOOP-13340:
-------------------------------------

Choosing which files to compress doesn't really solve the issues I brought up 
in my previous comment.  Even if we choose only to compress some of the files 
but not all of them, unless we choose a splittable/seekable codec and provide 
transparent decoding in the HarFileSystem layer it could change the semantics 
of how an application accesses the data before and after it enters the .har 
archive.  (e.g.: app was working just fine on uncompressed data but doesn't 
gracefully handle the compressed data, especially if it isn't splittable).  
That would be adding compression to the har that is not transparent.  I suppose 
as long as that's clearly documented and the user expects that behavior it 
could be OK.

What needs to be clarified is the requirements and expectations of this 
feature.  Is the compression transparent  (i.e.: data appears to be exactly as 
it was to anyone accessing the .har archive yet it is actually stored 
compressed and transparently decoded during access) or simply each file 
(optionally) compressed as it is added to the archive?  The latter has a 
straightforward workaround today (i.e.: simply compress the original files 
before archiving them).  The former would require support in HarFileSystem but 
could be nice for the common use-case for .har archives which is packing 
together a lot of relatively small files.  The compression could work across 
file boundaries achieving a greater compression ratio than if each flie were 
compressed separately, with the overhead of needing to decode up to an entire 
codec block to access a file's contents.




> Compress Hadoop Archive output
> ------------------------------
>
>                 Key: HADOOP-13340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13340
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 2.5.0
>            Reporter: Duc Le Tu
>              Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job? 
> I used some options like -D mapreduce.output.fileoutputformat.compress=true 
> -D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
>  but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool, 
> it's very neccessary for data retention for everyone (small files problem and 
> compress data).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to