[
https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322869#comment-16322869
]
Jason Lowe commented on HADOOP-13340:
-------------------------------------
Choosing which files to compress doesn't really solve the issues I brought up
in my previous comment. Even if we choose only to compress some of the files
but not all of them, unless we choose a splittable/seekable codec and provide
transparent decoding in the HarFileSystem layer it could change the semantics
of how an application accesses the data before and after it enters the .har
archive. (e.g.: app was working just fine on uncompressed data but doesn't
gracefully handle the compressed data, especially if it isn't splittable).
That would be adding compression to the har that is not transparent. I suppose
as long as that's clearly documented and the user expects that behavior it
could be OK.
What needs to be clarified is the requirements and expectations of this
feature. Is the compression transparent (i.e.: data appears to be exactly as
it was to anyone accessing the .har archive yet it is actually stored
compressed and transparently decoded during access) or simply each file
(optionally) compressed as it is added to the archive? The latter has a
straightforward workaround today (i.e.: simply compress the original files
before archiving them). The former would require support in HarFileSystem but
could be nice for the common use-case for .har archives which is packing
together a lot of relatively small files. The compression could work across
file boundaries achieving a greater compression ratio than if each flie were
compressed separately, with the overhead of needing to decode up to an entire
codec block to access a file's contents.
> Compress Hadoop Archive output
> ------------------------------
>
> Key: HADOOP-13340
> URL: https://issues.apache.org/jira/browse/HADOOP-13340
> Project: Hadoop Common
> Issue Type: New Feature
> Components: tools
> Affects Versions: 2.5.0
> Reporter: Duc Le Tu
> Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job?
> I used some options like -D mapreduce.output.fileoutputformat.compress=true
> -D
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
> but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool,
> it's very neccessary for data retention for everyone (small files problem and
> compress data).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]