[
https://issues.apache.org/jira/browse/HADOOP-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551587
]
Doug Cutting commented on HADOOP-2402:
--------------------------------------
> For Lzo, this means it will compress no more than 4k at a time, yielding even
> less than 20% compression.
We can't compress multiple buffers together with the lzo codec? It only
compresses buffer-at-a-time? If so, then it should do the buffering & set the
buffer size, since this is an lzo-specific issue.
> This might be a better place to add some buffering, but then the codec will
> be returning a buffered stream.
I don't think it's a bug for a codec to return a buffered stream if a
particular buffer size is required to get good performance from that codec. If
it's impossible to get lzo to compress data across buffers, and 64k or larger
is required to get good compression, then it should mandate that buffer size,
perhaps adding a new configuration parameter.
Separately, we should consider whether to (a) unilaterally add an
io.file.buffer.size buffer in TextOutputFormat, since it helps other codecs, or
(b) assume that all codecs return appropriately buffered streams, and add a
buffer in the Zip codec if it improves performance. If a io.file.buffer.size
buffer=4k gives somewhat improved Zip performance, and a 64k buffer gives even
better performance, I think that's okay. Performance should improve a bit by
increasing io.file.buffer.size, at the expense of chewing up more memory per
open file. The default setting should be for decent performance with minimal
memory use.
> Lzo compression compresses each write from TextOutputFormat
> -----------------------------------------------------------
>
> Key: HADOOP-2402
> URL: https://issues.apache.org/jira/browse/HADOOP-2402
> Project: Hadoop
> Issue Type: Bug
> Components: io, mapred, native
> Reporter: Chris Douglas
> Fix For: 0.16.0
>
> Attachments: 2402-0.patch
>
>
> Outputting with TextOutputFormat and Lzo compression generates a file such
> that each key, tab delimiter, and value are compressed separately.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.