Here are the issues that I'm aware of:
* Compression ratios are comparable
* Snappy Decompression is about twice as fast
* LZO is "splittable." It can be decompressed in pieces natively without
using an AVRO or sequence file. For LZO, this requres a separate operation to
generate an index file that identifies where the blocks are in the main file.
* LZO has to be downloaded and installed separately because the license is
incompatible with the hadoop license.
- Tim.
________________________________________
From: Mark [[email protected]]
Sent: Sunday, October 30, 2011 9:33 AM
To: [email protected]
Subject: Re: LZO Compression
Thanks for the info, very helpful.
Whats the difference between LZO and Snappy? I like how Cloudera has
snappy support so it looks like im going to go with that but I just
wanted to know the tradeoffs.
Thanks again
On 10/29/11 8:52 PM, Harsh J wrote:
> Hey Mark,
>
> (Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)
>
> On 30-Oct-2011, at 7:59 AM, Mark wrote:
>
>> Email was sent a bit prematurely.
>>
>> Anyway. How can one test that LZO compression is configured correctly? I've
>> found multiple sources on how to compile the hadoop-lzo jars and native
>> files but no where did I see a definitive way to test that the
>> installation/configuration is correct.
> You can run the compression codec test for per-node, or run a job that reads
> or writes with that codec.
>
> Single node test example, using an available test jar:
>
> $ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop
> org.apache.hadoop.io.compress.TestCodec -count 1000 -codec
> com.hadoop.compression.lzo.LzoCodec
>
>> Also, when is this compression enabled? Is it enabled on every file I write?
>> Do I somehow have to specify that I want to use this format? For example we
>> have a rather large directory of server logs ... /user/mark/logs. How can we
>> enable compression on this directory?
>>
> Compression in HDFS is pure client-side settings. You can't enable it
> 'globally'.
>
> For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to
> the desired class to have final job outputs written with that codec
> (Compression of write streams is toggled by
> {File}OutputFormat#setCompressOutput(…)). For optimizing the transient
> stages, you can use JobConf#setMapOutputCompressorClass(…) and toggle with
> JobConf#setCompressMapOutput(…).
>
> Reading compressed files back again is handled automagically by your Hadoop
> framework, and should require no settings.
>
> Hence, for a fully distributed test of your LZO install (which you may have
> hopefully done with Todd's easy tool at
> https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple
> parameterized (or mapred-site.xml configured) wordcount via an available
> example jar:
>
> $ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount
> -Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
> -Dmapred.output.compress=true inputDir outputDir
>
> Hope this helps!
>
> --
> Harsh J
________________________________
The information and any attached documents contained in this message
may be confidential and/or legally privileged. The message is
intended solely for the addressee(s). If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful. If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.