Thanks for the info, very helpful.

Whats the difference between LZO and Snappy? I like how Cloudera has snappy support so it looks like im going to go with that but I just wanted to know the tradeoffs.

Thanks again

On 10/29/11 8:52 PM, Harsh J wrote:
Hey Mark,

(Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)

On 30-Oct-2011, at 7:59 AM, Mark wrote:

Email was sent a bit prematurely.

Anyway. How can one test that LZO compression is configured correctly? I've 
found multiple sources on how to compile the hadoop-lzo jars and native files 
but no where did I see a definitive way to test that the 
installation/configuration is correct.
You can run the compression codec test for per-node, or run a job that reads or 
writes with that codec.

Single node test example, using an available test jar:

$ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop 
org.apache.hadoop.io.compress.TestCodec -count 1000 -codec 
com.hadoop.compression.lzo.LzoCodec

Also, when is this compression enabled? Is it enabled on every file I write? Do 
I somehow have to specify that I want to use this format? For example we have a 
rather large directory of server logs ... /user/mark/logs. How can we enable 
compression on this directory?

Compression in HDFS is pure client-side settings. You can't enable it 
'globally'.

For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the 
desired class to have final job outputs written with that codec (Compression of 
write streams is toggled by {File}OutputFormat#setCompressOutput(…)). For 
optimizing the transient stages, you can use 
JobConf#setMapOutputCompressorClass(…) and toggle with 
JobConf#setCompressMapOutput(…).

Reading compressed files back again is handled automagically by your Hadoop 
framework, and should require no settings.

Hence, for a fully distributed test of your LZO install (which you may have 
hopefully done with Todd's easy tool at 
https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple 
parameterized (or mapred-site.xml configured) wordcount via an available 
example jar:

$ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount 
-Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec 
-Dmapred.output.compress=true inputDir outputDir

Hope this helps!

--
Harsh J

Reply via email to