Hey Mark,

(Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)

On 30-Oct-2011, at 7:59 AM, Mark wrote:

> Email was sent a bit prematurely.
> 
> Anyway. How can one test that LZO compression is configured correctly? I've 
> found multiple sources on how to compile the hadoop-lzo jars and native files 
> but no where did I see a definitive way to test that the 
> installation/configuration is correct.

You can run the compression codec test for per-node, or run a job that reads or 
writes with that codec.

Single node test example, using an available test jar:

$ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop 
org.apache.hadoop.io.compress.TestCodec -count 1000 -codec 
com.hadoop.compression.lzo.LzoCodec

> Also, when is this compression enabled? Is it enabled on every file I write? 
> Do I somehow have to specify that I want to use this format? For example we 
> have a rather large directory of server logs ... /user/mark/logs. How can we 
> enable compression on this directory?
> 

Compression in HDFS is pure client-side settings. You can't enable it 
'globally'.

For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the 
desired class to have final job outputs written with that codec (Compression of 
write streams is toggled by {File}OutputFormat#setCompressOutput(…)). For 
optimizing the transient stages, you can use 
JobConf#setMapOutputCompressorClass(…) and toggle with 
JobConf#setCompressMapOutput(…).

Reading compressed files back again is handled automagically by your Hadoop 
framework, and should require no settings.

Hence, for a fully distributed test of your LZO install (which you may have 
hopefully done with Todd's easy tool at 
https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple 
parameterized (or mapred-site.xml configured) wordcount via an available 
example jar:

$ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount 
-Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec 
-Dmapred.output.compress=true inputDir outputDir

Hope this helps!

--
Harsh J

Reply via email to