Hey Mark,
(Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)
On 30-Oct-2011, at 7:59 AM, Mark wrote:
> Email was sent a bit prematurely.
>
> Anyway. How can one test that LZO compression is configured correctly? I've
> found multiple sources on how to compile the hadoop-lzo jars and native files
> but no where did I see a definitive way to test that the
> installation/configuration is correct.
You can run the compression codec test for per-node, or run a job that reads or
writes with that codec.
Single node test example, using an available test jar:
$ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop
org.apache.hadoop.io.compress.TestCodec -count 1000 -codec
com.hadoop.compression.lzo.LzoCodec
> Also, when is this compression enabled? Is it enabled on every file I write?
> Do I somehow have to specify that I want to use this format? For example we
> have a rather large directory of server logs ... /user/mark/logs. How can we
> enable compression on this directory?
>
Compression in HDFS is pure client-side settings. You can't enable it
'globally'.
For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the
desired class to have final job outputs written with that codec (Compression of
write streams is toggled by {File}OutputFormat#setCompressOutput(…)). For
optimizing the transient stages, you can use
JobConf#setMapOutputCompressorClass(…) and toggle with
JobConf#setCompressMapOutput(…).
Reading compressed files back again is handled automagically by your Hadoop
framework, and should require no settings.
Hence, for a fully distributed test of your LZO install (which you may have
hopefully done with Todd's easy tool at
https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple
parameterized (or mapred-site.xml configured) wordcount via an available
example jar:
$ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount
-Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
-Dmapred.output.compress=true inputDir outputDir
Hope this helps!
--
Harsh J