Re: Data corruption when using Lzo Codec

Chris Douglas Mon, 22 Sep 2008 18:10:34 -0700

If you're using TextInputFormat, you need to add LzoCodec to the listof codecs in the io.compression.codecs property.

LzopCodec is only for reading/writing files produced/consumed by the Ctool; it's not in 0.17. The ".lzo" files produced in 0.17 are not"real" .lzo files, but that's how you can get the codec to recognizethem in this version. In the future, you might want to just use thelzo codec with SequenceFileOutputFormat (use BLOCK compression). -C


On Sep 19, 2008, at 8:46 AM, Alex Feinberg wrote:

Hi Chris,
I was also unable to decompress by simply doing a map/reducer with"cat"
as a mapper and then doing dfs -get either.

I will try using LzopCodec.

Thanks,
- Alex
On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas <[EMAIL PROTECTED]inc.com> wrote:
It's probably not corrupted. If by "compressed lzo file" you meansomethingreadable with lzop, you should use LzopCodec, not LzoCodec.LzoCodec doesn't
write header information required by that tool.
Guessing at the output format (length encoded blocks of datacompressed bythe lzo algorithm), it's probably readable by TextInputFormat, butYMMV. Ifyou wanted to use the C tool, you'll have to add the appropriateheader (seelzop source or LzopCodec) using a hex editor and four zero bytes tothe end
of the file. You can also use lzo compression in SequenceFiles. -C

On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:
Hello,

I am running a custom crawler (written internally) using hadoop
streaming. I am attempting to
compress the output using LZO, but instead I am receiving corrupted
output that is neither in the
format I am aiming for nor as a compressed lzo file. Is this a known
issue? Is there anything
I am doing inherently wrong?

Here is the command line I am using:

~/hadoop/bin/hadoop jar
/home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
-inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
-mapper /home/hadoop/crawl_map -reducer NONE -jobconf
mapred.output.compress=true -jobconf
mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
-input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

The input is in in form of URLs stored as a SequenceFile

When running this without LZO compression, no such issue occurs.
Is there any way for me to recover the corrupted data as to beable to
process it by other
hadoop jobs or offline?

Thanks,

--
Alex Feinberg
Platform Engineer, SocialMedia Networks
--
Alex Feinberg
Platform Engineer, SocialMedia Networks

Re: Data corruption when using Lzo Codec

Reply via email to