It's probably not corrupted. If by "compressed lzo file" you mean
something readable with lzop, you should use LzopCodec, not LzoCodec.
LzoCodec doesn't write header information required by that tool.
Guessing at the output format (length encoded blocks of data
compressed by the lzo algorithm), it's probably readable by
TextInputFormat, but YMMV. If you wanted to use the C tool, you'll
have to add the appropriate header (see lzop source or LzopCodec)
using a hex editor and four zero bytes to the end of the file. You can
also use lzo compression in SequenceFiles. -C
On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:
Hello,
I am running a custom crawler (written internally) using hadoop
streaming. I am attempting to
compress the output using LZO, but instead I am receiving corrupted
output that is neither in the
format I am aiming for nor as a compressed lzo file. Is this a known
issue? Is there anything
I am doing inherently wrong?
Here is the command line I am using:
~/hadoop/bin/hadoop jar
/home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
-inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
-mapper /home/hadoop/crawl_map -reducer NONE -jobconf
mapred.output.compress=true -jobconf
mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
-input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0
The input is in in form of URLs stored as a SequenceFile
When running this without LZO compression, no such issue occurs.
Is there any way for me to recover the corrupted data as to be able to
process it by other
hadoop jobs or offline?
Thanks,
--
Alex Feinberg
Platform Engineer, SocialMedia Networks