Hello, I am running a custom crawler (written internally) using hadoop streaming. I am attempting to compress the output using LZO, but instead I am receiving corrupted output that is neither in the format I am aiming for nor as a compressed lzo file. Is this a known issue? Is there anything I am doing inherently wrong?
Here is the command line I am using: ~/hadoop/bin/hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat -mapper /home/hadoop/crawl_map -reducer NONE -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0 The input is in in form of URLs stored as a SequenceFile When running this without LZO compression, no such issue occurs. Is there any way for me to recover the corrupted data as to be able to process it by other hadoop jobs or offline? Thanks, -- Alex Feinberg Platform Engineer, SocialMedia Networks
