Re: LzopCodec and SequenceFile?

Harsh J Fri, 15 Jun 2012 03:59:55 -0700

Hey Joaquin,

When using SequenceFiles, use LzoCodec. The reason is that
SequenceFile is a container format of its own, just like LZOP files
are. It does not make sense combining the two.


For reading sequence files, use the SequenceFile.Reader class
(http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/io/SequenceFile.Reader.html)
and it will auto handle decompressing the K/V fields for you. You
don't have to run lzop/etc. first to be able to read it, as the
compression is applied internally and not over the entire file.

Here is also a good link on the difference at Quora:
http://www.quora.com/Whats-the-difference-between-the-LzoCodec-and-the-LzopCodec-in-Hadoop-LZO

On Fri, Jun 15, 2012 at 11:34 AM, JOAQUIN GUANTER GONZALBEZ <x...@tid.es> wrote:
> Hello,
>
>
>
> I have a sequence of MR Jobs that are using the SequenceFile for their
> output and input format. If I run them without any compression enabled they
> work fine. If I use the LzoCodec they also work just fine (but then the
> output is not Lzop compatible which is inconvenient).
>
>
>
> If I try using the LzopCodec, then the first MR job (which reads from a
> TextFile and outputs to a SequenceFile) runs OK, but when the second job
> tries to read what the first job wrote, I get the following exception:
>
>
>
> java.io.EOFException: Premature EOF from inputStream
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:75)
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:114)
>
>         at
> com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java:54)
>
>         at
> com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:83)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1591)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>
>         at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
>
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:451)
>
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>
>         at org.apache.ha
>
>
>
> Does anyone know why this could be happening? I’m using the latest’s
> Couldera CDH3 distribution and I’m configuring the compression through the
> mapred.output.compression.codec property in the mapred-site.xml file.
>
>
>
> Thanks!
>
> Ximo.
>
>
> ________________________________
> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> nuestra política de envío y recepción de correo electrónico en el enlace
> situado más abajo.
> This message is intended exclusively for its addressee. We only send and
> receive email on the basis of the terms set out at
> http://www.tid.es/ES/PAGINAS/disclaimer.aspx



-- 
Harsh J

Re: LzopCodec and SequenceFile?

Reply via email to