On Sun, Feb 26, 2012 at 1:49 PM, Harsh J <ha...@cloudera.com> wrote:
> Hi Mohit,
>
> On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia <mohitanch...@gmail.com> 
> wrote:
>> Thanks! Some questions I have is:
>> 1. Would it work with sequence files? I am using
>> SequenceFileAsTextInputStream
>
> Yes, you just need to set the right codec when you write the file.
> Reading is then normal as reading a non-compressed sequence-file.
>
> The codec classnames are stored as meta information into sequence
> files and are read back to load the right codec for the reader - thus
> you don't have to specify a 'reader' codec once you are done writing a
> file with any codec of choice.
>
>> 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
>> split the files?
>
> Yes SequenceFiles are a natively splittable file format, designed for
> HDFS and MapReduce. Compressed sequence files are thus splittable too.
>
> You mostly need block compression unless your records are large in
> size and you feel you'll benefit better with compression algorithms
> applied to a single, complete record instead of a bunch of records.
>
>> 3. I am also using CDH's 20.2 version of hadoop.
>
> http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :)
>
> --
> Harsh J

LZO confuses most because how it was added and removed. Also there is
a system to make raw LZO files split-table by indexing it.

I have just patched google-snappy into 0.20.2. Snappy has a similar
performance profile to LZO, good compression low processor overhead.
It does not have all the licence issues and there is not thousands and
semi contradictory/confusing information it ends up being easier to
setup and use.

http://code.google.com/p/snappy/

Recent version of hadoop just snappy build in so it will just work out
of the box.

Edward

Reply via email to