On Sun, Feb 26, 2012 at 1:49 PM, Harsh J <ha...@cloudera.com> wrote: > Hi Mohit, > > On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: >> Thanks! Some questions I have is: >> 1. Would it work with sequence files? I am using >> SequenceFileAsTextInputStream > > Yes, you just need to set the right codec when you write the file. > Reading is then normal as reading a non-compressed sequence-file. > > The codec classnames are stored as meta information into sequence > files and are read back to load the right codec for the reader - thus > you don't have to specify a 'reader' codec once you are done writing a > file with any codec of choice. > >> 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still >> split the files? > > Yes SequenceFiles are a natively splittable file format, designed for > HDFS and MapReduce. Compressed sequence files are thus splittable too. > > You mostly need block compression unless your records are large in > size and you feel you'll benefit better with compression algorithms > applied to a single, complete record instead of a bunch of records. > >> 3. I am also using CDH's 20.2 version of hadoop. > > http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :) > > -- > Harsh J
LZO confuses most because how it was added and removed. Also there is a system to make raw LZO files split-table by indexing it. I have just patched google-snappy into 0.20.2. Snappy has a similar performance profile to LZO, good compression low processor overhead. It does not have all the licence issues and there is not thousands and semi contradictory/confusing information it ends up being easier to setup and use. http://code.google.com/p/snappy/ Recent version of hadoop just snappy build in so it will just work out of the box. Edward