Re: LZO with sequenceFile

2012-02-26 Thread Shi Yu
Hi, 

You could easily find lots of documents talking about this.  Try 
kevinweil-hadoop-lzo in google.

Shi


Re: LZO with sequenceFile

2012-02-26 Thread Ioan Eugen Stan
2012/2/26 Mohit Anchlia mohitanch...@gmail.com:
 Thanks. Does it mean LZO is not installed by default? How can I install LZO?

The LZO library is released under GPL and I believe it can't be
included in most distributions of Hadoop because of this (can't mix
GPL with non GPL stuff). It should be easily available though.

 On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote:

 Yes, it is supported by Hadoop sequence file. It is splittable
 by default. If you have installed and specified LZO correctly,
 use these:


 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setCompressOutput(job,true);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
 odec.class);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressionType(job,
 SequenceFile.CompressionType.BLOCK);

 job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
 t.SequenceFileOutputFormat.class);


 Shi




-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/


Re: LZO with sequenceFile

2012-02-26 Thread Harsh J
If you want to just quickly package the hadoop-lzo items instead of
building/managing-deployment on your own, you can reuse Todd Lipcon's
script at https://github.com/toddlipcon/hadoop-lzo-packager - Creates
both RPMs and DEBs.

On Sun, Feb 26, 2012 at 9:55 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote:
 2012/2/26 Mohit Anchlia mohitanch...@gmail.com:
 Thanks. Does it mean LZO is not installed by default? How can I install LZO?

 The LZO library is released under GPL and I believe it can't be
 included in most distributions of Hadoop because of this (can't mix
 GPL with non GPL stuff). It should be easily available though.

 On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote:

 Yes, it is supported by Hadoop sequence file. It is splittable
 by default. If you have installed and specified LZO correctly,
 use these:


 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setCompressOutput(job,true);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
 odec.class);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressionType(job,
 SequenceFile.CompressionType.BLOCK);

 job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
 t.SequenceFileOutputFormat.class);


 Shi




 --
 Ioan Eugen Stan
 http://ieugen.blogspot.com/



-- 
Harsh J


Re: LZO with sequenceFile

2012-02-26 Thread Mohit Anchlia
On Sun, Feb 26, 2012 at 9:09 AM, Harsh J ha...@cloudera.com wrote:

 If you want to just quickly package the hadoop-lzo items instead of
 building/managing-deployment on your own, you can reuse Todd Lipcon's
 script at https://github.com/toddlipcon/hadoop-lzo-packager - Creates
 both RPMs and DEBs.


Thanks! Some questions I have is:
1. Would it work with sequence files? I am using
SequenceFileAsTextInputStream
2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
split the files?
3. I am also using CDH's 20.2 version of hadoop.



 On Sun, Feb 26, 2012 at 9:55 PM, Ioan Eugen Stan stan.ieu...@gmail.com
 wrote:
  2012/2/26 Mohit Anchlia mohitanch...@gmail.com:
  Thanks. Does it mean LZO is not installed by default? How can I install
 LZO?
 
  The LZO library is released under GPL and I believe it can't be
  included in most distributions of Hadoop because of this (can't mix
  GPL with non GPL stuff). It should be easily available though.
 
  On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote:
 
  Yes, it is supported by Hadoop sequence file. It is splittable
  by default. If you have installed and specified LZO correctly,
  use these:
 
 
  org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
  t.setCompressOutput(job,true);
 
  org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
  t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
  odec.class);
 
  org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
  t.setOutputCompressionType(job,
  SequenceFile.CompressionType.BLOCK);
 
  job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
  t.SequenceFileOutputFormat.class);
 
 
  Shi
 
 
 
 
  --
  Ioan Eugen Stan
  http://ieugen.blogspot.com/



 --
 Harsh J



Re: LZO with sequenceFile

2012-02-26 Thread Harsh J
Hi Mohit,

On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Thanks! Some questions I have is:
 1. Would it work with sequence files? I am using
 SequenceFileAsTextInputStream

Yes, you just need to set the right codec when you write the file.
Reading is then normal as reading a non-compressed sequence-file.

The codec classnames are stored as meta information into sequence
files and are read back to load the right codec for the reader - thus
you don't have to specify a 'reader' codec once you are done writing a
file with any codec of choice.

 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
 split the files?

Yes SequenceFiles are a natively splittable file format, designed for
HDFS and MapReduce. Compressed sequence files are thus splittable too.

You mostly need block compression unless your records are large in
size and you feel you'll benefit better with compression algorithms
applied to a single, complete record instead of a bunch of records.

 3. I am also using CDH's 20.2 version of hadoop.

http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :)

-- 
Harsh J


Re: LZO with sequenceFile

2012-02-26 Thread Edward Capriolo
On Sun, Feb 26, 2012 at 1:49 PM, Harsh J ha...@cloudera.com wrote:
 Hi Mohit,

 On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 Thanks! Some questions I have is:
 1. Would it work with sequence files? I am using
 SequenceFileAsTextInputStream

 Yes, you just need to set the right codec when you write the file.
 Reading is then normal as reading a non-compressed sequence-file.

 The codec classnames are stored as meta information into sequence
 files and are read back to load the right codec for the reader - thus
 you don't have to specify a 'reader' codec once you are done writing a
 file with any codec of choice.

 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
 split the files?

 Yes SequenceFiles are a natively splittable file format, designed for
 HDFS and MapReduce. Compressed sequence files are thus splittable too.

 You mostly need block compression unless your records are large in
 size and you feel you'll benefit better with compression algorithms
 applied to a single, complete record instead of a bunch of records.

 3. I am also using CDH's 20.2 version of hadoop.

 http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :)

 --
 Harsh J

LZO confuses most because how it was added and removed. Also there is
a system to make raw LZO files split-table by indexing it.

I have just patched google-snappy into 0.20.2. Snappy has a similar
performance profile to LZO, good compression low processor overhead.
It does not have all the licence issues and there is not thousands and
semi contradictory/confusing information it ends up being easier to
setup and use.

http://code.google.com/p/snappy/

Recent version of hadoop just snappy build in so it will just work out
of the box.

Edward


Re: LZO with sequenceFile

2012-02-25 Thread Shi Yu
Yes, it is supported by Hadoop sequence file. It is splittable 
by default. If you have installed and specified LZO correctly,  
use these:

   
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
t.setCompressOutput(job,true);
   
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
odec.class);
   
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
t.setOutputCompressionType(job, 
SequenceFile.CompressionType.BLOCK);
   
job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
t.SequenceFileOutputFormat.class);


Shi


Re: LZO with sequenceFile

2012-02-25 Thread Mohit Anchlia
Thanks. Does it mean LZO is not installed by default? How can I install LZO?

On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote:

 Yes, it is supported by Hadoop sequence file. It is splittable
 by default. If you have installed and specified LZO correctly,
 use these:


 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setCompressOutput(job,true);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
 odec.class);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressionType(job,
 SequenceFile.CompressionType.BLOCK);

 job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
 t.SequenceFileOutputFormat.class);


 Shi