Hi Arun,

thanks for your reply, I am CCing this e-mail to hadoop-dev. I will create
the appropriate JIRA tickets today. Here are a few insights about my
experience with Hadoop compression (all my comments apply to 0.13.0):

1. Map output compression: besides the issue I mentioned to you guys about
choosing two different codecs for map output and overall job output, it
works very well for us. I have been using non-native map output compression
on jobs that generate over 6Tb of data with no problems. Since I am using
0.13.0, because of HADOOP-1193, I could test LZO native on very small jobs
only. Our benchmarks show no degradation in performance whatsoever when
using native-LZO.

2. Compression type configuration: we noticed a small issue with the
configuration here. If "io.seqfile.compression,type" is set to NONE in
hadoop-site.xml, M/R jobs will not do any compression and there is no way to
override it programmatically. As a matter of fact, each worker machine will
end up using the value read from the local hadoop conf folder. I like the
fact that each worker reads this property locally when creating generic
SequenceFile(s), but, IMHO, the behavior of M/R jobs should be set in
JobConf only. This issue is very easy to reproduce.

3. Non-native GzipCodec: the codec returns Java's
java.util.zip.GzipOutputStream and java.util.zip.GzipInputStream when native
compression is not available. However, lines 197, 238, 299, and 357 of
SequenceFile (basically all the createWriter() methods that select a
compression codec) will throw an IllegalArgumentException if the GzipCodec
is selected but the native library is *not* available. Why is that?

4. Reduce reported progress when consuming compressed map outputs: is
generally incorrect, with reducers reporting over 220% completion. This is
regardless of whether native compression is used or not.

Best,

Riccardo


On 9/5/07, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>
> Hi Riccardo,
>
> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote:
> >Thanks Devaraj, good to hear from you.
> >
> >Actually, if you guys are interested, I have been testing Hadoop
> compression
> >(native and non-native), in the last 5 days on a cluster of 200 machines
> >(running 0.12.3, with HDFS as file system). I have a few insights you
> guys
> >might be interested into. I am just trying to figure out what the proper
> >channels would be, that is why I contacted you first. Thanks.
> >
>
> You are absolutely correct. Please file a jira (and a patch if you are so
> inclined! *smile*) to request a separate property for the 2 codecs.
>
> We'd love to hear any insights/opinion/ideas about the compression stuff
> you've been working on, please don't hesitate to mail hadoop-dev@ or file
> jira issues about any of them...
>
> thanks!
> Arun
>
> >Riccardo
> >
> >
> >On 9/4/07, Devaraj Das <[EMAIL PROTECTED]> wrote:
> >>
> >>  Hi Riccardo,
> >> Thanks for contacting me. I am doing good and hope you are doing great
> >> too!
> >> I am copying this mail to Arun who is our compression expert. Arun pls
> >> respond to the mail.
> >> Thanks,
> >> Devaraj
> >>
> >>  ------------------------------
> >> *From:* Nt Never [mailto:[EMAIL PROTECTED]
> >> *Sent:* Tuesday, September 04, 2007 10:24 PM
> >> *To:* [EMAIL PROTECTED]
> >> *Subject:* map output compression codec setting issue
> >>
> >> Hi Devaraj,
> >>
> >> how have you been doing? I finally got around to do some extensive
> testing
> >> with Hadoop's compression. I am aware of HADOOP-1193 and HADOOP-1545,
> so I
> >> am waiting for the release of 0.15.0 before I do more benchmarks.
> However,
> >> I noticed what seems to be a bug in JobConf. The property "
> >> mapred.output.compression.codec" is used when setting and getting the
> map
> >> output compression codec, thus making it impossible to use a different
> codec
> >> for map outputs and overall job outputs. The methods that affect this
> >> behavior are in line 341-371 of JobConf in Hadoop 0.13.0:
> >>
> >> /**
> >>    * Set the given class as the  compression codec for the map outputs.
> >>    * @param codecClass the CompressionCodec class that will compress
> the
> >>    *                   map outputs
> >>    */
> >>   public void setMapOutputCompressorClass(Class<? extends
> >> CompressionCodec> codecClass) {
> >>     setCompressMapOutput(true);
> >>     setClass("mapred.output.compression.codec", codecClass,
> >>              CompressionCodec.class);
> >>   }
> >>
> >>   /**
> >>    * Get the codec for compressing the map outputs
> >>    * @param defaultValue the value to return if it is not set
> >>    * @return the CompressionCodec class that should be used to compress
> >> the
> >>    *   map outputs
> >>    * @throws IllegalArgumentException if the class was specified, but
> not
> >> found
> >>    */
> >>   public Class<? extends CompressionCodec>
> getMapOutputCompressorClass(Class<?
> >> extends CompressionCodec> defaultValue) {
> >>     String name = get( "mapred.output.compression.codec");
> >>     if (name == null) {
> >>       return defaultValue;
> >>     } else {
> >>       try {
> >>         return getClassByName(name).asSubclass( CompressionCodec.class
> );
> >>       } catch (ClassNotFoundException e) {
> >>         throw new IllegalArgumentException("Compression codec " + name
> +
> >>                                            " was not found.", e);
> >>       }
> >>     }
> >>   }
> >>
> >> This could be easily fixed by using a different property, for example,
> "
> >> map.output.compression.codec". Should I create an issue on JIRA for
> this?
> >> Thanks.
> >>
> >> Riccardo
> >>
> >>
>

Reply via email to