Hi Arun, thanks for your reply, I am CCing this e-mail to hadoop-dev. I will create the appropriate JIRA tickets today. Here are a few insights about my experience with Hadoop compression (all my comments apply to 0.13.0):
1. Map output compression: besides the issue I mentioned to you guys about choosing two different codecs for map output and overall job output, it works very well for us. I have been using non-native map output compression on jobs that generate over 6Tb of data with no problems. Since I am using 0.13.0, because of HADOOP-1193, I could test LZO native on very small jobs only. Our benchmarks show no degradation in performance whatsoever when using native-LZO. 2. Compression type configuration: we noticed a small issue with the configuration here. If "io.seqfile.compression,type" is set to NONE in hadoop-site.xml, M/R jobs will not do any compression and there is no way to override it programmatically. As a matter of fact, each worker machine will end up using the value read from the local hadoop conf folder. I like the fact that each worker reads this property locally when creating generic SequenceFile(s), but, IMHO, the behavior of M/R jobs should be set in JobConf only. This issue is very easy to reproduce. 3. Non-native GzipCodec: the codec returns Java's java.util.zip.GzipOutputStream and java.util.zip.GzipInputStream when native compression is not available. However, lines 197, 238, 299, and 357 of SequenceFile (basically all the createWriter() methods that select a compression codec) will throw an IllegalArgumentException if the GzipCodec is selected but the native library is *not* available. Why is that? 4. Reduce reported progress when consuming compressed map outputs: is generally incorrect, with reducers reporting over 220% completion. This is regardless of whether native compression is used or not. Best, Riccardo On 9/5/07, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > Hi Riccardo, > > On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote: > >Thanks Devaraj, good to hear from you. > > > >Actually, if you guys are interested, I have been testing Hadoop > compression > >(native and non-native), in the last 5 days on a cluster of 200 machines > >(running 0.12.3, with HDFS as file system). I have a few insights you > guys > >might be interested into. I am just trying to figure out what the proper > >channels would be, that is why I contacted you first. Thanks. > > > > You are absolutely correct. Please file a jira (and a patch if you are so > inclined! *smile*) to request a separate property for the 2 codecs. > > We'd love to hear any insights/opinion/ideas about the compression stuff > you've been working on, please don't hesitate to mail hadoop-dev@ or file > jira issues about any of them... > > thanks! > Arun > > >Riccardo > > > > > >On 9/4/07, Devaraj Das <[EMAIL PROTECTED]> wrote: > >> > >> Hi Riccardo, > >> Thanks for contacting me. I am doing good and hope you are doing great > >> too! > >> I am copying this mail to Arun who is our compression expert. Arun pls > >> respond to the mail. > >> Thanks, > >> Devaraj > >> > >> ------------------------------ > >> *From:* Nt Never [mailto:[EMAIL PROTECTED] > >> *Sent:* Tuesday, September 04, 2007 10:24 PM > >> *To:* [EMAIL PROTECTED] > >> *Subject:* map output compression codec setting issue > >> > >> Hi Devaraj, > >> > >> how have you been doing? I finally got around to do some extensive > testing > >> with Hadoop's compression. I am aware of HADOOP-1193 and HADOOP-1545, > so I > >> am waiting for the release of 0.15.0 before I do more benchmarks. > However, > >> I noticed what seems to be a bug in JobConf. The property " > >> mapred.output.compression.codec" is used when setting and getting the > map > >> output compression codec, thus making it impossible to use a different > codec > >> for map outputs and overall job outputs. The methods that affect this > >> behavior are in line 341-371 of JobConf in Hadoop 0.13.0: > >> > >> /** > >> * Set the given class as the compression codec for the map outputs. > >> * @param codecClass the CompressionCodec class that will compress > the > >> * map outputs > >> */ > >> public void setMapOutputCompressorClass(Class<? extends > >> CompressionCodec> codecClass) { > >> setCompressMapOutput(true); > >> setClass("mapred.output.compression.codec", codecClass, > >> CompressionCodec.class); > >> } > >> > >> /** > >> * Get the codec for compressing the map outputs > >> * @param defaultValue the value to return if it is not set > >> * @return the CompressionCodec class that should be used to compress > >> the > >> * map outputs > >> * @throws IllegalArgumentException if the class was specified, but > not > >> found > >> */ > >> public Class<? extends CompressionCodec> > getMapOutputCompressorClass(Class<? > >> extends CompressionCodec> defaultValue) { > >> String name = get( "mapred.output.compression.codec"); > >> if (name == null) { > >> return defaultValue; > >> } else { > >> try { > >> return getClassByName(name).asSubclass( CompressionCodec.class > ); > >> } catch (ClassNotFoundException e) { > >> throw new IllegalArgumentException("Compression codec " + name > + > >> " was not found.", e); > >> } > >> } > >> } > >> > >> This could be easily fixed by using a different property, for example, > " > >> map.output.compression.codec". Should I create an issue on JIRA for > this? > >> Thanks. > >> > >> Riccardo > >> > >> >
