On Sun, Dec 19, 2010 at 6:14 PM, Scott Carey <[email protected]>wrote:
> > On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote: > > > AvroOutputFormat supports setting deflate level, but not the sync > interval. > > Was this a conscious decision (i.e. would there be drawbacks of making > the > > sync interval larger)? > > > > In some tests that I've done, Avro data files were over 50% smaller when > I > > upped the sync interval to 2MB (default is 16000 bytes). I also saw a > > modest speedup in building the files (I suspect my program was IO-bound). > > > > Would folks support a patch to add setting a sync interval as a static > > configuration option to AvroOutputFormat? > > Yes, it makes sense to expose that. > In that case, I'd be happy to file a ticket and create a patch. > > Out of curiosity, how much of an improvement do you get for going to 64000 > bytes? A larger default for the MapReduce case makes sense, but 2MB may be > on the large side. M/R has to split the file at sync boundaries and you > don't want those to end up too far from the HDFS block boundaries. > Here are the compression ratios I'm seeing (block size, compression ratio): 16384 0.217 32768 0.164 65536 0.132 131072 0.116 262144 0.108 524288 0.104 1048576 0.102 2097152 0.100 So the sweet-spot for this data seems to be around 128K-256K, which is within 7.7% - 16% of "optimal" (where optimal is the uncompressed file compressed with command-line gzip). > > The file format default is moderately sized because for many non M/R use > cases, syncing to disk more regularly is a good idea. With the default > deflate lookback window 32k, compression ratio as a function of block size > tends to have a sharp elbow near that size. In my experiments, compression > ratio did not go up after blocks that are about 120k in size, and was only > moderately better than 16000 byte blocks. But my data isn't your data. > Thanks for this suggestion -- I had only looked at the two extremes. If the ability to configure the size, then I should be able to do some tests to see how these window sizes affect performance for our application. Thanks, Joe
