HI Sriguru,

Thank you for the tips.  Just to clarify a few things.

Our machines have 32 GB of RAM.

I'm planning on setting each machine to run 12 mappers and 2 reducers with
the heap size set to 2048MB so total memory usage for the heap at 28GB.

If this is the case should io.sort.mb be set to 70% of 2048MB (so ~1400 MB)?

Also, I did not see a fs.inmemorysize.mb setting in any of the hadoop
configuration files.  Is that the correct setting I should be looking for?
Should this also be set to 70% of the heap size or does it need to share
with the io.sort.mb setting.

I assume if I'm bumping up io.sort.mb that much I also need to increase
io.sort.factor from the default of 10.  Is there a recommended relation
between these two?

Thank you for your help!

~Ed

On Sun, Sep 26, 2010 at 3:05 AM, Srigurunath Chakravarthi <
[email protected]> wrote:

> Ed,
>  Tuning io.sort.mb will be certainly worthwhile if you have enough RAM to
> allow for a higher Java heap per map task without risking swapping.
>
>  Similarly, you can decrease spills on the reduce side using
> fs.inmemorysize.mb.
>
> You can use the following thumb rules for tuning those two:
>
> - Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM
> across all processes (maps, reducers, TT, DN, other)
> - Set it small enough to avoid swap activity, but
> - Set it large enough to minimize disk spills.
> - Ensure that io.sort.factor is set large enough to allow full use of
> buffer space.
> - Balance space for output records (default 95%) & record meta-data (5%).
> Use io.sort.spill.percent and io.sort.record.percent
>
>  Your mileage may vary. We've seen job exec time improvements worth 1-3%
> via spill-avoidance for miscellaneous applications.
>
>  Your other option of running a map per 32MB or 64MB of input should give
> you better performance if your map task execution time is significant (i.e.,
> much larger than a few seconds) compared to the overhead of launching map
> tasks and reading input.
>
> Regards,
> Sriguru
>
> >-----Original Message-----
> >From: pig [mailto:[email protected]]
> >Sent: Saturday, September 25, 2010 2:36 AM
> >To: [email protected]
> >Subject: Proper blocksize and io.sort.mb setting when using compressed
> >LZO files
> >
> >Hello,
> >
> >We just recently switched to using lzo compressed file input for our
> >hadoop
> >cluster using Kevin Weil's lzo library.  The files are pretty uniform
> >in
> >size at around 200MB compressed.  Our block size is 256MB.
> >Decompressed the
> >average LZO input file is around 1.0GB.  I noticed lots of our jobs are
> >now
> >spilling lots of data to disk.  We have almost 3x more spilled records
> >than
> >map input records for example.  I'm guessing this is because each
> >mapper is
> >getting a 200 MB lzo file which decompresses into 1GB of data per
> >mapper.
> >
> >Would you recommend solving this by reducing the block size to 64MB, or
> >even
> >32MB and then using the LZO indexer so that a single 200MB lzo file is
> >actually split among 3 or 4 mappers?  Would it be better to play with
> >the
> >io.sort.mb value?  Or, would it be best to play with both? Right now
> >the
> >io.sort.mb value is the default 200MB. Have other lzo users had to
> >adjust
> >their block size to compensate for the "expansion" of the data after
> >decompression?
> >
> >Thank you for any help!
> >
> >~Ed
>

Reply via email to