Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

Bejoy Ks Tue, 03 Apr 2012 04:48:39 -0700

Jane,
       From my first look, properties that can help you could be
- Increase io sort factor to 100
- Increase io.sort.mb to 512Mb
- increase map task heap size to 2GB.


If the task still stalls, try providing lesser input for each mapper.

Regards
Bejoy KS

On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <jane.wayne2...@gmail.com> wrote:

> i have a map reduce job that is generating a lot of intermediate key-value
> pairs. for example, when i am 1/3 complete with my map phase, i may have
> generated over 130,000,000 output records (which is about 9 gigabytes). to
> get to the 1/3 complete mark is very fast (less than 10 minutes), but at
> the 1/3 complete mark, it seems to stall. when i look at the counter logs,
> i do not see any logging of spilling yet. however, on the web job UI, i see
> that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to
> say, i have to dig deeper to see what is going on.
>
> my question is, how do i fine tune my map reduce job with the above
> properties? namely, the property of generating a lot of intermediate
> key-value pairs? it seems the I/O operations are negatively impacting the
> job speed. there are so many map- and reduce-side tuning properties (see
> Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about
> just how to approach the tuning parameters. since the slow down is
> happening during the map-phase/task, i assume i should narrow down on the
> map-side tuning properties.
>
> by the way, i am using the CPU-intensive c1.medium instances of amazon web
> service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute node
> has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this instance
> type is documented to have moderate I/O performance.
>
> any help on fine tuning my particular map reduce job is appreciated.
>

Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

Reply via email to