When the reducers start up, they first must check every mapper to grab
all of the data for the reduce phase.  It seems to be implemented
right now as all reduce tasks starting at map task 0, searching
through all data, then going to 1, through n-1.  This causes a huge
slow down.  For instance, if I look at the reduce task table, it often
looks like:

tip_0014_r_000000       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000001       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000002       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000003       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000004       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000005       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000006       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000007       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000008       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000009       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
tip_0014_r_000010       26.95%  reduce > copy (38 of 47 at 0.06 MB/s)
......

While if I'm just sending data between two nodes I usually get closer
to 30 MB/s.  Where is the logic located that pulls data from map ->
reducer so that I can randomize the start node to start grabbing data
off of?

Thanks in advance

Reply via email to