When the reducers start up, they first must check every mapper to grab all of the data for the reduce phase. It seems to be implemented right now as all reduce tasks starting at map task 0, searching through all data, then going to 1, through n-1. This causes a huge slow down. For instance, if I look at the reduce task table, it often looks like:
tip_0014_r_000000 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000001 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000002 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000003 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000004 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000005 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000006 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000007 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000008 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000009 26.95% reduce > copy (38 of 47 at 0.06 MB/s) tip_0014_r_000010 26.95% reduce > copy (38 of 47 at 0.06 MB/s) ...... While if I'm just sending data between two nodes I usually get closer to 30 MB/s. Where is the logic located that pulls data from map -> reducer so that I can randomize the start node to start grabbing data off of? Thanks in advance
