1) It fetches the block from the rack it is on, if available or from another rack if not. Block is fetched (or streamed?) over the network I believe, before map can begin. This feature is known as the rack locality. You can see a counter associated with this in the jobs you run (data local tasks, rack local tasks, etc).
2) The reducer has a phase called copy which fetches _all_ the map outputs it needs to act on (first 33%). Only then the sort phase is initiated (next 33%). Only after copy and sort, the reduce begins (onto 100%). So such an issue won't occur, as all map outputs are fetched before any other logic runs. On Oct 13, 2010 5:42 PM, "Matthew John" <[email protected]> wrote: Hi all , Had some doubts : 1) what happens when a mapper running in node A needs data from a block it does nt have ? ( the block might be present in some other node in the cluster ) 2) in the Sort/Shuffle phase is just a logical representation of all map outputs together sorted rite ? and again, what happens when reduce in Node C needs access of some map outputs not in its memory? Matthew .
