Hi, I have a question about the "mapred.reduce.parallel.copies" configuration parameter in Hadoop. The mapred-default.xml file says it is "The default number of parallel transfers run by reduce during the copy(shuffle) phase." Is this the number of slave nodes from which a reduce task reads in parallel? or is it the number of parallel intermediate outputs from map task which a reducer task can read from?
For example, if I have 4 slave nodes and run a job with 800 maps and 4 reducers with mapred.reduce.parallel.copies=5. Then can each reduce task read from all the 4 nodes in parallel i.e. it can makes only 4 concurrent connections to the 4 nodes present? or can it read from 5 of the 800 map outputs i.e. it makes at least 2 concurrent connections to a single node? In essence, I am trying to determine how many reducers would be accessing a single disk, concurrently, in any given Hadoop cluster for any job configuration as a function of the various parameters that can be specified in the configuration files. Thanks, Virajith