Hi, I encountered a strange issue in developing a system. I have data where reducer recieves about 3 millions values. The reducer emits all the permutations of the values.
Reducer{ List<values> FindPermutations(List<values>) foreach( permutation ) emit( key, permutation ) } It is feasible to hold values in memory to calculate permutations if the number of values are low i.e. say less than 10,000. Otherwise, this is not scalable even in computational point of view. I tried to write the values into a file and move it to HDFS. Start a new mapreduce job for permutation from the reducer, this distributes the load of the reducer among available machines. let me call it as nested mapreduce job. The task waits until the nested job completes and uses the obtained result to emit the permutations. The parent job's task stills idle, so the nested job's tasks can run on the same tasktracker, but the tasktracker is not doing it. Is there a way to signal tasktracker that the current task is paused or sitting idle, but not to terminate. All the available tasktrackers are running parent mapreduce job's tasks and the nested mapreduce job never getting resources to start and falling into deadlock scenario. I can suspend parent task after starting a nested job for permutations, but it does continue from the same instruction when it resumes. In simple words, the parent task is not pausing but suspending. Anybody got into this situation. If you have any thoughts on it please post it here. All your help is appreciated. Thanks, Venkat -- View this message in context: http://old.nabble.com/Nested-map-reduce-job-tp33763485p33763485.html Sent from the Hadoop core-user mailing list archive at Nabble.com.