Hi,
I'm writing a mapreduce task that will take a load of complex numbers,
do some processing on each then return a double. As this processing will
be complex and could take up to 10 minutes I am using Hadoop to
distribute this amongst many machines.
So ideally for each complex number I want a new map task to spread the
load most efficiently. A typical run might have as many as 7500 complex
numbers that need processing. I will eventually have access to a cluster
of approximately 500 machines.
So far, the only way I can get one map task per complex number is to
create a new SequenceFile for each number in the input directory. This
takes a while though and I was hoping I could just create a single
SequenceFile holding all the complex numbers, and then use the
JobConf.setNumMapTasks(n) to get one map task per number in the file.
This doesn't work though, and I end up with approx 60-70 complex numbers
per map task (depending on the total number of input numbers).
Does anyone have any idea why this second method doesn't work? If it is
not supposed to work in this way are there any suggestions as to how to
get a map per input record without having to put each one in a separate
file?
Thanks in advance for any help,
Ollie