I am a novice on hadoop so please consider my response with caution :).
The split size seems to be calculated as follows:
split size = max(value("mapred.min.split.size"), min((total size of all
files / number of splits), file system block size))
The value of "file system block size" might be the cause of your
situation.I had a (possibly) similar situation and I ended up extending FileInputFormat. You can overload method getSplits() (and possibly isSplitable(), getRecordReader() etc.). The above fomula is from FileInputFormat.getSplits(). You can then do a conf.setInputFormat(<Your class>) before submitting the job. In your case, you might want to overload SequenceFileInputFormat which extends FileInputFormat. When hadoop wants to create the splits, your code can specify the splits. If you can predict the exact bytes it takes to store your complex number then splitting shouldn't be difficult, else if split can be decided only while fetching data from the file, then it would be difficult to set one split per number. I know it is good to split a job in as many smaller parts as possible, but too much granularity, according to me, can result in excessive overhead in splitting and then combining the results. You might want to experiment with the granularity to figure out the best value. ~ Neeraj -----Original Message----- From: Oliver Haggarty [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 03, 2007 8:45 AM To: [email protected] Subject: Setting number of Maps Hi, I'm writing a mapreduce task that will take a load of complex numbers, do some processing on each then return a double. As this processing will be complex and could take up to 10 minutes I am using Hadoop to distribute this amongst many machines. So ideally for each complex number I want a new map task to spread the load most efficiently. A typical run might have as many as 7500 complex numbers that need processing. I will eventually have access to a cluster of approximately 500 machines. So far, the only way I can get one map task per complex number is to create a new SequenceFile for each number in the input directory. This takes a while though and I was hoping I could just create a single SequenceFile holding all the complex numbers, and then use the JobConf.setNumMapTasks(n) to get one map task per number in the file. This doesn't work though, and I end up with approx 60-70 complex numbers per map task (depending on the total number of input numbers). Does anyone have any idea why this second method doesn't work? If it is not supposed to work in this way are there any suggestions as to how to get a map per input record without having to put each one in a separate file? Thanks in advance for any help, Ollie
