Hi All, I am new to hadoop and I seem to be having a problem setting the number of map tasks per node. I have an application that needs to load a significant amount of data (about 1 GB) in memory to use in mapping data read from files. I store this in a singleton and access it from my mapper. In order to do this, I need to have exactly one map task run on a node at anyone time or the memory requirements will far exceed my RAM. I am generating my own Splits using an InputFormat class. This gives me roughly 10 splits per node and I need each corresponding map task in a sequential fashion in the same child jvm so that each map run does not have to reinitialize the data.
I have tried the following in a single node configuration and 2 splits: - setting setNumMapTasks in the JobConf to 1 but hadoop seems to create 2 map tasks - setting mapred.tasktracker.tasks.maximum property 1 - same result 2 map tasks - setting mapred.map.tasks property to 1 - same result 2 map tasks I have yet to try it in a multiple node configuration. My target will using 20 AWS EC2 instances. Can you please let me know what I should be doing or looking at to make sure that I have maximum 1 map task per node. Also, how can I have multiple splits being mapped within the same child jvm by different map tasks in sequence? Thanks in advance, Dev
