I understand that normally map tasks are run close to the input files. but in my application, the input file is a txt file with many lines of query param, and the mapper reads out each line, use the params in the line to query a local db file (for example sqlite3 ), so the query itself takes a lot of time, and the input query param is very small. so in this case the time to fetch the input file is negligible . the db file is always sitting on all the boxes in the cluster, so there is no time to copy the db.
the problem is , when I have an empty cluster (100 nodes), and have a task with only 4 mappers, hadoop schedules out the 4 mappers all on the same node, likely close to where the data is. but since the run time here is mostly determined by CPU and disk seeking, I would like to spread them out as much as possible. given that the data is already present only on 1 node, how is it possible to spread out my mappers? Thanks Yang