controlling where a map task runs?

Yang Tue, 17 Jan 2012 13:02:19 -0800

I understand that normally map tasks are run close to the input files.

but in my application, the input file is a txt file with many lines of
query param, and the mapper reads out each line,
use the params in the line to query a local db file (for example sqlite3 ),
so the query itself takes a lot of time,
and the input query param is very small. so in this case the time to fetch
the input file is negligible . the db file is
always sitting on all the boxes in the cluster, so there is no time to copy
the db.



the problem is , when I have an empty cluster (100 nodes), and have a task
with only 4 mappers, hadoop schedules out the 4 mappers all on the same
node, likely close to where the data is. but since the run time here is
mostly determined by CPU and disk seeking,
I would like to spread them out as much as possible.

given  that the data is already present only on 1 node, how is it possible
to spread out my mappers?

Thanks
Yang

controlling where a map task runs?

Reply via email to