I have a question someone may have answered here before but I can not find
the answer.
Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and
the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces on the
local cluster of 10 machines.
The problem I am seeing is the map outputs, if I run the maps on EC2 they
are stored local on the instance
What I am looking to do is have the map output files stored in hdfs so I can
kill the EC2 instances sense I do not need them for the reduces.
The only way I can thank to do this is run two jobs one maper and store the
output on hdfs and then run a second job to run the reduces
from the map outputs store on the hfds.
Is there away to make the mappers store the final output in hdfs?