I have a question someone may have answered here before but I can not find the answer.

Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run the maps on a group of EC2 servers and run the reduces on the local cluster of 10 machines.

The problem I am seeing is the map outputs, if I run the maps on EC2 they are stored local on the instance What I am looking to do is have the map output files stored in hdfs so I can kill the EC2 instances sense I do not need them for the reduces.

The only way I can thank to do this is run two jobs one maper and store the output on hdfs and then run a second job to run the reduces
from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?

Reply via email to