Yes adding in more resources in the scheduling request would be the ideal solution to the problem. But sadly that is not a trivial change. The initial solution I suggested is an ugly hack, and will not work for the cases you have suggested. If you feel that this is important work please feel free to file a JIRA for this. We can continue discussion on that JIRA about the details of how to add in this type of functionality. I am very interested in the scheduler and would be happy to help out, but sadly my time right now is very limited.
--Bobby Evans On 5/10/12 6:56 AM, "Radim Kolar" <h...@filez.com> wrote: > We've been against these 'features' since it leads to very bad > behaviour across the cluster with multiple apps/users etc. Its not new feature, its extension of existing resource scheduling which works good enough only for RAM. There are 2 other resources - CPU cores and network IO which needs to be considered. We have job which is doing lot of network IO in mapper and its desirable to run mappers on different nodes even if reading blocks from HDFS will not be local. Our second job is burning all CPU cores on machine while doing computations, its important for mappers not to land on same node.