Re: Job scheduling (Re: Unable to run more than one job concurrently)

Eric Baldeschwieler Sun, 21 May 2006 21:12:13 -0700

??

You have no guarantee that your time sensitive data is safe /committed until after your reduce has completed. If you care aboutreliability or data integrity, simply run a full map-reduce job inyour collection window and store the result in the HDFS.

Do expensive post processing you have a quarter to complete asanother job. Being able to preempt a long job with a time sensitiveshort job seems to really be your requirement.


On May 21, 2006, at 11:22 AM, Rod Taylor wrote:

(2) Have a per-job total task count limit. Currently, weestablish thenumber of tasks each node runs, and how many map or reduce taskswe havetotal in a given job. But it would be great if we could set aceiling on thenumber of tasks that run concurrently for a given job. This mayhelp withAndrzej's fetcher (since it is bandwidth constrained, maybe fewerconcurrent
jobs would be fine?).
I like this idea. So if the highest-priority job is alreadyrunning at
its task limit, then tasks can be run from the next highest-priority
job.  Should there be separate limits for maps and reduces?
Limits for map and reduce are useful for a job class. Not so muchfor a
specific job instance.  Data collection may be best achieved with 15
parallel maps pulling data from remote data sources. But if the fact
there are 3 from one job and 12 from another isn't important. It's
important that 15 makes best use of resources.
A different priority for map and reduce would also be useful. Manytimes
data collection in a set timeframe is far more important than reducing
it for storage or post processing, particularly when datacollection is
retrieving it from a remote resource.


Data warehousing activities often require that data collection occur
once a night between set hours (very high priority) but processing of
the data collected can occur any time until the end of the quarter.
For Nutch, with both of the above you should be able to achieve Nnumber
of Fetch Map processes running at all times with everything else being
secondary within the remaining resources. This could make use of100% of
available remote bandwidth.

--
Rod Taylor <[EMAIL PROTECTED]>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Reply via email to