Re: Job scheduling (Re: Unable to run more than one job concurrently)

Rod Taylor Sun, 21 May 2006 21:29:53 -0700

> You have no guarantee that your time sensitive data is safe /  
> committed until after your reduce has completed.  If you care about  
> reliability or data integrity, simply run a full map-reduce job in  
> your collection window and store the result in the HDFS.


Perhaps I explained incorrectly. It's NOT the data that is time
sensitive it is the resource availability that is time sensitive. With a
given availability window for retrieval. So long as sorting is a
requirement of reduce, the overhead of saving is going to remain
significant.

> Do expensive post processing you have a quarter to complete as  
> another job.  Being able to preempt a long job with a time sensitive  
> short job seems to really be your requirement.

Fetch has the same problem. Running fetches end-to-end (starting a new
one the instant a previous has finished) you end up with lulls between
fetches. For me this is about 15% of the time (15% wasted bandwidth
since you pay a flat rate).

My machines all have 12GB ram -- temporary storage is in memory -- and
reasonably fast processors. I really don't want to hold up a new fetch
map for a previous rounds fetch reduce.

> On May 21, 2006, at 11:22 AM, Rod Taylor wrote:
> 
> >
> >>> (2) Have a per-job total task count limit. Currently, we  
> >>> establish the
> >>> number of tasks each node runs, and how many map or reduce tasks  
> >>> we have
> >>> total in a given job. But it would be great if we could set a  
> >>> ceiling on the
> >>> number of tasks that run concurrently for a given job. This may  
> >>> help with
> >>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
> >>> concurrent
> >>> jobs would be fine?).
> >>
> >> I like this idea.  So if the highest-priority job is already  
> >> running at
> >> its task limit, then tasks can be run from the next highest-priority
> >> job.  Should there be separate limits for maps and reduces?
> >
> > Limits for map and reduce are useful for a job class. Not so much  
> > for a
> > specific job instance.  Data collection may be best achieved with 15
> > parallel maps pulling data from remote data sources. But if the fact
> > there are 3 from one job and 12 from another isn't important. It's
> > important that 15 makes best use of resources.
> >
> > A different priority for map and reduce would also be useful. Many  
> > times
> > data collection in a set timeframe is far more important than reducing
> > it for storage or post processing, particularly when data  
> > collection is
> > retrieving it from a remote resource.
> >
> >
> > Data warehousing activities often require that data collection occur
> > once a night between set hours (very high priority) but processing of
> > the data collected can occur any time until the end of the quarter.
> >
> >
> > For Nutch, with both of the above you should be able to achieve N  
> > number
> > of Fetch Map processes running at all times with everything else being
> > secondary within the remaining resources. This could make use of  
> > 100% of
> > available remote bandwidth.
> >
> > -- 
> > Rod Taylor <[EMAIL PROTECTED]>
> >
> 
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Reply via email to