> You have no guarantee that your time sensitive data is safe / > committed until after your reduce has completed. If you care about > reliability or data integrity, simply run a full map-reduce job in > your collection window and store the result in the HDFS.
Perhaps I explained incorrectly. It's NOT the data that is time sensitive it is the resource availability that is time sensitive. With a given availability window for retrieval. So long as sorting is a requirement of reduce, the overhead of saving is going to remain significant. > Do expensive post processing you have a quarter to complete as > another job. Being able to preempt a long job with a time sensitive > short job seems to really be your requirement. Fetch has the same problem. Running fetches end-to-end (starting a new one the instant a previous has finished) you end up with lulls between fetches. For me this is about 15% of the time (15% wasted bandwidth since you pay a flat rate). My machines all have 12GB ram -- temporary storage is in memory -- and reasonably fast processors. I really don't want to hold up a new fetch map for a previous rounds fetch reduce. > On May 21, 2006, at 11:22 AM, Rod Taylor wrote: > > > > >>> (2) Have a per-job total task count limit. Currently, we > >>> establish the > >>> number of tasks each node runs, and how many map or reduce tasks > >>> we have > >>> total in a given job. But it would be great if we could set a > >>> ceiling on the > >>> number of tasks that run concurrently for a given job. This may > >>> help with > >>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer > >>> concurrent > >>> jobs would be fine?). > >> > >> I like this idea. So if the highest-priority job is already > >> running at > >> its task limit, then tasks can be run from the next highest-priority > >> job. Should there be separate limits for maps and reduces? > > > > Limits for map and reduce are useful for a job class. Not so much > > for a > > specific job instance. Data collection may be best achieved with 15 > > parallel maps pulling data from remote data sources. But if the fact > > there are 3 from one job and 12 from another isn't important. It's > > important that 15 makes best use of resources. > > > > A different priority for map and reduce would also be useful. Many > > times > > data collection in a set timeframe is far more important than reducing > > it for storage or post processing, particularly when data > > collection is > > retrieving it from a remote resource. > > > > > > Data warehousing activities often require that data collection occur > > once a night between set hours (very high priority) but processing of > > the data collected can occur any time until the end of the quarter. > > > > > > For Nutch, with both of the above you should be able to achieve N > > number > > of Fetch Map processes running at all times with everything else being > > secondary within the remaining resources. This could make use of > > 100% of > > available remote bandwidth. > > > > -- > > Rod Taylor <[EMAIL PROTECTED]> > > > > -- Rod Taylor <[EMAIL PROTECTED]>
