??
You have no guarantee that your time sensitive data is safe /
committed until after your reduce has completed. If you care about
reliability or data integrity, simply run a full map-reduce job in
your collection window and store the result in the HDFS.
Do expensive post processing you have a quarter to complete as
another job. Being able to preempt a long job with a time sensitive
short job seems to really be your requirement.
On May 21, 2006, at 11:22 AM, Rod Taylor wrote:
(2) Have a per-job total task count limit. Currently, we
establish the
number of tasks each node runs, and how many map or reduce tasks
we have
total in a given job. But it would be great if we could set a
ceiling on the
number of tasks that run concurrently for a given job. This may
help with
Andrzej's fetcher (since it is bandwidth constrained, maybe fewer
concurrent
jobs would be fine?).
I like this idea. So if the highest-priority job is already
running at
its task limit, then tasks can be run from the next highest-priority
job. Should there be separate limits for maps and reduces?
Limits for map and reduce are useful for a job class. Not so much
for a
specific job instance. Data collection may be best achieved with 15
parallel maps pulling data from remote data sources. But if the fact
there are 3 from one job and 12 from another isn't important. It's
important that 15 makes best use of resources.
A different priority for map and reduce would also be useful. Many
times
data collection in a set timeframe is far more important than reducing
it for storage or post processing, particularly when data
collection is
retrieving it from a remote resource.
Data warehousing activities often require that data collection occur
once a night between set hours (very high priority) but processing of
the data collected can occur any time until the end of the quarter.
For Nutch, with both of the above you should be able to achieve N
number
of Fetch Map processes running at all times with everything else being
secondary within the remaining resources. This could make use of
100% of
available remote bandwidth.
--
Rod Taylor <[EMAIL PROTECTED]>