Hi Markus,

> Currently the various Nutch jobs return 0 or -1 resp. indicating success or
> failure. It would be convenient to have certain jobs return the number of
> processed items instead of zero to make it a lot easier for shell scripts
> to
> fetch useful statistics.
>
> What would be an argument against the fetcher or indexer to return a
> positive
> integer instead of zero on success?
>

Conventions I suppose + I'd rather have a way for the scripts to retrieve
the context counters, e.g. by dumping the counters to the standard output at
the end of a job. For instance one of the small improvements I had in mind
recently was to count the number of URLs per status during the reduce step
of the update job as this would give a quick overview of the progress of a
crawl without having to call readdb -stats separately. Typically this
information can't be reduced to a single value. We already have similar
counters in the other steps e.g fetching, parsing so we might as well simply
add this functionality. By dumping that to the std out, the scripts could
then filter the output and do whatever they need. There is probably a clean
and standard way of doing this in Hadoop already.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to