Hi Markus,
> Currently the various Nutch jobs return 0 or -1 resp. indicating success or > failure. It would be convenient to have certain jobs return the number of > processed items instead of zero to make it a lot easier for shell scripts > to > fetch useful statistics. > > What would be an argument against the fetcher or indexer to return a > positive > integer instead of zero on success? > Conventions I suppose + I'd rather have a way for the scripts to retrieve the context counters, e.g. by dumping the counters to the standard output at the end of a job. For instance one of the small improvements I had in mind recently was to count the number of URLs per status during the reduce step of the update job as this would give a quick overview of the progress of a crawl without having to call readdb -stats separately. Typically this information can't be reduced to a single value. We already have similar counters in the other steps e.g fetching, parsing so we might as well simply add this functionality. By dumping that to the std out, the scripts could then filter the output and do whatever they need. There is probably a clean and standard way of doing this in Hadoop already. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

