Hi Lewis see comments below
> > So I thought I'd take this one on tonight and see if I can resolve. > Basically, my high level question is as follows... > Is each line of a text file (seed file) which we attempt to inject > into the webdb considered as an individual map task? > no - each file in a map task > The idea is to establish a counter for the successfully injected URLS > (and possibly a counter for unsuccessful ones as well) so determining > how many URLs are (or should be) present within the webdb can be > determined after bootstrapping Nutch via the inject command. > > you get this information from the Hadoop Mapreduce Admin - the number of seeds is the Map input records of the first job, the number post filtering and normalisation is in Map output records as for the final number of urls in the crawldb post merging with whatever is in the Reduce Output Record. Just get the values from the counters of these 2 jobs to display a user friendly message in the log In general I would advise anyone to use the pseudo distributed mode instead of the local one as you get a lot more info from the Hadoop admin screen and won't have to trawl through the log files. HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

