Re: NUTCH-1370

Julien Nioche Mon, 29 Oct 2012 09:52:38 -0700

Hi Lewis

see comments below


>
> So I thought I'd take this one on tonight and see if I can resolve.
> Basically, my high level question is as follows...
> Is each line of a text file (seed file) which we attempt to inject
> into the webdb considered as an individual map task?
>

no - each file in a map task


> The idea is to establish a counter for the successfully injected URLS
> (and possibly a counter for unsuccessful ones as well) so determining
> how many URLs are (or should be) present within the webdb can be
> determined after bootstrapping Nutch via the inject command.
>
> you get this information from the Hadoop Mapreduce Admin - the number of
seeds is the Map input records of the first job, the number post filtering
and normalisation is in Map output records as for the final number of urls
in the crawldb post merging with whatever is in the Reduce Output Record.

Just get the values from the counters of these 2 jobs to display a user
friendly message in the log

In general I would advise anyone to use the pseudo distributed mode instead
of the local one as you get a lot more info from the Hadoop admin screen
and won't have to trawl through the log files.

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: NUTCH-1370

Reply via email to