Re: NUTCH-1370

Lewis John Mcgibbney Mon, 29 Oct 2012 09:58:21 -0700

Hi Julien,

Thanks for the comments. Any additional ones regarding the accessibility of
the getDataStoreClass?


Thanks again

Lewis

On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche <
[email protected]> wrote:

> Hi Lewis
>
> see comments below
>
>>
>> So I thought I'd take this one on tonight and see if I can resolve.
>> Basically, my high level question is as follows...
>> Is each line of a text file (seed file) which we attempt to inject
>> into the webdb considered as an individual map task?
>>
>
> no - each file in a map task
>
>
>> The idea is to establish a counter for the successfully injected URLS
>> (and possibly a counter for unsuccessful ones as well) so determining
>> how many URLs are (or should be) present within the webdb can be
>> determined after bootstrapping Nutch via the inject command.
>>
>> you get this information from the Hadoop Mapreduce Admin - the number of
> seeds is the Map input records of the first job, the number post
> filtering and normalisation is in Map output records as for the final
> number of urls in the crawldb post merging with whatever is in the Reduce
> Output Record.
>
> Just get the values from the counters of these 2 jobs to display a user
> friendly message in the log
>
> In general I would advise anyone to use the pseudo distributed mode
> instead of the local one as you get a lot more info from the Hadoop admin
> screen and won't have to trawl through the log files.
>
> HTH
>
> Julien
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*Lewis*

Re: NUTCH-1370

Reply via email to