Hi Julien, Thanks for the comments. Any additional ones regarding the accessibility of the getDataStoreClass?
Thanks again Lewis On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche < [email protected]> wrote: > Hi Lewis > > see comments below > >> >> So I thought I'd take this one on tonight and see if I can resolve. >> Basically, my high level question is as follows... >> Is each line of a text file (seed file) which we attempt to inject >> into the webdb considered as an individual map task? >> > > no - each file in a map task > > >> The idea is to establish a counter for the successfully injected URLS >> (and possibly a counter for unsuccessful ones as well) so determining >> how many URLs are (or should be) present within the webdb can be >> determined after bootstrapping Nutch via the inject command. >> >> you get this information from the Hadoop Mapreduce Admin - the number of > seeds is the Map input records of the first job, the number post > filtering and normalisation is in Map output records as for the final > number of urls in the crawldb post merging with whatever is in the Reduce > Output Record. > > Just get the values from the counters of these 2 jobs to display a user > friendly message in the log > > In general I would advise anyone to use the pseudo distributed mode > instead of the local one as you get a lot more info from the Hadoop admin > screen and won't have to trawl through the log files. > > HTH > > Julien > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > -- *Lewis*

