Hi Again, Thanks Julien, I will also make this method public in the patch for 2.x.
This is actually getting quite interesting now as I've found out that using the o.a.hadoop.mapreduce.Job#Counters API can actually lead to security issues when attempting to obtain counters fro map and reduce jobs. For my own interest I'm heading over to mapreduce-user@ to get to the bottom of this one. What is really interesting is that an issue was filed [0] to deal with exactly this task so maybe I can chip in over there... we will see :0) Thanks for the info Julien, above aside the patch for 2.x is nearly done. I'll patch trunk in due course once I have the mapred specifics sorted out. Lewis [0] https://issues.apache.org/jira/browse/MAPREDUCE-3520 On Tue, Oct 30, 2012 at 8:27 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi, > > Sounds pretty harmless to have that method public IMHO > > Julien > > > On 29 October 2012 16:57, Lewis John Mcgibbney > <lewis.mcgibb...@gmail.com>wrote: > >> Hi Julien, >> >> Thanks for the comments. Any additional ones regarding the accessibility >> of the getDataStoreClass? >> >> Thanks again >> >> Lewis >> >> >> On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche < >> lists.digitalpeb...@gmail.com> wrote: >> >>> Hi Lewis >>> >>> see comments below >>> >>>> >>>> So I thought I'd take this one on tonight and see if I can resolve. >>>> Basically, my high level question is as follows... >>>> Is each line of a text file (seed file) which we attempt to inject >>>> into the webdb considered as an individual map task? >>>> >>> >>> no - each file in a map task >>> >>> >>>> The idea is to establish a counter for the successfully injected URLS >>>> (and possibly a counter for unsuccessful ones as well) so determining >>>> how many URLs are (or should be) present within the webdb can be >>>> determined after bootstrapping Nutch via the inject command. >>>> >>>> you get this information from the Hadoop Mapreduce Admin - the number >>> of seeds is the Map input records of the first job, the number post >>> filtering and normalisation is in Map output records as for the final >>> number of urls in the crawldb post merging with whatever is in the Reduce >>> Output Record. >>> >>> Just get the values from the counters of these 2 jobs to display a user >>> friendly message in the log >>> >>> In general I would advise anyone to use the pseudo distributed mode >>> instead of the local one as you get a lot more info from the Hadoop admin >>> screen and won't have to trawl through the log files. >>> >>> HTH >>> >>> Julien >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >>> >>> >> >> >> -- >> *Lewis* >> >> > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > -- *Lewis*