Re: NUTCH-1370

Lewis John Mcgibbney Tue, 30 Oct 2012 15:27:13 -0700

Hi Again,

Thanks Julien, I will also make this method public in the patch for 2.x.


This is actually getting quite interesting now as I've found out that using
the o.a.hadoop.mapreduce.Job#Counters API can actually lead to security
issues when attempting to obtain counters fro map and reduce jobs.

For my own interest I'm heading over to mapreduce-user@ to get to the
bottom of this one. What is really interesting is that an issue was filed
[0] to deal with exactly this task so maybe I can chip in over there... we
will see :0)

Thanks for the info Julien, above aside the patch for 2.x is nearly done.
I'll patch trunk in due course once I have the mapred specifics sorted out.

Lewis

[0] https://issues.apache.org/jira/browse/MAPREDUCE-3520

On Tue, Oct 30, 2012 at 8:27 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi,
>
> Sounds pretty harmless to have that method public IMHO
>
> Julien
>
>
> On 29 October 2012 16:57, Lewis John Mcgibbney 
> <lewis.mcgibb...@gmail.com>wrote:
>
>> Hi Julien,
>>
>> Thanks for the comments. Any additional ones regarding the accessibility
>> of the getDataStoreClass?
>>
>> Thanks again
>>
>> Lewis
>>
>>
>> On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche <
>> lists.digitalpeb...@gmail.com> wrote:
>>
>>> Hi Lewis
>>>
>>> see comments below
>>>
>>>>
>>>> So I thought I'd take this one on tonight and see if I can resolve.
>>>> Basically, my high level question is as follows...
>>>> Is each line of a text file (seed file) which we attempt to inject
>>>> into the webdb considered as an individual map task?
>>>>
>>>
>>> no - each file in a map task
>>>
>>>
>>>> The idea is to establish a counter for the successfully injected URLS
>>>> (and possibly a counter for unsuccessful ones as well) so determining
>>>> how many URLs are (or should be) present within the webdb can be
>>>> determined after bootstrapping Nutch via the inject command.
>>>>
>>>> you get this information from the Hadoop Mapreduce Admin - the number
>>> of seeds is the Map input records of the first job, the number post
>>> filtering and normalisation is in Map output records as for the final
>>> number of urls in the crawldb post merging with whatever is in the Reduce
>>> Output Record.
>>>
>>> Just get the values from the counters of these 2 jobs to display a user
>>> friendly message in the log
>>>
>>> In general I would advise anyone to use the pseudo distributed mode
>>> instead of the local one as you get a lot more info from the Hadoop admin
>>> screen and won't have to trawl through the log files.
>>>
>>> HTH
>>>
>>> Julien
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*Lewis*

Re: NUTCH-1370

Reply via email to