yes, you pointed me to the slides in a previous thread. I looked at it, but when I was reading nutch source code, it escaped my mind. Thank you so much for reminding me again, Ted.
-- Jim On 10/30/07, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > There is a slide show on nutch that would be much more clear. I mentioned > it some time ago. If you go to the hadoop presentations page on the wiki ( > http://wiki.apache.org/lucene-hadoop/HadoopPresentations) then you will be > able to find one of the slide shows that goes through the nutch MR steps. > > > On 10/30/07 10:17 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> > wrote: > > > Thanks for jumping in and giving me inputs, Ted. Yes, intuitively it > > is an easy project (we had a conversation a few days back), except > > when it comes to implementation, I am having trouble with the details. > > > > I tried to look at nutch's source code, but frankly it wasn't trivial. > > I guess I will try again, with what you just said in the emails as a > > guide. > > > > -- Jim > > > > On 10/30/07, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> > >> > >> When there are no new catalogs to examine, then the main code can exit. > >> > >> The easiest way to present this back to the controller is by using the > >> counter capability. That what the controller can look at the results of a > >> map-reduce step to determine how many new catalogs were found. > >> > >> You haven't hit a dead end. This is really a pretty simple program that is > >> very similar to what nutch does all the time to crawl web sites. > >> > >> > >> On 10/29/07 6:57 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: > >> > >>> Thanks, Stu... Maybe my mind is way off track - but I still sense a > >>> problem with the mapper sending feedbacks to the job controller. That > >>> is, when a mapper has reached the terminal condition, how can it tell > >>> the job controller to stop? > >>> > >>> If I keep a JobConf object in the mapper, and set a property > >>> "stop.processing" to true when a mapping task has reached the terminal > >>> condition, will it cause synchronization problems? There could be > >>> other mapping tasks that still wish to go on? > >>> > >>> I tried to find a way so that the job controller can open the file in > >>> the output path at the end of the loop to read the contents; but thus > >>> far, I haven't seen a way to achieve this. > >>> > >>> Does this mean I have hit a dead-end? > >>> > >>> -- Jim > >>> > >>> > >>> > >>> On 10/29/07, Stu Hood <[EMAIL PROTECTED]> wrote: > >>>> The iteration would take place in your control code (your 'main' method, > >>>> as > >>>> shown in the examples). > >>>> > >>>> In order to prevent records from looping infinitely, each iteration would > >>>> need to use a separate output/input directory. > >>>> > >>>> Thanks, > >>>> Stu > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: Jim the Standing Bear <[EMAIL PROTECTED]> > >>>> Sent: Monday, October 29, 2007 5:45pm > >>>> To: [email protected] > >>>> Subject: Re: can jobs be launched recursively within a mapper ? > >>>> > >>>> thanks, Owen and David, > >>>> > >>>> I also thought of making a queue so that I can push catalog names to > >>>> the end of it, while the job control loop keeps removing items off the > >>>> queue until there is no more left. > >>>> > >>>> However, the problem is I don't see how I can do so within the > >>>> map/reduce context. All the code examples are one-shot deals and > >>>> there is no iteration involved. > >>>> > >>>> Furthermore, what David said made sense, but to avoid infinite loop, > >>>> the code must remove the record it just read from the input file. How > >>>> do I do that using hadoop's fs? or does hadoop take care of it > >>>> automatically? > >>>> > >>>> -- Jim > >>>> > >>>> > >>>> > >>>> On 10/29/07, David Balatero <[EMAIL PROTECTED]> wrote: > >>>>> Aren't these questions a little advanced for a bear to be asking? > >>>>> I'll be here all night... > >>>>> > >>>>> But seriously, if your job is inherently recursive, one possible way > >>>>> to do it would be to make sure that you output in the same format > >>>>> that you input. Then you can keep re-reading the outputted file back > >>>>> into a new map/reduce job, until you hit some base case and you > >>>>> terminate. I've had a main method before that would kick off a bunch > >>>>> of jobs in a row -- but I wouldn't really recommend starting another > >>>>> map/reduce job in the scope of a running map() or reduce() method. > >>>>> > >>>>> - David > >>>>> > >>>>> > >>>>> On Oct 29, 2007, at 2:17 PM, Jim the Standing Bear wrote: > >>>>> > >>>>>> then > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> -------------------------------------- > >>>> Standing Bear Has Spoken > >>>> -------------------------------------- > >>>> > >>>> > >>>> > >>> > >> > >> > > > > -- -------------------------------------- Standing Bear Has Spoken --------------------------------------
