Thanks for jumping in and giving me inputs, Ted. Yes, intuitively it is an easy project (we had a conversation a few days back), except when it comes to implementation, I am having trouble with the details.
I tried to look at nutch's source code, but frankly it wasn't trivial. I guess I will try again, with what you just said in the emails as a guide. -- Jim On 10/30/07, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > When there are no new catalogs to examine, then the main code can exit. > > The easiest way to present this back to the controller is by using the > counter capability. That what the controller can look at the results of a > map-reduce step to determine how many new catalogs were found. > > You haven't hit a dead end. This is really a pretty simple program that is > very similar to what nutch does all the time to crawl web sites. > > > On 10/29/07 6:57 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: > > > Thanks, Stu... Maybe my mind is way off track - but I still sense a > > problem with the mapper sending feedbacks to the job controller. That > > is, when a mapper has reached the terminal condition, how can it tell > > the job controller to stop? > > > > If I keep a JobConf object in the mapper, and set a property > > "stop.processing" to true when a mapping task has reached the terminal > > condition, will it cause synchronization problems? There could be > > other mapping tasks that still wish to go on? > > > > I tried to find a way so that the job controller can open the file in > > the output path at the end of the loop to read the contents; but thus > > far, I haven't seen a way to achieve this. > > > > Does this mean I have hit a dead-end? > > > > -- Jim > > > > > > > > On 10/29/07, Stu Hood <[EMAIL PROTECTED]> wrote: > >> The iteration would take place in your control code (your 'main' method, as > >> shown in the examples). > >> > >> In order to prevent records from looping infinitely, each iteration would > >> need to use a separate output/input directory. > >> > >> Thanks, > >> Stu > >> > >> > >> -----Original Message----- > >> From: Jim the Standing Bear <[EMAIL PROTECTED]> > >> Sent: Monday, October 29, 2007 5:45pm > >> To: [email protected] > >> Subject: Re: can jobs be launched recursively within a mapper ? > >> > >> thanks, Owen and David, > >> > >> I also thought of making a queue so that I can push catalog names to > >> the end of it, while the job control loop keeps removing items off the > >> queue until there is no more left. > >> > >> However, the problem is I don't see how I can do so within the > >> map/reduce context. All the code examples are one-shot deals and > >> there is no iteration involved. > >> > >> Furthermore, what David said made sense, but to avoid infinite loop, > >> the code must remove the record it just read from the input file. How > >> do I do that using hadoop's fs? or does hadoop take care of it > >> automatically? > >> > >> -- Jim > >> > >> > >> > >> On 10/29/07, David Balatero <[EMAIL PROTECTED]> wrote: > >>> Aren't these questions a little advanced for a bear to be asking? > >>> I'll be here all night... > >>> > >>> But seriously, if your job is inherently recursive, one possible way > >>> to do it would be to make sure that you output in the same format > >>> that you input. Then you can keep re-reading the outputted file back > >>> into a new map/reduce job, until you hit some base case and you > >>> terminate. I've had a main method before that would kick off a bunch > >>> of jobs in a row -- but I wouldn't really recommend starting another > >>> map/reduce job in the scope of a running map() or reduce() method. > >>> > >>> - David > >>> > >>> > >>> On Oct 29, 2007, at 2:17 PM, Jim the Standing Bear wrote: > >>> > >>>> then > >>> > >>> > >> > >> > >> -- > >> -------------------------------------- > >> Standing Bear Has Spoken > >> -------------------------------------- > >> > >> > >> > > > > -- -------------------------------------- Standing Bear Has Spoken --------------------------------------
