God I never knew that they had a project like this. I should definitely check it out. I may even be able to use it at my work place. Thanks for the tip!!
From mp2893's iPhone On 2010. 12. 10., at 오후 10:36, "Jones, Nick" <[email protected]> wrote: > It might be worth looking into Nutch; it can probably be configured to do the > type of crawling you need. > > Nick Jones > > -----Original Message----- > From: Edward Choi [mailto:[email protected]] > Sent: Friday, December 10, 2010 6:24 AM > To: [email protected] > Subject: Re: Is it possible to write file output in Map phase once and write > another file output in Reduce phase? > > Wow thanks for the info. I'll definitely try that. > One question though... > Is that "tagged name"and "free indicator" some kind of special class variable > provided by MultipleOutputs class? > > Ed > > From mp2893's iPhone > > On 2010. 12. 10., at 오후 5:30, Harsh J <[email protected]> wrote: > >> Hi, >> >> You can use MultipleOutputs class to achieve this, with tagged names >> and free indicators of whether the output was from a map or reduce >> also. >> >> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[email protected]> wrote: >>> Hi, >>> >>> I'm trying to crawl numerous news sites. >>> My plan is to make a file containing a list of all the news rss feed urls, >>> and the path to save the crawled news article. >>> So it would be like this: >>> >>> nytimes_nation, /user/hadoop/nytimes >>> nytimes_sports, /user/hadoop/nytimes >>> latimes_world, /user/hadoop/latimes >>> latimes_nation, /user/hadoop/latimes >>> ... >>> ... >>> ... >>> >>> Each mapper would get a single line and crawl the assigned url, process >>> text, and save the result. >>> So this job does not need any Reducing process. >>> >>> But what I'd also like to do is to create a dictionary at the same time. >>> This could definitely take advantage of Reduce phase. Each mapper can >>> generate output as "KEY:term, VALUE:term_frequency" >>> Then Reducer can merge them all together and create a dictionary. (Of course >>> I would be using many Reducers so the dictionary would be partitioned) >>> >>> I know that I can do this by creating two separate jobs (one for crawling, >>> the other for making dictionary), but I'd like to do this in one-pass. >>> >>> So my design is: >>> Map phase ==> crawl news articles, process text, write the result to a file. >>> II >>> II pass (term, term_frequency) pair to the Reducer >>> II >>> V >>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>> dictionary >>> >>> Is this at all possible? Or is it inherently impossible due to the structure >>> of Hadoop? >>> If it's possible, could anyone tell me how to do it? >>> >>> Ed. >>> >> >> >> >> -- >> Harsh J >> www.harshj.com >
