Hi, You can use MultipleOutputs class to achieve this, with tagged names and free indicators of whether the output was from a map or reduce also.
On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[email protected]> wrote: > Hi, > > I'm trying to crawl numerous news sites. > My plan is to make a file containing a list of all the news rss feed urls, > and the path to save the crawled news article. > So it would be like this: > > nytimes_nation, /user/hadoop/nytimes > nytimes_sports, /user/hadoop/nytimes > latimes_world, /user/hadoop/latimes > latimes_nation, /user/hadoop/latimes > ... > ... > ... > > Each mapper would get a single line and crawl the assigned url, process > text, and save the result. > So this job does not need any Reducing process. > > But what I'd also like to do is to create a dictionary at the same time. > This could definitely take advantage of Reduce phase. Each mapper can > generate output as "KEY:term, VALUE:term_frequency" > Then Reducer can merge them all together and create a dictionary. (Of course > I would be using many Reducers so the dictionary would be partitioned) > > I know that I can do this by creating two separate jobs (one for crawling, > the other for making dictionary), but I'd like to do this in one-pass. > > So my design is: > Map phase ==> crawl news articles, process text, write the result to a file. > II > II pass (term, term_frequency) pair to the Reducer > II > V > Reduce phase ==> Merge the (term, term_frequency) pair and create a > dictionary > > Is this at all possible? Or is it inherently impossible due to the structure > of Hadoop? > If it's possible, could anyone tell me how to do it? > > Ed. > -- Harsh J www.harshj.com
