Hi again, Not sure if you are still on this approach after the previous suggestions, but since you asked:
2010/12/10 Edward Choi <[email protected]>: > Wow thanks for the info. I'll definitely try that. > One question though... > Is that "tagged name"and "free indicator" some kind of special class variable > provided by MultipleOutputs class? To add a multiple-output collector to your Mapper, you need to do something like a MultipleOutputs.addNamedOutput -- where-in you give a name (a string identifier, what I reffered to as a "tag"). Then while using this collector to write your file from the mapper, you will get files named <tag>-m-00000, <tag>-m-00000 and so on, apart from the usual part-00000 stuff. If you notice, you also got that the output file was created from a "mapper" since there's an "m" in the name itself. This is the free identifier that comes along with no extra config. What's more -- you also get counters for the multiple output collector you defined just by enabling them (and using a reporter)! > > Ed > > From mp2893's iPhone > > On 2010. 12. 10., at 오후 5:30, Harsh J <[email protected]> wrote: > >> Hi, >> >> You can use MultipleOutputs class to achieve this, with tagged names >> and free indicators of whether the output was from a map or reduce >> also. >> >> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[email protected]> wrote: >>> Hi, >>> >>> I'm trying to crawl numerous news sites. >>> My plan is to make a file containing a list of all the news rss feed urls, >>> and the path to save the crawled news article. >>> So it would be like this: >>> >>> nytimes_nation, /user/hadoop/nytimes >>> nytimes_sports, /user/hadoop/nytimes >>> latimes_world, /user/hadoop/latimes >>> latimes_nation, /user/hadoop/latimes >>> ... >>> ... >>> ... >>> >>> Each mapper would get a single line and crawl the assigned url, process >>> text, and save the result. >>> So this job does not need any Reducing process. >>> >>> But what I'd also like to do is to create a dictionary at the same time. >>> This could definitely take advantage of Reduce phase. Each mapper can >>> generate output as "KEY:term, VALUE:term_frequency" >>> Then Reducer can merge them all together and create a dictionary. (Of course >>> I would be using many Reducers so the dictionary would be partitioned) >>> >>> I know that I can do this by creating two separate jobs (one for crawling, >>> the other for making dictionary), but I'd like to do this in one-pass. >>> >>> So my design is: >>> Map phase ==> crawl news articles, process text, write the result to a file. >>> II >>> II pass (term, term_frequency) pair to the Reducer >>> II >>> V >>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>> dictionary >>> >>> Is this at all possible? Or is it inherently impossible due to the structure >>> of Hadoop? >>> If it's possible, could anyone tell me how to do it? >>> >>> Ed. >>> >> >> >> >> -- >> Harsh J >> www.harshj.com > -- Harsh J www.harshj.com
