Thanks for the detailed answer! The suggested approaches I also need check out. But since my goal is to just crawl rss feeds, I might be better off just making a small crawler myself :-). Thanks again for the reply.
Ed From mp2893's iPhone On 2010. 12. 11., at 오전 2:41, Harsh J <[email protected]> wrote: > Hi again, > > Not sure if you are still on this approach after the previous > suggestions, but since you asked: > > 2010/12/10 Edward Choi <[email protected]>: >> Wow thanks for the info. I'll definitely try that. >> One question though... >> Is that "tagged name"and "free indicator" some kind of special class >> variable provided by MultipleOutputs class? > > To add a multiple-output collector to your Mapper, you need to do > something like a MultipleOutputs.addNamedOutput -- where-in you give a > name (a string identifier, what I reffered to as a "tag"). Then while > using this collector to write your file from the mapper, you will get > files named <tag>-m-00000, <tag>-m-00000 and so on, apart from the > usual part-00000 stuff. > > If you notice, you also got that the output file was created from a > "mapper" since there's an "m" in the name itself. This is the free > identifier that comes along with no extra config. > > What's more -- you also get counters for the multiple output collector > you defined just by enabling them (and using a reporter)! > >> >> Ed >> >> From mp2893's iPhone >> >> On 2010. 12. 10., at 오후 5:30, Harsh J <[email protected]> wrote: >> >>> Hi, >>> >>> You can use MultipleOutputs class to achieve this, with tagged names >>> and free indicators of whether the output was from a map or reduce >>> also. >>> >>> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[email protected]> wrote: >>>> Hi, >>>> >>>> I'm trying to crawl numerous news sites. >>>> My plan is to make a file containing a list of all the news rss feed urls, >>>> and the path to save the crawled news article. >>>> So it would be like this: >>>> >>>> nytimes_nation, /user/hadoop/nytimes >>>> nytimes_sports, /user/hadoop/nytimes >>>> latimes_world, /user/hadoop/latimes >>>> latimes_nation, /user/hadoop/latimes >>>> ... >>>> ... >>>> ... >>>> >>>> Each mapper would get a single line and crawl the assigned url, process >>>> text, and save the result. >>>> So this job does not need any Reducing process. >>>> >>>> But what I'd also like to do is to create a dictionary at the same time. >>>> This could definitely take advantage of Reduce phase. Each mapper can >>>> generate output as "KEY:term, VALUE:term_frequency" >>>> Then Reducer can merge them all together and create a dictionary. (Of >>>> course >>>> I would be using many Reducers so the dictionary would be partitioned) >>>> >>>> I know that I can do this by creating two separate jobs (one for crawling, >>>> the other for making dictionary), but I'd like to do this in one-pass. >>>> >>>> So my design is: >>>> Map phase ==> crawl news articles, process text, write the result to a >>>> file. >>>> II >>>> II pass (term, term_frequency) pair to the Reducer >>>> II >>>> V >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>>> dictionary >>>> >>>> Is this at all possible? Or is it inherently impossible due to the >>>> structure >>>> of Hadoop? >>>> If it's possible, could anyone tell me how to do it? >>>> >>>> Ed. >>>> >>> >>> >>> >>> -- >>> Harsh J >>> www.harshj.com >> > > > > -- > Harsh J > www.harshj.com
