Thanks for the detailed answer! The suggested approaches I also need check out. 
But since my goal is to just crawl rss feeds,  I might be better off just 
making a small crawler myself :-). Thanks again for the reply. 

Ed

From mp2893's iPhone

On 2010. 12. 11., at 오전 2:41, Harsh J <[email protected]> wrote:

> Hi again,
> 
> Not sure if you are still on this approach after the previous
> suggestions, but since you asked:
> 
> 2010/12/10 Edward Choi <[email protected]>:
>> Wow thanks for the info. I'll definitely try that.
>> One question though...
>> Is that "tagged name"and "free indicator" some kind of special class 
>> variable provided by MultipleOutputs class?
> 
> To add a multiple-output collector to your Mapper, you need to do
> something like a MultipleOutputs.addNamedOutput -- where-in you give a
> name (a string identifier, what I reffered to as a "tag"). Then while
> using this collector to write your file from the mapper, you will get
> files named <tag>-m-00000, <tag>-m-00000 and so on, apart from the
> usual part-00000 stuff.
> 
> If you notice, you also got that the output file was created from a
> "mapper" since there's an "m" in the name itself. This is the free
> identifier that comes along with no extra config.
> 
> What's more -- you also get counters for the multiple output collector
> you defined just by enabling them (and using a reporter)!
> 
>> 
>> Ed
>> 
>> From mp2893's iPhone
>> 
>> On 2010. 12. 10., at 오후 5:30, Harsh J <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> You can use MultipleOutputs class to achieve this, with tagged names
>>> and free indicators of whether the output was from a map or reduce
>>> also.
>>> 
>>> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> I'm trying to crawl numerous news sites.
>>>> My plan is to make a file containing a list of all the news rss feed urls,
>>>> and the path to save the crawled news article.
>>>> So it would be like this:
>>>> 
>>>> nytimes_nation,    /user/hadoop/nytimes
>>>> nytimes_sports,    /user/hadoop/nytimes
>>>> latimes_world,      /user/hadoop/latimes
>>>> latimes_nation,     /user/hadoop/latimes
>>>> ...
>>>> ...
>>>> ...
>>>> 
>>>> Each mapper would get a single line and crawl the assigned url, process
>>>> text, and save the result.
>>>> So this job does not need any Reducing process.
>>>> 
>>>> But what I'd also like to do is to create a dictionary at the same time.
>>>> This could definitely take advantage of Reduce phase. Each mapper can
>>>> generate output as "KEY:term, VALUE:term_frequency"
>>>> Then Reducer can merge them all together and create a dictionary. (Of 
>>>> course
>>>> I would be using many Reducers so the dictionary would be partitioned)
>>>> 
>>>> I know that I can do this by creating two separate jobs (one for crawling,
>>>> the other for making dictionary), but I'd like to do this in one-pass.
>>>> 
>>>> So my design is:
>>>> Map phase ==> crawl news articles, process text, write the result to a 
>>>> file.
>>>>       II
>>>>       II     pass (term, term_frequency) pair to the Reducer
>>>>       II
>>>>       V
>>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>>>> dictionary
>>>> 
>>>> Is this at all possible? Or is it inherently impossible due to the 
>>>> structure
>>>> of Hadoop?
>>>> If it's possible, could anyone tell me how to do it?
>>>> 
>>>> Ed.
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
>>> www.harshj.com
>> 
> 
> 
> 
> -- 
> Harsh J
> www.harshj.com

Reply via email to