Hi again,

Not sure if you are still on this approach after the previous
suggestions, but since you asked:

2010/12/10 Edward Choi <[email protected]>:
> Wow thanks for the info. I'll definitely try that.
> One question though...
> Is that "tagged name"and "free indicator" some kind of special class variable 
> provided by MultipleOutputs class?

To add a multiple-output collector to your Mapper, you need to do
something like a MultipleOutputs.addNamedOutput -- where-in you give a
name (a string identifier, what I reffered to as a "tag"). Then while
using this collector to write your file from the mapper, you will get
files named <tag>-m-00000, <tag>-m-00000 and so on, apart from the
usual part-00000 stuff.

If you notice, you also got that the output file was created from a
"mapper" since there's an "m" in the name itself. This is the free
identifier that comes along with no extra config.

What's more -- you also get counters for the multiple output collector
you defined just by enabling them (and using a reporter)!

>
> Ed
>
> From mp2893's iPhone
>
> On 2010. 12. 10., at 오후 5:30, Harsh J <[email protected]> wrote:
>
>> Hi,
>>
>> You can use MultipleOutputs class to achieve this, with tagged names
>> and free indicators of whether the output was from a map or reduce
>> also.
>>
>> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm trying to crawl numerous news sites.
>>> My plan is to make a file containing a list of all the news rss feed urls,
>>> and the path to save the crawled news article.
>>> So it would be like this:
>>>
>>> nytimes_nation,    /user/hadoop/nytimes
>>> nytimes_sports,    /user/hadoop/nytimes
>>> latimes_world,      /user/hadoop/latimes
>>> latimes_nation,     /user/hadoop/latimes
>>> ...
>>> ...
>>> ...
>>>
>>> Each mapper would get a single line and crawl the assigned url, process
>>> text, and save the result.
>>> So this job does not need any Reducing process.
>>>
>>> But what I'd also like to do is to create a dictionary at the same time.
>>> This could definitely take advantage of Reduce phase. Each mapper can
>>> generate output as "KEY:term, VALUE:term_frequency"
>>> Then Reducer can merge them all together and create a dictionary. (Of course
>>> I would be using many Reducers so the dictionary would be partitioned)
>>>
>>> I know that I can do this by creating two separate jobs (one for crawling,
>>> the other for making dictionary), but I'd like to do this in one-pass.
>>>
>>> So my design is:
>>> Map phase ==> crawl news articles, process text, write the result to a file.
>>>        II
>>>        II     pass (term, term_frequency) pair to the Reducer
>>>        II
>>>        V
>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>>> dictionary
>>>
>>> Is this at all possible? Or is it inherently impossible due to the structure
>>> of Hadoop?
>>> If it's possible, could anyone tell me how to do it?
>>>
>>> Ed.
>>>
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>



-- 
Harsh J
www.harshj.com

Reply via email to