Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

edward choi Sat, 11 Dec 2010 00:30:50 -0800

I'd start with only a few rss feeds at first, but I plan to expand it to the
scale of a thousands of rss feeds every 30 minutes eventually.
That's why I am so eager to implement my system in Hadoop.
I skimmed through Nutch and Bixo but I feel that eventually I'm gonna have
to build the system from scratch.
I'm going to need a very specific index structure to perform what I want.
Customizing Nutch or Bixo seems to
require more effort and time than writing codes from the bottom. But I can
sure refer to their methodology.


Ed

2010년 12월 11일 오후 4:34, Ted Dunning <[email protected]>님의 말:

> If you are only loading articles at that rate, I would suggest that a
> simple
> java or perl or ruby program would be MUCH easy to write and debug than a
> full on map-reduce program.
>
> 2010/12/10 Edward Choi <[email protected]>
>
> > Thanks for the advice. But my plan is to crawl news rss feeds every 30
> > minutes. So I'd be downloading at most 5 to 10 news articles per map task
> > (since news aren't published that often). So I guess I won't have to
> worry
> > to much about the crawling dealy.
> > I thought it would be a good idea to make a dictionary during the
> crawling
> > process. Because I will be needing the a dictionary to calculate tf-idf
> and
> > I didn't want to have to go through the whole repository everytime a news
> > aricle is added.
> > If I crawl and make a dictionary at the same time, all I need to do to
> make
> > a dictionary is to merge the new ones (which are generated every 30
> minutes)
> > with the existing dictionary which I guess will be computationally cheap.
> >
> > Ed
> >
> > From mp2893's iPhone
> >
> > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[email protected]> wrote:
> >
> > > Regarding the idea of doing word counts during the crawl, I think you
> are
> > > motivated by the best of principles (read
> > > input only once), but in practice, you will be doing many small crawls
> > and
> > > saving the content.  Word counting
> > > should probably not be tied too closely to the crawl because the crawl
> > can
> > > be delayed arbitrarily.  Better to have
> > > a good content repository that is updated as often as crawls complete
> and
> > > run other processing against the
> > > repository whenever it seems like a good idea.
> > >
> > > 2010/12/10 Edward Choi <[email protected]>
> > >
> > >> Thanks for the tip. I guess it's a little different project from
> Nutch.
> > My
> > >> understanding is that while Nutch tries to implement a whole web
> search
> > >> package, Bixo focuses on the crawling part. I should look into both
> > projects
> > >> more deeply. Thanks again!!
> > >>
> > >> Ed
> > >>
> > >> From mp2893's iPhone
> > >>
> > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[email protected]>
> > wrote:
> > >>
> > >>> That is definitely possible, but may not be very desirable.
> > >>>
> > >>> Take a look at the Bixo project for a full-scale crawler.  There is a
> > lot
> > >> of
> > >>> subtlety in the fetching of URL's
> > >>> due to the varying quality of different sites and the interaction
> with
> > >> crawl
> > >>> choking due to robots.txt considerations.
> > >>>
> > >>> http://bixo.101tec.com/
> > >>>
> > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]>
> wrote:
> > >>>
> > >>>> So my design is:
> > >>>> Map phase ==> crawl news articles, process text, write the result to
> a
> > >>>> file.
> > >>>>      II
> > >>>>      II     pass (term, term_frequency) pair to the Reducer
> > >>>>      II
> > >>>>      V
> > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
> > >>>> dictionary
> > >>>>
> > >>>> Is this at all possible? Or is it inherently impossible due to the
> > >>>> structure
> > >>>> of Hadoop?
> > >>>>
> > >>
> >
>

Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

Reply via email to