If you are only loading articles at that rate, I would suggest that a simple java or perl or ruby program would be MUCH easy to write and debug than a full on map-reduce program.
2010/12/10 Edward Choi <[email protected]> > Thanks for the advice. But my plan is to crawl news rss feeds every 30 > minutes. So I'd be downloading at most 5 to 10 news articles per map task > (since news aren't published that often). So I guess I won't have to worry > to much about the crawling dealy. > I thought it would be a good idea to make a dictionary during the crawling > process. Because I will be needing the a dictionary to calculate tf-idf and > I didn't want to have to go through the whole repository everytime a news > aricle is added. > If I crawl and make a dictionary at the same time, all I need to do to make > a dictionary is to merge the new ones (which are generated every 30 minutes) > with the existing dictionary which I guess will be computationally cheap. > > Ed > > From mp2893's iPhone > > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[email protected]> wrote: > > > Regarding the idea of doing word counts during the crawl, I think you are > > motivated by the best of principles (read > > input only once), but in practice, you will be doing many small crawls > and > > saving the content. Word counting > > should probably not be tied too closely to the crawl because the crawl > can > > be delayed arbitrarily. Better to have > > a good content repository that is updated as often as crawls complete and > > run other processing against the > > repository whenever it seems like a good idea. > > > > 2010/12/10 Edward Choi <[email protected]> > > > >> Thanks for the tip. I guess it's a little different project from Nutch. > My > >> understanding is that while Nutch tries to implement a whole web search > >> package, Bixo focuses on the crawling part. I should look into both > projects > >> more deeply. Thanks again!! > >> > >> Ed > >> > >> From mp2893's iPhone > >> > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[email protected]> > wrote: > >> > >>> That is definitely possible, but may not be very desirable. > >>> > >>> Take a look at the Bixo project for a full-scale crawler. There is a > lot > >> of > >>> subtlety in the fetching of URL's > >>> due to the varying quality of different sites and the interaction with > >> crawl > >>> choking due to robots.txt considerations. > >>> > >>> http://bixo.101tec.com/ > >>> > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]> wrote: > >>> > >>>> So my design is: > >>>> Map phase ==> crawl news articles, process text, write the result to a > >>>> file. > >>>> II > >>>> II pass (term, term_frequency) pair to the Reducer > >>>> II > >>>> V > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a > >>>> dictionary > >>>> > >>>> Is this at all possible? Or is it inherently impossible due to the > >>>> structure > >>>> of Hadoop? > >>>> > >> >
