I'd start with only a few rss feeds at first, but I plan to expand it to the scale of a thousands of rss feeds every 30 minutes eventually. That's why I am so eager to implement my system in Hadoop. I skimmed through Nutch and Bixo but I feel that eventually I'm gonna have to build the system from scratch. I'm going to need a very specific index structure to perform what I want. Customizing Nutch or Bixo seems to require more effort and time than writing codes from the bottom. But I can sure refer to their methodology.
Ed 2010년 12월 11일 오후 4:34, Ted Dunning <[email protected]>님의 말: > If you are only loading articles at that rate, I would suggest that a > simple > java or perl or ruby program would be MUCH easy to write and debug than a > full on map-reduce program. > > 2010/12/10 Edward Choi <[email protected]> > > > Thanks for the advice. But my plan is to crawl news rss feeds every 30 > > minutes. So I'd be downloading at most 5 to 10 news articles per map task > > (since news aren't published that often). So I guess I won't have to > worry > > to much about the crawling dealy. > > I thought it would be a good idea to make a dictionary during the > crawling > > process. Because I will be needing the a dictionary to calculate tf-idf > and > > I didn't want to have to go through the whole repository everytime a news > > aricle is added. > > If I crawl and make a dictionary at the same time, all I need to do to > make > > a dictionary is to merge the new ones (which are generated every 30 > minutes) > > with the existing dictionary which I guess will be computationally cheap. > > > > Ed > > > > From mp2893's iPhone > > > > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[email protected]> wrote: > > > > > Regarding the idea of doing word counts during the crawl, I think you > are > > > motivated by the best of principles (read > > > input only once), but in practice, you will be doing many small crawls > > and > > > saving the content. Word counting > > > should probably not be tied too closely to the crawl because the crawl > > can > > > be delayed arbitrarily. Better to have > > > a good content repository that is updated as often as crawls complete > and > > > run other processing against the > > > repository whenever it seems like a good idea. > > > > > > 2010/12/10 Edward Choi <[email protected]> > > > > > >> Thanks for the tip. I guess it's a little different project from > Nutch. > > My > > >> understanding is that while Nutch tries to implement a whole web > search > > >> package, Bixo focuses on the crawling part. I should look into both > > projects > > >> more deeply. Thanks again!! > > >> > > >> Ed > > >> > > >> From mp2893's iPhone > > >> > > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[email protected]> > > wrote: > > >> > > >>> That is definitely possible, but may not be very desirable. > > >>> > > >>> Take a look at the Bixo project for a full-scale crawler. There is a > > lot > > >> of > > >>> subtlety in the fetching of URL's > > >>> due to the varying quality of different sites and the interaction > with > > >> crawl > > >>> choking due to robots.txt considerations. > > >>> > > >>> http://bixo.101tec.com/ > > >>> > > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]> > wrote: > > >>> > > >>>> So my design is: > > >>>> Map phase ==> crawl news articles, process text, write the result to > a > > >>>> file. > > >>>> II > > >>>> II pass (term, term_frequency) pair to the Reducer > > >>>> II > > >>>> V > > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a > > >>>> dictionary > > >>>> > > >>>> Is this at all possible? Or is it inherently impossible due to the > > >>>> structure > > >>>> of Hadoop? > > >>>> > > >> > > >
