Thanks for the advice. But my plan is to crawl news rss feeds every 30 minutes. So I'd be downloading at most 5 to 10 news articles per map task (since news aren't published that often). So I guess I won't have to worry to much about the crawling dealy. I thought it would be a good idea to make a dictionary during the crawling process. Because I will be needing the a dictionary to calculate tf-idf and I didn't want to have to go through the whole repository everytime a news aricle is added. If I crawl and make a dictionary at the same time, all I need to do to make a dictionary is to merge the new ones (which are generated every 30 minutes) with the existing dictionary which I guess will be computationally cheap.
Ed From mp2893's iPhone On 2010. 12. 11., at 오전 3:42, Ted Dunning <[email protected]> wrote: > Regarding the idea of doing word counts during the crawl, I think you are > motivated by the best of principles (read > input only once), but in practice, you will be doing many small crawls and > saving the content. Word counting > should probably not be tied too closely to the crawl because the crawl can > be delayed arbitrarily. Better to have > a good content repository that is updated as often as crawls complete and > run other processing against the > repository whenever it seems like a good idea. > > 2010/12/10 Edward Choi <[email protected]> > >> Thanks for the tip. I guess it's a little different project from Nutch. My >> understanding is that while Nutch tries to implement a whole web search >> package, Bixo focuses on the crawling part. I should look into both projects >> more deeply. Thanks again!! >> >> Ed >> >> From mp2893's iPhone >> >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[email protected]> wrote: >> >>> That is definitely possible, but may not be very desirable. >>> >>> Take a look at the Bixo project for a full-scale crawler. There is a lot >> of >>> subtlety in the fetching of URL's >>> due to the varying quality of different sites and the interaction with >> crawl >>> choking due to robots.txt considerations. >>> >>> http://bixo.101tec.com/ >>> >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]> wrote: >>> >>>> So my design is: >>>> Map phase ==> crawl news articles, process text, write the result to a >>>> file. >>>> II >>>> II pass (term, term_frequency) pair to the Reducer >>>> II >>>> V >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a >>>> dictionary >>>> >>>> Is this at all possible? Or is it inherently impossible due to the >>>> structure >>>> of Hadoop? >>>> >>
