Thanks for the tip. I guess it's a little different project from Nutch. My understanding is that while Nutch tries to implement a whole web search package, Bixo focuses on the crawling part. I should look into both projects more deeply. Thanks again!!
Ed From mp2893's iPhone On 2010. 12. 11., at 오전 1:15, Ted Dunning <[email protected]> wrote: > That is definitely possible, but may not be very desirable. > > Take a look at the Bixo project for a full-scale crawler. There is a lot of > subtlety in the fetching of URL's > due to the varying quality of different sites and the interaction with crawl > choking due to robots.txt considerations. > > http://bixo.101tec.com/ > > On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]> wrote: > >> So my design is: >> Map phase ==> crawl news articles, process text, write the result to a >> file. >> II >> II pass (term, term_frequency) pair to the Reducer >> II >> V >> Reduce phase ==> Merge the (term, term_frequency) pair and create a >> dictionary >> >> Is this at all possible? Or is it inherently impossible due to the >> structure >> of Hadoop? >>
