That is definitely possible, but may not be very desirable. Take a look at the Bixo project for a full-scale crawler. There is a lot of subtlety in the fetching of URL's due to the varying quality of different sites and the interaction with crawl choking due to robots.txt considerations.
http://bixo.101tec.com/ On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]> wrote: > So my design is: > Map phase ==> crawl news articles, process text, write the result to a > file. > II > II pass (term, term_frequency) pair to the Reducer > II > V > Reduce phase ==> Merge the (term, term_frequency) pair and create a > dictionary > > Is this at all possible? Or is it inherently impossible due to the > structure > of Hadoop? >
