I really like the concept of nutch and hadoop but I haven't been able to build an application with them. Most of the apps I like building are targetted at the public, anyone on the internet. I built a crawler of top sites like the NYtimes and Slate but I couldn't filter out the sites that were off-topic. Eg, links to advertising sites made up the majority of the search content.
My question; have you build a general site to crawl the internet and how did you find links that people would be interested in as opposed to capturing a lot of the junk out there. I guess stop words and other filters are important but I wasn't ever successful in building them. It is almost like for these types of apps, nutch needs a custom spam system. -- Berlin Brown http://botspiritcompany.com/botlist/spring/help/about.html newspirit technologies
