I really like the concept of nutch and hadoop but I haven't been able
to build an application with them.  Most of the apps I like building
are targetted at the public, anyone on the internet.  I built a
crawler of top sites like the NYtimes and Slate but I couldn't filter
out the sites that were off-topic.  Eg, links to advertising sites
made up the majority of the search content.

My question; have you build a general site to crawl the internet and
how did you find links that people would be interested in as opposed
to capturing a lot of the junk out there.

I guess stop words and other filters are important but I wasn't ever
successful in building them.  It is almost like for these types of
apps, nutch needs a custom spam system.

-- 
Berlin Brown
http://botspiritcompany.com/botlist/spring/help/about.html
newspirit technologies

Reply via email to