Droids is much simpler if all you want to do is do a little bit of crawling.  
Nutch is built to scale to many millions of web pages.
If you need to crawl just a few sites, I'd suggest Droids.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: no spam <mrs.nos...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Sun, November 15, 2009 11:06:52 AM
> Subject: crawling / data aggregation - is nutch the right tool?
> 
> I'm trying to crawl several sites, aggregate the data and provide lucene
> based search.  I want the lucene index to contain a small subset of the
> data, ie just the contents of a few tags.  I see that nutch provides the
> crawling infrastructure and scales really nicely.  I just don't have great
> insight into how I can tie into the part that extracts text from html.
> 
> Apache droids seems to be built for a task like this but I wonder if I'd
> spend a lot of time writing the infrastructure to handle the main task of
> crawling.
> 
> Thanks,
> Mark

Reply via email to