Droids is much simpler if all you want to do is do a little bit of crawling. Nutch is built to scale to many millions of web pages. If you need to crawl just a few sites, I'd suggest Droids.
Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: no spam <mrs.nos...@gmail.com> > To: nutch-user@lucene.apache.org > Sent: Sun, November 15, 2009 11:06:52 AM > Subject: crawling / data aggregation - is nutch the right tool? > > I'm trying to crawl several sites, aggregate the data and provide lucene > based search. I want the lucene index to contain a small subset of the > data, ie just the contents of a few tags. I see that nutch provides the > crawling infrastructure and scales really nicely. I just don't have great > insight into how I can tie into the part that extracts text from html. > > Apache droids seems to be built for a task like this but I wonder if I'd > spend a lot of time writing the infrastructure to handle the main task of > crawling. > > Thanks, > Mark