For now I only need to crawl hundreds of pages, previously I wrote stuff from scratch in perl. I want something that allows me to get started quickly and allows for scale in the future. I like that Droids is a framework and I only have to do minimal work to get started. Apache-Tika is the framework for parsing and it looks right for the job. It's the part that I have a hard time evaluating with Nutch. Some of what I have read from the mailing list suggests it's still not all that easy to do extraction with Nutch, am I wrong?
Mark