I'm looking at setting up a Lucene index front-ended by Solr returning JSON requests to a Python/Django app running a search UI. I have about 10,000 urls I need to crawl, and that number is supposed to rise up to about 200,000 over the next year. In crawling these urls, I will need to go 5 levels deep and stay within the domain. I need will need to keep the index fresh, secure, and fast. So, I need to be able to scale the system up probably to 10's of thousands of searches a minute.
I've ordered Lucene in Action, I'm frantically bookmarking all the wiki and faq pages I can find on the subject. Here are some options I've come up with. Can anyone comment? 1) Use Nutch to build an index through Solr - requires some low level config 2) Build a simple crawler in Python and post xml packets to Solr to build the index - simple, but may be too simple 3) Use wget to get all the pages, and then use ?? to index the pages locally (probably a python script.) - a hack I'm not sure I like any of these ideas. But, I'm leaning to 2) as it seems easy. Can always get this project going quick and agile-like, and the refactor into using Nutch down the road. That's assuming there isn't something about nutch that you think I'll need immediately. Thoughts?
