I'm looking at setting up a Lucene index front-ended by Solr returning
JSON requests to a Python/Django app running a search UI.   I have
about 10,000 urls I need to crawl, and that number is supposed to rise
up to about 200,000 over the next year.  In crawling these urls, I
will need to go 5 levels deep and stay within the domain.  I need will
need to keep the index fresh, secure, and fast.  So, I need to be able
to scale the system up probably to 10's of thousands of searches a
minute.

I've ordered Lucene in Action, I'm frantically bookmarking all the
wiki and faq pages I can find on the subject.

Here are some options I've come up with.  Can anyone comment?

1)  Use Nutch to build an index through Solr - requires some low level config
2)  Build a simple crawler in Python and post xml packets to Solr to
build the index - simple, but may be too simple
3)  Use wget to get all the pages, and then use ??  to index the pages
locally (probably a python script.) - a hack

I'm not sure I like any of these ideas.  But, I'm leaning to 2) as it
seems easy.    Can always get this project going quick and agile-like,
and the refactor into using Nutch down the road.  That's assuming
there isn't something about nutch that you think I'll need
immediately.

Thoughts?

Reply via email to