Ideas for solutions to Crawling and Solr

Gene Campbell Thu, 29 May 2008 16:05:59 -0700

I'm looking at setting up a Lucene index front-ended by Solr returning
JSON requests to a Python/Django app running a search UI.   I have
about 10,000 urls I need to crawl, and that number is supposed to rise
up to about 200,000 over the next year.  In crawling these urls, I
will need to go 5 levels deep and stay within the domain.  I need will
need to keep the index fresh, secure, and fast.  So, I need to be able
to scale the system up probably to 10's of thousands of searches a
minute.


I've ordered Lucene in Action, I'm frantically bookmarking all the
wiki and faq pages I can find on the subject.

Here are some options I've come up with.  Can anyone comment?

1)  Use Nutch to build an index through Solr - requires some low level config
2)  Build a simple crawler in Python and post xml packets to Solr to
build the index - simple, but may be too simple
3)  Use wget to get all the pages, and then use ??  to index the pages
locally (probably a python script.) - a hack

I'm not sure I like any of these ideas.  But, I'm leaning to 2) as it
seems easy.    Can always get this project going quick and agile-like,
and the refactor into using Nutch down the road.  That's assuming
there isn't something about nutch that you think I'll need
immediately.

Thoughts?

Ideas for solutions to Crawling and Solr

Reply via email to