We've successfully used:

1) Nutch to fetch + parse pages
2) Custom Nutch2Solr indexer

This ran on an EC2 cluster.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Gene Campbell <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, May 30, 2008 1:05:24 AM
> Subject: Ideas for solutions to Crawling and Solr
> 
> I'm looking at setting up a Lucene index front-ended by Solr returning
> JSON requests to a Python/Django app running a search UI.   I have
> about 10,000 urls I need to crawl, and that number is supposed to rise
> up to about 200,000 over the next year.  In crawling these urls, I
> will need to go 5 levels deep and stay within the domain.  I need will
> need to keep the index fresh, secure, and fast.  So, I need to be able
> to scale the system up probably to 10's of thousands of searches a
> minute.
> 
> I've ordered Lucene in Action, I'm frantically bookmarking all the
> wiki and faq pages I can find on the subject.
> 
> Here are some options I've come up with.  Can anyone comment?
> 
> 1)  Use Nutch to build an index through Solr - requires some low level config
> 2)  Build a simple crawler in Python and post xml packets to Solr to
> build the index - simple, but may be too simple
> 3)  Use wget to get all the pages, and then use ??  to index the pages
> locally (probably a python script.) - a hack
> 
> I'm not sure I like any of these ideas.  But, I'm leaning to 2) as it
> seems easy.    Can always get this project going quick and agile-like,
> and the refactor into using Nutch down the road.  That's assuming
> there isn't something about nutch that you think I'll need
> immediately.
> 
> Thoughts?

Reply via email to