We've successfully used: 1) Nutch to fetch + parse pages 2) Custom Nutch2Solr indexer
This ran on an EC2 cluster. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Gene Campbell <[EMAIL PROTECTED]> > To: [email protected] > Sent: Friday, May 30, 2008 1:05:24 AM > Subject: Ideas for solutions to Crawling and Solr > > I'm looking at setting up a Lucene index front-ended by Solr returning > JSON requests to a Python/Django app running a search UI. I have > about 10,000 urls I need to crawl, and that number is supposed to rise > up to about 200,000 over the next year. In crawling these urls, I > will need to go 5 levels deep and stay within the domain. I need will > need to keep the index fresh, secure, and fast. So, I need to be able > to scale the system up probably to 10's of thousands of searches a > minute. > > I've ordered Lucene in Action, I'm frantically bookmarking all the > wiki and faq pages I can find on the subject. > > Here are some options I've come up with. Can anyone comment? > > 1) Use Nutch to build an index through Solr - requires some low level config > 2) Build a simple crawler in Python and post xml packets to Solr to > build the index - simple, but may be too simple > 3) Use wget to get all the pages, and then use ?? to index the pages > locally (probably a python script.) - a hack > > I'm not sure I like any of these ideas. But, I'm leaning to 2) as it > seems easy. Can always get this project going quick and agile-like, > and the refactor into using Nutch down the road. That's assuming > there isn't something about nutch that you think I'll need > immediately. > > Thoughts?
