What do people do when 'something goes wrong' with a crawl? First some background; We are a small-ish university using nutch to crawl 60,000 - 100,000 pages across 50 or so domains. This probably puts us in a different category than most nutch users. Our crawl cycle consists of a script to crawl everything, one domain at a time, each Sunday and run search across all the indexes (one per domain). Our original reason for this was that merging was taking too long, but this also keeps one bad index (or a crawl with bad results) from destroying everything. Maybe we're worrying about nothing since we haven't had any problems in almost a year of production use (knock on wood) and I don't know how often indexes 'blow up'. We also move the previous week's indexes out of the way before replacing them so we have a backup if something happens. We have been moving things to a CMS and want to move to a system where pages are indexed as they are edited, while still being able to crawl things that don't fit in CMS. This would be a big incentive for most of our people to use CMS. The solr back end looks promising, but I'm not sure how to implement a recovery plan with solr. Any thoughts or experience with backing up solr indexes? Is it as simple as moving the index like we do with nutch indexes?
Thanks, Eric -- -- Eric J. Christeson <eric.christe...@ndsu.edu> Enterprise Computing and Infrastructure Phone: (701) 231-8693 North Dakota State University, Fargo, North Dakota, USA
signature.asc
Description: OpenPGP digital signature