Eric, There are a couple of ways you can back up a Lucene index built by Solr:
1) have a look at the Solr replication scripts, specifically snapshooter. This script creates a snapshot of an index. It's typically triggered by Solr after its "commit" or "optimize" calls, when the index is "stable" and not being modified. If you use snapshooter to create index snapshots, you could simply grab a snapshot and there is your backup. 2) have a look at Solr's new replication mechanism (info on the Solr Wiki), which does something similar to the above, but without relying on replication (shell) scripts. It does everything via HTTP. In my 10 years of using Lucene and N years of using Solr and Nutch I've never had index corruption. Nowadays Lucene even has transactions, so it's much harder (theoretically impossible) to corrupt the index. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Eric J. Christeson <eric.christe...@ndsu.edu> > To: nutch-user@lucene.apache.org > Sent: Friday, March 13, 2009 8:42:48 PM > Subject: Index Disaster Recovery > > What do people do when 'something goes wrong' with a crawl? > First some background; We are a small-ish university using nutch to > crawl 60,000 - 100,000 pages across 50 or so domains. This probably > puts us in a different category than most nutch users. Our crawl cycle > consists of a script to crawl everything, one domain at a time, each > Sunday and run search across all the indexes (one per domain). Our > original reason for this was that merging was taking too long, but this > also keeps one bad index (or a crawl with bad results) from destroying > everything. Maybe we're worrying about nothing since we haven't had any > problems in almost a year of production use (knock on wood) and I don't > know how often indexes 'blow up'. We also move the previous week's > indexes out of the way before replacing them so we have a backup if > something happens. > We have been moving things to a CMS and want to move to a system where > pages are indexed as they are edited, while still being able to crawl > things that don't fit in CMS. This would be a big incentive for most of > our people to use CMS. The solr back end looks promising, but I'm not > sure how to implement a recovery plan with solr. Any thoughts or > experience with backing up solr indexes? Is it as simple as moving the > index like we do with nutch indexes? > > Thanks, > Eric > -- > -- > Eric J. Christeson > Enterprise Computing and Infrastructure > Phone: (701) 231-8693 > North Dakota State University, Fargo, North Dakota, USA