What do people do when 'something goes wrong' with a crawl?
First some background; We are a small-ish university using nutch to
crawl 60,000 - 100,000 pages across 50 or so domains.  This probably
puts us in a different category than most nutch users.  Our crawl cycle
consists of a script to crawl everything, one domain at a time, each
Sunday and run search across all the indexes (one per domain).  Our
original reason for this was that merging was taking too long, but this
also keeps one bad index (or a crawl with bad results) from destroying
everything.  Maybe we're worrying about nothing since we haven't had any
problems in almost a year of production use (knock on wood) and I don't
know how often indexes 'blow up'.  We also move the previous week's
indexes out of the way before replacing them so we have a backup if
something happens.
We have been moving things to a CMS and want to move to a system where
pages are indexed as they are edited, while still being able to crawl
things that don't fit in CMS.  This would be a big incentive for most of
our people to use CMS.  The solr back end looks promising, but I'm not
sure how to implement a recovery plan with solr.  Any thoughts or
experience with backing up solr indexes?  Is it as simple as moving the
index like we do with nutch indexes?

Thanks,
Eric
--
-- 
Eric J. Christeson             <eric.christe...@ndsu.edu>
Enterprise Computing and Infrastructure
Phone: (701) 231-8693
North Dakota State University, Fargo, North Dakota, USA

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to