Hello, I've been asked to devise some way to discover and correct data in Lucene indexes that have been "corrupted." The word "corrupt", in this case, has a few different meanings, some of which strike me as exceedingly difficult to grok. What concerns me are the cases where we don't know that an index has been changed: A bit error in a stored field, for instance, is a form of corruption that we (ideally) should be able to identify, at the very least, and hopefully correct. This case in particular seems particularly onerous, since this isn't going to throw an exception of any sort, any time.
We have a fairly good handle on how to remedy problems that throw exceptions, so we should be alright with corruption where (say) an operator logs in and overwrites a file. I'm wondering how other Lucene users have tackled this problem in the past. Calculating checksums on the documents seems like one way to go about it: compute a checksum on the document and, in a background thread, compare the checksum to the data. Unfortunately we're building a large, federated system and it would take months to exhaustively check every document this way. Checksumming the files themselves might be too much: We're storing gigabytes of data per index and there is some churn to the data; in other words, the overhead for this method might be too high. Thanks for any help you might have. -Joseph Rose ____________________________________________________________________________________ Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. http://tv.yahoo.com/collections/222 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]