Out of curiosity, if the expense of the hash function is a consideration, have you all looked at using fnv1a or similarly fast hash in place of md5?
-Chris Paul J Davis wrote: > > On Apr 6, 2010, at 11:20 PM, Adam Kocoloski <[email protected]> wrote: > >> On Apr 6, 2010, at 10:50 PM, Paul J Davis wrote: >> >>> This corruption was quite odd in that there wasn't a conspicuous reason for >>> it. I didn't dive to deep into the whole thing so it's possible i missed >>> something obvious. >> The instance was unresponsive to ssh for 12 hours. The report from AWS >> Support was merely a "problem with the underlying host" followed by a >> recommendation to "launch a replacement at your earliest convenience". I >> don't know what the gremlins were doing behind the scenes, but I'm not >> surprised the files are corrupted :) >> > > Yeah I don't think that we should worry about high energy particles flipping > bits too much here. > >>> There are two things at play here. How proactive should we be in provoking >>> theseI errors and how much should we check for situations where our data >>> file got trounced. >>> >>> The extreme proactive position would be equivalent to a full table scan per >>> write which is out of the question. So to some extent we won't be able to >>> detect some errors until read time which is an unknowable interval. >> I'm totally comfortable with only detecting them at read-time. >> >>> The other aspect is how rigorous should we check reads? This extreme would >>> basically require a sha1 for every read or write no matter how small, not >>> to mention the storage overhead. This part I'm not sure about. There's >>> probably middle ground with crc sums and what not but i don't see a clear >>> answer. >> We currently store MD5 checksums with document bodies and validate them on >> reads. It hasn't proven to be an undue burden. >> > > We do that for every doc body? Did not know that. Perhaps general > append_term_md5 usage wouldn't be as big of a deal as i feared. > >> Best, Adam >> >>> Basically, the question is how much should we attempt to detect when >>> hardware lies. I reckon that there's probably a middle ground to report >>> when an assumption is violated and full on table scans. Ideally such things >>> would be fairly configurable but i sure don't see an obvious answer. >>> >>> >>> On Apr 6, 2010, at 10:06 PM, Randall Leeds <[email protected]> wrote: >>> >>>> I immediately want to say 'ini file option' but I'm not sure whether to err >>>> on safety or speed. >>>> >>>> Maybe this is a good candidate for merkle trees or something else we can do >>>> throughout the view tree that might less overhead than md5 summing all the >>>> nodes? After all, most inner nodes shouldn't change most of the time. Some >>>> incremental, cheap checksum might be a worthwhile *option*. >>>> >>>> On Apr 6, 2010 6:04 PM, "Adam Kocoloski" <[email protected]> wrote: >>>> >>>> Hi all, we recently had an EC2 node go AWOL for about 12 hours. When it >>>> came back, we noticed after a few days that a number of the view indexes >>>> stored on that node were not updating. I did some digging into the error >>>> logs and with Paul's help pieced together what was going on. I won't >>>> bother >>>> you with all the gory details unless you ask for them, but the gist of it >>>> is >>>> that those files are corrupted. >>>> >>>> The troubling thing for me is that we only discovered the corruption when >>>> it >>>> completely broke the index updates. In one case, it did this by >>>> rearranging >>>> the bits so that couch_file thought that the btree node it was reading from >>>> disk had an associated MD5 checksum. It didn't (no btree nodes do), and so >>>> couch_file threw a file_corruption exception. But if the corruption had >>>> shown up in another part of the file I might never have known. In fact, >>>> some of the other indices on that node probably are silently corrupted. >>>> >>>> You might wonder how likely it is that a file becomes corrupted but still >>>> appears to be functioning. I checked the last modified timestamps for >>>> three >>>> broken files. One was last modified when the node went down, but the other >>>> two had timestamps in between the node's recovery and now. To me, that >>>> means that the view indexer was able to update those files for quite a >>>> while >>>> (~2 days) before it bumped into a part of the btree that was corrupted. >>>> >>>> I wonder what we should do about this. My first thought is to make it >>>> optional to write btree nodes (possibly only for view index files?) using >>>> append_term_md5 instead of append_term. It seems like a simple patch, but >>>> I >>>> don't know a priori what the performance hit would be. Other thoughts? >>>> >>>> Best, Adam
