Out of curiosity, if the expense of the hash function is a
consideration, have you all looked at using fnv1a or similarly fast hash
in place of md5?

-Chris

Paul J Davis wrote:
> 
> On Apr 6, 2010, at 11:20 PM, Adam Kocoloski <[email protected]> wrote:
> 
>> On Apr 6, 2010, at 10:50 PM, Paul J Davis wrote:
>>
>>> This corruption was quite odd in that there wasn't a conspicuous reason for 
>>> it.  I didn't dive to deep into the whole thing so it's possible i missed 
>>> something obvious. 
>> The instance was unresponsive to ssh for 12 hours.  The report from AWS 
>> Support was merely a "problem with the underlying host" followed by a 
>> recommendation to "launch a replacement at your earliest convenience".  I 
>> don't know what the gremlins were doing behind the scenes, but I'm not 
>> surprised the files are corrupted :)
>>
> 
> Yeah I don't think that we should worry about high energy particles flipping 
> bits too much here.
> 
>>> There are two things at play here.  How proactive should we be in provoking 
>>> theseI errors and how much should we check for situations where our data 
>>> file got trounced.
>>>
>>> The extreme proactive position would be equivalent to a full table scan per 
>>> write which is out of the question. So to some extent we won't be able to 
>>> detect some errors until read time which is an unknowable interval.
>> I'm totally comfortable with only detecting them at read-time.
>>
>>> The other aspect is how rigorous should we check reads? This extreme would 
>>> basically require a sha1 for every read or write no matter how small, not 
>>> to mention the storage overhead. This part I'm not sure about. There's 
>>> probably middle ground with crc sums and what not but i don't see a clear 
>>> answer.
>> We currently store MD5 checksums with document bodies and validate them on 
>> reads.  It hasn't proven to be an undue burden.
>>
> 
> We do that for every doc body? Did not know that. Perhaps general 
> append_term_md5 usage wouldn't be as big of a deal as i feared.
> 
>> Best, Adam
>>
>>> Basically, the question is how much should we attempt to detect when 
>>> hardware lies.  I reckon that there's probably a middle ground to report 
>>> when an assumption is violated and full on table scans. Ideally such things 
>>> would be fairly configurable but i sure don't see an obvious answer.
>>>
>>>
>>> On Apr 6, 2010, at 10:06 PM, Randall Leeds <[email protected]> wrote:
>>>
>>>> I immediately want to say 'ini file option' but I'm not sure whether to err
>>>> on safety or speed.
>>>>
>>>> Maybe this is a good candidate for merkle trees or something else we can do
>>>> throughout the view tree that might less overhead than md5 summing all the
>>>> nodes? After all, most inner nodes shouldn't change most of the time. Some
>>>> incremental, cheap checksum might be a worthwhile *option*.
>>>>
>>>> On Apr 6, 2010 6:04 PM, "Adam Kocoloski" <[email protected]> wrote:
>>>>
>>>> Hi all, we recently had an EC2 node go AWOL for about 12 hours.  When it
>>>> came back, we noticed after a few days that a number of the view indexes
>>>> stored on that node were not updating.  I did some digging into the error
>>>> logs and with Paul's help pieced together what was going on.  I won't 
>>>> bother
>>>> you with all the gory details unless you ask for them, but the gist of it 
>>>> is
>>>> that those files are corrupted.
>>>>
>>>> The troubling thing for me is that we only discovered the corruption when 
>>>> it
>>>> completely broke the index updates.  In one case, it did this by 
>>>> rearranging
>>>> the bits so that couch_file thought that the btree node it was reading from
>>>> disk had an associated MD5 checksum. It didn't (no btree nodes do), and so
>>>> couch_file threw a file_corruption exception.  But if the corruption had
>>>> shown up in another part of the file I might never have known.  In fact,
>>>> some of the other indices on that node probably are silently corrupted.
>>>>
>>>> You might wonder how likely it is that a file becomes corrupted but still
>>>> appears to be functioning.  I checked the last modified timestamps for 
>>>> three
>>>> broken files.  One was last modified when the node went down, but the other
>>>> two had timestamps in between the node's recovery and now.  To me, that
>>>> means that the view indexer was able to update those files for quite a 
>>>> while
>>>> (~2 days) before it bumped into a part of the btree that was corrupted.
>>>>
>>>> I wonder what we should do about this.  My first thought is to make it
>>>> optional to write  btree nodes (possibly only for view index files?) using
>>>> append_term_md5 instead of append_term.  It seems like a simple patch, but 
>>>> I
>>>> don't know a priori what the performance hit would be.  Other thoughts?
>>>>
>>>> Best, Adam

Reply via email to