[ 
https://issues.apache.org/jira/browse/KUDU-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238384#comment-15238384
 ] 

Mike Percy commented on KUDU-1414:
----------------------------------

bq. But isn't this an extremely rare scenario? Typically an on-disk post-write 
corruption would be a flipped bit (a single byte) which would only affect a 
single entry.

Yes I agree we're talking about rare events. But I'm not sure that it's true 
that successive errors are even more rare. A somewhat recent paper on this 
topic is "An Analysis of Data Corruption in the Storage Stack" by 
Bairavasundaram et al from FAST '08, which explores disk corruption in 
real-world disks: http://www.cs.toronto.edu/~bianca/papers/fast08.pdf ... in 
section 4.3.1 (Checksum mismatches per corrupt disk) and 4.3.3 (Spatial 
locality) they show that among disks with checksum errors, there tend to be 
multiple sectors with errors and those sectors tend to be close together. At 
least for spinning SATA disks, this work must still be relevant.

bq. even if we incorrectly truncated the file and started up, we'd get an error 
when trying to open the block

Is this true? I suppose if we had applied a write and lost the commit message 
that's true, but if we lost both a write and a commit message I don't think 
we'd catch it.

bq. For Kudu's case I think we care about xfs and ext4-ordered. Both of these 
seem to guarantee a mutli-block prefix append property - in other words, since 
we're appending to a file without overwrite, we're guaranteed to see a correct 
prefix of the append (ie not some zeros followed by some real data).

Since we are preallocating space using fallocate(), I tend to think that what 
we are doing is considered an overwrite in ALICE parlance. If so, then 
ext4-ordered may appears to be vulnerable to the type of error you have 
described, where later sectors may be persisted before earlier ones. Since the 
block metadata is not updated, we are reliant on the order in which the data 
makes it to the disk. It's not totally clear from the ALICE paper whether 
fallocate() combined with fdatasync() will cause this but I suspect it may be 
the case. Both ext4-ordered and xfs have an "x" under ordering in the 
"overwrite -> any" category in their vulnerability table.

bq. The more likely scenario for corruption towards the end of a file is a 
partial write, which might be a string of zeros (eg one sector or 4k page) 
followed by some real data. In that case, we do want to truncate it rather than 
fail startup, no?

Based on the above analysis, I tend to believe that this is true.

bq. Handling bit-swaps that happen on cold data later seems like it should be 
considered separately than the more common case of crashes which are enumerated 
by the Alice paper.

Yeah, taken together I think these two papers do a pretty good job of 
enumerating all the bad things that can happen.

All in all, it still sounds like we're vulnerable if a disk starts getting bit 
errors, albeit in pretty specific circumstances.

> Corrupting multiple log entries at the end of a WAL file may go undetected
> --------------------------------------------------------------------------
>
>                 Key: KUDU-1414
>                 URL: https://issues.apache.org/jira/browse/KUDU-1414
>             Project: Kudu
>          Issue Type: Bug
>          Components: log
>    Affects Versions: 0.8.0
>            Reporter: Mike Percy
>
> While looking at KUDU-1377, I investigated how we are handling WAL truncation 
> when corruption is detected. The way the code is written today, a trailing 
> series of corrupt log entries are truncated with only a log warning message. 
> I'll post a unit test demonstrating this behavior.
> One way to get around this is to ensure that we only accept zeros following a 
> truncated record, instead of just bad records, in order to consider it a 
> partially-written record that we can safely truncate. We would have to 
> maintain this invariant when preallocating space and truncating partial 
> records before continuing to write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to