[
https://issues.apache.org/jira/browse/KUDU-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238384#comment-15238384
]
Mike Percy commented on KUDU-1414:
----------------------------------
bq. But isn't this an extremely rare scenario? Typically an on-disk post-write
corruption would be a flipped bit (a single byte) which would only affect a
single entry.
Yes I agree we're talking about rare events. But I'm not sure that it's true
that successive errors are even more rare. A somewhat recent paper on this
topic is "An Analysis of Data Corruption in the Storage Stack" by
Bairavasundaram et al from FAST '08, which explores disk corruption in
real-world disks: http://www.cs.toronto.edu/~bianca/papers/fast08.pdf ... in
section 4.3.1 (Checksum mismatches per corrupt disk) and 4.3.3 (Spatial
locality) they show that among disks with checksum errors, there tend to be
multiple sectors with errors and those sectors tend to be close together. At
least for spinning SATA disks, this work must still be relevant.
bq. even if we incorrectly truncated the file and started up, we'd get an error
when trying to open the block
Is this true? I suppose if we had applied a write and lost the commit message
that's true, but if we lost both a write and a commit message I don't think
we'd catch it.
bq. For Kudu's case I think we care about xfs and ext4-ordered. Both of these
seem to guarantee a mutli-block prefix append property - in other words, since
we're appending to a file without overwrite, we're guaranteed to see a correct
prefix of the append (ie not some zeros followed by some real data).
Since we are preallocating space using fallocate(), I tend to think that what
we are doing is considered an overwrite in ALICE parlance. If so, then
ext4-ordered may appears to be vulnerable to the type of error you have
described, where later sectors may be persisted before earlier ones. Since the
block metadata is not updated, we are reliant on the order in which the data
makes it to the disk. It's not totally clear from the ALICE paper whether
fallocate() combined with fdatasync() will cause this but I suspect it may be
the case. Both ext4-ordered and xfs have an "x" under ordering in the
"overwrite -> any" category in their vulnerability table.
bq. The more likely scenario for corruption towards the end of a file is a
partial write, which might be a string of zeros (eg one sector or 4k page)
followed by some real data. In that case, we do want to truncate it rather than
fail startup, no?
Based on the above analysis, I tend to believe that this is true.
bq. Handling bit-swaps that happen on cold data later seems like it should be
considered separately than the more common case of crashes which are enumerated
by the Alice paper.
Yeah, taken together I think these two papers do a pretty good job of
enumerating all the bad things that can happen.
All in all, it still sounds like we're vulnerable if a disk starts getting bit
errors, albeit in pretty specific circumstances.
> Corrupting multiple log entries at the end of a WAL file may go undetected
> --------------------------------------------------------------------------
>
> Key: KUDU-1414
> URL: https://issues.apache.org/jira/browse/KUDU-1414
> Project: Kudu
> Issue Type: Bug
> Components: log
> Affects Versions: 0.8.0
> Reporter: Mike Percy
>
> While looking at KUDU-1377, I investigated how we are handling WAL truncation
> when corruption is detected. The way the code is written today, a trailing
> series of corrupt log entries are truncated with only a log warning message.
> I'll post a unit test demonstrating this behavior.
> One way to get around this is to ensure that we only accept zeros following a
> truncated record, instead of just bad records, in order to consider it a
> partially-written record that we can safely truncate. We would have to
> maintain this invariant when preallocating space and truncating partial
> records before continuing to write.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)