Sorry for the delay in responding and thanks Dan for enumerating the error cases. Responses below:
> 1) Header magic number does not match Unrecoverable (corruption). > 2) Container version is not recognized Unrecoverable (corruption). > 3) Record length field is 0 (perhaps this isn't an error). Yeah, I just checked, and this is not an error. The case is writing a message with an optional field that is not present. The resulting serialized message is 0-length. > 4) Record length field is greater than the remaining bytes in the file. Recoverable, assuming the CRC checks out. We can truncate. > 5) Record length CRC does not match (if we go ahead and add a CRC for the length). Unrecoverable (corruption) -- assuming the file has enough length to store the full CRC. > 6) Record data CRC does not match. Unrecoverable (corruption). > 7) After the header or final record, there are more than 0 bytes remaining in the file, but not enough for the length and CRC(length) fields. Recoverable, we can truncate. Actually, assuming we're doing <len, crc(len), data, crc(data)> and len and crc fields are each 4 bytes long, we can say if len(remaining bytes) > 0 && < 12 then we can safely truncate. Additionally, I'll add some additional scenarios as well, which only occur when preallocating space: 8) Record data CRC does not match, but it is the last record in the file. We may choose to treat this as not corruption, but a partial write that could be truncated. This is only a valid assumption if we are preallocating the size of the record before writing the data (or if we assume file metadata is updated before the data is fsynced). If we don't preallocate, for example with ext4 data mode=ordered this should never occur normally and should be treated as an error (same as #6). 9) Same as #8, but followed by a long run of 0s at the end of the file. Another case that only occurs with preallocation, but this time we assume we're preallocating more than a single record at a time. We could preallocate 64MB at a time, for example. If we get a partial write, we'll end up with some bytes followed by zeroes. Let's assume a data corruption issue where several bytes got corrupted to 0 on-disk, and further that these are trailing bytes on the record. Unfortunately, in this case we don't have any way to detect whether we fsynced the write to the file and then that record was corrupted or whether we wrote only a partial record to disk and the fsync never completed. The same problem holds in the case of a series of trailing records if we use a "scan forward" approach (see KUDU-1414). If we want to preallocate for performance reasons (we may call fdatasync() for durability instead of fsync(), thus avoiding syncing file metadata), I think we have to accept that corruption of trailing zeros will be undetectable and that data loss may occur in this (probably very rare) case if we treat this situation as a partially written record that may be truncated. To minimize the problem surface area of this heuristic we can enforce that only the last record may be corrupt, and must be followed by zeroes. Otherwise, we should consider the file corrupt, treating it the same as case #6. *However, if we do not preallocate, then scenarios 1-7 are exhaustive and we shouldn't have any heuristic cases.* Personally, I think that's preferable for the log block metadata, which isn't getting written to as much as, say, the consensus WAL. Without profiling data suggesting this is a bottleneck, I'd rather err on the side of durability. So for now, I'm assuming we're not going to be preallocating space in the PBC files (this is not a change from what we do today). Mike On Tue, Apr 5, 2016 at 10:42 AM, Dan Burkert <[email protected]> wrote: > 7) After the header or final record, there are more than 0 bytes remaining > in the file, but not enough for the length and CRC(length) fields. > > On Tue, Apr 5, 2016 at 10:34 AM, Dan Burkert <[email protected]> wrote: > > > Perhaps not related to the issue exactly, but there is a pretty good > slide > > deck on CRCs here: > > https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf. > > Starting on slide 38 it has some do's and dont's, including: > > > > 1) Use a CRC over the length field, and then the data field, otherwise a > 1 > > bit error in the length field may go undetected. Adar also brought this > up. > > 2) Use a seed other than 0. This prevents zeroed runs in the file from > > passing the CRC check. > > 3) "Be careful with bit ordering" - I'm not sure exactly how this applies > > to our use case. > > 4) The slide deck doesn't say this - but perhaps we should be chaining > the > > CRCs by using the previous CRC as the seed for the next. I'm still trying > > to find some evidence that this is a beneficial thing to do. > > > > It's probably worth thinking about these things if we are going to revise > > the format. > > > > Back to the issue at hand, I think it may help to enumerate the possible > > failure scenarios. Here are the failures I can think of: > > > > 1) Header magic number does not match > > 2) Container version is not recognized > > 3) Record length field is 0 (perhaps this isn't an error). > > 4) Record length field is greater than the remaining bytes in the file. > > 5) Record length CRC does not match (if we go ahead and add a CRC for the > > length). > > 6) Record data CRC does not match. > > > > From what I understand, it's #4 that is the motivator behind KUDU-1377, > > right? > > > > - Dan > > > > On Tue, Apr 5, 2016 at 1:07 AM, Mike Percy <[email protected]> wrote: > > > >> It looks to me like the WAL suffers from an issue very similar to > >> KUDU-1377, where we could be more forgiving when a partial record is > >> written. The WAL will rebuild a segment footer for you, but it won't > >> ignore > >> a partial record. However it's less likely because the WAL uses > >> preallocation. > >> > >> Mike > >> > >> On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <[email protected]> wrote: > >> > >> > Quick note: the existing checksum is actually of the length AND the > >> bytes, > >> > not just the bytes. But that doesn't help solve your problem, because > >> the > >> > checksum follows the data, which means if the length is corrupted, you > >> > don't really know where the checksum lies. > >> > > >> > Isn't the solution you're describing effectively the same as what's > >> > described in KUDU-668? That closely mirrors what the WAL does, which > I'd > >> > argue is a good thing (since it's tried and true). > >> > > >> > On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <[email protected]> > wrote: > >> > > >> > > I'm working on solving an issue in the Log Block Manager (LBM) filed > >> as > >> > > KUDU-1377. In this case, writing a partial record to the end of a > >> > container > >> > > metadata file can cause a failure to restart Kudu. One possible > >> wversay > >> > for > >> > > this to happen is to run out of disk space when appending a block id > >> to > >> > > this metadata file. I wanted to discuss a potential fix for this > >> issue. > >> > > > >> > > The LBM uses the protobuf container (PBC) file format documented in > >> > > pb_util.h for the container metadata file. The current version of > this > >> > > format (V1) looks like this: > >> > > > >> > > > >> > > <magic number> > >> > > <container version> > >> > > repeated <record> > >> > > > >> > > > >> > > with each <record> looking like the following: > >> > > > >> > > <uint32 data length> > >> > > <data bytes> > >> > > <data checksum> > >> > > > >> > > > >> > > In the LBM, each time we create a new block we append the block id > to > >> the > >> > > metadata file. On startup, we verify that all records in the file > are > >> > > valid. If not, we print to the log and exit. > >> > > > >> > > In the case of a full disk, we will have written a partial record to > >> the > >> > > metadata file and at startup we will fail validation, however we > >> should > >> > be > >> > > able to detect this case and ignore the partial record on startup. > >> > Because > >> > > we still need to support deleting blocks, we need to be able to > >> continue > >> > > appending to this metadata file after startup, so we also need to > >> > truncate > >> > > the file to the last good record when this occurs. > >> > > > >> > > Here is what I am thinking about to fix this issue: > >> > > > >> > > 1. When we are reading a container metadata file at startup, if we > >> detect > >> > > that there is a trailing record that is too short to fit a valid > >> record > >> > > (relative to the length of the file) then we truncate the last > partial > >> > > record from the file and continue as normal. > >> > > > >> > > 2. To avoid truncating "good" records in the case that there is data > >> > > corruption in one of the length fields, we also need to extend the > PBC > >> > > format to add a checksum for the record length. So a record would > now > >> > look > >> > > like the following: > >> > > > >> > > <uint32 data length> > >> > > > >> > > <length checksum> > >> > > > >> > > <data bytes> > >> > > <data checksum> > >> > > > >> > > > >> > > Does anyone see any drawbacks to this approach? > >> > > > >> > > If you made it this far, thanks for reading. > >> > > > >> > > Mike > >> > > > >> > > >> > > > > >
