Re: Durability of appendable metadata for Log Block Manager

Dan Burkert Tue, 12 Apr 2016 15:38:55 -0700

Sounds good to me.  It should also be possible to begin preallocating in
the future if we see a need, since it's the same on disk format, right?
The only other concern about preallocation is making sure that the length
seed for a new record is never 0, since you could read a string of null
bytes as repeated 0 length records.  E.G., assuming CRC(null bytes, 0) ==
0, and a preallocated file with a single record:


[header]
[record1 length] = 10
[CRC(record 1 length, <initial seed>)] = 0x01234567
[record1 data] = 0x01ab...
[CRC(record1 data, 0x01234567)] = 0x00000000
0x00 // beginning of preallocated trailing 0 bytes
0x00
0x00
...

My understanding is that if the server wrote this file (in which the CRC of
the record1 data just happens to be 0), when it comes back up it will read
0 length records from the preallocated remainder of the file.  I haven't
actually checked the CRC(null bytes, 0) == 0 assumption, though.

- Dan

On Tue, Apr 12, 2016 at 3:09 PM, Mike Percy <[email protected]> wrote:

> Sorry for the delay in responding and thanks Dan for enumerating the error
> cases. Responses below:
>
> > 1) Header magic number does not match
> Unrecoverable (corruption).
>
> > 2) Container version is not recognized
> Unrecoverable (corruption).
>
> > 3) Record length field is 0 (perhaps this isn't an error).
> Yeah, I just checked, and this is not an error. The case is writing a
> message with an optional field that is not present. The resulting
> serialized message is 0-length.
>
> > 4) Record length field is greater than the remaining bytes in the file.
> Recoverable, assuming the CRC checks out. We can truncate.
>
> > 5) Record length CRC does not match (if we go ahead and add a CRC for the
> length).
> Unrecoverable (corruption) -- assuming the file has enough length to store
> the full CRC.
>
> > 6) Record data CRC does not match.
> Unrecoverable (corruption).
>
> > 7) After the header or final record, there are more than 0 bytes
> remaining in the file, but not enough for the length and CRC(length)
> fields.
> Recoverable, we can truncate. Actually, assuming we're doing <len,
> crc(len), data, crc(data)> and len and crc fields are each 4 bytes long, we
> can say if len(remaining bytes) > 0 && < 12 then we can safely truncate.
>
> Additionally, I'll add some additional scenarios as well, which only occur
> when preallocating space:
>
> 8) Record data CRC does not match, but it is the last record in the file.
>
> We may choose to treat this as not corruption, but a partial write that
> could be truncated. This is only a valid assumption if we are preallocating
> the size of the record before writing the data (or if we assume file
> metadata is updated before the data is fsynced). If we don't preallocate,
> for example with ext4 data mode=ordered this should never occur normally
> and should be treated as an error (same as #6).
>
> 9) Same as #8, but followed by a long run of 0s at the end of the file.
>
> Another case that only occurs with preallocation, but this time we assume
> we're preallocating more than a single record at a time. We could
> preallocate 64MB at a time, for example. If we get a partial write, we'll
> end up with some bytes followed by zeroes. Let's assume a data corruption
> issue where several bytes got corrupted to 0 on-disk, and further that
> these are trailing bytes on the record. Unfortunately, in this case we
> don't have any way to detect whether we fsynced the write to the file and
> then that record was corrupted or whether we wrote only a partial record to
> disk and the fsync never completed. The same problem holds in the case of a
> series of trailing records if we use a "scan forward" approach (see
> KUDU-1414).
>
> If we want to preallocate for performance reasons (we may call fdatasync()
> for durability instead of fsync(), thus avoiding syncing file metadata), I
> think we have to accept that corruption of trailing zeros will be
> undetectable and that data loss may occur in this (probably very rare) case
> if we treat this situation as a partially written record that may be
> truncated. To minimize the problem surface area of this heuristic we can
> enforce that only the last record may be corrupt, and must be followed by
> zeroes. Otherwise, we should consider the file corrupt, treating it the
> same as case #6.
>
> *However, if we do not preallocate, then scenarios 1-7 are exhaustive and
> we shouldn't have any heuristic cases.* Personally, I think that's
> preferable for the log block metadata, which isn't getting written to as
> much as, say, the consensus WAL. Without profiling data suggesting this is
> a bottleneck, I'd rather err on the side of durability. So for now, I'm
> assuming we're not going to be preallocating space in the PBC files (this
> is not a change from what we do today).
>
> Mike
>
> On Tue, Apr 5, 2016 at 10:42 AM, Dan Burkert <[email protected]> wrote:
>
> > 7) After the header or final record, there are more than 0 bytes
> remaining
> > in the file, but not enough for the length and CRC(length) fields.
> >
> > On Tue, Apr 5, 2016 at 10:34 AM, Dan Burkert <[email protected]> wrote:
> >
> > > Perhaps not related to the issue exactly, but there is a pretty good
> > slide
> > > deck on CRCs here:
> > > https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf.
> > > Starting on slide 38 it has some do's and dont's, including:
> > >
> > > 1) Use a CRC over the length field, and then the data field, otherwise
> a
> > 1
> > > bit error in the length field may go undetected.  Adar also brought
> this
> > up.
> > > 2) Use a seed other than 0. This prevents zeroed runs in the file from
> > > passing the CRC check.
> > > 3) "Be careful with bit ordering" - I'm not sure exactly how this
> applies
> > > to our use case.
> > > 4) The slide deck doesn't say this - but perhaps we should be chaining
> > the
> > > CRCs by using the previous CRC as the seed for the next. I'm still
> trying
> > > to find some evidence that this is a beneficial thing to do.
> > >
> > > It's probably worth thinking about these things if we are going to
> revise
> > > the format.
> > >
> > > Back to the issue at hand, I think it may help to enumerate the
> possible
> > > failure scenarios.  Here are the failures I can think of:
> > >
> > > 1) Header magic number does not match
> > > 2) Container version is not recognized
> > > 3) Record length field is 0 (perhaps this isn't an error).
> > > 4) Record length field is greater than the remaining bytes in the file.
> > > 5) Record length CRC does not match (if we go ahead and add a CRC for
> the
> > > length).
> > > 6) Record data CRC does not match.
> > >
> > > From what I understand, it's #4 that is the motivator behind KUDU-1377,
> > > right?
> > >
> > > - Dan
> > >
> > > On Tue, Apr 5, 2016 at 1:07 AM, Mike Percy <[email protected]> wrote:
> > >
> > >> It looks to me like the WAL suffers from an issue very similar to
> > >> KUDU-1377, where we could be more forgiving when a partial record is
> > >> written. The WAL will rebuild a segment footer for you, but it won't
> > >> ignore
> > >> a partial record. However it's less likely because the WAL uses
> > >> preallocation.
> > >>
> > >> Mike
> > >>
> > >> On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <[email protected]>
> wrote:
> > >>
> > >> > Quick note: the existing checksum is actually of the length AND the
> > >> bytes,
> > >> > not just the bytes. But that doesn't help solve your problem,
> because
> > >> the
> > >> > checksum follows the data, which means if the length is corrupted,
> you
> > >> > don't really know where the checksum lies.
> > >> >
> > >> > Isn't the solution you're describing effectively the same as what's
> > >> > described in KUDU-668? That closely mirrors what the WAL does, which
> > I'd
> > >> > argue is a good thing (since it's tried and true).
> > >> >
> > >> > On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <[email protected]>
> > wrote:
> > >> >
> > >> > > I'm working on solving an issue in the Log Block Manager (LBM)
> filed
> > >> as
> > >> > > KUDU-1377. In this case, writing a partial record to the end of a
> > >> > container
> > >> > > metadata file can cause a failure to restart Kudu. One possible
> > >> wversay
> > >> > for
> > >> > > this to happen is to run out of disk space when appending a block
> id
> > >> to
> > >> > > this metadata file. I wanted to discuss a potential fix for this
> > >> issue.
> > >> > >
> > >> > > The LBM uses the protobuf container (PBC) file format documented
> in
> > >> > > pb_util.h for the container metadata file. The current version of
> > this
> > >> > > format (V1) looks like this:
> > >> > >
> > >> > >
> > >> > > <magic number>
> > >> > > <container version>
> > >> > > repeated <record>
> > >> > >
> > >> > >
> > >> > > with each <record> looking like the following:
> > >> > >
> > >> > > <uint32 data length>
> > >> > > <data bytes>
> > >> > > <data checksum>
> > >> > >
> > >> > >
> > >> > > In the LBM, each time we create a new block we append the block id
> > to
> > >> the
> > >> > > metadata file. On startup, we verify that all records in the file
> > are
> > >> > > valid. If not, we print to the log and exit.
> > >> > >
> > >> > > In the case of a full disk, we will have written a partial record
> to
> > >> the
> > >> > > metadata file and at startup we will fail validation, however we
> > >> should
> > >> > be
> > >> > > able to detect this case and ignore the partial record on startup.
> > >> > Because
> > >> > > we still need to support deleting blocks, we need to be able to
> > >> continue
> > >> > > appending to this metadata file after startup, so we also need to
> > >> > truncate
> > >> > > the file to the last good record when this occurs.
> > >> > >
> > >> > > Here is what I am thinking about to fix this issue:
> > >> > >
> > >> > > 1. When we are reading a container metadata file at startup, if we
> > >> detect
> > >> > > that there is a trailing record that is too short to fit a valid
> > >> record
> > >> > > (relative to the length of the file) then we truncate the last
> > partial
> > >> > > record from the file and continue as normal.
> > >> > >
> > >> > > 2. To avoid truncating "good" records in the case that there is
> data
> > >> > > corruption in one of the length fields, we also need to extend the
> > PBC
> > >> > > format to add a checksum for the record length. So a record would
> > now
> > >> > look
> > >> > > like the following:
> > >> > >
> > >> > > <uint32 data length>
> > >> > >
> > >> > > <length checksum>
> > >> > >
> > >> > > <data bytes>
> > >> > > <data checksum>
> > >> > >
> > >> > >
> > >> > > Does anyone see any drawbacks to this approach?
> > >> > >
> > >> > > If you made it this far, thanks for reading.
> > >> > >
> > >> > > Mike
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Durability of appendable metadata for Log Block Manager

Reply via email to