Hi Stan, 2b. cp/ shows that db/ is in an inconsistent state (ignite was stopped in > the middle of a checkpoint)
Is it correct to say that the above is the most common scenario nowadays? Assuming that index files corruption falls into this category. At least, we can print out a message suggesting to remove indexes as a recovery attempt. What's about that? Also, I remember that Alex Goncharuk was proposing a data recovery tool which might apply WALs and make cluster recoverable. Alex, could you remind us about this? Do you see other scenarios firing off in production? - Denis On Mon, Feb 4, 2019 at 12:34 AM Stanislav Lukyanov <[email protected]> wrote: > Hi Igniters, > > I’d like to talk about Ignite startup when we have some of the persistence > files missing. > > This is related to the topic “Ignite index corruption issue -> > unrecoverable cluster” that is discussed nearby, > but not exactly the same – I’d like to avoid talking about indexes for now > (let’s think of them as of normal partition files) > and focus on possible behavioral changes, not documentation. > > We have three parts of the persistent storage: > - db/ - partition files > - cp/ - checkpoint markers > - wal/ - write-ahead log (let’s not make a disctinction between wal/ and > wal/archive/ for now) > > What if some of these pieces is missing? Currently we don’t handle it that > well, but experience shows that > bugs exist, disks fail and users make mistakes – all of which leads to > files becoming inaccessible. > > For starters, let’s not talk about missing db/ - If we’ve lost the base of > our PDS we’re in trouble, that’s understandable. > > Here are the cases I’d like to discuss: > 1. db/ is OK, cp/ and wal/ are completely missing. > This isn’t really too likely to happen due to a disk failure since cp/ is > stored together with db/. > But a user’s mistake or a bug in Ignite might lead this. > > Current behavior (AFAIK): Ignite doesn’t start. > I guess the current behavior is fine - we don’t know if the data is > consistent (if we were in the middle of a checkpoint or no), > so let’s not even try to use it. > But a user might want to still start with at least something (or may know > for sure that the data is consistent) – perhaps we could > allow that we some flag/option like “--force”. > > 2. db and cp are OK, wal is missing. > This is a highly likely situation – after all, we suggest that users have > a WAL on a separate disk (that may fail). > Because of that I think we should really be well-prepared for this. > > There are two cases: > > 2a. cp/ shows that db/ is in a consistent state (Ignite was stopped not in > the middle of a checkpoint) > Current behavior (AFAIK): Ignite doesn’t start. > We could (almost) safely start here – the data is consistent after all. > Might require the user to acknowledge that > the start Is without WAL (so we might’ve lost some updates of the last > checkpoint) by using, again, “--force". > > 2b. cp/ shows that db/ is in an inconsistent state (ignite was stopped in > the middle of a checkpoint) > Current behavior (AFAIK): Ignite doesn’t start. > Current behavior is OK – we’re in an inconsistent state, so let’s not > start. It is a question of whether to allow a force-start in this case. > > 3. db and wal are OK, cp is missing. > Current behavior (AFAIK): Ignite will start. > The current behavior is really awkward. Since we don’t have cp/, we don’t > have a way to map wal/ to the state of db/, so it is as good as missing. > I’d have the same behavior here as in the case 1. > > WDYT? > > Thanks, > Stan >
