Re: Starting with missing PDS pieces

Denis Magda Wed, 06 Feb 2019 16:09:51 -0800

Hi Stan,

2b. cp/ shows that db/ is in an inconsistent state (ignite was stopped in
> the middle of a checkpoint)



Is it correct to say that the above is the most common scenario nowadays?
Assuming that index files corruption falls into this category. At least, we
can print out a message suggesting to remove indexes as a recovery attempt.
What's about that?

Also, I remember that Alex Goncharuk was proposing a data recovery tool
which might apply WALs and make cluster recoverable. Alex, could you remind
us about this?

Do you see other scenarios firing off in production?

-
Denis


On Mon, Feb 4, 2019 at 12:34 AM Stanislav Lukyanov <[email protected]>
wrote:

> Hi Igniters,
>
> I’d like to talk about Ignite startup when we have some of the persistence
> files missing.
>
> This is related to the topic “Ignite index corruption issue ->
> unrecoverable cluster” that is discussed nearby,
> but not exactly the same – I’d like to avoid talking about indexes for now
> (let’s think of them as of normal partition files)
> and focus on possible behavioral changes, not documentation.
>
> We have three parts of the persistent storage:
> - db/ - partition files
> - cp/ - checkpoint markers
> - wal/ - write-ahead log (let’s not make a disctinction between wal/ and
> wal/archive/ for now)
>
> What if some of these pieces is missing? Currently we don’t handle it that
> well, but experience shows that
> bugs exist, disks fail and users make mistakes – all of which leads to
> files becoming inaccessible.
>
> For starters, let’s not talk about missing db/ - If we’ve lost the base of
> our PDS we’re in trouble, that’s understandable.
>
> Here are the cases I’d like to discuss:
> 1. db/ is OK, cp/ and wal/ are completely missing.
> This isn’t really too likely to happen due to a disk failure since cp/ is
> stored together with db/.
> But a user’s mistake or a bug in Ignite might lead this.
>
> Current behavior (AFAIK): Ignite doesn’t start.
> I guess the current behavior is fine - we don’t know if the data is
> consistent (if we were in the middle of a checkpoint or no),
> so let’s not even try to use it.
> But a user might want to still start with at least something (or may know
> for sure that the data is consistent) – perhaps we could
> allow that we some flag/option like “--force”.
>
> 2. db and cp are OK, wal is missing.
> This is a highly likely situation – after all, we suggest that users have
> a WAL on a separate disk (that may fail).
> Because of that I think we should really be well-prepared for this.
>
> There are two cases:
>
> 2a. cp/ shows that db/ is in a consistent state (Ignite was stopped not in
> the middle of a checkpoint)
> Current behavior (AFAIK): Ignite doesn’t start.
> We could (almost) safely start here – the data is consistent after all.
> Might require the user to acknowledge that
> the start Is without WAL (so we might’ve lost some updates of the last
> checkpoint) by using, again, “--force".
>
> 2b. cp/ shows that db/ is in an inconsistent state (ignite was stopped in
> the middle of a checkpoint)
> Current behavior (AFAIK): Ignite doesn’t start.
> Current behavior is OK – we’re in an inconsistent state, so let’s not
> start. It is a question of whether to allow a force-start in this case.
>
> 3. db and wal are OK, cp is missing.
> Current behavior (AFAIK): Ignite will start.
> The current behavior is really awkward. Since we don’t have cp/, we don’t
> have a way to map wal/ to the state of db/, so it is as good as missing.
> I’d have the same behavior here as in the case 1.
>
> WDYT?
>
> Thanks,
> Stan
>

Re: Starting with missing PDS pieces

Reply via email to