Re: About the system behavior when the checkpoint is corrupted

Chen Luo Wed, 29 Nov 2017 14:27:06 -0800

I'm not sure how the checkpoint file was corrupted. For my experiments, I
have some versions of AsterixDB sharing the same storage dir (so that I can
evaluate the performance after making some changes). Recently I synced my
branch with master, and maybe this causes some problem with the checkpoint
file (e.g., different versions of codebase?).


However, I think cleaning up the entire data directory is dangerous. The
user (such as me) can backup the checkpoint file because it's small, but it
would be cumbersome to backup the entire data directory. When there indeed
is something wrong with the checkpoint file, it's better that the user can
be aware of this, and make decisions by himself.

Best regards,
Chen Luo

On Wed, Nov 29, 2017 at 2:11 PM, abdullah alamoudi <[email protected]>
wrote:

> I wonder how it got to that state.
>
> The first thing an instance does after initialization is create the
> snapshot file.
> This will only be deleted after a new (uncorrupted) snapshot file is
> created.
>
> I understand your point, but I wonder how it got to this state. Bug!?
>
> Cheers,
> Abdullah.
>
> > On Nov 29, 2017, at 1:54 PM, Chen Luo <[email protected]> wrote:
> >
> > Hi devs,
> >
> > Recently I was experiencing a very annoying issue about recovery. The
> > checkpoint file of my dataset was somehow corrupted (and I didn't know
> > why). However, when I was restarting AsterixDB, it fails to read the
> > checkpoint file, and starts recovering as a clean state. This is highly
> > undesirable in the sense that it clean up all of my experiment datasets
> > saliently, roughly 100GB. And it'll take me days to re-ingest these data
> to
> > resume my experiments.
> >
> > I think the behavior of cleaning up all data when some small thing goes
> > wrong is undesirable and dangerous. When AsterixDB fails to restart, and
> > finds the data directory non-empty, I think it should notify the user and
> > let the user to make the decision. For example, it could fail to restart
> at
> > this time, and user could clean up the directory manually, or try to use
> a
> > backup checkpoint file, or add some flag to force restart. Anyway,
> blindly
> > cleaning up all files seem to be a dangerous solution.
> >
> > Any thoughts on this?
> >
> > Best regards,
> > Chen Luo
>
>

Re: About the system behavior when the checkpoint is corrupted

Reply via email to