Re: About the system behavior when the checkpoint is corrupted

abdullah alamoudi Wed, 29 Nov 2017 14:11:54 -0800

I wonder how it got to that state.

The first thing an instance does after initialization is create the snapshot 
file.
This will only be deleted after a new (uncorrupted) snapshot file is created.


I understand your point, but I wonder how it got to this state. Bug!?

Cheers,
Abdullah.

> On Nov 29, 2017, at 1:54 PM, Chen Luo <[email protected]> wrote:
> 
> Hi devs,
> 
> Recently I was experiencing a very annoying issue about recovery. The
> checkpoint file of my dataset was somehow corrupted (and I didn't know
> why). However, when I was restarting AsterixDB, it fails to read the
> checkpoint file, and starts recovering as a clean state. This is highly
> undesirable in the sense that it clean up all of my experiment datasets
> saliently, roughly 100GB. And it'll take me days to re-ingest these data to
> resume my experiments.
> 
> I think the behavior of cleaning up all data when some small thing goes
> wrong is undesirable and dangerous. When AsterixDB fails to restart, and
> finds the data directory non-empty, I think it should notify the user and
> let the user to make the decision. For example, it could fail to restart at
> this time, and user could clean up the directory manually, or try to use a
> backup checkpoint file, or add some flag to force restart. Anyway, blindly
> cleaning up all files seem to be a dangerous solution.
> 
> Any thoughts on this?
> 
> Best regards,
> Chen Luo

Re: About the system behavior when the checkpoint is corrupted

Reply via email to