Re: About the system behavior when the checkpoint is corrupted

Mike Carey Wed, 29 Nov 2017 14:24:00 -0800

+1 for not proceeding and simply removing the data in this (ideallyunreachable) state....


On 11/29/17 2:15 PM, Ian Maxon wrote:

I too have seen this issue, but I couldn't reproduce or surmise how it
might happen from just inspecting the code. How'd it appear for you?
I would disagree that a checkpoint file not appearing is a small thing
however. It is more or less the most important artifact for recovery.
It's not something that ever should have an issue like this.

On Wed, Nov 29, 2017 at 1:54 PM, Chen Luo <[email protected]> wrote:

Hi devs,

Recently I was experiencing a very annoying issue about recovery. The
checkpoint file of my dataset was somehow corrupted (and I didn't know
why). However, when I was restarting AsterixDB, it fails to read the
checkpoint file, and starts recovering as a clean state. This is highly
undesirable in the sense that it clean up all of my experiment datasets
saliently, roughly 100GB. And it'll take me days to re-ingest these data to
resume my experiments.

I think the behavior of cleaning up all data when some small thing goes
wrong is undesirable and dangerous. When AsterixDB fails to restart, and
finds the data directory non-empty, I think it should notify the user and
let the user to make the decision. For example, it could fail to restart at
this time, and user could clean up the directory manually, or try to use a
backup checkpoint file, or add some flag to force restart. Anyway, blindly
cleaning up all files seem to be a dangerous solution.

Any thoughts on this?

Best regards,
Chen Luo

Re: About the system behavior when the checkpoint is corrupted

Reply via email to