Re: About the system behavior when the checkpoint is corrupted

Ian Maxon Wed, 29 Nov 2017 14:19:30 -0800

To be more precise what I saw, was that the checkpoint file was
actually there but 0 length, if memory serves (and hence corrupt).


On Wed, Nov 29, 2017 at 2:11 PM, abdullah alamoudi <[email protected]> wrote:
> I wonder how it got to that state.
>
> The first thing an instance does after initialization is create the snapshot 
> file.
> This will only be deleted after a new (uncorrupted) snapshot file is created.
>
> I understand your point, but I wonder how it got to this state. Bug!?
>
> Cheers,
> Abdullah.
>
>> On Nov 29, 2017, at 1:54 PM, Chen Luo <[email protected]> wrote:
>>
>> Hi devs,
>>
>> Recently I was experiencing a very annoying issue about recovery. The
>> checkpoint file of my dataset was somehow corrupted (and I didn't know
>> why). However, when I was restarting AsterixDB, it fails to read the
>> checkpoint file, and starts recovering as a clean state. This is highly
>> undesirable in the sense that it clean up all of my experiment datasets
>> saliently, roughly 100GB. And it'll take me days to re-ingest these data to
>> resume my experiments.
>>
>> I think the behavior of cleaning up all data when some small thing goes
>> wrong is undesirable and dangerous. When AsterixDB fails to restart, and
>> finds the data directory non-empty, I think it should notify the user and
>> let the user to make the decision. For example, it could fail to restart at
>> this time, and user could clean up the directory manually, or try to use a
>> backup checkpoint file, or add some flag to force restart. Anyway, blindly
>> cleaning up all files seem to be a dangerous solution.
>>
>> Any thoughts on this?
>>
>> Best regards,
>> Chen Luo
>

Re: About the system behavior when the checkpoint is corrupted

Reply via email to