Re: Cassandra commitlog corruption on hard shutdown

2022-04-05 Thread Erick Ramirez
Thanks for circling back and posting your experience!

>


Re: Cassandra commitlog corruption on hard shutdown

2022-04-04 Thread Leon Zaruvinsky
Hi all,

I wanted to echo back on this thread a bit of a "win".  In investigating
ways to mitigate the "corruption on hard shutdown" issue, we came across
the Group Commitlog feature that was added in 4.0 (
https://issues.apache.org/jira/browse/CASSANDRA-13530).  We backported and
enabled this feature with "commitlog_sync_group_window_in_ms: 2" and the
results are:
- As expected, IOPS on the commitlog drive dropped drastically and no
longer scaled by number of writes.
- Write performance did not change significantly, and there was no impact
to our application (Cassandra write performance >2ms did not seem to be a
bottleneck)
- We've had *zero* commitlog corruption errors since we rolled this out to
our fleet 6 months ago!! Previously using batch commitlog, we faced 1-2
corruptions per month.

Cheers,
Leon


On Tue, Aug 3, 2021 at 11:39 PM Leon Zaruvinsky 
wrote:

> Following up, I've found that we tend to encounter one of three types of
> exceptions/commitlog corruptions:
>
> 1.
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
> Mutation checksum failure at ... in CommitLog-5-1531150627243.log
> at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)
>
> 2.
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
> Could not read commit log descriptor in file CommitLog-5-1550003067433.log
> at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)
>
> 3.
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
> Encountered bad header at position ... of commit log
> CommitLog-5-1603991140803.log, with invalid CRC. The end of segment marker
> should be zero.
> at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
>
> I believe exception (2) is mitigated by
> https://issues.apache.org/jira/browse/CASSANDRA-11995 and
> https://issues.apache.org/jira/browse/CASSANDRA-13918
>
> But it's not clear to me how (1) and (3) can be mitigated.
>
> On Mon, Jul 26, 2021 at 6:40 PM Leon Zaruvinsky 
> wrote:
>
>> Thanks for the links/comments Jeff and Bowen.
>>
>> We run xfs. Not sure that we can switch to zfs, so a different solution
>> would be preferred.
>>
>> I’ll take a look through that patch – maybe I’ll try to backport and
>> replicate.  We’ve seen both cases where the commitlog is just 0s (empty)
>> and where it has had real data in it.
>>
>> Leon
>>
>> On Mon, Jul 26, 2021 at 6:38 PM Jeff Jirsa  wrote:
>>
>>> The commitlog code has changed DRASTICALLY between 2.x and trunk.
>>>
>>> If it's really a bunch of trailing 0s as was suggested later, then
>>> https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at
>>> least one cause/case of that particular bug.
>>>
>>>
>>>
>>> On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky <
>>> leonzaruvin...@gmail.com> wrote:
>>>
 And for completeness, a sample stack trace:

 ERROR [2021-07-21T02:11:01.994Z] 
 org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. 
 Commit disk failure policy is stop_on_startup; terminating thread 
 (throwable0_message: Mutation checksum failure at 15167277 in 
 CommitLog-5-1626828286977.log)
 org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
  Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
at 
 org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
at 
 org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
at 
 org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
at 
 org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
at 
 org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
at 
 org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
at 
 org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
at 
 org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)


 On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky <
 leonzaruvin...@gmail.com> wrote:

> Currently we're using commitlog_batch:
>
> commitlog_sync: batch
> commitlog_sync_batch_window_in_ms: 2
> commitlog_segment_size_in_mb: 32
>
> durable_writes is also true.
>
> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
> curious 

Re: Cassandra commitlog corruption on hard shutdown

2021-08-03 Thread Leon Zaruvinsky
Following up, I've found that we tend to encounter one of three types of
exceptions/commitlog corruptions:

1.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Mutation checksum failure at ... in CommitLog-5-1531150627243.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)

2.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Could not read commit log descriptor in file CommitLog-5-1550003067433.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)

3.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Encountered bad header at position ... of commit log
CommitLog-5-1603991140803.log, with invalid CRC. The end of segment marker
should be zero.
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)

I believe exception (2) is mitigated by
https://issues.apache.org/jira/browse/CASSANDRA-11995 and
https://issues.apache.org/jira/browse/CASSANDRA-13918

But it's not clear to me how (1) and (3) can be mitigated.

On Mon, Jul 26, 2021 at 6:40 PM Leon Zaruvinsky 
wrote:

> Thanks for the links/comments Jeff and Bowen.
>
> We run xfs. Not sure that we can switch to zfs, so a different solution
> would be preferred.
>
> I’ll take a look through that patch – maybe I’ll try to backport and
> replicate.  We’ve seen both cases where the commitlog is just 0s (empty)
> and where it has had real data in it.
>
> Leon
>
> On Mon, Jul 26, 2021 at 6:38 PM Jeff Jirsa  wrote:
>
>> The commitlog code has changed DRASTICALLY between 2.x and trunk.
>>
>> If it's really a bunch of trailing 0s as was suggested later, then
>> https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least
>> one cause/case of that particular bug.
>>
>>
>>
>> On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky 
>> wrote:
>>
>>> And for completeness, a sample stack trace:
>>>
>>> ERROR [2021-07-21T02:11:01.994Z] 
>>> org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. 
>>> Commit disk failure policy is stop_on_startup; terminating thread 
>>> (throwable0_message: Mutation checksum failure at 15167277 in 
>>> CommitLog-5-1626828286977.log)
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
>>>  Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)
>>>
>>>
>>> On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky <
>>> leonzaruvin...@gmail.com> wrote:
>>>
 Currently we're using commitlog_batch:

 commitlog_sync: batch
 commitlog_sync_batch_window_in_ms: 2
 commitlog_segment_size_in_mb: 32

 durable_writes is also true.

 Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
 curious if much in this space has changed since then (I've looked through
 the changelogs and nothing stood out).

 On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa  wrote:

> What commitlog settings are you using?
>
> Default is periodic with 10s sync. That leaves you a 10s window on
> hard poweroff/crash.
>
> I would also expect cassandra to cleanup and start cleanly, which
> version are you running?
>
>
>
> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky <
> leonzaruvin...@gmail.com> wrote:
>
>> Hi Cassandra community,
>>
>> We (and others) regularly run into commit log corruptions that are
>> caused by Cassandra, or the underlying infrastructure, being hard
>> restarted.  I suspect that this is because it happens in the middle of a
>> commitlog file write to disk.
>>
>> Could anyone point me at resources / code to understand why this is
>> happening?  Shouldn't Cassandra not be acking writes until the commitlog 

Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Leon Zaruvinsky
Thanks for the links/comments Jeff and Bowen.

We run xfs. Not sure that we can switch to zfs, so a different solution
would be preferred.

I’ll take a look through that patch – maybe I’ll try to backport and
replicate.  We’ve seen both cases where the commitlog is just 0s (empty)
and where it has had real data in it.

Leon

On Mon, Jul 26, 2021 at 6:38 PM Jeff Jirsa  wrote:

> The commitlog code has changed DRASTICALLY between 2.x and trunk.
>
> If it's really a bunch of trailing 0s as was suggested later, then
> https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least
> one cause/case of that particular bug.
>
>
>
> On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky 
> wrote:
>
>> And for completeness, a sample stack trace:
>>
>> ERROR [2021-07-21T02:11:01.994Z] 
>> org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. 
>> Commit disk failure policy is stop_on_startup; terminating thread 
>> (throwable0_message: Mutation checksum failure at 15167277 in 
>> CommitLog-5-1626828286977.log)
>> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
>>  Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
>>  at 
>> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
>>  at 
>> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
>>  at 
>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
>>  at 
>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
>>  at 
>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
>>  at 
>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
>>  at 
>> org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
>>  at 
>> org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
>>  at 
>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
>>  at 
>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>>  at 
>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)
>>
>>
>> On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky 
>> wrote:
>>
>>> Currently we're using commitlog_batch:
>>>
>>> commitlog_sync: batch
>>> commitlog_sync_batch_window_in_ms: 2
>>> commitlog_segment_size_in_mb: 32
>>>
>>> durable_writes is also true.
>>>
>>> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
>>> curious if much in this space has changed since then (I've looked through
>>> the changelogs and nothing stood out).
>>>
>>> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa  wrote:
>>>
 What commitlog settings are you using?

 Default is periodic with 10s sync. That leaves you a 10s window on hard
 poweroff/crash.

 I would also expect cassandra to cleanup and start cleanly, which
 version are you running?



 On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky <
 leonzaruvin...@gmail.com> wrote:

> Hi Cassandra community,
>
> We (and others) regularly run into commit log corruptions that are
> caused by Cassandra, or the underlying infrastructure, being hard
> restarted.  I suspect that this is because it happens in the middle of a
> commitlog file write to disk.
>
> Could anyone point me at resources / code to understand why this is
> happening?  Shouldn't Cassandra not be acking writes until the commitlog 
> is
> safely written to disk?  I would expect that on startup, Cassandra should
> be able to clean up bad commitlog files and recover gracefully.
>
> I've seen various references online to this issue as something that
> will be fixed in the future - so I'm curious if there is any movement or
> thoughts there.
>
> Thanks a bunch,
> Leon
>



Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Jeff Jirsa
The commitlog code has changed DRASTICALLY between 2.x and trunk.

If it's really a bunch of trailing 0s as was suggested later, then
https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least
one cause/case of that particular bug.



On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky 
wrote:

> And for completeness, a sample stack trace:
>
> ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog: 
> Failed commit log replay. Commit disk failure policy is stop_on_startup; 
> terminating thread (throwable0_message: Mutation checksum failure at 15167277 
> in CommitLog-5-1626828286977.log)
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
> Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
>   at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
>   at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
>   at 
> org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
>   at 
> org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)
>
>
> On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky 
> wrote:
>
>> Currently we're using commitlog_batch:
>>
>> commitlog_sync: batch
>> commitlog_sync_batch_window_in_ms: 2
>> commitlog_segment_size_in_mb: 32
>>
>> durable_writes is also true.
>>
>> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious
>> if much in this space has changed since then (I've looked through the
>> changelogs and nothing stood out).
>>
>> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa  wrote:
>>
>>> What commitlog settings are you using?
>>>
>>> Default is periodic with 10s sync. That leaves you a 10s window on hard
>>> poweroff/crash.
>>>
>>> I would also expect cassandra to cleanup and start cleanly, which
>>> version are you running?
>>>
>>>
>>>
>>> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky <
>>> leonzaruvin...@gmail.com> wrote:
>>>
 Hi Cassandra community,

 We (and others) regularly run into commit log corruptions that are
 caused by Cassandra, or the underlying infrastructure, being hard
 restarted.  I suspect that this is because it happens in the middle of a
 commitlog file write to disk.

 Could anyone point me at resources / code to understand why this is
 happening?  Shouldn't Cassandra not be acking writes until the commitlog is
 safely written to disk?  I would expect that on startup, Cassandra should
 be able to clean up bad commitlog files and recover gracefully.

 I've seen various references online to this issue as something that
 will be fixed in the future - so I'm curious if there is any movement or
 thoughts there.

 Thanks a bunch,
 Leon

>>>


Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Bowen Song
I have seen the same error in Cassandra 3.x too, and in fact quite a few 
times. On a few occasions, I opened the corrupted commit log file in a 
hex editor, and it was filled with a lots of 0x00s. I believe it was 
caused by the combination of the way Cassandra flushes the commit log + 
the way XFS handles the metadata in journal + an unexpected power cut + 
the SSD write back cache. I have never experienced this again since we 
moved all Cassandra servers to ZFS.


On 26/07/2021 23:11, Leon Zaruvinsky wrote:

And for completeness, a sample stack trace:

ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog: 
Failed commit log replay. Commit disk failure policy is stop_on_startup; 
terminating thread (throwable0_message: Mutation checksum failure at 15167277 
in CommitLog-5-1626828286977.log)
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
at 
org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
at 
org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)

On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky 
mailto:leonzaruvin...@gmail.com>> wrote:


Currently we're using commitlog_batch:

    commitlog_sync: batch
    commitlog_sync_batch_window_in_ms: 2
    commitlog_segment_size_in_mb: 32

durable_writes is also true.

Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
curious if much in this space has changed since then (I've looked
through the changelogs and nothing stood out).

On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa mailto:jji...@gmail.com>> wrote:

What commitlog settings are you using?

Default is periodic with 10s sync. That leaves you a 10s
window on hard poweroff/crash.

I would also expect cassandra to cleanup and start cleanly,
which version are you running?



On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky
mailto:leonzaruvin...@gmail.com>>
wrote:

Hi Cassandra community,

We (and others) regularly run into commit log corruptions
that are caused by Cassandra, or the underlying
infrastructure, being hard restarted. I suspect that this
is because it happens in the middle of a commitlog file
write to disk.

Could anyone point me at resources / code to understand
why this is happening?  Shouldn't Cassandra not be acking
writes until the commitlog is safely written to disk?  I
would expect that on startup, Cassandra should be able to
clean up bad commitlog files and recover gracefully.

I've seen various references online to this issue as
something that will be fixed in the future - so I'm
curious if there is any movement or thoughts there.

Thanks a bunch,
Leon



Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Leon Zaruvinsky
And for completeness, a sample stack trace:

ERROR [2021-07-21T02:11:01.994Z]
org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay.
Commit disk failure policy is stop_on_startup; terminating thread
(throwable0_message: Mutation checksum failure at 15167277 in
CommitLog-5-1626828286977.log)
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
at 
org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
at 
org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)


On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky 
wrote:

> Currently we're using commitlog_batch:
>
> commitlog_sync: batch
> commitlog_sync_batch_window_in_ms: 2
> commitlog_segment_size_in_mb: 32
>
> durable_writes is also true.
>
> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious
> if much in this space has changed since then (I've looked through the
> changelogs and nothing stood out).
>
> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa  wrote:
>
>> What commitlog settings are you using?
>>
>> Default is periodic with 10s sync. That leaves you a 10s window on hard
>> poweroff/crash.
>>
>> I would also expect cassandra to cleanup and start cleanly, which version
>> are you running?
>>
>>
>>
>> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky 
>> wrote:
>>
>>> Hi Cassandra community,
>>>
>>> We (and others) regularly run into commit log corruptions that are
>>> caused by Cassandra, or the underlying infrastructure, being hard
>>> restarted.  I suspect that this is because it happens in the middle of a
>>> commitlog file write to disk.
>>>
>>> Could anyone point me at resources / code to understand why this is
>>> happening?  Shouldn't Cassandra not be acking writes until the commitlog is
>>> safely written to disk?  I would expect that on startup, Cassandra should
>>> be able to clean up bad commitlog files and recover gracefully.
>>>
>>> I've seen various references online to this issue as something that will
>>> be fixed in the future - so I'm curious if there is any movement or
>>> thoughts there.
>>>
>>> Thanks a bunch,
>>> Leon
>>>
>>


Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Leon Zaruvinsky
Currently we're using commitlog_batch:

commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2
commitlog_segment_size_in_mb: 32

durable_writes is also true.

Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious
if much in this space has changed since then (I've looked through the
changelogs and nothing stood out).

On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa  wrote:

> What commitlog settings are you using?
>
> Default is periodic with 10s sync. That leaves you a 10s window on hard
> poweroff/crash.
>
> I would also expect cassandra to cleanup and start cleanly, which version
> are you running?
>
>
>
> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky 
> wrote:
>
>> Hi Cassandra community,
>>
>> We (and others) regularly run into commit log corruptions that are caused
>> by Cassandra, or the underlying infrastructure, being hard restarted.  I
>> suspect that this is because it happens in the middle of a commitlog file
>> write to disk.
>>
>> Could anyone point me at resources / code to understand why this is
>> happening?  Shouldn't Cassandra not be acking writes until the commitlog is
>> safely written to disk?  I would expect that on startup, Cassandra should
>> be able to clean up bad commitlog files and recover gracefully.
>>
>> I've seen various references online to this issue as something that will
>> be fixed in the future - so I'm curious if there is any movement or
>> thoughts there.
>>
>> Thanks a bunch,
>> Leon
>>
>


Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Jeff Jirsa
What commitlog settings are you using?

Default is periodic with 10s sync. That leaves you a 10s window on hard
poweroff/crash.

I would also expect cassandra to cleanup and start cleanly, which version
are you running?



On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky 
wrote:

> Hi Cassandra community,
>
> We (and others) regularly run into commit log corruptions that are caused
> by Cassandra, or the underlying infrastructure, being hard restarted.  I
> suspect that this is because it happens in the middle of a commitlog file
> write to disk.
>
> Could anyone point me at resources / code to understand why this is
> happening?  Shouldn't Cassandra not be acking writes until the commitlog is
> safely written to disk?  I would expect that on startup, Cassandra should
> be able to clean up bad commitlog files and recover gracefully.
>
> I've seen various references online to this issue as something that will
> be fixed in the future - so I'm curious if there is any movement or
> thoughts there.
>
> Thanks a bunch,
> Leon
>


Re: Cassandra commitlog corruption on hard shutdown

2021-07-26 Thread Arvinder Dhillon
I thought durable_writes is the solution.

-Arvinder

On Mon, Jul 26, 2021, 1:00 PM Leon Zaruvinsky 
wrote:

> Hi Cassandra community,
>
> We (and others) regularly run into commit log corruptions that are caused
> by Cassandra, or the underlying infrastructure, being hard restarted.  I
> suspect that this is because it happens in the middle of a commitlog file
> write to disk.
>
> Could anyone point me at resources / code to understand why this is
> happening?  Shouldn't Cassandra not be acking writes until the commitlog is
> safely written to disk?  I would expect that on startup, Cassandra should
> be able to clean up bad commitlog files and recover gracefully.
>
> I've seen various references online to this issue as something that will
> be fixed in the future - so I'm curious if there is any movement or
> thoughts there.
>
> Thanks a bunch,
> Leon
>