Re: Cassandra commitlog corruption on hard shutdown
Thanks for the links/comments Jeff and Bowen. We run xfs. Not sure that we can switch to zfs, so a different solution would be preferred. I’ll take a look through that patch – maybe I’ll try to backport and replicate. We’ve seen both cases where the commitlog is just 0s (empty) and where it has had real data in it. Leon On Mon, Jul 26, 2021 at 6:38 PM Jeff Jirsa wrote: > The commitlog code has changed DRASTICALLY between 2.x and trunk. > > If it's really a bunch of trailing 0s as was suggested later, then > https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least > one cause/case of that particular bug. > > > > On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky > wrote: > >> And for completeness, a sample stack trace: >> >> ERROR [2021-07-21T02:11:01.994Z] >> org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. >> Commit disk failure policy is stop_on_startup; terminating thread >> (throwable0_message: Mutation checksum failure at 15167277 in >> CommitLog-5-1626828286977.log) >> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: >> Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log >> at >> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647) >> at >> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519) >> at >> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401) >> at >> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143) >> at >> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175) >> at >> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155) >> at >> org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296) >> at >> org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289) >> at >> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) >> at >> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) >> at >> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741) >> >> >> On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky >> wrote: >> >>> Currently we're using commitlog_batch: >>> >>> commitlog_sync: batch >>> commitlog_sync_batch_window_in_ms: 2 >>> commitlog_segment_size_in_mb: 32 >>> >>> durable_writes is also true. >>> >>> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be >>> curious if much in this space has changed since then (I've looked through >>> the changelogs and nothing stood out). >>> >>> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa wrote: >>> What commitlog settings are you using? Default is periodic with 10s sync. That leaves you a 10s window on hard poweroff/crash. I would also expect cassandra to cleanup and start cleanly, which version are you running? On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky < leonzaruvin...@gmail.com> wrote: > Hi Cassandra community, > > We (and others) regularly run into commit log corruptions that are > caused by Cassandra, or the underlying infrastructure, being hard > restarted. I suspect that this is because it happens in the middle of a > commitlog file write to disk. > > Could anyone point me at resources / code to understand why this is > happening? Shouldn't Cassandra not be acking writes until the commitlog > is > safely written to disk? I would expect that on startup, Cassandra should > be able to clean up bad commitlog files and recover gracefully. > > I've seen various references online to this issue as something that > will be fixed in the future - so I'm curious if there is any movement or > thoughts there. > > Thanks a bunch, > Leon >
Re: Cassandra commitlog corruption on hard shutdown
The commitlog code has changed DRASTICALLY between 2.x and trunk. If it's really a bunch of trailing 0s as was suggested later, then https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least one cause/case of that particular bug. On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky wrote: > And for completeness, a sample stack trace: > > ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog: > Failed commit log replay. Commit disk failure policy is stop_on_startup; > terminating thread (throwable0_message: Mutation checksum failure at 15167277 > in CommitLog-5-1626828286977.log) > org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: > Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143) > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175) > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155) > at > org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296) > at > org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741) > > > On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky > wrote: > >> Currently we're using commitlog_batch: >> >> commitlog_sync: batch >> commitlog_sync_batch_window_in_ms: 2 >> commitlog_segment_size_in_mb: 32 >> >> durable_writes is also true. >> >> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious >> if much in this space has changed since then (I've looked through the >> changelogs and nothing stood out). >> >> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa wrote: >> >>> What commitlog settings are you using? >>> >>> Default is periodic with 10s sync. That leaves you a 10s window on hard >>> poweroff/crash. >>> >>> I would also expect cassandra to cleanup and start cleanly, which >>> version are you running? >>> >>> >>> >>> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky < >>> leonzaruvin...@gmail.com> wrote: >>> Hi Cassandra community, We (and others) regularly run into commit log corruptions that are caused by Cassandra, or the underlying infrastructure, being hard restarted. I suspect that this is because it happens in the middle of a commitlog file write to disk. Could anyone point me at resources / code to understand why this is happening? Shouldn't Cassandra not be acking writes until the commitlog is safely written to disk? I would expect that on startup, Cassandra should be able to clean up bad commitlog files and recover gracefully. I've seen various references online to this issue as something that will be fixed in the future - so I'm curious if there is any movement or thoughts there. Thanks a bunch, Leon >>>
Re: Cassandra commitlog corruption on hard shutdown
I have seen the same error in Cassandra 3.x too, and in fact quite a few times. On a few occasions, I opened the corrupted commit log file in a hex editor, and it was filled with a lots of 0x00s. I believe it was caused by the combination of the way Cassandra flushes the commit log + the way XFS handles the metadata in journal + an unexpected power cut + the SSD write back cache. I have never experienced this again since we moved all Cassandra servers to ZFS. On 26/07/2021 23:11, Leon Zaruvinsky wrote: And for completeness, a sample stack trace: ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. Commit disk failure policy is stop_on_startup; terminating thread (throwable0_message: Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log) org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647) at org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519) at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401) at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155) at org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296) at org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741) On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky mailto:leonzaruvin...@gmail.com>> wrote: Currently we're using commitlog_batch: commitlog_sync: batch commitlog_sync_batch_window_in_ms: 2 commitlog_segment_size_in_mb: 32 durable_writes is also true. Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious if much in this space has changed since then (I've looked through the changelogs and nothing stood out). On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa mailto:jji...@gmail.com>> wrote: What commitlog settings are you using? Default is periodic with 10s sync. That leaves you a 10s window on hard poweroff/crash. I would also expect cassandra to cleanup and start cleanly, which version are you running? On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky mailto:leonzaruvin...@gmail.com>> wrote: Hi Cassandra community, We (and others) regularly run into commit log corruptions that are caused by Cassandra, or the underlying infrastructure, being hard restarted. I suspect that this is because it happens in the middle of a commitlog file write to disk. Could anyone point me at resources / code to understand why this is happening? Shouldn't Cassandra not be acking writes until the commitlog is safely written to disk? I would expect that on startup, Cassandra should be able to clean up bad commitlog files and recover gracefully. I've seen various references online to this issue as something that will be fixed in the future - so I'm curious if there is any movement or thoughts there. Thanks a bunch, Leon
Re: Cassandra commitlog corruption on hard shutdown
And for completeness, a sample stack trace: ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. Commit disk failure policy is stop_on_startup; terminating thread (throwable0_message: Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log) org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647) at org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519) at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401) at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155) at org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296) at org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741) On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky wrote: > Currently we're using commitlog_batch: > > commitlog_sync: batch > commitlog_sync_batch_window_in_ms: 2 > commitlog_segment_size_in_mb: 32 > > durable_writes is also true. > > Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious > if much in this space has changed since then (I've looked through the > changelogs and nothing stood out). > > On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa wrote: > >> What commitlog settings are you using? >> >> Default is periodic with 10s sync. That leaves you a 10s window on hard >> poweroff/crash. >> >> I would also expect cassandra to cleanup and start cleanly, which version >> are you running? >> >> >> >> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky >> wrote: >> >>> Hi Cassandra community, >>> >>> We (and others) regularly run into commit log corruptions that are >>> caused by Cassandra, or the underlying infrastructure, being hard >>> restarted. I suspect that this is because it happens in the middle of a >>> commitlog file write to disk. >>> >>> Could anyone point me at resources / code to understand why this is >>> happening? Shouldn't Cassandra not be acking writes until the commitlog is >>> safely written to disk? I would expect that on startup, Cassandra should >>> be able to clean up bad commitlog files and recover gracefully. >>> >>> I've seen various references online to this issue as something that will >>> be fixed in the future - so I'm curious if there is any movement or >>> thoughts there. >>> >>> Thanks a bunch, >>> Leon >>> >>
Re: Cassandra commitlog corruption on hard shutdown
Currently we're using commitlog_batch: commitlog_sync: batch commitlog_sync_batch_window_in_ms: 2 commitlog_segment_size_in_mb: 32 durable_writes is also true. Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious if much in this space has changed since then (I've looked through the changelogs and nothing stood out). On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa wrote: > What commitlog settings are you using? > > Default is periodic with 10s sync. That leaves you a 10s window on hard > poweroff/crash. > > I would also expect cassandra to cleanup and start cleanly, which version > are you running? > > > > On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky > wrote: > >> Hi Cassandra community, >> >> We (and others) regularly run into commit log corruptions that are caused >> by Cassandra, or the underlying infrastructure, being hard restarted. I >> suspect that this is because it happens in the middle of a commitlog file >> write to disk. >> >> Could anyone point me at resources / code to understand why this is >> happening? Shouldn't Cassandra not be acking writes until the commitlog is >> safely written to disk? I would expect that on startup, Cassandra should >> be able to clean up bad commitlog files and recover gracefully. >> >> I've seen various references online to this issue as something that will >> be fixed in the future - so I'm curious if there is any movement or >> thoughts there. >> >> Thanks a bunch, >> Leon >> >
Re: Cassandra commitlog corruption on hard shutdown
What commitlog settings are you using? Default is periodic with 10s sync. That leaves you a 10s window on hard poweroff/crash. I would also expect cassandra to cleanup and start cleanly, which version are you running? On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky wrote: > Hi Cassandra community, > > We (and others) regularly run into commit log corruptions that are caused > by Cassandra, or the underlying infrastructure, being hard restarted. I > suspect that this is because it happens in the middle of a commitlog file > write to disk. > > Could anyone point me at resources / code to understand why this is > happening? Shouldn't Cassandra not be acking writes until the commitlog is > safely written to disk? I would expect that on startup, Cassandra should > be able to clean up bad commitlog files and recover gracefully. > > I've seen various references online to this issue as something that will > be fixed in the future - so I'm curious if there is any movement or > thoughts there. > > Thanks a bunch, > Leon >
Re: Cassandra commitlog corruption on hard shutdown
I thought durable_writes is the solution. -Arvinder On Mon, Jul 26, 2021, 1:00 PM Leon Zaruvinsky wrote: > Hi Cassandra community, > > We (and others) regularly run into commit log corruptions that are caused > by Cassandra, or the underlying infrastructure, being hard restarted. I > suspect that this is because it happens in the middle of a commitlog file > write to disk. > > Could anyone point me at resources / code to understand why this is > happening? Shouldn't Cassandra not be acking writes until the commitlog is > safely written to disk? I would expect that on startup, Cassandra should > be able to clean up bad commitlog files and recover gracefully. > > I've seen various references online to this issue as something that will > be fixed in the future - so I'm curious if there is any movement or > thoughts there. > > Thanks a bunch, > Leon >
Re: [RELEASE] Apache Cassandra 4.0.0 released
Whoo hoo! Looking forward to trying it out! -Joe On 7/26/2021 4:03 PM, Brandon Williams wrote: The Cassandra team is pleased to announce the release of Apache Cassandra version 4.0.0. Apache Cassandra is a fully distributed database. It is the right choice when you need scalability and high availability without compromising performance. http://cassandra.apache.org/ Downloads of source and binary distributions are available in our download section: http://cassandra.apache.org/download/ This version is the initial release in the 4.0 series. As always, please pay attention to the release notes[2] and Let us know[3] if you were to encounter any problem. Enjoy! [1]: CHANGES.txt https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0.0 [2]: NEWS.txt https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0.0 [3]: https://issues.apache.org/jira/browse/CASSANDRA
[RELEASE] Apache Cassandra 4.0.0 released
The Cassandra team is pleased to announce the release of Apache Cassandra version 4.0.0. Apache Cassandra is a fully distributed database. It is the right choice when you need scalability and high availability without compromising performance. http://cassandra.apache.org/ Downloads of source and binary distributions are available in our download section: http://cassandra.apache.org/download/ This version is the initial release in the 4.0 series. As always, please pay attention to the release notes[2] and Let us know[3] if you were to encounter any problem. Enjoy! [1]: CHANGES.txt https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0.0 [2]: NEWS.txt https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0.0 [3]: https://issues.apache.org/jira/browse/CASSANDRA
Cassandra commitlog corruption on hard shutdown
Hi Cassandra community, We (and others) regularly run into commit log corruptions that are caused by Cassandra, or the underlying infrastructure, being hard restarted. I suspect that this is because it happens in the middle of a commitlog file write to disk. Could anyone point me at resources / code to understand why this is happening? Shouldn't Cassandra not be acking writes until the commitlog is safely written to disk? I would expect that on startup, Cassandra should be able to clean up bad commitlog files and recover gracefully. I've seen various references online to this issue as something that will be fixed in the future - so I'm curious if there is any movement or thoughts there. Thanks a bunch, Leon