[
https://issues.apache.org/jira/browse/CASSANDRA-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656282#comment-17656282
]
Josh McKenzie commented on CASSANDRA-18119:
-------------------------------------------
+1 on the code change.
Looks like the trunk and 4.1 ci runs have failures in repeat runs on
LogTransactionTest.java; the one on 4.1 [looks like it might be
related|https://app.circleci.com/pipelines/github/krummas/cassandra/846/workflows/f9ff701a-22f6-463b-b356-c37a61d24a75/jobs/7054/tests]
to this change here; it now chains down to
{code:java}
static boolean removeUnfinishedLeftovers(Map.Entry<String, List<File>> entry)
{
try(LogFile txn = LogFile.make(entry.getKey(), entry.getValue()))
{
logger.info("Verifying logfile transaction {}", txn);
// We don't check / include the stats file timestamp on LogRecord
creation / verification as that might
// be modified by a race in compaction notification and then needlessly
fail subsequent node starts.
if (txn.verify(true)) // REVIEWER NOTE: this param leading to skipping
TS calc on stats file is what we changed here{code}
>From the logs on the test failure, looks like it was expecting true (files
>found and deleted correctly) but received fals:
{code:java}
34409 [junit-timeout] ERROR [main] 2022-12-13 15:45:01,465 LogFile.java:170 -
Failed to read records for transaction log
[nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log in
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1,
build/test/cassandra/data/TransactionLogs
34410 [junit-timeout] ERROR [main] 2022-12-13 15:45:01,465
LogTransaction.java:561 - Unexpected disk state: failed to read transaction log
[nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log in
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1,
build/test/cassandra/da
34411 [junit-timeout] Files and contents follow:^M
34412 [junit-timeout]
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log^M
34413 [junit-timeout]
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-0-big-,1670946301437,5][2873235910]^M
34414 [junit-timeout]
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-2-big-,1670946301441,5][1283732776]^M
34415 [junit-timeout]
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-1-big-,0,5][1197593494]^M
34416 [junit-timeout]
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-3-big-,0,5][1374830355]^M
34417 [junit-timeout] COMMIT:[,0,0][2613697770]^M
34418 [junit-timeout] ***This record should have been the last one in
all replicas^M
34419 [junit-timeout]
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log^M
34420 [junit-timeout]
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-0-big-,1670946301437,5][2873235910]^M
34421 [junit-timeout]
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-2-big-,1670946301441,5][1283732776]^M
34422 [junit-timeout]
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-1-big-,0,5][1197593494]^M
34423 [junit-timeout]
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-3-big-,0,5][1374830355]^M
34424 [junit-timeout] COMMIT:[,0,0][2613697770]^M
34425 [junit-timeout] ***This record should have been the last one in
all replicas^M
34426 [junit-timeout] COMMIT:[,0,0][2613697770]^M
34427 [junit-timeout] ***This record should have been the last one in
all replicas^M
34428 [junit-timeout] ^M {code}
Not clear to me how the change in this patch could have contributed to that;
might be worth multiplexing this unit test on trunk w/out the change to see if
there's a pre-existing race / issue in there. The repeated leaks of the
TransactionTidier from LogTransaction throughout the repeated runs also look a
smidge suspicious:
{code:java}
34435 [junit-timeout] ERROR [Reference-Reaper] 2022-12-13 15:45:01,466
Ref.java:237 - LEAK DETECTED: a reference (class
org.apache.cassandra.db.lifecycle.LogTransaction$TransactionTidier@1776643200:[nb_txn_compaction_198e1590-7afd-11ed-bd8f-0b2a035c989d.log
in /tmp/cassandra/build/test/cassandra/data/TransactionLog {code}
> Handle sstable metadata stats file getting a new mtime after compaction has
> finished
> ------------------------------------------------------------------------------------
>
> Key: CASSANDRA-18119
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18119
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Compaction, Local/Startup and Shutdown
> Reporter: Marcus Eriksson
> Assignee: Marcus Eriksson
> Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.1.x, 4.x
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Due to a race between compaction finishing and compaction strategies getting
> reloaded there is a chance that we try to add both the new sstable and the
> old compacted sstable to the compaction strategy, and in the LCS case this
> can cause the old sstable to get sent to L0 to avoid overlap. This changes
> the mtime of the stats metadata file and if the node is shut down before the
> sstable is actually deleted from disk, we fail starting with the following
> exception:
> {code}
> .../mockcf1-392b3ff07c5a11ed8c662f5760cb10b3/nb_txn_compaction_3983c030-7c5a-11ed-8c66-2f5760cb10b3.log
> REMOVE:[.../data/TransactionLogsTest/mockcf1-392b3ff07c5a11ed8c662f5760cb10b3/nb-0-big-,1671096247000,5][4003386800]
> ***Unexpected files detected for sstable [nb-0-big-]: last update
> time [Thu Dec 15 10:24:09 CET 2022] (1671096249000) should have been [Thu Dec
> 15 10:24:07 CET 2022] (1671096247000)
> ADD:[.../data/TransactionLogsTest/mockcf1-392b3ff07c5a11ed8c662f5760cb10b3/nb-2-big-,0,5][319189529]
> {code}
> A workaround for this (until we properly fix the way compaction strategies
> get notified about sstable changes) is to ignore the timestamp of the STATS
> component when cleaning up compaction leftovers on startup.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]