[jira] [Comment Edited] (CASSANDRA-11479) BatchlogManager unit tests failing on truncate race condition

Yuki Morishita (JIRA) Fri, 01 Apr 2016 15:24:34 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222438#comment-15222438
 ]


Yuki Morishita edited comment on CASSANDRA-11479 at 4/1/16 10:23 PM:
---------------------------------------------------------------------

Code-wise, It looks like compaction still can be running during 
{{runWithCompactionDisabled}}.

>From the log attached (Thank you!):

{noformat}
TRACE [MemtableFlushWriter:1] 2016-04-01 15:02:32,335 Scheduling a background 
task check for system.batches with SizeTieredCompactionStrategy
{noformat}

Here, compaction was submitted for flushed SSTable.

{noformat}
TRACE [main] 2016-04-01 15:02:32,337 Cancelling in-progress compactions for 
batches
TRACE [CompactionExecutor:1] 2016-04-01 15:02:32,337 Compaction buckets are []
TRACE [main] 2016-04-01 15:02:32,337 Compactions successfully cancelled
{noformat}

and cancelling compaction happened, but since compaction strategy (STCS) failed 
to grab compacting SSTables, compaction task was not yet created.

{code:java}
// interrupt in-progress compactions
CompactionManager.instance.interruptCompactionForCFs(selfWithAuxiliaryCfs, 
interruptValidation);
CompactionManager.instance.waitForCessation(selfWithAuxiliaryCfs);
{code}

did not happen since there was no {{CompactionTask}} and no SSTables were in 
compaction yet.

So,

{noformat}
TRACE [main] 2016-04-01 15:02:32,337 Discarding sstable data for truncated CF + 
indexes
{noformat}

truncate proceeded, but

{noformat}
TRACE [CompactionExecutor:1] 2016-04-01 15:02:32,337 Compaction buckets are 
[[BigTableReader(path='build/test/cassandra/data:0/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-5-big-Data.db'),
 
BigTableReader(path='build/test/cassandra/data:0/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-4-big-Data.db')]]
{noformat}

in CompactionExecutor, STCS was still trying to grab compaction candidate and 
it succeeded, putting those two into {{compacting}} SSTable.

EDIT: The first "Compaction buckets are: []" can be for repaired SSTables, and 
second one is for unrepaired.

Thus, inside truncate logic,

{code:java}
public ReplayPosition discardSSTables(long truncatedAt)
{
    assert data.getCompacting().isEmpty() : data.getCompacting();
    ....
}
{code}
throws AssertionError claiming some SSTables are still compacting.

{noformat}
INFO  [CompactionExecutor:1] 2016-04-01 15:02:32,345 Compaction interrupted: 
Compaction@919a4bc5-7a33-3573-b03e-13fc3f68b465(system, batches, 0/240554)bytes
{noformat}

Eventually, the submitted {{CompactionTask}} was aborted since 
{{runWithCompactionDisabled}} marks compaction strategy to {{pause()}} at first.


was (Author: yukim):
Code-wise, It looks like compaction still can be running during 
{{runWithCompactionDisabled}}.

>From the log attached (Thank you!):

{noformat}
TRACE [MemtableFlushWriter:1] 2016-04-01 15:02:32,335 Scheduling a background 
task check for system.batches with SizeTieredCompactionStrategy
{noformat}

Here, compaction was submitted for flushed SSTable.

{noformat}
TRACE [main] 2016-04-01 15:02:32,337 Cancelling in-progress compactions for 
batches
TRACE [CompactionExecutor:1] 2016-04-01 15:02:32,337 Compaction buckets are []
TRACE [main] 2016-04-01 15:02:32,337 Compactions successfully cancelled
{noformat}

and cancelling compaction happened, but since compaction strategy (STCS) failed 
to grab compacting SSTables, compaction task was not yet created.

{code:java}
// interrupt in-progress compactions
CompactionManager.instance.interruptCompactionForCFs(selfWithAuxiliaryCfs, 
interruptValidation);
CompactionManager.instance.waitForCessation(selfWithAuxiliaryCfs);
{code}

did not happen since there was no {{CompactionTask}} and no SSTables were in 
compaction yet.

So,

{noformat}
TRACE [main] 2016-04-01 15:02:32,337 Discarding sstable data for truncated CF + 
indexes
{noformat}

truncate proceeded, but

{noformat}
TRACE [CompactionExecutor:1] 2016-04-01 15:02:32,337 Compaction buckets are 
[[BigTableReader(path='build/test/cassandra/data:0/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-5-big-Data.db'),
 
BigTableReader(path='build/test/cassandra/data:0/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-4-big-Data.db')]]
{noformat}

in CompactionExecutor, [STCS was still trying to grab compaction candidate 
eagerly|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java#L180]
 and it succeeded, putting those two into {{compacting}} SSTable.
Thus, inside truncate logic,

{code:java}
public ReplayPosition discardSSTables(long truncatedAt)
{
    assert data.getCompacting().isEmpty() : data.getCompacting();
    ....
}
{code}
throws AssertionError claiming some SSTables are still compacting.

{noformat}
INFO  [CompactionExecutor:1] 2016-04-01 15:02:32,345 Compaction interrupted: 
Compaction@919a4bc5-7a33-3573-b03e-13fc3f68b465(system, batches, 0/240554)bytes
{noformat}

Eventually, the submitted {{CompactionTask}} was aborted since 
{{runWithCompactionDisabled}} marks compaction strategy to {{pause()}} at first.

> BatchlogManager unit tests failing on truncate race condition
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-11479
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11479
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>         Attachments: 
> TEST-org.apache.cassandra.batchlog.BatchlogManagerTest.log
>
>
> Example on CI 
> [here|http://cassci.datastax.com/job/trunk_testall/818/testReport/junit/org.apache.cassandra.batchlog/BatchlogManagerTest/testLegacyReplay_compression/].
>  This seems to have only started happening relatively recently (within the 
> last month or two).
> As far as I can tell, this is only showing up on BatchlogManagerTests purely 
> because it is an aggressive user of truncate. The assertion is hit in the 
> setUp method, so it can happen before any of the test methods. The assertion 
> occurs because a compaction is happening when truncate wants to discard 
> SSTables; trace level logs suggest that this compaction is submitted after 
> the pause on the CompactionStrategyManager.
> This should be reproducible by running BatchlogManagerTest in a loop - it 
> takes up to half an hour in my experience. A trace-level log from such a run 
> is attached - grep for my added log message "SSTABLES COMPACTING WHEN 
> DISCARDING" to find when the assert is hit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11479) BatchlogManager unit tests failing on truncate race condition

Reply via email to