[
https://issues.apache.org/jira/browse/CASSANDRA-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719574#comment-17719574
]
David Capwell commented on CASSANDRA-18366:
-------------------------------------------
sorry, just improved my filters and saw I was pinged....
Double checking, is this ticket for the fact junit timed out the test, or that
it is flakey in-general?
https://issues.apache.org/jira/browse/CASSANDRA-18366?focusedCommentId=17704821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17704821
[~smiklosovic] have not looked much at CASSANDRA-18294, but "die" should kill
the jvm, stopping messaging is the wrong behavior... I thought we even test
that this happens in jvm-dtest (we mock out how instances die and make sure
that they do try...).
bq. It fails again, obviously. Basically, when we hit "die" and we stop
transports, for some unknow-yet reason, the code in FailingRepairTest which
waits for this loops forever:
git says I replaced this behavior in CASSANDRA-17116, and that I use isShutdown
and not replicaInstance.killAttempts... I don't know why I made this change, it
actually feels wrong as the test causes a sstable to be corrupted, and the
policy is "die"... so we just stopped validating that the die policy was
properly handled? If I look at
org.apache.cassandra.service.DefaultFSErrorHandler#handleCorruptSSTable this
isn't where we handle die, so not sure why CASSANDRA-18294 changed that... we
handle die in
org.apache.cassandra.utils.JVMStabilityInspector#inspectThrowable(java.lang.Throwable,
java.util.function.Consumer<java.lang.Throwable>)
{code}
if (DatabaseDescriptor.getDiskFailurePolicy() == Config.DiskFailurePolicy.die)
if (t instanceof FSError || t instanceof CorruptSSTableException)
isUnstable = true;
// Check for file handle exhaustion
if (t instanceof FileNotFoundException || t instanceof
FileSystemException || t instanceof SocketException)
if (t.getMessage() != null && t.getMessage().contains("Too many
open files"))
isUnstable = true;
if (isUnstable)
{
if (!StorageService.instance.isDaemonSetupCompleted())
FileUtils.handleStartupFSError(t);
killer.killCurrentJVM(t);
}
{code}
Now, if the issue is not that the tests are failing and just that we keep
timing out... I sadly feel the best solution is to do what we always do
(sigh)... split the test cross different class files....
{code}
@Parameters(name = "{0}/{1}/{2}")
public static Collection<Object[]> messages()
{
List<Object[]> tests = new ArrayList<>();
for (RepairParallelism parallelism : RepairParallelism.values())
{
for (Boolean withTracing : Arrays.asList(Boolean.TRUE,
Boolean.FALSE))
{
tests.add(new Object[]{ Verb.VALIDATION_REQ, parallelism,
withTracing, failingReaders(Verb.VALIDATION_REQ, parallelism, withTracing) });
}
}
return tests;
}
{code}
This is 3 * 2 = 6 tests... we prob want to split it at the RepairParallelism
level so we have 3 top level tests that do 2 different cases (w/, and w/o
tracing)
> Test failure: org.apache.cassandra.distributed.test.FailingRepairTest -
> testFailingMessage[VALIDATION_REQ/parallel/true]
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-18366
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18366
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Brandon Williams
> Priority: Normal
> Fix For: 4.0.x
>
>
> First seen
> [here|https://app.circleci.com/pipelines/github/driftx/cassandra/928/workflows/f4e93a72-d4aa-47a2-996f-aa3fb018d848/jobs/16206]
> this test times out for me consistently on both j8 and j11 where 4.1 and
> trunk do not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]