true]

David Capwell (Jira) Thu, 04 May 2023 17:31:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719574#comment-17719574
 ]


David Capwell commented on CASSANDRA-18366:
-------------------------------------------

sorry, just improved my filters and saw I was pinged....

Double checking, is this ticket for the fact junit timed out the test, or that 
it is flakey in-general?

https://issues.apache.org/jira/browse/CASSANDRA-18366?focusedCommentId=17704821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17704821

[~smiklosovic] have not looked much at CASSANDRA-18294, but "die" should kill 
the jvm, stopping messaging is the wrong behavior...  I thought we even test 
that this happens in jvm-dtest (we mock out how instances die and make sure 
that they do try...).

bq. It fails again, obviously. Basically, when we hit "die" and we stop 
transports, for some unknow-yet reason, the code in FailingRepairTest which 
waits for this loops forever:

git says I replaced this behavior in CASSANDRA-17116, and that I use isShutdown 
and not replicaInstance.killAttempts... I don't know why I made this change, it 
actually feels wrong as the test causes a sstable to be corrupted, and the 
policy is "die"... so we just stopped validating that the die policy was 
properly handled?  If I look at 
org.apache.cassandra.service.DefaultFSErrorHandler#handleCorruptSSTable this 
isn't where we handle die, so not sure why CASSANDRA-18294 changed that... we 
handle die in 
org.apache.cassandra.utils.JVMStabilityInspector#inspectThrowable(java.lang.Throwable,
 java.util.function.Consumer<java.lang.Throwable>)

{code}
if (DatabaseDescriptor.getDiskFailurePolicy() == Config.DiskFailurePolicy.die)
            if (t instanceof FSError || t instanceof CorruptSSTableException)
                isUnstable = true;

        // Check for file handle exhaustion
        if (t instanceof FileNotFoundException || t instanceof 
FileSystemException || t instanceof SocketException)
            if (t.getMessage() != null && t.getMessage().contains("Too many 
open files"))
                isUnstable = true;

        if (isUnstable)
        {
            if (!StorageService.instance.isDaemonSetupCompleted())
                FileUtils.handleStartupFSError(t);
            killer.killCurrentJVM(t);
        }
{code}


Now, if the issue is not that the tests are failing and just that we keep 
timing out... I sadly feel the best solution is to do what we always do 
(sigh)... split the test cross different class files....

{code}
@Parameters(name = "{0}/{1}/{2}")
    public static Collection<Object[]> messages()
    {
        List<Object[]> tests = new ArrayList<>();
        for (RepairParallelism parallelism : RepairParallelism.values())
        {
            for (Boolean withTracing : Arrays.asList(Boolean.TRUE, 
Boolean.FALSE))
            {
                tests.add(new Object[]{ Verb.VALIDATION_REQ, parallelism, 
withTracing, failingReaders(Verb.VALIDATION_REQ, parallelism, withTracing) });
            }
        }
        return tests;
    }
{code}

This is 3 * 2 = 6 tests... we prob want to split it at the RepairParallelism 
level so we have 3 top level tests that do 2 different cases (w/, and w/o 
tracing)



> Test failure: org.apache.cassandra.distributed.test.FailingRepairTest - 
> testFailingMessage[VALIDATION_REQ/parallel/true]
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18366
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18366
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> First seen 
> [here|https://app.circleci.com/pipelines/github/driftx/cassandra/928/workflows/f4e93a72-d4aa-47a2-996f-aa3fb018d848/jobs/16206]
>  this test times out for me consistently on both j8 and j11 where 4.1 and 
> trunk do not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-18366) Test failure: org.apache.cassandra.distributed.test.FailingRepairTest - testFailingMessage[VALIDATION_REQ/parallel/true]

Reply via email to