[ 
https://issues.apache.org/jira/browse/CASSANDRA-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704821#comment-17704821
 ] 

Stefan Miklosovic commented on CASSANDRA-18366:
-----------------------------------------------

4.0 currently has this in DefaultFSErrorHandler

{code}
    @Override
    public void handleCorruptSSTable(CorruptSSTableException e)
    {
        if (!StorageService.instance.isDaemonSetupCompleted())
            handleStartupFSError(e);

        switch (DatabaseDescriptor.getDiskFailurePolicy())
        {
            case die:
            case stop_paranoid:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
        }
    }
{code}

"case: die" was added in 18294.

Now, when I remove this "case die:", all tests pass. 

However, when I do this:

{code}
    @Override
    public void handleCorruptSSTable(CorruptSSTableException e)
    {
        if (!StorageService.instance.isDaemonSetupCompleted())
            handleStartupFSError(e);

        switch (DatabaseDescriptor.getDiskFailurePolicy())
        {
            case die:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
            case stop_paranoid:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
        }
    }
{code}

It fails again, obviously. Basically, when we hit "die" and we stop transports, 
for some unknow-yet reason, the code in FailingRepairTest which waits for this 
loops forever:

{code}
        IInvokableInstance replicaInstance = CLUSTER.get(replica);
        while (replicaInstance.killAttempts() <= 0)
            Uninterruptibles.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);
{code}

> Test failure: org.apache.cassandra.distributed.test.FailingRepairTest - 
> testFailingMessage[VALIDATION_REQ/parallel/true]
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18366
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18366
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> First seen 
> [here|https://app.circleci.com/pipelines/github/driftx/cassandra/928/workflows/f4e93a72-d4aa-47a2-996f-aa3fb018d848/jobs/16206]
>  this test times out for me consistently on both j8 and j11 where 4.1 and 
> trunk do not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to