[ 
https://issues.apache.org/jira/browse/CASSANDRA-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704821#comment-17704821
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18366 at 3/25/23 7:48 AM:
------------------------------------------------------------------------

4.0 currently has this in DefaultFSErrorHandler

{code}
    @Override
    public void handleCorruptSSTable(CorruptSSTableException e)
    {
        if (!StorageService.instance.isDaemonSetupCompleted())
            handleStartupFSError(e);

        switch (DatabaseDescriptor.getDiskFailurePolicy())
        {
            case die:
            case stop_paranoid:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
        }
    }
{code}

"case: die" was added in 18294.

Now, when I remove this "case die:", all tests pass. 

However, when I do this:

{code}
    @Override
    public void handleCorruptSSTable(CorruptSSTableException e)
    {
        if (!StorageService.instance.isDaemonSetupCompleted())
            handleStartupFSError(e);

        switch (DatabaseDescriptor.getDiskFailurePolicy())
        {
            case die:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
            case stop_paranoid:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
        }
    }
{code}

It fails again, obviously. Basically, when we hit "die" and we stop transports, 
for some unknow-yet reason, the code in FailingRepairTest which waits for this 
loops forever:

{code}
        IInvokableInstance replicaInstance = CLUSTER.get(replica);
        while (replicaInstance.killAttempts() <= 0)
            Uninterruptibles.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);
{code}

EDIT:

The next test fails because we stop the transports in the very first one when 
we hit "die". Cluster is created statically so while the first test just passes 
fine, it will stop transports and then the second test prints:

{code}
ERROR 21:55:19 Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
        at 
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
        at 
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
        at 
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
        at 
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
        at 
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
ERROR [node2_Repair-Task:1] node2 2023-03-24 22:55:19,186 
RepairRunnable.java:178 - Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
        at 
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
        at 
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
        at 
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
        at 
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
        at 
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
{code}

That seems like the transport is not available on the "previously killed" node 
so repair will fail to send repair messages hence nothing is killed so it keeps 
looping.

There are some non-trivial changes in 4.1 in this test which patches other 
internal stuff, Instance, AbstractCluster .... bunch of internal services, just 
to accommodate this test. 

I tried to port it to 4.0 but no luck yet. I think we might also rewrite this 
test so cluster is created per test method instead of having it parameterized. 


was (Author: smiklosovic):
4.0 currently has this in DefaultFSErrorHandler

{code}
    @Override
    public void handleCorruptSSTable(CorruptSSTableException e)
    {
        if (!StorageService.instance.isDaemonSetupCompleted())
            handleStartupFSError(e);

        switch (DatabaseDescriptor.getDiskFailurePolicy())
        {
            case die:
            case stop_paranoid:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
        }
    }
{code}

"case: die" was added in 18294.

Now, when I remove this "case die:", all tests pass. 

However, when I do this:

{code}
    @Override
    public void handleCorruptSSTable(CorruptSSTableException e)
    {
        if (!StorageService.instance.isDaemonSetupCompleted())
            handleStartupFSError(e);

        switch (DatabaseDescriptor.getDiskFailurePolicy())
        {
            case die:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
            case stop_paranoid:
                // exception not logged here on purpose as it is already logged
                logger.error("Stopping transports as disk_failure_policy is " + 
DatabaseDescriptor.getDiskFailurePolicy());
                StorageService.instance.stopTransports();
                break;
        }
    }
{code}

It fails again, obviously. Basically, when we hit "die" and we stop transports, 
for some unknow-yet reason, the code in FailingRepairTest which waits for this 
loops forever:

{code}
        IInvokableInstance replicaInstance = CLUSTER.get(replica);
        while (replicaInstance.killAttempts() <= 0)
            Uninterruptibles.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);
{code}

EDIT:

The next test fails because we stop the transports in the very first one when 
we hit "die". Cluster is created statically so while the first test just passes 
fine, it will stop transports and then the second test prints:

{code}
ERROR 21:55:19 Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
        at 
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
        at 
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
        at 
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
        at 
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
        at 
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
ERROR [node2_Repair-Task:1] node2 2023-03-24 22:55:19,186 
RepairRunnable.java:178 - Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
        at 
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
        at 
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
        at 
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
        at 
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
        at 
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
{code}

That seems like the transport is not available on the "previously killed" node 
so repair will fail to send repair messages hence nothing is killed so it keeps 
looping.

There are some non-trivial changes in 4.1 in this test which patch other 
internal stuff, Instance, AbstractCluster .... bunch of internal services, just 
to accommodate this test. 

I tried to port it to 4.0 but no luck yet. I think we might also rewrite this 
test so cluster is created per test method instead of having it parameterized. 

> Test failure: org.apache.cassandra.distributed.test.FailingRepairTest - 
> testFailingMessage[VALIDATION_REQ/parallel/true]
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18366
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18366
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> First seen 
> [here|https://app.circleci.com/pipelines/github/driftx/cassandra/928/workflows/f4e93a72-d4aa-47a2-996f-aa3fb018d848/jobs/16206]
>  this test times out for me consistently on both j8 and j11 where 4.1 and 
> trunk do not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to