[
https://issues.apache.org/jira/browse/CASSANDRA-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704821#comment-17704821
]
Stefan Miklosovic edited comment on CASSANDRA-18366 at 3/25/23 7:47 AM:
------------------------------------------------------------------------
4.0 currently has this in DefaultFSErrorHandler
{code}
@Override
public void handleCorruptSSTable(CorruptSSTableException e)
{
if (!StorageService.instance.isDaemonSetupCompleted())
handleStartupFSError(e);
switch (DatabaseDescriptor.getDiskFailurePolicy())
{
case die:
case stop_paranoid:
// exception not logged here on purpose as it is already logged
logger.error("Stopping transports as disk_failure_policy is " +
DatabaseDescriptor.getDiskFailurePolicy());
StorageService.instance.stopTransports();
break;
}
}
{code}
"case: die" was added in 18294.
Now, when I remove this "case die:", all tests pass.
However, when I do this:
{code}
@Override
public void handleCorruptSSTable(CorruptSSTableException e)
{
if (!StorageService.instance.isDaemonSetupCompleted())
handleStartupFSError(e);
switch (DatabaseDescriptor.getDiskFailurePolicy())
{
case die:
// exception not logged here on purpose as it is already logged
logger.error("Stopping transports as disk_failure_policy is " +
DatabaseDescriptor.getDiskFailurePolicy());
StorageService.instance.stopTransports();
break;
case stop_paranoid:
// exception not logged here on purpose as it is already logged
logger.error("Stopping transports as disk_failure_policy is " +
DatabaseDescriptor.getDiskFailurePolicy());
StorageService.instance.stopTransports();
break;
}
}
{code}
It fails again, obviously. Basically, when we hit "die" and we stop transports,
for some unknow-yet reason, the code in FailingRepairTest which waits for this
loops forever:
{code}
IInvokableInstance replicaInstance = CLUSTER.get(replica);
while (replicaInstance.killAttempts() <= 0)
Uninterruptibles.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);
{code}
EDIT:
The next test fails because we stop the transports in the very first one when
we hit "die". Cluster is created statically so while the first test just passes
fine, it will stop transports and then the second test prints:
{code}
ERROR 21:55:19 Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
at
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
at
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
at
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
at
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
at
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
ERROR [node2_Repair-Task:1] node2 2023-03-24 22:55:19,186
RepairRunnable.java:178 - Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
at
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
at
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
at
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
at
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
at
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
{code}
That seems like the transport is not available on the "previously killed" node
so repair will fail to send repair messages hence nothing is killed so it keeps
looping.
There are some non-trivial changes in 4.1 in this test which patch other
internal stuff, Instance, AbstractCluster .... bunch of internal services, just
to accommodate this test.
I tried to port it to 4.0 but no luck yet. I think we might also rewrite this
test so cluster is created per test method instead of having it parameterized.
was (Author: smiklosovic):
4.0 currently has this in DefaultFSErrorHandler
{code}
@Override
public void handleCorruptSSTable(CorruptSSTableException e)
{
if (!StorageService.instance.isDaemonSetupCompleted())
handleStartupFSError(e);
switch (DatabaseDescriptor.getDiskFailurePolicy())
{
case die:
case stop_paranoid:
// exception not logged here on purpose as it is already logged
logger.error("Stopping transports as disk_failure_policy is " +
DatabaseDescriptor.getDiskFailurePolicy());
StorageService.instance.stopTransports();
break;
}
}
{code}
"case: die" was added in 18294.
Now, when I remove this "case die:", all tests pass.
However, when I do this:
{code}
@Override
public void handleCorruptSSTable(CorruptSSTableException e)
{
if (!StorageService.instance.isDaemonSetupCompleted())
handleStartupFSError(e);
switch (DatabaseDescriptor.getDiskFailurePolicy())
{
case die:
// exception not logged here on purpose as it is already logged
logger.error("Stopping transports as disk_failure_policy is " +
DatabaseDescriptor.getDiskFailurePolicy());
StorageService.instance.stopTransports();
break;
case stop_paranoid:
// exception not logged here on purpose as it is already logged
logger.error("Stopping transports as disk_failure_policy is " +
DatabaseDescriptor.getDiskFailurePolicy());
StorageService.instance.stopTransports();
break;
}
}
{code}
It fails again, obviously. Basically, when we hit "die" and we stop transports,
for some unknow-yet reason, the code in FailingRepairTest which waits for this
loops forever:
{code}
IInvokableInstance replicaInstance = CLUSTER.get(replica);
while (replicaInstance.killAttempts() <= 0)
Uninterruptibles.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);
{code}
EDIT:
The next test fails because we stop the transports in the very first ones when
we hit "die". Cluster is created statically so while the first test just passes
fine, it will stop transports and then the second test prints:
{code}
ERROR 21:55:19 Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
at
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
at
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
at
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
at
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
at
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
ERROR [node2_Repair-Task:1] node2 2023-03-24 22:55:19,186
RepairRunnable.java:178 - Repair 916278f0-ca8e-11ed-8125-8104d0e5c44e failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.1:7012
at
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:665)
at
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:597)
at
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:393)
at
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:269)
at
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:241)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
{code}
That seems like the gossip is not available on the "previously killed" node so
repair will fail hence nothing is killed so it keeps looping.
> Test failure: org.apache.cassandra.distributed.test.FailingRepairTest -
> testFailingMessage[VALIDATION_REQ/parallel/true]
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-18366
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18366
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Brandon Williams
> Priority: Normal
> Fix For: 4.0.x
>
>
> First seen
> [here|https://app.circleci.com/pipelines/github/driftx/cassandra/928/workflows/f4e93a72-d4aa-47a2-996f-aa3fb018d848/jobs/16206]
> this test times out for me consistently on both j8 and j11 where 4.1 and
> trunk do not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]