[
https://issues.apache.org/jira/browse/CASSANDRA-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876137#comment-17876137
]
Dmitry Konstantinov edited comment on CASSANDRA-19651 at 8/22/24 11:56 PM:
---------------------------------------------------------------------------
If I got correctly org.apache.cassandra.gms.Gossiper#waitToSettle just waits
for 5 +3 x 1 = 8 seconds if there are no changes in the number of nodes
discovered using gossip (even if we have not had any interactions with other
nodes using gossip at all).
I have added a 5-second sleep to
org.apache.cassandra.gms.Gossiper.GossipTask#run (we also have 1 second of
initial delay when we schedule GossipTask)
{code:java}
private class GossipTask implements Runnable
{
public void run()
{
try
{
//wait on messaging service to start listening
MessagingService.instance().waitUntilListening();
Thread.sleep(5000); // <===============================
taskLock.lock();
{code}
and have got the NPE reproduced quite frequently for
org.apache.cassandra.distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest:
{code:java}
java.lang.NullPointerException: Cannot invoke
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
because "state" is null
at
org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
at
org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
at
org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
{code}
So, it is possible that NPE in the original test run was caused by a delay of
GossipTask execution...
was (Author: dnk):
If I got correctly org.apache.cassandra.gms.Gossiper#waitToSettle just waits
for 5 +3 x 1 = 8 seconds if there are no changes in the number of nodes
discovered using gossip (even if we have not had any interactions with other
nodes using gossip at all).
I have added a 5-second sleep to
org.apache.cassandra.gms.Gossiper.GossipTask#run (we also have 1 second of
initial delay when we schedule GossipTask)
{code:java}
private class GossipTask implements Runnable
{
public void run()
{
try
{
//wait on messaging service to start listening
MessagingService.instance().waitUntilListening();
Thread.sleep(5000); // <===============================
taskLock.lock();
{code}
and have got the NPE reproduced quite frequently for
org.apache.cassandra.distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest:
{code:java}
java.lang.NullPointerException: Cannot invoke
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
because "state" is null
at
org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
at
org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
at
org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
{code}
So, it is possible what NPE in the original test run was caused by a delay of
GossipTask execution...
> idealCLWriteLatency metric reports the worst response time instead of the
> time when ideal CL is satisfied
> ---------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-19651
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19651
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Observability
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 4.0.14, 5.0.1, 5.1, 4.1.7
>
> Attachments:
> ci_summary-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.html,
> ci_summary-cassandra-4.1-1ed312f881c0c170c8833ff9fbf397ab8fc625cc.html,
> ci_summary-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.html,
> ci_summary-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.html,
> result_details-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.tar.gz,
> result_details-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.tar.gz,
> result_details-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.tar.gz,
> select-junit-tests-rerun-4.1.zip
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> org.apache.cassandra.service.AbstractWriteResponseHandler:
> {code:java}
> private final void decrementResponseOrExpired()
> {
> int decrementedValue = responsesAndExpirations.decrementAndGet();
> if (decrementedValue == 0)
> {
> // The condition being signaled is a valid proxy for the CL being
> achieved
> // Only mark it as failed if the requested CL was achieved.
> if (!condition.isSignalled() && requestedCLAchieved)
> {
> replicaPlan.keyspace().metric.writeFailedIdealCL.inc();
> }
> else
> {
>
> replicaPlan.keyspace().metric.idealCLWriteLatency.addNano(nanoTime() -
> queryStartNanoTime);
> }
> }
> } {code}
> Actual result: responsesAndExpirations is a total number of replicas across
> all DCs which does not depend on the ideal CL, so the metric value for
> replicaPlan.keyspace().metric.idealCLWriteLatency is updated when we get the
> latest response/timeout for all replicas.
> Expected result: replicaPlan.keyspace().metric.idealCLWriteLatency is updated
> when we get enough responses from replicas according to the ideal CL.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]