[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-11-13 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785472#comment-17785472
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18866 at 11/13/23 12:16 PM:
--

[~brandon.williams]  do we still want to see this in 5.0.0? Strictly speaking, 
this is not a bug, just an improvement. I think it took so much time to review 
/ complete (not blaming Brandon, just saying), that we entered freeze period 
and here we are ... 

cc [~mck] 


was (Author: smiklosovic):
[~brandon.williams]  do we still want this to see in 5.0.0? Strictly speaking, 
this is not a bug, just an improvement. I think it took so much time to review 
/ complete (not blaming Brandon, just saying), that we entered freeze period 
and here we are ... 

cc [~mck] 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-11-13 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785472#comment-17785472
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18866 at 11/13/23 12:14 PM:
--

[~brandon.williams]  do we still want this to see in 5.0.0? Strictly speaking, 
this is not a bug, just an improvement. I think it took so much time to review 
/ complete (not blaming Brandon, just saying), that we entered freeze period 
and here we are ... 

cc [~mck] 


was (Author: smiklosovic):
[~brandon.williams]  do we still want this to see in 5.0? Strictly speaking, 
this is not a bug, just an improvement. I think it took so much time to review 
/ complete (not blaming Brandon, just saying), that we entered freeze period 
and here we are ... 

cc [~mck] 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-10-13 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774815#comment-17774815
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18866 at 10/13/23 7:44 AM:
-

[~brandon.williams] do you remember CASSANDRA-18854 / CASSANDRA-18543 where we 
reverted the logic around missed echo message? This one fixes it. Repeated 
tests seem to be stable, it is seen that in some cases it resends echo request 
when lost (in 1% of cases). Do you think this is something you could take a 
look at?  

[trunk 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/f44ee4a8-03fb-488f-ba54-43306bdc86d0]
[trunk 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/459e6b55-3d43-4ee8-85cd-b308e8797e51]
[5.0 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/50a0bc41-b800-478d-b7e1-38cc73c16f84]
[5.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/d92a1a77-1f81-4210-a14a-da61fb18e1dd]
[4.1 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/dc9128f2-94f0-4651-afc8-df4a2db53a9b]
[4.1 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/8205db12-78f4-429a-81b8-508ba83b98bb]
[4.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/86b807f3-801f-47a0-925b-9ca49eb76d97]
[4.0 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/0af2bd74-55fc-4d3f-a4e0-2a4c1461ed90]
[3.11 j8| 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/3325/workflows/ff2b2562-c238-49d2-abb2-3457acb9618d]


was (Author: smiklosovic):
[~brandon.williams] do you remember CASSANDRA-18854 / CASSANDRA-18543 where we 
reverted the logic around missed echo message? This one fixes it. Repeated 
tests seem to be stable, it is seen that in some cases it resends echo request 
when lost (in 1% of cases). Do you think this is something you could take a 
look at?  

[trunk 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/f44ee4a8-03fb-488f-ba54-43306bdc86d0]
[trunk 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/459e6b55-3d43-4ee8-85cd-b308e8797e51]
[5.0 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/50a0bc41-b800-478d-b7e1-38cc73c16f84]
[5.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/d92a1a77-1f81-4210-a14a-da61fb18e1dd]
[4.1 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/dc9128f2-94f0-4651-afc8-df4a2db53a9b]
[4.1 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/8205db12-78f4-429a-81b8-508ba83b98bb]
[4.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/86b807f3-801f-47a0-925b-9ca49eb76d97]
[4.0 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/0af2bd74-55fc-4d3f-a4e0-2a4c1461ed90]
[3.11 j8 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/3325/workflows/ff2b2562-c238-49d2-abb2-3457acb9618d]

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-24 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768462#comment-17768462
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/24/23 11:47 PM:
-

Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                    inflightEcho.remove(addr);
                 }
            }
 {code}
[instaclustr/cassandra at CASSANDRA-18866-regressiontest 
(github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest]


was (Author: cam1982):
Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}
[instaclustr/cassandra at CASSANDRA-18866-regressiontest 
(github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest]

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-24 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768462#comment-17768462
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/24/23 11:42 PM:
-

Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}
[instaclustr/cassandra at CASSANDRA-18866-regressiontest 
(github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest]


was (Author: cam1982):
Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767742#comment-17767742
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/22/23 3:00 AM:


Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{code:java}
        if (!inflightEcho.add(addr))
        {
            return;
        } {code}
Otherwise, data race allows multiple inflight echos.

 

and
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 


was (Author: cam1982):
Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
Otherwise, data race allows multiple inflight echos.

 

and
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767742#comment-17767742
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/21/23 10:04 PM:
-

Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
Otherwise, data race allows multiple inflight echos.

 

and
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 


was (Author: cam1982):
Found some bugs:

 
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be

 

 
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
 

Otherwise, data race allows multiple inflight echos.

 

and

 
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be

 

 
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-20 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767036#comment-17767036
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/20/23 7:37 AM:


going to run overnight the broken dtest that was flagged by the ECHO changes. 
But with potential fix:
{noformat}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }{noformat}
will report back in the morning.


was (Author: cam1982):
!18866-regression.patch|width=7,height=7,align=absmiddle!

going to run overnight the broken dtest that was flagged by the ECHO changes. 
But with potential fix:
{noformat}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }{noformat}
will report back in the morning.

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org