[jira] [Comment Edited] (GEODE-3948) Improve CQ performance under flaky network conditions

Bruce Schuchardt (JIRA) Thu, 11 Apr 2019 15:19:32 -0700


    [ 
https://issues.apache.org/jira/browse/GEODE-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815817#comment-16815817
 ]


Bruce Schuchardt edited comment on GEODE-3948 at 4/11/19 10:18 PM:
-------------------------------------------------------------------

Reopening as the changes for this ticket caused an unintended side-effect.  
Some reads are now done w/o a read-timeout.


{quote} 
We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along 
with Geode Server each VM has one Geode 1.7 Client. Hence we have  2 servers 
and 2 clients in Geode cluster.

 

While doing validation, we have introduced packet loss(~65%) on first VM “A” 
and after about 1 minute client of VM “B” reports following:

 

[warning 2019/04/11 16:20:27.502 AMT 
Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] 
Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck 
for <76.204 seconds> and number of thread monitor iteration <1>

  Thread Name <poolTimer-CollectorControllerPool-142>

  Thread state <RUNNABLE>

  Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>

  Monitored metric <ResourceManagerStats.numThreadsStuck>

  Thread Stack:

  java.net.SocketInputStream.socketRead0(Native Method)

  java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

  java.net.SocketInputStream.read(SocketInputStream.java:171)

  java.net.SocketInputStream.read(SocketInputStream.java:141)

  sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

  sun.security.ssl.InputRecord.read(InputRecord.java:503)

  sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

  sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)

  sun.security.ssl.AppInputStream.read(AppInputStream.java:105)

  
org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)

  
org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)

  
org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)

  
org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)

  
org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)

  org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)

  
org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)

  
org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)

  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)

  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)

  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)

  org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)

  org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)

  
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)

  
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)

  java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

  java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

  
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)

  
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

  java.lang.Thread.run(Thread.java:748)

 

This report and stacktrace is being continuously repeated by ThreadsMOnitor 
over time – just iteration count and “stuck for” values are increasing. From 
stacktrace it seems to be PingOperation initiated by client on VM “B” to Server 
of VM “A”. Due to packet drop between the nodes the response is not reaching 
caller client from the server and this thread remaines blocked for hours. In 
source I see that receiveWithHeaderReadTimeout receives NO_HEADER_READ_TIMEOUT 
as a timeout argument which means we will wait indefinitely. Is this 
reasonable? So the question is why PingOperation is executed without timeout?

 

Or could it be that this stacked thread will be interrupted by some monitoring 
logic at some moment?
{quote}


was (Author: bschuchardt):
{quote}We have 2 VMs that are running Geode 1.7 servers – one server per VM. 
Along with Geode Server each VM has one Geode 1.7 Client. Hence we have  2 
servers and 2 clients in Geode cluster.

 

While doing validation, we have introduced packet loss(~65%) on first VM “A” 
and after about 1 minute client of VM “B” reports following:

 

[warning 2019/04/11 16:20:27.502 AMT 
Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] 
Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck 
for <76.204 seconds> and number of thread monitor iteration <1>

  Thread Name <poolTimer-CollectorControllerPool-142>

  Thread state <RUNNABLE>

  Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>

  Monitored metric <ResourceManagerStats.numThreadsStuck>

  Thread Stack:

  java.net.SocketInputStream.socketRead0(Native Method)

  java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

  java.net.SocketInputStream.read(SocketInputStream.java:171)

  java.net.SocketInputStream.read(SocketInputStream.java:141)

  sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

  sun.security.ssl.InputRecord.read(InputRecord.java:503)

  sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

  sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)

  sun.security.ssl.AppInputStream.read(AppInputStream.java:105)

  
org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)

  
org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)

  
org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)

  
org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)

  
org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)

  org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)

  
org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)

  
org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)

  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)

  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)

  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)

  org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)

  org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)

  
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)

  
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)

  java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

  java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

  
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)

  
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

  java.lang.Thread.run(Thread.java:748)

 

This report and stacktrace is being continuously repeated by ThreadsMOnitor 
over time – just iteration count and “stuck for” values are increasing. From 
stacktrace it seems to be PingOperation initiated by client on VM “B” to Server 
of VM “A”. Due to packet drop between the nodes the response is not reaching 
caller client from the server and this thread remaines blocked for hours. In 
source I see that receiveWithHeaderReadTimeout receives NO_HEADER_READ_TIMEOUT 
as a timeout argument which means we will wait indefinitely. Is this 
reasonable? So the question is why PingOperation is executed without timeout?

 

Or could it be that this stacked thread will be interrupted by some monitoring 
logic at some moment?
{quote}

> Improve CQ performance under flaky network conditions
> -----------------------------------------------------
>
>                 Key: GEODE-3948
>                 URL: https://issues.apache.org/jira/browse/GEODE-3948
>             Project: Geode
>          Issue Type: Improvement
>          Components: docs
>            Reporter: Galen O'Sullivan
>            Assignee: Bruce Schuchardt
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.5.0
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> Client CQ connections occasionally stop receiving messages and become blocked 
> indefinitely. 
> This can be caused by a server that hangs or dies without sending a close 
> message, or by some firewalls. 
> The client already gets ping messages from the server, but currently ignores 
> them. Let's use those messages to detect a failed connection and close it.
> Probably the client should follow the same logic and send ping messages if it 
> has sent no acks for a while, so that the server can also detect and close a 
> broken connection.
> The timeout could be specified as a number and time interval, the ping 
> interval and the number of missed pings after which to fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (GEODE-3948) Improve CQ performance under flaky network conditions

Reply via email to