[ 
https://issues.apache.org/jira/browse/GEODE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219786#comment-16219786
 ] 

Barry Oglesby commented on GEODE-3709:
--------------------------------------

The client logs correspond with what you're saying.

In the 18:03 UTC minute (which is the minute the issue started), the 
dca-prd-rstc12 client received 42458 invalidate messages from the server 
between 18:03:00.000 and 18:03:38.799, all of which were processed immediately. 
Thats an average of 1061 invalidates / second.

For example, the 18:03:00 second received and processed 2106 invalidate 
messages.

Processing log messages look like:
{noformat}
[fine 2017/10/20 18:03:00.000 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (141 bytes)
[fine 2017/10/20 18:03:00.000 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
19991|5057|offerCode|2_W_0000787828_20171018_5 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:00.000 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
19991|5057|offerCode|2_W_0000787828_20171018_5 callbackArgument: null
{noformat}
>From 18:03:39 on, the client received ~2 invalidate messages a second (roughly 
>a message every 400 ms).

So, the client goes from receiving ~1000 messages / sec to ~2 messages / second.

Here are the log messages from 18:03:39 UTC to 18:03:45 UTC:
{noformat}
[fine 2017/10/20 18:03:39.396 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:39.396 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|b0ac2670-3fd5-408e-8da3-43a399215645 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:39.396 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|b0ac2670-3fd5-408e-8da3-43a399215645 callbackArgument: null
[fine 2017/10/20 18:03:39.804 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:39.804 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|20836a0a-7435-4e97-a4be-cdef4dc31213 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:39.804 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|20836a0a-7435-4e97-a4be-cdef4dc31213 callbackArgument: null

[fine 2017/10/20 18:03:40.416 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (141 bytes)
[fine 2017/10/20 18:03:40.416 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000296797_20171018_0 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:40.416 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000296797_20171018_0 callbackArgument: null
[fine 2017/10/20 18:03:40.824 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:40.824 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|f371045a-f0a2-4fcd-92f5-10b1088e70fc callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:40.824 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|f371045a-f0a2-4fcd-92f5-10b1088e70fc callbackArgument: null

[fine 2017/10/20 18:03:41.232 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (141 bytes)
[fine 2017/10/20 18:03:41.232 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000576827_20171018_0 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:41.232 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000576827_20171018_0 callbackArgument: null
[fine 2017/10/20 18:03:41.844 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:41.844 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|45e86530-cb06-4d8b-a501-206a2be47395 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:41.844 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|45e86530-cb06-4d8b-a501-206a2be47395 callbackArgument: null

[fine 2017/10/20 18:03:42.252 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:42.252 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|d5424b01-7115-41cb-89e9-71f3b949c6d9 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:42.252 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|d5424b01-7115-41cb-89e9-71f3b949c6d9 callbackArgument: null
[fine 2017/10/20 18:03:42.660 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (141 bytes)
[fine 2017/10/20 18:03:42.660 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000935202_20171018_0 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:42.660 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000935202_20171018_0 callbackArgument: null

[fine 2017/10/20 18:03:43.272 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:43.272 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|7d91518b-bd61-4c55-81d1-a0b8fc293062 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:43.272 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|7d91518b-bd61-4c55-81d1-a0b8fc293062 callbackArgument: null
[fine 2017/10/20 18:03:43.680 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (141 bytes)
[fine 2017/10/20 18:03:43.680 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000446976_20171018_0 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:43.680 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
19991|5085|offerCode|2_W_0000446976_20171018_0 callbackArgument: null

[fine 2017/10/20 18:03:44.292 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:44.292 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|7c825d65-7af7-4952-9f52-ba2a28b5b852 callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:44.292 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|7c825d65-7af7-4952-9f52-ba2a28b5b852 callbackArgument: null
[fine 2017/10/20 18:03:44.700 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Received 
invalidate message of length (142 bytes)
[fine 2017/10/20 18:03:44.700 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidating 
entry for region: /CircularStoreMap key: 
20001|1620|8f87d525-2a08-4c68-8c0f-80137d39f2ad callbackArgument: null 
withInterest=true withCQs=false version=null
[fine 2017/10/20 18:03:44.700 UTC  <Cache Client Updater Thread  on 
10.55.13.77(DCA-PRD-GDCS11:9074)<v94>:1024 port 40404> tid=0xa9] Invalidated 
entry for region: /CircularStoreMap key: 
20001|1620|8f87d525-2a08-4c68-8c0f-80137d39f2ad callbackArgument: null
{noformat}
The 'Received invalidate message' message is logged right after the Cache 
Client Updater Thread reads a message from the socket and determines the 
message type.

The way that server -> client messaging works is:

On the server:

- the Client Message Dispatcher peeks a message from queue
- the Client Message Dispatcher writes the message to the socket
- repeat

On the client:
- the Cache Client Updater reads a message from the socket
- the Cache Client Updater processes that message
- repeat

There isn't an explicit ACK from the client to the server for each message, and 
the server doesn't wait for an ACK from the client.

The acknowledgements from the client to the server are asynchronous every 100 
ms by default. The client sends the last message it has seen which causes the 
server to remove those messages from the queue.

The ACKs in the TCP dump are at the TCP level not the Geode level.

Based on the thread dumps, I don't see that Geode is doing anything other than 
trying to write to the socket. If the Geode server were doing something for 
400ms between message sends, the thread dumps would show it, but they don't. 
All they show is the {{Client Message Dispatcher}} attempting to write to the 
socket.

Unfortunately, there isn't a whole lot of logging on the server in this area, 
so I'm not sure server fine logs will tell us much.


> Geode Version: 1.1.1    In one of the project we a...
> -----------------------------------------------------
>
>                 Key: GEODE-3709
>                 URL: https://issues.apache.org/jira/browse/GEODE-3709
>             Project: Geode
>          Issue Type: Improvement
>          Components: client queues
>            Reporter: Gregory Chase
>         Attachments: 20171006-logs-stats-tds.zip, 20171020.zip, 
> CacheClientProxyStats_sentBytes.gif, 
> DistributionStats_receivedBytes_CacheClientProxyStats_sentBytes.gif, 
> gf-rest-stats-12-05.gfs, myStatisticsArchiveFile-04-01.gfs
>
>
> Geode Version: 1.1.1
> In one of the project we are using Geode. Here is a summary of how we use it.
> - Geode servers have multiple regions. 
> - Clients subscribe to the data from these regions.
> - Clients subscribe interest in all the entries, therefore they get updates 
> about all the entries from creation to modification to deletion.
> - One of the regions usually has 5-10 million entries with a TTL of 24 hours. 
> Most entries are added in an hour's span one after other. So when TTL kicks 
> in, they are often destroyed in an hour.
> Problem:
> Every now and then we observe following message: 
>       Client queue for 
> _gfe_non_durable_client_with_id_x.x.x.x(14229:loner):42754:e4266fc4_2_queue 
> client is full.
> This seems to happen when the TTL kicks in on the region with 5-10 million 
> entries. Entries start getting evicted (deleted); the updates (destroys) now 
> must be sent to clients. We see that the updates do happen for a while but 
> suddenly the updates stop and the queue size starts growing. This is becoming 
> a major issue for smooth functioning of our production setup. Any help will 
> be much appreciated. 
> I did some ground work by downloading and looking at the code. I see 
> reference to 2 issues #37581, #51400. But I am unable to view actual JIRA 
> tickets (needs login credentials) Hopefully, it helps someone looking at the 
> issue.
> Here is the pertinent code:
>    @Override
>     @edu.umd.cs.findbugs.annotations.SuppressWarnings("TLW_TWO_LOCK_WAIT")
>     void checkQueueSizeConstraint() throws InterruptedException {
>       if (this.haContainer instanceof HAContainerMap && isPrimary()) { // Fix 
> for bug 39413
>         if (Thread.interrupted())
>           throw new InterruptedException();
>         synchronized (this.putGuard) {
>           if (putPermits <= 0) {
>             synchronized (this.permitMon) {
>               if (reconcilePutPermits() <= 0) {
>                 if 
> (region.getSystem().getConfig().getRemoveUnresponsiveClient()) {
>                   isClientSlowReciever = true;
>                 } else {
>                   try {
>                     long logFrequency = 
> CacheClientNotifier.DEFAULT_LOG_FREQUENCY;
>                     CacheClientNotifier ccn = 
> CacheClientNotifier.getInstance();
>                     if (ccn != null) { // check needed for junit tests
>                       logFrequency = ccn.getLogFrequency();
>                     }
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 0) {
>                       logger.warn(LocalizedMessage.create(
>                           
> LocalizedStrings.HARegionQueue_CLIENT_QUEUE_FOR_0_IS_FULL,
>                           new Object[] {region.getName()}));
>                       this.maxQueueSizeHitCount = 0;
>                     }
>                     ++this.maxQueueSizeHitCount;
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // TODO: wait called while holding two locks
>                     
> this.permitMon.wait(CacheClientNotifier.eventEnqueueWaitTime);
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // Fix for #51400. Allow the queue to grow beyond its
>                     // capacity/maxQueueSize, if it is taking a long time to
>                     // drain the queue, either due to a slower client or the
>                     // deadlock scenario mentioned in the ticket.
>                     reconcilePutPermits();
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 1) {
>                       logger.info(LocalizedMessage
>                           
> .create(LocalizedStrings.HARegionQueue_RESUMING_WITH_PROCESSING_PUTS));
>                     }
>                   } catch (InterruptedException ex) {
>                     // TODO: The line below is meaningless. Comment it out 
> later
>                     this.permitMon.notifyAll();
>                     throw ex;
>                   }
>                 }
>               }
>             } // synchronized (this.permitMon)
>           } // if (putPermits <= 0)
>           --putPermits;
>         } // synchronized (this.putGuard)
>       }
>     }
> *Reporter*: Mangesh Deshmukh
> *E-mail*: [mailto:[email protected]]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to