[jira] [Commented] (GEODE-3709) Geode Version: 1.1.1 In one of the project we a...

Barry Oglesby (JIRA) Wed, 25 Oct 2017 17:38:25 -0700

    [ 
https://issues.apache.org/jira/browse/GEODE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219788#comment-16219788
 ]


Barry Oglesby commented on GEODE-3709:
--------------------------------------

The TCP dump shows similar messages. Here are some examples in the TCP dump of 
messages from the server to the client and the acks from the client to the 
server in the 11:03:38 second:
{noformat}
284220  339.961818      10.55.13.77     10.55.12.224    TCP     224     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.622066000 PDT

284221  339.961847      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29245611 Win=30408 Len=0 TSval=45105917 TSecr=45105911
Arrival Time: Oct 20, 2017 11:03:38.622095000 PDT
[This is an ACK to the segment in frame: 284220]
{noformat}
{noformat}
284222  339.961933      10.55.13.77     10.55.12.224    TCP     225     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.622181000 PDT

284223  339.961944      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29245770 Win=30408 Len=0 TSval=45105917 TSecr=45105911
Arrival Time: Oct 20, 2017 11:03:38.622192000 PDT
[This is an ACK to the segment in frame: 284222]
{noformat}
{noformat}
284224  339.962202      10.55.13.77     10.55.12.224    TCP     225     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.622450000 PDT

284225  339.962232      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29245929 Win=30408 Len=0 TSval=45105917 TSecr=45105911
Arrival Time: Oct 20, 2017 11:03:38.622480000 PDT
[This is an ACK to the segment in frame: 284224]
{noformat}
At this time and before, the TCP dump is basically a series of 
send-ack-send-ack, etc. Then it changes.

Starting with 284272, there are a bunch of server message sends with no 
corresponding client ack. There are a few client acks, but not many.
{noformat}
284272  340.101627      10.55.13.77     10.55.12.224    TCP     383     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761875000 PDT

284273  340.101665      10.55.13.77     10.55.12.224    TCP     383     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761913000 PDT

284274  340.101700      10.55.13.77     10.55.12.224    TCP     225     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761948000 PDT

284275  340.101728      10.55.13.77     10.55.12.224    TCP     225     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761976000 PDT
{noformat}
That behavior continues all the way until 284751 when there is a TCP 
Retransmission:
{noformat}
284751  340.124070      10.55.13.77     10.55.12.224    TCP     224     [TCP 
Retransmission] 40404 → 43482 [PSH, ACK] Seq=29322581 Ack=1 Win=14480 Len=158 
TSval=45105952 TSecr=45105955
Arrival Time: Oct 20, 2017 11:03:38.784318000 PDT
{noformat}
>From then on, there are server message send - client ack pairs, but between 
>each pair, there is ~200ms (as you noted). This continues for the remainder of 
>the dump.
{noformat}
284756  340.328147      10.55.13.77     10.55.12.224    TCP     133     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.988395000 PDT

284757  340.328178      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29322806 Win=67 Len=0 TSval=45106009 TSecr=45106003
Arrival Time: Oct 20, 2017 11:03:38.988426000 PDT
[This is an ACK to the segment in frame: 284756]
{noformat}
{noformat}
284758  340.532096      10.55.13.77     10.55.12.224    TCP     133     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:39.192344000 PDT

284759  340.532122      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29322873 Win=67 Len=0 TSval=45106060 TSecr=45106054
Arrival Time: Oct 20, 2017 11:03:39.192370000 PDT
[This is an ACK to the segment in frame: 284758]
{noformat}
{noformat}
284760  340.736142      10.55.13.77     10.55.12.224    TCP     133     [TCP 
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:39.396390000 PDT

284761  340.736262      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29322940 Win=67 Len=0 TSval=45106111 TSecr=45106105
Arrival Time: Oct 20, 2017 11:03:39.396510000 PDT
[This is an ACK to the segment in frame: 284760]
{noformat}
Another interesting thing is the window size from the client's perspective 
drops from 30408 to 67 (right when the issue starts). The window size from the 
server's perspective is steady at 14480.

There are a lot of messages where the Win=30408 like:
{noformat}
284034  339.946179      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29230500 Win=30408 Len=0 TSval=45105913 TSecr=45105907
{noformat}
But then it starts to fluctuate (but mainly drop). This drop corresponds to 
when messages were sent but not acked. 
{noformat}
284298  340.102442      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29253860 Win=25808 Len=0 TSval=45105952 TSecr=45105946
284323  340.103121      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29257681 Win=21987 Len=0 TSval=45105952 TSecr=45105946
284368  340.104484      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29264214 Win=19849 Len=0 TSval=45105953 TSecr=45105947
284402  340.105416      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29268986 Win=17265 Len=0 TSval=45105953 TSecr=45105947
284410  340.105723      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29270255 Win=15996 Len=0 TSval=45105953 TSecr=45105947
284487  340.107882      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29281545 Win=10920 Len=0 TSval=45105954 TSecr=45105947
284632  340.112171      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29303160 Win=8688 Len=0 TSval=45105955 TSecr=45105948
284675  340.113402      10.55.12.224    10.55.13.77     TCP     66      43482 → 
40404 [ACK] Seq=1 Ack=29310195 Win=4349 Len=0 TSval=45105955 TSecr=45105949
284752  340.124096      10.55.12.224    10.55.13.77     TCP     78      43482 → 
40404 [ACK] Seq=1 Ack=29322739 Win=67 Len=0 TSval=45105958 TSecr=45105949 
SLE=29322581 SRE=29322739
{noformat}
It is 67 for the remainder of the file. I'm not sure what this means, but it 
seems significant.

> Geode Version: 1.1.1    In one of the project we a...
> -----------------------------------------------------
>
>                 Key: GEODE-3709
>                 URL: https://issues.apache.org/jira/browse/GEODE-3709
>             Project: Geode
>          Issue Type: Improvement
>          Components: client queues
>            Reporter: Gregory Chase
>         Attachments: 20171006-logs-stats-tds.zip, 20171020.zip, 
> CacheClientProxyStats_sentBytes.gif, 
> DistributionStats_receivedBytes_CacheClientProxyStats_sentBytes.gif, 
> gf-rest-stats-12-05.gfs, myStatisticsArchiveFile-04-01.gfs
>
>
> Geode Version: 1.1.1
> In one of the project we are using Geode. Here is a summary of how we use it.
> - Geode servers have multiple regions. 
> - Clients subscribe to the data from these regions.
> - Clients subscribe interest in all the entries, therefore they get updates 
> about all the entries from creation to modification to deletion.
> - One of the regions usually has 5-10 million entries with a TTL of 24 hours. 
> Most entries are added in an hour's span one after other. So when TTL kicks 
> in, they are often destroyed in an hour.
> Problem:
> Every now and then we observe following message: 
>       Client queue for 
> _gfe_non_durable_client_with_id_x.x.x.x(14229:loner):42754:e4266fc4_2_queue 
> client is full.
> This seems to happen when the TTL kicks in on the region with 5-10 million 
> entries. Entries start getting evicted (deleted); the updates (destroys) now 
> must be sent to clients. We see that the updates do happen for a while but 
> suddenly the updates stop and the queue size starts growing. This is becoming 
> a major issue for smooth functioning of our production setup. Any help will 
> be much appreciated. 
> I did some ground work by downloading and looking at the code. I see 
> reference to 2 issues #37581, #51400. But I am unable to view actual JIRA 
> tickets (needs login credentials) Hopefully, it helps someone looking at the 
> issue.
> Here is the pertinent code:
>    @Override
>     @edu.umd.cs.findbugs.annotations.SuppressWarnings("TLW_TWO_LOCK_WAIT")
>     void checkQueueSizeConstraint() throws InterruptedException {
>       if (this.haContainer instanceof HAContainerMap && isPrimary()) { // Fix 
> for bug 39413
>         if (Thread.interrupted())
>           throw new InterruptedException();
>         synchronized (this.putGuard) {
>           if (putPermits <= 0) {
>             synchronized (this.permitMon) {
>               if (reconcilePutPermits() <= 0) {
>                 if 
> (region.getSystem().getConfig().getRemoveUnresponsiveClient()) {
>                   isClientSlowReciever = true;
>                 } else {
>                   try {
>                     long logFrequency = 
> CacheClientNotifier.DEFAULT_LOG_FREQUENCY;
>                     CacheClientNotifier ccn = 
> CacheClientNotifier.getInstance();
>                     if (ccn != null) { // check needed for junit tests
>                       logFrequency = ccn.getLogFrequency();
>                     }
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 0) {
>                       logger.warn(LocalizedMessage.create(
>                           
> LocalizedStrings.HARegionQueue_CLIENT_QUEUE_FOR_0_IS_FULL,
>                           new Object[] {region.getName()}));
>                       this.maxQueueSizeHitCount = 0;
>                     }
>                     ++this.maxQueueSizeHitCount;
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // TODO: wait called while holding two locks
>                     
> this.permitMon.wait(CacheClientNotifier.eventEnqueueWaitTime);
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // Fix for #51400. Allow the queue to grow beyond its
>                     // capacity/maxQueueSize, if it is taking a long time to
>                     // drain the queue, either due to a slower client or the
>                     // deadlock scenario mentioned in the ticket.
>                     reconcilePutPermits();
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 1) {
>                       logger.info(LocalizedMessage
>                           
> .create(LocalizedStrings.HARegionQueue_RESUMING_WITH_PROCESSING_PUTS));
>                     }
>                   } catch (InterruptedException ex) {
>                     // TODO: The line below is meaningless. Comment it out 
> later
>                     this.permitMon.notifyAll();
>                     throw ex;
>                   }
>                 }
>               }
>             } // synchronized (this.permitMon)
>           } // if (putPermits <= 0)
>           --putPermits;
>         } // synchronized (this.putGuard)
>       }
>     }
> *Reporter*: Mangesh Deshmukh
> *E-mail*: [mailto:[email protected]]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (GEODE-3709) Geode Version: 1.1.1 In one of the project we a...

Reply via email to