[
https://issues.apache.org/jira/browse/GEODE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219788#comment-16219788
]
Barry Oglesby commented on GEODE-3709:
--------------------------------------
The TCP dump shows similar messages. Here are some examples in the TCP dump of
messages from the server to the client and the acks from the client to the
server in the 11:03:38 second:
{noformat}
284220 339.961818 10.55.13.77 10.55.12.224 TCP 224 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.622066000 PDT
284221 339.961847 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29245611 Win=30408 Len=0 TSval=45105917 TSecr=45105911
Arrival Time: Oct 20, 2017 11:03:38.622095000 PDT
[This is an ACK to the segment in frame: 284220]
{noformat}
{noformat}
284222 339.961933 10.55.13.77 10.55.12.224 TCP 225 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.622181000 PDT
284223 339.961944 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29245770 Win=30408 Len=0 TSval=45105917 TSecr=45105911
Arrival Time: Oct 20, 2017 11:03:38.622192000 PDT
[This is an ACK to the segment in frame: 284222]
{noformat}
{noformat}
284224 339.962202 10.55.13.77 10.55.12.224 TCP 225 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.622450000 PDT
284225 339.962232 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29245929 Win=30408 Len=0 TSval=45105917 TSecr=45105911
Arrival Time: Oct 20, 2017 11:03:38.622480000 PDT
[This is an ACK to the segment in frame: 284224]
{noformat}
At this time and before, the TCP dump is basically a series of
send-ack-send-ack, etc. Then it changes.
Starting with 284272, there are a bunch of server message sends with no
corresponding client ack. There are a few client acks, but not many.
{noformat}
284272 340.101627 10.55.13.77 10.55.12.224 TCP 383 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761875000 PDT
284273 340.101665 10.55.13.77 10.55.12.224 TCP 383 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761913000 PDT
284274 340.101700 10.55.13.77 10.55.12.224 TCP 225 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761948000 PDT
284275 340.101728 10.55.13.77 10.55.12.224 TCP 225 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.761976000 PDT
{noformat}
That behavior continues all the way until 284751 when there is a TCP
Retransmission:
{noformat}
284751 340.124070 10.55.13.77 10.55.12.224 TCP 224 [TCP
Retransmission] 40404 → 43482 [PSH, ACK] Seq=29322581 Ack=1 Win=14480 Len=158
TSval=45105952 TSecr=45105955
Arrival Time: Oct 20, 2017 11:03:38.784318000 PDT
{noformat}
>From then on, there are server message send - client ack pairs, but between
>each pair, there is ~200ms (as you noted). This continues for the remainder of
>the dump.
{noformat}
284756 340.328147 10.55.13.77 10.55.12.224 TCP 133 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:38.988395000 PDT
284757 340.328178 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29322806 Win=67 Len=0 TSval=45106009 TSecr=45106003
Arrival Time: Oct 20, 2017 11:03:38.988426000 PDT
[This is an ACK to the segment in frame: 284756]
{noformat}
{noformat}
284758 340.532096 10.55.13.77 10.55.12.224 TCP 133 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:39.192344000 PDT
284759 340.532122 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29322873 Win=67 Len=0 TSval=45106060 TSecr=45106054
Arrival Time: Oct 20, 2017 11:03:39.192370000 PDT
[This is an ACK to the segment in frame: 284758]
{noformat}
{noformat}
284760 340.736142 10.55.13.77 10.55.12.224 TCP 133 [TCP
segment of a reassembled PDU]
Arrival Time: Oct 20, 2017 11:03:39.396390000 PDT
284761 340.736262 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29322940 Win=67 Len=0 TSval=45106111 TSecr=45106105
Arrival Time: Oct 20, 2017 11:03:39.396510000 PDT
[This is an ACK to the segment in frame: 284760]
{noformat}
Another interesting thing is the window size from the client's perspective
drops from 30408 to 67 (right when the issue starts). The window size from the
server's perspective is steady at 14480.
There are a lot of messages where the Win=30408 like:
{noformat}
284034 339.946179 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29230500 Win=30408 Len=0 TSval=45105913 TSecr=45105907
{noformat}
But then it starts to fluctuate (but mainly drop). This drop corresponds to
when messages were sent but not acked.
{noformat}
284298 340.102442 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29253860 Win=25808 Len=0 TSval=45105952 TSecr=45105946
284323 340.103121 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29257681 Win=21987 Len=0 TSval=45105952 TSecr=45105946
284368 340.104484 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29264214 Win=19849 Len=0 TSval=45105953 TSecr=45105947
284402 340.105416 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29268986 Win=17265 Len=0 TSval=45105953 TSecr=45105947
284410 340.105723 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29270255 Win=15996 Len=0 TSval=45105953 TSecr=45105947
284487 340.107882 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29281545 Win=10920 Len=0 TSval=45105954 TSecr=45105947
284632 340.112171 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29303160 Win=8688 Len=0 TSval=45105955 TSecr=45105948
284675 340.113402 10.55.12.224 10.55.13.77 TCP 66 43482 →
40404 [ACK] Seq=1 Ack=29310195 Win=4349 Len=0 TSval=45105955 TSecr=45105949
284752 340.124096 10.55.12.224 10.55.13.77 TCP 78 43482 →
40404 [ACK] Seq=1 Ack=29322739 Win=67 Len=0 TSval=45105958 TSecr=45105949
SLE=29322581 SRE=29322739
{noformat}
It is 67 for the remainder of the file. I'm not sure what this means, but it
seems significant.
> Geode Version: 1.1.1 In one of the project we a...
> -----------------------------------------------------
>
> Key: GEODE-3709
> URL: https://issues.apache.org/jira/browse/GEODE-3709
> Project: Geode
> Issue Type: Improvement
> Components: client queues
> Reporter: Gregory Chase
> Attachments: 20171006-logs-stats-tds.zip, 20171020.zip,
> CacheClientProxyStats_sentBytes.gif,
> DistributionStats_receivedBytes_CacheClientProxyStats_sentBytes.gif,
> gf-rest-stats-12-05.gfs, myStatisticsArchiveFile-04-01.gfs
>
>
> Geode Version: 1.1.1
> In one of the project we are using Geode. Here is a summary of how we use it.
> - Geode servers have multiple regions.
> - Clients subscribe to the data from these regions.
> - Clients subscribe interest in all the entries, therefore they get updates
> about all the entries from creation to modification to deletion.
> - One of the regions usually has 5-10 million entries with a TTL of 24 hours.
> Most entries are added in an hour's span one after other. So when TTL kicks
> in, they are often destroyed in an hour.
> Problem:
> Every now and then we observe following message:
> Client queue for
> _gfe_non_durable_client_with_id_x.x.x.x(14229:loner):42754:e4266fc4_2_queue
> client is full.
> This seems to happen when the TTL kicks in on the region with 5-10 million
> entries. Entries start getting evicted (deleted); the updates (destroys) now
> must be sent to clients. We see that the updates do happen for a while but
> suddenly the updates stop and the queue size starts growing. This is becoming
> a major issue for smooth functioning of our production setup. Any help will
> be much appreciated.
> I did some ground work by downloading and looking at the code. I see
> reference to 2 issues #37581, #51400. But I am unable to view actual JIRA
> tickets (needs login credentials) Hopefully, it helps someone looking at the
> issue.
> Here is the pertinent code:
> @Override
> @edu.umd.cs.findbugs.annotations.SuppressWarnings("TLW_TWO_LOCK_WAIT")
> void checkQueueSizeConstraint() throws InterruptedException {
> if (this.haContainer instanceof HAContainerMap && isPrimary()) { // Fix
> for bug 39413
> if (Thread.interrupted())
> throw new InterruptedException();
> synchronized (this.putGuard) {
> if (putPermits <= 0) {
> synchronized (this.permitMon) {
> if (reconcilePutPermits() <= 0) {
> if
> (region.getSystem().getConfig().getRemoveUnresponsiveClient()) {
> isClientSlowReciever = true;
> } else {
> try {
> long logFrequency =
> CacheClientNotifier.DEFAULT_LOG_FREQUENCY;
> CacheClientNotifier ccn =
> CacheClientNotifier.getInstance();
> if (ccn != null) { // check needed for junit tests
> logFrequency = ccn.getLogFrequency();
> }
> if ((this.maxQueueSizeHitCount % logFrequency) == 0) {
> logger.warn(LocalizedMessage.create(
>
> LocalizedStrings.HARegionQueue_CLIENT_QUEUE_FOR_0_IS_FULL,
> new Object[] {region.getName()}));
> this.maxQueueSizeHitCount = 0;
> }
> ++this.maxQueueSizeHitCount;
> this.region.checkReadiness(); // fix for bug 37581
> // TODO: wait called while holding two locks
>
> this.permitMon.wait(CacheClientNotifier.eventEnqueueWaitTime);
> this.region.checkReadiness(); // fix for bug 37581
> // Fix for #51400. Allow the queue to grow beyond its
> // capacity/maxQueueSize, if it is taking a long time to
> // drain the queue, either due to a slower client or the
> // deadlock scenario mentioned in the ticket.
> reconcilePutPermits();
> if ((this.maxQueueSizeHitCount % logFrequency) == 1) {
> logger.info(LocalizedMessage
>
> .create(LocalizedStrings.HARegionQueue_RESUMING_WITH_PROCESSING_PUTS));
> }
> } catch (InterruptedException ex) {
> // TODO: The line below is meaningless. Comment it out
> later
> this.permitMon.notifyAll();
> throw ex;
> }
> }
> }
> } // synchronized (this.permitMon)
> } // if (putPermits <= 0)
> --putPermits;
> } // synchronized (this.putGuard)
> }
> }
> *Reporter*: Mangesh Deshmukh
> *E-mail*: [mailto:[email protected]]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)