[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473648#comment-16473648 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12866052/ZOOKEEPER-2770.003.patch against trunk revision 2fa315b7d0ed65828479fcdcc9e76ca8552fba4a. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3691//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Labels: pull-request-available > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458304#comment-16458304 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r184910779 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } +synchronized public void resetNumRequestsAboveThresholdTime() { +numRequestsAboveThresholdTime = 0; +} synchronized public void reset() { resetLatency(); resetRequestCounters(); +resetNumRequestsAboveThresholdTime(); +} + +public void checkLatency(final ZooKeeperServer zks, Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || +(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); +if (enabledAndAboveThreshold) { +zks.serverStats().incNumRequestsAboveThresholdTime(); + +// Try acquiring lock only if not waiting +boolean success = waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE); +if (success) { +LOG.warn("Request {} exceeded threshold. Took {} ms", request, requestLatency); +startCount = zks.serverStats().getNumRequestsAboveThresholdTime(); +timer.schedule(new TimerTask() { +@Override +public void run() { +long count = zks.serverStats().getNumRequestsAboveThresholdTime() - startCount; +LOG.warn("Number of requests that exceeded {} ms in past {} ms: {}", +requestWarnThresholdMs, delayTimeForLoggingWarnThresholdMsg, count); +waitForLoggingWarnThresholdMsg.set(Boolean.FALSE); +} +}, delayTimeForLoggingWarnThresholdMsg); +} +} +} + +public void setDelayTimeForLoggingWarnThresholdMsg(int delay) { +this.delayTimeForLoggingWarnThresholdMsg = delay; --- End diff -- I think it is worth to add some comment about this method. Probably also add comments in the document as well about the behavior of the log and how to control the rate using this method. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458305#comment-16458305 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r184910500 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } +synchronized public void resetNumRequestsAboveThresholdTime() { +numRequestsAboveThresholdTime = 0; +} synchronized public void reset() { resetLatency(); resetRequestCounters(); +resetNumRequestsAboveThresholdTime(); +} + +public void checkLatency(final ZooKeeperServer zks, Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || +(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); --- End diff -- This could be simplified: `boolean enabledAndAboveThreshold = requestWarnThresholdMs != -1 && requestLatency > requestWarnThresholdMs` > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458303#comment-16458303 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r184910721 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } +synchronized public void resetNumRequestsAboveThresholdTime() { +numRequestsAboveThresholdTime = 0; +} synchronized public void reset() { resetLatency(); resetRequestCounters(); +resetNumRequestsAboveThresholdTime(); +} + +public void checkLatency(final ZooKeeperServer zks, Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || +(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); +if (enabledAndAboveThreshold) { +zks.serverStats().incNumRequestsAboveThresholdTime(); + +// Try acquiring lock only if not waiting +boolean success = waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE); +if (success) { +LOG.warn("Request {} exceeded threshold. Took {} ms", request, requestLatency); +startCount = zks.serverStats().getNumRequestsAboveThresholdTime(); +timer.schedule(new TimerTask() { +@Override +public void run() { +long count = zks.serverStats().getNumRequestsAboveThresholdTime() - startCount; --- End diff -- If there is no slow requests coming in between the time the task is scheduled and the time the task is executed, this count will be 0. It will not reflect the actual number of requests we want to log. Maybe log the startCount instead and reset it and leave the task only to reset the barrier? Also please use ScheduledService instead of Timer task. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438989#comment-16438989 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm Thank you! I have rebased the patch. The jenkins build has been failing for a while and my patch doesn't introduce new failure. Thanks once again! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438977#comment-16438977 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 looks like the branch has merge conflicts. do you mind update your patch with latest master and I'll review @karanmehta93 > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392529#comment-16392529 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user zodvik commented on the issue: https://github.com/apache/zookeeper/pull/307 > reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem IMO, slow query log is useful for those systems were the response times can vary due to type of query (long scans, improper index, poor query plan, etc.). For the type of issues mentioned, we can expect all operations to be impacted and fill up the log. If the intent is to get an time stamp indication of the problem, a metric reporting mechanism will be more useful. If we have rolling interval percentiles (per command?) published either to `stat` or `jmx`, that would give a good visibility. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389945#comment-16389945 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Hello everyone, Appreciate your efforts in reviewing this patch. @hanm @tdunning @eribeiro @skamille Is there any possibility that the patch will get merged in (with minor changes if required) or shall we 'never' this JIRA and close this PR? Thanks! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta >Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263413#comment-16263413 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 > Can we at least make it so that the process of "push a metric to a buffer, have a thread that wakes up periodically and flushes information out of that buffer" is usable by multiple parts of the system, instead of coupling it to the one metric of request time? I am ready to put in the effort of making this framework more generic, if you feel its worth the time and effort to put into it. I would also like to hear others opinions on this. The inspiration for this JIRA came from HBase, where it has been used to log slow requests at the basic level. I feel that it can be useful to keep track of latency spikes that are seen in production. @apurtell can suggest use-cases. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261884#comment-16261884 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user skamille commented on the issue: https://github.com/apache/zookeeper/pull/307 I appreciate the work that you are trying to do here, and this still feels like an incomplete approach to a problem that would be worth investing further into. It seems like we all agree that it would be good to have clearer metrics about the performance behavior of the internals of the system. This implements some basic infrastructure for a metric recording framework, coupled to the recording of one metric. Can we at least make it so that the process of "push a metric to a buffer, have a thread that wakes up periodically and flushes information out of that buffer" is usable by multiple parts of the system, instead of coupling it to the one metric of request time? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244971#comment-16244971 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Ping @hanm @tdunning @eribeiro @skamille for a review if you got some time. Thanks! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221123#comment-16221123 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Due to some reason, Jenkins didn't build it automatically. I triggered a manual build with my latest commit ID SHA1 and the results are at https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1123/ All the Findbugs warnings have been resolved, and the single failing test seems to be unrelated to my change (or possibly a flapper since it is passing on my local machine). This commit is a squashed commit based on the earlier inputs provided by everyone. Thank you! As of now, the default value for `requestWarnThresholdMs` is 10 seconds, which is too high. Users can set it according to the average latency values they typically see from the serverstats. The patch also limits logging rate, so if the value is set too low, it will print out a message every 60 seconds showing the number of high latency requests in past minute. The motivation of this JIRA is specified in description at https://issues.apache.org/jira/browse/ZOOKEEPER-2770. If you can review this PR, it would be great. @hanm @tdunning @eribeiro @skamille If we can get in this JIRA, the future plan is to work on @tdunning t-digest and use it to determine high latency requests specifically based on its type. Thanks! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219981#comment-16219981 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r147047396 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -81,9 +81,11 @@ private static final Logger LOG = LoggerFactory.getLogger(FinalRequestProcessor.class); ZooKeeperServer zks; +ServerStats serverStats; --- End diff -- Fixed it in the next commit. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219982#comment-16219982 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r147047400 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } +synchronized public void resetNumRequestsAboveThresholdTime() { +numRequestsAboveThresholdTime = 0; +} synchronized public void reset() { resetLatency(); resetRequestCounters(); +resetNumRequestsAboveThresholdTime(); +} + +public void checkLatency(final ZooKeeperServer zks, Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || +(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); +if (enabledAndAboveThreshold) { +zks.serverStats().incNumRequestsAboveThresholdTime(); + +// Try acquiring lock only if not waiting +boolean success = waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE); +if(success) { --- End diff -- Fixed it in the next commit. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219919#comment-16219919 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r147040783 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } +synchronized public void resetNumRequestsAboveThresholdTime() { +numRequestsAboveThresholdTime = 0; +} synchronized public void reset() { resetLatency(); resetRequestCounters(); +resetNumRequestsAboveThresholdTime(); +} + +public void checkLatency(final ZooKeeperServer zks, Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || +(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); +if (enabledAndAboveThreshold) { +zks.serverStats().incNumRequestsAboveThresholdTime(); + +// Try acquiring lock only if not waiting +boolean success = waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE); +if(success) { --- End diff -- Need a space between if and ( > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219918#comment-16219918 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r147040873 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -81,9 +81,11 @@ private static final Logger LOG = LoggerFactory.getLogger(FinalRequestProcessor.class); ZooKeeperServer zks; +ServerStats serverStats; --- End diff -- I think that caching this reference is a bad practice. Getting a reference every time costs essentially nothing and keeping a reference to a data structure that we don't own is asking for trouble if anybody every reallocates it. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219838#comment-16219838 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- GitHub user karanmehta93 reopened a pull request: https://github.com/apache/zookeeper/pull/307 ZOOKEEPER-2770 ZooKeeper slow operation log ZooKeeper is a complex distributed application. There are many reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem. Although ZooKeeper is not a datastore, it does persist data, and can suffer intermittent performance degradation, and should consider implementing a 'slow query' log, a feature very common to services which persist information on behalf of clients which may be sensitive to latency while waiting for confirmation of successful persistence. The patch sets a default threshold of 10 sec for request. If the value of warn.responseTime is configured as -1 then logging is ignored. You can merge this pull request into a Git repository by running: $ git pull https://github.com/karanmehta93/zookeeper ZOOKEEPER-2770 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/307.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #307 commit bf0ed9eca106faebca6548122108e3ff123b04d3 Author: Karan Mehta Date: 2017-10-26T01:48:56Z ZOOKEEPER-2770 ZooKeeper slow operation log > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219820#comment-16219820 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 closed the pull request at: https://github.com/apache/zookeeper/pull/307 > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219169#comment-16219169 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @tdunning Sorry about this. I rebased my branch with the current master and pushed my patch along with it. Now it is showing up all those commits as my commits. Let me try to figure out a way to resolve this. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218938#comment-16218938 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on the issue: https://github.com/apache/zookeeper/pull/307 What is this pull request? 45 commits? 95 changed files? Lots of commits without associated issues? (danger, danger) > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218937#comment-16218937 ] Ted Dunning commented on ZOOKEEPER-2770: There is a pull request just now that has 95 files changed and 45 commits. What is this?!? https://github.com/apache/zookeeper/pull/307/commits It looks like some wires got seriously crossed here. There is no way that this feature should have so many commits. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218293#comment-16218293 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 89 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213980#comment-16213980 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r146108372 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +173,47 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } +synchronized public void resetNumRequestsAboveThresholdTime() { +numRequestsAboveThresholdTime = 0; +} synchronized public void reset() { resetLatency(); resetRequestCounters(); +resetNumRequestsAboveThresholdTime(); +} + +public void checkLatency(final ZooKeeperServer zks, Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || +(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); +if (enabledAndAboveThreshold) { +zks.serverStats().incNumRequestsAboveThresholdTime(); +// Try acquiring lock only if not waiting +if(!waitForLoggingWarnThresholdMsg) { --- End diff -- nit: space between `if` and `(` > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213978#comment-16213978 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r146108343 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -21,17 +21,32 @@ import org.apache.zookeeper.common.Time; +import org.apache.zookeeper.server.quorum.QuorumPeerConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Timer; +import java.util.TimerTask; /** * Basic Server Statistics */ public class ServerStats { +private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class); + private long packetsSent; private long packetsReceived; private long maxLatency; private long minLatency = Long.MAX_VALUE; private long totalLatency = 0; private long count = 0; +private long numRequestsAboveThresholdTime = 0; + +final static long requestWarnThresholdMs = QuorumPeerConfig.getRequestWarnResponseThresholdMs(); +final static Timer timer = new Timer(); +volatile Boolean waitForLoggingWarnThresholdMsg = false; --- End diff -- Oh, excuse me, I got it. It's because you are synchronizing on it, right? Well, it's not recommended (anti-pattern) to synchronize on non-final fields. Didn't Findbugs raised a warning about this? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213976#comment-16213976 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r146108249 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -21,17 +21,32 @@ import org.apache.zookeeper.common.Time; +import org.apache.zookeeper.server.quorum.QuorumPeerConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Timer; +import java.util.TimerTask; /** * Basic Server Statistics */ public class ServerStats { +private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class); + private long packetsSent; private long packetsReceived; private long maxLatency; private long minLatency = Long.MAX_VALUE; private long totalLatency = 0; private long count = 0; +private long numRequestsAboveThresholdTime = 0; + +final static long requestWarnThresholdMs = QuorumPeerConfig.getRequestWarnResponseThresholdMs(); +final static Timer timer = new Timer(); +volatile Boolean waitForLoggingWarnThresholdMsg = false; --- End diff -- What is the reason to make `waitForLoggingWarnThresholdMsg` a boxed type instead of a primitive `boolean`? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176070#comment-16176070 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 3.0.1) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133177#comment-16133177 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 There are -1 on findbugs, release audits, and test failures reported by Jekins. Please investigate. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131884#comment-16131884 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 3.0.1) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131874#comment-16131874 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 3.0.1) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131873#comment-16131873 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 3.0.1) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131704#comment-16131704 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 3.0.1) warnings. -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131694#comment-16131694 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm @eribeiro @tdunning @skamille Please review. Now that I have added rate limiting to logging, can we also turn this on by default? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113679#comment-16113679 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r131282046 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -460,11 +465,20 @@ public void processRequest(Request request) { if (request.type == OpCode.closeSession) { cnxn.sendCloseSession(); } +checkLatency(request); } catch (IOException e) { LOG.error("FIXMSG",e); } } +private void checkLatency(Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +if((requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs) || requestWarnThresholdMs == 0) { +LOG.warn("Request " + request + " exceeded threshold. Took " + requestLatency + " ms."); --- End diff -- You can leverage the use of interpolation of Strings in logging: ``` LOG.warn("Request {} exceeded {}", obj1, obj2); ``` Again, such a minor suggestion that it's up to you adopt it or not. Cheers! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113678#comment-16113678 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r131281803 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -460,11 +465,20 @@ public void processRequest(Request request) { if (request.type == OpCode.closeSession) { cnxn.sendCloseSession(); } +checkLatency(request); } catch (IOException e) { LOG.error("FIXMSG",e); } } +private void checkLatency(Request request) { +long requestLatency = Time.currentElapsedTime() - request.createTime; +if((requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs) || requestWarnThresholdMs == 0) { --- End diff -- @karanmehta93 really *really* minor nit: ``` boolean enabledAndAboveThreshold = requestWarnThresholdMs == 0 || (requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); if (enabledAndAboveThreshold) { ``` I moved `requestWarnThresholdMs == 0` to the beginning of the expression so that the whole expression can be short circuited if it is equals zero. ;-) But this is a so trivial recommendation that it's up to you use it or not. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102749#comment-16102749 ] Ted Dunning commented on ZOOKEEPER-2770: The typical approach is to set a limit on number of messages per unit time (say one every 10 minutes). Each message that is printed sets a coalescence time during which no further messages are printed, but a counter is updated. At the end of the coalescence time a modified message which mentions that n additional events were detected and the coalescence time is disabled. This way if the warnings are rare, you get normal behavior. If the warnings are frequent, you get at most one message per 10 minutes (or whatever coalescence period you choose). You get instant notification of a problem and limited log output. On Wed, Jul 26, 2017 at 10:05 PM, Karan Mehta (JIRA) > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102708#comment-16102708 ] Karan Mehta commented on ZOOKEEPER-2770: [~tdunning] There is no rate limiting on logging. If the threshold is too low, it might result in huge amount of messages getting printed. At this point, I suggest the following things. Please suggest your opinion. 1. Turn off this feature by default, so that we don't end up on an arbitrary value. I personally not want this, since I believe that no matter what your requirements are or your hardware, it can be possible to put some upper bound on this value. Experienced people can comment on this more than I can. 2. Add rate limiter based on some logic 2.1 Time based logic (Limit messages printed in a given amount of time) 2.2 Random sampling based on some probability percentage. I am not aware of how typically these things are implemented. It would be good if you can suggest some part of code which does similar stuff. Thanks! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101161#comment-16101161 ] Ted Dunning commented on ZOOKEEPER-2770: Btw I note that there is no metering on this logging. That raise an obligatory question. Is there a plausible circumstance where thousands of nearly identical messages might be logged? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101158#comment-16101158 ] Ted Dunning commented on ZOOKEEPER-2770: {quote} With that said, is 300 ms a good value or even less is better? {quote} I would suggest that getting a real time varying histogram is the right answer. I suggested that early on for just this kind of reason. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101114#comment-16101114 ] Karan Mehta commented on ZOOKEEPER-2770: bq. Operations over 100ms should be vanishingly rare, but I wouldn't leap up to find out what is happening. I would be fairly unhappy, though, and would start checking. Let's take this as a motivation. :) With that said, is 300 ms a good value or even less is better? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100945#comment-16100945 ] Ted Dunning commented on ZOOKEEPER-2770: On second thought, I could imagine that startup transients could cause a long operation. Once you have your quorum in a groove, however, >1 second is very bad, especially if you don't have something like a quorum leader change happening. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100942#comment-16100942 ] Ted Dunning commented on ZOOKEEPER-2770: To put some color on Camille's surprise, I would consider any operation over a second to be indicative of gross failure in the quorum. Operations over 100ms should be vanishingly rare, but I wouldn't leap up to find out what is happening. I would be fairly unhappy, though, and would start checking. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100940#comment-16100940 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r129450258 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int requestWarnThresholdMs = 1; --- End diff -- To be frank, I am newbie and haven't debugged this in detail. This value is purely seen based on the 'stat' command on our test cluster. @apurtell might be able to tell more practical values. @skamille I would prefer turning this on by default, although the default value needs to be discussed. In my understanding, this helps in situations when we see timeouts at application level, such a log can might help narrow down towards the cause. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100906#comment-16100906 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user skamille commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r129444784 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int requestWarnThresholdMs = 1; --- End diff -- You've seen 2.3 seconds latency within the ZK quorum operations? That seems worthy of posting to the mailing list along with some information about what was happening and why. I think it sounds like @hanm wants to turn this off by default, which makes this moot, and I'm supportive of that, so I'll let him make the call. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100881#comment-16100881 ] Michael Han commented on ZOOKEEPER-2770: For hardcode I meant the default value of "requestWarnThresholdMs" baked in code. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100852#comment-16100852 ] Andrew Purtell commented on ZOOKEEPER-2770: --- >From the original patch the warning threshold has been configurable. Calling >it 'hardcoded' isn't correct. Maybe you meant a simple threshold only? That's >true. It's better than nothing. FWIW I also like Ted's suggestion as a >followup, and in fact would like to carry that over to HBase if it works out >well here. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100846#comment-16100846 ] Michael Han commented on ZOOKEEPER-2770: A hardcoded default value in code is unlikely to work for everyone and it is possible to have false negatives if the value is too small. I am leaning towards have this feature as an opt-in feature with the value has its default -1 only and for those who want use it they can tune the parameter for their deployment but it has to be enabled explicitly. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100820#comment-16100820 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r129431091 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int requestWarnThresholdMs = 1; --- End diff -- Is 2 or 3 seconds reasonable? I have seen 2.3 seconds as max latency sometimes, however I don't have much experience. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100780#comment-16100780 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user skamille commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r129424302 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int requestWarnThresholdMs = 1; --- End diff -- If we're going to implement this let's at least put some sort of realistic threshold. 10s is basically saying "don't enable this feature" is that what we want? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100706#comment-16100706 ] Andrew Purtell commented on ZOOKEEPER-2770: --- The originally proposed change is hardly complex. I don't understand that aspect of this discussion. Whether or not the metric is useful, on the other hand... ok. That is a matter of opinion. I think we'd like to know if any ZK op takes longer than a second to complete, and how often that might happen, and on what host it is happening. We have fleet of thousands of servers. We have tens of ZooKeeper installations, each on five servers. Hardware does funny things from time to time. We'd like to be proactive. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100640#comment-16100640 ] Camille Fournier commented on ZOOKEEPER-2770: - Are there really 10s long slow requests? It's defaults like this that make me skeptical about the usefulness of this particular implementation. If we have a request through ZK that takes 10s to process your whole system is completely effed. I don't think we should add complexity to the code base without suitable justification for the value of the new feature. With that in mind, I'd like to understand what, specifically, the circumstances we're trying to measure are. It looks like processing time for a request through the ZK quorum alone, correct? The only network time that might be captured would be, in the case of a write, the quorum voting time. I'm all for making ZK more operable and exposing metrics but I don't think exposing low-value metrics is worth the additional code complexity without justification. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100603#comment-16100603 ] Ted Dunning commented on ZOOKEEPER-2770: [~fournc], I am not so sure that *I* agree with me at this point. It is fair to say that on occasion there are slow operations in ZK and it would be good to know about them. This kind of problem is almost always due, in my own vicarious experience, to bad configuration. Often the bad configuration is simply collocation with a noisy neighbor on a deficient storage layer. There might be situations where an operation is slow due to the content of the query itself, but I cannot imagine what those situations might be. Writing a large value (but that is strictly limited in size), or even doing a huge multi-op (which has the same limited size in aggregate) should never take very long. As such, I would expect that the highest diagnostic value would not be something that dumped the contents of slow queries, but rather a capability that characterizes the entire distribution of query times. The frequency of slow queries is a diagnostic of sorts, but is one that could be inferred from the time-varying distributional information I was suggesting. That said, I don't think that a slow query log is a BAD thing (except a bit bad in terms of security if it logs the actual query). And I wouldn't want the BEST thing (a distribution log) to stop somebody contributing something. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100551#comment-16100551 ] Camille Fournier commented on ZOOKEEPER-2770: - I completely agree with [~tdunning] I don't understand the motivation for this. Are we just timing the internal processing time for the request? ZK is not the same type of system as HBase so I'm not sure we are comparing apples to oranges trying to cross-implement this feature. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099714#comment-16099714 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/900//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/900//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/900//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099515#comment-16099515 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm @eribeiro @tdunning I have fixed the whitespace issues and final patch is ready, Please review whenever convenient. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093587#comment-16093587 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/890//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/890//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/890//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093571#comment-16093571 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r128330873 --- Diff: src/java/main/org/apache/zookeeper/server/Request.java --- @@ -29,12 +29,14 @@ import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier; import org.apache.zookeeper.txn.TxnHeader; + --- End diff -- Some whitespace changes have been introduced by mistake. Whats the best way to resolve them? I am not changing `Request.java` file at all in the final code. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092620#comment-16092620 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/889//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/889//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/889//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092593#comment-16092593 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r128155180 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- > We can always increase the counter before we send a response back to client in FinalRequestProcessor. This will make test deterministic, Or, Not this approach. It will boil down to the same thing as `ServerStats` which we don't want in the first place. > In test, implement retry polling with a timeout and a smaller sleep interval (e..g 10ms instead of 1000ms). We use this pattern a lot in our test. In general we'd like to make deterministic tests as much as possible (in other words, avoid sleep at all cost.) though sometimes that is hard to do due to various reasons. I will add the polling mechanism with a smaller sleep interval. I would also love to make the test deterministic, however I don't think this will be possible here because of race condition involved between client side and server side processing thread. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configure
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092574#comment-16092574 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r128153360 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- sync() will not help here because this specific counter is not part of any zab transactions. Two approaches to make test more robust: * We can always increase the counter before we send a response back to client in FinalRequestProcessor. This will make test deterministic, Or, * In test, implement retry polling with a timeout and a smaller sleep interval (e..g 10ms instead of 1000ms). We use this pattern a lot in our test. In general we'd like to make deterministic tests as much as possible (in other words, avoid sleep at all cost.) though sometimes that is hard to do due to various reasons. A side note on sync() - sync is not a quorum operation so it only works if the client connects to the leader. I think there is a jira trying to fix this. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- Th
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092569#comment-16092569 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm > I think at some point after merge this you need create another pull request targeting stable branch (branch-3.4) where the version number should be 3.4.11 (hopefully, if this can get in before next release) I will create a new PR for that branch as well. I am hoping this to get this out in next release if possible. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092565#comment-16092565 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 >> And for the documentation, this property is being introduced from 3.5.4 version, Is that correct? Yes, for this specific pull request targeting master. I think at some point after merge this you need create another pull request targeting stable branch (branch-3.4) where the version number should be 3.4.11 (hopefully, if this can get in before next release). > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092524#comment-16092524 ] Hadoop QA commented on ZOOKEEPER-2770: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/888//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/888//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/888//console This message is automatically generated. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092421#comment-16092421 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r128130857 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- So if the test does any kind of operation that involves the interaction of the `FinalRequestProcessor`, the completion of that operation will occur after the counter increment has completed. The client side can then query the counter with safety. You are correct that the counter isn't part of the state that ZK is protecting, but the effect will be the same since the `FRP` handles all requests in order. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092332#comment-16092332 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r128122188 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- @tdunning > OK. The point of the timeout, however, is that an update has not been applied. The sync() is designed to wait for exactly this. Where does the update have to applied? To ZKDb or the counter? This is what my understanding is, correct me where I am getting it wrong. The call to `zk.getData()` is a blocking call. The `FinalRequestProcessor` completes all the relevant task related to this request and in the end sends out a reply packet to the client. It puts the packet on a queue which is handled by the `ServerCnxn` on a separate thread. In the meanwhile, the former thread goes and increments the `numRequestsAboveThresholdTime` counter. In a case when client gets acknowledged about the data being persisted at the node, the test proceeds to check the counter value. If the later thread got suspended due to any reason and didn't update the counter, this might cause the test the fail. > Code that depends on timeouts is not deterministic in my book. It may just have a low probability of failure. Agreed, using the term `completely deterministic` on my part was incorrect. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients whic
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091875#comment-16091875 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r128045203 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- OK. The point of the timeout, however, is that an update has not been applied. The sync() is designed to wait for exactly this. Code that depends on timeouts is not deterministic in my book. It may just have a low probability of failure. If this isn't feasible, however, it may be something that has to be lived with. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091189#comment-16091189 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127892837 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- > There is a sync() call that ZK supports on the client side. It waits until the replica that the client is connected to catches up with the leader. I am not sure if this is the case here since this test has a single instance of ZooKeeper Server and not a quorum. IMO, the conditional timeout should be good enough to make this test completely deterministic. Suggest if you feel otherwise. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091065#comment-16091065 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127877506 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- There is a sync() call that ZK supports on the client side. It waits until the replica that the client is connected to catches up with the leader. This solves lots of timing issues and is typically the way to implement read-what-you-wrote. On the other hand, if your test is running with server internals, this isn't likely to work as such. The point is that if you wait until all pending items in the queue have been committed to all followers, you will be guaranteed to see all writes you caused before starting to wait. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090088#comment-16090088 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127762613 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -460,11 +464,21 @@ public void processRequest(Request request) { if (request.type == OpCode.closeSession) { cnxn.sendCloseSession(); } +checkLatency(request); } catch (IOException e) { LOG.error("FIXMSG",e); } } +private void checkLatency(Request request) { +long requestWarnThresholdMs = QuorumPeerConfig.getRequestWarnResponseThresholdMs(); --- End diff -- I am thinking to cache the value of `requestWarnThresholdMs` inside the `FinalRequestProcessor.java`. Is that a good idea? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090045#comment-16090045 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127757305 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- @tdunning Yes a conditional timeout is a better option. From what I understand, the failure can be cause in test-only scenario where there is no physical network between client and server, which sometimes results in client getting back the acknowledge and the server is just yet to complete incrementing the count. IMO, it is not possible in actual cluster. > Would this be susceptible to a sync() to force the test to come up to date with the leader? I am not sure what you mean by this. Could you please explain? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089846#comment-16089846 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127711561 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- At the very least, make the timeout conditional on getting the wrong answer. That avoids paying the penalty of a slow test every time. The mere presence of a timeout indicates a problem with concurrency, however. Would this be susceptible to a sync() to force the test to come up to date with the leader? If so, that can make the test deterministic. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089319#comment-16089319 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127633626 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); --- End diff -- Not necessary as such, but will be useful in debugging if some change to `getData()` method might cause this test to fail. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089315#comment-16089315 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127633278 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- I didn't see it consistently. I ran the same test about 60-70 times individually and couple of times as a suite with all other tests. It happened only once for me that the final assertion failed because the registered count was 2 instead of 3, which was when I was running it with other tests from the same class. What do you suggest? Should I reduce the time-out or remove it all together? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089313#comment-16089313 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @eribeiro Thank you for your comments. Will post an updated patch soon. A small concern regarding the `ServerStats.updateLatency()` method. Should the updated patch have both the methods for now and file another JIRA to look into it? And for the documentation, this property is being introduced from `3.5.4` version, Is that correct? @hanm > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089281#comment-16089281 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127627334 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -292,6 +293,14 @@ public void parseProperties(Properties zkProp) } } else if ((key.startsWith("server.") || key.startsWith("group") || key.startsWith("weight")) && zkProp.containsKey("dynamicConfigFile")) { throw new ConfigException("parameter: " + key + " must be in a separate dynamic config file"); +} else if(key.equals("request.warningthresholdms")) { --- End diff -- nit: space between `if` and `(` > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089276#comment-16089276 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127628162 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -27,6 +27,7 @@ import java.util.concurrent.CountDownLatch; import java.util.concurrent.TimeUnit; +import org.apache.log4j.BasicConfigurator; --- End diff -- Unused import > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089275#comment-16089275 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127627096 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -64,10 +65,15 @@ public MainThread(int clientPort, boolean preCreateDirs, String configs) throws IOException { -this(clientPort, preCreateDirs, ClientBase.createTmpDir(), configs); +this(clientPort, preCreateDirs, ClientBase.createTmpDir(), configs, -1); } -public MainThread(int clientPort, boolean preCreateDirs, File tmpDir, String configs) +public MainThread(int clientPort, boolean preCreateDirs, String configs, int threshold) --- End diff -- the `threshold` word in method signature doesn't say much so it requires digging the code to see what it's all about. It would be nice to rename to something closer to its intent (there are plenty of names in this patch already). > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089278#comment-16089278 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127618910 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -460,11 +464,21 @@ public void processRequest(Request request) { if (request.type == OpCode.closeSession) { cnxn.sendCloseSession(); } +checkLatency(request); } catch (IOException e) { LOG.error("FIXMSG",e); } } +private void checkLatency(Request request) { +long requestWarnThresholdMs = QuorumPeerConfig.getRequestWarnResponseThresholdMs(); +long requestLatency = Time.currentElapsedTime() - request.createTime; +if((requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs) || requestWarnThresholdMs == 0) { +LOG.warn("Request " + request + " exceeded threshold. Took " + requestLatency + " ms."); +zks.serverStats().updateNumRequestsExceededThresholdTime(); --- End diff -- Cool! :) nit: Two suggestions: to change the prefix from `update` to `inc` and use `numRequestsAboveThresholdTime()` (up to you both). > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089279#comment-16089279 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127619032 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log --- End diff -- We don't put these kind of comments. The git log already has the patch name, description and files changed/added. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089277#comment-16089277 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127618988 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -32,6 +32,7 @@ private long minLatency = Long.MAX_VALUE; private long totalLatency = 0; private long count = 0; +private long numRequestsExceededThresholdTime = 0; --- End diff -- The addition of this metric is missing a `resetRequestsExceededThresholdTime` method to zero the counter. ;) > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089280#comment-16089280 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127627537 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); --- End diff -- question: Is this assertion check really necessary? It's not enough just a `zk.getData("/foo1", null, null);`? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089282#comment-16089282 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127627805 --- Diff: src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java --- @@ -138,14 +145,56 @@ void delete(File f) throws IOException { ServerCnxnFactory getCnxnFactory() { return main.getCnxnFactory(); } + } -public static class TestZKSMain extends ZooKeeperServerMain { +public static class TestZKSMain extends ZooKeeperServerMain { + +private ServerStats serverStats; + +@Override +public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, ServerConfig config, ZKDatabase zkDb) { +ZooKeeperServer zooKeeperServer = super.getZooKeeperServer(txnLog, config, zkDb); +serverStats = zooKeeperServer.serverStats(); +return zooKeeperServer; +} + +@Override public void shutdown() { super.shutdown(); } } +// Test for ZOOKEEPER-2770 ZooKeeper slow operation log +@Test +public void testRequestWarningThreshold() throws IOException, KeeperException, InterruptedException { +ClientBase.setupTestEnv(); + +final int CLIENT_PORT = PortAssignment.unique(); + +MainThread main = new MainThread(CLIENT_PORT, true, null, 0); +main.start(); + +Assert.assertTrue("waiting for server being up", +ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT, +CONNECTION_TIMEOUT)); +// Get the stats object from the ZooKeeperServer to keep track of high latency requests. +ServerStats stats = main.main.serverStats; + +ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT, +ClientBase.CONNECTION_TIMEOUT, this); + +zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); + +Assert.assertEquals(new String(zk.getData("/foo1", null, null)), "foobar"); +// It takes a while for the counter to get updated sometimes, this is added to reduce flakyness +Thread.sleep(1000); --- End diff -- wondering why... 🤔 I have run this test case a couple of times, but was unable to see this lag in counter updated manifested as assertion failure. Did you see this consistently? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089160#comment-16089160 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on the issue: https://github.com/apache/zookeeper/pull/307 @karanmehta93 hi, mate. You don't need to touch `ZooKeeperAdmin.html` You change `ZooKeeperAdmin.xml` **only** and Apache Forrest generate html anf PDF. Also, this xml change should be part of this patch, okay? Cheers! > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088755#comment-16088755 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm @eribeiro Thank you for your comments. I have added a rebased patch with a test in `ZooKeeperServerMainTest` class. Please review whenever convenient. > It's odd to have two (or in future even more) types of latency checks scattered around which creates fragmentation w.r.t. the definition of what a request latency means. We can decide based on opinion from other people and I can update the patch accordingly. I will create new JIRA's based on @tdunning suggestions, after this one gets approved. I am eager to take them up. Documentation is still pending. Will add a patch for it soon. I am not sure if I have to modify the `ZooKeeperAdmin.xml` file or `ZooKeeperAdmin.html` file. Can you point me to any relevant stuff? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086643#comment-16086643 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127363108 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int warnThreshold = 1; --- End diff -- I suggest having to turn on logging by default. Won't hurt much as such. People can opt out if they want. Its similar to usual stats collection IMO. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086597#comment-16086597 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127356689 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int warnThreshold = 1; --- End diff -- Should this have a default value of -1 instead? I think we'd like this to be an opt in feature and by default the logging should be disabled. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086596#comment-16086596 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 I think we should consolidate the latency check in `zks.serverStats().updateLatency`. It's odd to have two (or in future even more) types of latency checks scattered around which creates fragmentation w.r.t. the definition of what a request latency means. The existing latency measurement in ServerStats measures the time between a request creation and a request landing at final request processor; the patch instead measures end to end time of a request from its start to finish processing. I am fine with the end to end processing time, though I'd like to double check with a few folks around to make sure the regression and impact of this change is limited. I think ServerStats is a good place to put the DS Ted recommended. I think it's a good idea to scope the JIRA so it's easier to get it reviewed and committed. What this patch is doing is a positive improvement to the operational aspects of ZK so that can be the scope of this PR. On top of that future improvements could be what Edward and Ted suggested (JMX, distribution of latencies / histogram etc). These work can be tracked by making them sub tasks under current JIRA. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Assignee: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085462#comment-16085462 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127172773 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -430,6 +432,7 @@ public void processRequest(Request request) { // the client and leader disagree on where the client is most // recently attached (and therefore invalid SESSION MOVED generated) cnxn.sendCloseSession(); +request.checkLatency(); --- End diff -- Yes. The percentiles and whatsoever should be a matter of another issue. No doubt about this, IMO. The meat of my comment was in fact if checkLatency isn't in the responsibility of Stats class, even it remaining as an proper method (i.e, no inclusion into updateLatency). > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085180#comment-16085180 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 @karanmehta93 Thanks for update the pull request with more context information. Do you mind put the context description in the pull request description? Regarding test cases, I think you can just add a new test file under zookeeper/test. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085171#comment-16085171 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127127351 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -430,6 +432,7 @@ public void processRequest(Request request) { // the client and leader disagree on where the client is most // recently attached (and therefore invalid SESSION MOVED generated) cnxn.sendCloseSession(); +request.checkLatency(); --- End diff -- @eribeiro That is exactly the reason that I created my own function after the `cnxn.sendCloseSession()` since I was interested in end to end latency, and didn't use the `zks.serverStats().updateLatency(request.createTime);` method. I would also like to know opinion from others on this. Should we scope this JIRA for adding threshold as a general and create a new JIRA for extending the scope to include percentiles with new data-structure inside the `ServerStats` class? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085106#comment-16085106 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127112584 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -735,6 +738,8 @@ public String getConfigFilename(){ public Boolean getQuorumListenOnAllIPs() { return quorumListenOnAllIPs; } + +public static int getWarnResponseTime() { return warnThreshold; } --- End diff -- What about rename this method to `getMaxRequestThreshold()` or `getWarnRequestThreshold()`? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085105#comment-16085105 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r12796 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -61,6 +61,7 @@ private static boolean standaloneEnabled = true; private static boolean reconfigEnabled = false; +private static int warnThreshold = 1; --- End diff -- I would suggest to rename this field as `warnThresholdMs` or `maxRequestThresholdMs`. The important pare here is the `Ms` suffix. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085107#comment-16085107 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127117526 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -430,6 +432,7 @@ public void processRequest(Request request) { // the client and leader disagree on where the client is most // recently attached (and therefore invalid SESSION MOVED generated) cnxn.sendCloseSession(); +request.checkLatency(); --- End diff -- On a related note, besides logging those slow requests, wouldn't be nice to expose a counter of how many requests were above the latency threshold and store it in `zks.serverStats()` so that it could be exposed as metric via JMX? Going even further, we could expose other metrics related to slow requests like average time & 99% percentiles as suggested by @tdunning > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085103#comment-16085103 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127110615 --- Diff: src/java/main/org/apache/zookeeper/server/Request.java --- @@ -21,20 +21,27 @@ import java.nio.ByteBuffer; import java.util.List; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; import org.apache.jute.Record; import org.apache.zookeeper.KeeperException; import org.apache.zookeeper.ZooDefs.OpCode; import org.apache.zookeeper.common.Time; import org.apache.zookeeper.data.Id; +import org.apache.zookeeper.server.quorum.QuorumPeerConfig; import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier; import org.apache.zookeeper.txn.TxnHeader; + /** * This is the structure that represents a request moving through a chain of * RequestProcessors. There are various pieces of information that is tacked * onto the request as it is processed. */ public class Request { +private static final Logger LOG = LoggerFactory +.getLogger(Request.class); --- End diff -- No need to break line here. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085104#comment-16085104 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127116226 --- Diff: src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java --- @@ -430,6 +432,7 @@ public void processRequest(Request request) { // the client and leader disagree on where the client is most // recently attached (and therefore invalid SESSION MOVED generated) cnxn.sendCloseSession(); +request.checkLatency(); --- End diff -- I see that both `request.checkLatency()` and `zks.serverStats().updateLatency(request.createTime);` (a few lines above) are dealing with the same matter (request latency) even though on radical different positions. IMHO, `zks.serverStats().updateLatency(request.createTime);` should be moved to the line position where `request.checkLatency()` currently is and to move the if condition from `request.checkLatency()` to `zks.serverStats().updateLatency(request.createTime);`, eventually getting rid of `request.checkLatency()`. Could this raise some unexpected bugs? @karanmehta93 @hanm, wdyt? > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085102#comment-16085102 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127110211 --- Diff: src/java/main/org/apache/zookeeper/server/Request.java --- @@ -121,6 +128,13 @@ public void setTxn(Record txn) { this.txn = txn; } +public void checkLatency() { +long requestLatency = Time.currentElapsedTime() - this.createTime; +if(QuorumPeerConfig.getWarnResponseTime() > -1 && requestLatency > QuorumPeerConfig.getWarnResponseTime()) { --- End diff -- nit: space between `if` and `(`. You can reduce the size of the `if` expression and avoid an unnecessary call to the same method in the same expression by doing: ``` java long warnRequestThreshold = QuorumPeerConfig.getWarnRequestTime() if (warnRequestThreshold > -1 && requestLatency > warnRequestThreshold) { ``` > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085060#comment-16085060 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 From my understanding, `QuorumPeer` class will be the candidate class to maintain this data structure. Let me know alternatives. I will look into the data-structure ideas you provided till then. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085014#comment-16085014 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127109651 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -292,6 +293,8 @@ public void parseProperties(Properties zkProp) } } else if ((key.startsWith("server.") || key.startsWith("group") || key.startsWith("weight")) && zkProp.containsKey("dynamicConfigFile")) { throw new ConfigException("parameter: " + key + " must be in a separate dynamic config file"); +} else if(key.equals("warn.responseTime")) { --- End diff -- +1 about @hanm suggestions. nit: a space between `if` and `(` at line 296. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084975#comment-16084975 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user tdunning commented on the issue: https://github.com/apache/zookeeper/pull/307 So here is an alternative suggestion. For each class of operation (GET, PUT, MULTIPUT, DELETE and so on) maintain a data structure (such as t-digest or Floathistogram see https://github.com/tdunning/t-digest) that keeps track of the distribution of latencies for that operation. Log any transactions that are above a settable percentile latency for operations of that class. Default could be 99.99%. Operations longer than 2x the 99.9%-ile latency should not be entered into the historical distribution. > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084536#comment-16084536 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r127048435 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java --- @@ -292,6 +293,8 @@ public void parseProperties(Properties zkProp) } } else if ((key.startsWith("server.") || key.startsWith("group") || key.startsWith("weight")) && zkProp.containsKey("dynamicConfigFile")) { throw new ConfigException("parameter: " + key + " must be in a separate dynamic config file"); +} else if(key.equals("warn.responseTime")) { --- End diff -- I will document the new property and change the name to the suggested one. I have tested this feature manually. I am currently trying to add unit test case in NettyServerCnxnTest.java or NIOServerCnxnTest.java, but seems too high level approach. Can you suggest any locations for adding test case? @hanm > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084500#comment-16084500 ] ASF GitHub Bot commented on ZOOKEEPER-2770: --- Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 ZooKeeper is a complex distributed application. There are many reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem. Although ZooKeeper is not a datastore, it does persist data, and can suffer intermittent performance degradation, and should consider implementing a 'slow query' log, a feature very common to services which persist information on behalf of clients which may be sensitive to latency while waiting for confirmation of successful persistence. @hanm FYI > ZooKeeper slow operation log > > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)