[ https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458304#comment-16458304 ]
ASF GitHub Bot commented on ZOOKEEPER-2770: ------------------------------------------- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/307#discussion_r184910779 --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java --- @@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){ packetsReceived = 0; packetsSent = 0; } + synchronized public void resetNumRequestsAboveThresholdTime() { + numRequestsAboveThresholdTime = 0; + } synchronized public void reset() { resetLatency(); resetRequestCounters(); + resetNumRequestsAboveThresholdTime(); + } + + public void checkLatency(final ZooKeeperServer zks, Request request) { + long requestLatency = Time.currentElapsedTime() - request.createTime; + boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) || + (requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs); + if (enabledAndAboveThreshold) { + zks.serverStats().incNumRequestsAboveThresholdTime(); + + // Try acquiring lock only if not waiting + boolean success = waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE); + if (success) { + LOG.warn("Request {} exceeded threshold. Took {} ms", request, requestLatency); + startCount = zks.serverStats().getNumRequestsAboveThresholdTime(); + timer.schedule(new TimerTask() { + @Override + public void run() { + long count = zks.serverStats().getNumRequestsAboveThresholdTime() - startCount; + LOG.warn("Number of requests that exceeded {} ms in past {} ms: {}", + requestWarnThresholdMs, delayTimeForLoggingWarnThresholdMsg, count); + waitForLoggingWarnThresholdMsg.set(Boolean.FALSE); + } + }, delayTimeForLoggingWarnThresholdMsg); + } + } + } + + public void setDelayTimeForLoggingWarnThresholdMsg(int delay) { + this.delayTimeForLoggingWarnThresholdMsg = delay; --- End diff -- I think it is worth to add some comment about this method. Probably also add comments in the document as well about the behavior of the log and how to control the rate using this method. > ZooKeeper slow operation log > ---------------------------- > > Key: ZOOKEEPER-2770 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770 > Project: ZooKeeper > Issue Type: Improvement > Reporter: Karan Mehta > Assignee: Karan Mehta > Priority: Major > Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, > ZOOKEEPER-2770.003.patch > > > ZooKeeper is a complex distributed application. There are many reasons why > any given read or write operation may become slow: a software bug, a protocol > problem, a hardware issue with the commit log(s), a network issue. If the > problem is constant it is trivial to come to an understanding of the cause. > However in order to diagnose intermittent problems we often don't know where, > or when, to begin looking. We need some sort of timestamped indication of the > problem. Although ZooKeeper is not a datastore, it does persist data, and can > suffer intermittent performance degradation, and should consider implementing > a 'slow query' log, a feature very common to services which persist > information on behalf of clients which may be sensitive to latency while > waiting for confirmation of successful persistence. > Log the client and request details if the server discovers, when finally > processing the request, that the current time minus arrival time of the > request is beyond a configured threshold. > Look at the HBase {{responseTooSlow}} feature for inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)