[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473648#comment-16473648
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12866052/ZOOKEEPER-2770.003.patch
  against trunk revision 2fa315b7d0ed65828479fcdcc9e76ca8552fba4a.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3691//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
>  Labels: pull-request-available
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-04-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458304#comment-16458304
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r184910779
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){
 packetsReceived = 0;
 packetsSent = 0;
 }
+synchronized public void resetNumRequestsAboveThresholdTime() {
+numRequestsAboveThresholdTime = 0;
+}
 synchronized public void reset() {
 resetLatency();
 resetRequestCounters();
+resetNumRequestsAboveThresholdTime();
+}
+
+public void checkLatency(final ZooKeeperServer zks, Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
+(requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs);
+if (enabledAndAboveThreshold) {
+zks.serverStats().incNumRequestsAboveThresholdTime();
+
+// Try acquiring lock only if not waiting
+boolean success = 
waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE);
+if (success) {
+LOG.warn("Request {} exceeded threshold. Took {} ms", 
request, requestLatency);
+startCount = 
zks.serverStats().getNumRequestsAboveThresholdTime();
+timer.schedule(new TimerTask() {
+@Override
+public void run() {
+long count = 
zks.serverStats().getNumRequestsAboveThresholdTime() - startCount;
+LOG.warn("Number of requests that exceeded {} ms 
in past {} ms: {}",
+requestWarnThresholdMs, 
delayTimeForLoggingWarnThresholdMsg, count);
+waitForLoggingWarnThresholdMsg.set(Boolean.FALSE);
+}
+}, delayTimeForLoggingWarnThresholdMsg);
+}
+}
+}
+
+public void setDelayTimeForLoggingWarnThresholdMsg(int delay) {
+this.delayTimeForLoggingWarnThresholdMsg = delay;
--- End diff --

I think it is worth to add some comment about this method. Probably also 
add comments in the document as well about the behavior of the log and how to 
control the rate using this method.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-04-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458305#comment-16458305
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r184910500
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){
 packetsReceived = 0;
 packetsSent = 0;
 }
+synchronized public void resetNumRequestsAboveThresholdTime() {
+numRequestsAboveThresholdTime = 0;
+}
 synchronized public void reset() {
 resetLatency();
 resetRequestCounters();
+resetNumRequestsAboveThresholdTime();
+}
+
+public void checkLatency(final ZooKeeperServer zks, Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
+(requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs);
--- End diff --

This could be simplified: `boolean enabledAndAboveThreshold = 
requestWarnThresholdMs != -1 && requestLatency > requestWarnThresholdMs`


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-04-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458303#comment-16458303
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r184910721
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){
 packetsReceived = 0;
 packetsSent = 0;
 }
+synchronized public void resetNumRequestsAboveThresholdTime() {
+numRequestsAboveThresholdTime = 0;
+}
 synchronized public void reset() {
 resetLatency();
 resetRequestCounters();
+resetNumRequestsAboveThresholdTime();
+}
+
+public void checkLatency(final ZooKeeperServer zks, Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
+(requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs);
+if (enabledAndAboveThreshold) {
+zks.serverStats().incNumRequestsAboveThresholdTime();
+
+// Try acquiring lock only if not waiting
+boolean success = 
waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE);
+if (success) {
+LOG.warn("Request {} exceeded threshold. Took {} ms", 
request, requestLatency);
+startCount = 
zks.serverStats().getNumRequestsAboveThresholdTime();
+timer.schedule(new TimerTask() {
+@Override
+public void run() {
+long count = 
zks.serverStats().getNumRequestsAboveThresholdTime() - startCount;
--- End diff --

If there is no slow requests coming in between the time the task is 
scheduled and the time the task is executed, this count will be 0. It will not 
reflect the actual number of requests we want to log. Maybe log the startCount 
instead and reset it and leave the task only to reset the barrier?

Also please use ScheduledService instead of Timer task.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438989#comment-16438989
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm Thank you! I have rebased the patch. The jenkins build has been 
failing for a while and my patch doesn't introduce new failure. Thanks once 
again!


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438977#comment-16438977
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
looks like the branch has merge conflicts. do you mind update your patch 
with latest master and I'll review @karanmehta93 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392529#comment-16392529
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user zodvik commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
> reasons why any given read or write operation may become slow: a software 
bug, a protocol problem, a hardware issue with the commit log(s), a network 
issue. If the problem is constant it is trivial to come to an understanding of 
the cause. However in order to diagnose intermittent problems we often don't 
know where, or when, to begin looking. We need some sort of timestamped 
indication of the problem

IMO, slow query log is useful for those systems were the response times can 
vary due to type of query (long scans, improper index, poor query plan, etc.). 
For the type of issues mentioned, we can expect all operations to be impacted 
and fill up the log. 

If the intent is to get an time stamp indication of the problem, a metric 
reporting mechanism will be more useful. If we have rolling interval 
percentiles (per command?) published either to `stat` or `jmx`, that would give 
a good visibility.



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2018-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389945#comment-16389945
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Hello everyone,
Appreciate your efforts in reviewing this patch. @hanm @tdunning @eribeiro 
@skamille 
Is there any possibility that the patch will get merged in (with minor 
changes if required) or shall we 'never' this JIRA and close this PR?
Thanks!


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>Priority: Major
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-11-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263413#comment-16263413
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
>  Can we at least make it so that the process of "push a metric to a 
buffer, have a thread that wakes up periodically and flushes information out of 
that buffer" is usable by multiple parts of the system, instead of coupling it 
to the one metric of request time?

I am ready to put in the effort of making this framework more generic, if 
you feel its worth the time and effort to put into it. I would also like to 
hear others opinions on this. 
The inspiration for this JIRA came from HBase, where it has been used to 
log slow requests at the basic level. I feel that it can be useful to keep 
track of latency spikes that are seen in production. @apurtell can suggest 
use-cases.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261884#comment-16261884
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user skamille commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
I appreciate the work that you are trying to do here, and this still feels 
like an incomplete approach to a problem that would be worth investing further 
into. It seems like we all agree that it would be good to have clearer metrics 
about the performance behavior of the internals of the system. This implements 
some basic infrastructure for a metric recording framework, coupled to the 
recording of one metric. Can we at least make it so that the process of "push a 
metric to a buffer, have a thread that wakes up periodically and flushes 
information out of that buffer" is usable by multiple parts of the system, 
instead of coupling it to the one metric of request time?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244971#comment-16244971
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Ping @hanm @tdunning @eribeiro @skamille for a review if you got some time. 
Thanks!


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221123#comment-16221123
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Due to some reason, Jenkins didn't build it automatically. I triggered a 
manual build with my latest commit ID SHA1 and the results are at 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1123/

All the Findbugs warnings have been resolved, and the single failing test 
seems to be unrelated to my change (or possibly a flapper since it is passing 
on my local machine).

This commit is a squashed commit based on the earlier inputs provided by 
everyone. Thank you!

As of now, the default value for `requestWarnThresholdMs` is 10 seconds, 
which is too high. Users can set it according to the average latency values 
they typically see from the serverstats. The patch also limits logging rate, so 
if the value is set too low, it will print out a message every 60 seconds 
showing the number of high latency requests in past minute. The motivation of 
this JIRA is specified in description at  
https://issues.apache.org/jira/browse/ZOOKEEPER-2770.

If you can review this PR, it would be great.
@hanm @tdunning @eribeiro @skamille 

If we can get in this JIRA, the future plan is to work on @tdunning 
t-digest and use it to determine high latency requests specifically based on 
its type. Thanks!


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219981#comment-16219981
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r147047396
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -81,9 +81,11 @@
 private static final Logger LOG = 
LoggerFactory.getLogger(FinalRequestProcessor.class);
 
 ZooKeeperServer zks;
+ServerStats serverStats;
--- End diff --

Fixed it in the next commit.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219982#comment-16219982
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r147047400
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){
 packetsReceived = 0;
 packetsSent = 0;
 }
+synchronized public void resetNumRequestsAboveThresholdTime() {
+numRequestsAboveThresholdTime = 0;
+}
 synchronized public void reset() {
 resetLatency();
 resetRequestCounters();
+resetNumRequestsAboveThresholdTime();
+}
+
+public void checkLatency(final ZooKeeperServer zks, Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
+(requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs);
+if (enabledAndAboveThreshold) {
+zks.serverStats().incNumRequestsAboveThresholdTime();
+
+// Try acquiring lock only if not waiting
+boolean success = 
waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE);
+if(success) {
--- End diff --

Fixed it in the next commit.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219919#comment-16219919
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r147040783
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -148,9 +174,46 @@ synchronized public void resetRequestCounters(){
 packetsReceived = 0;
 packetsSent = 0;
 }
+synchronized public void resetNumRequestsAboveThresholdTime() {
+numRequestsAboveThresholdTime = 0;
+}
 synchronized public void reset() {
 resetLatency();
 resetRequestCounters();
+resetNumRequestsAboveThresholdTime();
+}
+
+public void checkLatency(final ZooKeeperServer zks, Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
+(requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs);
+if (enabledAndAboveThreshold) {
+zks.serverStats().incNumRequestsAboveThresholdTime();
+
+// Try acquiring lock only if not waiting
+boolean success = 
waitForLoggingWarnThresholdMsg.compareAndSet(Boolean.FALSE, Boolean.TRUE);
+if(success) {
--- End diff --

Need a space between if and (


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219918#comment-16219918
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r147040873
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -81,9 +81,11 @@
 private static final Logger LOG = 
LoggerFactory.getLogger(FinalRequestProcessor.class);
 
 ZooKeeperServer zks;
+ServerStats serverStats;
--- End diff --

I think that caching this reference is a bad practice. Getting a reference 
every time costs essentially nothing and keeping a reference to a data 
structure that we don't own is asking for trouble if anybody every reallocates 
it.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219838#comment-16219838
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

GitHub user karanmehta93 reopened a pull request:

https://github.com/apache/zookeeper/pull/307

ZOOKEEPER-2770 ZooKeeper slow operation log

ZooKeeper is a complex distributed application. There are many reasons why 
any given read or write operation may become slow: a software bug, a protocol 
problem, a hardware issue with the commit log(s), a network issue. If the 
problem is constant it is trivial to come to an understanding of the cause. 
However in order to diagnose intermittent problems we often don't know where, 
or when, to begin looking. We need some sort of timestamped indication of the 
problem. Although ZooKeeper is not a datastore, it does persist data, and can 
suffer intermittent performance degradation, and should consider implementing a 
'slow query' log, a feature very common to services which persist information 
on behalf of clients which may be sensitive to latency while waiting for 
confirmation of successful persistence.

The patch sets a default threshold of 10 sec for request. If the value of 
warn.responseTime is configured as -1 then logging is ignored.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/karanmehta93/zookeeper ZOOKEEPER-2770

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/307.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #307


commit bf0ed9eca106faebca6548122108e3ff123b04d3
Author: Karan Mehta 
Date:   2017-10-26T01:48:56Z

ZOOKEEPER-2770 ZooKeeper slow operation log




> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219820#comment-16219820
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 closed the pull request at:

https://github.com/apache/zookeeper/pull/307


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219169#comment-16219169
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@tdunning Sorry about this. I rebased my branch with the current master and 
pushed my patch along with it. Now it is showing up all those commits as my 
commits. Let me try to figure out a way to resolve this.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218938#comment-16218938
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
What is this pull request?  45 commits? 95 changed files? Lots of commits 
without associated issues?

(danger, danger)


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218937#comment-16218937
 ] 

Ted Dunning commented on ZOOKEEPER-2770:



There is a pull request just now that has 95 files changed and 45 commits.

What is this?!?

https://github.com/apache/zookeeper/pull/307/commits

It looks like some wires got seriously crossed here. There is no way that this 
feature should have so many commits.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218293#comment-16218293
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 89 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 3.0.1) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1118//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213980#comment-16213980
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r146108372
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -148,9 +173,47 @@ synchronized public void resetRequestCounters(){
 packetsReceived = 0;
 packetsSent = 0;
 }
+synchronized public void resetNumRequestsAboveThresholdTime() {
+numRequestsAboveThresholdTime = 0;
+}
 synchronized public void reset() {
 resetLatency();
 resetRequestCounters();
+resetNumRequestsAboveThresholdTime();
+}
+
+public void checkLatency(final ZooKeeperServer zks, Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+boolean enabledAndAboveThreshold = (requestWarnThresholdMs == 0) ||
+(requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs);
+if (enabledAndAboveThreshold) {
+zks.serverStats().incNumRequestsAboveThresholdTime();
+// Try acquiring lock only if not waiting
+if(!waitForLoggingWarnThresholdMsg) {
--- End diff --

nit: space between `if` and `(`


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213978#comment-16213978
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r146108343
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -21,17 +21,32 @@
 
 
 import org.apache.zookeeper.common.Time;
+import org.apache.zookeeper.server.quorum.QuorumPeerConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Timer;
+import java.util.TimerTask;
 
 /**
  * Basic Server Statistics
  */
 public class ServerStats {
+private static final Logger LOG = 
LoggerFactory.getLogger(ServerStats.class);
+
 private long packetsSent;
 private long packetsReceived;
 private long maxLatency;
 private long minLatency = Long.MAX_VALUE;
 private long totalLatency = 0;
 private long count = 0;
+private long numRequestsAboveThresholdTime = 0;
+
+final static long requestWarnThresholdMs = 
QuorumPeerConfig.getRequestWarnResponseThresholdMs();
+final static Timer timer = new Timer();
+volatile Boolean waitForLoggingWarnThresholdMsg = false;
--- End diff --

Oh, excuse me, I got it. It's because you are synchronizing on it, right? 
Well, it's not recommended (anti-pattern) to synchronize on non-final fields. 
Didn't Findbugs raised a warning about this?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-10-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213976#comment-16213976
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r146108249
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -21,17 +21,32 @@
 
 
 import org.apache.zookeeper.common.Time;
+import org.apache.zookeeper.server.quorum.QuorumPeerConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Timer;
+import java.util.TimerTask;
 
 /**
  * Basic Server Statistics
  */
 public class ServerStats {
+private static final Logger LOG = 
LoggerFactory.getLogger(ServerStats.class);
+
 private long packetsSent;
 private long packetsReceived;
 private long maxLatency;
 private long minLatency = Long.MAX_VALUE;
 private long totalLatency = 0;
 private long count = 0;
+private long numRequestsAboveThresholdTime = 0;
+
+final static long requestWarnThresholdMs = 
QuorumPeerConfig.getRequestWarnResponseThresholdMs();
+final static Timer timer = new Timer();
+volatile Boolean waitForLoggingWarnThresholdMsg = false;
--- End diff --

What is the reason to make `waitForLoggingWarnThresholdMsg` a boxed type 
instead of a primitive `boolean`?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-09-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176070#comment-16176070
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs (version 3.0.1) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1032//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133177#comment-16133177
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
There are -1 on findbugs, release audits, and test failures reported by 
Jekins. Please investigate.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131884#comment-16131884
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs (version 3.0.1) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/947//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131874#comment-16131874
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs (version 3.0.1) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/946//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131873#comment-16131873
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs (version 3.0.1) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/945//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131704#comment-16131704
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 5 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs (version 3.0.1) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/944//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131694#comment-16131694
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm @eribeiro @tdunning @skamille 
Please review.
Now that I have added rate limiting to logging, can we also turn this on by 
default?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113679#comment-16113679
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r131282046
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -460,11 +465,20 @@ public void processRequest(Request request) {
 if (request.type == OpCode.closeSession) {
 cnxn.sendCloseSession();
 }
+checkLatency(request);
 } catch (IOException e) {
 LOG.error("FIXMSG",e);
 }
 }
 
+private void checkLatency(Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+if((requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs) || requestWarnThresholdMs == 0) {
+LOG.warn("Request "  + request + " exceeded threshold. Took " 
+ requestLatency + " ms.");
--- End diff --

You can leverage the use of interpolation of Strings in logging:
```
LOG.warn("Request {} exceeded {}", obj1, obj2);
```

Again, such a minor suggestion that it's up to you adopt it or not. Cheers!


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113678#comment-16113678
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r131281803
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -460,11 +465,20 @@ public void processRequest(Request request) {
 if (request.type == OpCode.closeSession) {
 cnxn.sendCloseSession();
 }
+checkLatency(request);
 } catch (IOException e) {
 LOG.error("FIXMSG",e);
 }
 }
 
+private void checkLatency(Request request) {
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+if((requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs) || requestWarnThresholdMs == 0) {
--- End diff --

@karanmehta93 really *really* minor nit: 
```
boolean enabledAndAboveThreshold = requestWarnThresholdMs == 0 || 
(requestWarnThresholdMs > -1 && requestLatency > requestWarnThresholdMs);

if (enabledAndAboveThreshold) {
```

I moved `requestWarnThresholdMs == 0` to the beginning of the expression so 
that the whole expression can be short circuited if it is equals zero. ;-)

But this is a so trivial recommendation that it's up to you use it or not.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-26 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102749#comment-16102749
 ] 

Ted Dunning commented on ZOOKEEPER-2770:


The typical approach is to set a limit on number of messages per unit time
(say one every 10 minutes). Each message that is printed sets a coalescence
time during which no further messages are printed, but a counter is
updated. At the end of the coalescence time a modified message which
mentions that n additional events were detected and the coalescence time is
disabled.

This way if the warnings are rare, you get normal behavior. If the warnings
are frequent, you get at most one message per 10 minutes (or whatever
coalescence period you choose). You get instant notification of a problem
and limited log output.


On Wed, Jul 26, 2017 at 10:05 PM, Karan Mehta (JIRA) 



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-26 Thread Karan Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102708#comment-16102708
 ] 

Karan Mehta commented on ZOOKEEPER-2770:


[~tdunning]
There is no rate limiting on logging. If the threshold is too low, it might 
result in huge amount of messages getting printed. At this point, I suggest the 
following things. Please suggest your opinion.
1. Turn off this feature by default, so that we don't end up on an arbitrary 
value.
I personally not want this, since I believe that no matter what your 
requirements are or your hardware, it can be possible to put some upper bound 
on this value. Experienced people can comment on this more than I can. 
2. Add rate limiter based on some logic
  2.1 Time based logic (Limit messages printed in a given amount of time)
  2.2 Random sampling based on some probability percentage.

I am not aware of how typically these things are implemented. It would be good 
if you can suggest some part of code which does similar stuff. Thanks!

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101161#comment-16101161
 ] 

Ted Dunning commented on ZOOKEEPER-2770:



Btw I note that there is no metering on this logging.

That raise an obligatory question. Is there a plausible circumstance where 
thousands of nearly identical messages might be logged?



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101158#comment-16101158
 ] 

Ted Dunning commented on ZOOKEEPER-2770:


{quote}
With that said, is 300 ms a good value or even less is better?
{quote}

I would suggest that getting a real time varying histogram is the right answer. 
I suggested that early on for just this kind of reason.



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Karan Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101114#comment-16101114
 ] 

Karan Mehta commented on ZOOKEEPER-2770:


bq. Operations over 100ms should be vanishingly rare, but I wouldn't leap up to 
find out what is happening. I would be fairly unhappy, though, and would start 
checking.
Let's take this as a motivation. :) 
With that said, is 300 ms a good value or even less is better?

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100945#comment-16100945
 ] 

Ted Dunning commented on ZOOKEEPER-2770:



On second thought, I could imagine that startup transients could cause a long 
operation. Once you have your quorum in a groove, however, >1 second is very 
bad, especially if you don't have something like a quorum leader change 
happening.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100942#comment-16100942
 ] 

Ted Dunning commented on ZOOKEEPER-2770:


To put some color on Camille's surprise, I would consider any operation over a 
second to be indicative of gross failure in the quorum. Operations over 100ms 
should be vanishingly rare, but I wouldn't leap up to find out what is 
happening. I would be fairly unhappy, though, and would start checking.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100940#comment-16100940
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r129450258
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int requestWarnThresholdMs = 1;
--- End diff --

To be frank, I am newbie and haven't debugged this in detail. This value is 
purely seen based on the 'stat' command on our test cluster. @apurtell might be 
able to tell more practical values.

@skamille I would prefer turning this on by default, although the default 
value needs to be discussed. In my understanding, this helps in situations when 
we see timeouts at application level, such a log can might help narrow down 
towards the cause.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100906#comment-16100906
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user skamille commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r129444784
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int requestWarnThresholdMs = 1;
--- End diff --

You've seen 2.3 seconds latency within the ZK quorum operations? That seems 
worthy of posting to the mailing list along with some information about what 
was happening and why.
I think it sounds like @hanm wants to turn this off by default, which makes 
this moot, and I'm supportive of that, so I'll let him make the call.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100881#comment-16100881
 ] 

Michael Han commented on ZOOKEEPER-2770:


For hardcode I meant the default value of "requestWarnThresholdMs" baked in 
code.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100852#comment-16100852
 ] 

Andrew Purtell commented on ZOOKEEPER-2770:
---

>From the original patch the warning threshold has been configurable. Calling 
>it 'hardcoded' isn't correct. Maybe you meant a simple threshold only? That's 
>true. It's better than nothing. FWIW I also like Ted's suggestion as a 
>followup, and in fact would like to carry that over to HBase if it works out 
>well here.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100846#comment-16100846
 ] 

Michael Han commented on ZOOKEEPER-2770:


A hardcoded default value in code is unlikely to work for everyone and it is 
possible to have false negatives if the value is too small. I am leaning 
towards have this feature as an opt-in feature with the value has its default 
-1 only and for those who want use it they can tune the parameter for their 
deployment but it has to be enabled explicitly.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100820#comment-16100820
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r129431091
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int requestWarnThresholdMs = 1;
--- End diff --

Is 2 or 3 seconds reasonable? I have seen 2.3 seconds as max latency 
sometimes, however I don't have much experience. 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100780#comment-16100780
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user skamille commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r129424302
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int requestWarnThresholdMs = 1;
--- End diff --

If we're going to implement this let's at least put some sort of realistic 
threshold. 10s is basically saying "don't enable this feature" is that what we 
want?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100706#comment-16100706
 ] 

Andrew Purtell commented on ZOOKEEPER-2770:
---

The originally proposed change is hardly complex. I don't understand that 
aspect of this discussion. Whether or not the metric is useful, on the other 
hand... ok. That is a matter of opinion. I think we'd like to know if any ZK op 
takes longer than a second to complete, and how often that might happen, and on 
what host it is happening. We have fleet of thousands of servers. We have tens 
of ZooKeeper installations, each on five servers. Hardware does funny things 
from time to time. We'd like to be proactive. 

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100640#comment-16100640
 ] 

Camille Fournier commented on ZOOKEEPER-2770:
-

Are there really 10s long slow requests? It's defaults like this that make me 
skeptical about the usefulness of this particular implementation. If we have a 
request through ZK that takes 10s to process your whole system is completely 
effed. 

I don't think we should add complexity to the code base without suitable 
justification for the value of the new feature. With that in mind, I'd like to 
understand what, specifically, the circumstances we're trying to measure are. 
It looks like processing time for a request through the ZK quorum alone, 
correct? The only network time that might be captured would be, in the case of 
a write, the quorum voting time.

I'm all for making ZK more operable and exposing metrics but I don't think 
exposing low-value metrics is worth the additional code complexity without 
justification.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100603#comment-16100603
 ] 

Ted Dunning commented on ZOOKEEPER-2770:


[~fournc],

I am not so sure that *I* agree with me at this point.

It is fair to say that on occasion there are slow operations in ZK and it would 
be good to know about them. 

This kind of problem is almost always due, in my own vicarious experience,  to 
bad configuration. Often the bad configuration is simply collocation with a 
noisy neighbor on a deficient storage layer.  There might be situations where 
an operation is slow due to the content of the query itself, but I cannot 
imagine what those situations might be.  Writing a large value (but that is 
strictly limited in size), or even doing a huge multi-op (which has the same 
limited size in aggregate) should never take very long.

As such, I would expect that the highest diagnostic value would not be 
something that dumped the contents of slow queries, but rather a capability 
that characterizes the entire distribution of query times. The frequency of 
slow queries is a diagnostic of sorts, but is one that could be inferred from 
the time-varying distributional information I was suggesting.

That said, I don't think that a slow query log is a BAD thing (except a bit bad 
in terms of security if it logs the actual query). And I wouldn't want the BEST 
thing (a distribution log) to stop somebody contributing something.




> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100551#comment-16100551
 ] 

Camille Fournier commented on ZOOKEEPER-2770:
-

I completely agree with [~tdunning] I don't understand the motivation for this. 
Are we just timing the internal processing time for the request? ZK is not the 
same type of system as HBase so I'm not sure we are comparing apples to oranges 
trying to cross-implement this feature.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099714#comment-16099714
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/900//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/900//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/900//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099515#comment-16099515
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm @eribeiro @tdunning 
I have fixed the whitespace issues and final patch is ready, Please review 
whenever convenient.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093587#comment-16093587
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/890//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/890//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/890//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093571#comment-16093571
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r128330873
  
--- Diff: src/java/main/org/apache/zookeeper/server/Request.java ---
@@ -29,12 +29,14 @@
 import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier;
 import org.apache.zookeeper.txn.TxnHeader;
 
+
--- End diff --

Some whitespace changes have been introduced by mistake. Whats the best way 
to resolve them? I am not changing `Request.java` file at all in the final code.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092620#comment-16092620
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/889//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/889//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/889//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092593#comment-16092593
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r128155180
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

> We can always increase the counter before we send a response back to 
client in FinalRequestProcessor. This will make test deterministic, Or,

Not this approach. It will boil down to the same thing as `ServerStats` 
which we don't want in the first place. 

> In test, implement retry polling with a timeout and a smaller sleep 
interval (e..g 10ms instead of 1000ms). We use this pattern a lot in our test. 
In general we'd like to make deterministic tests as much as possible (in other 
words, avoid sleep at all cost.) though sometimes that is hard to do due to 
various reasons.

I will add the polling mechanism with a smaller sleep interval. I would 
also love to make the test deterministic, however I don't think this will be 
possible here because of race condition involved between client side and server 
side processing thread.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configure

[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092574#comment-16092574
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r128153360
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

sync() will not help here because this specific counter is not part of any 
zab transactions. Two approaches to make test more robust:

* We can always increase the counter before we send a response back to 
client in FinalRequestProcessor. This will make test deterministic, Or,

* In test, implement retry polling with a timeout and a smaller sleep 
interval (e..g 10ms instead of 1000ms). We use this pattern a lot in our test. 
In general we'd like to make deterministic tests as much as possible (in other 
words, avoid sleep at all cost.) though sometimes that is hard to do due to 
various reasons.

A side note on sync() - sync is not a quorum operation so it only works if 
the client connects to the leader. I think there is a jira trying to fix this.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
Th

[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092569#comment-16092569
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm 
> I think at some point after merge this you need create another pull 
request targeting stable branch (branch-3.4) where the version number should be 
3.4.11 (hopefully, if this can get in before next release)

I will create a new PR for that branch as well. I am hoping this to get 
this out in next release if possible.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092565#comment-16092565
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
>> And for the documentation, this property is being introduced from 3.5.4 
version, Is that correct?

Yes, for this specific pull request targeting master. I think at some point 
after merge this you need create another pull request targeting stable branch 
(branch-3.4) where the version number should be 3.4.11 (hopefully, if this can 
get in before next release).


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092524#comment-16092524
 ] 

Hadoop QA commented on ZOOKEEPER-2770:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/888//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/888//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/888//console

This message is automatically generated.

> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092421#comment-16092421
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r128130857
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

So if the test does any kind of operation that involves the interaction of 
the `FinalRequestProcessor`, the completion of that operation will occur after 
the counter increment has completed. The client side can then query the counter 
with safety.

You are correct that the counter isn't part of the state that ZK is 
protecting, but the effect will be the same since the `FRP` handles all 
requests in order.



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092332#comment-16092332
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r128122188
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

@tdunning 

> OK. The point of the timeout, however, is that an update has not been 
applied. The sync() is designed to wait for exactly this.

Where does the update have to applied? To ZKDb or the counter? This is what 
my understanding is, correct me where I am getting it wrong. The call to 
`zk.getData()` is a blocking call. The `FinalRequestProcessor` completes all 
the relevant task related to this request and in the end sends out a reply 
packet to the client. It puts the packet on a queue which is handled by the 
`ServerCnxn` on a separate thread. In the meanwhile, the former thread goes and 
increments the `numRequestsAboveThresholdTime` counter. In a case when client 
gets acknowledged about the data being persisted at the node, the test proceeds 
to check the counter value. If the later thread got suspended due to any reason 
and didn't update the counter, this might cause the test the fail.

> Code that depends on timeouts is not deterministic in my book. It may 
just have a low probability of failure.

Agreed, using the term `completely deterministic` on my part was incorrect.



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients whic

[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091875#comment-16091875
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r128045203
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

OK. The point of the timeout, however, is that an update has not been 
applied. The sync() is designed to wait for exactly this.

Code that depends on timeouts is not deterministic in my book. It may just 
have a low probability of failure.

If this isn't feasible, however, it may be something that has to be lived 
with.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091189#comment-16091189
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127892837
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

> There is a sync() call that ZK supports on the client side. It waits 
until the replica that the client is connected to catches up with the leader.

I am not sure if this is the case here since this test has a single 
instance of ZooKeeper Server and not a quorum. IMO, the conditional timeout 
should be good enough to make this test completely deterministic. Suggest if 
you feel otherwise.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091065#comment-16091065
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127877506
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

There is a sync() call that ZK supports on the client side. It waits until 
the replica that the client is connected to catches up with the leader. 

This solves lots of timing issues and is typically the way to implement 
read-what-you-wrote.

On the other hand, if your test is running with server internals, this 
isn't likely to work as such.

The point is that if you wait until all pending items in the queue have 
been committed to all followers, you will be guaranteed to see all writes you 
caused before starting to wait.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090088#comment-16090088
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127762613
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -460,11 +464,21 @@ public void processRequest(Request request) {
 if (request.type == OpCode.closeSession) {
 cnxn.sendCloseSession();
 }
+checkLatency(request);
 } catch (IOException e) {
 LOG.error("FIXMSG",e);
 }
 }
 
+private void checkLatency(Request request) {
+long requestWarnThresholdMs = 
QuorumPeerConfig.getRequestWarnResponseThresholdMs();
--- End diff --

I am thinking to cache the value of `requestWarnThresholdMs` inside the 
`FinalRequestProcessor.java`. Is that a good idea?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090045#comment-16090045
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127757305
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

@tdunning Yes a conditional timeout is a better option.
From what I understand, the failure can be cause in test-only scenario 
where there is no physical network between client and server, which sometimes 
results in client getting back the acknowledge and the server is just yet to 
complete incrementing the count. IMO, it is not possible in actual cluster. 
> Would this be susceptible to a sync() to force the test to come up to 
date with the leader?

I am not sure what you mean by this. Could you please explain?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089846#comment-16089846
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127711561
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

At the very least, make the timeout conditional on getting the wrong 
answer. That avoids paying the penalty of a slow test every time.

The mere presence of a timeout indicates a problem with concurrency, 
however. Would this be susceptible to a sync() to force the test to come up to 
date with the leader? If so, that can make the test deterministic.



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089319#comment-16089319
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127633626
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
--- End diff --

Not necessary as such, but will be useful in debugging if some change to 
`getData()` method might cause this test to fail. 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089315#comment-16089315
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127633278
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

I didn't see it consistently. I ran the same test about 60-70 times 
individually and couple of times as a suite with all other tests. It happened 
only once for me that the final assertion failed because the registered count 
was 2 instead of 3, which was when I was running it with other tests from the 
same class. 
What do you suggest? Should I reduce the time-out or remove it all together?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089313#comment-16089313
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@eribeiro Thank you for your comments. Will post an updated patch soon.
A small concern regarding the `ServerStats.updateLatency()` method. 
Should the updated patch have both the methods for now and file another 
JIRA to look into it?
And for the documentation, this property is being introduced from `3.5.4` 
version, Is that correct?
@hanm 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089281#comment-16089281
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127627334
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -292,6 +293,14 @@ public void parseProperties(Properties zkProp)
 }
 } else if ((key.startsWith("server.") || 
key.startsWith("group") || key.startsWith("weight")) && 
zkProp.containsKey("dynamicConfigFile")) {
 throw new ConfigException("parameter: " + key + " must be 
in a separate dynamic config file");
+} else if(key.equals("request.warningthresholdms")) {
--- End diff --

nit: space between `if` and `(`


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089276#comment-16089276
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127628162
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -27,6 +27,7 @@
 import java.util.concurrent.CountDownLatch;
 import java.util.concurrent.TimeUnit;
 
+import org.apache.log4j.BasicConfigurator;
--- End diff --

Unused import


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089275#comment-16089275
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127627096
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -64,10 +65,15 @@
 
 public MainThread(int clientPort, boolean preCreateDirs, String 
configs)
 throws IOException {
-this(clientPort, preCreateDirs, ClientBase.createTmpDir(), 
configs);
+this(clientPort, preCreateDirs, ClientBase.createTmpDir(), 
configs, -1);
 }
 
-public MainThread(int clientPort, boolean preCreateDirs, File 
tmpDir, String configs)
+public MainThread(int clientPort, boolean preCreateDirs, String 
configs, int threshold)
--- End diff --

the `threshold` word in method signature doesn't say much so it requires 
digging the code to see what it's all about. It would be nice to rename to 
something closer to its intent (there are plenty of names in this patch 
already).


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089278#comment-16089278
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127618910
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -460,11 +464,21 @@ public void processRequest(Request request) {
 if (request.type == OpCode.closeSession) {
 cnxn.sendCloseSession();
 }
+checkLatency(request);
 } catch (IOException e) {
 LOG.error("FIXMSG",e);
 }
 }
 
+private void checkLatency(Request request) {
+long requestWarnThresholdMs = 
QuorumPeerConfig.getRequestWarnResponseThresholdMs();
+long requestLatency = Time.currentElapsedTime() - 
request.createTime;
+if((requestWarnThresholdMs > -1 && requestLatency > 
requestWarnThresholdMs) || requestWarnThresholdMs == 0) {
+LOG.warn("Request "  + request + " exceeded threshold. Took " 
+ requestLatency + " ms.");
+zks.serverStats().updateNumRequestsExceededThresholdTime();
--- End diff --

Cool! :) 

nit: Two suggestions: to change the prefix from `update` to `inc` and use 
`numRequestsAboveThresholdTime()` (up to you both).


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089279#comment-16089279
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127619032
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
--- End diff --

We don't put these kind of comments. The git log already has the patch 
name, description and files changed/added.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089277#comment-16089277
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127618988
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
@@ -32,6 +32,7 @@
 private long minLatency = Long.MAX_VALUE;
 private long totalLatency = 0;
 private long count = 0;
+private long numRequestsExceededThresholdTime = 0;
--- End diff --

The addition of this metric is missing a 
`resetRequestsExceededThresholdTime` method to zero the counter. ;)


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089280#comment-16089280
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127627537
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
--- End diff --

question: Is this assertion check really necessary? It's not enough just a 
`zk.getData("/foo1", null, null);`?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089282#comment-16089282
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127627805
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java ---
@@ -138,14 +145,56 @@ void delete(File f) throws IOException {
 ServerCnxnFactory getCnxnFactory() {
 return main.getCnxnFactory();
 }
+
 }
 
-public static  class TestZKSMain extends ZooKeeperServerMain {
+public static class TestZKSMain extends ZooKeeperServerMain {
+
+private ServerStats serverStats;
+
+@Override
+public ZooKeeperServer getZooKeeperServer(FileTxnSnapLog txnLog, 
ServerConfig config, ZKDatabase zkDb) {
+ZooKeeperServer zooKeeperServer = 
super.getZooKeeperServer(txnLog, config, zkDb);
+serverStats = zooKeeperServer.serverStats();
+return zooKeeperServer;
+}
+
+@Override
 public void shutdown() {
 super.shutdown();
 }
 }
 
+// Test for ZOOKEEPER-2770 ZooKeeper slow operation log
+@Test
+public void testRequestWarningThreshold() throws IOException, 
KeeperException, InterruptedException {
+ClientBase.setupTestEnv();
+
+final int CLIENT_PORT = PortAssignment.unique();
+
+MainThread main = new MainThread(CLIENT_PORT, true, null, 0);
+main.start();
+
+Assert.assertTrue("waiting for server being up",
+ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT,
+CONNECTION_TIMEOUT));
+// Get the stats object from the ZooKeeperServer to keep track of 
high latency requests.
+ServerStats stats = main.main.serverStats;
+
+ZooKeeper zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT,
+ClientBase.CONNECTION_TIMEOUT, this);
+
+zk.create("/foo1", "foobar".getBytes(), Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+
+Assert.assertEquals(new String(zk.getData("/foo1", null, null)), 
"foobar");
+// It takes a while for the counter to get updated sometimes, this 
is added to reduce flakyness
+Thread.sleep(1000);
--- End diff --

wondering why... 🤔 I have run this test case a couple of times, but was 
unable to see this lag in counter updated manifested as assertion failure. Did 
you see this consistently?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089160#comment-16089160
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@karanmehta93 hi, mate.

You don't need to touch `ZooKeeperAdmin.html` You change 
`ZooKeeperAdmin.xml` **only** and Apache Forrest generate html anf PDF. Also, 
this xml change should be part of this patch, okay?

Cheers!



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088755#comment-16088755
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm @eribeiro 
Thank you for your comments. I have added a rebased patch with a test in 
`ZooKeeperServerMainTest` class. Please review whenever convenient. 

> It's odd to have two (or in future even more) types of latency checks 
scattered around which creates fragmentation w.r.t. the definition of what a 
request latency means. 

We can decide based on opinion from other people and I can update the patch 
accordingly. I will create new JIRA's based on @tdunning suggestions, after 
this one gets approved. I am eager to take them up.

Documentation is still pending. Will add a patch for it soon. I am not sure 
if I have to modify the `ZooKeeperAdmin.xml` file or `ZooKeeperAdmin.html` 
file. Can you point me to any relevant stuff?



> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086643#comment-16086643
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127363108
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int warnThreshold = 1;
--- End diff --

I suggest having to turn on logging by default. Won't hurt much as such. 
People can opt out if they want. Its similar to usual stats collection IMO.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086597#comment-16086597
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127356689
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int warnThreshold = 1;
--- End diff --

Should this have a default value of -1 instead? I think we'd like this to 
be an opt in feature and by default the logging should be disabled.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086596#comment-16086596
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
I think we should consolidate the latency check in 
`zks.serverStats().updateLatency`. It's odd to have two (or in future even 
more) types of latency checks scattered around which creates fragmentation 
w.r.t. the definition of what a request latency means. The existing latency 
measurement in ServerStats measures the time between a request creation and a 
request landing at final request processor; the patch instead measures end to 
end time of a request from its start to finish processing. I am fine with the 
end to end processing time, though I'd like to double check with a few folks 
around to make sure the regression and impact of this change is limited.

I think ServerStats is a good place to put the DS Ted recommended. 

I think it's a good idea to scope the JIRA so it's easier to get it 
reviewed and committed. What this patch is doing is a positive improvement to 
the operational aspects of ZK so that can be the scope of this PR. On top of 
that future improvements could be what Edward and Ted suggested (JMX, 
distribution of latencies / histogram etc). These work can be tracked by making 
them sub tasks under current JIRA.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
>Assignee: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085462#comment-16085462
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127172773
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -430,6 +432,7 @@ public void processRequest(Request request) {
 // the client and leader disagree on where the client is most
 // recently attached (and therefore invalid SESSION MOVED 
generated)
 cnxn.sendCloseSession();
+request.checkLatency();
--- End diff --

Yes. The percentiles and whatsoever should be a matter of another issue. No 
doubt about this, IMO.

The meat of my comment was in fact if checkLatency isn't in the 
responsibility of Stats class, even it remaining as an proper method (i.e, no 
inclusion into updateLatency). 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085180#comment-16085180
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@karanmehta93 Thanks for update the pull request with more context 
information. Do you mind put the context description in the pull request 
description? 

Regarding test cases, I think you can just add a new test file under 
zookeeper/test. 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085171#comment-16085171
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127127351
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -430,6 +432,7 @@ public void processRequest(Request request) {
 // the client and leader disagree on where the client is most
 // recently attached (and therefore invalid SESSION MOVED 
generated)
 cnxn.sendCloseSession();
+request.checkLatency();
--- End diff --

@eribeiro 
That is exactly the reason that I created my own function after the 
`cnxn.sendCloseSession()` since I was interested in end to end latency, and 
didn't use the `zks.serverStats().updateLatency(request.createTime);` method. I 
would also like to know opinion from others on this. 
Should we scope this JIRA for adding threshold as a general and create a 
new JIRA for extending the scope to include percentiles with new data-structure 
inside the `ServerStats` class?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085106#comment-16085106
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127112584
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -735,6 +738,8 @@ public String getConfigFilename(){
 public Boolean getQuorumListenOnAllIPs() {
 return quorumListenOnAllIPs;
 }
+
+public static int getWarnResponseTime() { return  warnThreshold; }
--- End diff --

What about rename this method to `getMaxRequestThreshold()` or 
`getWarnRequestThreshold()`?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085105#comment-16085105
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r12796
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -61,6 +61,7 @@
 
 private static boolean standaloneEnabled = true;
 private static boolean reconfigEnabled = false;
+private static int warnThreshold = 1;
--- End diff --

I would suggest to rename this field as `warnThresholdMs` or 
`maxRequestThresholdMs`. The important pare here is the `Ms` suffix.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085107#comment-16085107
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127117526
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -430,6 +432,7 @@ public void processRequest(Request request) {
 // the client and leader disagree on where the client is most
 // recently attached (and therefore invalid SESSION MOVED 
generated)
 cnxn.sendCloseSession();
+request.checkLatency();
--- End diff --

On a related note, besides logging those slow requests, wouldn't be nice to 
expose a counter of how many requests were above the latency threshold and 
store it in `zks.serverStats()` so that it could be exposed as metric via JMX?

Going even further, we could expose other metrics related to slow requests 
like average time & 99% percentiles as suggested by @tdunning


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085103#comment-16085103
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127110615
  
--- Diff: src/java/main/org/apache/zookeeper/server/Request.java ---
@@ -21,20 +21,27 @@
 import java.nio.ByteBuffer;
 import java.util.List;
 
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 import org.apache.jute.Record;
 import org.apache.zookeeper.KeeperException;
 import org.apache.zookeeper.ZooDefs.OpCode;
 import org.apache.zookeeper.common.Time;
 import org.apache.zookeeper.data.Id;
+import org.apache.zookeeper.server.quorum.QuorumPeerConfig;
 import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier;
 import org.apache.zookeeper.txn.TxnHeader;
 
+
 /**
  * This is the structure that represents a request moving through a chain 
of
  * RequestProcessors. There are various pieces of information that is 
tacked
  * onto the request as it is processed.
  */
 public class Request {
+private static final Logger LOG = LoggerFactory
+.getLogger(Request.class);
--- End diff --

No need to break line here.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085104#comment-16085104
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127116226
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java ---
@@ -430,6 +432,7 @@ public void processRequest(Request request) {
 // the client and leader disagree on where the client is most
 // recently attached (and therefore invalid SESSION MOVED 
generated)
 cnxn.sendCloseSession();
+request.checkLatency();
--- End diff --

I see that both `request.checkLatency()` and  
`zks.serverStats().updateLatency(request.createTime);` (a few lines above) are 
dealing with the same matter (request latency) even though on radical different 
positions.

IMHO, `zks.serverStats().updateLatency(request.createTime);` should be 
moved to the line position where `request.checkLatency()` currently is and to 
move the if condition from `request.checkLatency()` to 
`zks.serverStats().updateLatency(request.createTime);`, eventually getting rid 
of `request.checkLatency()`. Could this raise some unexpected bugs?

@karanmehta93 @hanm, wdyt?


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085102#comment-16085102
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127110211
  
--- Diff: src/java/main/org/apache/zookeeper/server/Request.java ---
@@ -121,6 +128,13 @@ public void setTxn(Record txn) {
 this.txn = txn;
 }
 
+public void checkLatency() {
+long requestLatency = Time.currentElapsedTime() - this.createTime;
+if(QuorumPeerConfig.getWarnResponseTime() > -1 && requestLatency > 
QuorumPeerConfig.getWarnResponseTime()) {
--- End diff --

nit: space between `if` and `(`.

You can reduce the size of the `if` expression and avoid an unnecessary 
call to the same method in the same expression by doing:

``` java
long warnRequestThreshold = QuorumPeerConfig.getWarnRequestTime()
if (warnRequestThreshold > -1 && requestLatency > warnRequestThreshold) {
```




> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085060#comment-16085060
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
From my understanding, `QuorumPeer` class will be the candidate class to 
maintain this data structure. Let me know alternatives. 
I will look into the data-structure ideas you provided till then.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085014#comment-16085014
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127109651
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -292,6 +293,8 @@ public void parseProperties(Properties zkProp)
 }
 } else if ((key.startsWith("server.") || 
key.startsWith("group") || key.startsWith("weight")) && 
zkProp.containsKey("dynamicConfigFile")) {
 throw new ConfigException("parameter: " + key + " must be 
in a separate dynamic config file");
+} else if(key.equals("warn.responseTime")) {
--- End diff --

+1 about @hanm suggestions. 

nit: a space between `if` and `(` at line 296.




> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084975#comment-16084975
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user tdunning commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
So here is an alternative suggestion.

For each class of operation (GET, PUT, MULTIPUT, DELETE and so on) maintain 
a data structure (such as t-digest or Floathistogram see 
https://github.com/tdunning/t-digest) that keeps track of the distribution of 
latencies for that operation.

Log any transactions that are above a settable percentile latency for 
operations of that class. Default could be 99.99%.

Operations longer than 2x the 99.9%-ile latency should not be entered into 
the historical distribution.


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084536#comment-16084536
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/307#discussion_r127048435
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java ---
@@ -292,6 +293,8 @@ public void parseProperties(Properties zkProp)
 }
 } else if ((key.startsWith("server.") || 
key.startsWith("group") || key.startsWith("weight")) && 
zkProp.containsKey("dynamicConfigFile")) {
 throw new ConfigException("parameter: " + key + " must be 
in a separate dynamic config file");
+} else if(key.equals("warn.responseTime")) {
--- End diff --

I will document the new property and change the name to the suggested one.
I have tested this feature manually. 
I am currently trying to add unit test case in NettyServerCnxnTest.java or 
NIOServerCnxnTest.java, but seems too high level approach.
Can you suggest any locations for adding test case? @hanm 


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2770) ZooKeeper slow operation log

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084500#comment-16084500
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2770:
---

Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
ZooKeeper is a complex distributed application. There are many reasons why 
any given read or write operation may become slow: a software bug, a protocol 
problem, a hardware issue with the commit log(s), a network issue. If the 
problem is constant it is trivial to come to an understanding of the cause. 
However in order to diagnose intermittent problems we often don't know where, 
or when, to begin looking. We need some sort of timestamped indication of the 
problem. Although ZooKeeper is not a datastore, it does persist data, and can 
suffer intermittent performance degradation, and should consider implementing a 
'slow query' log, a feature very common to services which persist information 
on behalf of clients which may be sensitive to latency while waiting for 
confirmation of successful persistence.

@hanm FYI


> ZooKeeper slow operation log
> 
>
> Key: ZOOKEEPER-2770
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Karan Mehta
> Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >