[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100706#comment-16100706
 ] 

Andrew Purtell edited comment on ZOOKEEPER-2770 at 7/25/17 8:26 PM:
--------------------------------------------------------------------

The originally proposed change is hardly complex. I don't understand that 
aspect of this discussion. Whether or not the metric is useful, on the other 
hand... ok. That is a matter of opinion. I think we'd like to know if any ZK op 
takes longer than a second to complete, and how often that might happen, and on 
what host(s)/quorum it is happening. We have fleet of thousands of servers. We 
have tens of ZooKeeper installations, each on five servers. Hardware does funny 
things from time to time. We'd like to be proactive. 

Edit: More like 160 quorums, I think. 


was (Author: apurtell):
The originally proposed change is hardly complex. I don't understand that 
aspect of this discussion. Whether or not the metric is useful, on the other 
hand... ok. That is a matter of opinion. I think we'd like to know if any ZK op 
takes longer than a second to complete, and how often that might happen, and on 
what host it is happening. We have fleet of thousands of servers. We have tens 
of ZooKeeper installations, each on five servers. Hardware does funny things 
from time to time. We'd like to be proactive. 

Edit: More like 160 quorums, I think. 

> ZooKeeper slow operation log
> ----------------------------
>
>                 Key: ZOOKEEPER-2770
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
>             Project: ZooKeeper
>          Issue Type: Improvement
>            Reporter: Karan Mehta
>            Assignee: Karan Mehta
>         Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to