[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 Got a [green build](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2193/). Merging. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 Thanks for rebase, @jtuple. Latest build failed, but those are flaky tests. Kicking Jenkins again to get a green build. @anmolnar Could you please also sign this off? I believe you raised a request of removing `currentTime` which @jtuple addressed by pointing out its usage in test code. Do you have other concerns regarding this patch? ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user jtuple commented on the issue: https://github.com/apache/zookeeper/pull/580 Rebased to resolve merge conflict. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 From the conversation so far, it looks like we have reached a consensus regarding the directions of this patch and future metrics related patch. For this specific patch, it looks good to me, and I only have one comment regarding FOLLOWER_SYNC_TIME instrumentation but that's not blocking. Any comments from other reviewers? @anmolnar @eolivelli @lvfangmin Let's try to get this patch merge in soon to avoid constant turnovers on merge conflicts (unfortunately, it got another merge conflicts that requires another rebase.). ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 Thanks for detailed reply, @jtuple. Jenkins was not happy because one C client test failed (the 'Test Result' only shows Java tests). I kicked a [new build](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2141/), and it passes. >> Other types like a Gauge could easily be done in a future pull-request, although our experience has been that gauge's are less useful in a polling based approach since you only see the most recent value. OK. >> Once we land the set-variants, you could add a single metric to the ServerMetric enum and rely upon the set logic. I think this is what I was looking for - a way of organizing metrics hierarchically like files/folder or namespaces. Good to know it's already supported and will be upstreamed. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user jtuple commented on the issue: https://github.com/apache/zookeeper/pull/580 Sorry for the delay, August was a super busy month. I've rebased onto latest master + added basic unit tests for `AvgMinMaxCounter` and `SimpleCounter`. I also resolved the findbugs issue and the `testMonitor` unit tests. Jenkin still seems unhappy, but I'm not sure what's up -- the linked test results show no failures. I've also addressed the majority of inline comments. Please let me know if there's anything else I overlooked. --- As mentioned in a prior comment, this pull request is just the first of many. In this PR we add the `AvgMinMaxCounter` and `SimpleCounter` metrics. We have additional metrics internally that we'll upstream in the future, including `AvgMinMaxPercentileCounter` as well as set variants: `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet`. I talked about both of these in [a previous comment](https://github.com/apache/zookeeper/pull/580#issuecomment-410340039) Those are all the types we currently have internally. Other types like a Gauge could easily be done in a future pull-request, although our experience has been that gauge's are less useful in a polling based approach since you only see the most recent value. Something like an `AvgMinMaxCounter` metric gives you more information when there are multiple events, and trivially reduces to a gauge-equivalent when there's only a single value during the sampling interval. For things like commit processor queue sizes and the like, we simply query the queue size every time we enqueue an item and report that instantaneous size as a value in an `AvgMinMaxCounter`. When we periodically query metrics, we then know the min/max/avg queue size during the interval, and as well as the population count which is equivalent to "number of enqueues" during the interval. --- @hanm: For your example, you have a few options: 1. Do as you mentioned and add a new metric to the `ServerMetric` enum for each request type. 2. Once we land the set-variants, you could add a single metric to the `ServerMetric` enum and rely upon the set logic. Eg. add a `REQUEST_LATENCY(new AvgMinMaxCounterSet("request_latency"))` metric and then do `ServerMetrics.REQUEST_LATENCY.add("create_session", latency)`, `ServerMetrics.REQUEST_LATENCY.add("set_data", latency)`, etc. For your example, you wanted to track counts so you'd probably want a `SimpleCounterSet` which we don't have, but would be easy to create. 3. Use something custom. For example, the majority of our internal metrics build on `ServerMetrics`, but we do have a handful that don't. One example that we'll be upstreaming soon is our so-called "request state" metrics that track the count for requests at different stages in the request pipeline. We maintain a count of requests queued, issued, and completed for different category of requests (all, read, write, create_session, close_session, sync, auth, set_watches, other). This matrix is also added to the output of `/metrics` but doesn't use any of the `ServerMetrics` logic. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 And yes like others commented, please rebase and resolve conflicts, and reply to previous unattended comments. My previous comments need address are: * Existing unit test failures and one findbug issue. * Are there more types of metrics going to be added, such as Gauge? * Would be good to have some tests around the system. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 One question about gathering metrics using this system: let's say I have a use case where I want to gather the number of requests processed by FinalRequestProcessor for each request type (e.g. Ping, CreateSession, and other OpCode types). How to do this using this metrics system? I think I have to add, for each OpCode, a new enum type in ServerMetrics, is this correct? ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 We at Twitter manage large ZK clusters and we also have our own internal metrics system (you can't really operate things like ZK without metrics at large scale.). Our design is similar like this in terms of metrics collection and code instrumentation, but we coupled the metrics collection with publication. We are publishing metrics through Finagle (Twitter's RPC library) and the tight coupling caused several issues, such as circular dependencies between Finagle and ZooKeeper, where ZK depends on Finagle for metrics, and Finagle depends on ZK for ServerSets (naming resolution). I think the current design proposed in the community would solve this problem. Basically we will have two systems: * The ZooKeeper metrics system for metrics collection. * The pluggable system @eolivelli is working on is the backends for the metrics. So IMO both work are sort of orthogonal and can be done in parallel. And I agree with the design decisions @jtuple proposed and @eolivelli commented previously. Regarding use an external metrics type system: I think it would be more flexible for us to define and own the metrics definitions so it's easy to add / change the system without depends on third parties. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/580 @jtuple do you have time to resolve the comments and rebase this onto latest branch? It would be great if we can get this in this week. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/580 Thanks @anmolnar, I've answered some of the questions here. @jtuple can you resolve some of these comments and the conflict, we might ready to go now, this will unblock other patches we're going to upstream, so I'd like to see it's being merged sooner. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/580 @lvfangmin @eolivelli Thanks for the clarification. Let's do it the way you're suggesting. @jtuple I'm happy to commit your patch once you answered my questions and resolved the conflicts. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user eolivelli commented on the issue: https://github.com/apache/zookeeper/pull/580 I think it will be easier for FB guys to merge their work to community repo and then we can move to the new system. Otherwise the two forks will never converge. So I will lean towards merging all the FB work about metrics and then convert to the new system. In parallel we can take decisions about the new system, we can merge my changes as long as they do not have significant impact at runtime.. For me it would be a nice approch to let @lvfangmin convert eventually the instrumentation points to the new system. I can focus on implementation of providers, like Prometheus and Dropwizard ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/580 @anmolnar If I understand correctly, @eolivelli's work is more about making it easier to export ZooKeeper's internal metric to external systems, while FB is focusing on adding a lot more useful new metrics, so they don't conflict with each other that much. And even we need to change something to adapt the new metric framework it should be trivial. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/580 @eolivelli I got that this is Facebook work on adding new metrics and at the same time you're working on a generic long term solution for ZooKeeper metrics system. My point is that we should get these 2 initiatiatives as close as possible and minimize the need for refactoring on both sides. > will be (quite) easily refactored to the new system once we have merged Facebook contribs and merged the new system. I'm not sure that we can get to that point in the short term, hence I encourage to do these 2 things side by side. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user eolivelli commented on the issue: https://github.com/apache/zookeeper/pull/580 @anmolnar this change is a backport from Facebook fork. in PR #582 I have started a brand new work, to design and implement a long term ZK internal API to instrument server and client java code base. Users are requesting support for Prometheus for instance. The idea of PR #582 is to design how a "MetricsProvider" can interact with ZK code. This great work from @jtuple and his collegues will be (quite) easily refactored to the new system once we have merged Facebook contribs and merged the new system. On the new Metrics API the most interesting key points are: - we are going to support several providers and/or let the users contribute/use other own providers - we are not using 'static' variables, so you can have several ZooKeeper clients (for instance a HBase client, a Curator based client...) inside the same JVM I would like to move forward and send the next patches but we need to have agreement on the high level decisions cc @hanm @phunt ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user eolivelli commented on the issue: https://github.com/apache/zookeeper/pull/580 @jtuple in my opinion it is better to allow all of changes coming from Facebook to be merged in as soon as possible. Having only one common codebase is necessary in order to work together. So from my point of view it will be better to mark cleearly all of the changes coming from FB fork, this way reviewers and committer will take it in account and treat the patch under that light. Last year for instance in Bookkeeper we merged three 'diverged' forks from Yahoo, Twitter and Salesforce into the Apache codebase. It has been a long work but now it is super easy to collaborate. Once we have a common codebase refactoring will be easier for everyone. Just my 2cents ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user jtuple commented on the issue: https://github.com/apache/zookeeper/pull/580 I'm happy to add tests for `AvgMinMaxCounter` and `SimpleCounter`, will go ahead and update this PR today to include that for completeness. Since it's easy change to make while we discuss any other reworkings. We actually landed a version of #440 internally at Facebook and have been running with Dropwiz-based percentile counters in production for a few months now. We build on the same `ServerMetrics` design in this PR and have an `AvgMinMaxPercentileCounter` that wraps a Dropwiz `Histogram` which uses a custom uniform-sampling reservoir implementation that enables resetting the metric/reservoir. This makes things consistent with all the other metric types that are resettable. In practice, the memory overhead for the Dropwiz histogram is much higher than the flat `AvgMinMaxCounter` metric, so we only converted about 20 of our 100 metrics to `AvgMinMaxPercentileCounter` metrics -- pretty much just the latency metrics and metrics that track time spent in various queues and stages of the request pipeline. We also have `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet` metric-types that aggregate metrics across a dynamic set of entities. These are types that we were planning to submit in future PRs that build on this one. Example uses of this includes tracking learner queue size/time and ack latency on the leader for each quorum peer. --- In any case, happy to rework this PR with guidance and have the bandwidth to do so if we can come to consensus on a design. My biggest concern is that pretty much all of our remaining internal features at Facebook that we want to upstream are blocked on this feature, since we aggressively add metrics when adding new features (so that we can monitor/track in production, etc). The sooner we can land something, the sooner we can rebase on top of whatever the agreed approach is and unblock our other contributions. Open to any design that the community is happy with, however I think there's a few things which are important to maintain: 1. Any solution should make it super easy to add new metrics and use them throughout the codebase. We've learned the hard way that lowering the barrier to entry was the only way to truly motivate adding new metrics while adding new features. 2. It's useful to have a `Metric` type interface that allows us to enforce certain requirements. As an example, ZooKeeper metrics have historically been resettable, which isn't necessarily true for other metric libraries. Having an adaptor class allowed us to wrap the Dropwiz histograms and add a fast reset capability that then behaved like all our other metrics. 3. Even if we export to external systems, having the ability to have Zookeeper maintain internal metrics/aggregation is still important. There should be a low-barrier to entry for getting visibility into the system from day one with minimal config/setup. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/580 @eolivelli @jtuple I think the best would be to join the efforts and work together to establish the new framework and implement these new metrics with a suitable library like dropwizard. At the end everybody will be happy with the results for sure. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 Finally this happens, thanks for contributing the change back @jtuple . My quick comments inline. In addition, there are two things: * There is find bug warning. I believe it's the txn variable [here](https://github.com/jtuple/zookeeper/blob/e6935f8d99eace05d29c2d6659e68e8b90b9a633/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java#L110) that should be removed. I know that's not part of this patch, and i am not sure why it gets triggered here, but please remove it so we can get a clean find bug check. * The test `org.apache.zookeeper.server.admin.CommandsTest` is failing. Please investigate. ---