[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-17 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Got a [green 
build](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2193/).
 Merging.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-14 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Thanks for rebase, @jtuple. Latest build failed, but those are flaky tests. 
Kicking Jenkins again to get a green build.

@anmolnar Could you please also sign this off? I believe you raised a 
request of removing `currentTime` which @jtuple addressed by pointing out its 
usage in test code. Do you have other concerns regarding this patch?


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-14 Thread jtuple
Github user jtuple commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Rebased to resolve merge conflict.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-14 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
From the conversation so far, it looks like we have reached a consensus 
regarding the directions of this patch and future metrics related patch. For 
this specific patch, it looks good to me, and I only have one comment regarding 
FOLLOWER_SYNC_TIME instrumentation but that's not blocking.

Any comments from other reviewers? @anmolnar @eolivelli @lvfangmin 
Let's try to get this patch merge in soon to avoid constant turnovers on 
merge conflicts (unfortunately, it got another merge conflicts that requires 
another rebase.).


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-07 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Thanks for detailed reply, @jtuple.

Jenkins was not happy because one C client test failed (the 'Test Result' 
only shows Java tests). I kicked a 
[new 
build](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2141/),
 and it passes.

>> Other types like a Gauge could easily be done in a future pull-request, 
although our experience has been that gauge's are less useful in a polling 
based approach since you only see the most recent value. 

OK.

>> Once we land the set-variants, you could add a single metric to the 
ServerMetric enum and rely upon the set logic.

I think this is what I was looking for - a way of organizing metrics 
hierarchically like files/folder or namespaces. Good to know it's already 
supported and will be upstreamed.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-07 Thread jtuple
Github user jtuple commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Sorry for the delay, August was a super busy month.

I've rebased onto latest master + added basic unit tests for 
`AvgMinMaxCounter` and `SimpleCounter`. I also resolved the findbugs issue and 
the `testMonitor` unit tests. Jenkin still seems unhappy, but I'm not sure 
what's up -- the linked test results show no failures.

I've also addressed the majority of inline comments. Please let me know if 
there's anything else I overlooked.

---

As mentioned in a prior comment, this pull request is just the first of 
many. In this PR we add the `AvgMinMaxCounter` and `SimpleCounter` metrics.

We have additional metrics internally that we'll upstream in the future, 
including `AvgMinMaxPercentileCounter` as well as set variants: 
`AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet`. I talked about both 
of these in [a previous 
comment](https://github.com/apache/zookeeper/pull/580#issuecomment-410340039)

Those are all the types we currently have internally. Other types like a 
Gauge could easily be done in a future pull-request, although our experience 
has been that gauge's are less useful in a polling based approach since you 
only see the most recent value. Something like an `AvgMinMaxCounter` metric 
gives you more information when there are multiple events, and trivially 
reduces to a gauge-equivalent when there's only a single value during the 
sampling interval.

For things like commit processor queue sizes and the like, we simply query 
the queue size every time we enqueue an item and report that instantaneous size 
as a value in an `AvgMinMaxCounter`. When we periodically query metrics, we 
then know the min/max/avg queue size during the interval, and as well as the 
population count which is equivalent to "number of enqueues" during the 
interval.

---

@hanm: For your example, you have a few options:

1. Do as you mentioned and add a new metric to the `ServerMetric` enum for 
each request type.

2. Once we land the set-variants, you could add a single metric to the 
`ServerMetric` enum and rely upon the set logic. Eg. add a `REQUEST_LATENCY(new 
AvgMinMaxCounterSet("request_latency"))` metric and then do 
`ServerMetrics.REQUEST_LATENCY.add("create_session", latency)`, 
`ServerMetrics.REQUEST_LATENCY.add("set_data", latency)`, etc. For your 
example, you wanted to track counts so you'd probably want a `SimpleCounterSet` 
which we don't have, but would be easy to create.

3. Use something custom. For example, the majority of our internal metrics 
build on `ServerMetrics`, but we do have a handful that don't. One example that 
we'll be upstreaming soon is our so-called "request state" metrics that track 
the count for requests at different stages in the request pipeline. We maintain 
a count of requests queued, issued, and completed for different category of 
requests (all, read, write, create_session, close_session, sync, auth, 
set_watches, other). This matrix is also added to the output of `/metrics` but 
doesn't use any of the `ServerMetrics` logic.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-05 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
And yes like others commented, please rebase and resolve conflicts, and 
reply to previous unattended comments. My previous comments need address are:

* Existing unit test failures and one findbug issue.
* Are there more types of metrics going to be added, such as Gauge? 
* Would be good to have some tests around the system.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-05 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
One question about gathering metrics using this system: let's say I have a 
use case where I want to gather the number of requests processed by 
FinalRequestProcessor for each request type (e.g. Ping, CreateSession, and 
other OpCode types). How to do this using this metrics system? I think I have 
to add, for each OpCode, a new enum type in ServerMetrics, is this correct? 


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-09-05 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
We at Twitter manage large ZK clusters and we also have our own internal 
metrics system (you can't really operate things like ZK without metrics at 
large scale.). Our design is similar like this in terms of metrics collection 
and code instrumentation, but we coupled the metrics collection with 
publication. We are publishing metrics through Finagle (Twitter's RPC library) 
and the tight coupling caused several issues, such as circular dependencies 
between Finagle and ZooKeeper, where ZK depends on Finagle for metrics, and 
Finagle depends on ZK for ServerSets (naming resolution). 

I think the current design proposed in the community would solve this 
problem. Basically we will have two systems:
* The ZooKeeper metrics system for metrics collection.
* The pluggable system @eolivelli is working on is the backends for the 
metrics.

So IMO both work are sort of orthogonal and can be done in parallel. And I 
agree with the design decisions @jtuple proposed and @eolivelli commented 
previously.

Regarding use an external metrics type system: I think it would be more 
flexible for us to define and own the metrics definitions so it's easy to add / 
change the system without depends on third parties. 


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-30 Thread lvfangmin
Github user lvfangmin commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@jtuple do you have time to resolve the comments and rebase this onto 
latest branch? It would be great if we can get this in this week.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-22 Thread lvfangmin
Github user lvfangmin commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Thanks @anmolnar, I've answered some of the questions here.

@jtuple can you resolve some of these comments and the conflict, we might 
ready to go now, this will unblock other patches we're going to upstream, so 
I'd like to see it's being merged sooner.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-11 Thread anmolnar
Github user anmolnar commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@lvfangmin @eolivelli Thanks for the clarification. Let's do it the way 
you're suggesting.
@jtuple I'm happy to commit your patch once you answered my questions and 
resolved the conflicts.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-11 Thread eolivelli
Github user eolivelli commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
I think it will be easier for FB guys to merge their work to community repo 
and then we can move to the new system.
Otherwise the two forks will never converge.

So I will lean towards merging all the FB work about metrics and then 
convert to the new system.

In parallel we can take decisions about the new system, we can merge my 
changes as long as they do not have significant impact at runtime..

For me it would be a nice  approch to let @lvfangmin convert eventually the 
instrumentation points to the new system.

I can focus on implementation of providers, like Prometheus and Dropwizard


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-10 Thread lvfangmin
Github user lvfangmin commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@anmolnar If I understand correctly, @eolivelli's work is more about making 
it easier to export ZooKeeper's internal metric to external systems, while FB 
is focusing on adding a lot more useful new metrics, so they don't conflict 
with each other that much.

And even we need to change something to adapt the new metric framework it 
should be trivial.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-06 Thread anmolnar
Github user anmolnar commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@eolivelli 
I got that this is Facebook work on adding new metrics and at the same time 
you're working on a generic long term solution for ZooKeeper metrics system.

My point is that we should get these 2 initiatiatives as close as possible 
and minimize the need for refactoring on both sides.

> will be (quite) easily refactored to the new system once we have merged 
Facebook contribs and merged the new system.

I'm not sure that we can get to that point in the short term, hence I 
encourage to do these 2 things side by side.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-06 Thread eolivelli
Github user eolivelli commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@anmolnar  this change is a backport from Facebook fork.

in PR #582 I have started a brand new work, to design and implement a long 
term ZK internal API to instrument server and client java code base.
Users are requesting support for Prometheus for instance.
The idea of PR #582 is to design how a "MetricsProvider" can interact with 
ZK code.

This great work from @jtuple and his collegues will be (quite) easily 
refactored to the new system once we have merged Facebook contribs and merged 
the new system.

On the new Metrics API the most interesting key points are:
- we are going to support several providers and/or let the users 
contribute/use other own providers
- we are not using 'static' variables, so you can have several ZooKeeper 
clients (for instance a HBase client, a Curator based client...) inside the 
same JVM

I would like to move forward and send the next patches but we need to have 
agreement on the high level decisions

cc @hanm @phunt 



---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-04 Thread eolivelli
Github user eolivelli commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@jtuple in my opinion it is better  to allow all  of changes coming from 
Facebook to be merged in as soon as possible.
Having only one common codebase is necessary in order to work together.

So from my point of view it will be better to mark cleearly all of the 
changes coming from FB fork, this way reviewers and committer will take it in 
account and treat the patch under that light.

Last year for instance in Bookkeeper we merged three 'diverged' forks from 
Yahoo, Twitter and Salesforce into the Apache codebase. It has been a long work 
but now it is super easy to collaborate.

Once we have a common codebase refactoring will be easier for everyone.

Just my 2cents


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-08-03 Thread jtuple
Github user jtuple commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
I'm happy to add tests for `AvgMinMaxCounter` and `SimpleCounter`, will go 
ahead and update this PR today to include that for completeness. Since it's 
easy change to make while we discuss any other reworkings.

We actually landed a version of #440 internally at Facebook and have been 
running with Dropwiz-based percentile counters in production for a few months 
now. We build on the same `ServerMetrics` design in this PR and have an 
`AvgMinMaxPercentileCounter` that wraps a Dropwiz `Histogram` which uses a 
custom uniform-sampling reservoir implementation that enables resetting the 
metric/reservoir. This makes things consistent with all the other metric types 
that are resettable.

In practice, the memory overhead for the Dropwiz histogram is much higher 
than the flat `AvgMinMaxCounter` metric, so we only converted about 20 of our 
100 metrics to `AvgMinMaxPercentileCounter` metrics -- pretty much just the 
latency metrics and metrics that track time spent in various queues and stages 
of the request pipeline.

We also have `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet` 
metric-types that aggregate metrics across a dynamic set of entities. These are 
types that we were planning to submit in future PRs that build on this one. 
Example uses of this includes tracking learner queue size/time and ack latency 
on the leader for each quorum peer.

---

In any case, happy to rework this PR with guidance and have the bandwidth 
to do so if we can come to consensus on a design. My biggest concern is that 
pretty much all of our remaining internal features at Facebook that we want to 
upstream are blocked on this feature, since we aggressively add metrics when 
adding new features (so that we can monitor/track in production, etc). The 
sooner we can land something, the sooner we can rebase on top of whatever the 
agreed approach is and unblock our other contributions.

Open to any design that the community is happy with, however I think 
there's a few things which are important to maintain:

1. Any solution should make it super easy to add new metrics and use them 
throughout the codebase. We've learned the hard way that lowering the barrier 
to entry was the only way to truly motivate adding new metrics while adding new 
features.

2. It's useful to have a `Metric` type interface that allows us to enforce 
certain requirements. As an example, ZooKeeper metrics have historically been 
resettable, which isn't necessarily true for other metric libraries. Having an 
adaptor class allowed us to wrap the Dropwiz histograms and add a fast reset 
capability that then behaved like all our other metrics.

3. Even if we export to external systems, having the ability to have 
Zookeeper maintain internal metrics/aggregation is still important. There 
should be a low-barrier to entry for getting visibility into the system from 
day one with minimal config/setup.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-07-31 Thread anmolnar
Github user anmolnar commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
@eolivelli @jtuple 
I think the best would be to join the efforts and work together to 
establish the new framework and implement these new metrics with a suitable 
library like dropwizard. At the end everybody will be happy with the results 
for sure.


---


[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

2018-07-20 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/580
  
Finally this happens, thanks for contributing the change back @jtuple . My 
quick comments inline.

In addition, there are two things:
* There is find bug warning. I believe it's the txn variable 
[here](https://github.com/jtuple/zookeeper/blob/e6935f8d99eace05d29c2d6659e68e8b90b9a633/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java#L110)
 that should be removed. I know that's not part of this patch, and i am not 
sure why it gets triggered here, but please remove it so we can get a clean 
find bug check.

* The test `org.apache.zookeeper.server.admin.CommandsTest` is failing. 
Please investigate.


---