Taking queue depth is a good idea for measuring load of components though there're some scenarios that users want to know the latency of a specific method/processing flow. BTW., we've added support to turn on/off metrics dynamically in JStorm(actually we had this feature in JStorm 2.1.1 but there's a bug in implementation), like metrics filtering, I think this is a pretty practical requirement.
I agree with you that it's not a simple 1 to 1 translation, actually new metrics requirements look more like refactoring rather than strong improvements. On Mon, May 23, 2016 at 9:36 PM, Bobby Evans <ev...@yahoo-inc.com.invalid> wrote: > From JStorm we got the ability to show most of the metrics on the UI, but > I think we missed the ability to turn on and off metrics dynamically. This > is important because there will be metrics that are expensive to gather, > but very important when debugging specific problems. > I also think we should spend some time thinking about why we are > gathering/computing specific metrics, what is the question we are trying to > answer and is there a better way to answer it. > For example we collect a lot of latency statistics for a bolt to answer a > few different questions, but for the most part the latency stats we have > are bad at answering either of those questions. > > 1) How over loaded is a particular bolt? aka capacity. > 2) How does this bolt contribute to the overall processing time in a > topology? > For the first one we subsample because it is expensive to compute the > latency every time. But in practice I have seen many bolts with a capacity > well over 1, I have even seen a 35. How can a bolt be running 3500% of the > time? If we use the input queue depth instead that could give us a much > better picture, and be a lot less expensive. > > For the second one we have talked about adding in more latency > measurements, (serialization/deserialization time, etc) but I think we will > end up playing whack a mole unless we take a wholistic approach where we > don't just measure small part of the system, but we measure latency from > point A to point B and from point B to point C, so there are no gaps? > > I just don't think we want to have a simple 1 to 1 translation. > - Bobby > > On Monday, May 23, 2016 5:05 AM, Jungtaek Lim <kabh...@gmail.com> > wrote: > > > Thanks for following up, Abhishek. I agree with you about the consensus. > I think you already listed most of my / community requirements. I'm adding > up something I have in mind, > 7. Aggregation at stream level (STORM-1719), and machine level8. way to > subscribe cluster metrics (STORM-1723)9. counter stats as non-sampled if it > doesn't hurt performance10. more metrics like serialization/deserialization > latency, queue status > I'd also like to see Spout offset information if implementation of Spout > can provide, but it's just my 2 cent. > Thanks, > Jungtaek Lim (HeartSaVioR) > 2016년 5월 23일 (월) 오후 6:33, Abhishek Agarwal <abhishc...@gmail.com>님이 작성: > > So I am assuming that there is a general consensus on adding new api for > metrics and gradually phasing out the old one. If yes, may be we can work > toward finer details of how to maintain two apis as well as the design of > new API. > > Jungtaek, it would be better to summarize the requirements and let others > add on what they feel missing. Some asks I have seen - > > 1. Aggregation at component level (Average, Sum etc) - > 2. Blacklist/whitelist > 3. Allow only numbers for values > 4. Efficient routing of built-in metrics to UI (current they get tagged > along with executor heartbeat which puts pressure on zookeeper) > 5. Worker/JVM level metrics which are not owned by a particular component > 6. Percentiles for latency metrics such as p99, p95 etc > > Not all of them may be considered. Please add anything I might have missed. > > > > On Fri, May 20, 2016 at 5:57 AM, Jungtaek Lim <kabh...@gmail.com> wrote: > > > Personally I'm also in favor of maintaining old API (but deprecated) and > > adding new API. > > It's ideal way, and that's what many projects are trying to do, and so on > > the other project I'm also maintaining. > > > > And I also prefer to gone away current metrics feature in next major > > release. In general, before each major release we should discuss which > > classes/features to drop. I think we forgot this when we release 1.0.0, > and > > deprecated things are all alive. > > > > I'm also in favor of dropping the support for custom frequency for each > > metric. I guess we don't need to worry about mapping since internal > metrics > > on Storm will be moved to new metrics feature. While some behavioral > > changes > > could be occurred (for example, IMetricsConsumer could be changed to no > > longer receive built-in metrics on Storm), it will not break backward > > compatibility from the API side anyway. > > > > Thanks, > > Jungtaek Lim (HeartSaVioR) > > > > 2016년 5월 20일 (금) 오전 12:57, Abhishek Agarwal <abhishc...@gmail.com>님이 작성: > > > > > Sounds good. Having two separate metric reporters may be confusing but > it > > > is better than breaking the client code. > > > > > > Codahale library allows user to specify frequency per reporter > instance. > > > Storm on the other hand allows different reporting frequency for each > > > metric. How will that mapping work? I am ok to drop the support for > > custom > > > frequency for each metric. Internal metrics in storm anyway use same > > > frequency of reporting. > > > > > > On Thu, May 19, 2016 at 9:04 PM, Bobby Evans > <ev...@yahoo-inc.com.invalid > > > > > > wrote: > > > > > > > I personally would like to see that change happen differently for the > > two > > > > branches. > > > > On 1.x we add in a new API for both reporting metrics and collecting > in > > > > parallel to the old API. We leave IMetric and IMetricsConsumer in > > place > > > > but deprecated. As we move internal metrics over from the old > > interface > > > to > > > > the new one, we either keep versions of the old ones in place or we > > > provide > > > > a translation shim going from the new to the old. > > > > > > > > In 2.x either the old way is gone completely or it is off by default. > > I > > > > prefer gone completely. > > > > > > > > If we go off of dropwizard/codahale metrics or a layer around them > like > > > > was discussed previously it seems fairly straight forward to take > some > > of > > > > our current metrics that all trigger at the same interval and setup a > > > > reporter that can translate them into the format that was reported > > > > previously. > > > > In 1.x to get a full picture of what is happening if your topology > you > > > may > > > > need two separate reporters. One for the new metrics and one for the > > > old, > > > > but it should only be for a short period of time. - Bobby > > > > > > > > On Thursday, May 19, 2016 1:00 AM, Cody Innowhere < > > > e.neve...@gmail.com> > > > > wrote: > > > > > > > > > > > > If we want to refactor the metrics system, I think we may have to > > incur > > > > breaking changes. We can make it backward compatible but this means > we > > > may > > > > build an adapt layer on top of metrics, or a lot of "if...else..." > > which > > > > might be ugly, either way, it might be a pain to maintain the code. > > > > So I prefer to making breaking changes if we want to build a new > > metrics > > > > system, and I'm OK to move JStorm metrics migration phase forward to > > 1.x, > > > > and I'm happy to share our design & experiences. > > > > > > > > On Thu, May 19, 2016 at 11:12 AM, Jungtaek Lim <kabh...@gmail.com> > > > wrote: > > > > > > > > > Hi devs, > > > > > > > > > > I'd like to see our opinions on breaking changes on metrics for > 1.x. > > > > > > > > > > Some backgrounds here: > > > > > > > > > > - As you may have seen, I'm trying to address some places to > improve > > > > > metrics without breaking backward compatibility, but it's limited > due > > > to > > > > > interface IMetric which is opened to public. > > > > > - We're working on Storm 2.0.0, and evaluation / adoption for > metrics > > > > > feature of JStorm is planned to phase 2 but we all don't know > > estimated > > > > > release date, and I feel it's not near future. > > > > > - We've just released Storm 1.0.x, so I expected the lifetime of > > Storm > > > > 1.0 > > > > > (even 0.9) months or even years. > > > > > > > > > > If someone wants to know what exactly things current metrics > feature > > > > > matter, please let me know so that I will summarize. > > > > > > > > > > I have other ideas on mind to relieve some problems with current > > > metrics, > > > > > so I'm also OK to postpone renewal of metrics to 2.0.0 with > applying > > > > those > > > > > workaround ideas. But if we're having willingness to address > metrics > > on > > > > > 1.x, IMO we can consider breaking backward compatibility from 1.x > for > > > > once. > > > > > > > > > > What do you think? > > > > > > > > > > Thanks, > > > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > > > Abhishek Agarwal > > > > > > > > > -- > Regards, > Abhishek Agarwal > > > > >