Re: [DISCUSSION] opinions on breaking changes on metrics for 1.x

Abhishek Agarwal Mon, 23 May 2016 07:15:07 -0700

I am working on adding queue depth metrics.
https://github.com/apache/storm/pull/1406


On Mon, May 23, 2016 at 7:21 PM, Cody Innowhere <e.neve...@gmail.com> wrote:

> Taking queue depth is a good idea for measuring load of components though
> there're some scenarios that users want to know the latency of a specific
> method/processing flow.
> BTW., we've added support to turn on/off metrics dynamically in
> JStorm(actually we had this feature in JStorm 2.1.1 but there's a bug in
> implementation), like metrics filtering, I think this is a pretty practical
> requirement.
>
> I agree with you that it's not a simple 1 to 1 translation, actually new
> metrics requirements look more like refactoring rather than strong
> improvements.
>
> On Mon, May 23, 2016 at 9:36 PM, Bobby Evans <ev...@yahoo-inc.com.invalid>
> wrote:
>
> > From JStorm we got the ability to show most of the metrics on the UI, but
> > I think we missed the ability to turn on and off metrics dynamically.
> This
> > is important because there will be metrics that are expensive to gather,
> > but very important when debugging specific problems.
> > I also think we should spend some time thinking about why we are
> > gathering/computing specific metrics, what is the question we are trying
> to
> > answer and is there a better way to answer it.
> > For example we collect a lot of latency statistics for a bolt to answer a
> > few different questions, but for the most part the latency stats we have
> > are bad at answering either of those questions.
> >
> > 1) How over loaded is a particular bolt? aka capacity.
> > 2) How does this bolt contribute to the overall processing time in a
> > topology?
> > For the first one we subsample because it is expensive to compute the
> > latency every time.  But in practice I have seen many bolts with a
> capacity
> > well over 1, I have even seen a 35.  How can a bolt be running 3500% of
> the
> > time? If we use the input queue depth instead that could give us a much
> > better picture, and be a lot less expensive.
> >
> > For the second one we have talked about adding in more latency
> > measurements, (serialization/deserialization time, etc) but I think we
> will
> > end up playing whack a mole unless we take a wholistic approach where we
> > don't just measure small part of the system, but we measure latency from
> > point A to point B and from point B to point C, so there are no gaps?
> >
> > I just don't think we want to have a simple 1 to 1 translation.
> >  - Bobby
> >
> >     On Monday, May 23, 2016 5:05 AM, Jungtaek Lim <kabh...@gmail.com>
> > wrote:
> >
> >
> >  Thanks for following up, Abhishek. I agree with you about the consensus.
> > I think you already listed most of my / community requirements. I'm
> adding
> > up something I have in mind,
> > 7. Aggregation at stream level (STORM-1719), and machine level8. way to
> > subscribe cluster metrics (STORM-1723)9. counter stats as non-sampled if
> it
> > doesn't hurt performance10. more metrics like
> serialization/deserialization
> > latency, queue status
> > I'd also like to see Spout offset information if implementation of Spout
> > can provide, but it's just my 2 cent.
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> > 2016년 5월 23일 (월) 오후 6:33, Abhishek Agarwal <abhishc...@gmail.com>님이 작성:
> >
> > So I am assuming that there is a general consensus on adding new api for
> > metrics and gradually phasing out the old one. If yes, may be we can work
> > toward finer details of how to maintain two apis as well as the design of
> > new API.
> >
> > Jungtaek, it would be better to summarize the requirements and let others
> > add on what they feel missing. Some asks I have seen -
> >
> > 1. Aggregation at component level (Average, Sum etc) -
> > 2. Blacklist/whitelist
> > 3. Allow only numbers for values
> > 4. Efficient routing of built-in metrics to UI (current they get tagged
> > along with executor heartbeat which puts pressure on zookeeper)
> > 5. Worker/JVM level metrics which are not owned by a particular component
> > 6. Percentiles for latency metrics such as p99, p95 etc
> >
> > Not all of them may be considered. Please add anything I might have
> missed.
> >
> >
> >
> > On Fri, May 20, 2016 at 5:57 AM, Jungtaek Lim <kabh...@gmail.com> wrote:
> >
> > > Personally I'm also in favor of maintaining old API (but deprecated)
> and
> > > adding new API.
> > > It's ideal way, and that's what many projects are trying to do, and so
> on
> > > the other project I'm also maintaining.
> > >
> > > And I also prefer to gone away current metrics feature in next major
> > > release. In general, before each major release we should discuss which
> > > classes/features to drop. I think we forgot this when we release 1.0.0,
> > and
> > > deprecated things are all alive.
> > >
> > > I'm also in favor of dropping the support for custom frequency for each
> > > metric. I guess we don't need to worry about mapping since internal
> > metrics
> > > on Storm will be moved to new metrics feature. While some behavioral
> > > changes
> > > could be occurred (for example, IMetricsConsumer could be changed to no
> > > longer receive built-in metrics on Storm), it will not break backward
> > > compatibility from the API side anyway.
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2016년 5월 20일 (금) 오전 12:57, Abhishek Agarwal <abhishc...@gmail.com>님이
> 작성:
> > >
> > > > Sounds good. Having two separate metric reporters may be confusing
> but
> > it
> > > > is better than breaking the client code.
> > > >
> > > > Codahale library allows user to specify frequency per reporter
> > instance.
> > > > Storm on the other hand allows different reporting frequency for each
> > > > metric. How will that mapping work? I am ok to drop the support for
> > > custom
> > > > frequency for each metric. Internal metrics in storm anyway use same
> > > > frequency of reporting.
> > > >
> > > > On Thu, May 19, 2016 at 9:04 PM, Bobby Evans
> > <ev...@yahoo-inc.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > I personally would like to see that change happen differently for
> the
> > > two
> > > > > branches.
> > > > > On 1.x we add in a new API for both reporting metrics and
> collecting
> > in
> > > > > parallel to the old API.  We leave IMetric and IMetricsConsumer in
> > > place
> > > > > but deprecated.  As we move internal metrics over from the old
> > > interface
> > > > to
> > > > > the new one, we either keep versions of the old ones in place or we
> > > > provide
> > > > > a translation shim going from the new to the old.
> > > > >
> > > > > In 2.x either the old way is gone completely or it is off by
> default.
> > > I
> > > > > prefer gone completely.
> > > > >
> > > > > If we go off of dropwizard/codahale metrics or a layer around them
> > like
> > > > > was discussed previously it seems fairly straight forward to take
> > some
> > > of
> > > > > our current metrics that all trigger at the same interval and
> setup a
> > > > > reporter that can translate them into the format that was reported
> > > > > previously.
> > > > > In 1.x to get a full picture of what is happening if your topology
> > you
> > > > may
> > > > > need two separate reporters.  One for the new metrics and one for
> the
> > > > old,
> > > > > but it should only be for a short period of time. - Bobby
> > > > >
> > > > >     On Thursday, May 19, 2016 1:00 AM, Cody Innowhere <
> > > > e.neve...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >
> > > > >  If we want to refactor the metrics system, I think we may have to
> > > incur
> > > > > breaking changes. We can make it backward compatible but this means
> > we
> > > > may
> > > > > build an adapt layer on top of metrics, or a lot of "if...else..."
> > > which
> > > > > might be ugly, either way, it might be a pain to maintain the code.
> > > > > So I prefer to making breaking changes if we want to build a new
> > > metrics
> > > > > system, and I'm OK to move JStorm metrics migration phase forward
> to
> > > 1.x,
> > > > > and I'm happy to share our design & experiences.
> > > > >
> > > > > On Thu, May 19, 2016 at 11:12 AM, Jungtaek Lim <kabh...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi devs,
> > > > > >
> > > > > > I'd like to see our opinions on breaking changes on metrics for
> > 1.x.
> > > > > >
> > > > > > Some backgrounds here:
> > > > > >
> > > > > > - As you may have seen, I'm trying to address some places to
> > improve
> > > > > > metrics without breaking backward compatibility, but it's limited
> > due
> > > > to
> > > > > > interface IMetric which is opened to public.
> > > > > > - We're working on Storm 2.0.0, and evaluation / adoption for
> > metrics
> > > > > > feature of JStorm is planned to phase 2 but we all don't know
> > > estimated
> > > > > > release date, and I feel it's not near future.
> > > > > > - We've just released Storm 1.0.x, so I expected the lifetime of
> > > Storm
> > > > > 1.0
> > > > > > (even 0.9) months or even years.
> > > > > >
> > > > > > If someone wants to know what exactly things current metrics
> > feature
> > > > > > matter, please let me know so that I will summarize.
> > > > > >
> > > > > > I have other ideas on mind to relieve some problems with current
> > > > metrics,
> > > > > > so I'm also OK to postpone renewal of metrics to 2.0.0 with
> > applying
> > > > > those
> > > > > > workaround ideas. But if we're having willingness to address
> > metrics
> > > on
> > > > > > 1.x, IMO we can consider breaking backward compatibility from 1.x
> > for
> > > > > once.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Thanks,
> > > > > > Jungtaek Lim (HeartSaVioR)
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Abhishek Agarwal
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
> >
> >
> >
> >
> >
>



-- 
Regards,
Abhishek Agarwal

Re: [DISCUSSION] opinions on breaking changes on metrics for 1.x

Reply via email to