Re: [DISCUSS] FLIP-33: Standardize connector metrics

Becket Qin Fri, 22 Feb 2019 05:04:00 -0800

Hi Chesnay,

Thanks for the explanation.


** Re: FLIP
I might have misunderstood this, but it seems that "major changes" are well
defined in FLIP. The full contents is following:
What is considered a "major change" that needs a FLIP?

Any of the following should be considered a major change:

   - Any major new feature, subsystem, or piece of functionality
   - *Any change that impacts the public interfaces of the project*

What are the "public interfaces" of the project?



*All of the following are public interfaces *that people build around:

   - DataStream and DataSet API, including classes related to that, such as
   StreamExecutionEnvironment


   - Classes marked with the @Public annotation


   - On-disk binary formats, such as checkpoints/savepoints


   - User-facing scripts/command-line tools, i.e. bin/flink, Yarn scripts,
   Mesos scripts


   - Configuration settings


   - *Exposed monitoring information*


So any monitoring information change is considered as public interface, and
any public interface change is considered as a "major change".


** Re: over complication of implementation.

Although this is more of implementation details that is not covered by the
FLIP. But it may be worth discussing.

First of all, I completely agree that we should use the simplest way to
achieve our goal. To me the goal is the following:
1. Clear connector conventions and interfaces.
2. The easiness of creating a connector.

Both of them are important to the prosperity of the connector ecosystem. So
I'd rather abstract as much as possible on our side to make the connector
developer's work lighter. Given this goal, a static util method approach
might have a few drawbacks:
1. Users still have to construct the metrics by themselves. (And note that
this might be erroneous by itself. For example, a customer wrapper around
dropwizard meter maybe used instead of MeterView).
2. When connector specific metrics are added, it is difficult to enforce
the scope to be the same as standard metrics.
3. It seems that a method proliferation is inevitable if we want to apply
sanity checks. e.g. The metric of numBytesIn was not registered for a meter.
4. Metrics are still defined in random places and hard to track.

The current PR I had was inspired by the Config system in Kafka, which I
found pretty handy. In fact it is not only used by Kafka itself but even
some other projects that depend on Kafka. I am not saying this approach is
perfect. But I think it worths to save the work for connector writers and
encourage more systematic implementation. That being said, I am fully open
to suggestions.


Re: Histogram
I think there are two orthogonal questions around those metrics:

1. Regardless of the metric type, by just looking at the meaning of a
metric, is generic to all connectors? If the answer is yes, we should
include the metric into the convention. No matter whether we include it
into the convention or not, some connector implementations will emit such
metric. It is better to have a convention than letting each connector do
random things.

2. If a standard metric is a histogram, what should we do?
I agree that we should make it clear that using histograms will have
performance risk. But I do see histogram is useful in some fine-granularity
debugging where one do not have the luxury to stop the system and inject
more inspection code. So the workaround I am thinking is to provide some
implementation suggestions. Assume later on we have a mechanism of
selective metrics. In the abstract metrics class we can disable those
metrics by default individual connector writers does not have to do
anything (this is another advantage of having an AbstractMetrics instead of
static util methods.)

I am not sure I fully understand the histogram in the backend approach. Can
you explain a bit more? Do you mean emitting the raw data, e.g. fetchTime
and emitTime with each record and let the histogram computation happen in
the background? Or let the processing thread putting the values into a
queue and have a separate thread polling from the queue and add them into
the histogram?

Thanks,

Jiangjie (Becket) Qin





On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler <[email protected]> wrote:

> Re: Flip
> The very first line under both the main header and Purpose section
> describe Flips as "major changes", which this isn't.
>
> Re: complication
> I'm not arguing against standardization, but again an over-complicated
> implementation when a static utility method would be sufficient.
>
> public static void setupConnectorMetrics(
> MetricGroup operatorMetricGroup,
> String connectorName,
> Optional<Gauge<Long>> numRecordsIn,
> ...)
>
> This gives you all you need:
> * a well-defined set of metrics for a connector to opt-in
> * standardized naming schemes for scope and individual metrics
> * standardize metric types (although personally I'm not interested in that
> since metric types should be considered syntactic sugar)
>
> Re: Configurable Histogram
> If anything they _must_ be turned off by default, but the metric system is
> already exposing so many options that I'm not too keen on adding even more.
> You have also only addressed my first argument against histograms
> (performance), the second one still stands (calculate histogram in metric
> backends instead).
>
> On 21.02.2019 16:27, Becket Qin wrote:
> > Hi Chesnay,
> >
> > Thanks for the comments. I think this is worthy of a FLIP because it is
> > public API. According to the FLIP description a FlIP is required in case
> of:
> >
> >     - Any change that impacts the public interfaces of the project
> >
> > and the following entry is found in the definition of "public interface".
> >
> >     - Exposed monitoring information
> >
> > Metrics are critical to any production system. So a clear metric
> definition
> > is important for any serious users. For an organization with large Flink
> > installation, change in metrics means great amount of work. So such
> changes
> > do need to be fully discussed and documented.
> >
> > ** Re: Histogram.
> > We can discuss whether there is a better way to expose metrics that are
> > suitable for histograms. My micro-benchmark on various histogram
> > implementations also indicates that they are significantly slower than
> > other metric types. But I don't think that means never use histogram, but
> > means use it with caution. For example, we can suggest the
> implementations
> > to turn them off by default and only turn it on for a small amount of
> time
> > when performing some micro-debugging.
> >
> > ** Re: complication:
> > Connector conventions are essential for Flink ecosystem. Flink connectors
> > pool is probably the most important part of Flink, just like any other
> data
> > system. Clear conventions of connectors will help build Flink ecosystem
> in
> > a more organic way.
> > Take the metrics convention as an example, imagine someone has developed
> a
> > Flink connector for System foo, and another developer may have developed
> a
> > monitoring and diagnostic framework for Flink which analyzes the Flink
> job
> > performance based on metrics. With a clear metric convention, those two
> > projects could be developed independently.  Once users put them together,
> > it would work without additional modifications. This cannot be easily
> > achieved by just defining a few constants.
> >
> > ** Re: selective metrics:
> > Sure, we can discuss that in a separate thread.
> >
> > @Dawid
> >
> > ** Re: latency / fetchedLatency
> > The primary purpose of establish such a convention is to help developers
> > write connectors in a more compatible way. The convention is supposed to
> be
> > defined more proactively. So when look at the convention, it seems more
> > important to see if the concept is applicable to connectors in general.
> It
> > might be true so far only Kafka connector reports latency. But there
> might
> > be hundreds of other connector implementations in the Flink ecosystem,
> > though not in the Flink repo, and some of them also emits latency. I
> think
> > a lot of other sources actually also has an append timestamp. e.g.
> database
> > bin logs and some K-V stores. So I wouldn't be surprised if some database
> > connector can also emit latency metrics.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler <[email protected]>
> > wrote:
> >
> >> Regarding 2) It doesn't make sense to investigate this as part of this
> >> FLIP. This is something that could be of interest for the entire metric
> >> system, and should be designed for as such.
> >>
> >> Regarding the proposal as a whole:
> >>
> >> Histogram metrics shall not be added to the core of Flink. They are
> >> significantly more expensive than other metrics, and calculating
> >> histograms in the application is regarded as an anti-pattern by several
> >> metric backends, who instead recommend to expose the raw data and
> >> calculate the histogram in the backend.
> >>
> >> Second, this seems overly complicated. Given that we already established
> >> that not all connectors will export all metrics we are effectively
> >> reducing this down to a consistent naming scheme. We don't need anything
> >> sophisticated for that; basically just a few constants that all
> >> connectors use.
> >>
> >> I'm not convinced that this is worthy of a FLIP.
> >>
> >> On 21.02.2019 14:26, Dawid Wysakowicz wrote:
> >>> Hi,
> >>>
> >>> Ad 1. In general I undestand and I agree. But those particular metrics
> >>> (latency, fetchLatency), right now would only be reported if user uses
> >>> KafkaConsumer with internal timestampAssigner with StreamCharacteristic
> >>> set to EventTime, right? That sounds like a very specific case. I am
> not
> >>> sure if we should introduce a generic metric that will be
> >>> disabled/absent for most of implementations.
> >>>
> >>> Ad.2 That sounds like an orthogonal issue, that might make sense to
> >>> investigate in the future.
> >>>
> >>> Best,
> >>>
> >>> Dawid
> >>>
> >>> On 21/02/2019 13:20, Becket Qin wrote:
> >>>> Hi Dawid,
> >>>>
> >>>> Thanks for the feedback. That makes sense to me. There are two cases
> to
> >> be
> >>>> addressed.
> >>>>
> >>>> 1. The metrics are supposed to be a guidance. It is likely that a
> >> connector
> >>>> only supports some but not all of the metrics. In that case, each
> >> connector
> >>>> implementation should have the freedom to decide which metrics are
> >>>> reported. For the metrics that are supported, the guidance should be
> >>>> followed.
> >>>>
> >>>> 2. Sometimes users may want to disable certain metrics for some reason
> >>>> (e.g. performance / reprocessing of data). A generic mechanism should
> be
> >>>> provided to allow user choose which metrics are reported. This
> mechanism
> >>>> should also be honored by the connector implementations.
> >>>>
> >>>> Does this sound reasonable to you?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jiangjie (Becket) Qin
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Generally I like the idea of having a unified, standard set of
> metrics
> >> for
> >>>>> all connectors. I have some slight concerns about fetchLatency and
> >>>>> latency though. They are computed based on EventTime which is not a
> >> purely
> >>>>> technical feature. It depends often on some business logic, might be
> >> absent
> >>>>> or defined after source. Those metrics could also behave in a weird
> >> way in
> >>>>> case of replaying backlog. Therefore I am not sure if we should
> include
> >>>>> those metrics by default. Maybe we could at least introduce a feature
> >>>>> switch for them? What do you think?
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Dawid
> >>>>> On 21/02/2019 03:13, Becket Qin wrote:
> >>>>>
> >>>>> Bump. If there is no objections to the proposed metrics. I'll start a
> >>>>> voting thread later toady.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin <[email protected]> <
> >> [email protected]> wrote:
> >>>>>
> >>>>> Hi folks,
> >>>>>
> >>>>> I would like to start the FLIP discussion thread about standardize
> the
> >>>>> connector metrics.
> >>>>>
> >>>>> In short, we would like to provide a convention of Flink connector
> >>>>> metrics. It will help simplify the monitoring and alerting on Flink
> >> jobs.
> >>>>> The FLIP link is following:
> >>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>>
> >>>>>
> >>
>
>

Re: [DISCUSS] FLIP-33: Standardize connector metrics

Reply via email to