Hi Piotr, That's great to know. Chances are that we will need to revisit the convention list and adjust them accordingly when FLIP-27 is ready, At that point we can mark some of the metrics as available by default for connectors implementing the new interface.
Thanks, Jiangjie (Becket) Qin On Thu, Jun 13, 2019 at 3:51 PM Piotr Nowojski <pi...@ververica.com> wrote: > Thanks for driving this. I’ve just noticed one small thing. With new > SourceReader interface Flink will be able to provide `idleTime` metric > automatically. > > Piotrek > > > On 13 Jun 2019, at 03:30, Becket Qin <becket....@gmail.com> wrote: > > > > Thanks all for the feedback and discussion. > > > > Since there wasn't any concern raised, I've started the voting thread for > > this FLIP, but please feel free to continue the discussion here if you > > think something still needs to be addressed. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > > > On Mon, Jun 10, 2019 at 9:10 AM Becket Qin <becket....@gmail.com> wrote: > > > >> Hi Piotr, > >> > >> Thanks for the comments. Yes, you are right. Users will have to look at > >> other metrics to decide whether the pipeline is healthy or not in the > first > >> place before they can use the time-based metric to fix the bottleneck. > >> > >> I agree that once we have FLIP-27 ready, some of the metrics can just be > >> reported by the abstract implementation. > >> > >> I've updated FLIP-33 wiki page to add the pendingBytes and > pendingRecords > >> metric. Please let me know if you have any concern over the updated > metric > >> convention proposal. > >> > >> @Chesnay Schepler <ches...@apache.org> @Stephan Ewen > >> <step...@ververica.com> will you also have time to take a look at the > >> proposed metric convention? If there is no further concern I'll start a > >> voting thread for this FLIP in two days. > >> > >> Thanks, > >> > >> Jiangjie (Becket) Qin > >> > >> > >> > >> On Wed, Jun 5, 2019 at 6:54 PM Piotr Nowojski <pi...@ververica.com> > wrote: > >> > >>> Hi Becket, > >>> > >>> Thanks for the answer :) > >>> > >>>> By time-based metric, I meant the portion of time spent on producing > the > >>>> record to downstream. For example, a source connector can report that > >>> it's > >>>> spending 80% of time to emit record to downstream processing pipeline. > >>> In > >>>> another case, a sink connector may report that its spending 30% of > time > >>>> producing the records to the external system. > >>>> > >>>> This is in some sense equivalent to the buffer usage metric: > >>> > >>>> - 80% of time spent on emitting records to downstream ---> downstream > >>>> node is bottleneck ---> output buffer is probably full. > >>>> - 30% of time spent on emitting records to downstream ---> downstream > >>>> node is not bottleneck ---> output buffer is probably not full. > >>> > >>> If by “time spent on emitting records to downstream” you understand > >>> “waiting on back pressure”, then I see your point. And I agree that > some > >>> kind of ratio/time based metric gives you more information. However > under > >>> “time spent on emitting records to downstream” might be hidden the > >>> following (extreme) situation: > >>> > >>> 1. Job is barely able to handle influx of records, there is 99% > >>> CPU/resource usage in the cluster, but nobody is > >>> bottlenecked/backpressured, all output buffers are empty, everybody is > >>> waiting in 1% of it’s time for more records to process. > >>> 2. 80% time can still be spent on "down stream operators”, because they > >>> are the CPU intensive operations, but this doesn’t mean that > increasing the > >>> parallelism down the stream will help with anything there. To the > contrary, > >>> increasing parallelism of the source operator might help to increase > >>> resource utilisation up to 100%. > >>> > >>> However, this “time based/ratio” approach can be extended to in/output > >>> buffer usage. Besides collecting an information that input/output > buffer is > >>> full/empty, we can probe profile how often are buffer empty/full. If > output > >>> buffer is full 1% of times, there is almost no back pressure. If it’s > full > >>> 80% of times, there is some back pressure, if it’s full 99.9% of times, > >>> there is huge back pressure. > >>> > >>> Now for autoscaling you could compare the input & output buffers fill > >>> ratio: > >>> > >>> 1. Both are high, the source of bottleneck is down the stream > >>> 2. Output is low, input is high, this is the bottleneck and the higher > >>> the difference, the bigger source of bottleneck is this is > operator/task > >>> 3. Output is high, input is low - there was some load spike that we are > >>> currently finishing to process > >>> > >>> > >>> > >>> But long story short, we are probably diverging from the topic of this > >>> discussion, and we can discuss this at some later point. > >>> > >>> For now, for sources: > >>> > >>> as I wrote before, +1 for: > >>> - pending.bytes, Gauge > >>> - pending.messages, Gauge > >>> > >>> When we will be developing/discussing SourceReader from FLIP-27 we > might > >>> then add: > >>> > >>> - in-memory.buffer.usage (0 - 100%) > >>> > >>> Which will be estimated automatically by Flink while user will be able > to > >>> override/provide better estimation. > >>> > >>> Piotrek > >>> > >>>> On 5 Jun 2019, at 05:42, Becket Qin <becket....@gmail.com> wrote: > >>>> > >>>> Hi Piotr, > >>>> > >>>> Thanks for the explanation. Please see some clarifications below. > >>>> > >>>> By time-based metric, I meant the portion of time spent on producing > the > >>>> record to downstream. For example, a source connector can report that > >>> it's > >>>> spending 80% of time to emit record to downstream processing pipeline. > >>> In > >>>> another case, a sink connector may report that its spending 30% of > time > >>>> producing the records to the external system. > >>>> > >>>> This is in some sense equivalent to the buffer usage metric: > >>>> - 80% of time spent on emitting records to downstream ---> downstream > >>>> node is bottleneck ---> output buffer is probably full. > >>>> - 30% of time spent on emitting records to downstream ---> downstream > >>>> node is not bottleneck ---> output buffer is probably not full. > >>>> > >>>> However, the time-based metric has a few advantages that the buffer > >>> usage > >>>> metric may not have. > >>>> > >>>> 1. Buffer usage metric may not be applicable to all the connector > >>>> implementations, while reporting time-based metric are always doable. > >>>> Some source connectors may not have any input buffer, or they may use > >>> some > >>>> third party library that does not expose the input buffer at all. > >>>> Similarly, for sink connectors, the implementation may not have any > >>> output > >>>> buffer, or the third party library does not expose such buffer. > >>>> > >>>> 2. Although both type of metrics can detect bottleneck, time-based > >>> metrics > >>>> can be used to generate a more informed action to remove the > bottleneck. > >>>> For example, when the downstream is bottleneck, the output buffer > usage > >>>> metric is likely to be 100%, and the input buffer usage might be 0%. > >>> That > >>>> means we don't know what is the suitable parallelism to lift the > >>>> bottleneck. The time-based metric, on the other hand, would give > useful > >>>> information, e.g. if 80% of time was spent on emitting records, we can > >>>> roughly increase the downstream node parallelism by 4 times. > >>>> > >>>> Admittedly, the time-based metrics are more expensive than buffer > >>> usage. So > >>>> we may have to do some sampling to reduce the cost. But in general, > >>> using > >>>> time-based metrics seems worth adding. > >>>> > >>>> That being said, I don't think buffer usage metric and time-based > >>> metrics > >>>> are mutually exclusive. We can probably have both. It is just that in > >>>> practice, features like auto-scaling might prefer time-based metrics > for > >>>> the reason stated above. > >>>> > >>>>> 1. Define the metrics that would allow us to manually detect > >>> bottlenecks. > >>>> As I wrote, we already have them in most of the places, except of > >>>> sources/sinks. > >>>>> 2. Use those metrics, to automatically detect bottlenecks. Currently > we > >>>> are only automatically detecting back pressure and reporting it to the > >>> user > >>>> in web UI (is it exposed as a metric at all?). Detecting the root > cause > >>> of > >>>> the back pressure (bottleneck) is one step further. > >>>>> 3. Use the knowledge about where exactly the bottleneck is located, > to > >>>> try to do something with it. > >>>> > >>>> As explained above, I think time-based metric also addresses item 1 > and > >>>> item 2. > >>>> > >>>> Any thoughts? > >>>> > >>>> Thanks, > >>>> > >>>> Jiangjie (Becket) Qin > >>>> > >>>> > >>>> > >>>> On Mon, Jun 3, 2019 at 4:14 PM Piotr Nowojski <pi...@ververica.com> > >>> wrote: > >>>> > >>>>> Hi again :) > >>>>> > >>>>>> - pending.bytes, Gauge > >>>>>> - pending.messages, Gauge > >>>>> > >>>>> > >>>>> +1 > >>>>> > >>>>> And true, instead of overloading one of the metric it is better when > >>> user > >>>>> can choose to provide only one of them. > >>>>> > >>>>> Re 2: > >>>>> > >>>>>> If I understand correctly, this metric along with the pending > mesages > >>> / > >>>>>> bytes would answer the questions of: > >>>>> > >>>>>> - Does the connector consume fast enough? Lagging behind + empty > >>> buffer > >>>>> = > >>>>>> cannot consume fast enough. > >>>>>> - Does the connector emit fast enough? Lagging behind + full buffer > = > >>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow. > >>>>> > >>>>> Yes, exactly. This can also be used to support decisions like > changing > >>> the > >>>>> parallelism of the sources and/or down stream operators. > >>>>> > >>>>> I’m not sure if I understand your proposal with time based > >>> measurements. > >>>>> Maybe I’m missing something, but I do not see how measuring time > alone > >>>>> could answer the problem: where is the bottleneck. Time spent on the > >>>>> next/emit might be short or long (depending on how heavy to process > the > >>>>> record is) and the source can still be bottlenecked/back pressured or > >>> not. > >>>>> Usually the easiest and the most reliable way how to detect > >>> bottlenecks is > >>>>> by checking usage of input & output buffers, since when input buffer > is > >>>>> full while output buffer is empty, that’s the definition of a > >>> bottleneck. > >>>>> Also this is usually very easy and cheap to measure (it works > >>> effectively > >>>>> the same way as current’s Flink back pressure monitoring, but more > >>> cleanly, > >>>>> without probing thread’s stack traces). > >>>>> > >>>>> Also keep in mind that we are already using the buffer usage metrics > >>> for > >>>>> detecting the bottlenecks in Flink’s internal network exchanges > (manual > >>>>> work). That’s the reason why I wanted to extend this to > sources/sinks, > >>>>> since they are currently our blind spot. > >>>>> > >>>>>> One feature we are currently working on to scale Flink automatically > >>>>> relies > >>>>>> on some metrics answering the same question > >>>>> > >>>>> That would be very helpful feature. I think in order to achieve that > we > >>>>> would need to: > >>>>> 1. Define the metrics that would allow us to manually detect > >>> bottlenecks. > >>>>> As I wrote, we already have them in most of the places, except of > >>>>> sources/sinks. > >>>>> 2. Use those metrics, to automatically detect bottlenecks. Currently > we > >>>>> are only automatically detecting back pressure and reporting it to > the > >>> user > >>>>> in web UI (is it exposed as a metric at all?). Detecting the root > >>> cause of > >>>>> the back pressure (bottleneck) is one step further. > >>>>> 3. Use the knowledge about where exactly the bottleneck is located, > to > >>> try > >>>>> to do something with it. > >>>>> > >>>>> I think you are aiming for point 3., but before we reach it, we are > >>> still > >>>>> missing 1. & 2. Also even if we have 3., there is a value in 1 & 2 > for > >>>>> manual analysis/dashboards. > >>>>> > >>>>> However, having the knowledge of where the bottleneck is, doesn’t > >>>>> necessarily mean that we know what we can do about it. For example > >>>>> increasing parallelism might or might not help with anything (data > >>> skew, > >>>>> bottleneck on some resource that does not scale), but this remark > >>> applies > >>>>> always, regardless of the way how did we detect the bottleneck. > >>>>> > >>>>> Piotrek > >>>>> > >>>>>> On 3 Jun 2019, at 06:16, Becket Qin <becket....@gmail.com> wrote: > >>>>>> > >>>>>> Hi Piotr, > >>>>>> > >>>>>> Thanks for the suggestion. Some thoughts below: > >>>>>> > >>>>>> Re 1: The pending messages / bytes. > >>>>>> I completely agree these are very useful metrics and we should > expect > >>> the > >>>>>> connector to report. WRT the way to expose them, it seems more > >>> consistent > >>>>>> to add two metrics instead of adding a method (unless there are > other > >>> use > >>>>>> cases other than metric reporting). So we can add the following two > >>>>> metrics. > >>>>>> - pending.bytes, Gauge > >>>>>> - pending.messages, Gauge > >>>>>> Applicable connectors can choose to report them. These two metrics > >>> along > >>>>>> with latency should be sufficient for users to understand the > progress > >>>>> of a > >>>>>> connector. > >>>>>> > >>>>>> > >>>>>> Re 2: Number of buffered data in-memory of the connector > >>>>>> If I understand correctly, this metric along with the pending > mesages > >>> / > >>>>>> bytes would answer the questions of: > >>>>>> - Does the connector consume fast enough? Lagging behind + empty > >>> buffer > >>>>> = > >>>>>> cannot consume fast enough. > >>>>>> - Does the connector emit fast enough? Lagging behind + full buffer > = > >>>>>> cannot emit fast enough, i.e. the Flink pipeline is slow. > >>>>>> > >>>>>> One feature we are currently working on to scale Flink automatically > >>>>> relies > >>>>>> on some metrics answering the same question, more specifically, we > are > >>>>>> profiling the time spent on .next() method (time to consume) and the > >>> time > >>>>>> spent on .collect() method (time to emit / process). One advantage > of > >>>>> such > >>>>>> method level time cost allows us to calculate the parallelism > >>> required to > >>>>>> keep up in case their is a lag. > >>>>>> > >>>>>> However, one concern I have regarding such metric is that they are > >>>>>> implementation specific. Either profiling on the method time, or > >>>>> reporting > >>>>>> buffer usage assumes the connector are implemented in such a way. A > >>>>>> slightly better solution might be have the following metric: > >>>>>> > >>>>>> - EmitTimeRatio (or FetchTimeRatio): The time spent on emitting > >>>>>> records / Total time elapsed. > >>>>>> > >>>>>> This assumes that the source connectors have to emit the records to > >>> the > >>>>>> downstream at some point. The emission may take some time ( e.g. go > >>>>> through > >>>>>> chained operators). And the rest of the time are spent to prepare > the > >>>>>> record to emit, including time for consuming and format conversion, > >>> etc. > >>>>>> Ideally, we'd like to see the time spent on record fetch and emit to > >>> be > >>>>>> about the same, so no one is bottleneck for the other. > >>>>>> > >>>>>> The downside of these time based metrics is additional overhead to > get > >>>>> the > >>>>>> time, therefore sampling might be needed. But in practice I feel > such > >>>>> time > >>>>>> based metric might be more useful if we want to take action. > >>>>>> > >>>>>> > >>>>>> I think we should absolutely add metrics in (1) to the metric > >>> convention. > >>>>>> We could also add the metrics mentioned in (2) if we reach consensus > >>> on > >>>>>> that. What do you think? > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Jiangjie (Becket) Qin > >>>>>> > >>>>>> > >>>>>> On Fri, May 31, 2019 at 4:26 PM Piotr Nowojski <pi...@ververica.com > > > >>>>> wrote: > >>>>>> > >>>>>>> Hey Becket, > >>>>>>> > >>>>>>> Re 1a) and 1b) +1 from my side. > >>>>>>> > >>>>>>> I’ve discussed this issue: > >>>>>>> > >>>>>>>>>> > >>>>>>>>>> 2. It would be nice to have metrics, that allow us to check the > >>> cause > >>>>>>> of > >>>>>>>>>> back pressure: > >>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean > >>>>>>>>>> hasSomethingl/isEmpty) > >>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean > >>>>>>>>>> hasSomething/isEmpty) > >>>>>>> > >>>>>>> With Nico at some lengths and he also saw the benefits of them. We > >>> also > >>>>>>> have more concrete proposal for that. > >>>>>>> > >>>>>>> Actually there are two really useful metrics, that we are missing > >>>>>>> currently: > >>>>>>> > >>>>>>> 1. Number of data/records/bytes in the backlog to process. For > >>> example > >>>>>>> remaining number of bytes in unread files. Or pending data in Kafka > >>>>> topics. > >>>>>>> 2. Number of buffered data in-memory of the connector, that are > >>> waiting > >>>>> to > >>>>>>> be processed pushed to Flink pipeline. > >>>>>>> > >>>>>>> Re 1: > >>>>>>> This would have to be a metric provided directly by a connector. It > >>>>> could > >>>>>>> be an undefined `int`: > >>>>>>> > >>>>>>> `int backlog` - estimate of pending work. > >>>>>>> > >>>>>>> “Undefined” meaning that it would be up to a connector to decided > >>>>> whether > >>>>>>> it’s measured in bytes, records, pending files or whatever it is > >>>>> possible > >>>>>>> to provide by the connector. This is because I assume not every > >>>>> connector > >>>>>>> can provide exact number and for some of them it might be > impossible > >>> to > >>>>>>> provide records number of bytes count. > >>>>>>> > >>>>>>> Re 2: > >>>>>>> This metric could be either provided by a connector, or calculated > >>>>> crudely > >>>>>>> by Flink: > >>>>>>> > >>>>>>> `float bufferUsage` - value from [0.0, 1.0] range > >>>>>>> > >>>>>>> Percentage of used in memory buffers, like in Kafka’s handover. > >>>>>>> > >>>>>>> It could be crudely implemented by Flink with FLIP-27 > >>>>>>> SourceReader#isAvailable. If SourceReader is not available reported > >>>>>>> `bufferUsage` could be 0.0. If it is available, it could be 1.0. I > >>> think > >>>>>>> this would be a good enough estimation for most of the use cases > >>> (that > >>>>>>> could be overloaded and implemented better if desired). Especially > >>>>> since we > >>>>>>> are reporting only probed values: if probed values are almost > always > >>>>> “1.0”, > >>>>>>> it would mean that we have a back pressure. If they are almost > always > >>>>>>> “0.0”, there is probably no back pressure at the sources. > >>>>>>> > >>>>>>> What do you think about this? > >>>>>>> > >>>>>>> Piotrek > >>>>>>> > >>>>>>>> On 30 May 2019, at 11:41, Becket Qin <becket....@gmail.com> > wrote: > >>>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> Thanks a lot for all the feedback and comments. I'd like to > continue > >>>>> the > >>>>>>>> discussion on this FLIP. > >>>>>>>> > >>>>>>>> I updated the FLIP-33 wiki to remove all the histogram metrics > from > >>> the > >>>>>>>> first version of metric convention due to the performance concern. > >>> The > >>>>>>> plan > >>>>>>>> is to introduce them later when we have a mechanism to opt in/out > >>>>>>> metrics. > >>>>>>>> At that point, users can decide whether they want to pay the cost > to > >>>>> get > >>>>>>>> the metric or not. > >>>>>>>> > >>>>>>>> As Stephan suggested, for this FLIP, let's first try to agree on > the > >>>>>>> small > >>>>>>>> list of conventional metrics that connectors should follow. > >>>>>>>> Just to be clear, the purpose of the convention is not to enforce > >>> every > >>>>>>>> connector to report all these metrics, but to provide a guidance > in > >>>>> case > >>>>>>>> these metrics are reported by some connectors. > >>>>>>>> > >>>>>>>> > >>>>>>>> @ Stephan & Chesnay, > >>>>>>>> > >>>>>>>> Regarding the duplication of `RecordsIn` metric in operator / task > >>>>>>>> IOMetricGroups, from what I understand, for source operator, it is > >>>>>>> actually > >>>>>>>> the SourceFunction that reports the operator level > >>>>>>>> RecordsIn/RecordsInPerSecond metric. So they are essentially the > >>> same > >>>>>>>> metric in the operator level IOMetricGroup. Similarly for the Sink > >>>>>>>> operator, the operator level RecordsOut/RecordsOutPerSecond > metrics > >>> are > >>>>>>>> also reported by the Sink function. I marked them as existing in > the > >>>>>>>> FLIP-33 wiki page. Please let me know if I misunderstood. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, May 30, 2019 at 5:16 PM Becket Qin <becket....@gmail.com> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Piotr, > >>>>>>>>> > >>>>>>>>> Thanks a lot for the feedback. > >>>>>>>>> > >>>>>>>>> 1a) I guess you are referring to the part that "original system > >>>>> specific > >>>>>>>>> metrics should also be reported". The performance impact depends > on > >>>>> the > >>>>>>>>> implementation. An efficient implementation would only record the > >>>>> metric > >>>>>>>>> once, but report them with two different metric names. This is > >>>>> unlikely > >>>>>>> to > >>>>>>>>> hurt performance. > >>>>>>>>> > >>>>>>>>> 1b) Yes, I agree that we should avoid adding overhead to the > >>> critical > >>>>>>> path > >>>>>>>>> by all means. This is sometimes a tradeoff, running blindly > without > >>>>> any > >>>>>>>>> metric gives best performance, but sometimes might be frustrating > >>> when > >>>>>>> we > >>>>>>>>> debug some issues. > >>>>>>>>> > >>>>>>>>> 2. The metrics are indeed very useful. Are they supposed to be > >>>>> reported > >>>>>>> by > >>>>>>>>> the connectors or Flink itself? At this point FLIP-33 is more > >>> focused > >>>>> on > >>>>>>>>> provide a guidance to the connector authors on the metrics > >>> reporting. > >>>>>>> That > >>>>>>>>> said, after FLIP-27, I think we should absolutely report these > >>> metrics > >>>>>>> in > >>>>>>>>> the abstract implementation. In any case, the metric convention > in > >>>>> this > >>>>>>>>> list are expected to evolve over time. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>> > >>>>>>>>> On Tue, May 28, 2019 at 6:24 PM Piotr Nowojski < > >>> pi...@ververica.com> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> Thanks for the proposal and driving the effort here Becket :) > I’ve > >>>>> read > >>>>>>>>>> through the FLIP-33 [1], and here are couple of my thoughts. > >>>>>>>>>> > >>>>>>>>>> Big +1 for standardising the metric names between connectors, it > >>> will > >>>>>>>>>> definitely help us and users a lot. > >>>>>>>>>> > >>>>>>>>>> Issues/questions/things to discuss that I’ve thought of: > >>>>>>>>>> > >>>>>>>>>> 1a. If we are about to duplicate some metrics, can this become a > >>>>>>>>>> performance issue? > >>>>>>>>>> 1b. Generally speaking, we should make sure that collecting > those > >>>>>>> metrics > >>>>>>>>>> is as non intrusive as possible, especially that they will need > >>> to be > >>>>>>>>>> updated once per record. (They might be collected more rarely > with > >>>>> some > >>>>>>>>>> overhead, but the hot path of updating it per record will need > to > >>> be > >>>>> as > >>>>>>>>>> quick as possible). That includes both avoiding heavy > computation > >>> on > >>>>>>> per > >>>>>>>>>> record path: histograms?, measuring time for time based metrics > >>> (per > >>>>>>>>>> second) (System.currentTimeMillis() depending on the > >>> implementation > >>>>> can > >>>>>>>>>> invoke a system call) > >>>>>>>>>> > >>>>>>>>>> 2. It would be nice to have metrics, that allow us to check the > >>> cause > >>>>>>> of > >>>>>>>>>> back pressure: > >>>>>>>>>> a) for sources, length of input queue (in bytes? Or boolean > >>>>>>>>>> hasSomethingl/isEmpty) > >>>>>>>>>> b) for sinks, length of output queue (in bytes? Or boolean > >>>>>>>>>> hasSomething/isEmpty) > >>>>>>>>>> > >>>>>>>>>> a) is useful in a scenario when we are processing backlog of > >>> records, > >>>>>>> all > >>>>>>>>>> of the internal Flink’s input/output network buffers are empty, > >>> and > >>>>> we > >>>>>>> want > >>>>>>>>>> to check whether the external source system is the bottleneck > >>>>> (source’s > >>>>>>>>>> input queue will be empty), or if the Flink’s connector is the > >>>>>>> bottleneck > >>>>>>>>>> (source’s input queues will be full). > >>>>>>>>>> b) similar story. Backlog of records, but this time all of the > >>>>> internal > >>>>>>>>>> Flink’s input/ouput network buffers are full, and we want o > check > >>>>>>> whether > >>>>>>>>>> the external sink system is the bottleneck (sink output queues > are > >>>>>>> full), > >>>>>>>>>> or if the Flink’s connector is the bottleneck (sink’s output > >>> queues > >>>>>>> will be > >>>>>>>>>> empty) > >>>>>>>>>> > >>>>>>>>>> It might be sometimes difficult to provide those metrics, so > they > >>>>> could > >>>>>>>>>> be optional, but if we could provide them, it would be really > >>>>> helpful. > >>>>>>>>>> > >>>>>>>>>> Piotrek > >>>>>>>>>> > >>>>>>>>>> [1] > >>>>>>>>>> > >>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics > >>>>>>>>>> < > >>>>>>>>>> > >>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33:+Standardize+Connector+Metrics > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On 24 Apr 2019, at 13:28, Stephan Ewen <se...@apache.org> > wrote: > >>>>>>>>>>> > >>>>>>>>>>> I think this sounds reasonable. > >>>>>>>>>>> > >>>>>>>>>>> Let's keep the "reconfiguration without stopping the job" out > of > >>>>> this, > >>>>>>>>>>> because that would be a super big effort and if we approach > that, > >>>>> then > >>>>>>>>>> in > >>>>>>>>>>> more generic way rather than specific to connector metrics. > >>>>>>>>>>> > >>>>>>>>>>> I would suggest to look at the following things before starting > >>> with > >>>>>>> any > >>>>>>>>>>> implementation work: > >>>>>>>>>>> > >>>>>>>>>>> - Try and find a committer to support this, otherwise it will > be > >>>>> hard > >>>>>>>>>> to > >>>>>>>>>>> make progress > >>>>>>>>>>> - Start with defining a smaller set of "core metrics" and > extend > >>> the > >>>>>>>>>> set > >>>>>>>>>>> later. I think that is easier than now blocking on reaching > >>>>> consensus > >>>>>>>>>> on a > >>>>>>>>>>> large group of metrics. > >>>>>>>>>>> - Find a solution to the problem Chesnay mentioned, that the > >>>>> "records > >>>>>>>>>> in" > >>>>>>>>>>> metric is somehow overloaded and exists already in the IO > Metric > >>>>>>> group. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Mar 25, 2019 at 7:16 AM Becket Qin < > becket....@gmail.com > >>>> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Stephan, > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks a lot for the feedback. All makes sense. > >>>>>>>>>>>> > >>>>>>>>>>>> It is a good suggestion to simply have an onRecord(numBytes, > >>>>>>> eventTime) > >>>>>>>>>>>> method for connector writers. It should meet most of the > >>>>>>> requirements, > >>>>>>>>>>>> individual > >>>>>>>>>>>> > >>>>>>>>>>>> The configurable metrics feature is something really useful, > >>>>>>>>>> especially if > >>>>>>>>>>>> we can somehow make it dynamically configurable without > stopping > >>>>> the > >>>>>>>>>> jobs. > >>>>>>>>>>>> It might be better to make it a separate discussion because it > >>> is a > >>>>>>>>>> more > >>>>>>>>>>>> generic feature instead of only for connectors. > >>>>>>>>>>>> > >>>>>>>>>>>> So in order to make some progress, in this FLIP we can limit > the > >>>>>>>>>> discussion > >>>>>>>>>>>> scope to the connector related items: > >>>>>>>>>>>> > >>>>>>>>>>>> - the standard connector metric names and types. > >>>>>>>>>>>> - the abstract ConnectorMetricHandler interface > >>>>>>>>>>>> > >>>>>>>>>>>> I'll start a separate thread to discuss other general metric > >>>>> related > >>>>>>>>>>>> enhancement items including: > >>>>>>>>>>>> > >>>>>>>>>>>> - optional metrics > >>>>>>>>>>>> - dynamic metric configuration > >>>>>>>>>>>> - potential combination with rate limiter > >>>>>>>>>>>> > >>>>>>>>>>>> Does this plan sound reasonable? > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> > >>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>> > >>>>>>>>>>>> On Sat, Mar 23, 2019 at 5:53 AM Stephan Ewen < > se...@apache.org> > >>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Ignoring for a moment implementation details, this connector > >>>>> metrics > >>>>>>>>>> work > >>>>>>>>>>>>> is a really good thing to do, in my opinion > >>>>>>>>>>>>> > >>>>>>>>>>>>> The questions "oh, my job seems to be doing nothing, I am > >>> looking > >>>>> at > >>>>>>>>>> the > >>>>>>>>>>>> UI > >>>>>>>>>>>>> and the 'records in' value is still zero" is in the top three > >>>>>>> support > >>>>>>>>>>>>> questions I have been asked personally. > >>>>>>>>>>>>> Introspection into "how far is the consumer lagging behind" > >>> (event > >>>>>>>>>> time > >>>>>>>>>>>>> fetch latency) came up many times as well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> So big +1 to solving this problem. > >>>>>>>>>>>>> > >>>>>>>>>>>>> About the exact design - I would try to go for the following > >>>>>>>>>> properties: > >>>>>>>>>>>>> > >>>>>>>>>>>>> - keep complexity of of connectors. Ideally the metrics > handler > >>>>> has > >>>>>>> a > >>>>>>>>>>>>> single onRecord(numBytes, eventTime) method or so, and > >>> everything > >>>>>>>>>> else is > >>>>>>>>>>>>> internal to the handler. That makes it dead simple for the > >>>>>>> connector. > >>>>>>>>>> We > >>>>>>>>>>>>> can also think of an extensive scheme for connector specific > >>>>>>> metrics. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - make it configurable on the job it cluster level which > >>> metrics > >>>>> the > >>>>>>>>>>>>> handler internally creates when that method is invoked. > >>>>>>>>>>>>> > >>>>>>>>>>>>> What do you think? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best, > >>>>>>>>>>>>> Stephan > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Mar 21, 2019 at 10:42 AM Chesnay Schepler < > >>>>>>> ches...@apache.org > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> As I said before, I believe this to be over-engineered and > >>> have > >>>>> no > >>>>>>>>>>>>>> interest in this implementation. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> There are conceptual issues like defining a duplicate > >>>>>>>>>>>> numBytesIn(PerSec) > >>>>>>>>>>>>>> metric that already exists for each operator. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 21.03.2019 06:13, Becket Qin wrote: > >>>>>>>>>>>>>>> A few updates to the thread. I uploaded a patch[1] as a > >>> complete > >>>>>>>>>>>>>>> example of how users can use the metrics in the future. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Some thoughts below after taking a look at the > >>>>> AbstractMetricGroup > >>>>>>>>>>>> and > >>>>>>>>>>>>>>> its subclasses. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This patch intends to provide convenience for Flink > connector > >>>>>>>>>>>>>>> implementations to follow metrics standards proposed in > >>> FLIP-33. > >>>>>>> It > >>>>>>>>>>>>>>> also try to enhance the metric management in general way to > >>> help > >>>>>>>>>>>> users > >>>>>>>>>>>>>>> with: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1. metric definition > >>>>>>>>>>>>>>> 2. metric dependencies check > >>>>>>>>>>>>>>> 3. metric validation > >>>>>>>>>>>>>>> 4. metric control (turn on / off particular metrics) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This patch wraps |MetricGroup| to extend the functionality > of > >>>>>>>>>>>>>>> |AbstractMetricGroup| and its subclasses. The > >>>>>>>>>>>>>>> |AbstractMetricGroup| mainly focus on the metric group > >>>>> hierarchy, > >>>>>>>>>> but > >>>>>>>>>>>>>>> does not really manage the metrics other than keeping them > >>> in a > >>>>>>> Map. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Ideally we should only have one entry point for the > metrics. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Right now the entry point is |AbstractMetricGroup|. > However, > >>>>>>> besides > >>>>>>>>>>>>>>> the missing functionality mentioned above, > >>> |AbstractMetricGroup| > >>>>>>>>>>>> seems > >>>>>>>>>>>>>>> deeply rooted in Flink runtime. We could extract it out to > >>>>>>>>>>>>>>> flink-metrics in order to use it for generic purpose. There > >>> will > >>>>>>> be > >>>>>>>>>>>>>>> some work, though. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Another approach is to make |AbstractMetrics| in this patch > >>> as > >>>>> the > >>>>>>>>>>>>>>> metric entry point. It wraps metric group and provides the > >>>>> missing > >>>>>>>>>>>>>>> functionalities. Then we can roll out this pattern to > runtime > >>>>>>>>>>>>>>> components gradually as well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> My first thought is that the latter approach gives a more > >>> smooth > >>>>>>>>>>>>>>> migration. But I am also OK with doing a refactoring on the > >>>>>>>>>>>>>>> |AbstractMetricGroup| family. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [1] https://github.com/becketqin/flink/pull/1 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Mon, Feb 25, 2019 at 2:32 PM Becket Qin < > >>>>> becket....@gmail.com > >>>>>>>>>>>>>>> <mailto:becket....@gmail.com>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Chesnay, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It might be easier to discuss some implementation details > in > >>>>> the > >>>>>>>>>>>>>>> PR review instead of in the FLIP discussion thread. I have > a > >>>>>>>>>>>> patch > >>>>>>>>>>>>>>> for Kafka connectors ready but haven't submitted the PR > yet. > >>>>>>>>>>>>>>> Hopefully that will help explain a bit more. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ** Re: metric type binding > >>>>>>>>>>>>>>> This is a valid point that worths discussing. If I > understand > >>>>>>>>>>>>>>> correctly, there are two points: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1. Metric type / interface does not matter as long as the > >>>>> metric > >>>>>>>>>>>>>>> semantic is clearly defined. > >>>>>>>>>>>>>>> Conceptually speaking, I agree that as long as the metric > >>>>>>>>>>>> semantic > >>>>>>>>>>>>>>> is defined, metric type does not matter. To some extent, > >>> Gauge > >>>>> / > >>>>>>>>>>>>>>> Counter / Meter / Histogram themselves can be think of as > >>> some > >>>>>>>>>>>>>>> well-recognized semantics, if you wish. In Flink, these > >>> metric > >>>>>>>>>>>>>>> semantics have their associated interface classes. In > >>> practice, > >>>>>>>>>>>>>>> such semantic to interface binding seems necessary for > >>>>> different > >>>>>>>>>>>>>>> components to communicate. Simply standardize the semantic > >>> of > >>>>>>>>>>>> the > >>>>>>>>>>>>>>> connector metrics seems not sufficient for people to build > >>>>>>>>>>>>>>> ecosystem on top of. At the end of the day, we still need > to > >>>>>>> have > >>>>>>>>>>>>>>> some embodiment of the metric semantics that people can > >>> program > >>>>>>>>>>>>>>> against. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2. Sometimes the same metric semantic can be exposed using > >>>>>>>>>>>>>>> different metric types / interfaces. > >>>>>>>>>>>>>>> This is a good point. Counter and Gauge-as-a-Counter are > >>> pretty > >>>>>>>>>>>>>>> much interchangeable. This is more of a trade-off between > the > >>>>>>>>>>>> user > >>>>>>>>>>>>>>> experience of metric producers and consumers. The metric > >>>>>>>>>>>> producers > >>>>>>>>>>>>>>> want to use Counter or Gauge depending on whether the > counter > >>>>> is > >>>>>>>>>>>>>>> already tracked in code, while ideally the metric consumers > >>>>> only > >>>>>>>>>>>>>>> want to see a single metric type for each metric. I am > >>> leaning > >>>>>>>>>>>>>>> towards to make the metric producers happy, i.e. allow > Gauge > >>> / > >>>>>>>>>>>>>>> Counter metric type, and the the metric consumers handle > the > >>>>>>> type > >>>>>>>>>>>>>>> variation. The reason is that in practice, there might be > >>> more > >>>>>>>>>>>>>>> connector implementations than metric reporter > >>> implementations. > >>>>>>>>>>>> We > >>>>>>>>>>>>>>> could also provide some helper method to facilitate reading > >>>>> from > >>>>>>>>>>>>>>> such variable metric type. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Just some quick replies to the comments around > implementation > >>>>>>>>>>>>>> details. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 4) single place where metrics are registered except > >>>>>>>>>>>>>>> connector-specific > >>>>>>>>>>>>>>> ones (which we can't really avoid). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Register connector specific ones in a single place is > >>> actually > >>>>>>>>>>>>>>> something that I want to achieve. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2) I'm talking about time-series databases like > >>> Prometheus. > >>>>>>>>>>>> We > >>>>>>>>>>>>>>> would > >>>>>>>>>>>>>>> only have a gauge metric exposing the last > >>>>>>> fetchTime/emitTime > >>>>>>>>>>>>>>> that is > >>>>>>>>>>>>>>> regularly reported to the backend (Prometheus), where a > >>>>> user > >>>>>>>>>>>>>>> could build > >>>>>>>>>>>>>>> a histogram of his choosing when/if he wants it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Not sure if such downsampling works. As an example, if a > user > >>>>>>>>>>>>>>> complains that there are some intermittent latency spikes > >>>>> (maybe > >>>>>>>>>>>> a > >>>>>>>>>>>>>>> few records in 10 seconds) in their processing system. > >>> Having a > >>>>>>>>>>>>>>> Gauge sampling instantaneous latency seems unlikely useful. > >>>>>>>>>>>>>>> However by looking at actual 99.9 percentile latency might > >>>>> help. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 9:30 PM Chesnay Schepler > >>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Re: over complication of implementation. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I think I get understand better know what you're > shooting > >>>>>>>>>>>> for, > >>>>>>>>>>>>>>> effectively something like the OperatorIOMetricGroup. > >>>>>>>>>>>>>>> But still, re-define setupConnectorMetrics() to accept a > >>>>> set > >>>>>>>>>>>>>>> of flags > >>>>>>>>>>>>>>> for counters/meters(ans _possibly_ histograms) along > >>> with a > >>>>>>>>>>>>>>> set of > >>>>>>>>>>>>>>> well-defined Optional<Gauge<?>>, and return the group. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Solves all issues as far as i can tell: > >>>>>>>>>>>>>>> 1) no metrics must be created manually (except Gauges, > >>>>> which > >>>>>>>>>>>>> are > >>>>>>>>>>>>>>> effectively just Suppliers and you can't get around > >>> this), > >>>>>>>>>>>>>>> 2) additional metrics can be registered on the returned > >>>>>>>>>>>> group, > >>>>>>>>>>>>>>> 3) see 1), > >>>>>>>>>>>>>>> 4) single place where metrics are registered except > >>>>>>>>>>>>>>> connector-specific > >>>>>>>>>>>>>>> ones (which we can't really avoid). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Re: Histogram > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1) As an example, whether "numRecordsIn" is exposed as a > >>>>>>>>>>>>>>> Counter or a > >>>>>>>>>>>>>>> Gauge should be irrelevant. So far we're using the > metric > >>>>>>>>>>>> type > >>>>>>>>>>>>>>> that is > >>>>>>>>>>>>>>> the most convenient at exposing a given value. If there > >>> is > >>>>>>>>>>>>>>> some backing > >>>>>>>>>>>>>>> data-structure that we want to expose some data from we > >>>>>>>>>>>>>>> typically opt > >>>>>>>>>>>>>>> for a Gauge, as otherwise we're just mucking around with > >>>>> the > >>>>>>>>>>>>>>> Meter/Counter API to get it to match. Similarly, if we > >>> want > >>>>>>>>>>>> to > >>>>>>>>>>>>>>> count > >>>>>>>>>>>>>>> something but no current count exists we typically added > >>> a > >>>>>>>>>>>>>>> Counter. > >>>>>>>>>>>>>>> That's why attaching semantics to metric types makes > >>> little > >>>>>>>>>>>>>>> sense (but > >>>>>>>>>>>>>>> unfortunately several reporters already do it); for > >>>>>>>>>>>>>>> counters/meters > >>>>>>>>>>>>>>> certainly, but the majority of metrics are gauges. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2) I'm talking about time-series databases like > >>> Prometheus. > >>>>>>>>>>>> We > >>>>>>>>>>>>>>> would > >>>>>>>>>>>>>>> only have a gauge metric exposing the last > >>>>>>> fetchTime/emitTime > >>>>>>>>>>>>>>> that is > >>>>>>>>>>>>>>> regularly reported to the backend (Prometheus), where a > >>>>> user > >>>>>>>>>>>>>>> could build > >>>>>>>>>>>>>>> a histogram of his choosing when/if he wants it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 22.02.2019 13:57, Becket Qin wrote: > >>>>>>>>>>>>>>>> Hi Chesnay, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks for the explanation. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ** Re: FLIP > >>>>>>>>>>>>>>>> I might have misunderstood this, but it seems that "major > >>>>>>>>>>>>>>> changes" are well > >>>>>>>>>>>>>>>> defined in FLIP. The full contents is following: > >>>>>>>>>>>>>>>> What is considered a "major change" that needs a FLIP? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Any of the following should be considered a major change: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - Any major new feature, subsystem, or piece of > >>>>>>>>>>>>>>> functionality > >>>>>>>>>>>>>>>> - *Any change that impacts the public interfaces of the > >>>>>>>>>>>>>>> project* > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> What are the "public interfaces" of the project? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> *All of the following are public interfaces *that people > >>>>>>>>>>>>>>> build around: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - DataStream and DataSet API, including classes related > >>>>>>>>>>>>>>> to that, such as > >>>>>>>>>>>>>>>> StreamExecutionEnvironment > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - Classes marked with the @Public annotation > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - On-disk binary formats, such as > >>>>>>>>>>>> checkpoints/savepoints > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - User-facing scripts/command-line tools, i.e. > >>>>>>>>>>>>>>> bin/flink, Yarn scripts, > >>>>>>>>>>>>>>>> Mesos scripts > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - Configuration settings > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - *Exposed monitoring information* > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> So any monitoring information change is considered as > >>>>>>>>>>>> public > >>>>>>>>>>>>>>> interface, and > >>>>>>>>>>>>>>>> any public interface change is considered as a "major > >>>>>>>>>>>>> change". > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ** Re: over complication of implementation. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Although this is more of implementation details that is > not > >>>>>>>>>>>>>>> covered by the > >>>>>>>>>>>>>>>> FLIP. But it may be worth discussing. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> First of all, I completely agree that we should use the > >>>>>>>>>>>>>>> simplest way to > >>>>>>>>>>>>>>>> achieve our goal. To me the goal is the following: > >>>>>>>>>>>>>>>> 1. Clear connector conventions and interfaces. > >>>>>>>>>>>>>>>> 2. The easiness of creating a connector. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Both of them are important to the prosperity of the > >>>>>>>>>>>>>>> connector ecosystem. So > >>>>>>>>>>>>>>>> I'd rather abstract as much as possible on our side to > make > >>>>>>>>>>>>>>> the connector > >>>>>>>>>>>>>>>> developer's work lighter. Given this goal, a static util > >>>>>>>>>>>>>>> method approach > >>>>>>>>>>>>>>>> might have a few drawbacks: > >>>>>>>>>>>>>>>> 1. Users still have to construct the metrics by > themselves. > >>>>>>>>>>>>>>> (And note that > >>>>>>>>>>>>>>>> this might be erroneous by itself. For example, a customer > >>>>>>>>>>>>>>> wrapper around > >>>>>>>>>>>>>>>> dropwizard meter maybe used instead of MeterView). > >>>>>>>>>>>>>>>> 2. When connector specific metrics are added, it is > >>>>>>>>>>>>>>> difficult to enforce > >>>>>>>>>>>>>>>> the scope to be the same as standard metrics. > >>>>>>>>>>>>>>>> 3. It seems that a method proliferation is inevitable if > we > >>>>>>>>>>>>>>> want to apply > >>>>>>>>>>>>>>>> sanity checks. e.g. The metric of numBytesIn was not > >>>>>>>>>>>>>>> registered for a meter. > >>>>>>>>>>>>>>>> 4. Metrics are still defined in random places and hard to > >>>>>>>>>>>>>> track. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The current PR I had was inspired by the Config system in > >>>>>>>>>>>>>>> Kafka, which I > >>>>>>>>>>>>>>>> found pretty handy. In fact it is not only used by Kafka > >>>>>>>>>>>>>>> itself but even > >>>>>>>>>>>>>>>> some other projects that depend on Kafka. I am not saying > >>>>>>>>>>>>>>> this approach is > >>>>>>>>>>>>>>>> perfect. But I think it worths to save the work for > >>>>>>>>>>>>>>> connector writers and > >>>>>>>>>>>>>>>> encourage more systematic implementation. That being said, > >>>>>>>>>>>> I > >>>>>>>>>>>>>>> am fully open > >>>>>>>>>>>>>>>> to suggestions. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Re: Histogram > >>>>>>>>>>>>>>>> I think there are two orthogonal questions around those > >>>>>>>>>>>>>> metrics: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 1. Regardless of the metric type, by just looking at the > >>>>>>>>>>>>>>> meaning of a > >>>>>>>>>>>>>>>> metric, is generic to all connectors? If the answer is > yes, > >>>>>>>>>>>>>>> we should > >>>>>>>>>>>>>>>> include the metric into the convention. No matter whether > >>>>>>>>>>>> we > >>>>>>>>>>>>>>> include it > >>>>>>>>>>>>>>>> into the convention or not, some connector implementations > >>>>>>>>>>>>>>> will emit such > >>>>>>>>>>>>>>>> metric. It is better to have a convention than letting > each > >>>>>>>>>>>>>>> connector do > >>>>>>>>>>>>>>>> random things. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 2. If a standard metric is a histogram, what should we do? > >>>>>>>>>>>>>>>> I agree that we should make it clear that using histograms > >>>>>>>>>>>>>>> will have > >>>>>>>>>>>>>>>> performance risk. But I do see histogram is useful in some > >>>>>>>>>>>>>>> fine-granularity > >>>>>>>>>>>>>>>> debugging where one do not have the luxury to stop the > >>>>>>>>>>>>>>> system and inject > >>>>>>>>>>>>>>>> more inspection code. So the workaround I am thinking is > to > >>>>>>>>>>>>>>> provide some > >>>>>>>>>>>>>>>> implementation suggestions. Assume later on we have a > >>>>>>>>>>>>>>> mechanism of > >>>>>>>>>>>>>>>> selective metrics. In the abstract metrics class we can > >>>>>>>>>>>>>>> disable those > >>>>>>>>>>>>>>>> metrics by default individual connector writers does not > >>>>>>>>>>>>>>> have to do > >>>>>>>>>>>>>>>> anything (this is another advantage of having an > >>>>>>>>>>>>>>> AbstractMetrics instead of > >>>>>>>>>>>>>>>> static util methods.) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I am not sure I fully understand the histogram in the > >>>>>>>>>>>>>>> backend approach. Can > >>>>>>>>>>>>>>>> you explain a bit more? Do you mean emitting the raw data, > >>>>>>>>>>>>>>> e.g. fetchTime > >>>>>>>>>>>>>>>> and emitTime with each record and let the histogram > >>>>>>>>>>>>>>> computation happen in > >>>>>>>>>>>>>>>> the background? Or let the processing thread putting the > >>>>>>>>>>>>>>> values into a > >>>>>>>>>>>>>>>> queue and have a separate thread polling from the queue > and > >>>>>>>>>>>>>>> add them into > >>>>>>>>>>>>>>>> the histogram? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Fri, Feb 22, 2019 at 4:34 PM Chesnay Schepler > >>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Re: Flip > >>>>>>>>>>>>>>>>> The very first line under both the main header and > Purpose > >>>>>>>>>>>>>>> section > >>>>>>>>>>>>>>>>> describe Flips as "major changes", which this isn't. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Re: complication > >>>>>>>>>>>>>>>>> I'm not arguing against standardization, but again an > >>>>>>>>>>>>>>> over-complicated > >>>>>>>>>>>>>>>>> implementation when a static utility method would be > >>>>>>>>>>>>>>> sufficient. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> public static void setupConnectorMetrics( > >>>>>>>>>>>>>>>>> MetricGroup operatorMetricGroup, > >>>>>>>>>>>>>>>>> String connectorName, > >>>>>>>>>>>>>>>>> Optional<Gauge<Long>> numRecordsIn, > >>>>>>>>>>>>>>>>> ...) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> This gives you all you need: > >>>>>>>>>>>>>>>>> * a well-defined set of metrics for a connector to opt-in > >>>>>>>>>>>>>>>>> * standardized naming schemes for scope and individual > >>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>> * standardize metric types (although personally I'm not > >>>>>>>>>>>>>>> interested in that > >>>>>>>>>>>>>>>>> since metric types should be considered syntactic sugar) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Re: Configurable Histogram > >>>>>>>>>>>>>>>>> If anything they _must_ be turned off by default, but the > >>>>>>>>>>>>>>> metric system is > >>>>>>>>>>>>>>>>> already exposing so many options that I'm not too keen on > >>>>>>>>>>>>>>> adding even more. > >>>>>>>>>>>>>>>>> You have also only addressed my first argument against > >>>>>>>>>>>>>>> histograms > >>>>>>>>>>>>>>>>> (performance), the second one still stands (calculate > >>>>>>>>>>>>>>> histogram in metric > >>>>>>>>>>>>>>>>> backends instead). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 21.02.2019 16:27, Becket Qin wrote: > >>>>>>>>>>>>>>>>>> Hi Chesnay, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks for the comments. I think this is worthy of a > FLIP > >>>>>>>>>>>>>>> because it is > >>>>>>>>>>>>>>>>>> public API. According to the FLIP description a FlIP is > >>>>>>>>>>>>>>> required in case > >>>>>>>>>>>>>>>>> of: > >>>>>>>>>>>>>>>>>> - Any change that impacts the public interfaces of > >>>>>>>>>>>>>>> the project > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> and the following entry is found in the definition of > >>>>>>>>>>>>>>> "public interface". > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> - Exposed monitoring information > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Metrics are critical to any production system. So a > clear > >>>>>>>>>>>>>>> metric > >>>>>>>>>>>>>>>>> definition > >>>>>>>>>>>>>>>>>> is important for any serious users. For an organization > >>>>>>>>>>>>>>> with large Flink > >>>>>>>>>>>>>>>>>> installation, change in metrics means great amount of > >>>>>>>>>>>>>>> work. So such > >>>>>>>>>>>>>>>>> changes > >>>>>>>>>>>>>>>>>> do need to be fully discussed and documented. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ** Re: Histogram. > >>>>>>>>>>>>>>>>>> We can discuss whether there is a better way to expose > >>>>>>>>>>>>>>> metrics that are > >>>>>>>>>>>>>>>>>> suitable for histograms. My micro-benchmark on various > >>>>>>>>>>>>>>> histogram > >>>>>>>>>>>>>>>>>> implementations also indicates that they are > >>>>>>>>>>>> significantly > >>>>>>>>>>>>>>> slower than > >>>>>>>>>>>>>>>>>> other metric types. But I don't think that means never > >>>>>>>>>>>> use > >>>>>>>>>>>>>>> histogram, but > >>>>>>>>>>>>>>>>>> means use it with caution. For example, we can suggest > >>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> implementations > >>>>>>>>>>>>>>>>>> to turn them off by default and only turn it on for a > >>>>>>>>>>>>>>> small amount of > >>>>>>>>>>>>>>>>> time > >>>>>>>>>>>>>>>>>> when performing some micro-debugging. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ** Re: complication: > >>>>>>>>>>>>>>>>>> Connector conventions are essential for Flink ecosystem. > >>>>>>>>>>>>>>> Flink connectors > >>>>>>>>>>>>>>>>>> pool is probably the most important part of Flink, just > >>>>>>>>>>>>>>> like any other > >>>>>>>>>>>>>>>>> data > >>>>>>>>>>>>>>>>>> system. Clear conventions of connectors will help build > >>>>>>>>>>>>>>> Flink ecosystem > >>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>> a more organic way. > >>>>>>>>>>>>>>>>>> Take the metrics convention as an example, imagine > >>>>>>>>>>>> someone > >>>>>>>>>>>>>>> has developed > >>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>> Flink connector for System foo, and another developer > may > >>>>>>>>>>>>>>> have developed > >>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>> monitoring and diagnostic framework for Flink which > >>>>>>>>>>>>>>> analyzes the Flink > >>>>>>>>>>>>>>>>> job > >>>>>>>>>>>>>>>>>> performance based on metrics. With a clear metric > >>>>>>>>>>>>>>> convention, those two > >>>>>>>>>>>>>>>>>> projects could be developed independently. Once users > put > >>>>>>>>>>>>>>> them together, > >>>>>>>>>>>>>>>>>> it would work without additional modifications. This > >>>>>>>>>>>>>>> cannot be easily > >>>>>>>>>>>>>>>>>> achieved by just defining a few constants. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ** Re: selective metrics: > >>>>>>>>>>>>>>>>>> Sure, we can discuss that in a separate thread. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> @Dawid > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ** Re: latency / fetchedLatency > >>>>>>>>>>>>>>>>>> The primary purpose of establish such a convention is to > >>>>>>>>>>>>>>> help developers > >>>>>>>>>>>>>>>>>> write connectors in a more compatible way. The > convention > >>>>>>>>>>>>>>> is supposed to > >>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>> defined more proactively. So when look at the > convention, > >>>>>>>>>>>>>>> it seems more > >>>>>>>>>>>>>>>>>> important to see if the concept is applicable to > >>>>>>>>>>>>>>> connectors in general. > >>>>>>>>>>>>>>>>> It > >>>>>>>>>>>>>>>>>> might be true so far only Kafka connector reports > >>>>>>>>>>>> latency. > >>>>>>>>>>>>>>> But there > >>>>>>>>>>>>>>>>> might > >>>>>>>>>>>>>>>>>> be hundreds of other connector implementations in the > >>>>>>>>>>>>>>> Flink ecosystem, > >>>>>>>>>>>>>>>>>> though not in the Flink repo, and some of them also > emits > >>>>>>>>>>>>>>> latency. I > >>>>>>>>>>>>>>>>> think > >>>>>>>>>>>>>>>>>> a lot of other sources actually also has an append > >>>>>>>>>>>>>>> timestamp. e.g. > >>>>>>>>>>>>>>>>> database > >>>>>>>>>>>>>>>>>> bin logs and some K-V stores. So I wouldn't be surprised > >>>>>>>>>>>>>>> if some database > >>>>>>>>>>>>>>>>>> connector can also emit latency metrics. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:14 PM Chesnay Schepler > >>>>>>>>>>>>>>> <ches...@apache.org <mailto:ches...@apache.org>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Regarding 2) It doesn't make sense to investigate this > >>>>>>>>>>>> as > >>>>>>>>>>>>>>> part of this > >>>>>>>>>>>>>>>>>>> FLIP. This is something that could be of interest for > >>>>>>>>>>>> the > >>>>>>>>>>>>>>> entire metric > >>>>>>>>>>>>>>>>>>> system, and should be designed for as such. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Regarding the proposal as a whole: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Histogram metrics shall not be added to the core of > >>>>>>>>>>>>>>> Flink. They are > >>>>>>>>>>>>>>>>>>> significantly more expensive than other metrics, and > >>>>>>>>>>>>>>> calculating > >>>>>>>>>>>>>>>>>>> histograms in the application is regarded as an > >>>>>>>>>>>>>>> anti-pattern by several > >>>>>>>>>>>>>>>>>>> metric backends, who instead recommend to expose the > raw > >>>>>>>>>>>>>>> data and > >>>>>>>>>>>>>>>>>>> calculate the histogram in the backend. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Second, this seems overly complicated. Given that we > >>>>>>>>>>>>>>> already established > >>>>>>>>>>>>>>>>>>> that not all connectors will export all metrics we are > >>>>>>>>>>>>>>> effectively > >>>>>>>>>>>>>>>>>>> reducing this down to a consistent naming scheme. We > >>>>>>>>>>>>>>> don't need anything > >>>>>>>>>>>>>>>>>>> sophisticated for that; basically just a few constants > >>>>>>>>>>>>>>> that all > >>>>>>>>>>>>>>>>>>> connectors use. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'm not convinced that this is worthy of a FLIP. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 21.02.2019 14:26, Dawid Wysakowicz wrote: > >>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Ad 1. In general I undestand and I agree. But those > >>>>>>>>>>>>>>> particular metrics > >>>>>>>>>>>>>>>>>>>> (latency, fetchLatency), right now would only be > >>>>>>>>>>>>>>> reported if user uses > >>>>>>>>>>>>>>>>>>>> KafkaConsumer with internal timestampAssigner with > >>>>>>>>>>>>>>> StreamCharacteristic > >>>>>>>>>>>>>>>>>>>> set to EventTime, right? That sounds like a very > >>>>>>>>>>>>>>> specific case. I am > >>>>>>>>>>>>>>>>> not > >>>>>>>>>>>>>>>>>>>> sure if we should introduce a generic metric that will > >>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>> disabled/absent for most of implementations. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Ad.2 That sounds like an orthogonal issue, that might > >>>>>>>>>>>>>>> make sense to > >>>>>>>>>>>>>>>>>>>> investigate in the future. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Dawid > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On 21/02/2019 13:20, Becket Qin wrote: > >>>>>>>>>>>>>>>>>>>>> Hi Dawid, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. That makes sense to me. > There > >>>>>>>>>>>>>>> are two cases > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>>> addressed. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 1. The metrics are supposed to be a guidance. It is > >>>>>>>>>>>>>>> likely that a > >>>>>>>>>>>>>>>>>>> connector > >>>>>>>>>>>>>>>>>>>>> only supports some but not all of the metrics. In > that > >>>>>>>>>>>>>>> case, each > >>>>>>>>>>>>>>>>>>> connector > >>>>>>>>>>>>>>>>>>>>> implementation should have the freedom to decide > which > >>>>>>>>>>>>>>> metrics are > >>>>>>>>>>>>>>>>>>>>> reported. For the metrics that are supported, the > >>>>>>>>>>>>>>> guidance should be > >>>>>>>>>>>>>>>>>>>>> followed. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 2. Sometimes users may want to disable certain > metrics > >>>>>>>>>>>>>>> for some reason > >>>>>>>>>>>>>>>>>>>>> (e.g. performance / reprocessing of data). A generic > >>>>>>>>>>>>>>> mechanism should > >>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>>> provided to allow user choose which metrics are > >>>>>>>>>>>>>>> reported. This > >>>>>>>>>>>>>>>>> mechanism > >>>>>>>>>>>>>>>>>>>>> should also be honored by the connector > >>>>>>>>>>>> implementations. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Does this sound reasonable to you? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 4:22 PM Dawid Wysakowicz < > >>>>>>>>>>>>>>>>>>> dwysakow...@apache.org <mailto:dwysakow...@apache.org > >> > >>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Generally I like the idea of having a unified, > >>>>>>>>>>>>>>> standard set of > >>>>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>>>>>>> all connectors. I have some slight concerns about > >>>>>>>>>>>>>>> fetchLatency and > >>>>>>>>>>>>>>>>>>>>>> latency though. They are computed based on EventTime > >>>>>>>>>>>>>>> which is not a > >>>>>>>>>>>>>>>>>>> purely > >>>>>>>>>>>>>>>>>>>>>> technical feature. It depends often on some business > >>>>>>>>>>>>>>> logic, might be > >>>>>>>>>>>>>>>>>>> absent > >>>>>>>>>>>>>>>>>>>>>> or defined after source. Those metrics could also > >>>>>>>>>>>>>>> behave in a weird > >>>>>>>>>>>>>>>>>>> way in > >>>>>>>>>>>>>>>>>>>>>> case of replaying backlog. Therefore I am not sure > if > >>>>>>>>>>>>>>> we should > >>>>>>>>>>>>>>>>> include > >>>>>>>>>>>>>>>>>>>>>> those metrics by default. Maybe we could at least > >>>>>>>>>>>>>>> introduce a feature > >>>>>>>>>>>>>>>>>>>>>> switch for them? What do you think? > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Dawid > >>>>>>>>>>>>>>>>>>>>>> On 21/02/2019 03:13, Becket Qin wrote: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Bump. If there is no objections to the proposed > >>>>>>>>>>>>>>> metrics. I'll start a > >>>>>>>>>>>>>>>>>>>>>> voting thread later toady. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 8:17 PM Becket Qin > >>>>>>>>>>>>>>> <becket....@gmail.com <mailto:becket....@gmail.com>> < > >>>>>>>>>>>>>>>>>>> becket....@gmail.com <mailto:becket....@gmail.com>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>> Hi folks, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> I would like to start the FLIP discussion thread > >>>>>>>>>>>> about > >>>>>>>>>>>>>>> standardize > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>>> connector metrics. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> In short, we would like to provide a convention of > >>>>>>>>>>>>>>> Flink connector > >>>>>>>>>>>>>>>>>>>>>> metrics. It will help simplify the monitoring and > >>>>>>>>>>>>>>> alerting on Flink > >>>>>>>>>>>>>>>>>>> jobs. > >>>>>>>>>>>>>>>>>>>>>> The FLIP link is following: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics > >>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>>> > >>> > >>> > >