Hi Guojun!

I've read the modified document. I still have the following concerns:

1. Some metric names are still misleading. For example, "totalTablesFiles"
is the "number of total data files currently maintained on storage",
however "numTotalManifests" is the "number of scanned manifests files in
the last scan planning", not the total number of manifest files on storage.
Please keep metric names consistent, even across different metric types.

2. There is no grouping in the current design. Let's consider all metrics
related to compaction. As one parallelism can hold multiple writers, it is
very likely that compactions for different buckets may happen in the same
parallelism. How are you going to distinguish the same compaction metric
for different buckets? Flink has metric groups which separate different
subtasks of the same operator.

Guojun Li <[email protected]> 于2023年6月2日周五 18:25写道:

> Thanks caizhi for the detailed review.
>
> I've updated the pip-3 based on your comments, they're useful to improve
> this proposal and make it clear.
> 1. I added the Histogram and HistogramStatistic interfaces.
> 2. Added histogram metrics to measure the distribution of commit / scan /
> compaction duration.
> 3. Update the descriptions for some metrics, make them more clear.
>
> And any other ideas from devs? Looking forward to your feedback.
>
> Best,
> Guojun
>
> On Mon, May 29, 2023 at 1:51 PM Guojun Li <[email protected]> wrote:
>
> > Thanks Caizhi for your comments. I'll think about introducing Histogram
> > metric type and making the metrics description more clear.
> >
> > Best,
> > Guojun
> >
> > On Thu, May 25, 2023 at 7:13 PM Caizhi Weng <[email protected]>
> wrote:
> >
> >> I also forgot to mention that Flink metrics [1] is a good reference. It
> >> clearly describes what Gauge, Counter and Histogram should do and has a
> >> list of metrics with clear definition. For example they have a metric
> >> named
> >> "lastCheckpointDuration" (which is a Gauge) and not just
> >> "checkpointDuration".
> >>
> >> [1]
> >> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/
> >>
> >> Caizhi Weng <[email protected]> 于2023年5月25日周四 19:10写道:
> >>
> >> > Hi Guojun!
> >> >
> >> > Thanks for raising this discussion. After reading the document, I'd
> like
> >> > to raise a few opinions.
> >> >
> >> >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> >
> >> > 1. About the overall types of metrics.
> >> >
> >> > You suggested two types of metrics, namely Gauge and Counter. A gauge
> >> > indicates the value of a metric at that instance, for example "the
> >> duration
> >> > of the last compaction". A counter indicates an accumulated value of a
> >> > metric overtime, for example "the total number of records written".
> >> >
> >> > However there are times when we want to study the history and see the
> >> > trend. For example if we notice that the last compaction is slower
> than
> >> > normal, we might want to know "the average duration of the last few
> >> > compactions" to make sure that this slow down is just by chance or is
> a
> >> > common case. Both gauge and counter cannot meet this need. What we
> need
> >> is
> >> > another type of metrics called Histogram. This metrics type should be
> >> able
> >> > to record the last few values of a metric and calculate their
> statistics
> >> > such as average, min/max and more.
> >> >
> >> >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> >
> >> > 2. About the types of some specific metrics.
> >> >
> >> > It seems to me that in your proposal, you cannot clearly tell the
> >> > difference between Gauges and Counters, and sometimes it is not clear
> >> that
> >> > what a specific metric means. Let me give you some example.
> >> >
> >> > Metric Name: commitDuration
> >> >
> >> > Description:  Commit
> >> >
> >> > Type: Gauge
> >> >
> >> > Update at: Timer starts before commit starting, update commit duration
> >> >> after commit finished
> >> >
> >> >
> >> > At first glance, "Commit" is not a valid description. From the metric
> >> name
> >> > I guess you'd like to record the length of duration for committing
> >> > snapshots. As its type is a Gauge, I suppose you want to record the
> >> *last*
> >> > commit. Please make things clear what exactly you want to record. And
> >> as I
> >> > said in the last section, it might be better to introduce a Histogram
> >> type
> >> > so that we can study the average duration of the last few commits.
> >> >
> >> > Metric Name: numTableFilesAdded
> >> >
> >> > Description: Number of added table files in this commit
> >> >
> >> > Type: Counter
> >> >
> >> > Update at: Collecting changes from committables
> >> >
> >> >
> >> > By "this commit" I suppose you mean *last* commit. If you the type
> >> should
> >> > be a Gauge, not a Counter. By using Counter you may want to record the
> >> *total
> >> > number* of files added during all commits.
> >> >
> >> > Metric Name: numFilesCompactedBefore
> >> >
> >> > Description: Number of deleted files in compaction
> >> >
> >> > Type: Counter
> >> >
> >> > Update at: Triggering compaction
> >> >
> >> >
> >> > It is also unclear to me what you're going to record. Are you going
> the
> >> > record the number of deleted files during the *last* compaction? Or
> >> you'd
> >> > like to record the *total number* of deleted files during all
> >> compaction?
> >> >
> >> > I'm afraid most of your proposed metrics seem unclear to me. Please
> >> > rewrite the proposal to make them clear.
> >> >
> >> > Guojun Li <[email protected]> 于2023年5月25日周四 16:59写道:
> >> >
> >> >> Hi, Paimon Devs,
> >> >>      I’d like to start a discussion about PIP-3[1].
> >> >>      In this PIP, I'm talking about how to support metrics for
> paimon,
> >> >> what
> >> >> metrics paimon needs. Look forward to your question and suggestions.
> >> >>
> >> >> Best,
> >> >> Guojun
> >> >>
> >> >> [1]
> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-3%3A+Introduce+metrics+for+Paimon
> >> >>
> >> >
> >>
> >
>

Reply via email to