Yes, thanks Caizhi, it looks very nice. I'm going to make the metrics name more reasonable. Metrics group is also a good idea to take into account.
On Thu, Jun 8, 2023 at 4:50 PM Caizhi Weng <[email protected]> wrote: > Hi Guojun! > > I've read the modified document. I still have the following concerns: > > 1. Some metric names are still misleading. For example, "totalTablesFiles" > is the "number of total data files currently maintained on storage", > however "numTotalManifests" is the "number of scanned manifests files in > the last scan planning", not the total number of manifest files on storage. > Please keep metric names consistent, even across different metric types. > > 2. There is no grouping in the current design. Let's consider all metrics > related to compaction. As one parallelism can hold multiple writers, it is > very likely that compactions for different buckets may happen in the same > parallelism. How are you going to distinguish the same compaction metric > for different buckets? Flink has metric groups which separate different > subtasks of the same operator. > > Guojun Li <[email protected]> 于2023年6月2日周五 18:25写道: > > > Thanks caizhi for the detailed review. > > > > I've updated the pip-3 based on your comments, they're useful to improve > > this proposal and make it clear. > > 1. I added the Histogram and HistogramStatistic interfaces. > > 2. Added histogram metrics to measure the distribution of commit / scan / > > compaction duration. > > 3. Update the descriptions for some metrics, make them more clear. > > > > And any other ideas from devs? Looking forward to your feedback. > > > > Best, > > Guojun > > > > On Mon, May 29, 2023 at 1:51 PM Guojun Li <[email protected]> > wrote: > > > > > Thanks Caizhi for your comments. I'll think about introducing Histogram > > > metric type and making the metrics description more clear. > > > > > > Best, > > > Guojun > > > > > > On Thu, May 25, 2023 at 7:13 PM Caizhi Weng <[email protected]> > > wrote: > > > > > >> I also forgot to mention that Flink metrics [1] is a good reference. > It > > >> clearly describes what Gauge, Counter and Histogram should do and has > a > > >> list of metrics with clear definition. For example they have a metric > > >> named > > >> "lastCheckpointDuration" (which is a Gauge) and not just > > >> "checkpointDuration". > > >> > > >> [1] > > >> > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ > > >> > > >> Caizhi Weng <[email protected]> 于2023年5月25日周四 19:10写道: > > >> > > >> > Hi Guojun! > > >> > > > >> > Thanks for raising this discussion. After reading the document, I'd > > like > > >> > to raise a few opinions. > > >> > > > >> > > > >> > > > >> > > > -------------------------------------------------------------------------------- > > >> > > > >> > 1. About the overall types of metrics. > > >> > > > >> > You suggested two types of metrics, namely Gauge and Counter. A > gauge > > >> > indicates the value of a metric at that instance, for example "the > > >> duration > > >> > of the last compaction". A counter indicates an accumulated value > of a > > >> > metric overtime, for example "the total number of records written". > > >> > > > >> > However there are times when we want to study the history and see > the > > >> > trend. For example if we notice that the last compaction is slower > > than > > >> > normal, we might want to know "the average duration of the last few > > >> > compactions" to make sure that this slow down is just by chance or > is > > a > > >> > common case. Both gauge and counter cannot meet this need. What we > > need > > >> is > > >> > another type of metrics called Histogram. This metrics type should > be > > >> able > > >> > to record the last few values of a metric and calculate their > > statistics > > >> > such as average, min/max and more. > > >> > > > >> > > > >> > > > >> > > > -------------------------------------------------------------------------------- > > >> > > > >> > 2. About the types of some specific metrics. > > >> > > > >> > It seems to me that in your proposal, you cannot clearly tell the > > >> > difference between Gauges and Counters, and sometimes it is not > clear > > >> that > > >> > what a specific metric means. Let me give you some example. > > >> > > > >> > Metric Name: commitDuration > > >> > > > >> > Description: Commit > > >> > > > >> > Type: Gauge > > >> > > > >> > Update at: Timer starts before commit starting, update commit > duration > > >> >> after commit finished > > >> > > > >> > > > >> > At first glance, "Commit" is not a valid description. From the > metric > > >> name > > >> > I guess you'd like to record the length of duration for committing > > >> > snapshots. As its type is a Gauge, I suppose you want to record the > > >> *last* > > >> > commit. Please make things clear what exactly you want to record. > And > > >> as I > > >> > said in the last section, it might be better to introduce a > Histogram > > >> type > > >> > so that we can study the average duration of the last few commits. > > >> > > > >> > Metric Name: numTableFilesAdded > > >> > > > >> > Description: Number of added table files in this commit > > >> > > > >> > Type: Counter > > >> > > > >> > Update at: Collecting changes from committables > > >> > > > >> > > > >> > By "this commit" I suppose you mean *last* commit. If you the type > > >> should > > >> > be a Gauge, not a Counter. By using Counter you may want to record > the > > >> *total > > >> > number* of files added during all commits. > > >> > > > >> > Metric Name: numFilesCompactedBefore > > >> > > > >> > Description: Number of deleted files in compaction > > >> > > > >> > Type: Counter > > >> > > > >> > Update at: Triggering compaction > > >> > > > >> > > > >> > It is also unclear to me what you're going to record. Are you going > > the > > >> > record the number of deleted files during the *last* compaction? Or > > >> you'd > > >> > like to record the *total number* of deleted files during all > > >> compaction? > > >> > > > >> > I'm afraid most of your proposed metrics seem unclear to me. Please > > >> > rewrite the proposal to make them clear. > > >> > > > >> > Guojun Li <[email protected]> 于2023年5月25日周四 16:59写道: > > >> > > > >> >> Hi, Paimon Devs, > > >> >> I’d like to start a discussion about PIP-3[1]. > > >> >> In this PIP, I'm talking about how to support metrics for > > paimon, > > >> >> what > > >> >> metrics paimon needs. Look forward to your question and > suggestions. > > >> >> > > >> >> Best, > > >> >> Guojun > > >> >> > > >> >> [1] > > >> >> > > >> >> > > >> > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-3%3A+Introduce+metrics+for+Paimon > > >> >> > > >> > > > >> > > > > > >
