Hi Jingsong Li. Thank you for your reply. I will revise the PIP according to this approach.
Best, Shidayang Jingsong Li <[email protected]> 于2023年9月6日周三 10:01写道: > > Thanks Jocean and Shammon. > > I took a look at Spark code. I think its abstraction is OK for us too. > > A big listener interface PaimonListener (Just like SparkListener), and > an implementation is MetricsPaimonListener to report metrics. > > Or you can create another listener implementation, I don't know, if > you need you should mention it in the PIP. > > Best, > Jingsong > > On Mon, Sep 4, 2023 at 10:10 PM Shammon FY <[email protected]> wrote: > > > > Thanks Jocean. > > > > So should we need to introduce trigger-based metrics for some special > > events such as commit/compaction? Maybe we can hear from others cc @Caizhi > > Weng , @Jingsong Li , what do you think? > > > > Best, > > Shammon FY > > > > On Fri, Sep 1, 2023 at 11:29 AM Jocean shi <[email protected]> wrote: > >> > >> Hi Shammon FY > >> > >> Continuing from the previous discussion. I would like to discuss the > >> abstract model of "Listener," "Metric," and "MetricReport" further. > >> Firstly, it is important to clarify that the discussed compaction and > >> commit operations have clear boundaries, making them events. These > >> events, in addition to basic information, also include metrics > >> generated during the execution process, such as execution time and CPU > >> consumption. Therefore, an event consists of both base information and > >> base metrics. Users can obtain these events through Listeners, and > >> they can construct the desired new Metric using these events. They can > >> then report the Metric periodically via MetricReport. Thus, I believe > >> the Metric system is a use case of the Listener. > >> > >> Some similar implementations: > >> > >> SparkListener: The SparkListenerTaskEnd event includes TaskInfo and > >> TaskMetrics, and users can subscribe to SparkListenerTaskEnd to obtain > >> the desired metrics. > >> Iceberg: Iceberg's CommitMetric is also generated through the > >> CreateSnapshotEvent and reported via a reporter. The difference is > >> that Iceberg's reporting is trigger-based, while Paimon performs it on > >> a schedule. > >> > >> Best > >> shidayang > >> > >> Jocean shi <[email protected]> 于2023年8月24日周四 17:41写道: > >> > > >> > Hi Shammon FY > >> > > >> > Thanks for your comment. > >> > > >> > 1. DDL events > >> > Many behaviors of the Table service are related to the options of > >> > tables, such as whether the table has enabled full-compaction and the > >> > triggering conditions for compaction. If the options of a table are > >> > changed, the Table service needs to perceive it in a timely manner and > >> > make corresponding adjustments to the behavior of the table. Without a > >> > listener mechanism, the Table service needs to constantly poll the > >> > table to determine if its configuration has changed, which increases > >> > the pressure on Hive and the Table service. If we can listen to the > >> > AlterTableEvent, we won't need to poll the options of the table. > >> > > >> > 2. Why not metric > >> > Metric is mainly processed statistical indicators that are usually > >> > measured at regular intervals, and multiple reported values may be the > >> > same. This is quite different from events. For example, for 'commit', > >> > Metric usually measures the size, quantity, and duration of recently > >> > committed files, and the results obtained from multiple retrievals may > >> > be the same. It can be imagined that replacing the currently existing > >> > CommitCallback with Metric would be very troublesome. > >> > > >> > Best > >> > shidayang > >> > > >> > Shammon FY <[email protected]> 于2023年8月23日周三 10:53写道: > >> > > > >> > > Hi Jocean > >> > > > >> > > Thanks for your answer. I think there are two types of the information > >> > > you > >> > > want to report: the ddl events and the runtime events such as commit, > >> > > compaction. > >> > > > >> > > For the ddl events, I don't quite understand why you need to poll the > >> > > table > >> > > information regularly? As we all know that Paimon is really a storage > >> > > which > >> > > has all meta information in it, and even when you poll the information > >> > > from > >> > > Paimon, you need to store it somewhere. I think you can just use > >> > > Paimon as > >> > > the storage itself. If the performance of obtaining Paimon tables is > >> > > relatively low, such as the large number of tables you mentioned, I > >> > > think > >> > > we should improve this, for example, add a table cache? > >> > > > >> > > For the runtime events, I understand that they are indeed necessary to > >> > > report to a system like `Table Service`. But my issue is: can we do > >> > > this in > >> > > the existing metrics mechanism? For example, reporting relevant > >> > > metrics to > >> > > the `Table Service` instead of adding a new `listener`? If the metrics > >> > > information is not complete enough, we can continue to add information > >> > > in > >> > > it. > >> > > > >> > > Best, > >> > > Shammon FY > >> > > > >> > > On Tue, Aug 22, 2023 at 2:20 PM Jocean shi <[email protected]> > >> > > wrote: > >> > > > >> > > > Hi Shammon FY, > >> > > > > >> > > > I get your point, but the role of a Listener is more towards > >> > > > notification. For example, as you mentioned, we can query the > >> > > > relevant > >> > > > information through APIs for DDL and commit information. However, > >> > > > when > >> > > > we want to know if there have been any changes to the relevant > >> > > > information, we need to constantly poll the tables. This mechanism > >> > > > can > >> > > > be resource-intensive, especially when there are many tables. With a > >> > > > Listener, we can promptly detect changes in status. Consider a > >> > > > separate Table service that has a requirement to compact all tables, > >> > > > and the compact parameters are stored in the options. When there is a > >> > > > change in the options of a table, the Table Service needs to be > >> > > > notified promptly to determine whether to immediately compact the > >> > > > table. When there is new data committed to a table, it needs to be > >> > > > promptly detected to determine whether to compact it. Also, users > >> > > > need > >> > > > the assistance of CommitEvent to trigger downstream tasks based on > >> > > > the > >> > > > watermark of a table. > >> > > > Querying compact information through SQL or APIs is indeed a good > >> > > > way. > >> > > > It is relatively simple to query historical compact records. However, > >> > > > if you want to know the current compact status of a table, using a > >> > > > Listener may be simpler. > >> > > > > >> > > > Best > >> > > > Shidayang > >> > > > > >> > > > Shammon FY <[email protected]> 于2023年8月21日周一 23:24写道: > >> > > > > > >> > > > > Hi Jocean, > >> > > > > > >> > > > > Thanks for your explanation. I still have some issues > >> > > > > > >> > > > > 1. What are the ddl events for Paimon used for? If you need to show > >> > > > tables > >> > > > > for paimon in your system, I think it's better to define table > >> > > > > related > >> > > > > interfaces, and then you can implement them for Paimon, Iceberg > >> > > > > and Hudi > >> > > > > instead of adding a ddl listener in them. It's more general and > >> > > > > you can > >> > > > > even manage other tables such as databases, mongodb and hive. > >> > > > > > >> > > > > 2. If some system information in `CompactEvent` is currently > >> > > > > missing or > >> > > > > there's no information about `compact`, I think a better way is > >> > > > > to add > >> > > > > this system information in Paimon, rather than adding a listener > >> > > > > and > >> > > > > creating an event with the information. Then the external system > >> > > > > can get > >> > > > > the information by SQL or API directly, this is a more reasonable > >> > > > approach. > >> > > > > > >> > > > > 3. Also what is the `CommitEvent` used for? Currently we have > >> > > > > metrics for > >> > > > > `Commit` and jobs can report them. How about adding a customized > >> > > > > reporter > >> > > > > for metrics instead of a listener for `CommitEvent`? > >> > > > > > >> > > > > Best, > >> > > > > Shammon FY > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Mon, Aug 21, 2023 at 5:16 PM Jocean shi <[email protected]> > >> > > > > wrote: > >> > > > > > >> > > > > > Hi Shammon FY, > >> > > > > > > >> > > > > > Thanks for your comments. I’d like to share my thoughts about > >> > > > > > your > >> > > > > > comments. > >> > > > > > > >> > > > > > 1. Public Interface > >> > > > > > Thank you for the reminder. I overlooked the correspondence > >> > > > > > between > >> > > > > > the Public Interface of PIP and the "@Public" annotation. > >> > > > > > My idea was that Event, Listener, and ListenerFactory are public, > >> > > > > > while the others are non-public. > >> > > > > > > >> > > > > > 2. Add `Factory` to create `Listener` > >> > > > > > Great suggestion, I have already added the ListenerFactory to > >> > > > > > PIP. > >> > > > > > > >> > > > > > 3. Flink and Spark support meta-data listeners > >> > > > > > It will be very inconvenient for users to obtain DDL information > >> > > > > > through engines. Firstly, there are many implementations of > >> > > > > > various > >> > > > > > engines that need to be connected. Secondly, in addition to > >> > > > > > Flink and > >> > > > > > Spark, many engines do not support meta-data listeners. As a > >> > > > > > general > >> > > > > > data lake, Paimon should have its own mechanism for meta-data > >> > > > > > listeners. > >> > > > > > > >> > > > > > 4. report events such as commit/compact to an external system > >> > > > > > CompactEvent: Currently, the compact state is a black box, and > >> > > > > > users > >> > > > > > cannot obtain the information through SQL or API. > >> > > > > > CommitEvent: Currently, the methods of querying through SQL or > >> > > > > > API are > >> > > > > > based on polling, which makes it difficult for users to perceive > >> > > > > > commit operations in a timely manner and consumes a lot of > >> > > > > > resources. > >> > > > > > > >> > > > > > Best > >> > > > > > Shidayang > >> > > > > > > >> > > > > > Shammon FY <[email protected]> 于2023年8月18日周五 14:07写道: > >> > > > > > > > >> > > > > > > Thanks @Jocean for starting this discussion, I have some > >> > > > > > > comments > >> > > > > > > > >> > > > > > > 1. About the public interfaces in the PIP, we should add > >> > > > > > > @Public for > >> > > > them > >> > > > > > > such as `Event`, `Listener` and even `CommitEvent` and other > >> > > > > > > events. > >> > > > But > >> > > > > > > for `Listeners`, I don't think it should be a public > >> > > > > > > interface. All > >> > > > > > fields > >> > > > > > > in the public interface for users should be `Public` too, but I > >> > > > found the > >> > > > > > > information such as `ManifestEntry` in `CommitEvent` is not a > >> > > > > > > public > >> > > > > > > interface. I think you may need to reconsider which interfaces > >> > > > > > > need > >> > > > to be > >> > > > > > > marked with @Public and which are not. > >> > > > > > > > >> > > > > > > 2. In general, it is better to give a `Factory` to create > >> > > > > > > `Listener` > >> > > > > > which > >> > > > > > > should be all marked as `@Public` and you can see > >> > > > > > > `CatalogFactory`->`Catalog` as an example. > >> > > > > > > > >> > > > > > > 3. Currently Flink and Spark support meta-data listeners and > >> > > > > > > we can > >> > > > > > support > >> > > > > > > reporting ddl information there, should we need to add the same > >> > > > listener > >> > > > > > in > >> > > > > > > Paimon? > >> > > > > > > > >> > > > > > > 4. Should we need to report the events such as commit/compact > >> > > > > > > to an > >> > > > > > > external system? Currently we have some system tables and > >> > > > > > > users can > >> > > > get > >> > > > > > > these information by SQL or API, should the external system > >> > > > > > > query > >> > > > these > >> > > > > > > information regularly instead of a listener to push them? > >> > > > > > > > >> > > > > > > Best, > >> > > > > > > Shammon FY > >> > > > > > > > >> > > > > > > > >> > > > > > > On Tue, Aug 15, 2023 at 11:08 AM Jocean shi > >> > > > > > > <[email protected]> > >> > > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi devs: > >> > > > > > > > > >> > > > > > > > We would like to start a discussion about PIP-8: Introduce > >> > > > listeners > >> > > > > > > > for Paimon[1]. > >> > > > > > > > > >> > > > > > > > In production environments, users often need to perceive the > >> > > > > > > > state > >> > > > > > > > changes of Paimon table, > >> > > > > > > > such as whether a new file has been committed to the table, > >> > > > > > > > in > >> > > > which > >> > > > > > > > partitions the committed files are, > >> > > > > > > > the size and number of the committed files, the status and > >> > > > > > > > type of > >> > > > > > > > compaction, operations like table creation, deletion, and > >> > > > > > > > schema > >> > > > > > > > changes, etc. > >> > > > > > > > So, we introduce a Listener system for Paimon. > >> > > > > > > > Looking forward to hearing from you. > >> > > > > > > > > >> > > > > > > > [1] > >> > > > > > > > > >> > > > > > > >> > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-8%3A+Introduce+listeners+for+Paimon > >> > > > > > > > > >> > > > > > > > Best > >> > > > > > > > shidayang > >> > > > > > > > > >> > > > > > > >> > > >
