Hey Jack, This is a good idea.
I am not sure if per-table events are in scope for these event notifications. I would like to see event notifications for the following events as well: - Schema changes on an individual Iceberg table - Commits in an Iceberg table Thanks Vivek On Thu, Dec 2, 2021 at 5:44 AM Ashish Singh <singhashish....@gmail.com> wrote: > Hey Jack, > > Thanks for sharing your thoughts on this. We ran into a need for richer > event notification for table operations as well for various reasons > including enforcements like ownership. While looking into potential ways to > add pluggable logic during various table operations, we considered > following two options primarily. > > 1. Enhance Iceberg’s event notification to include more event types. > 2. Use existing pre and post table operations hook interface from HMS. > > We decided to go with 2nd option for following reasons. > > 1. All table operations (SQL or Programmatic access) go through HMS. Users > don’t have to worry about configuring listeners per app and also won’t be > able to remove mandatory listeners. > 2. We, and I am guessing most HMS prod installations, already have HMS > events pipelines set up that we will be able to reuse. > 3. Ability to take action pre and post commit. > > Extra compute needed to construct metadata from iceberg metadata and > manifest files is a drawback with second approach, which we are not too > concerned about as of now though. > > I will be curious to learn what others think of this approach. > Irrespective please count me in any discussion along this as well. I will > loop on some more folks from Pinterest who are actively looking into this > as well. > > - Ashish > > On Wed, Dec 1, 2021 at 6:51 AM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > >> +1 to the idea. >> >> >> >> The events are very useful for building asynchronous services around >> Iceberg such as observability, garbage collection, compaction, asynchronous >> table deletion (to handle slow purge calls in the background) , etc. >> >> >> >> It seems like the Iceberg catalog is a good place to configure/setup the >> events because almost all access starts with the catalog and namespaces are >> managed at the catalog level. I feel that an extension to the catalog to >> send notification would be nice because it can track events such as >> create/drop/alter properties on namespaces, list calls, >> create/alter/rename/drop tables, successful commits, etc. >> >> >> >> Notification configuration at namespace level may be leveraged with an >> override at the table level in case a specific table has an override. >> >> (Namespaces are a good abstraction for configuration management that are >> common to multiple tables and if a namespace is mapped to a bucket the >> uniform IAM can be used which generally works better on GCS). >> >> >> >> In terms of events, there are two kinds of events that we are interested >> in: >> >> 1. Change events: any change that happens in the table such as >> create, drop table/namespace. >> >> 2. Access events: any time user accesses a table, calls list. It is >> good to send notification because they can be consumed by the users if they >> want to track usage for a specific namespace or table. >> >> >> >> When this is implemented, could we provide generic hooks and events so >> that another notification system such as PubSub (on GCP) or Kakfa can be >> leveraged. >> >> >> >> I’m happy to join any brainstorming discussion around this topic. >> >> >> >> Thanks, >> >> Mayur >> >> >> >> *From:* Kyle Bendickson <k...@tabular.io> >> *Sent:* Wednesday, December 1, 2021 1:23 AM >> *To:* dev@iceberg.apache.org >> *Subject:* Re: Iceberg event notification support >> >> >> >> I think this is a great idea, Jack. Thank you for bringing this up! +1 >> >> >> >> There have been several people interested in having more observability >> (for example for table design patterns akin to how folks might monitor >> Hive) and events would be a big win for that and something users could use >> with a lot of their existing infra (Kafka, REST services, AWS or other >> cloud provider queue types). >> >> >> >> Spark has an existing interface, ExternalCatalogWithListener, which emits >> events we might hook into. I won't go into too much detail here. And while >> these Spark "ExternalCatalogEvents" shouldn't be how we define our own >> events, which should have their own type system, it could be a beneficial >> source of event hooks from within Spark. It also provides us table level >> query data we don't currently otherwise get. It's worth investigating if we >> haven't, though we might choose to forgo it's complexity. >> >> >> >> I agree conceptually that most events should be registered at the table >> level, though I'd be open to having events of differing granularities. >> Especially if this helps support cross-table patterns. But table level data >> should be prioritized first. >> >> >> >> If you have something to share or would like to make time to discuss, >> please count me in. This is an area I've been thinking about a bit lately >> as I've had quite some interest in observability and possible event-driven >> patterns. >> >> >> >> Best >> >> Kyle (GitHub @kbendick) >> >> >> >> On Tue, Nov 30, 2021 at 9:50 PM Neelesh Salian <neeleshssal...@gmail.com> >> wrote: >> >> +1 to this effort. >> >> There is value in adding support for Events - general bookkeeping and >> helping replay actions in the event of recovery. >> >> At the minimum we should aim to track the following all catalogs: >> >> 1. Create actions >> >> 2. Alter actions >> >> 3. Delete actions >> >> across all tables, properties and namespaces. >> >> >> >> >> >> >> >> On Tue, Nov 30, 2021 at 9:12 PM Jack Ye <yezhao...@gmail.com> wrote: >> >> Hi everyone, >> >> >> >> I would like to start some initial discussions around Iceberg event >> notification support, because we might have some engineering resources to >> work on Iceberg notification integration with AWS services such as SNS, >> SQS, CloudWatch. >> >> >> >> As of today, we have a Listener interface and 3 events ScanEvent, >> IncrementalScanEvent, CreateSnapshotEvent. There is a static registry >> called Listeners that registers the event listeners in the JVM. >> >> >> >> However, when I read the related code paths, my thought is that it might >> be better to register listeners per-table, based on the following >> observations: >> >> 1. Iceberg events are all table or sub-table level events. For any >> catalog or global level events, the catalog service can provide >> notifications, Iceberg can be out of the picture. >> >> 2. A user might have multiple Iceberg catalogs defined, pointing to >> different catalog services. (e.g. one to AWS Glue, one to a Hive >> metastore). The notifications from tables of these different catalogs >> should be directed to different listeners at least per catalog, instead of >> the same set of listeners that are registered globally. >> >> 3. Event listener configurations are usually static. It makes more sense >> to me to define it once and then repeatedly use it, instead of >> re-registering it every time I start an application. >> >> >> >> If we register the listeners at table level, we can add a hook in >> TableOperations to get a set of listeners to emit specific events. The >> listeners could be defined and serialized as a part of the table >> properties, or maybe even a part of the Iceberg spec. >> >> >> >> This is really just my brainstorming. Maybe it's a bit overkill, maybe I >> am missing the correct way to use the Listeners static registry. It would >> be great if anyone could provide more contexts or thoughts around this >> topic. >> >> >> >> Best, >> >> Jack Ye >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> Regards, >> >> Neelesh S. Salian >> >> >> >> -- > - Ashish >