I have been thinking of perhaps a slightly different use case, but mentioning it here for context. It would be great if the ScanEvent could capture richer statistics about files, bytes, records etc. touched by scans. That can provide rich information about which tables are good candidates for file compactions (scans are touching a lot of small files); and which columns in tables are good candidates for becoming sort columns (which scan predicates do not benefit from file pruning); and so on.
Thanks, - Puneet On Wed, Dec 1, 2021 at 7:09 PM Vivekanand Vellanki <vi...@dremio.com> wrote: > Hey Jack, > > This is a good idea. > > I am not sure if per-table events are in scope for these event > notifications. I would like to see event notifications for the following > events as well: > > - Schema changes on an individual Iceberg table > - Commits in an Iceberg table > > Thanks > Vivek > > On Thu, Dec 2, 2021 at 5:44 AM Ashish Singh <singhashish....@gmail.com> > wrote: > >> Hey Jack, >> >> Thanks for sharing your thoughts on this. We ran into a need for richer >> event notification for table operations as well for various reasons >> including enforcements like ownership. While looking into potential ways to >> add pluggable logic during various table operations, we considered >> following two options primarily. >> >> 1. Enhance Iceberg’s event notification to include more event types. >> 2. Use existing pre and post table operations hook interface from HMS. >> >> We decided to go with 2nd option for following reasons. >> >> 1. All table operations (SQL or Programmatic access) go through HMS. >> Users don’t have to worry about configuring listeners per app and also >> won’t be able to remove mandatory listeners. >> 2. We, and I am guessing most HMS prod installations, already have HMS >> events pipelines set up that we will be able to reuse. >> 3. Ability to take action pre and post commit. >> >> Extra compute needed to construct metadata from iceberg metadata and >> manifest files is a drawback with second approach, which we are not too >> concerned about as of now though. >> >> I will be curious to learn what others think of this approach. >> Irrespective please count me in any discussion along this as well. I will >> loop on some more folks from Pinterest who are actively looking into this >> as well. >> >> - Ashish >> >> On Wed, Dec 1, 2021 at 6:51 AM Mayur Srivastava < >> mayur.srivast...@twosigma.com> wrote: >> >>> +1 to the idea. >>> >>> >>> >>> The events are very useful for building asynchronous services around >>> Iceberg such as observability, garbage collection, compaction, asynchronous >>> table deletion (to handle slow purge calls in the background) , etc. >>> >>> >>> >>> It seems like the Iceberg catalog is a good place to configure/setup the >>> events because almost all access starts with the catalog and namespaces are >>> managed at the catalog level. I feel that an extension to the catalog to >>> send notification would be nice because it can track events such as >>> create/drop/alter properties on namespaces, list calls, >>> create/alter/rename/drop tables, successful commits, etc. >>> >>> >>> >>> Notification configuration at namespace level may be leveraged with an >>> override at the table level in case a specific table has an override. >>> >>> (Namespaces are a good abstraction for configuration management that are >>> common to multiple tables and if a namespace is mapped to a bucket the >>> uniform IAM can be used which generally works better on GCS). >>> >>> >>> >>> In terms of events, there are two kinds of events that we are interested >>> in: >>> >>> 1. Change events: any change that happens in the table such as >>> create, drop table/namespace. >>> >>> 2. Access events: any time user accesses a table, calls list. It >>> is good to send notification because they can be consumed by the users if >>> they want to track usage for a specific namespace or table. >>> >>> >>> >>> When this is implemented, could we provide generic hooks and events so >>> that another notification system such as PubSub (on GCP) or Kakfa can be >>> leveraged. >>> >>> >>> >>> I’m happy to join any brainstorming discussion around this topic. >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>> >>> *From:* Kyle Bendickson <k...@tabular.io> >>> *Sent:* Wednesday, December 1, 2021 1:23 AM >>> *To:* dev@iceberg.apache.org >>> *Subject:* Re: Iceberg event notification support >>> >>> >>> >>> I think this is a great idea, Jack. Thank you for bringing this up! +1 >>> >>> >>> >>> There have been several people interested in having more observability >>> (for example for table design patterns akin to how folks might monitor >>> Hive) and events would be a big win for that and something users could use >>> with a lot of their existing infra (Kafka, REST services, AWS or other >>> cloud provider queue types). >>> >>> >>> >>> Spark has an existing interface, ExternalCatalogWithListener, which >>> emits events we might hook into. I won't go into too much detail here. And >>> while these Spark "ExternalCatalogEvents" shouldn't be how we define our >>> own events, which should have their own type system, it could be a >>> beneficial source of event hooks from within Spark. It also provides us >>> table level query data we don't currently otherwise get. It's worth >>> investigating if we haven't, though we might choose to forgo it's >>> complexity. >>> >>> >>> >>> I agree conceptually that most events should be registered at the table >>> level, though I'd be open to having events of differing granularities. >>> Especially if this helps support cross-table patterns. But table level data >>> should be prioritized first. >>> >>> >>> >>> If you have something to share or would like to make time to discuss, >>> please count me in. This is an area I've been thinking about a bit lately >>> as I've had quite some interest in observability and possible event-driven >>> patterns. >>> >>> >>> >>> Best >>> >>> Kyle (GitHub @kbendick) >>> >>> >>> >>> On Tue, Nov 30, 2021 at 9:50 PM Neelesh Salian <neeleshssal...@gmail.com> >>> wrote: >>> >>> +1 to this effort. >>> >>> There is value in adding support for Events - general bookkeeping and >>> helping replay actions in the event of recovery. >>> >>> At the minimum we should aim to track the following all catalogs: >>> >>> 1. Create actions >>> >>> 2. Alter actions >>> >>> 3. Delete actions >>> >>> across all tables, properties and namespaces. >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Nov 30, 2021 at 9:12 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>> Hi everyone, >>> >>> >>> >>> I would like to start some initial discussions around Iceberg event >>> notification support, because we might have some engineering resources to >>> work on Iceberg notification integration with AWS services such as SNS, >>> SQS, CloudWatch. >>> >>> >>> >>> As of today, we have a Listener interface and 3 events ScanEvent, >>> IncrementalScanEvent, CreateSnapshotEvent. There is a static registry >>> called Listeners that registers the event listeners in the JVM. >>> >>> >>> >>> However, when I read the related code paths, my thought is that it might >>> be better to register listeners per-table, based on the following >>> observations: >>> >>> 1. Iceberg events are all table or sub-table level events. For any >>> catalog or global level events, the catalog service can provide >>> notifications, Iceberg can be out of the picture. >>> >>> 2. A user might have multiple Iceberg catalogs defined, pointing to >>> different catalog services. (e.g. one to AWS Glue, one to a Hive >>> metastore). The notifications from tables of these different catalogs >>> should be directed to different listeners at least per catalog, instead of >>> the same set of listeners that are registered globally. >>> >>> 3. Event listener configurations are usually static. It makes more sense >>> to me to define it once and then repeatedly use it, instead of >>> re-registering it every time I start an application. >>> >>> >>> >>> If we register the listeners at table level, we can add a hook in >>> TableOperations to get a set of listeners to emit specific events. The >>> listeners could be defined and serialized as a part of the table >>> properties, or maybe even a part of the Iceberg spec. >>> >>> >>> >>> This is really just my brainstorming. Maybe it's a bit overkill, maybe I >>> am missing the correct way to use the Listeners static registry. It would >>> be great if anyone could provide more contexts or thoughts around this >>> topic. >>> >>> >>> >>> Best, >>> >>> Jack Ye >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Regards, >>> >>> Neelesh S. Salian >>> >>> >>> >>> -- >> - Ashish >> >