Hey Jack,

This is a good idea.

I am not sure if per-table events are in scope for these event
notifications. I would like to see event notifications for the following
events as well:

   - Schema changes on an individual Iceberg table
   - Commits in an Iceberg table

Thanks
Vivek

On Thu, Dec 2, 2021 at 5:44 AM Ashish Singh <singhashish....@gmail.com>
wrote:

> Hey Jack,
>
> Thanks for sharing your thoughts on this. We ran into a need for richer
> event notification for table operations as well for various reasons
> including enforcements like ownership. While looking into potential ways to
> add pluggable logic during various table operations, we considered
> following two options primarily.
>
> 1. Enhance Iceberg’s event notification to include more event types.
> 2. Use existing pre and post table operations hook interface from HMS.
>
> We decided to go with 2nd option for following reasons.
>
> 1. All table operations (SQL or Programmatic access) go through HMS. Users
> don’t have to worry about configuring listeners per app and also won’t be
> able to remove mandatory listeners.
> 2. We, and I am guessing most HMS prod installations, already have HMS
> events pipelines set up that we will be able to reuse.
> 3. Ability to take action pre and post commit.
>
> Extra compute needed to construct metadata from iceberg metadata and
> manifest files is a drawback with second approach, which we are not too
> concerned about as of now though.
>
> I will be curious to learn what others think of this approach.
> Irrespective please count me in any discussion along this as well. I will
> loop on some more folks from Pinterest who are actively looking into this
> as well.
>
> - Ashish
>
> On Wed, Dec 1, 2021 at 6:51 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
>> +1 to the idea.
>>
>>
>>
>> The events are very useful for building asynchronous services around
>> Iceberg such as observability, garbage collection, compaction, asynchronous
>> table deletion (to handle slow purge calls in the background) , etc.
>>
>>
>>
>> It seems like the Iceberg catalog is a good place to configure/setup the
>> events because almost all access starts with the catalog and namespaces are
>> managed at the catalog level. I feel that an extension to the catalog to
>> send notification would be nice because it can track events such as
>> create/drop/alter properties on namespaces, list calls,
>> create/alter/rename/drop tables, successful commits, etc.
>>
>>
>>
>> Notification configuration at namespace level may be leveraged with an
>> override at the table level in case a specific table has an override.
>>
>> (Namespaces are a good abstraction for configuration management that are
>> common to multiple tables and if a namespace is mapped to a bucket the
>> uniform IAM can be used which generally works better on GCS).
>>
>>
>>
>> In terms of events, there are two kinds of events that we are interested
>> in:
>>
>> 1.      Change events: any change that happens in the table such as
>> create, drop table/namespace.
>>
>> 2.      Access events: any time user accesses a table, calls list. It is
>> good to send notification because they can be consumed by the users if they
>> want to track usage for a specific namespace or table.
>>
>>
>>
>> When this is implemented, could we provide generic hooks and events so
>> that another notification system such as PubSub (on GCP) or Kakfa can be
>> leveraged.
>>
>>
>>
>> I’m happy to join any brainstorming discussion around this topic.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Kyle Bendickson <k...@tabular.io>
>> *Sent:* Wednesday, December 1, 2021 1:23 AM
>> *To:* dev@iceberg.apache.org
>> *Subject:* Re: Iceberg event notification support
>>
>>
>>
>> I think this is a great idea, Jack. Thank you for bringing this up! +1
>>
>>
>>
>> There have been several people interested in having more observability
>> (for example for table design patterns akin to how folks might monitor
>> Hive) and events would be a big win for that and something users could use
>> with a lot of their existing infra (Kafka, REST services, AWS or other
>> cloud provider queue types).
>>
>>
>>
>> Spark has an existing interface, ExternalCatalogWithListener, which emits
>> events we might hook into. I won't go into too much detail here. And while
>> these Spark "ExternalCatalogEvents" shouldn't be how we define our own
>> events, which should have their own type system, it could be a beneficial
>> source of event hooks from within Spark. It also provides us table level
>> query data we don't currently otherwise get. It's worth investigating if we
>> haven't, though we might choose to forgo it's complexity.
>>
>>
>>
>> I agree conceptually that most events should be registered at the table
>> level, though I'd be open to having events of differing granularities.
>> Especially if this helps support cross-table patterns. But table level data
>> should be prioritized first.
>>
>>
>>
>> If you have something to share or would like to make time to discuss,
>> please count me in. This is an area I've been thinking about a bit lately
>> as I've had quite some interest in observability and possible event-driven
>> patterns.
>>
>>
>>
>> Best
>>
>> Kyle (GitHub @kbendick)
>>
>>
>>
>> On Tue, Nov 30, 2021 at 9:50 PM Neelesh Salian <neeleshssal...@gmail.com>
>> wrote:
>>
>> +1 to this effort.
>>
>> There is value in adding support for Events - general bookkeeping and
>> helping replay actions in the event of recovery.
>>
>> At the minimum we should aim to track the following all catalogs:
>>
>> 1. Create actions
>>
>> 2. Alter actions
>>
>> 3. Delete actions
>>
>> across all tables, properties and namespaces.
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Nov 30, 2021 at 9:12 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>>
>>
>> I would like to start some initial discussions around Iceberg event
>> notification support, because we might have some engineering resources to
>> work on Iceberg notification integration with AWS services such as SNS,
>> SQS, CloudWatch.
>>
>>
>>
>> As of today, we have a Listener interface and 3 events ScanEvent,
>> IncrementalScanEvent, CreateSnapshotEvent. There is a static registry
>> called Listeners that registers the event listeners in the JVM.
>>
>>
>>
>> However, when I read the related code paths, my thought is that it might
>> be better to register listeners per-table, based on the following
>> observations:
>>
>> 1. Iceberg events are all table or sub-table level events. For any
>> catalog or global level events, the catalog service can provide
>> notifications, Iceberg can be out of the picture.
>>
>> 2. A user might have multiple Iceberg catalogs defined, pointing to
>> different catalog services. (e.g. one to AWS Glue, one to a Hive
>> metastore). The notifications from tables of these different catalogs
>> should be directed to different listeners at least per catalog, instead of
>> the same set of listeners that are registered globally.
>>
>> 3. Event listener configurations are usually static. It makes more sense
>> to me to define it once and then repeatedly use it, instead of
>> re-registering it every time I start an application.
>>
>>
>>
>> If we register the listeners at table level, we can add a hook in
>> TableOperations to get a set of listeners to emit specific events. The
>> listeners could be defined and serialized as a part of the table
>> properties, or maybe even a part of the Iceberg spec.
>>
>>
>>
>> This is really just my brainstorming. Maybe it's a bit overkill, maybe I
>> am missing the correct way to use the Listeners static registry. It would
>> be great if anyone could provide more contexts or thoughts around this
>> topic.
>>
>>
>>
>> Best,
>>
>> Jack Ye
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Regards,
>>
>> Neelesh S. Salian
>>
>>
>>
>> --
> - Ashish
>

Reply via email to