+1 to the idea.

The events are very useful for building asynchronous services around Iceberg 
such as observability, garbage collection, compaction, asynchronous table 
deletion (to handle slow purge calls in the background) , etc.

It seems like the Iceberg catalog is a good place to configure/setup the events 
because almost all access starts with the catalog and namespaces are managed at 
the catalog level. I feel that an extension to the catalog to send notification 
would be nice because it can track events such as create/drop/alter properties 
on namespaces, list calls, create/alter/rename/drop tables, successful commits, 
etc.

Notification configuration at namespace level may be leveraged with an override 
at the table level in case a specific table has an override.
(Namespaces are a good abstraction for configuration management that are common 
to multiple tables and if a namespace is mapped to a bucket the uniform IAM can 
be used which generally works better on GCS).

In terms of events, there are two kinds of events that we are interested in:

1.      Change events: any change that happens in the table such as create, 
drop table/namespace.

2.      Access events: any time user accesses a table, calls list. It is good 
to send notification because they can be consumed by the users if they want to 
track usage for a specific namespace or table.

When this is implemented, could we provide generic hooks and events so that 
another notification system such as PubSub (on GCP) or Kakfa can be leveraged.

I’m happy to join any brainstorming discussion around this topic.

Thanks,
Mayur

From: Kyle Bendickson <k...@tabular.io>
Sent: Wednesday, December 1, 2021 1:23 AM
To: dev@iceberg.apache.org
Subject: Re: Iceberg event notification support

I think this is a great idea, Jack. Thank you for bringing this up! +1

There have been several people interested in having more observability (for 
example for table design patterns akin to how folks might monitor Hive) and 
events would be a big win for that and something users could use with a lot of 
their existing infra (Kafka, REST services, AWS or other cloud provider queue 
types).

Spark has an existing interface, ExternalCatalogWithListener, which emits 
events we might hook into. I won't go into too much detail here. And while 
these Spark "ExternalCatalogEvents" shouldn't be how we define our own events, 
which should have their own type system, it could be a beneficial source of 
event hooks from within Spark. It also provides us table level query data we 
don't currently otherwise get. It's worth investigating if we haven't, though 
we might choose to forgo it's complexity.

I agree conceptually that most events should be registered at the table level, 
though I'd be open to having events of differing granularities. Especially if 
this helps support cross-table patterns. But table level data should be 
prioritized first.

If you have something to share or would like to make time to discuss, please 
count me in. This is an area I've been thinking about a bit lately as I've had 
quite some interest in observability and possible event-driven patterns.

Best
Kyle (GitHub @kbendick)

On Tue, Nov 30, 2021 at 9:50 PM Neelesh Salian 
<neeleshssal...@gmail.com<mailto:neeleshssal...@gmail.com>> wrote:
+1 to this effort.
There is value in adding support for Events - general bookkeeping and helping 
replay actions in the event of recovery.
At the minimum we should aim to track the following all catalogs:
1. Create actions
2. Alter actions
3. Delete actions
across all tables, properties and namespaces.



On Tue, Nov 30, 2021 at 9:12 PM Jack Ye 
<yezhao...@gmail.com<mailto:yezhao...@gmail.com>> wrote:
Hi everyone,

I would like to start some initial discussions around Iceberg event 
notification support, because we might have some engineering resources to work 
on Iceberg notification integration with AWS services such as SNS, SQS, 
CloudWatch.

As of today, we have a Listener interface and 3 events ScanEvent, 
IncrementalScanEvent, CreateSnapshotEvent. There is a static registry called 
Listeners that registers the event listeners in the JVM.

However, when I read the related code paths, my thought is that it might be 
better to register listeners per-table, based on the following observations:
1. Iceberg events are all table or sub-table level events. For any catalog or 
global level events, the catalog service can provide notifications, Iceberg can 
be out of the picture.
2. A user might have multiple Iceberg catalogs defined, pointing to different 
catalog services. (e.g. one to AWS Glue, one to a Hive metastore). The 
notifications from tables of these different catalogs should be directed to 
different listeners at least per catalog, instead of the same set of listeners 
that are registered globally.
3. Event listener configurations are usually static. It makes more sense to me 
to define it once and then repeatedly use it, instead of re-registering it 
every time I start an application.

If we register the listeners at table level, we can add a hook in 
TableOperations to get a set of listeners to emit specific events. The 
listeners could be defined and serialized as a part of the table properties, or 
maybe even a part of the Iceberg spec.

This is really just my brainstorming. Maybe it's a bit overkill, maybe I am 
missing the correct way to use the Listeners static registry. It would be great 
if anyone could provide more contexts or thoughts around this topic.

Best,
Jack Ye











--
Regards,
Neelesh S. Salian

Reply via email to