Re: [DISCUSS] Proposal: Returning Commit Results from commit()

yuxia Thu, 18 Sep 2025 10:42:10 -0700

Hi, Jason. 

Thanks for bringing this up. When integrating Apache Iceberg with Apache Flink 
(incubating)[1], we also needed to capture commit results and store snapshot 
IDs in our internal state to track tiering progress. However, we can’t simply 
refresh the table to get the latest snapshot, since other writers may be 
concurrently committing to the same table — and we only want the snapshot 
generated by our own commit. To work around this, we used the Iceberg listener 
mechanism[2], but this still feels a bit like a hack. It would be much cleaner 
if Iceberg provided a standard interface to return commit results.

[1] https://github.com/apache/fluss
[2]
https://github.com/apache/fluss/blob/03313a9b02dca57c87c406f0ecf396b08fa8726a/fluss-lake/fluss-lake-iceberg/src/main/java/org/apache/fluss/lake/iceberg/tiering/IcebergLakeCommitter.java#L328

Best regards,
Yuxia

发件人: "Russell Spitzer" <[email protected]>
收件人: "dev" <[email protected]>
发送时间: 星期五, 2025年 9 月 12日 上午 2:21:56
主题: Re: [DISCUSS] Proposal: Returning Commit Results from commit()

I don't think I'm opposed to this idea in general but I think we probably need
to get some concrete examples of how this is going to be used by a consumer.
Since this
would require modifying every implementation of XXXOperation that we currently
have; it's
a rather heavy change and should probably be backed by some concrete use cases
where
the client needs the exact metadata object produced by the operation.

We also need to actually nail down in the proposal the return type as you
mentioned. I don't think
there is a problem returning a table metadata object but this would be rather
complicated for any
REST catalog interface. A Rest Catalog would still require a round trip to the
Catalog to get the new
state since there is no other way to know what was actually committed as the
metadata.json is
written remotely so we would still be leaning on TableOperations to actually
figure out what that is.

For the future, we probably will also have issues as we move towards a "send
changes" to the catalog
model instead of a "send new state" model. In those cases we will also have the
issue of not actually
knowing what was committed without contacting the catalog after the commit
succeeds. So we also
need to consider how the REST Spec would need to change to support this.

On Thu, Sep 11, 2025 at 5:08 AM Jason Fine <[email protected]> wrote:

Hi all,

I’d like to start a discussion about [
https://github.com/apache/iceberg/pull/13987 | PR
#13987 ] which adds support for returning results from the commit() operation.

What this PR is about

The core idea is: when a client calls commit() , they should be able to
immediately obtain the updated information produced by the commit (whether
that’s a new snapshot or updated table metadata), instead of performing a
redundant refresh() afterwards. This is useful for distributed system that want
to track their progress and save progress state. But I’m sure it will have many
other uses as well.

Calling refresh unnecessarily is a slowdown, but also it also counts against
your quota for rate limits in certain services.

I know some implementations currently don’t call refresh() but others do, and
the interface doesn’t enforce this, and the wanted information is already
available in the client after the commit as it produced it.

Key points/concerns raised

* API compatibility breakage : Several folks pointed out that returning
updated snapshot or metadata from commit() changes the existing API contract .
We can resolve this by adding a new method instead.
* What counts as a snapshot : Some committed operations don’t produce
snapshots — e.g. metadata operations (schema changes, property updates). The
distinction between operations that produce snapshots vs ones that just update
metadata matters . Perhaps returning the TableMetadata like mentioned below
always is a good a solution.
* Behavior varies by catalog implementation : Some implementations already
refresh automatically ( shouldRefresh etc.), others don’t. RestCatalog vs
Metastore vs others behave differently.

Proposal / Possible compromises

>From the discussion, here are options that seem promising to me, or ways to
>mitigate the drawbacks:

1. Add a new method, e.g. commitWithResult(...) This method would commit
and return the updated snapshot / metadata, but leave the existing commit()
method with its current behavior. That way we retain backward compatibility.
2. Return a read-only metadata snapshot If returning the full metadata
object is too heavy or too risky, return a minimal “read ‐ only” summary
containing just what is needed (snapshotId, maybe timestamp). This reduces
implementation risk. [
https://github.com/apache/iceberg/pull/14023/files#diff-c941602822c0e1c24d7de4ef5db76105d414d7b3f0b26df7ca75e76ba79e9663
| GitHub ] . This is also helpful if we want to avoid adding a new Generic
argument to the SnpashotUpdate interface.

Please let me know what you think about this suggestion and how we can move it
forwards.

Thanks,

Jason

The information transmitted by Qlik is intended only for the person or entity
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon, this information by persons or entities other
than the intended recipient is prohibited. If you received this in error,
please contact the sender and delete the material from any computer. Qlik's [
https://www.qlik.com/us/legal/privacy-and-cookie-notice | Privacy & Cookie
Notice ] describes how we handle personal information

Re: [DISCUSS] Proposal: Returning Commit Results from commit()

Reply via email to