Re: [DISCUSS] Proposal: Returning Commit Results from commit()

Yufei Gu Mon, 15 Sep 2025 10:51:39 -0700

Hi Endi, Could you elaborate on your use case? Once a commit succeeds, the
client already holds the latest snapshot as it's a part of the request, so
what’s the need for an additional call? For any subsequent commits, the
client would have to reload the table regardless.


Yufei

On Mon, Sep 15, 2025 at 9:34 AM Endi Caushi <[email protected]> wrote:

> Hi
>
> >  it's a rather heavy change and should probably be backed by some
>> concrete use cases where the client needs the exact metadata object
>> produced by the operation.
>
>
> Apologies for chiming in late, but I wanted to share an example from our
> side.
> We ingest our data pipelines incrementally using PySpark, leveraging the
> snapshotID as a watermark.
> After each run, we store the new snapshotID in the snapshot summary as the
> updated watermark. It would be very convenient if the commit() operation
> returned the snapshotID directly, as it would save us from making an
> additional round-trip just to retrieve the latest snapshot.
>
>
> Best regards,
> Endi
>
>
> On Sun, 14 Sept 2025 at 21:44, yuxia <[email protected]> wrote:
>
>> Hi, Peter,
>> Yes, you're right. I meant Apache Fluss, sorry for the mistake.
>>
>> Thanks for your suggestion, the workaround you proposed can also solve
>> our problem.
>>
>>
>> Best regards,
>> Yuxia
>>
>> ------------------------------
>> *发件人: *"Jason Fine" <[email protected]>
>> *收件人: *"dev" <[email protected]>
>> *发送时间: *星期日, 2025年 9 月 14日 下午 7:43:05
>> *主题: *Re: [DISCUSS] Proposal: Returning Commit Results from commit()
>>
>> I was about to say that we should ask the Flink/Fluss team about this
>> since they also do streaming stateful transforms so I expected you would
>> need it as well!
>>
>>
>>
>> That’s a neat trick with the listener, but I agree it’s a little hacky
>> and a cleaner interface would be nice. In our case it’s ok if we get a
>> newer commit so we can rely on the refreshed data, I is also another good
>> point that if you rely on calling currentSnapshot() currently you may get
>> newer data than desired since the implementation may call refresh() after a
>> commit.
>>
>> Russel, regarding some of the points you brought up. I think that if we
>> add a new method to the interface for this will help since future versions
>> of the Rest Catalog that may only send updates the method that doesn’t
>> return a result can just send the update and not load the response data
>> while the other method will.
>> Regarding implementing all the XXXOperations, I found it to be not much
>> work since most of them inherit from the same base class and the info is
>> available to the operation itself. In the future if the catalogs get more
>> complex with the REST partial update request they may require some more
>> work to get the required info back from the TableOperations class. For now
>> though it seems like it’s not necessary.
>>
>>
>>
>> Regarding the return type I think there are two decent options:
>> 1. Return TableMetadata (or a minimized ReadOnly interface version of it
>> since it’s currently not in the API project)
>>
>>    Pros – Contains all data that the user may want
>>
>>    Cons – Might be slower and heavier for future implementations of
>> things like the rest catalog
>>
>> 2. Return just locally created info particular to the current operation
>>
>>    Pros – Can always return locally without additional network calls
>>
>>    Cons
>>
>>    -  Might not always contain all the info the user wants
>>
>>
>>    - Implementation requires more work as each Operation is different
>>    - Might require an additional interface or expanding SnapshotUpdate
>>    with an additional generic argument
>>
>>
>>
>>
>>
>> *From: *Péter Váry <[email protected]>
>> *Date: *Friday, 12 September 2025 at 15:53
>> *To: *[email protected] <[email protected]>
>> *Subject: *Re: [DISCUSS] Proposal: Returning Commit Results from commit()
>>
>> *CAUTION:* This email originated from outside of the organization. Do
>> not click links or open attachments unless you recognize the sender and
>> know the content is safe.
>>
>>
>>
>> Hi Yuxia,
>>
>>
>>
>> You meant Apache Fluss instead of Apache Flink, right? :)
>>
>>
>>
>> As a workaround in the meantime, you could add an UUID changeset
>> identifier to the commit summary. After refresh, you can find the
>> corresponding snapshot in by searching the commits for this UUID.
>>
>>
>>
>> yuxia <[email protected]> ezt írta (időpont: 2025. szept. 12.,
>> P, 14:30):
>>
>> Hi, Jason.
>>
>>
>>
>> Thanks for bringing this up. When integrating Apache Iceberg with Apache
>> Flink (incubating)[1], we also needed to capture commit results and store
>> snapshot IDs in our internal state to track tiering progress. However, we
>> can’t simply refresh the table to get the latest snapshot, since other
>> writers may be concurrently committing to the same table — and we only want
>> the snapshot generated by our own commit. To work around this, we used the
>> Iceberg listener mechanism[2], but this still feels a bit like a hack. It
>> would be much cleaner if Iceberg provided a standard interface to return
>> commit results.
>>
>>
>>
>> [1] https://github.com/apache/fluss
>>
>> [2]
>> https://github.com/apache/fluss/blob/03313a9b02dca57c87c406f0ecf396b08fa8726a/fluss-lake/fluss-lake-iceberg/src/main/java/org/apache/fluss/lake/iceberg/tiering/IcebergLakeCommitter.java#L328
>>
>>
>>
>> Best regards,
>> Yuxia
>>
>>
>> ------------------------------
>>
>> *发**件人**: *"Russell Spitzer" <[email protected]>
>> *收件人**: *"dev" <[email protected]>
>> *发**送**时间**: *星期五, 2025年 9 月 12日 上午 2:21:56
>> *主**题**: *Re: [DISCUSS] Proposal: Returning Commit Results from commit()
>>
>>
>>
>> I don't think I'm opposed to this idea in general but I think we probably
>> need
>> to get some concrete examples of how this is going to be used by a
>> consumer. Since this
>>
>> would require modifying every implementation of XXXOperation that we
>> currently have; it's
>>
>> a rather heavy change and should probably be backed by some concrete use
>> cases where
>>
>> the client needs the exact metadata object produced by the operation.
>>
>> We also need to actually nail down in the proposal the return type as you
>> mentioned. I don't think
>>
>> there is a problem returning a table metadata object but this would be
>> rather complicated for any
>> REST catalog interface. A Rest Catalog would still require a round trip
>> to the Catalog to get the new
>>
>> state since there is no other way to know what was actually committed as
>> the metadata.json is
>>
>> written remotely so we would still be leaning on TableOperations to
>> actually figure out what that is.
>>
>> For the future, we probably will also have issues as we move towards a
>> "send changes" to the catalog
>>
>> model instead of a "send new state" model. In those cases we will also
>> have the issue of not actually
>>
>> knowing what was committed without contacting the catalog after the
>> commit succeeds. So we also
>>
>> need to consider how the REST Spec would need to change to support this.
>>
>>
>>
>>
>>
>> On Thu, Sep 11, 2025 at 5:08 AM Jason Fine <[email protected]>
>> wrote:
>>
>> Hi all,
>>
>> I’d like to start a discussion about PR #13987
>> <https://github.com/apache/iceberg/pull/13987>  which adds support for
>> returning results from the commit() operation.
>> ------------------------------
>>
>> *What this PR is about*
>>
>> The core idea is: when a client calls commit(), they should be able to
>> immediately obtain the *updated information produced by the commit* (whether
>> that’s a new snapshot or updated table metadata), instead of performing a
>> redundant refresh()afterwards. This is useful for distributed system
>> that want to track their progress and save progress state. But I’m sure it
>> will have many other uses as well.
>>
>> Calling refresh unnecessarily is a slowdown, but also it also counts
>> against your quota for rate limits in certain services.
>>
>>
>>
>> I know some implementations currently don’t call refresh() but others do,
>> and the interface doesn’t enforce this, and the wanted information is
>> already available in the client after the commit as it produced it.
>>
>>
>> ------------------------------
>>
>> *Key points/concerns raised*
>>
>>    - *API compatibility breakage*: Several folks pointed out that
>>    returning updated snapshot or metadata from commit() changes the
>>    existing API contract. We can resolve this by adding a new method
>>    instead.
>>    - *What counts as a snapshot*: Some committed operations *don’t* produce
>>    snapshots — e.g. metadata operations (schema changes, property updates).
>>    The distinction between operations that produce snapshots vs ones that 
>> just
>>    update metadata matters. Perhaps returning the TableMetadata like
>>    mentioned below always is a good a solution.
>>    - *Behavior varies by catalog implementation*: Some implementations
>>    already refresh automatically (shouldRefresh etc.), others don’t.
>>    RestCatalog vs Metastore vs others behave differently.
>>
>> ------------------------------
>>
>> *Proposal / Possible compromises*
>>
>> From the discussion, here are options that seem promising to me, or ways
>> to mitigate the drawbacks:
>>
>>    1. *Add a new method, e.g. **commitWithResult(...)*
>>    This method would commit and return the updated snapshot / metadata,
>>    but leave the existing commit() method with its current behavior.
>>    That way we retain backward compatibility.
>>    2. *Return a read-only metadata snapshot*
>>    If returning the full metadata object is too heavy or too risky,
>>    return a minimal “read‐only” summary containing just what is needed
>>    (snapshotId, maybe timestamp). This reduces implementation risk.
>>    GitHub
>>    
>> <https://github.com/apache/iceberg/pull/14023/files#diff-c941602822c0e1c24d7de4ef5db76105d414d7b3f0b26df7ca75e76ba79e9663>.
>>    This is also helpful if we want to avoid adding a new Generic argument to
>>    the SnpashotUpdate interface.
>>
>>
>>
>> Please let me know what you think about this suggestion and how we can
>> move it forwards.
>>
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>>
>>
>> The information transmitted by Qlik is intended only for the person or
>> entity to which it is addressed and may contain confidential and/or
>> privileged material. Any review, retransmission, dissemination or other use
>> of, or taking of any action in reliance upon, this information by persons
>> or entities other than the intended recipient is prohibited. If you
>> received this in error, please contact the sender and delete the material
>> from any computer. Qlik's Privacy & Cookie Notice
>> <https://www.qlik.com/us/legal/privacy-and-cookie-notice> describes how
>> we handle personal information
>>
>>
>>
>>
>> The information transmitted by Qlik is intended only for the person or
>> entity to which it is addressed and may contain confidential and/or
>> privileged material. Any review, retransmission, dissemination or other use
>> of, or taking of any action in reliance upon, this information by persons
>> or entities other than the intended recipient is prohibited. If you
>> received this in error, please contact the sender and delete the material
>> from any computer. Qlik's Privacy & Cookie Notice
>> <https://www.qlik.com/us/legal/privacy-and-cookie-notice> describes how
>> we handle personal information
>>
>>

Re: [DISCUSS] Proposal: Returning Commit Results from commit()

Reply via email to