Hi Dmitri,

Thanks a lot for the review and for calling this out!  You’re right that PR
3205 only shows the persistence SPI, so the cross‑node behavior isn’t
obvious from the code alone.I’ve updated the design doc with a new section
“Multi‑node Coordination, Liveness, and Cleanup” that spells this out, but
let me summarize the answers to your questions here:

Retry hits a different node while the original is still running

   - Each request with an Idempotency-Key goes through an HTTP idempotency
   filter (IdempotencyFilter) that calls

reserve(realmId, key, operationType, normalizedResourceId, expiresAt,
executorId, now).


   - The Postgres implementation uses INSERT … ON CONFLICT DO NOTHING
on (realm_id,
   idempotency_key), so only one node wins globally (OWNED); all others see
   DUPLICATE + the existing record.
   - The winning node is the owner and periodically calls
   updateHeartbeat(...) while the record is IN_PROGRESS, which keeps
   heartbeat_at fresh.
   - When a duplicate lands on another node, it looks at heartbeat_at and a
   short lease (lease_ttl, e.g. ~25s): if now − heartbeat_at ≤ lease_ttl,
   the original owner is treated as active. The duplicate does not re‑execute
   the mutation; it waits (with a small bounded poll) for the record to become
   FINALIZED, then replays the stored result.

Original node crashes without finalizing the record

   - In that case we still have an IN_PROGRESS row with an old heartbeat_at
   and the original executor_id.
   - After lease_ttl has passed, subsequent requests with the same key see
   IN_PROGRESS and a stale heartbeat (now − heartbeat_at > lease_ttl), so
   the filter treats the owner as dead and invokes a follow‑up Reconciler
   SPI.
   - The Reconciler does “finalize‑gap” reconciliation: it inspects the
   catalog state for the (operationType, resourceId, realmId) to decide
   whether the original mutation actually landed.
      - If the state already reflects the intended change, it can mark the
      record FINALIZED and allow duplicates to replay an equivalent success
      response without re‑executing the commit.
      - If the state does not reflect the change, it can re‑execute once on
      the new node (“takeover”).
   - The current PR just introduces the fields needed for this (executor_id,
   heartbeat_at, http_status, finalized_at, response_*); the filter +
   Reconciler wiring comes in a follow‑up change.

Cleanup of old idempotency records / collaboration across nodes

   - On first acceptance we set expires_at = firstAcceptance +
   idempotency-key-lifetime + grace. This is a long‑lived retention
   horizon, separate from the short liveness lease:
      - lease_ttl (seconds) → “is the current owner still alive?”
      - expires_at (minutes/hours+) → “how long must we remember this key
      for duplicates/replay and when can we delete it?”
   - The SPI exposes purgeExpired(before) which is implemented as DELETE
   FROM idempotency_records WHERE expires_at < ?. Any node (or a
   maintenance job) can call this periodically. The delete is idempotent and
   keyed only by time, so multiple nodes can run cleanup concurrently without
   coordination; it’s intentionally a collaborative, best‑effort process.

Thanks,
Huaxin


On Fri, Dec 5, 2025 at 4:56 PM Dmitri Bourlatchkov <[email protected]> wrote:

> Hi Huaxin,
>
> I looked at the PR briefly. It appears to add a persistence SPI for storing
> information related to the processing of Idempotency Keys.
>
> I do not have any particular comment about that interface itself ATM, but I
> suppose it will be used (eventually) in conjunction with coordinating
> request processing in multiple Polaris server nodes. This aspect is not
> apparent in PR 3205 and IIRC it is not discussed in depth in the proposal
> doc (please correct me if I missed that). The updateHeartbeat() method is
> declared, but how it is supposed to be used is not clear to me.
>
> So, in order to better understand how the proposed persistence SPI can work
> end-to-end, could you expand on your proposal for coordinating retry
> processing across nodes and how that will relate to persistence?
>
> Specifically, I mean the case when a retry is submitted by a client to one
> node while another node is still processing the original request. Another
> case is when the original request node crashes without updating the
> persisted data.
>
> Additionally, what is your vision regarding the cleanup of old Idempotency
> records? Is this a collaborative process where all server nodes participate
> equally?
>
> Thanks,
> Dmitri.
>
> On Thu, Dec 4, 2025 at 4:53 PM huaxin gao <[email protected]> wrote:
>
> > Hi all,
> >
> > Following the earlier idempotency proposal, I’ve opened a PR
> > <https://github.com/apache/polaris/pull/3205> that adds the first
> > implementation of idempotency retries to Polaris core and would like
> > feedback on the Java design.
> >
> > The PR introduces:
> >
> >    -
> >
> >    org.apache.polaris.core.persistence.IdempotencyStore – an SPI to
> >    reserve/load/finalize/purge idempotency records (with ReserveResult /
> >    ReserveResultType).
> >
> >
> >    -
> >
> >    org.apache.polaris.idempotency.IdempotencyRecord – the value type for
> >    stored idempotency metadata.
> >
> >
> >    -
> >
> >    PostgresIdempotencyStore + PostgresIdempotencyStoreIT – a
> >    relational‑jdbc implementation backed by
> >    POLARIS_SCHEMA.idempotency_records, tested via
> Testcontainers/Postgres.
> >
> > Please take a look at the PR and reply here or comment on the change with
> > any concerns or suggestions.
> >
> > Thanks,
> >
> > Huaxin
> >
>

Reply via email to