Hi Dmitri,
Thanks a lot for the review and for calling this out! You’re right that PR
3205 only shows the persistence SPI, so the cross‑node behavior isn’t
obvious from the code alone.I’ve updated the design doc with a new section
“Multi‑node Coordination, Liveness, and Cleanup” that spells this out, but
let me summarize the answers to your questions here:
Retry hits a different node while the original is still running
- Each request with an Idempotency-Key goes through an HTTP idempotency
filter (IdempotencyFilter) that calls
reserve(realmId, key, operationType, normalizedResourceId, expiresAt,
executorId, now).
- The Postgres implementation uses INSERT … ON CONFLICT DO NOTHING
on (realm_id,
idempotency_key), so only one node wins globally (OWNED); all others see
DUPLICATE + the existing record.
- The winning node is the owner and periodically calls
updateHeartbeat(...) while the record is IN_PROGRESS, which keeps
heartbeat_at fresh.
- When a duplicate lands on another node, it looks at heartbeat_at and a
short lease (lease_ttl, e.g. ~25s): if now − heartbeat_at ≤ lease_ttl,
the original owner is treated as active. The duplicate does not re‑execute
the mutation; it waits (with a small bounded poll) for the record to become
FINALIZED, then replays the stored result.
Original node crashes without finalizing the record
- In that case we still have an IN_PROGRESS row with an old heartbeat_at
and the original executor_id.
- After lease_ttl has passed, subsequent requests with the same key see
IN_PROGRESS and a stale heartbeat (now − heartbeat_at > lease_ttl), so
the filter treats the owner as dead and invokes a follow‑up Reconciler
SPI.
- The Reconciler does “finalize‑gap” reconciliation: it inspects the
catalog state for the (operationType, resourceId, realmId) to decide
whether the original mutation actually landed.
- If the state already reflects the intended change, it can mark the
record FINALIZED and allow duplicates to replay an equivalent success
response without re‑executing the commit.
- If the state does not reflect the change, it can re‑execute once on
the new node (“takeover”).
- The current PR just introduces the fields needed for this (executor_id,
heartbeat_at, http_status, finalized_at, response_*); the filter +
Reconciler wiring comes in a follow‑up change.
Cleanup of old idempotency records / collaboration across nodes
- On first acceptance we set expires_at = firstAcceptance +
idempotency-key-lifetime + grace. This is a long‑lived retention
horizon, separate from the short liveness lease:
- lease_ttl (seconds) → “is the current owner still alive?”
- expires_at (minutes/hours+) → “how long must we remember this key
for duplicates/replay and when can we delete it?”
- The SPI exposes purgeExpired(before) which is implemented as DELETE
FROM idempotency_records WHERE expires_at < ?. Any node (or a
maintenance job) can call this periodically. The delete is idempotent and
keyed only by time, so multiple nodes can run cleanup concurrently without
coordination; it’s intentionally a collaborative, best‑effort process.
Thanks,
Huaxin
On Fri, Dec 5, 2025 at 4:56 PM Dmitri Bourlatchkov <[email protected]> wrote:
> Hi Huaxin,
>
> I looked at the PR briefly. It appears to add a persistence SPI for storing
> information related to the processing of Idempotency Keys.
>
> I do not have any particular comment about that interface itself ATM, but I
> suppose it will be used (eventually) in conjunction with coordinating
> request processing in multiple Polaris server nodes. This aspect is not
> apparent in PR 3205 and IIRC it is not discussed in depth in the proposal
> doc (please correct me if I missed that). The updateHeartbeat() method is
> declared, but how it is supposed to be used is not clear to me.
>
> So, in order to better understand how the proposed persistence SPI can work
> end-to-end, could you expand on your proposal for coordinating retry
> processing across nodes and how that will relate to persistence?
>
> Specifically, I mean the case when a retry is submitted by a client to one
> node while another node is still processing the original request. Another
> case is when the original request node crashes without updating the
> persisted data.
>
> Additionally, what is your vision regarding the cleanup of old Idempotency
> records? Is this a collaborative process where all server nodes participate
> equally?
>
> Thanks,
> Dmitri.
>
> On Thu, Dec 4, 2025 at 4:53 PM huaxin gao <[email protected]> wrote:
>
> > Hi all,
> >
> > Following the earlier idempotency proposal, I’ve opened a PR
> > <https://github.com/apache/polaris/pull/3205> that adds the first
> > implementation of idempotency retries to Polaris core and would like
> > feedback on the Java design.
> >
> > The PR introduces:
> >
> > -
> >
> > org.apache.polaris.core.persistence.IdempotencyStore – an SPI to
> > reserve/load/finalize/purge idempotency records (with ReserveResult /
> > ReserveResultType).
> >
> >
> > -
> >
> > org.apache.polaris.idempotency.IdempotencyRecord – the value type for
> > stored idempotency metadata.
> >
> >
> > -
> >
> > PostgresIdempotencyStore + PostgresIdempotencyStoreIT – a
> > relational‑jdbc implementation backed by
> > POLARIS_SCHEMA.idempotency_records, tested via
> Testcontainers/Postgres.
> >
> > Please take a look at the PR and reply here or comment on the change with
> > any concerns or suggestions.
> >
> > Thanks,
> >
> > Huaxin
> >
>