Hi Huaxin, Thanks for responding. A few corrections and clarifications.
My PR review described three consequences of the missing caller identity binding and the credential leakage attack vector. Those were not suggestions for a Phase 1/Phase 2 split. The offline decision to split into phases, and what each phase contains, was yours [1]. On the phasing: the concern is not PR size. The concern is that Phase 1 alone ships a full authorization bypass. The filter replays cached responses before Polaris's resource-level authorization runs. Stripping credentials does not fix that. Any authenticated user with a valid idempotency key can receive a cached response for a resource they are not authorized to access. That is the case with or without credentials in the response. "We will not release it until entire idempotency coding is finished" is reassuring, but the architecture question remains: can the generic filter approach support re-vending at all? My email explained why I believe it cannot, since it would require the filter to authorize the caller, resolve storage locations, call cloud IAM, and manage credential caching. That is the catalog handler's job, not a filter's. The questions from my original email that still need answers: How does the binding prevent cross-principal cache hits without including the caller's identity? How does Phase 2 re-vend credentials from a generic HTTP filter without reimplementing the catalog handler's authorization and credential vending logic? Has the operational overhead (3 DB round-trips per mutation on the catalog's connection pool) been benchmarked on any backend? I'd like to resolve these on this thread before more code is written. Best, Robert [1] https://github.com/apache/polaris/pull/3803#issuecomment-4180969838 On Sun, Apr 5, 2026 at 7:55 AM huaxin gao <[email protected]> wrote: > Hi Robert, > > Thanks for raising this on the mailing list. I acknowledge the design > concerns you listed. > > To clarify: I picked your option 2 (strip credentials and re-vend on > replay) you suggested in the PR review. The offline discussion with Yufei > and Prashant was simply to verify that I picked the best option from all > the three suggestions you listed in the review. > > The reason I split it into two phases is that the current PR is already > around 3000 lines. Adding re-vending logic on top would make it very hard > to review. This is a big feature and we need to do it piece by piece. We > will not release it until entire idempotency coding is finished. > > I'm happy to discuss the architecture further on this thread, whether the > filter approach can support re-vending, or whether we need a handler-level > design. I may be slow to respond next week but I'll follow up after that. > > Thanks, > > Huaxin > > On Sat, Apr 4, 2026 at 9:34 AM Robert Stupp <[email protected]> wrote: > > > Hi all, > > > > I need to raise some fundamental design concerns with the current > > Idempotency-Key implementation in PR #3803 [1]. Some of these were > > raised in November 2025 [2] and were acknowledged [3] but never > > addressed in the implementation. > > > > I raised the caller identity concern on the > > mailing list in November 2025 [2]. In December 2025 [4], I > > reiterated that all existing implementations I could find do > > request fingerprinting including the request body, and warned that > > the idempotency subsystem itself must not become a source of > > failures. The PR was opened in February 2026 without addressing > > either point. > > > > The credential leakage consequences surfaced during code review > > [5]. The response was an offline decision between three contributors, > > without mailing list discussion [6]. The decision was to split > > into Phase 1 (strip credentials, which produces broken responses > > but does not fix the authorization bypass) and Phase 2 (re-vend > > on replay, which would require a complete redesign and > > reimplementation). Details below. > > > > I think we need to resolve these at the design level before more > > code is written. > > > > I will not do further code reviews on this PR until these design > > concerns are resolved. > > > > > > 1. Caller identity is not part of the idempotency binding > > > > The caller's identity is not part of the binding. The > > filter replays cached responses before Polaris's resource-level > > authorization runs. Any authenticated user with a valid idempotency > > key can receive a cached response for a resource they are not > > authorized to access, including vended storage credentials. This > > is a full authorization bypass. > > > > Idempotency keys are HTTP headers that end up in access logs, > > observability tooling, debug output. They cannot be a security > > boundary. I described the attack vector in my PR review [5]. > > > > The proposed Phase 2 [6] says it will "re-authorize the user and > > re-vend fresh credentials on replay." That addresses credentials > > but not the binding itself. > > > > > > 2. Credential stripping breaks Polaris's deployment model > > > > Without mailing list discussion, the decision was made offline > > to strip credentials before storing ("Phase 1") [6]. > > > > Polaris's purpose is to vend down-scoped credentials. Clients have > > no ambient storage access. > > > > A stripped response returns a successful response ("table created") > > to the client, which then fails hard because it cannot access the > > object storage. Query engines (Spark, Trino, etc.) will fail on > > the first storage operation against that table. Query engine users > > have no way to resolve this, they don't control the REST catalog > > interaction. For stage-create, the client gets a successful > > response telling it to write data but has no credentials to write > > with. No fallback exists. > > > > The stripped response is spec-compliant, but spec-compliant and > > operationally correct are not the same thing. > > > > > > 3. Phase 2 is architecturally incompatible with Phase 1 > > > > The filter is a generic HTTP filter below the application layer. It > > treats responses as opaque JSON. Credential vending happens deep > > inside IcebergCatalogHandler, scoped to specific table locations > > and principals, through cloud IAM calls (STS AssumeRole, GCS token > > exchange). > > > > Re-vending on replay would require the filter to resolve the > > table's storage location, authorize the caller, call cloud IAM to > > mint scoped credentials, and manage credential caching. That is > > not a filter, that is reimplementing the catalog handler's > > authorization and credential vending logic. > > > > Phase 1 and Phase 2 need to be designed and shipped together. > > > > > > 4. Operational overhead > > > > Every idempotent mutation adds 3 database round-trips (INSERT, UPDATE, > > heartbeat UPDATEs), each borrowing its own connection > > (autoCommit=true, no reuse), sharing the catalog's connection pool > > and RDBMS. This added latency increases the probability of client > > timeouts, which are exactly the transient failures that idempotency > > is supposed to prevent. > > > > The idempotency table is high-churn and write-hot, competing with > > catalog operations for connection pool capacity. The overhead gets > > worse on managed and distributed database backends. Duplicate > > requests trigger polling loops with repeated database reads. > > > > No benchmark data has been presented for any backend. > > > > > > What needs to change > > > > The current generic-filter approach cannot support these > > requirements. This needs a different design. > > > > A new design must address the following: > > > > a) The idempotency binding MUST include the caller's principal > > identity. > > b) Credential-bearing responses must not be stored as-is. On > > replay, the handler must build a fresh response with fresh > > credentials for the current caller. > > c) The operational overhead needs to be benchmarked. > > > > I am happy to review code once the design addresses these points. > > > > Best, > > Robert > > > > [1] https://github.com/apache/polaris/pull/3803 > > [2] https://lists.apache.org/thread/qqrgr2bzsp69629jdj8kf39m10pzwy6l > > (Nov 24, 2025: "Building a service-internal key from the client > > provided idempotency key plus more data from the request gives > > better uniqueness. That more data can come from: operation id, > > all operation parameters, including the whole payload, quite a > > few HTTP request headers like user-agent, authorization, host, > > accept-* and also the client's address.") > > [3] https://lists.apache.org/thread/7vy44mxsq2cgtp8gsj0r8pk8blp37h5b > > (Dec 8, 2025: "I agree that a bare Idempotency-Key + entity > > identifier is not enough to protect against buggy or malicious > > clients; fingerprinting the full request would be stronger.") > > [4] https://lists.apache.org/thread/jt3gmpnjh6z490dvxorjs6ms00kvo264 > > (Dec 8, 2025: "We should avoid failing Polaris if this subsystem > > fails, or letting this subsystem be a reason for its existence > > (aka retry due to timeouts because the idempotent-request > > subsystem hangs).") > > [5] > > https://github.com/apache/polaris/pull/3803#pullrequestreview-4051690055 > > [6] https://github.com/apache/polaris/pull/3803#issuecomment-4180969838 > > >
