Hi Robert, Thanks for the follow-up.
I appreciate the clarification. I read the listed items as fix options and picked option 2, and the Phase 1/Phase 2 split was purely to keep the PR at a reviewable size, not a design choice. I agree that the caller's identity should be part of the binding. That is straightforward to add. On the architecture question, whether re-vending can work within the filter or needs a handler-level approach, I want to give this proper thought rather than rush an answer. I may not be able to get back to you on this until next week (the week of 4/13). Thanks, Huaxin On Sat, Apr 4, 2026 at 11:42 PM Robert Stupp <[email protected]> wrote: > Hi Huaxin, > > Thanks for responding. A few corrections and clarifications. > > My PR review described three consequences of the missing caller > identity binding and the credential leakage attack vector. Those > were not suggestions for a Phase 1/Phase 2 split. The offline > decision to split into phases, and what each phase contains, was > yours [1]. > > On the phasing: the concern is not PR size. The concern is that > Phase 1 alone ships a full authorization bypass. The filter replays > cached responses before Polaris's resource-level authorization > runs. Stripping credentials does not fix that. Any authenticated > user with a valid idempotency key can receive a cached response > for a resource they are not authorized to access. That is the case > with or without credentials in the response. > > "We will not release it until entire idempotency coding is > finished" is reassuring, but the architecture question remains: > can the generic filter approach support re-vending at all? My > email explained why I believe it cannot, since it would require > the filter to authorize the caller, resolve storage locations, > call cloud IAM, and manage credential caching. That is the catalog > handler's job, not a filter's. > > The questions from my original email that still need answers: > > How does the binding prevent cross-principal cache hits without > including the caller's identity? > > How does Phase 2 re-vend credentials from a generic HTTP filter > without reimplementing the catalog handler's authorization and > credential vending logic? > > Has the operational overhead (3 DB round-trips per mutation on > the catalog's connection pool) been benchmarked on any backend? > > I'd like to resolve these on this thread before more code is > written. > > Best, > Robert > > [1] https://github.com/apache/polaris/pull/3803#issuecomment-4180969838 > > On Sun, Apr 5, 2026 at 7:55 AM huaxin gao <[email protected]> wrote: > > > Hi Robert, > > > > Thanks for raising this on the mailing list. I acknowledge the design > > concerns you listed. > > > > To clarify: I picked your option 2 (strip credentials and re-vend on > > replay) you suggested in the PR review. The offline discussion with Yufei > > and Prashant was simply to verify that I picked the best option from all > > the three suggestions you listed in the review. > > > > The reason I split it into two phases is that the current PR is already > > around 3000 lines. Adding re-vending logic on top would make it very hard > > to review. This is a big feature and we need to do it piece by piece. We > > will not release it until entire idempotency coding is finished. > > > > I'm happy to discuss the architecture further on this thread, whether the > > filter approach can support re-vending, or whether we need a > handler-level > > design. I may be slow to respond next week but I'll follow up after that. > > > > Thanks, > > > > Huaxin > > > > On Sat, Apr 4, 2026 at 9:34 AM Robert Stupp <[email protected]> wrote: > > > > > Hi all, > > > > > > I need to raise some fundamental design concerns with the current > > > Idempotency-Key implementation in PR #3803 [1]. Some of these were > > > raised in November 2025 [2] and were acknowledged [3] but never > > > addressed in the implementation. > > > > > > I raised the caller identity concern on the > > > mailing list in November 2025 [2]. In December 2025 [4], I > > > reiterated that all existing implementations I could find do > > > request fingerprinting including the request body, and warned that > > > the idempotency subsystem itself must not become a source of > > > failures. The PR was opened in February 2026 without addressing > > > either point. > > > > > > The credential leakage consequences surfaced during code review > > > [5]. The response was an offline decision between three contributors, > > > without mailing list discussion [6]. The decision was to split > > > into Phase 1 (strip credentials, which produces broken responses > > > but does not fix the authorization bypass) and Phase 2 (re-vend > > > on replay, which would require a complete redesign and > > > reimplementation). Details below. > > > > > > I think we need to resolve these at the design level before more > > > code is written. > > > > > > I will not do further code reviews on this PR until these design > > > concerns are resolved. > > > > > > > > > 1. Caller identity is not part of the idempotency binding > > > > > > The caller's identity is not part of the binding. The > > > filter replays cached responses before Polaris's resource-level > > > authorization runs. Any authenticated user with a valid idempotency > > > key can receive a cached response for a resource they are not > > > authorized to access, including vended storage credentials. This > > > is a full authorization bypass. > > > > > > Idempotency keys are HTTP headers that end up in access logs, > > > observability tooling, debug output. They cannot be a security > > > boundary. I described the attack vector in my PR review [5]. > > > > > > The proposed Phase 2 [6] says it will "re-authorize the user and > > > re-vend fresh credentials on replay." That addresses credentials > > > but not the binding itself. > > > > > > > > > 2. Credential stripping breaks Polaris's deployment model > > > > > > Without mailing list discussion, the decision was made offline > > > to strip credentials before storing ("Phase 1") [6]. > > > > > > Polaris's purpose is to vend down-scoped credentials. Clients have > > > no ambient storage access. > > > > > > A stripped response returns a successful response ("table created") > > > to the client, which then fails hard because it cannot access the > > > object storage. Query engines (Spark, Trino, etc.) will fail on > > > the first storage operation against that table. Query engine users > > > have no way to resolve this, they don't control the REST catalog > > > interaction. For stage-create, the client gets a successful > > > response telling it to write data but has no credentials to write > > > with. No fallback exists. > > > > > > The stripped response is spec-compliant, but spec-compliant and > > > operationally correct are not the same thing. > > > > > > > > > 3. Phase 2 is architecturally incompatible with Phase 1 > > > > > > The filter is a generic HTTP filter below the application layer. It > > > treats responses as opaque JSON. Credential vending happens deep > > > inside IcebergCatalogHandler, scoped to specific table locations > > > and principals, through cloud IAM calls (STS AssumeRole, GCS token > > > exchange). > > > > > > Re-vending on replay would require the filter to resolve the > > > table's storage location, authorize the caller, call cloud IAM to > > > mint scoped credentials, and manage credential caching. That is > > > not a filter, that is reimplementing the catalog handler's > > > authorization and credential vending logic. > > > > > > Phase 1 and Phase 2 need to be designed and shipped together. > > > > > > > > > 4. Operational overhead > > > > > > Every idempotent mutation adds 3 database round-trips (INSERT, UPDATE, > > > heartbeat UPDATEs), each borrowing its own connection > > > (autoCommit=true, no reuse), sharing the catalog's connection pool > > > and RDBMS. This added latency increases the probability of client > > > timeouts, which are exactly the transient failures that idempotency > > > is supposed to prevent. > > > > > > The idempotency table is high-churn and write-hot, competing with > > > catalog operations for connection pool capacity. The overhead gets > > > worse on managed and distributed database backends. Duplicate > > > requests trigger polling loops with repeated database reads. > > > > > > No benchmark data has been presented for any backend. > > > > > > > > > What needs to change > > > > > > The current generic-filter approach cannot support these > > > requirements. This needs a different design. > > > > > > A new design must address the following: > > > > > > a) The idempotency binding MUST include the caller's principal > > > identity. > > > b) Credential-bearing responses must not be stored as-is. On > > > replay, the handler must build a fresh response with fresh > > > credentials for the current caller. > > > c) The operational overhead needs to be benchmarked. > > > > > > I am happy to review code once the design addresses these points. > > > > > > Best, > > > Robert > > > > > > [1] https://github.com/apache/polaris/pull/3803 > > > [2] https://lists.apache.org/thread/qqrgr2bzsp69629jdj8kf39m10pzwy6l > > > (Nov 24, 2025: "Building a service-internal key from the client > > > provided idempotency key plus more data from the request gives > > > better uniqueness. That more data can come from: operation id, > > > all operation parameters, including the whole payload, quite a > > > few HTTP request headers like user-agent, authorization, host, > > > accept-* and also the client's address.") > > > [3] https://lists.apache.org/thread/7vy44mxsq2cgtp8gsj0r8pk8blp37h5b > > > (Dec 8, 2025: "I agree that a bare Idempotency-Key + entity > > > identifier is not enough to protect against buggy or malicious > > > clients; fingerprinting the full request would be stronger.") > > > [4] https://lists.apache.org/thread/jt3gmpnjh6z490dvxorjs6ms00kvo264 > > > (Dec 8, 2025: "We should avoid failing Polaris if this subsystem > > > fails, or letting this subsystem be a reason for its existence > > > (aka retry due to timeouts because the idempotent-request > > > subsystem hangs).") > > > [5] > > > > https://github.com/apache/polaris/pull/3803#pullrequestreview-4051690055 > > > [6] > https://github.com/apache/polaris/pull/3803#issuecomment-4180969838 > > > > > >
