Hi Robert and JB, +1 on moving the document to the proposal area for now.
Given that we don't have a place to store the markdown proposal, I updated issue 3786[1] so the proposal page can locate and reference it properly. I will close PR 3990 if you don't mind. 1. https://github.com/apache/polaris/issues/3786 Yufei On Wed, May 20, 2026 at 7:22 AM Jean-Baptiste Onofré <[email protected]> wrote: > Hi Robert, > > The PR is currently a draft, and my intent when creating it was to > facilitate discussion on the dev@ mailing list. > > I am fine with moving it to the proposal area for now. We can move it back > to the documentation once we have reached a consensus. > > I chose to start this as a PR rather than a Google Doc for two reasons: > 1. To evaluate how efficiently we can collaborate via PR and explore the > related changes needed in the Polaris core (API/SPI, etc.). > 2. To simplify the merge process once we have consensus, as the ultimate > goal is to update the documentation. > > Regards, > JB > > On Wed, May 20, 2026 at 1:23 PM Robert Stupp <[email protected]> wrote: > > > Thanks Yufei, that helps. > > > > If the intent is proposal/design-level direction, I think we are mostly > > aligned then. > > > > My main concern is the placement/wording of the doc. > > > > If this is published as release documentation, users will read it as > > supported behavior. > > > > So I think the PR should make this very explicit: > > push mode is conceptual/proposed, and the concrete task lifecycle, > > reliability, security, request-budget, and operational contracts are > future > > work. > > > > Maybe the cleanest option is to keep this under the existing > > community/proposals area for now, rather than under release > documentation. > > That would match the current status better: useful architectural > direction, > > but not yet a supported push-mode contract. > > > > Thanks also for the context from the sprint discussions, that is useful > > background. > > > > For the project decision, I think we should make sure the desired > direction > > is explicit on the dev list. > > Same for the open contract questions. > > Then the community can validate or challenge them here and build > consensus > > on that. > > > > With that clarification, I think the pull/push terminology is useful. > > > > For the actual execution semantics, I still think the safer foundation is > > the durable task-state approach from the async/reliable tasks proposal. > > > > Polaris owns the persistent record of what work exists, whether it > > finished, and what needs retry. > > > > Remote execution can then still be added later as an optional executor > > backend, without making it the baseline model for everyone. > > > > Robert > > > > On Wed, May 20, 2026 at 2:53 AM Yufei Gu <[email protected]> wrote: > > > > > Thanks Robert, this is helpful feedback. > > > > > > I think there may be a scope mismatch between the intent of the current > > > document and how “push mode” is being interpreted. The current doc is > > > mainly trying to capture architectural directions and terminology > > discussed > > > during the sprint, especially the distinction between pull mode and > push > > > mode. The goal is not yet to standardize a full distributed task > > execution > > > or reliability contract. To share some more context, we agreed to > > publish a > > > short doc for architectural directions in two sprints(one in Feb, one > in > > > April). This PR (3990) is based on it. I think JB intialized it a few > > month > > > ago. > > > > > > I agree the topics you raised, durable task state, retry semantics, > > failure > > > handling, credential scoping, request budgets, operational guarantees, > > > etc., are important discussions, especially once we move toward > > production > > > semantics for async execution. But I do not think the current document > is > > > trying to define those guarantees yet. It is more intended as a > > > design/proposal level document describing possible execution/deployment > > > models and the general direction the community discussed. > > > > > > I also agree that we should avoid overstating the maturity of push > mode. > > We > > > can clarify in the document that push mode is still conceptual/proposed > > and > > > that the detailed operational and reliability contracts remain future > > work. > > > > > > Yufei > > > > > > > > > On Tue, May 19, 2026 at 5:48 AM Robert Stupp <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > thanks for creating the doc and for splitting the discussion into > pull > > > and > > > > push mode. > > > > > > > > I think that terminology is useful and helps to separate two very > > > different > > > > cases. > > > > > > > > I agree that pull and push are useful options to discuss. > > > > I also think this is the right time to clarify whether push mode > should > > > be > > > > release documentation already, and what contract would be behind it. > > > > > > > > I am not objecting to the direction. > > > > > > > > I am objecting to publishing push mode as release documentation > before > > we > > > > have defined its contract. > > > > > > > > Pull mode mostly looks like a normal REST/OAuth client pattern. > > > > I am not sure that needs a separate Delegation Service specification. > > > > I think pull mode is a good fit when the external service owns the > > > > workflow. > > > > > > > > When Polaris exposes the operation as Polaris behavior, for example > > DROP > > > > TABLE PURGE or server-side scan planning, Polaris owns the contract. > > > > > > > > For purge, that means durable state and eventual completion. > > > > For scan planning, that means bounded request behavior: timeouts, > > > > cancellation, resource limits, result-size limits, fallback behavior, > > and > > > > cache ownership. > > > > > > > > After that, pull vs push is mostly about where execution runs. > > > > > > > > Remote push mode is still different operationally: > > > > > > > > Polaris needs to coordinate with another separately deployed service > > that > > > > can fail independently, but users will still hold Polaris responsible > > for > > > > the correct result. > > > > That means the contract must define retry, failure handling, > > credentials, > > > > status, and operator controls. > > > > > > > > It also crosses security and service boundaries. > > > > > > > > The contract needs to define who the worker acts as, which > credentials > > it > > > > gets, and how those credentials are scoped. > > > > It also needs to define how Polaris and the worker safely talk to > each > > > > other across Kubernetes service, network, and proxy boundaries. > > > > > > > > Once documented as release behavior, users will expect Polaris to > > define > > > > what happens when Polaris, the worker, the object store, or the > network > > > > fails. > > > > > > > > I do not think that contract exists yet. > > > > So I think this should either stay a design/proposal note for now, or > > the > > > > release documentation should clearly say that the push-mode contract > is > > > > still TBD. > > > > > > > > I think the good news is that the "Asynchronous & Reliable Tasks" > > > proposal > > > > already gives us a simpler foundation: > > > > Polaris should own the durable task state, meaning the persistent > > record > > > of > > > > what work exists, whether it finished, and what needs retry. > > > > With that, the default deployment can stay simple, and remote > execution > > > can > > > > still be added later as an optional executor backend. > > > > > > > > I also think we should separate the advanced deployment option from > the > > > > common user path. > > > > > > > > A remote push-mode Delegation Service can be useful for deployments > > that > > > > already have the operational machinery for separate worker services. > > > > But for many self-hosted users it also means another service to > deploy, > > > > secure, monitor, scale, upgrade, and debug. > > > > > > > > So I would prefer that the common path stays simple first: Polaris > owns > > > the > > > > durable task state, and operators can run the worker in the same > > > deployment > > > > or same image. > > > > > > > > Remote execution can then be added as an optional executor backend > > > without > > > > making it the baseline model for everyone. > > > > > > > > The failure cases below are why I think this matters. > > > > They are not a request to solve every detail in this PR. > > > > > > > > For example: > > > > > > > > * What happens if the user-visible drop succeeds, but the purge task > is > > > not > > > > recorded yet? > > > > This matters when entities and tasks are served by different SPIs > or > > > > backends. > > > > Atomicity across those writes cannot then be assumed. > > > > > > > > * What happens if a worker deletes some files and then crashes? > > > > Who owns retry? > > > > Where is progress recorded? > > > > Can another node safely resume a crashed node's work? > > > > > > > > * What happens if the worker needs to call Polaris after the table is > > > > already > > > > hidden or dropped from the normal API surface? > > > > This creates a cyclic dependency unless the task contains the > > > information > > > > needed to continue without rediscovering the table through > loadTable. > > > > > > > > * Server-side scan planning is also not a simple service call. > > > > It either needs a query engine, or the relevant planning parts of > > one. > > > > At minimum, the contract needs request budgets: timeouts, > > cancellation, > > > > backpressure, result-size limits, fallback behavior, and cache > > > ownership. > > > > > > > > The existing proposals already contain most of the useful building > > > blocks. > > > > > > > > For me, the safer order is to define the guarantees first, then > > document > > > > the deployment modes on top. > > > > > > > > One possible path could roughly look like this: > > > > > > > > 1. Define how destructive operations persist the intent for DROP > TABLE > > > > PURGE. > > > > The important part is that the user-visible drop and the purge > > intent > > > > are > > > > recorded atomically. > > > > > > > > 2. Building on the "Asynchronous & Reliable Tasks" work for the > durable > > > > Polaris task control plane gives us deterministic task IDs, task > > > state, > > > > retry/lost-task recovery, and admin-visible status. > > > > > > > > 3. Using the "Object store functionality" work as the execution > library > > > > for purge/file cleanup gives us streaming file discovery, bulk > > > deletes, > > > > rate limiting, stats, and lower heap pressure. > > > > > > > > 4. Wire DROP TABLE PURGE to a reliable task behavior using those > object > > > > store > > > > operations. > > > > Once Polaris returns success, the table is hidden from normal > > catalog > > > > APIs > > > > and the purge intent is durable. > > > > File deletion can continue asynchronously and survive process > > > restarts. > > > > > > > > 5. Then consider deployment variants. > > > > A same-image task runner gives self-hosted operators isolation and > > > > separate scaling without a second protocol or persistence model. > > > > A remote Delegation Service can still be added later as an > optional > > > > executor backend if SaaS deployments need that shape. > > > > > > > > This is not meant to block pull/push terminology. > > > > It is also not meant to rule out remote execution. > > > > I am mostly trying to avoid publishing push mode as supported release > > > > behavior before the task, security, request-budget, and operational > > > > contracts are defined. > > > > > > > > So I would prefer to keep this PR as a design/proposal note for now, > or > > > > make the released documentation explicit that push mode is still TBD. > > > > > > > > My worry is that otherwise we ship a simple-looking doc that commits > > the > > > > project to a surprisingly complex distributed-systems design. > > > > > > > > Robert > > > > > > > > On Wed, May 13, 2026 at 11:50 PM Yufei Gu <[email protected]> > > wrote: > > > > > > > > > Hi folks, > > > > > > > > > > Sharing a few updates regarding the delegation service design doc. > JB > > > > and I > > > > > will be co-authoring the document, and the PR has been updated > > > > accordingly. > > > > > > > > > > Please take a look at the latest changes here: > > > > > https://github.com/apache/polaris/pull/3990 > > > > > > > > > > Yufei > > > > > > > > > > > > > > > On Tue, Apr 14, 2026 at 1:56 PM Yufei Gu <[email protected]> > > wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > We had a productive discussion on the delegation service during > the > > > > > > Polaris Sprint on April 7, thanks all for the great input. > > > > > > > > > > > > As a quick summary, the current direction is to condense the > design > > > > > doc[1] > > > > > > and focus on the two options the community seems to prefer moving > > > > forward > > > > > > with: pull mode and push mode. The goal is to keep the doc > concise > > > and > > > > > > briefly describe these two modes. > > > > > > > > > > > > Please let me know if I missed anything. And Looking forward your > > > > > feedback. > > > > > > > > > > > > 1. https://github.com/apache/polaris/pull/3990 > > > > > > > > > > > > Thanks, > > > > > > Yufei > > > > > > > > > > > > > > > > > > > > >
