Re: [DISCUSS] Delegation Service design doc direction (pull vs push modes)

Yufei Gu Wed, 20 May 2026 16:58:12 -0700

Hi Robert and JB,

+1 on moving the document to the proposal area for now.


Given that we don't have a place to store the markdown proposal, I updated
issue 3786[1] so the proposal page can locate and reference it properly. I
will close PR 3990 if you don't mind.

1. https://github.com/apache/polaris/issues/3786

Yufei


On Wed, May 20, 2026 at 7:22 AM Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Robert,
>
> The PR is currently a draft, and my intent when creating it was to
> facilitate discussion on the dev@ mailing list.
>
> I am fine with moving it to the proposal area for now. We can move it back
> to the documentation once we have reached a consensus.
>
> I chose to start this as a PR rather than a Google Doc for two reasons:
> 1. To evaluate how efficiently we can collaborate via PR and explore the
> related changes needed in the Polaris core (API/SPI, etc.).
> 2. To simplify the merge process once we have consensus, as the ultimate
> goal is to update the documentation.
>
> Regards,
> JB
>
> On Wed, May 20, 2026 at 1:23 PM Robert Stupp <[email protected]> wrote:
>
> > Thanks Yufei, that helps.
> >
> > If the intent is proposal/design-level direction, I think we are mostly
> > aligned then.
> >
> > My main concern is the placement/wording of the doc.
> >
> > If this is published as release documentation, users will read it as
> > supported behavior.
> >
> > So I think the PR should make this very explicit:
> > push mode is conceptual/proposed, and the concrete task lifecycle,
> > reliability, security, request-budget, and operational contracts are
> future
> > work.
> >
> > Maybe the cleanest option is to keep this under the existing
> > community/proposals area for now, rather than under release
> documentation.
> > That would match the current status better: useful architectural
> direction,
> > but not yet a supported push-mode contract.
> >
> > Thanks also for the context from the sprint discussions, that is useful
> > background.
> >
> > For the project decision, I think we should make sure the desired
> direction
> > is explicit on the dev list.
> > Same for the open contract questions.
> > Then the community can validate or challenge them here and build
> consensus
> > on that.
> >
> > With that clarification, I think the pull/push terminology is useful.
> >
> > For the actual execution semantics, I still think the safer foundation is
> > the durable task-state approach from the async/reliable tasks proposal.
> >
> > Polaris owns the persistent record of what work exists, whether it
> > finished, and what needs retry.
> >
> > Remote execution can then still be added later as an optional executor
> > backend, without making it the baseline model for everyone.
> >
> > Robert
> >
> > On Wed, May 20, 2026 at 2:53 AM Yufei Gu <[email protected]> wrote:
> >
> > > Thanks Robert, this is helpful feedback.
> > >
> > > I think there may be a scope mismatch between the intent of the current
> > > document and how “push mode” is being interpreted. The current doc is
> > > mainly trying to capture architectural directions and terminology
> > discussed
> > > during the sprint, especially the distinction between pull mode and
> push
> > > mode. The goal is not yet to standardize a full distributed task
> > execution
> > > or reliability contract. To share some more context, we agreed to
> > publish a
> > > short doc for architectural directions in two sprints(one in Feb, one
> in
> > > April). This PR (3990) is based on it. I think JB intialized it a few
> > month
> > > ago.
> > >
> > > I agree the topics you raised, durable task state, retry semantics,
> > failure
> > > handling, credential scoping, request budgets, operational guarantees,
> > > etc., are important discussions, especially once we move toward
> > production
> > > semantics for async execution. But I do not think the current document
> is
> > > trying to define those guarantees yet. It is more intended as a
> > > design/proposal level document describing possible execution/deployment
> > > models and the general direction the community discussed.
> > >
> > > I also agree that we should avoid overstating the maturity of push
> mode.
> > We
> > > can clarify in the document that push mode is still conceptual/proposed
> > and
> > > that the detailed operational and reliability contracts remain future
> > work.
> > >
> > > Yufei
> > >
> > >
> > > On Tue, May 19, 2026 at 5:48 AM Robert Stupp <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > thanks for creating the doc and for splitting the discussion into
> pull
> > > and
> > > > push mode.
> > > >
> > > > I think that terminology is useful and helps to separate two very
> > > different
> > > > cases.
> > > >
> > > > I agree that pull and push are useful options to discuss.
> > > > I also think this is the right time to clarify whether push mode
> should
> > > be
> > > > release documentation already, and what contract would be behind it.
> > > >
> > > > I am not objecting to the direction.
> > > >
> > > > I am objecting to publishing push mode as release documentation
> before
> > we
> > > > have defined its contract.
> > > >
> > > > Pull mode mostly looks like a normal REST/OAuth client pattern.
> > > > I am not sure that needs a separate Delegation Service specification.
> > > > I think pull mode is a good fit when the external service owns the
> > > > workflow.
> > > >
> > > > When Polaris exposes the operation as Polaris behavior, for example
> > DROP
> > > > TABLE PURGE or server-side scan planning, Polaris owns the contract.
> > > >
> > > > For purge, that means durable state and eventual completion.
> > > > For scan planning, that means bounded request behavior: timeouts,
> > > > cancellation, resource limits, result-size limits, fallback behavior,
> > and
> > > > cache ownership.
> > > >
> > > > After that, pull vs push is mostly about where execution runs.
> > > >
> > > > Remote push mode is still different operationally:
> > > >
> > > > Polaris needs to coordinate with another separately deployed service
> > that
> > > > can fail independently, but users will still hold Polaris responsible
> > for
> > > > the correct result.
> > > > That means the contract must define retry, failure handling,
> > credentials,
> > > > status, and operator controls.
> > > >
> > > > It also crosses security and service boundaries.
> > > >
> > > > The contract needs to define who the worker acts as, which
> credentials
> > it
> > > > gets, and how those credentials are scoped.
> > > > It also needs to define how Polaris and the worker safely talk to
> each
> > > > other across Kubernetes service, network, and proxy boundaries.
> > > >
> > > > Once documented as release behavior, users will expect Polaris to
> > define
> > > > what happens when Polaris, the worker, the object store, or the
> network
> > > > fails.
> > > >
> > > > I do not think that contract exists yet.
> > > > So I think this should either stay a design/proposal note for now, or
> > the
> > > > release documentation should clearly say that the push-mode contract
> is
> > > > still TBD.
> > > >
> > > > I think the good news is that the "Asynchronous & Reliable Tasks"
> > > proposal
> > > > already gives us a simpler foundation:
> > > > Polaris should own the durable task state, meaning the persistent
> > record
> > > of
> > > > what work exists, whether it finished, and what needs retry.
> > > > With that, the default deployment can stay simple, and remote
> execution
> > > can
> > > > still be added later as an optional executor backend.
> > > >
> > > > I also think we should separate the advanced deployment option from
> the
> > > > common user path.
> > > >
> > > > A remote push-mode Delegation Service can be useful for deployments
> > that
> > > > already have the operational machinery for separate worker services.
> > > > But for many self-hosted users it also means another service to
> deploy,
> > > > secure, monitor, scale, upgrade, and debug.
> > > >
> > > > So I would prefer that the common path stays simple first: Polaris
> owns
> > > the
> > > > durable task state, and operators can run the worker in the same
> > > deployment
> > > > or same image.
> > > >
> > > > Remote execution can then be added as an optional executor backend
> > > without
> > > > making it the baseline model for everyone.
> > > >
> > > > The failure cases below are why I think this matters.
> > > > They are not a request to solve every detail in this PR.
> > > >
> > > > For example:
> > > >
> > > > * What happens if the user-visible drop succeeds, but the purge task
> is
> > > not
> > > >   recorded yet?
> > > >   This matters when entities and tasks are served by different SPIs
> or
> > > >   backends.
> > > >   Atomicity across those writes cannot then be assumed.
> > > >
> > > > * What happens if a worker deletes some files and then crashes?
> > > >   Who owns retry?
> > > >   Where is progress recorded?
> > > >   Can another node safely resume a crashed node's work?
> > > >
> > > > * What happens if the worker needs to call Polaris after the table is
> > > > already
> > > >   hidden or dropped from the normal API surface?
> > > >   This creates a cyclic dependency unless the task contains the
> > > information
> > > >   needed to continue without rediscovering the table through
> loadTable.
> > > >
> > > > * Server-side scan planning is also not a simple service call.
> > > >   It either needs a query engine, or the relevant planning parts of
> > one.
> > > >   At minimum, the contract needs request budgets: timeouts,
> > cancellation,
> > > >   backpressure, result-size limits, fallback behavior, and cache
> > > ownership.
> > > >
> > > > The existing proposals already contain most of the useful building
> > > blocks.
> > > >
> > > > For me, the safer order is to define the guarantees first, then
> > document
> > > > the deployment modes on top.
> > > >
> > > > One possible path could roughly look like this:
> > > >
> > > > 1. Define how destructive operations persist the intent for DROP
> TABLE
> > > > PURGE.
> > > >    The important part is that the user-visible drop and the purge
> > intent
> > > > are
> > > >    recorded atomically.
> > > >
> > > > 2. Building on the "Asynchronous & Reliable Tasks" work for the
> durable
> > > >    Polaris task control plane gives us deterministic task IDs, task
> > > state,
> > > >    retry/lost-task recovery, and admin-visible status.
> > > >
> > > > 3. Using the "Object store functionality" work as the execution
> library
> > > >    for purge/file cleanup gives us streaming file discovery, bulk
> > > deletes,
> > > >    rate limiting, stats, and lower heap pressure.
> > > >
> > > > 4. Wire DROP TABLE PURGE to a reliable task behavior using those
> object
> > > > store
> > > >    operations.
> > > >    Once Polaris returns success, the table is hidden from normal
> > catalog
> > > > APIs
> > > >    and the purge intent is durable.
> > > >    File deletion can continue asynchronously and survive process
> > > restarts.
> > > >
> > > > 5. Then consider deployment variants.
> > > >    A same-image task runner gives self-hosted operators isolation and
> > > >    separate scaling without a second protocol or persistence model.
> > > >    A remote Delegation Service can still be added later as an
> optional
> > > >    executor backend if SaaS deployments need that shape.
> > > >
> > > > This is not meant to block pull/push terminology.
> > > > It is also not meant to rule out remote execution.
> > > > I am mostly trying to avoid publishing push mode as supported release
> > > > behavior before the task, security, request-budget, and operational
> > > > contracts are defined.
> > > >
> > > > So I would prefer to keep this PR as a design/proposal note for now,
> or
> > > > make the released documentation explicit that push mode is still TBD.
> > > >
> > > > My worry is that otherwise we ship a simple-looking doc that commits
> > the
> > > > project to a surprisingly complex distributed-systems design.
> > > >
> > > > Robert
> > > >
> > > > On Wed, May 13, 2026 at 11:50 PM Yufei Gu <[email protected]>
> > wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > Sharing a few updates regarding the delegation service design doc.
> JB
> > > > and I
> > > > > will be co-authoring the document, and the PR has been updated
> > > > accordingly.
> > > > >
> > > > > Please take a look at the latest changes here:
> > > > > https://github.com/apache/polaris/pull/3990
> > > > >
> > > > > Yufei
> > > > >
> > > > >
> > > > > On Tue, Apr 14, 2026 at 1:56 PM Yufei Gu <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > We had a productive discussion on the delegation service during
> the
> > > > > > Polaris Sprint on April 7, thanks all for the great input.
> > > > > >
> > > > > > As a quick summary, the current direction is to condense the
> design
> > > > > doc[1]
> > > > > > and focus on the two options the community seems to prefer moving
> > > > forward
> > > > > > with: pull mode and push mode. The goal is to keep the doc
> concise
> > > and
> > > > > > briefly describe these two modes.
> > > > > >
> > > > > > Please let me know if I missed anything. And Looking forward your
> > > > > feedback.
> > > > > >
> > > > > > 1. https://github.com/apache/polaris/pull/3990
> > > > > >
> > > > > > Thanks,
> > > > > > Yufei
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Delegation Service design doc direction (pull vs push modes)

Reply via email to