Re: [DISCUSS] Delegation Service design doc direction (pull vs push modes)

Robert Stupp Tue, 19 May 2026 05:48:41 -0700

Hi all,

thanks for creating the doc and for splitting the discussion into pull and
push mode.

I think that terminology is useful and helps to separate two very different
cases.

I agree that pull and push are useful options to discuss.
I also think this is the right time to clarify whether push mode should be
release documentation already, and what contract would be behind it.

I am not objecting to the direction.

I am objecting to publishing push mode as release documentation before we
have defined its contract.

Pull mode mostly looks like a normal REST/OAuth client pattern.
I am not sure that needs a separate Delegation Service specification.
I think pull mode is a good fit when the external service owns the workflow.

When Polaris exposes the operation as Polaris behavior, for example DROP
TABLE PURGE or server-side scan planning, Polaris owns the contract.

For purge, that means durable state and eventual completion.
For scan planning, that means bounded request behavior: timeouts,
cancellation, resource limits, result-size limits, fallback behavior, and
cache ownership.

After that, pull vs push is mostly about where execution runs.

Remote push mode is still different operationally:

Polaris needs to coordinate with another separately deployed service that
can fail independently, but users will still hold Polaris responsible for
the correct result.
That means the contract must define retry, failure handling, credentials,
status, and operator controls.

It also crosses security and service boundaries.

The contract needs to define who the worker acts as, which credentials it
gets, and how those credentials are scoped.
It also needs to define how Polaris and the worker safely talk to each
other across Kubernetes service, network, and proxy boundaries.

Once documented as release behavior, users will expect Polaris to define
what happens when Polaris, the worker, the object store, or the network
fails.

I do not think that contract exists yet.
So I think this should either stay a design/proposal note for now, or the
release documentation should clearly say that the push-mode contract is
still TBD.

I think the good news is that the "Asynchronous & Reliable Tasks" proposal
already gives us a simpler foundation:
Polaris should own the durable task state, meaning the persistent record of
what work exists, whether it finished, and what needs retry.
With that, the default deployment can stay simple, and remote execution can
still be added later as an optional executor backend.

I also think we should separate the advanced deployment option from the
common user path.

A remote push-mode Delegation Service can be useful for deployments that
already have the operational machinery for separate worker services.
But for many self-hosted users it also means another service to deploy,
secure, monitor, scale, upgrade, and debug.

So I would prefer that the common path stays simple first: Polaris owns the
durable task state, and operators can run the worker in the same deployment
or same image.

Remote execution can then be added as an optional executor backend without
making it the baseline model for everyone.

The failure cases below are why I think this matters.
They are not a request to solve every detail in this PR.

For example:

* What happens if the user-visible drop succeeds, but the purge task is not
  recorded yet?
  This matters when entities and tasks are served by different SPIs or
  backends.
  Atomicity across those writes cannot then be assumed.

* What happens if a worker deletes some files and then crashes?
  Who owns retry?
  Where is progress recorded?
  Can another node safely resume a crashed node's work?

* What happens if the worker needs to call Polaris after the table is
already
  hidden or dropped from the normal API surface?
  This creates a cyclic dependency unless the task contains the information
  needed to continue without rediscovering the table through loadTable.

* Server-side scan planning is also not a simple service call.
  It either needs a query engine, or the relevant planning parts of one.
  At minimum, the contract needs request budgets: timeouts, cancellation,
  backpressure, result-size limits, fallback behavior, and cache ownership.

The existing proposals already contain most of the useful building blocks.

For me, the safer order is to define the guarantees first, then document
the deployment modes on top.

One possible path could roughly look like this:

1. Define how destructive operations persist the intent for DROP TABLE
PURGE.
   The important part is that the user-visible drop and the purge intent are
   recorded atomically.

2. Building on the "Asynchronous & Reliable Tasks" work for the durable
   Polaris task control plane gives us deterministic task IDs, task state,
   retry/lost-task recovery, and admin-visible status.

3. Using the "Object store functionality" work as the execution library
   for purge/file cleanup gives us streaming file discovery, bulk deletes,
   rate limiting, stats, and lower heap pressure.

4. Wire DROP TABLE PURGE to a reliable task behavior using those object
store
   operations.
   Once Polaris returns success, the table is hidden from normal catalog
APIs
   and the purge intent is durable.
   File deletion can continue asynchronously and survive process restarts.

5. Then consider deployment variants.
   A same-image task runner gives self-hosted operators isolation and
   separate scaling without a second protocol or persistence model.
   A remote Delegation Service can still be added later as an optional
   executor backend if SaaS deployments need that shape.

This is not meant to block pull/push terminology.
It is also not meant to rule out remote execution.
I am mostly trying to avoid publishing push mode as supported release
behavior before the task, security, request-budget, and operational
contracts are defined.

So I would prefer to keep this PR as a design/proposal note for now, or
make the released documentation explicit that push mode is still TBD.

My worry is that otherwise we ship a simple-looking doc that commits the
project to a surprisingly complex distributed-systems design.

Robert

On Wed, May 13, 2026 at 11:50 PM Yufei Gu <[email protected]> wrote:

> Hi folks,
>
> Sharing a few updates regarding the delegation service design doc. JB and I
> will be co-authoring the document, and the PR has been updated accordingly.
>
> Please take a look at the latest changes here:
> https://github.com/apache/polaris/pull/3990
>
> Yufei
>
>
> On Tue, Apr 14, 2026 at 1:56 PM Yufei Gu <[email protected]> wrote:
>
> > Hi everyone,
> >
> > We had a productive discussion on the delegation service during the
> > Polaris Sprint on April 7, thanks all for the great input.
> >
> > As a quick summary, the current direction is to condense the design
> doc[1]
> > and focus on the two options the community seems to prefer moving forward
> > with: pull mode and push mode. The goal is to keep the doc concise and
> > briefly describe these two modes.
> >
> > Please let me know if I missed anything. And Looking forward your
> feedback.
> >
> > 1. https://github.com/apache/polaris/pull/3990
> >
> > Thanks,
> > Yufei
> >
>

Re: [DISCUSS] Delegation Service design doc direction (pull vs push modes)

Reply via email to