Re: [PROPOSAL] Asynchronous & Reliable Tasks

Robert Stupp Fri, 12 Dec 2025 03:53:14 -0800

Hi all,

Thanks for the discussion that happened so far.
After some time of silence, apologies, I would like to revive this discussion!


The context, for those who haven't followed the thread since the
beginning, is to provide a resilient framework to submit long-running
tasks and to eventually execute those on "any" live Polaris instance.
(You can find the "full version" in the initial email of this thread
[1] from May 2025.)

For some context, a related proposal [2] was proposed in June 2025 to
keep the existing implementations but move the task execution to a
separate service, with the current, local behavior as the fallback (or
default if you like).

The "Async & reliable tasks" proposal allows instances to choose
whether tasks can only be submitted or whether tasks can also be
executed. In other words, support for delegation is built-in.

Related to the overall effort is the "Object store functionality"
proposal [3] (via PR [3256]) to provide a CPU, heap-friendly API and
implementation to work against object stores. It is built in a way to
provide "pluggable" functions.

The "object store functionality" proposal implicitly addresses the
current issue of running into out-of-memory errors when purging
Iceberg tables. Details about that issue can be found in [4].

I would like to bring the whole effort "back to live" and propose to
1. Start with [2180] and [3256]. Those two are orthogonal and do not
depend on each other.
2. Continue with implementation PR(s) building on top of [2180] for
both NoSQL and "DB native" persistence.
3. Provide a task behavior implementation using "object store
functionality" to purge Iceberg tables and views.
4. Wire the behavior into the existing code base.
The code base of the existing tasks implementation is not touched by
this effort.

If all that's in, we could even think about a more intelligent and
fully automatic approach to purge unreferenced files in object stores
to keep object store usage at a reasonable size.

Looking forward to hearing your thoughts and a friendly and
constructive collaboration!

Robert

[1] https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l
(initial email of this thread)
[2] https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
(delegation service proposal)
[3] https://lists.apache.org/thread/0z8nb3w58zb9s617gsoyhzlnz53rt9zx
(object storage operations proposal)
[3256] https://github.com/apache/polaris/pull/3256 (Object storage
operations PR)
[4] https://lists.apache.org/thread/9pgvhr9btfgzofbm6qhyfyqnk62hzp4m (OOM)
[2180] https://github.com/apache/polaris/pull/2180 (Async & reliable tasks PR)




On Mon, Aug 4, 2025 at 11:15 AM Robert Stupp <[email protected]> wrote:
>
> RIght, the idea is to have a "common abstraction" first.
> I'm actively looking into exactly that at the moment. WIll come up
> with a couple PRs to enable this.
> Some of it is implicitly covered by the work that Christopher's
> contributing, although it's rather orthogonal.
>
> On Fri, Aug 1, 2025 at 6:54 PM Eric Maynard <[email protected]> wrote:
> >
> > I agree with Robert that the current implementation is not good and should
> > be ripped out ASAP. However, I see this effort as complementary to Will's
> > refactor, not as a dependency. We should first add a layer of abstraction
> > between the business logic in Polaris and the task execution -- once that's
> > in place, we can replace the existing task implementation behind that
> > abstraction. At the same time, adding this abstraction will unlock the
> > ability for us to implement remote task execution as well.
> >
> > --EM
> >
> > On Fri, Aug 1, 2025 at 6:31 AM Yufei Gu <[email protected]> wrote:
> >
> > > Thanks for the async task proposal. I think it's the right direction
> > > for async light tasks. Meanwhile, we will still need other models:
> > > 1. A scalable way to execute synchronous tasks
> > > 2. A scalable way to execute heavy async tasks, e.g., table maintenance
> > > tasks.
> > >
> > > The delegation service[1] is a good candidate for that.
> > >
> > > 1.
> > >
> > > https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.xjibr7sfbv6a
> > >
> > > Yufei
> > >
> > >
> > > On Thu, Jul 31, 2025 at 11:37 AM Russell Spitzer <
> > > [email protected]>
> > > wrote:
> > >
> > > > I'm fine with the plan although I think we should probably change step 4
> > > > to allow both the current implementation and the new implementation to
> > > > exist at the same time with a flag for switching over to the new task
> > > > implementation. While the new implementation may be much better, it is a
> > > > pretty significant behavior change that I think should be opt in until
> > > it's
> > > > been in Polaris for a release or two. After that we could force all 
> > > > users
> > > > to switch once it's been out in the wild for a bit.
> > > >
> > > > On 2025/07/30 01:30:43 William Hyun wrote:
> > > > > >
> > > > > > Considering the current issues, I don't think it's worth the effort
> > > to
> > > > > > keep the current implementation.
> > > > >
> > > > >
> > > > > It seems risky to me to not support the current implementation at 
> > > > > least
> > > > for
> > > > > the period where the new tasks implementation is unstable.
> > > > >
> > > > > Bests,
> > > > > William
> > > > >
> > > > > On Tue, Jul 29, 2025 at 3:49 AM Robert Stupp <[email protected]> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > (starting w/ a recap for everybody watching this thread)
> > > > > > The goal of this is to have a mechanism to guarantee the _eventual_
> > > > > > execution of a task. That may happen immediately on the same node or
> > > > > > at a later time on another node.
> > > > > > This particular "async reliable tasks" is to ensure that tasks run
> > > > > > eventually in any Polaris node. The related "Delegation Service"
> > > > > > proposal is to let tasks run in a separate, different remote 
> > > > > > service.
> > > > > > But it requires a "local fallback" in case the remote service is not
> > > > > > available, which would be provided by this proposal.
> > > > > >
> > > > > > Currently, all scheduled and running tasks are "lost", if Polaris is
> > > > > > stopped, killed or crashed. So I'd prefer to get this proposal in
> > > > > > first to address the current issues and have a reliable fallback for
> > > > > > the Delegation Service.
> > > > > >
> > > > > > Considering the current issues, I don't think it's worth the effort
> > > to
> > > > > > keep the current implementation.
> > > > > >
> > > > > > Both, this proposal and the Delegation Service, shouldn't rely on
> > > > > > Polaris entities but rather have targeted definitions for the tasks
> > > to
> > > > > > execute, which contain exactly (and not more) what the tasks need to
> > > > > > be executed.
> > > > > >
> > > > > > So I think the following steps (approx 1 PR for each) would be:
> > > > > > 1. Add the tasks API (the draft PR [1])
> > > > > > 2. Add the tasks implementation, w/o any persistence integration but
> > > > > > with mock testing
> > > > > > 3. Add persistence integration
> > > > > > 4. Replace current task implementation with the new one
> > > > > >
> > > > > > I'll probably have more details soon-ish.
> > > > > >
> > > > > > Robert
> > > > > >
> > > > > > [1] https://github.com/apache/polaris/pull/2180
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 28, 2025 at 6:22 AM William Hyun <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > Hey Robert!
> > > > > > >
> > > > > > > Thank you for the draft PR.
> > > > > > > I have taken a look and the general approach seems good to me.
> > > > > > > However, one of my concerns would be the timeline to deliver this
> > > new
> > > > > > > task framework refactoring as this could be intrusive due to the
> > > > scope
> > > > > > > of the change.
> > > > > > > What do you plan as the ETA for delivering this change?
> > > > > > >
> > > > > > > It seems we need to support both the pre-existing (v1) and new 
> > > > > > > task
> > > > > > > framework (v2) until we are sure that v2 is stabilized so that we
> > > can
> > > > > > > delete v1.
> > > > > > > With the Delegation Service proposal being a new feature for
> > > users, I
> > > > > > > am proposing to include it within the 1.1 release as a small,
> > > > optional
> > > > > > > extension and also support it in v2 by reusing via implementing
> > > v2's
> > > > > > > SPI module as we previously discussed.
> > > > > > > I also have opened a PR demonstrating what the Delegation Service
> > > > > > > looks like here:
> > > > > > >
> > > > > > > - https://github.com/apache/polaris/pull/2193
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > > > Bests,
> > > > > > > William
> > > > > > >
> > > > > > > On Thu, Jul 24, 2025 at 11:18 AM Robert Stupp <[email protected]>
> > > > wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > As discussed on the Polaris Community Sync today, we're aligned
> > > > that
> > > > > > > > the current tasks handling needs some refactoring.
> > > > > > > >
> > > > > > > > This proposal focuses on the "eventual execution" of a task.
> > > > > > > > Implementations for would follow.
> > > > > > > > The "Delegation Service" [1]  proposal focuses on the execution
> > > of
> > > > > > > > tasks "outside" of Polaris.
> > > > > > > >
> > > > > > > > I've pushed a draft PR [2] with the Java interfaces and value
> > > types
> > > > > > > > for the API, the SPI (behavior implementation) and store (used 
> > > > > > > > by
> > > > > > > > tasks implementations).
> > > > > > > >
> > > > > > > > The only entry point is the `org.apache.polaris.tasks.api.Tasks`
> > > > > > > > interface with a function defining the behavior and providing a
> > > > > > > > parameter object (if necessary), returning a `TaskSubmission`.
> > > Call
> > > > > > > > sites _may_ subscribe to a `CompletionStage`, but the idea is
> > > that
> > > > > > > > it's rather "fire and forget" and the task behavior does
> > > > "everything
> > > > > > > > that's needed". This allows the task to be executed on any node.
> > > > > > > > There's no guarantee in any form that a task will run "locally"
> > > or
> > > > any
> > > > > > > > other specific node. Every Polaris node can handle task 
> > > > > > > > execution
> > > > and
> > > > > > > > perform failure/retry handling. Polaris nodes may use a "server"
> > > > > > > > implementation or a "client" implementation or a "remote"
> > > > > > > > implementation - that's defined upon deployment or by
> > > configuration
> > > > > > > > (TBD).
> > > > > > > >
> > > > > > > > I think that we can get to a Polaris internal API/SPI that can 
> > > > > > > > be
> > > > > > > > leveraged by both proposals.
> > > > > > > > This proposal is implementation and persistence backend 
> > > > > > > > agnostic.
> > > > > > > > There could be a "server" implementation that can run tasks, a
> > > > > > > > "client" implementation that can only submit tasks (think: from
> > > the
> > > > > > > > polaris-admin tool), and an implementation for the delegation
> > > > service
> > > > > > > > to execute tasks remotely.
> > > > > > > >
> > > > > > > > I do have a working implementation sitting around locally that's
> > > > > > > > passing tests exercising concurrency, multi-node and failure
> > > > > > > > scenarios. Since there's only a store-implementation for NoSQL, 
> > > > > > > > I
> > > > > > > > haven't pushed that yet. Adding a store-implementation that
> > > solely
> > > > > > > > uses `BasePersistence``(JDBC) is not such a big deal.
> > > > > > > >
> > > > > > > > If we're okay with the approach in general, I can follow up with
> > > a
> > > > > > > > more concrete implementation including the "purge table" use 
> > > > > > > > case
> > > > and
> > > > > > > > maybe another example use case.
> > > > > > > >
> > > > > > > > Robert
> > > > > > > >
> > > > > > > > [1]
> > > > https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
> > > > > > > > [2] https://github.com/apache/polaris/pull/2180
> > > > > > > >
> > > > > > > > On Mon, May 19, 2025 at 12:05 PM Robert Stupp <[email protected]>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Yes, each "task behavior" has an ID. I've chosen the term 
> > > > > > > > > "task
> > > > > > > > > behavior" over "type", because it doesn't only define "what's
> > > > done"
> > > > > > but
> > > > > > > > > also "when" it's done (delay) and "how it behaves" (retries on
> > > > > > failures).
> > > > > > > > >
> > > > > > > > > On 14.05.25 04:25, Adnan Hemani wrote:
> > > > > > > > > > Hi Robert,
> > > > > > > > > >
> > > > > > > > > > Firstly, thanks for this document. One quick question: is 
> > > > > > > > > > the
> > > > > > `behavior ID` basically the task type? This part was slightly 
> > > > > > unclear
> > > > to me.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Adnan Hemani
> > > > > > > > > >
> > > > > > > > > >> On May 9, 2025, at 6:07 AM, Robert Stupp <[email protected]>
> > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> Hi,
> > > > > > > > > >>
> > > > > > > > > >> Polaris is a service, which has to eventually perform
> > > > operations
> > > > > > asynchronously. Polaris is also meant to be backed by multiple 
> > > > > > server
> > > > > > instances (think: high-availability & load-balancing setups).
> > > > > > > > > >>
> > > > > > > > > >> During runtime, things can go sideways in many ways. Server
> > > > > > instances may crash, be killed or whatever... Task executions may
> > > fail,
> > > > > > because some other remote service fails, configuration values (and
> > > > > > credentials) may be wrong or other error situations.
> > > > > > > > > >>
> > > > > > > > > >> Task execution should be resilient to both kinds of
> > > scenarios:
> > > > > > being able to eventually recover from a "dead/lost node" scenario 
> > > > > > and
> > > > to
> > > > > > retry failed tasks.
> > > > > > > > > >>
> > > > > > > > > >> Each individual task should also be executed only once.
> > > > > > > > > >>
> > > > > > > > > >> There are also different kinds of tasks with different
> > > > behaviors:
> > > > > > the "function" being executed and the retry behavior.
> > > > > > > > > >>
> > > > > > > > > >> Proposal doc for this:
> > > > > >
> > > >
> > > https://www.google.com/url?q=https://docs.google.com/document/d/17D28E2ne5dzOHWc9DJ91Yz3lnQOtgmWaA_TBNdXv0sY/edit?tab%3Dt.0&source=gmail-imap&ust=1747400861000000&usg=AOvVaw3x56ChuB1ga0MelG6URxxi
> > > > > > > > > >>
> > > > > > > > > >> Robert
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> --
> > > > > > > > > >> Robert Stupp
> > > > > > > > > >> @snazy
> > > > > > > > > >>
> > > > > > > > > --
> > > > > > > > > Robert Stupp
> > > > > > > > > @snazy
> > > > > > > > >
> > > > > >
> > > > >
> > > >
> > >

Re: [PROPOSAL] Asynchronous & Reliable Tasks

Reply via email to