Re: [PROPOSAL] Asynchronous & Reliable Tasks

Yufei Gu Thu, 31 Jul 2025 14:33:00 -0700

Thanks for the async task proposal. I think it's the right direction
for async light tasks. Meanwhile, we will still need other models:
1. A scalable way to execute synchronous tasks
2. A scalable way to execute heavy async tasks, e.g., table maintenance
tasks.


The delegation service[1] is a good candidate for that.

1.
https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.xjibr7sfbv6a

Yufei


On Thu, Jul 31, 2025 at 11:37 AM Russell Spitzer <[email protected]>
wrote:

> I'm fine with the plan although I think we should probably change step 4
> to allow both the current implementation and the new implementation to
> exist at the same time with a flag for switching over to the new task
> implementation. While the new implementation may be much better, it is a
> pretty significant behavior change that I think should be opt in until it's
> been in Polaris for a release or two. After that we could force all users
> to switch once it's been out in the wild for a bit.
>
> On 2025/07/30 01:30:43 William Hyun wrote:
> > >
> > > Considering the current issues, I don't think it's worth the effort to
> > > keep the current implementation.
> >
> >
> > It seems risky to me to not support the current implementation at least
> for
> > the period where the new tasks implementation is unstable.
> >
> > Bests,
> > William
> >
> > On Tue, Jul 29, 2025 at 3:49 AM Robert Stupp <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > (starting w/ a recap for everybody watching this thread)
> > > The goal of this is to have a mechanism to guarantee the _eventual_
> > > execution of a task. That may happen immediately on the same node or
> > > at a later time on another node.
> > > This particular "async reliable tasks" is to ensure that tasks run
> > > eventually in any Polaris node. The related "Delegation Service"
> > > proposal is to let tasks run in a separate, different remote service.
> > > But it requires a "local fallback" in case the remote service is not
> > > available, which would be provided by this proposal.
> > >
> > > Currently, all scheduled and running tasks are "lost", if Polaris is
> > > stopped, killed or crashed. So I'd prefer to get this proposal in
> > > first to address the current issues and have a reliable fallback for
> > > the Delegation Service.
> > >
> > > Considering the current issues, I don't think it's worth the effort to
> > > keep the current implementation.
> > >
> > > Both, this proposal and the Delegation Service, shouldn't rely on
> > > Polaris entities but rather have targeted definitions for the tasks to
> > > execute, which contain exactly (and not more) what the tasks need to
> > > be executed.
> > >
> > > So I think the following steps (approx 1 PR for each) would be:
> > > 1. Add the tasks API (the draft PR [1])
> > > 2. Add the tasks implementation, w/o any persistence integration but
> > > with mock testing
> > > 3. Add persistence integration
> > > 4. Replace current task implementation with the new one
> > >
> > > I'll probably have more details soon-ish.
> > >
> > > Robert
> > >
> > > [1] https://github.com/apache/polaris/pull/2180
> > >
> > >
> > >
> > > On Mon, Jul 28, 2025 at 6:22 AM William Hyun <[email protected]>
> wrote:
> > > >
> > > > Hey Robert!
> > > >
> > > > Thank you for the draft PR.
> > > > I have taken a look and the general approach seems good to me.
> > > > However, one of my concerns would be the timeline to deliver this new
> > > > task framework refactoring as this could be intrusive due to the
> scope
> > > > of the change.
> > > > What do you plan as the ETA for delivering this change?
> > > >
> > > > It seems we need to support both the pre-existing (v1) and new task
> > > > framework (v2) until we are sure that v2 is stabilized so that we can
> > > > delete v1.
> > > > With the Delegation Service proposal being a new feature for users, I
> > > > am proposing to include it within the 1.1 release as a small,
> optional
> > > > extension and also support it in v2 by reusing via implementing v2's
> > > > SPI module as we previously discussed.
> > > > I also have opened a PR demonstrating what the Delegation Service
> > > > looks like here:
> > > >
> > > > - https://github.com/apache/polaris/pull/2193
> > > >
> > > > WDYT?
> > > >
> > > > Bests,
> > > > William
> > > >
> > > > On Thu, Jul 24, 2025 at 11:18 AM Robert Stupp <[email protected]>
> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > As discussed on the Polaris Community Sync today, we're aligned
> that
> > > > > the current tasks handling needs some refactoring.
> > > > >
> > > > > This proposal focuses on the "eventual execution" of a task.
> > > > > Implementations for would follow.
> > > > > The "Delegation Service" [1]  proposal focuses on the execution of
> > > > > tasks "outside" of Polaris.
> > > > >
> > > > > I've pushed a draft PR [2] with the Java interfaces and value types
> > > > > for the API, the SPI (behavior implementation) and store (used by
> > > > > tasks implementations).
> > > > >
> > > > > The only entry point is the `org.apache.polaris.tasks.api.Tasks`
> > > > > interface with a function defining the behavior and providing a
> > > > > parameter object (if necessary), returning a `TaskSubmission`. Call
> > > > > sites _may_ subscribe to a `CompletionStage`, but the idea is that
> > > > > it's rather "fire and forget" and the task behavior does
> "everything
> > > > > that's needed". This allows the task to be executed on any node.
> > > > > There's no guarantee in any form that a task will run "locally" or
> any
> > > > > other specific node. Every Polaris node can handle task execution
> and
> > > > > perform failure/retry handling. Polaris nodes may use a "server"
> > > > > implementation or a "client" implementation or a "remote"
> > > > > implementation - that's defined upon deployment or by configuration
> > > > > (TBD).
> > > > >
> > > > > I think that we can get to a Polaris internal API/SPI that can be
> > > > > leveraged by both proposals.
> > > > > This proposal is implementation and persistence backend agnostic.
> > > > > There could be a "server" implementation that can run tasks, a
> > > > > "client" implementation that can only submit tasks (think: from the
> > > > > polaris-admin tool), and an implementation for the delegation
> service
> > > > > to execute tasks remotely.
> > > > >
> > > > > I do have a working implementation sitting around locally that's
> > > > > passing tests exercising concurrency, multi-node and failure
> > > > > scenarios. Since there's only a store-implementation for NoSQL, I
> > > > > haven't pushed that yet. Adding a store-implementation that solely
> > > > > uses `BasePersistence``(JDBC) is not such a big deal.
> > > > >
> > > > > If we're okay with the approach in general, I can follow up with a
> > > > > more concrete implementation including the "purge table" use case
> and
> > > > > maybe another example use case.
> > > > >
> > > > > Robert
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
> > > > > [2] https://github.com/apache/polaris/pull/2180
> > > > >
> > > > > On Mon, May 19, 2025 at 12:05 PM Robert Stupp <[email protected]>
> wrote:
> > > > > >
> > > > > > Yes, each "task behavior" has an ID. I've chosen the term "task
> > > > > > behavior" over "type", because it doesn't only define "what's
> done"
> > > but
> > > > > > also "when" it's done (delay) and "how it behaves" (retries on
> > > failures).
> > > > > >
> > > > > > On 14.05.25 04:25, Adnan Hemani wrote:
> > > > > > > Hi Robert,
> > > > > > >
> > > > > > > Firstly, thanks for this document. One quick question: is the
> > > `behavior ID` basically the task type? This part was slightly unclear
> to me.
> > > > > > >
> > > > > > > Best,
> > > > > > > Adnan Hemani
> > > > > > >
> > > > > > >> On May 9, 2025, at 6:07 AM, Robert Stupp <[email protected]>
> wrote:
> > > > > > >>
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> Polaris is a service, which has to eventually perform
> operations
> > > asynchronously. Polaris is also meant to be backed by multiple server
> > > instances (think: high-availability & load-balancing setups).
> > > > > > >>
> > > > > > >> During runtime, things can go sideways in many ways. Server
> > > instances may crash, be killed or whatever... Task executions may fail,
> > > because some other remote service fails, configuration values (and
> > > credentials) may be wrong or other error situations.
> > > > > > >>
> > > > > > >> Task execution should be resilient to both kinds of scenarios:
> > > being able to eventually recover from a "dead/lost node" scenario and
> to
> > > retry failed tasks.
> > > > > > >>
> > > > > > >> Each individual task should also be executed only once.
> > > > > > >>
> > > > > > >> There are also different kinds of tasks with different
> behaviors:
> > > the "function" being executed and the retry behavior.
> > > > > > >>
> > > > > > >> Proposal doc for this:
> > >
> https://www.google.com/url?q=https://docs.google.com/document/d/17D28E2ne5dzOHWc9DJ91Yz3lnQOtgmWaA_TBNdXv0sY/edit?tab%3Dt.0&source=gmail-imap&ust=1747400861000000&usg=AOvVaw3x56ChuB1ga0MelG6URxxi
> > > > > > >>
> > > > > > >> Robert
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Robert Stupp
> > > > > > >> @snazy
> > > > > > >>
> > > > > > --
> > > > > > Robert Stupp
> > > > > > @snazy
> > > > > >
> > >
> >
>

Re: [PROPOSAL] Asynchronous & Reliable Tasks

Reply via email to