Re: [PROPOSAL] Asynchronous & Reliable Tasks

William Hyun Sun, 27 Jul 2025 21:23:03 -0700

Hey Robert!

Thank you for the draft PR.
I have taken a look and the general approach seems good to me.
However, one of my concerns would be the timeline to deliver this new
task framework refactoring as this could be intrusive due to the scope
of the change.
What do you plan as the ETA for delivering this change?


It seems we need to support both the pre-existing (v1) and new task
framework (v2) until we are sure that v2 is stabilized so that we can
delete v1.
With the Delegation Service proposal being a new feature for users, I
am proposing to include it within the 1.1 release as a small, optional
extension and also support it in v2 by reusing via implementing v2's
SPI module as we previously discussed.
I also have opened a PR demonstrating what the Delegation Service
looks like here:

- https://github.com/apache/polaris/pull/2193

WDYT?

Bests,
William

On Thu, Jul 24, 2025 at 11:18 AM Robert Stupp <sn...@snazy.de> wrote:
>
> Hi,
>
> As discussed on the Polaris Community Sync today, we're aligned that
> the current tasks handling needs some refactoring.
>
> This proposal focuses on the "eventual execution" of a task.
> Implementations for would follow.
> The "Delegation Service" [1]  proposal focuses on the execution of
> tasks "outside" of Polaris.
>
> I've pushed a draft PR [2] with the Java interfaces and value types
> for the API, the SPI (behavior implementation) and store (used by
> tasks implementations).
>
> The only entry point is the `org.apache.polaris.tasks.api.Tasks`
> interface with a function defining the behavior and providing a
> parameter object (if necessary), returning a `TaskSubmission`. Call
> sites _may_ subscribe to a `CompletionStage`, but the idea is that
> it's rather "fire and forget" and the task behavior does "everything
> that's needed". This allows the task to be executed on any node.
> There's no guarantee in any form that a task will run "locally" or any
> other specific node. Every Polaris node can handle task execution and
> perform failure/retry handling. Polaris nodes may use a "server"
> implementation or a "client" implementation or a "remote"
> implementation - that's defined upon deployment or by configuration
> (TBD).
>
> I think that we can get to a Polaris internal API/SPI that can be
> leveraged by both proposals.
> This proposal is implementation and persistence backend agnostic.
> There could be a "server" implementation that can run tasks, a
> "client" implementation that can only submit tasks (think: from the
> polaris-admin tool), and an implementation for the delegation service
> to execute tasks remotely.
>
> I do have a working implementation sitting around locally that's
> passing tests exercising concurrency, multi-node and failure
> scenarios. Since there's only a store-implementation for NoSQL, I
> haven't pushed that yet. Adding a store-implementation that solely
> uses `BasePersistence``(JDBC) is not such a big deal.
>
> If we're okay with the approach in general, I can follow up with a
> more concrete implementation including the "purge table" use case and
> maybe another example use case.
>
> Robert
>
> [1] https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
> [2] https://github.com/apache/polaris/pull/2180
>
> On Mon, May 19, 2025 at 12:05 PM Robert Stupp <sn...@snazy.de> wrote:
> >
> > Yes, each "task behavior" has an ID. I've chosen the term "task
> > behavior" over "type", because it doesn't only define "what's done" but
> > also "when" it's done (delay) and "how it behaves" (retries on failures).
> >
> > On 14.05.25 04:25, Adnan Hemani wrote:
> > > Hi Robert,
> > >
> > > Firstly, thanks for this document. One quick question: is the `behavior 
> > > ID` basically the task type? This part was slightly unclear to me.
> > >
> > > Best,
> > > Adnan Hemani
> > >
> > >> On May 9, 2025, at 6:07 AM, Robert Stupp <sn...@snazy.de> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Polaris is a service, which has to eventually perform operations 
> > >> asynchronously. Polaris is also meant to be backed by multiple server 
> > >> instances (think: high-availability & load-balancing setups).
> > >>
> > >> During runtime, things can go sideways in many ways. Server instances 
> > >> may crash, be killed or whatever... Task executions may fail, because 
> > >> some other remote service fails, configuration values (and credentials) 
> > >> may be wrong or other error situations.
> > >>
> > >> Task execution should be resilient to both kinds of scenarios: being 
> > >> able to eventually recover from a "dead/lost node" scenario and to retry 
> > >> failed tasks.
> > >>
> > >> Each individual task should also be executed only once.
> > >>
> > >> There are also different kinds of tasks with different behaviors: the 
> > >> "function" being executed and the retry behavior.
> > >>
> > >> Proposal doc for this: 
> > >> https://www.google.com/url?q=https://docs.google.com/document/d/17D28E2ne5dzOHWc9DJ91Yz3lnQOtgmWaA_TBNdXv0sY/edit?tab%3Dt.0&source=gmail-imap&ust=1747400861000000&usg=AOvVaw3x56ChuB1ga0MelG6URxxi
> > >>
> > >> Robert
> > >>
> > >>
> > >> --
> > >> Robert Stupp
> > >> @snazy
> > >>
> > --
> > Robert Stupp
> > @snazy
> >

Re: [PROPOSAL] Asynchronous & Reliable Tasks

Reply via email to