Thanks for the heads up, Robert. I reviewed the API/SPI/Store PR and apart from minor Javadoc comments, that is a +1 for me. I especially like the clear SPI definition that allows for using data store of choice. This is really good work.
Go team ! -- Pierre On Fri, Dec 12, 2025 at 12:53 PM Robert Stupp <[email protected]> wrote: > Hi all, > > Thanks for the discussion that happened so far. > After some time of silence, apologies, I would like to revive this > discussion! > > The context, for those who haven't followed the thread since the > beginning, is to provide a resilient framework to submit long-running > tasks and to eventually execute those on "any" live Polaris instance. > (You can find the "full version" in the initial email of this thread > [1] from May 2025.) > > For some context, a related proposal [2] was proposed in June 2025 to > keep the existing implementations but move the task execution to a > separate service, with the current, local behavior as the fallback (or > default if you like). > > The "Async & reliable tasks" proposal allows instances to choose > whether tasks can only be submitted or whether tasks can also be > executed. In other words, support for delegation is built-in. > > Related to the overall effort is the "Object store functionality" > proposal [3] (via PR [3256]) to provide a CPU, heap-friendly API and > implementation to work against object stores. It is built in a way to > provide "pluggable" functions. > > The "object store functionality" proposal implicitly addresses the > current issue of running into out-of-memory errors when purging > Iceberg tables. Details about that issue can be found in [4]. > > I would like to bring the whole effort "back to live" and propose to > 1. Start with [2180] and [3256]. Those two are orthogonal and do not > depend on each other. > 2. Continue with implementation PR(s) building on top of [2180] for > both NoSQL and "DB native" persistence. > 3. Provide a task behavior implementation using "object store > functionality" to purge Iceberg tables and views. > 4. Wire the behavior into the existing code base. > The code base of the existing tasks implementation is not touched by > this effort. > > If all that's in, we could even think about a more intelligent and > fully automatic approach to purge unreferenced files in object stores > to keep object store usage at a reasonable size. > > Looking forward to hearing your thoughts and a friendly and > constructive collaboration! > > Robert > > [1] https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l > (initial email of this thread) > [2] https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw > (delegation service proposal) > [3] https://lists.apache.org/thread/0z8nb3w58zb9s617gsoyhzlnz53rt9zx > (object storage operations proposal) > [3256] https://github.com/apache/polaris/pull/3256 (Object storage > operations PR) > [4] https://lists.apache.org/thread/9pgvhr9btfgzofbm6qhyfyqnk62hzp4m (OOM) > [2180] https://github.com/apache/polaris/pull/2180 (Async & reliable > tasks PR) > > > > > On Mon, Aug 4, 2025 at 11:15 AM Robert Stupp <[email protected]> wrote: > > > > RIght, the idea is to have a "common abstraction" first. > > I'm actively looking into exactly that at the moment. WIll come up > > with a couple PRs to enable this. > > Some of it is implicitly covered by the work that Christopher's > > contributing, although it's rather orthogonal. > > > > On Fri, Aug 1, 2025 at 6:54 PM Eric Maynard <[email protected]> > wrote: > > > > > > I agree with Robert that the current implementation is not good and > should > > > be ripped out ASAP. However, I see this effort as complementary to > Will's > > > refactor, not as a dependency. We should first add a layer of > abstraction > > > between the business logic in Polaris and the task execution -- once > that's > > > in place, we can replace the existing task implementation behind that > > > abstraction. At the same time, adding this abstraction will unlock the > > > ability for us to implement remote task execution as well. > > > > > > --EM > > > > > > On Fri, Aug 1, 2025 at 6:31 AM Yufei Gu <[email protected]> wrote: > > > > > > > Thanks for the async task proposal. I think it's the right direction > > > > for async light tasks. Meanwhile, we will still need other models: > > > > 1. A scalable way to execute synchronous tasks > > > > 2. A scalable way to execute heavy async tasks, e.g., table > maintenance > > > > tasks. > > > > > > > > The delegation service[1] is a good candidate for that. > > > > > > > > 1. > > > > > > > > > https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.xjibr7sfbv6a > > > > > > > > Yufei > > > > > > > > > > > > On Thu, Jul 31, 2025 at 11:37 AM Russell Spitzer < > > > > [email protected]> > > > > wrote: > > > > > > > > > I'm fine with the plan although I think we should probably change > step 4 > > > > > to allow both the current implementation and the new > implementation to > > > > > exist at the same time with a flag for switching over to the new > task > > > > > implementation. While the new implementation may be much better, > it is a > > > > > pretty significant behavior change that I think should be opt in > until > > > > it's > > > > > been in Polaris for a release or two. After that we could force > all users > > > > > to switch once it's been out in the wild for a bit. > > > > > > > > > > On 2025/07/30 01:30:43 William Hyun wrote: > > > > > > > > > > > > > > Considering the current issues, I don't think it's worth the > effort > > > > to > > > > > > > keep the current implementation. > > > > > > > > > > > > > > > > > > It seems risky to me to not support the current implementation > at least > > > > > for > > > > > > the period where the new tasks implementation is unstable. > > > > > > > > > > > > Bests, > > > > > > William > > > > > > > > > > > > On Tue, Jul 29, 2025 at 3:49 AM Robert Stupp <[email protected]> > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > (starting w/ a recap for everybody watching this thread) > > > > > > > The goal of this is to have a mechanism to guarantee the > _eventual_ > > > > > > > execution of a task. That may happen immediately on the same > node or > > > > > > > at a later time on another node. > > > > > > > This particular "async reliable tasks" is to ensure that tasks > run > > > > > > > eventually in any Polaris node. The related "Delegation > Service" > > > > > > > proposal is to let tasks run in a separate, different remote > service. > > > > > > > But it requires a "local fallback" in case the remote service > is not > > > > > > > available, which would be provided by this proposal. > > > > > > > > > > > > > > Currently, all scheduled and running tasks are "lost", if > Polaris is > > > > > > > stopped, killed or crashed. So I'd prefer to get this proposal > in > > > > > > > first to address the current issues and have a reliable > fallback for > > > > > > > the Delegation Service. > > > > > > > > > > > > > > Considering the current issues, I don't think it's worth the > effort > > > > to > > > > > > > keep the current implementation. > > > > > > > > > > > > > > Both, this proposal and the Delegation Service, shouldn't rely > on > > > > > > > Polaris entities but rather have targeted definitions for the > tasks > > > > to > > > > > > > execute, which contain exactly (and not more) what the tasks > need to > > > > > > > be executed. > > > > > > > > > > > > > > So I think the following steps (approx 1 PR for each) would be: > > > > > > > 1. Add the tasks API (the draft PR [1]) > > > > > > > 2. Add the tasks implementation, w/o any persistence > integration but > > > > > > > with mock testing > > > > > > > 3. Add persistence integration > > > > > > > 4. Replace current task implementation with the new one > > > > > > > > > > > > > > I'll probably have more details soon-ish. > > > > > > > > > > > > > > Robert > > > > > > > > > > > > > > [1] https://github.com/apache/polaris/pull/2180 > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 28, 2025 at 6:22 AM William Hyun < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > Hey Robert! > > > > > > > > > > > > > > > > Thank you for the draft PR. > > > > > > > > I have taken a look and the general approach seems good to > me. > > > > > > > > However, one of my concerns would be the timeline to deliver > this > > > > new > > > > > > > > task framework refactoring as this could be intrusive due to > the > > > > > scope > > > > > > > > of the change. > > > > > > > > What do you plan as the ETA for delivering this change? > > > > > > > > > > > > > > > > It seems we need to support both the pre-existing (v1) and > new task > > > > > > > > framework (v2) until we are sure that v2 is stabilized so > that we > > > > can > > > > > > > > delete v1. > > > > > > > > With the Delegation Service proposal being a new feature for > > > > users, I > > > > > > > > am proposing to include it within the 1.1 release as a small, > > > > > optional > > > > > > > > extension and also support it in v2 by reusing via > implementing > > > > v2's > > > > > > > > SPI module as we previously discussed. > > > > > > > > I also have opened a PR demonstrating what the Delegation > Service > > > > > > > > looks like here: > > > > > > > > > > > > > > > > - https://github.com/apache/polaris/pull/2193 > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > Bests, > > > > > > > > William > > > > > > > > > > > > > > > > On Thu, Jul 24, 2025 at 11:18 AM Robert Stupp < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > As discussed on the Polaris Community Sync today, we're > aligned > > > > > that > > > > > > > > > the current tasks handling needs some refactoring. > > > > > > > > > > > > > > > > > > This proposal focuses on the "eventual execution" of a > task. > > > > > > > > > Implementations for would follow. > > > > > > > > > The "Delegation Service" [1] proposal focuses on the > execution > > > > of > > > > > > > > > tasks "outside" of Polaris. > > > > > > > > > > > > > > > > > > I've pushed a draft PR [2] with the Java interfaces and > value > > > > types > > > > > > > > > for the API, the SPI (behavior implementation) and store > (used by > > > > > > > > > tasks implementations). > > > > > > > > > > > > > > > > > > The only entry point is the > `org.apache.polaris.tasks.api.Tasks` > > > > > > > > > interface with a function defining the behavior and > providing a > > > > > > > > > parameter object (if necessary), returning a > `TaskSubmission`. > > > > Call > > > > > > > > > sites _may_ subscribe to a `CompletionStage`, but the idea > is > > > > that > > > > > > > > > it's rather "fire and forget" and the task behavior does > > > > > "everything > > > > > > > > > that's needed". This allows the task to be executed on any > node. > > > > > > > > > There's no guarantee in any form that a task will run > "locally" > > > > or > > > > > any > > > > > > > > > other specific node. Every Polaris node can handle task > execution > > > > > and > > > > > > > > > perform failure/retry handling. Polaris nodes may use a > "server" > > > > > > > > > implementation or a "client" implementation or a "remote" > > > > > > > > > implementation - that's defined upon deployment or by > > > > configuration > > > > > > > > > (TBD). > > > > > > > > > > > > > > > > > > I think that we can get to a Polaris internal API/SPI that > can be > > > > > > > > > leveraged by both proposals. > > > > > > > > > This proposal is implementation and persistence backend > agnostic. > > > > > > > > > There could be a "server" implementation that can run > tasks, a > > > > > > > > > "client" implementation that can only submit tasks (think: > from > > > > the > > > > > > > > > polaris-admin tool), and an implementation for the > delegation > > > > > service > > > > > > > > > to execute tasks remotely. > > > > > > > > > > > > > > > > > > I do have a working implementation sitting around locally > that's > > > > > > > > > passing tests exercising concurrency, multi-node and > failure > > > > > > > > > scenarios. Since there's only a store-implementation for > NoSQL, I > > > > > > > > > haven't pushed that yet. Adding a store-implementation that > > > > solely > > > > > > > > > uses `BasePersistence``(JDBC) is not such a big deal. > > > > > > > > > > > > > > > > > > If we're okay with the approach in general, I can follow > up with > > > > a > > > > > > > > > more concrete implementation including the "purge table" > use case > > > > > and > > > > > > > > > maybe another example use case. > > > > > > > > > > > > > > > > > > Robert > > > > > > > > > > > > > > > > > > [1] > > > > > https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw > > > > > > > > > [2] https://github.com/apache/polaris/pull/2180 > > > > > > > > > > > > > > > > > > On Mon, May 19, 2025 at 12:05 PM Robert Stupp < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Yes, each "task behavior" has an ID. I've chosen the > term "task > > > > > > > > > > behavior" over "type", because it doesn't only define > "what's > > > > > done" > > > > > > > but > > > > > > > > > > also "when" it's done (delay) and "how it behaves" > (retries on > > > > > > > failures). > > > > > > > > > > > > > > > > > > > > On 14.05.25 04:25, Adnan Hemani wrote: > > > > > > > > > > > Hi Robert, > > > > > > > > > > > > > > > > > > > > > > Firstly, thanks for this document. One quick question: > is the > > > > > > > `behavior ID` basically the task type? This part was slightly > unclear > > > > > to me. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Adnan Hemani > > > > > > > > > > > > > > > > > > > > > >> On May 9, 2025, at 6:07 AM, Robert Stupp < > [email protected]> > > > > > wrote: > > > > > > > > > > >> > > > > > > > > > > >> Hi, > > > > > > > > > > >> > > > > > > > > > > >> Polaris is a service, which has to eventually perform > > > > > operations > > > > > > > asynchronously. Polaris is also meant to be backed by multiple > server > > > > > > > instances (think: high-availability & load-balancing setups). > > > > > > > > > > >> > > > > > > > > > > >> During runtime, things can go sideways in many ways. > Server > > > > > > > instances may crash, be killed or whatever... Task executions > may > > > > fail, > > > > > > > because some other remote service fails, configuration values > (and > > > > > > > credentials) may be wrong or other error situations. > > > > > > > > > > >> > > > > > > > > > > >> Task execution should be resilient to both kinds of > > > > scenarios: > > > > > > > being able to eventually recover from a "dead/lost node" > scenario and > > > > > to > > > > > > > retry failed tasks. > > > > > > > > > > >> > > > > > > > > > > >> Each individual task should also be executed only > once. > > > > > > > > > > >> > > > > > > > > > > >> There are also different kinds of tasks with different > > > > > behaviors: > > > > > > > the "function" being executed and the retry behavior. > > > > > > > > > > >> > > > > > > > > > > >> Proposal doc for this: > > > > > > > > > > > > > > > > > https://www.google.com/url?q=https://docs.google.com/document/d/17D28E2ne5dzOHWc9DJ91Yz3lnQOtgmWaA_TBNdXv0sY/edit?tab%3Dt.0&source=gmail-imap&ust=1747400861000000&usg=AOvVaw3x56ChuB1ga0MelG6URxxi > > > > > > > > > > >> > > > > > > > > > > >> Robert > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> -- > > > > > > > > > > >> Robert Stupp > > > > > > > > > > >> @snazy > > > > > > > > > > >> > > > > > > > > > > -- > > > > > > > > > > Robert Stupp > > > > > > > > > > @snazy > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
