Feel free to add it to the Polaris community sync agenda for Thu next week (https://docs.google.com/document/d/1TAAMjCtk4KuWSwfxpCBhhK9vM1k_3n7YE4L28slclXU)
On Tue, Jul 15, 2025 at 10:03 AM William Hyun <will...@apache.org> wrote: > > Hey Robert, > > Thank you for your review and comments! > To address some of your concerns, > 1. Polaris would fall back to local execution (current behavior) in this case. > 2. The delegation service would update the task status as a terminal > failure in its persistence, allowing users to retry once a reliable > Polaris instance is able to communicate with the delegation service. > 3. Additional systems for handling retries can be explored with > further discussions, but is currently not part of the MVP. > > These mostly seem to be implementation details, and I would be happy > to have a discussion with you on this! > > Bests, > William > > On Wed, Jul 9, 2025 at 7:36 AM Robert Stupp <sn...@snazy.de> wrote: > > > > Hi all, > > > > Overall Polaris deserves a thorough asynchronous task handling > > infrastructure. > > > > The general difference to my proposal [1] is that this one is a > > dedicated service. It seems that there will be different > > implementations of task types depending on whether those are run > > inside Polaris or inside the new service, at least the (integration) > > test and maintenance efforts are higher. Having "dedicated task > > runners" (instances that do not serve IRC requests but only run tasks > > asynchornously) is possible with [1]. > > > > The "dedicated service" proposal needs some clarification on a few concerns. > > 1. Resiliency of Polaris in case the remote delegation service is not > > or not reliably available? > > 2. Resiliency of the delegation service in case Polaris is not or not > > reliably available? > > 3. I suspect that both sides require additional retry handling logic > > in case the respective remote side is not available. Are additional > > queuing/messaging systems needed? > > > > [1] does not require additional credential vending endpoints and does > > not require additional infrastructure (k8s, persistent state) nor an > > additional or separate code base. > > > > In summary, [1] would share the exact same code base in every setup, > > whether a user wants all server instance)s) to serve IRC and tasks or > > whether a user really wants dedicated instances only for tasks. This > > means that no additional testing overhead, no new publicly accessible > > security related endpoints, no new services to care about and > > maintain, no cross-service communication and no additional > > configuration overhead for users. > > > > PS: I have to mention that I'm a bit disappointed by this counter > > proposal to [1], where the latter did not receive a lot of attention > > since May 19. > > > > [1] https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l > > > > > > On Thu, Jun 26, 2025 at 12:53 AM William Hyun <will...@apache.org> wrote: > > > > > > Hi Anurag, > > > > > > Thank you for your interest and taking the time to review the design doc! > > > > > > To answer some of your questions: > > > 1. The source of truth for all delegated tasks is within the > > > Delegation Service's own persistence layer. > > > 2. The current document abstracts away the implementation details of > > > the Delegation Service. The intent is to first agree on the high-level > > > architecture and the API contract between the services. For the > > > synchronous MVP, there is no traditional in-memory or message broker > > > queue. Instead, the persistence layer itself acts as a durable log; a > > > task is persisted upon submission and then processed by the API > > > thread. An example task execution loop has been added onto the > > > appendix outlining this approach. > > > 3. The plan is to provide the Delegation Service as a new, separate > > > Docker image to be deployed alongside the existing Polaris container. > > > We envision a one-to-one Polaris to Delegation Service security binary > > > enforced through the security measures outlined in the document. I > > > have included a new entry in the appendix discussing the high-level > > > approach. > > > > > > Thanks again for the valuable questions. Please let me know if these > > > clarifications address your concerns or if you have any further > > > thoughts. > > > > > > Bests, > > > William > > > > > > On Tue, Jun 24, 2025 at 5:35 PM Anurag Mantripragada > > > <amantriprag...@apple.com.invalid> wrote: > > > > > > > > Thank you for your proposal, Willam. > > > > > > > > This type of companion service is necessary, as evidenced by the other > > > > proposal on asynchronous tasks. Overall, this is a promising start. I > > > > understand that the scope for this proposal is limited, so please feel > > > > free to indicate that it is not in scope. However, I have a few > > > > questions: > > > > > > > > 1. Could you clarify in the documentation the source of truth for task > > > > status? From your diagram, it appears that it is in the delegation > > > > service. > > > > 2. The implementation details of the service are abstracted away. Are > > > > these not in scope for this design? (For instance, do we have a task > > > > queue in the delegation service?) > > > > 3. Could you provide additional details on how this service will be > > > > deployed? > > > > > > > > It becomes very complicated when we transition from a synchronous model > > > > to an asynchronous model. (Handling failures, task executor > > > > unavailability, status updates, etc.) We can have a separate discussion > > > > for those. > > > > > > > > Thank you, > > > > Anurag Mantripragada > > > > > > > > > > > > > On Jun 24, 2025, at 11:56 AM, William Hyun <will...@apache.org> wrote: > > > > > > > > > > Hey Dmitri, > > > > > > > > > > Thank you for your comments! > > > > > > > > > > I would like to first clarify that while the initial use case is > > > > > internal, we are not closing the door completely on having Delegation > > > > > Service be accessible through user-driven clients. > > > > > We would love this service to eventually be deployed and run > > > > > independently from the Polaris Catalog to handle scheduled, > > > > > asynchronous tasks as Eric mentioned above with compaction. > > > > > We believe the REST API is the foundational building block for that > > > > > evolution and the initial proposal aims to simply introduce the > > > > > framework to the Polaris ecosystem with the purge table task as the > > > > > main focal point. > > > > > > > > > > Secondly, in addressing the concern about task failures, I have added > > > > > a section in the appendix discussing the expected behavior of failed > > > > > tasks. > > > > > Please feel free to take a look and let me know what you think! > > > > > - > > > > > https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.fr5gi42vvat3 > > > > > > > > > > Bests, > > > > > William > > > > > > > > > > > > > > > On Mon, Jun 23, 2025 at 4:42 PM Dmitri Bourlatchkov > > > > > <di...@apache.org> wrote: > > > > >> > > > > >> Apologies for missing the reference to Robert's doc. I hope it does > > > > >> not > > > > >> invalidate my comments :) > > > > >> > > > > >> This is certainly up for discussion. > > > > >> > > > > >> To clarify my concern about the REST API: If we are to have > > > > >> resilient tasks > > > > >> and the node that serves the initial REST request fails, other nodes > > > > >> will > > > > >> have to be able to provide responses about the task instead of the > > > > >> failed > > > > >> node. Ultimately the data will come from persistence (I assume). > > > > >> Also, I > > > > >> suppose the Tasks Service is meant for internal interactions (not for > > > > >> user-driven clients). Therefore, it seems to me that the REST API is > > > > >> somewhat superficial in this case. > > > > >> > > > > >> Like I mentioned before, this is just what I thought after a quick > > > > >> review. > > > > >> I'll certainly have a deeper look later. > > > > >> > > > > >> Cheers, > > > > >> Dmitri. > > > > >> > > > > >> On Mon, Jun 23, 2025 at 6:02 PM Eric Maynard > > > > >> <eric.w.mayn...@gmail.com> > > > > >> wrote: > > > > >> > > > > >>> Hey Dmitri, > > > > >>> > > > > >>> There's a section in the email above and the linked doc that talks > > > > >>> about > > > > >>> the linked proposal. See "Relationship to the "Asynchronous & > > > > >>> Reliable > > > > >>> Tasks" Proposal". > > > > >>> > > > > >>> As for pulling away from a REST API in favor of driving things > > > > >>> directly > > > > >>> from persistence, there's a lot to discuss here. Bear in mind that > > > > >>> the > > > > >>> design goes into detail about one proposed "TaskExecutor" > > > > >>> implementation; > > > > >>> maybe another TaskExecutor could work exactly like you describe. > > > > >>> But the > > > > >>> reason that this implementation proposes to be driven by a REST API > > > > >>> is that > > > > >>> there's a lot of interesting future work -- see the "Future Work" > > > > >>> section > > > > >>> of the doc for some examples -- that can be added on to the REST > > > > >>> API. In > > > > >>> particular, table maintenance actions like compaction. > > > > >>> > > > > >>> --EM > > > > >>> > > > > >>> On Mon, Jun 23, 2025 at 2:31 PM Dmitri Bourlatchkov > > > > >>> <di...@apache.org> > > > > >>> wrote: > > > > >>> > > > > >>>> Hi All, > > > > >>>> > > > > >>>> A previous proposal by Robert [1] from May 9 appears to be > > > > >>>> related. I > > > > >>> think > > > > >>>> we should consider both at the same time, possibly as > > > > >>>> alternatives, but > > > > >>>> perhaps also sharing / reusing their respective ideas. > > > > >>>> > > > > >>>> A few notes after a quick review: > > > > >>>> > > > > >>>> * Separate scaling for task executors seems reasonable at first > > > > >>>> glance, > > > > >>> but > > > > >>>> it adds deployment complexity. If we go with this approach, I > > > > >>>> believe it > > > > >>>> would be worth making this deployment strategy optional. In other > > > > >>>> words > > > > >>> let > > > > >>>> admin users decide whether they want to have extra nodes dedicated > > > > >>>> to > > > > >>>> specific tasks or whether they are ok with having uniform nodes. > > > > >>>> > > > > >>>> * I'm not sure a separate rich REST API for submitting tasks is > > > > >>>> really > > > > >>>> necessary. Proper synchronization among multiple nodes will > > > > >>>> probably require roundtrips to Persistence anyway, so task > > > > >>>> submission > > > > >>> could > > > > >>>> probably be done via Persistence. > > > > >>>> > > > > >>>> [1] > > > > >>>> https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l > > > > >>>> > > > > >>>> Thanks, > > > > >>>> Dmitri. > > > > >>>> > > > > >>>> > > > > >>>> On Mon, Jun 23, 2025 at 3:12 PM William Hyun <will...@apache.org> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> Hello Polaris Community, > > > > >>>>> > > > > >>>>> I would like to share my proposal for a new service, the Polaris > > > > >>>>> Delegation Service, and to share the design document for > > > > >>>>> discussion > > > > >>>>> and feedback. The Delegation Service is intended to optionally be > > > > >>>>> deployed alongside Polaris to handle the execution of certain > > > > >>>>> long-running tasks. > > > > >>>>> > > > > >>>>> 1. Motivation > > > > >>>>> The Polaris Catalog is optimized for low-latency metadata > > > > >>>>> operations. > > > > >>>>> However, certain tasks such as purging data files for dropped > > > > >>>>> tables > > > > >>>>> are resource-intensive and can impact its core performance. The > > > > >>>>> motivation for this new service is to decouple these I/O-heavy > > > > >>>>> background tasks from the main catalog, ensuring it remains highly > > > > >>>>> responsive while allowing the task execution workload to be > > > > >>>>> managed > > > > >>>>> and scaled independently. > > > > >>>>> > > > > >>>>> 2. Proposal > > > > >>>>> We propose an optional, independent Delegation Service > > > > >>>>> responsible for > > > > >>>>> executing these offloaded operations. > > > > >>>>> The MVP will focus on synchronously handling the data file > > > > >>>>> deletion > > > > >>>>> process for DROP TABLE WITH PURGE commands. > > > > >>>>> > > > > >>>>> 3. Relationship to the "Asynchronous & Reliable Tasks" Proposal > > > > >>>>> This proposal is designed to be highly synergistic with the > > > > >>>>> existing > > > > >>>>> "Asynchronous & Reliable Tasks" proposal. > > > > >>>>> > > > > >>>>> The Asynchronous Task proposal describes a general internal > > > > >>>>> framework > > > > >>>>> for reliably scheduling and managing the lifecycle of any task > > > > >>>>> within > > > > >>>>> Polaris. On the other hand, this proposal defines a specific, > > > > >>>>> external > > > > >>>>> worker service optimized for executing a particular class of > > > > >>>>> I/O-heavy > > > > >>>>> tasks. > > > > >>>>> > > > > >>>>> The Delegation Service does not alter the core Polaris task > > > > >>>>> schema. > > > > >>>>> This allows it to seamlessly act as a specialized "backend" worker > > > > >>>>> that can execute tasks scheduled and managed by the more advanced > > > > >>>>> Asynchronous Task Framework, which would serve as the reliable > > > > >>>>> "frontend." This relationship is explored further in section 10.2 > > > > >>>>> of > > > > >>>>> the document. > > > > >>>>> > > > > >>>>> Please find the detailed design document here for review: > > > > >>>>> - > > > > >>>>> > > > > >>>> > > > > >>> https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?usp=sharing > > > > >>>>> > > > > >>>>> Best Regards, > > > > >>>>> William > > > > >>>>> > > > > >>>> > > > > >>> > > > >