Re: [DISCUSS] Polaris Delegation Service for Long-Running Tasks

Robert Stupp Tue, 15 Jul 2025 02:24:54 -0700

Feel free to add it to the Polaris community sync agenda for Thu next
week 
(https://docs.google.com/document/d/1TAAMjCtk4KuWSwfxpCBhhK9vM1k_3n7YE4L28slclXU)


On Tue, Jul 15, 2025 at 10:03 AM William Hyun <will...@apache.org> wrote:
>
> Hey Robert,
>
> Thank you for your review and comments!
> To address some of your concerns,
> 1. Polaris would fall back to local execution (current behavior) in this case.
> 2. The delegation service would update the task status as a terminal
> failure in its persistence, allowing users to retry once a reliable
> Polaris instance is able to communicate with the delegation service.
> 3. Additional systems for handling retries can be explored with
> further discussions, but is currently not part of the MVP.
>
> These mostly seem to be implementation details, and I would be happy
> to have a discussion with you on this!
>
> Bests,
> William
>
> On Wed, Jul 9, 2025 at 7:36 AM Robert Stupp <sn...@snazy.de> wrote:
> >
> > Hi all,
> >
> > Overall Polaris deserves a thorough asynchronous task handling 
> > infrastructure.
> >
> > The general difference to my proposal [1] is that this one is a
> > dedicated service. It seems that there will be different
> > implementations of task types depending on whether those are run
> > inside Polaris or inside the new service, at least the (integration)
> > test and maintenance efforts are higher. Having "dedicated task
> > runners" (instances that do not serve IRC requests but only run tasks
> > asynchornously) is possible with [1].
> >
> > The "dedicated service" proposal needs some clarification on a few concerns.
> > 1. Resiliency of Polaris in case the remote delegation service is not
> > or not reliably available?
> > 2. Resiliency of the delegation service in case Polaris is not or not
> > reliably available?
> > 3. I suspect that both sides require additional retry handling logic
> > in case the respective remote side is not available. Are additional
> > queuing/messaging systems needed?
> >
> > [1] does not require additional credential vending endpoints and does
> > not require additional infrastructure (k8s, persistent state) nor an
> > additional or separate code base.
> >
> > In summary, [1] would share the exact same code base in every setup,
> > whether a user wants all server instance)s) to serve IRC and tasks or
> > whether a user really wants dedicated instances only for tasks. This
> > means that no additional testing overhead, no new publicly accessible
> > security related endpoints, no new services to care about and
> > maintain, no cross-service communication and no additional
> > configuration overhead for users.
> >
> > PS: I have to mention that I'm a bit disappointed by this counter
> > proposal to [1], where the latter did not receive a lot of attention
> > since May 19.
> >
> > [1] https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l
> >
> >
> > On Thu, Jun 26, 2025 at 12:53 AM William Hyun <will...@apache.org> wrote:
> > >
> > > Hi Anurag,
> > >
> > > Thank you for your interest and taking the time to review the design doc!
> > >
> > > To answer some of your questions:
> > > 1. The source of truth for all delegated tasks is within the
> > > Delegation Service's own persistence layer.
> > > 2. The current document abstracts away the implementation details of
> > > the Delegation Service. The intent is to first agree on the high-level
> > > architecture and the API contract between the services. For the
> > > synchronous MVP, there is no traditional in-memory or message broker
> > > queue. Instead, the persistence layer itself acts as a durable log; a
> > > task is persisted upon submission and then processed by the API
> > > thread. An example task execution loop has been added onto the
> > > appendix outlining this approach.
> > > 3. The plan is to provide the Delegation Service as a new, separate
> > > Docker image to be deployed alongside the existing Polaris container.
> > > We envision a one-to-one Polaris to Delegation Service security binary
> > > enforced through the security measures outlined in the document. I
> > > have included a new entry in the appendix discussing the high-level
> > > approach.
> > >
> > > Thanks again for the valuable questions. Please let me know if these
> > > clarifications address your concerns or if you have any further
> > > thoughts.
> > >
> > > Bests,
> > > William
> > >
> > > On Tue, Jun 24, 2025 at 5:35 PM Anurag Mantripragada
> > > <amantriprag...@apple.com.invalid> wrote:
> > > >
> > > > Thank you for your proposal, Willam.
> > > >
> > > > This type of companion service is necessary, as evidenced by the other 
> > > > proposal on asynchronous tasks. Overall, this is a promising start. I 
> > > > understand that the scope for this proposal is limited, so please feel 
> > > > free to indicate that it is not in scope. However, I have a few 
> > > > questions:
> > > >
> > > > 1. Could you clarify in the documentation the source of truth for task 
> > > > status? From your diagram, it appears that it is in the delegation 
> > > > service.
> > > > 2. The implementation details of the service are abstracted away. Are 
> > > > these not in scope for this design? (For instance, do we have a task 
> > > > queue in the delegation service?)
> > > > 3. Could you provide additional details on how this service will be 
> > > > deployed?
> > > >
> > > > It becomes very complicated when we transition from a synchronous model 
> > > > to an asynchronous model. (Handling failures, task executor 
> > > > unavailability, status updates, etc.) We can have a separate discussion 
> > > > for those.
> > > >
> > > > Thank you,
> > > > Anurag Mantripragada
> > > >
> > > >
> > > > > On Jun 24, 2025, at 11:56 AM, William Hyun <will...@apache.org> wrote:
> > > > >
> > > > > Hey Dmitri,
> > > > >
> > > > > Thank you for your comments!
> > > > >
> > > > > I would like to first clarify that while the initial use case is
> > > > > internal, we are not closing the door completely on having Delegation
> > > > > Service be accessible through user-driven clients.
> > > > > We would love this service to eventually be deployed and run
> > > > > independently from the Polaris Catalog to handle scheduled,
> > > > > asynchronous tasks as Eric mentioned above with compaction.
> > > > > We believe the REST API is the foundational building block for that
> > > > > evolution and the initial proposal aims to simply introduce the
> > > > > framework to the Polaris ecosystem with the purge table task as the
> > > > > main focal point.
> > > > >
> > > > > Secondly, in addressing the concern about task failures, I have added
> > > > > a section in the appendix discussing the expected behavior of failed
> > > > > tasks.
> > > > > Please feel free to take a look and let me know what you think!
> > > > > - 
> > > > > https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.fr5gi42vvat3
> > > > >
> > > > > Bests,
> > > > > William
> > > > >
> > > > >
> > > > > On Mon, Jun 23, 2025 at 4:42 PM Dmitri Bourlatchkov 
> > > > > <di...@apache.org> wrote:
> > > > >>
> > > > >> Apologies for missing the reference to Robert's doc. I hope it does 
> > > > >> not
> > > > >> invalidate my comments :)
> > > > >>
> > > > >> This is certainly up for discussion.
> > > > >>
> > > > >> To clarify my concern about the REST API: If we are to have 
> > > > >> resilient tasks
> > > > >> and the node that serves the initial REST request fails, other nodes 
> > > > >> will
> > > > >> have to be able to provide responses about the task instead of the 
> > > > >> failed
> > > > >> node. Ultimately the data will come from persistence (I assume). 
> > > > >> Also, I
> > > > >> suppose the Tasks Service is meant for internal interactions (not for
> > > > >> user-driven clients). Therefore, it seems to me that the REST API is
> > > > >> somewhat superficial in this case.
> > > > >>
> > > > >> Like I mentioned before, this is just what I thought after a quick 
> > > > >> review.
> > > > >> I'll certainly have a deeper look later.
> > > > >>
> > > > >> Cheers,
> > > > >> Dmitri.
> > > > >>
> > > > >> On Mon, Jun 23, 2025 at 6:02 PM Eric Maynard 
> > > > >> <eric.w.mayn...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Hey Dmitri,
> > > > >>>
> > > > >>> There's a section in the email above and the linked doc that talks 
> > > > >>> about
> > > > >>> the linked proposal. See "Relationship to the "Asynchronous & 
> > > > >>> Reliable
> > > > >>> Tasks" Proposal".
> > > > >>>
> > > > >>> As for pulling away from a REST API in favor of driving things 
> > > > >>> directly
> > > > >>> from persistence, there's a lot to discuss here. Bear in mind that 
> > > > >>> the
> > > > >>> design goes into detail about one proposed "TaskExecutor" 
> > > > >>> implementation;
> > > > >>> maybe another TaskExecutor could work exactly like you describe. 
> > > > >>> But the
> > > > >>> reason that this implementation proposes to be driven by a REST API 
> > > > >>> is that
> > > > >>> there's a lot of interesting future work -- see the "Future Work" 
> > > > >>> section
> > > > >>> of the doc for some examples -- that can be added on to the REST 
> > > > >>> API. In
> > > > >>> particular, table maintenance actions like compaction.
> > > > >>>
> > > > >>> --EM
> > > > >>>
> > > > >>> On Mon, Jun 23, 2025 at 2:31 PM Dmitri Bourlatchkov 
> > > > >>> <di...@apache.org>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hi All,
> > > > >>>>
> > > > >>>> A previous proposal by Robert [1] from May 9 appears to be 
> > > > >>>> related. I
> > > > >>> think
> > > > >>>> we should consider both at the same time, possibly as 
> > > > >>>> alternatives, but
> > > > >>>> perhaps also sharing / reusing their respective ideas.
> > > > >>>>
> > > > >>>> A few notes after a quick review:
> > > > >>>>
> > > > >>>> * Separate scaling for task executors seems reasonable at first 
> > > > >>>> glance,
> > > > >>> but
> > > > >>>> it adds deployment complexity. If we go with this approach, I 
> > > > >>>> believe it
> > > > >>>> would be worth making this deployment strategy optional. In other 
> > > > >>>> words
> > > > >>> let
> > > > >>>> admin users decide whether they want to have extra nodes dedicated 
> > > > >>>> to
> > > > >>>> specific tasks or whether they are ok with having uniform nodes.
> > > > >>>>
> > > > >>>> * I'm not sure a separate rich REST API for submitting tasks is 
> > > > >>>> really
> > > > >>>> necessary. Proper synchronization among multiple nodes will
> > > > >>>> probably require roundtrips to Persistence anyway, so task 
> > > > >>>> submission
> > > > >>> could
> > > > >>>> probably be done via Persistence.
> > > > >>>>
> > > > >>>> [1] 
> > > > >>>> https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Dmitri.
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Jun 23, 2025 at 3:12 PM William Hyun <will...@apache.org> 
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hello Polaris Community,
> > > > >>>>>
> > > > >>>>> I would like to share my proposal for a new service, the Polaris
> > > > >>>>> Delegation Service, and to share the design document for 
> > > > >>>>> discussion
> > > > >>>>> and feedback. The Delegation Service is intended to optionally be
> > > > >>>>> deployed alongside Polaris to handle the execution of certain
> > > > >>>>> long-running tasks.
> > > > >>>>>
> > > > >>>>> 1. Motivation
> > > > >>>>> The Polaris Catalog is optimized for low-latency metadata 
> > > > >>>>> operations.
> > > > >>>>> However, certain tasks such as purging data files for dropped 
> > > > >>>>> tables
> > > > >>>>> are resource-intensive and can impact its core performance. The
> > > > >>>>> motivation for this new service is to decouple these I/O-heavy
> > > > >>>>> background tasks from the main catalog, ensuring it remains highly
> > > > >>>>> responsive while allowing the task execution workload to be 
> > > > >>>>> managed
> > > > >>>>> and scaled independently.
> > > > >>>>>
> > > > >>>>> 2. Proposal
> > > > >>>>> We propose an optional, independent Delegation Service 
> > > > >>>>> responsible for
> > > > >>>>> executing these offloaded operations.
> > > > >>>>> The MVP will focus on synchronously handling the data file 
> > > > >>>>> deletion
> > > > >>>>> process for DROP TABLE WITH PURGE commands.
> > > > >>>>>
> > > > >>>>> 3. Relationship to the "Asynchronous & Reliable Tasks" Proposal
> > > > >>>>> This proposal is designed to be highly synergistic with the 
> > > > >>>>> existing
> > > > >>>>> "Asynchronous & Reliable Tasks" proposal.
> > > > >>>>>
> > > > >>>>> The Asynchronous Task proposal describes a general internal 
> > > > >>>>> framework
> > > > >>>>> for reliably scheduling and managing the lifecycle of any task 
> > > > >>>>> within
> > > > >>>>> Polaris. On the other hand, this proposal defines a specific, 
> > > > >>>>> external
> > > > >>>>> worker service optimized for executing a particular class of 
> > > > >>>>> I/O-heavy
> > > > >>>>> tasks.
> > > > >>>>>
> > > > >>>>> The Delegation Service does not alter the core Polaris task 
> > > > >>>>> schema.
> > > > >>>>> This allows it to seamlessly act as a specialized "backend" worker
> > > > >>>>> that can execute tasks scheduled and managed by the more advanced
> > > > >>>>> Asynchronous Task Framework, which would serve as the reliable
> > > > >>>>> "frontend." This relationship is explored further in section 10.2 
> > > > >>>>> of
> > > > >>>>> the document.
> > > > >>>>>
> > > > >>>>> Please find the detailed design document here for review:
> > > > >>>>> -
> > > > >>>>>
> > > > >>>>
> > > > >>> https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?usp=sharing
> > > > >>>>>
> > > > >>>>> Best Regards,
> > > > >>>>> William
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > >

Re: [DISCUSS] Polaris Delegation Service for Long-Running Tasks

Reply via email to