Re: [PROPOSAL] Scan Planning with Optional Caching Layers

Tornike Gurgenidze Fri, 19 Jun 2026 06:20:38 -0700

Yufei, Adnan, thanks for taking a look at the proposal.

I definitely understand the concern and agree that there should be a way to
avoid including compute-intensive workload in polaris server and/or
metadata db. Still, my preferred approach would be to implement entire
functionality first and make it configurable later on when we have better
idea of how Delegation Service will look like (planning will sit behind a
feature flag, after all). if that sounds fine, I can adjust the proposal to
include eventual integration with delegation service (both for ScanPlanner
SPI and indexing) rather than make Delegation Service a hard prerequisite.


regarding SQL pruning index: I agree that it's a big topic and probably
valuable even outside of the scope of polaris. still.. since there's no
existing spec for anything like that outside of polaris, I think it makes
sense to start laying the foundation for it here for this particular use
case, don't you agree? In terms of compute, the actual indexing can happen
"externally", maybe orchestrated by polaris cli rather than as a side
effect of a snapshot update.

In short, while I agree that we should coordinate planning and delegation
service, I'd much rather implement the feature first and then build
delegation service around it especially since there's both types of
delegation requirement here (invoking external planner, notifying external
indexer).

Thanks,
Tornike

On Fri, Jun 19, 2026 at 2:12 AM Adnan Hemani via dev <[email protected]>
wrote:

> I agree with Yufei - I don't think we can implement something as heavy as
> server-side planning directly onto Polaris as it stands. I think we need to
> revisit the Delegation Service discussion; it would be a great place to
> implement this type of functionality.
>
> Best,
> Adnan Hemani
>
> On Wed, Jun 17, 2026 at 4:11 PM Yufei Gu <[email protected]> wrote:
>
> > Thanks for putting this together. The first phase sounds good to me.
> >
> > My main concern is that, without some form of delegation service, scan
> > planning could easily become a heavy workload that impacts Polaris
> > performance.
> >
> > The SQL pruning index is also a pretty big topic with a lot of design
> > choices around ownership, consistency, updates, and operations. I'm not
> > sure Polaris itself should be responsible for managing the index.
> >
> > One possible direction is to delegate scan planning and indexing to a
> > separate service. That would keep Polaris focused on catalog and
> governance
> > responsibilities while still enabling these optimizations. In a way, that
> > brings us back to the delegation service discussion.
> >
> > Curious what others think.
> >
> > Yufei
> >
> >
> > On Tue, Jun 16, 2026 at 12:44 AM Tornike Gurgenidze <
> > [email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I drafted a proposal regarding adding iceberg rest-compliant scan
> > planning
> > > support to Polaris. The proposal doc can be found here:
> > >
> > >
> >
> https://docs.google.com/document/d/1agpz4wwXxWfEy9fJLgPRDcrzdR5USM1i9vQhOBcHo3Q/edit?usp=sharing
> > >
> > > tldr: doc proposes to first add a straightforward implementation of
> scan
> > > planning in the initial phase and integrate new endpoints with polaris
> > > authz. Subsequently, we can enhance scan planning performance with 2
> > > independent caching layers:
> > >
> > >    - *CachingFileIO* - FileIO wrapper that wraps existing FileIO
> > >    implementations and introduces a configurable Caffeine-powered
> > in-memory
> > >    cache to speed up access to manifest files.
> > >    - *SQL Pruning Index* - additional index stored in a rdbms and
> > >    asynchronously updated by polaris when a new table snapshot is
> > > registered.
> > >    The goal is to store all relevant per-file stats in a db table that
> > will
> > >    allow applying a pruning predicate in a single sql query. This is
> > >    essentially a ducklake-style index but used only as a file pruning
> > index
> > >    rather than the source of truth. Index is allowed to lag behind the
> > > latest
> > >    snapshot in which case ScanPlanner will use both index and
> underlying
> > > files
> > >    for the relevant parts of the table metadata.
> > >
> > > I have a POC for caching layers in a private repo which you can take a
> > look
> > > at as well: https://github.com/tokoko/iceberg-cache/.
> > >
> > > thanks,
> > > Tornike
> > >
> >
>

Re: [PROPOSAL] Scan Planning with Optional Caching Layers

Reply via email to