Re: [PROPOSAL] Scan Planning with Optional Caching Layers

Yufei Gu Wed, 17 Jun 2026 16:11:59 -0700

Thanks for putting this together. The first phase sounds good to me.

My main concern is that, without some form of delegation service, scan
planning could easily become a heavy workload that impacts Polaris
performance.


The SQL pruning index is also a pretty big topic with a lot of design
choices around ownership, consistency, updates, and operations. I'm not
sure Polaris itself should be responsible for managing the index.

One possible direction is to delegate scan planning and indexing to a
separate service. That would keep Polaris focused on catalog and governance
responsibilities while still enabling these optimizations. In a way, that
brings us back to the delegation service discussion.

Curious what others think.

Yufei


On Tue, Jun 16, 2026 at 12:44 AM Tornike Gurgenidze <[email protected]>
wrote:

> Hi,
>
> I drafted a proposal regarding adding iceberg rest-compliant scan planning
> support to Polaris. The proposal doc can be found here:
>
> https://docs.google.com/document/d/1agpz4wwXxWfEy9fJLgPRDcrzdR5USM1i9vQhOBcHo3Q/edit?usp=sharing
>
> tldr: doc proposes to first add a straightforward implementation of scan
> planning in the initial phase and integrate new endpoints with polaris
> authz. Subsequently, we can enhance scan planning performance with 2
> independent caching layers:
>
>    - *CachingFileIO* - FileIO wrapper that wraps existing FileIO
>    implementations and introduces a configurable Caffeine-powered in-memory
>    cache to speed up access to manifest files.
>    - *SQL Pruning Index* - additional index stored in a rdbms and
>    asynchronously updated by polaris when a new table snapshot is
> registered.
>    The goal is to store all relevant per-file stats in a db table that will
>    allow applying a pruning predicate in a single sql query. This is
>    essentially a ducklake-style index but used only as a file pruning index
>    rather than the source of truth. Index is allowed to lag behind the
> latest
>    snapshot in which case ScanPlanner will use both index and underlying
> files
>    for the relevant parts of the table metadata.
>
> I have a POC for caching layers in a private repo which you can take a look
> at as well: https://github.com/tokoko/iceberg-cache/.
>
> thanks,
> Tornike
>

Re: [PROPOSAL] Scan Planning with Optional Caching Layers

Reply via email to