Thanks JB! I think that's the right direction. That said, I don't think the default scan service should run inside the Polaris service itself. Scanning can be very I/O and network intensive and could easily saturate a Polaris instance. We'll likely need a delegation service for that.
I think the most practical path forward is to work on the delegation service to unblock it. In parallel, we can continue working on volume support without the inventory table. Yufei On Sat, Jun 20, 2026 at 10:37 PM Jean-Baptiste Onofré <[email protected]> wrote: > Hi everyone > > Thanks to your feedback, I will update the proposal/PR to include a default > object store scan service in Polaris (that can be disabled and replaced by > a custom one). > > I will keep you posted when the PR is updated. > > Thanks, > > Regards > JB > > Le mar. 9 juin 2026 à 21:42, Jean-Baptiste Onofré <[email protected]> a > écrit : > > > Hi Robert, > > > > Thanks for your feedback! > > > > From a user perspective, I personally prefer having the Directory and > > Table share the same name, as I find it less confusing to see the > > association at first glance. However, I'm open to including the inventory > > table name as part of the Directory configuration instead. > > > > As mentioned in my initial proposal, the current PR is intended to > > illustrate a potential implementation. It is certainly not the final > > version, and I am happy to update it based on community input. I fully > > agree with the high-level model you outlined, and I believe the PR is > > well-aligned with that direction. > > > > I still believe the inventory table is essential, as it represents the > > core value of the Directory and scanner; without it, users could simply > > create an Iceberg table manually to list objects. > > I'm fine to have add a endpoint in the Directory API to create a > inventory > > table without scanning (but using the static schema) and also other > > endpoints to deal with entries in an inventory (if you think it's > helpful). > > > > Regards, > > JB > > > > > > On Mon, Jun 8, 2026 at 1:27 PM Robert Stupp <[email protected]> wrote: > > > >> Hi, > >> > >> I support the general direction. > >> Modeling a directory/prefix as a first-class catalog concept in Polaris, > >> complete with an inventory table for discovered objects, seems very > >> useful. > >> > >> I think we should separate agreement on that direction from locking in > the > >> exact object model too early, though. > >> One design point I would like to keep open is the relationship between > the > >> directory configuration and the inventory table. > >> > >> For example, if the directory configuration and the inventory table > share > >> the same name in the same namespace and are distinguished only by object > >> type, that may be workable, but it can create ambiguity for APIs, UI, > >> events, authorization/audit, and lifecycle operations like rename/drop. > >> I don’t think we need to settle that in the first discussion, but I also > >> would not want the current PR shape to imply that this part is already > >> fixed. > >> > >> My preference would be to first agree on the higher-level model: > >> > >> - Polaris has a first-class Directory abstraction. > >> - A Directory has a configured object-store location and scan/inventory > >> settings. > >> - A Directory is associated with an Iceberg inventory table. > >> - Scanner execution can be discussed separately: Polaris-provided, > >> disabled, or integrator-provided. > >> > >> Then we can discuss whether the inventory table is implicitly named, > >> explicitly referenced, hidden/internal, user-visible, or modeled some > >> other > >> way. > >> > >> Thoughts? > >> > >> Robert > >> > >> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]> > >> wrote: > >> > >> > Hi > >> > > >> > I wanted to have two steps in the proposal: the configuration and high > >> > level architecture (that’s the current proposal), then the scanning > >> > service. > >> > > >> > I think the scanning should be part of Polaris but not mandatory: if > >> > integrators want to have their own scanning they should be able to do > >> so. > >> > The Polaris scanners should be disabled by users. Integrators would > >> > probably like to have scanning performed by a distributed engines or > >> within > >> > cloud provider infra. > >> > > >> > So my proposal here is: > >> > 1. To have scanner in Polaris > >> > 2. Be able to disable the Polaris scanner > >> > 3. Allow users/integrators to provide their own scanners > >> > > >> > The first step is to get consensus on the Polaris Directories proposal > >> > approach. > >> > > >> > I will create a follow up PR with a scanner. > >> > > >> > Regards > >> > JB > >> > > >> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit > : > >> > > >> > > I think one thing we should clarify is where the scanner lives. > >> > > > >> > > If the scanner is completely outside Polaris, the UX becomes a bit > >> > > confusing to me. In that model, Polaris only stores a directory > >> > > configuration, while users still need to bring their own service to > >> scan > >> > > object storage and write an Iceberg table. In that case, I’m not > sure > >> > what > >> > > value Polaris Directories add over *manually creating an Iceberg > >> table to > >> > > track unstructured data files*. Users can already do that today, and > >> it > >> > is > >> > > arguably more flexible because they can define any schema they want > >> and > >> > use > >> > > any engine or workflow to populate it. > >> > > > >> > > To me, the more compelling direction is for Polaris to own the > >> scanner or > >> > > at least provide it as part of the project, likely through a push > mode > >> > > delegation service[1]. Polaris would still not need to do all the > >> heavy > >> > > scanning work itself, but it should provide a clear, first class > >> workflow > >> > > for turning a directory configuration into an updated directory > table, > >> > via > >> > > a delegated service. > >> > > > >> > > That also seems related to Romain’s questions. If the metadata > >> extraction > >> > > and scanning model are fully external, then extensibility and > >> streaming > >> > > support become entirely out of scope. But if Polaris provides the > >> scanner > >> > > framework, we can define clear extension points for custom metadata > >> and > >> > > think about supportting both batch and event driven scanning. > >> > > > >> > > 1. > >> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696 > >> > > > >> > > Yufei > >> > > > >> > > > >> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau < > >> [email protected] > >> > > > >> > > wrote: > >> > > > >> > > > Hi JB, > >> > > > > >> > > > I have two questions on this scope: > >> > > > > >> > > > 1. any hope it is extensible so an user can plug its own metadata? > >> > > > 2. will scanning be made streaming friendly (I assume phase 0 is a > >> > > batch), > >> > > > idea would be to be able to use Kappa like architecture to have > real > >> > time > >> > > > capabilities > >> > > > > >> > > > Thanks, > >> > > > Romain Manni-Bucau > >> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog > >> > > > <https://dotnetbirdie.github.io/> | Blog < > >> > https://rmannibucau.github.io/ > >> > > > > >> > > > | Old > >> > > > Blog <http://rmannibucau.wordpress.com> | Github > >> > > > <https://github.com/rmannibucau> | LinkedIn > >> > > > <https://www.linkedin.com/in/rmannibucau> | Book > >> > > > < > >> > > > > >> > > > >> > > >> > https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064 > >> > > > > > >> > > > Javaccino founder (Java/.NET service - contact via linkedin) > >> > > > > >> > > > > >> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a > >> écrit : > >> > > > > >> > > > > Great to see the progress here. Thanks a lot JB! I will take a > >> look > >> > at > >> > > > the > >> > > > > PR. > >> > > > > > >> > > > > Yufei > >> > > > > > >> > > > > > >> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré < > >> [email protected] > >> > > > >> > > > > wrote: > >> > > > > > >> > > > > > Hi everyone, > >> > > > > > > >> > > > > > After several months of discussion (involving Directories, > Table > >> > > > Sources, > >> > > > > > etc), I would like to propose Polaris Directories. > >> > > > > > > >> > > > > > I drafted a PR: > >> > > > > > https://github.com/apache/polaris/pull/4613 > >> > > > > > > >> > > > > > The proposal is documented as part of the PR: > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md > >> > > > > > > >> > > > > > In a nutshell, Polaris Directories make objects (including > >> > > unstructured > >> > > > > > data like images, videos, and documents) discoverable > alongside > >> > > > > structured > >> > > > > > Iceberg tables within a Polaris catalog. A directory points > to a > >> > base > >> > > > > > location/prefix on an object store and automatically tracks > the > >> > > objects > >> > > > > it > >> > > > > > contains by maintaining an Iceberg table with object-level > >> metadata > >> > > > such > >> > > > > as > >> > > > > > URI, size, content type, checksum, ... > >> > > > > > > >> > > > > > This means query engines and tools that already know how to > read > >> > > > Iceberg > >> > > > > > tables can discover and access unstructured data with little > or > >> no > >> > > > extra > >> > > > > > work (accessing the object itself). > >> > > > > > > >> > > > > > A directory has two main parts: > >> > > > > > - Directory configuration, stored by the Polaris server. It > >> > describes > >> > > > > where > >> > > > > > the data lives, how to authenticate, which objects to include, > >> and > >> > > how > >> > > > > > often to re-scan. The configuration "lives" in a namespace. > >> > > > > > - Directory table, an Iceberg table serving as the inventory > of > >> all > >> > > > > objects > >> > > > > > contained in the directory, with one row per object discovered > >> > > during a > >> > > > > > scan. The directory table uses the configuration name. > >> > > > > > The Polaris server itself does not perform scans. Instead, > >> external > >> > > > > > services (e.g. directory table scanning service) read the > >> directory > >> > > > > > configuration through the REST API, walk the object store, and > >> > write > >> > > > the > >> > > > > > results into the directory table. > >> > > > > > > >> > > > > > I propose we discuss this both on the mailing list (this > thread) > >> > and > >> > > on > >> > > > > the > >> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting. > >> > > > > > > >> > > > > > I'm looking forward to your thoughts! > >> > > > > > > >> > > > > > Thanks! > >> > > > > > > >> > > > > > Regards > >> > > > > > JB > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >
