Hi everyone Thanks to your feedback, I will update the proposal/PR to include a default object store scan service in Polaris (that can be disabled and replaced by a custom one).
I will keep you posted when the PR is updated. Thanks, Regards JB Le mar. 9 juin 2026 à 21:42, Jean-Baptiste Onofré <[email protected]> a écrit : > Hi Robert, > > Thanks for your feedback! > > From a user perspective, I personally prefer having the Directory and > Table share the same name, as I find it less confusing to see the > association at first glance. However, I'm open to including the inventory > table name as part of the Directory configuration instead. > > As mentioned in my initial proposal, the current PR is intended to > illustrate a potential implementation. It is certainly not the final > version, and I am happy to update it based on community input. I fully > agree with the high-level model you outlined, and I believe the PR is > well-aligned with that direction. > > I still believe the inventory table is essential, as it represents the > core value of the Directory and scanner; without it, users could simply > create an Iceberg table manually to list objects. > I'm fine to have add a endpoint in the Directory API to create a inventory > table without scanning (but using the static schema) and also other > endpoints to deal with entries in an inventory (if you think it's helpful). > > Regards, > JB > > > On Mon, Jun 8, 2026 at 1:27 PM Robert Stupp <[email protected]> wrote: > >> Hi, >> >> I support the general direction. >> Modeling a directory/prefix as a first-class catalog concept in Polaris, >> complete with an inventory table for discovered objects, seems very >> useful. >> >> I think we should separate agreement on that direction from locking in the >> exact object model too early, though. >> One design point I would like to keep open is the relationship between the >> directory configuration and the inventory table. >> >> For example, if the directory configuration and the inventory table share >> the same name in the same namespace and are distinguished only by object >> type, that may be workable, but it can create ambiguity for APIs, UI, >> events, authorization/audit, and lifecycle operations like rename/drop. >> I don’t think we need to settle that in the first discussion, but I also >> would not want the current PR shape to imply that this part is already >> fixed. >> >> My preference would be to first agree on the higher-level model: >> >> - Polaris has a first-class Directory abstraction. >> - A Directory has a configured object-store location and scan/inventory >> settings. >> - A Directory is associated with an Iceberg inventory table. >> - Scanner execution can be discussed separately: Polaris-provided, >> disabled, or integrator-provided. >> >> Then we can discuss whether the inventory table is implicitly named, >> explicitly referenced, hidden/internal, user-visible, or modeled some >> other >> way. >> >> Thoughts? >> >> Robert >> >> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]> >> wrote: >> >> > Hi >> > >> > I wanted to have two steps in the proposal: the configuration and high >> > level architecture (that’s the current proposal), then the scanning >> > service. >> > >> > I think the scanning should be part of Polaris but not mandatory: if >> > integrators want to have their own scanning they should be able to do >> so. >> > The Polaris scanners should be disabled by users. Integrators would >> > probably like to have scanning performed by a distributed engines or >> within >> > cloud provider infra. >> > >> > So my proposal here is: >> > 1. To have scanner in Polaris >> > 2. Be able to disable the Polaris scanner >> > 3. Allow users/integrators to provide their own scanners >> > >> > The first step is to get consensus on the Polaris Directories proposal >> > approach. >> > >> > I will create a follow up PR with a scanner. >> > >> > Regards >> > JB >> > >> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a écrit : >> > >> > > I think one thing we should clarify is where the scanner lives. >> > > >> > > If the scanner is completely outside Polaris, the UX becomes a bit >> > > confusing to me. In that model, Polaris only stores a directory >> > > configuration, while users still need to bring their own service to >> scan >> > > object storage and write an Iceberg table. In that case, I’m not sure >> > what >> > > value Polaris Directories add over *manually creating an Iceberg >> table to >> > > track unstructured data files*. Users can already do that today, and >> it >> > is >> > > arguably more flexible because they can define any schema they want >> and >> > use >> > > any engine or workflow to populate it. >> > > >> > > To me, the more compelling direction is for Polaris to own the >> scanner or >> > > at least provide it as part of the project, likely through a push mode >> > > delegation service[1]. Polaris would still not need to do all the >> heavy >> > > scanning work itself, but it should provide a clear, first class >> workflow >> > > for turning a directory configuration into an updated directory table, >> > via >> > > a delegated service. >> > > >> > > That also seems related to Romain’s questions. If the metadata >> extraction >> > > and scanning model are fully external, then extensibility and >> streaming >> > > support become entirely out of scope. But if Polaris provides the >> scanner >> > > framework, we can define clear extension points for custom metadata >> and >> > > think about supportting both batch and event driven scanning. >> > > >> > > 1. >> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696 >> > > >> > > Yufei >> > > >> > > >> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau < >> [email protected] >> > > >> > > wrote: >> > > >> > > > Hi JB, >> > > > >> > > > I have two questions on this scope: >> > > > >> > > > 1. any hope it is extensible so an user can plug its own metadata? >> > > > 2. will scanning be made streaming friendly (I assume phase 0 is a >> > > batch), >> > > > idea would be to be able to use Kappa like architecture to have real >> > time >> > > > capabilities >> > > > >> > > > Thanks, >> > > > Romain Manni-Bucau >> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog >> > > > <https://dotnetbirdie.github.io/> | Blog < >> > https://rmannibucau.github.io/ >> > > > >> > > > | Old >> > > > Blog <http://rmannibucau.wordpress.com> | Github >> > > > <https://github.com/rmannibucau> | LinkedIn >> > > > <https://www.linkedin.com/in/rmannibucau> | Book >> > > > < >> > > > >> > > >> > >> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064 >> > > > > >> > > > Javaccino founder (Java/.NET service - contact via linkedin) >> > > > >> > > > >> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a >> écrit : >> > > > >> > > > > Great to see the progress here. Thanks a lot JB! I will take a >> look >> > at >> > > > the >> > > > > PR. >> > > > > >> > > > > Yufei >> > > > > >> > > > > >> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré < >> [email protected] >> > > >> > > > > wrote: >> > > > > >> > > > > > Hi everyone, >> > > > > > >> > > > > > After several months of discussion (involving Directories, Table >> > > > Sources, >> > > > > > etc), I would like to propose Polaris Directories. >> > > > > > >> > > > > > I drafted a PR: >> > > > > > https://github.com/apache/polaris/pull/4613 >> > > > > > >> > > > > > The proposal is documented as part of the PR: >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md >> > > > > > >> > > > > > In a nutshell, Polaris Directories make objects (including >> > > unstructured >> > > > > > data like images, videos, and documents) discoverable alongside >> > > > > structured >> > > > > > Iceberg tables within a Polaris catalog. A directory points to a >> > base >> > > > > > location/prefix on an object store and automatically tracks the >> > > objects >> > > > > it >> > > > > > contains by maintaining an Iceberg table with object-level >> metadata >> > > > such >> > > > > as >> > > > > > URI, size, content type, checksum, ... >> > > > > > >> > > > > > This means query engines and tools that already know how to read >> > > > Iceberg >> > > > > > tables can discover and access unstructured data with little or >> no >> > > > extra >> > > > > > work (accessing the object itself). >> > > > > > >> > > > > > A directory has two main parts: >> > > > > > - Directory configuration, stored by the Polaris server. It >> > describes >> > > > > where >> > > > > > the data lives, how to authenticate, which objects to include, >> and >> > > how >> > > > > > often to re-scan. The configuration "lives" in a namespace. >> > > > > > - Directory table, an Iceberg table serving as the inventory of >> all >> > > > > objects >> > > > > > contained in the directory, with one row per object discovered >> > > during a >> > > > > > scan. The directory table uses the configuration name. >> > > > > > The Polaris server itself does not perform scans. Instead, >> external >> > > > > > services (e.g. directory table scanning service) read the >> directory >> > > > > > configuration through the REST API, walk the object store, and >> > write >> > > > the >> > > > > > results into the directory table. >> > > > > > >> > > > > > I propose we discuss this both on the mailing list (this thread) >> > and >> > > on >> > > > > the >> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting. >> > > > > > >> > > > > > I'm looking forward to your thoughts! >> > > > > > >> > > > > > Thanks! >> > > > > > >> > > > > > Regards >> > > > > > JB >> > > > > > >> > > > > >> > > > >> > > >> > >> >
