Let me clarify: the scan service code is in Polaris (the default) but running outside of the Polaris server.
Regards JB Le dim. 21 juin 2026 à 20:30, Yufei Gu <[email protected]> a écrit : > Thanks JB! I think that's the right direction. > > That said, I don't think the default scan service should run inside the > Polaris service itself. Scanning can be very I/O and network intensive and > could easily saturate a Polaris instance. We'll likely need a delegation > service for that. > > I think the most practical path forward is to work on the delegation > service to unblock it. In parallel, we can continue working on volume > support without the inventory table. > > Yufei > > > On Sat, Jun 20, 2026 at 10:37 PM Jean-Baptiste Onofré <[email protected]> > wrote: > >> Hi everyone >> >> Thanks to your feedback, I will update the proposal/PR to include a >> default >> object store scan service in Polaris (that can be disabled and replaced by >> a custom one). >> >> I will keep you posted when the PR is updated. >> >> Thanks, >> >> Regards >> JB >> >> Le mar. 9 juin 2026 à 21:42, Jean-Baptiste Onofré <[email protected]> a >> écrit : >> >> > Hi Robert, >> > >> > Thanks for your feedback! >> > >> > From a user perspective, I personally prefer having the Directory and >> > Table share the same name, as I find it less confusing to see the >> > association at first glance. However, I'm open to including the >> inventory >> > table name as part of the Directory configuration instead. >> > >> > As mentioned in my initial proposal, the current PR is intended to >> > illustrate a potential implementation. It is certainly not the final >> > version, and I am happy to update it based on community input. I fully >> > agree with the high-level model you outlined, and I believe the PR is >> > well-aligned with that direction. >> > >> > I still believe the inventory table is essential, as it represents the >> > core value of the Directory and scanner; without it, users could simply >> > create an Iceberg table manually to list objects. >> > I'm fine to have add a endpoint in the Directory API to create a >> inventory >> > table without scanning (but using the static schema) and also other >> > endpoints to deal with entries in an inventory (if you think it's >> helpful). >> > >> > Regards, >> > JB >> > >> > >> > On Mon, Jun 8, 2026 at 1:27 PM Robert Stupp <[email protected]> wrote: >> > >> >> Hi, >> >> >> >> I support the general direction. >> >> Modeling a directory/prefix as a first-class catalog concept in >> Polaris, >> >> complete with an inventory table for discovered objects, seems very >> >> useful. >> >> >> >> I think we should separate agreement on that direction from locking in >> the >> >> exact object model too early, though. >> >> One design point I would like to keep open is the relationship between >> the >> >> directory configuration and the inventory table. >> >> >> >> For example, if the directory configuration and the inventory table >> share >> >> the same name in the same namespace and are distinguished only by >> object >> >> type, that may be workable, but it can create ambiguity for APIs, UI, >> >> events, authorization/audit, and lifecycle operations like rename/drop. >> >> I don’t think we need to settle that in the first discussion, but I >> also >> >> would not want the current PR shape to imply that this part is already >> >> fixed. >> >> >> >> My preference would be to first agree on the higher-level model: >> >> >> >> - Polaris has a first-class Directory abstraction. >> >> - A Directory has a configured object-store location and scan/inventory >> >> settings. >> >> - A Directory is associated with an Iceberg inventory table. >> >> - Scanner execution can be discussed separately: Polaris-provided, >> >> disabled, or integrator-provided. >> >> >> >> Then we can discuss whether the inventory table is implicitly named, >> >> explicitly referenced, hidden/internal, user-visible, or modeled some >> >> other >> >> way. >> >> >> >> Thoughts? >> >> >> >> Robert >> >> >> >> On Sun, Jun 7, 2026 at 7:07 AM Jean-Baptiste Onofré <[email protected]> >> >> wrote: >> >> >> >> > Hi >> >> > >> >> > I wanted to have two steps in the proposal: the configuration and >> high >> >> > level architecture (that’s the current proposal), then the scanning >> >> > service. >> >> > >> >> > I think the scanning should be part of Polaris but not mandatory: if >> >> > integrators want to have their own scanning they should be able to do >> >> so. >> >> > The Polaris scanners should be disabled by users. Integrators would >> >> > probably like to have scanning performed by a distributed engines or >> >> within >> >> > cloud provider infra. >> >> > >> >> > So my proposal here is: >> >> > 1. To have scanner in Polaris >> >> > 2. Be able to disable the Polaris scanner >> >> > 3. Allow users/integrators to provide their own scanners >> >> > >> >> > The first step is to get consensus on the Polaris Directories >> proposal >> >> > approach. >> >> > >> >> > I will create a follow up PR with a scanner. >> >> > >> >> > Regards >> >> > JB >> >> > >> >> > Le ven. 5 juin 2026 à 23:25, Yufei Gu <[email protected]> a >> écrit : >> >> > >> >> > > I think one thing we should clarify is where the scanner lives. >> >> > > >> >> > > If the scanner is completely outside Polaris, the UX becomes a bit >> >> > > confusing to me. In that model, Polaris only stores a directory >> >> > > configuration, while users still need to bring their own service to >> >> scan >> >> > > object storage and write an Iceberg table. In that case, I’m not >> sure >> >> > what >> >> > > value Polaris Directories add over *manually creating an Iceberg >> >> table to >> >> > > track unstructured data files*. Users can already do that today, >> and >> >> it >> >> > is >> >> > > arguably more flexible because they can define any schema they want >> >> and >> >> > use >> >> > > any engine or workflow to populate it. >> >> > > >> >> > > To me, the more compelling direction is for Polaris to own the >> >> scanner or >> >> > > at least provide it as part of the project, likely through a push >> mode >> >> > > delegation service[1]. Polaris would still not need to do all the >> >> heavy >> >> > > scanning work itself, but it should provide a clear, first class >> >> workflow >> >> > > for turning a directory configuration into an updated directory >> table, >> >> > via >> >> > > a delegated service. >> >> > > >> >> > > That also seems related to Romain’s questions. If the metadata >> >> extraction >> >> > > and scanning model are fully external, then extensibility and >> >> streaming >> >> > > support become entirely out of scope. But if Polaris provides the >> >> scanner >> >> > > framework, we can define clear extension points for custom metadata >> >> and >> >> > > think about supportting both batch and event driven scanning. >> >> > > >> >> > > 1. >> >> https://github.com/apache/polaris/issues/3786#issuecomment-4503583696 >> >> > > >> >> > > Yufei >> >> > > >> >> > > >> >> > > On Fri, Jun 5, 2026 at 2:41 AM Romain Manni-Bucau < >> >> [email protected] >> >> > > >> >> > > wrote: >> >> > > >> >> > > > Hi JB, >> >> > > > >> >> > > > I have two questions on this scope: >> >> > > > >> >> > > > 1. any hope it is extensible so an user can plug its own >> metadata? >> >> > > > 2. will scanning be made streaming friendly (I assume phase 0 is >> a >> >> > > batch), >> >> > > > idea would be to be able to use Kappa like architecture to have >> real >> >> > time >> >> > > > capabilities >> >> > > > >> >> > > > Thanks, >> >> > > > Romain Manni-Bucau >> >> > > > @rmannibucau <https://x.com/rmannibucau> | .NET Blog >> >> > > > <https://dotnetbirdie.github.io/> | Blog < >> >> > https://rmannibucau.github.io/ >> >> > > > >> >> > > > | Old >> >> > > > Blog <http://rmannibucau.wordpress.com> | Github >> >> > > > <https://github.com/rmannibucau> | LinkedIn >> >> > > > <https://www.linkedin.com/in/rmannibucau> | Book >> >> > > > < >> >> > > > >> >> > > >> >> > >> >> >> https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064 >> >> > > > > >> >> > > > Javaccino founder (Java/.NET service - contact via linkedin) >> >> > > > >> >> > > > >> >> > > > Le ven. 5 juin 2026 à 02:20, Yufei Gu <[email protected]> a >> >> écrit : >> >> > > > >> >> > > > > Great to see the progress here. Thanks a lot JB! I will take a >> >> look >> >> > at >> >> > > > the >> >> > > > > PR. >> >> > > > > >> >> > > > > Yufei >> >> > > > > >> >> > > > > >> >> > > > > On Thu, Jun 4, 2026 at 2:58 AM Jean-Baptiste Onofré < >> >> [email protected] >> >> > > >> >> > > > > wrote: >> >> > > > > >> >> > > > > > Hi everyone, >> >> > > > > > >> >> > > > > > After several months of discussion (involving Directories, >> Table >> >> > > > Sources, >> >> > > > > > etc), I would like to propose Polaris Directories. >> >> > > > > > >> >> > > > > > I drafted a PR: >> >> > > > > > https://github.com/apache/polaris/pull/4613 >> >> > > > > > >> >> > > > > > The proposal is documented as part of the PR: >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> https://github.com/jbonofre/polaris/blob/12dfea48570d076d4012143e66f02e8b503c4f99/site/content/in-dev/unreleased/directories.md >> >> > > > > > >> >> > > > > > In a nutshell, Polaris Directories make objects (including >> >> > > unstructured >> >> > > > > > data like images, videos, and documents) discoverable >> alongside >> >> > > > > structured >> >> > > > > > Iceberg tables within a Polaris catalog. A directory points >> to a >> >> > base >> >> > > > > > location/prefix on an object store and automatically tracks >> the >> >> > > objects >> >> > > > > it >> >> > > > > > contains by maintaining an Iceberg table with object-level >> >> metadata >> >> > > > such >> >> > > > > as >> >> > > > > > URI, size, content type, checksum, ... >> >> > > > > > >> >> > > > > > This means query engines and tools that already know how to >> read >> >> > > > Iceberg >> >> > > > > > tables can discover and access unstructured data with little >> or >> >> no >> >> > > > extra >> >> > > > > > work (accessing the object itself). >> >> > > > > > >> >> > > > > > A directory has two main parts: >> >> > > > > > - Directory configuration, stored by the Polaris server. It >> >> > describes >> >> > > > > where >> >> > > > > > the data lives, how to authenticate, which objects to >> include, >> >> and >> >> > > how >> >> > > > > > often to re-scan. The configuration "lives" in a namespace. >> >> > > > > > - Directory table, an Iceberg table serving as the inventory >> of >> >> all >> >> > > > > objects >> >> > > > > > contained in the directory, with one row per object >> discovered >> >> > > during a >> >> > > > > > scan. The directory table uses the configuration name. >> >> > > > > > The Polaris server itself does not perform scans. Instead, >> >> external >> >> > > > > > services (e.g. directory table scanning service) read the >> >> directory >> >> > > > > > configuration through the REST API, walk the object store, >> and >> >> > write >> >> > > > the >> >> > > > > > results into the directory table. >> >> > > > > > >> >> > > > > > I propose we discuss this both on the mailing list (this >> thread) >> >> > and >> >> > > on >> >> > > > > the >> >> > > > > > PR. If needed, I'm happy to schedule a dedicated meeting. >> >> > > > > > >> >> > > > > > I'm looking forward to your thoughts! >> >> > > > > > >> >> > > > > > Thanks! >> >> > > > > > >> >> > > > > > Regards >> >> > > > > > JB >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> > >> >
