Hi EJ, Thanks for looking at the proposal. I've responded to most of your comments on the document itself, but I'll summarize the stances here to close the loop.
I am consciously making an effort to let the OpenLineage standard drive the requirements here; this is a feature, not a bug. IMO, OpenLineage is by-far the most well-used standard for data lineage; I don't even know of any other significant competitors. Big Data engines like Spark and Trino, which represent a significant use case for Polaris, have OpenLineage integrations and nothing else. Going the extra mile for further flexibility to de-couple our lineage implementations from OpenLineage will likely not produce any ROI in terms of work IMO. Happy to hear any other thoughts on this topic. I also don't agree that Polaris should morph into a full-fledged OpenLineage server. I don't think the Polaris community is attempting to make a "Swiss-Army Knife" tool out of Polaris. For major lineage use cases, users absolutely should be redirected to other servers like Marquez where they can get full graph history, multi-hop traversal, jobs/runs info, etc. I disagree with the "extensions" piece of your email based on this reasoning. Regarding the "out-of-the-box" experience, I have no doubt: Polaris cannot have lineage information. An admin must take a small step to configure how they want to enable Lineage data persistence: either for Polaris-local persistence or for the passthrough/proxy/AuthZ layer modes. I think you've missed some of the points in the mailing thread replies above; the Query API is really only helpful when using the Polaris local persistence mode. The current plan is to build toward "passthrough" mode first, with plans to support the Polaris local implementation soon afterward. A Query API won't be introduced until the Polaris local implementation work begins. This means there's no implication that a Query API will exist without returning data to the user. You can see this in my first PR, where only the Ingest API is implemented: https://github.com/apache/polaris/pull/4667. One last note/suggestion for you: the term "default battery" on its own generally doesn't make much sense. I'm only able to piece together your comments because you used the phrase "batteries included" in this morning's community sync. I would usually use "out-of-the-box (OOTB)" or "default implementation". Using similar terms in the future would improve readability in general. Best, Adnan Hemani On Thu, Jun 11, 2026 at 4:12 PM EJ Wang <[email protected]> wrote: > Hi all, > > I read through the proposal and the comments. One framing that may help us > converge is to split the proposal into a few separate decisions instead of > reviewing it as one bundled “OpenLineage support in Polaris” feature. > > This seems related to a broader direction I understand for Polaris as a > platform: it should be flexible enough to support different deployment and > integration use cases, but still battery-included enough to be useful out > of the box. For lineage, I think that means we should explicitly separate: > what Polaris promises as native lineage semantics, what the default battery > implementation does, and what should remain pluggable for richer or > deployment-specific implementations. > > I have been using a similar exercise in a recent SPI proposal draft: first > separate external contracts, default/battery implementation, extension > implementations, and provider-facing replacement points; then decide > implementation. I think that exercise applies well here because this > proposal touches several different boundary types at once: ingest protocol, > Polaris-native lineage model, persistence, query API, downstream > forwarding, auth, and dataset resolution. > > The questions I think we should separate are: > > 1. *OpenLineage compatibility: *Do we require existing OpenLineage > clients to emit to Polaris by changing only the endpoint/config? > - If yes, then a server-side OpenLineage-compatible adapter > endpoint makes sense. > - If not, another option is a Polaris-provided OpenLineage > transport/client shim that reshapes OpenLineage events into a > Polaris-native lineage API. > - Those are different adoption tradeoffs, and I think we should choose > intentionally rather than letting OpenLineage compatibility implicitly > define the Polaris-native API. > 2. *Polaris-native lineage model: *Should the long-term Polaris > lineage model/query API be OpenLineage-specific, or framework-agnostic with > OpenLineage as one adapter? > - My preference is the latter. OpenLineage compatibility is useful, > but I would avoid making the OpenLineage payload shape the > Polaris-native > lineage model by accident. > 3. *Default battery behavior: *What should work out of the box? > - If query is part of the initial release, I think the battery > needs enough local state to answer a minimal query. A narrow default > could > be: latest observed direct table-level upstreams for a Polaris-managed > target table, with observed timestamp, producer/engine identifier, and > upstream dataset refs. > 4. *Extension implementations: *What should be pluggable or future > work? > - I would put raw OpenLineage forwarding/proxying, external backend > query, full graph history, multi-hop traversal, column-level query, > job/run > graph, pruning/staleness, and richer governance-aware behavior into > extension/future implementation areas rather than the default battery. > > *One subtle point*: I do not think the default battery and the REST/API > envelope need to have exactly the same scope. > > The default battery can be intentionally small. For example, latest direct > table-level lineage summary for Polaris-managed target tables. *But the > REST/API envelope can still be designed so that richer implementations are > possible later or through extensions*. For example, the API can carry > metadata such as *granularity (table/col/job etc.), format/source > protocol (OpenLineage or other lineage framework)*, or requested mode to > help Polaris route handling to the configured provider, without requiring > every default implementation to support every mode. > > Said differently, I would separate: > > - what the API envelope can represent; > - what the default battery actually guarantees; > - what extension implementations can support. > > *My concrete recommendation would be*: > > If Polaris exposes a lineage Query API in the initial release, the default > battery should provide a minimal latest table-level summary implementation > so the query works out of the box. If we do not want any local persistence > in the initial release, then I think the Query API should be out of scope > for the initial release or clearly extension-provided. I would avoid > exposing a core query API whose default implementation cannot answer > anything. > > *My preferred shape would be*: > > - Polaris-native lineage semantics stay *framework-agnostic*. > - OpenLineage is supported as an adapter/adoption path, *not as the > only Polaris lineage model*. > - The default battery, if query is in scope, is latest direct > table-level lineage summary only. > - *The API envelope leaves room for richer provider implementations*. > - Full OpenLineage backend behavior, downstream forwarding/proxying, > historical graph, column lineage, job/run lineage, multi-hop query, > pruning/staleness, and external backend query *are extension or future > work*. > > This would still give Polaris a useful out-of-the-box lineage experience, > while avoiding turning Polaris into a full lineage backend in the first > step. > > -ej > > On Mon, Jun 8, 2026 at 2:31 PM Adnan Hemani via dev < > [email protected]> wrote: > >> Hi Robert, >> >> > Is my understanding correct that option 1 is out of scope from your >> perspective, and option 2 is not sufficient for the M0 you have in mind? >> In >> other words, you are proposing option 3 as the baseline, with active >> planning toward option 4? >> >> Yes, that's correct. Happy to hear others' opinions, but Option 4 has been >> detailed in the proposal document since the very start. I'm happy to wait >> a >> few more days for others' opinions, but as of now I don't see any active >> opposition to the plans as-is and the "lazy consensus" suggested deadline >> was over 2 weeks ago. I-Ting and I will start implementation in the >> meantime. >> >> Best, >> Adnan Hemani >> >> On Mon, Jun 8, 2026 at 3:19 AM Robert Stupp <[email protected]> wrote: >> >> > Hi all, >> > >> > Thanks Adnan, that helps clarify the shape. >> > >> > I think this is the point where broader community input would be useful, >> > because options 3/4 are a materially different commitment from options >> 1/2. >> > >> > Is my understanding correct that option 1 is out of scope from your >> > perspective, and option 2 is not sufficient for the M0 you have in >> mind? In >> > other words, you are proposing option 3 as the baseline, with active >> > planning toward option 4? >> > >> > Option 3 does not just put a proxy endpoint in Polaris. >> > It makes Polaris responsible for the OL ingest path: dataset-name >> > resolution, per-entity authZ over OL assertions, policy for non-Polaris >> > datasets, trusted-service credentials to downstream systems, >> request-size >> > and payload limits, forwarding failure semantics, audit behavior, and >> > tenant isolation. >> > >> > Option 4 then adds a Polaris-local lineage storage/query subsystem. >> > Even if the first version stores only a reduced projection, Polaris >> would >> > take on many responsibilities of an OL backend: persistence semantics, >> > query semantics, staleness/pruning, auth-filtered reads, backend >> > compatibility, migrations, limits, and long-term compatibility with OL >> > event shapes. >> > At that point, even if intentionally limited, Polaris effectively >> operates >> > as an OL backend for the supported subset. >> > >> > So before we treat option 3 plus active planning toward option 4 as the >> M0 >> > baseline, I think it would be good to hear whether others agree that >> > Polaris should take on that implementation and maintenance surface for >> the >> > first milestone. >> > >> > Or whether we should start with a smaller integration point first. >> > >> > Robert >> > >> >
