Hi all,

Thanks for your review, Dmitri. I think we're discussing most of your
comments on the document itself, so I'll keep the majority of this response
focused on Robert's feedback.

*Security Model:*

> What exactly does LINEAGE_INGEST authorize: all inputs, all outputs, only
outputs, or parent namespaces for CTAS-style outputs?

We should authorize all the above :)

> What about external datasets that are not Polaris securables?

We cannot authorize these. If the user has the LINEAGE_INGEST authorization
for all other datasets in the request, we should allow it.

> Are OL events trusted metadata from privileged engine principals, or
should Polaris validate submitted read/write claims?

No, they are not "trusted" in any way. The user making the request should
be authenticated and authorized for all actions.

> For forwarding, Polaris should use configured downstream credentials by
default, not forward inbound Polaris bearer tokens or arbitrary request
headers, and explicitly define what realm/principal/resolved-entity context
is propagated.

Agreed. This is the model advocated by the document.

> For reads, opaque nodes with preserved edges may still leak that hidden
datasets exist and are connected to visible ones; that should be an
explicit choice.

We should never show lineage information for tables the user isn't
authorized to see. We can either create a LINEAGE_READ permission or bundle
this into a permission that already exists in Polaris.

*Runtime Limits and Failure Behavior:*

I agree that we need to define this - but are there any concerns from the
proposal itself that you'd like answered at this stage specifically? I
believe these are more implementation details rather than items that belong
in a proposal document.

*User Value of Local Store:*

> Which queries is Polaris-local storage supposed to answer, and which are
left to Marquez/DataHub/OpenMetadata/etc.?

My vision is that Polaris should mainly handle smaller deployments or use
cases out-of-the-box with the local store. This is mainly for users who
lack a use case substantial enough to require an external service or vendor
for data lineage, but are interested in introducing data lineage to their
organization. If their use case grows, the users should transition to an
external OL server using "mixed-mode" (Polaris local store + OL external OL
server). This approach allows for a graceful migration, with the eventual
goal of stopping the use of the Polaris local store. This is why the "mixed
mode" is not P0, but the "pass-through" mode and the "local store" modes
are - there should be users who can benefit from these use cases today.

Based on my understanding of the remainder of the email, I think you align
with the proxy/"passthrough-only" mode. I'm ok to start implementation
there for now, while we iron out the remaining topics you may have concerns
about for the "local store" mode. Can you please ask those questions
point-by-point (either in this mail thread or on the document) and I can
answer them. In the meantime, I will work on defining the
LineagePersistence contract.

Best,
Adnan Hemani


On Tue, May 26, 2026 at 5:10 AM Robert Stupp <[email protected]> wrote:

> Hi Adnan,
>
> I am generally supportive of Polaris integrating with OpenLineage,
> especially around configuration, identity, and dataset resolution.
> If Polaris must be inline for auth/enrichment, the gateway / resolver /
> forwarder shape seems like the most natural fit to me.
>
> That said, I wonder whether the simpler M0 is to keep Polaris out of the OL
> event data path unless inline authorization/enrichment is a hard
> requirement.
> OL clients can already emit directly to Marquez/DataHub/OpenMetadata/etc.
> Polaris could still add value by exposing the OL transport configuration,
> auth/credential references, realm/tenant context, TLS settings, and naming
> conventions.
> That would avoid proxying large OL payloads through Polaris and would keep
> gateway, local storage, custom query APIs, and downstream query
> normalization from being bundled into one milestone.
>
> For the Polaris-local storage and Polaris-specific query API parts, I think
> a few things still need clarification before PRs assume that scope.
>
> First, could we make the security model more explicit?
> What exactly does LINEAGE_INGEST authorize: all inputs, all outputs, only
> outputs, or parent namespaces for CTAS-style outputs?
> What about external datasets that are not Polaris securables?
> Are OL events trusted metadata from privileged engine principals, or should
> Polaris validate submitted read/write claims?
> For forwarding, Polaris should use configured downstream credentials by
> default, not forward inbound Polaris bearer tokens or arbitrary request
> headers, and explicitly define what realm/principal/resolved-entity context
> is propagated.
> For reads, opaque nodes with preserved edges may still leak that hidden
> datasets exist and are connected to visible ones; that should be an
> explicit choice.
>
> Second, could we define runtime limits and failure behavior?
> OL payloads are effectively unbounded by the spec and can get large through
> schemas, SQL, column lineage, data quality facets, debug facets, or custom
> facets.
> Even if Polaris stores only a projection, it still has to receive, parse,
> and maybe forward the full event.
> The proposal should define request/event/batch limits, facet and
> column-lineage limits, timeouts, backpressure/concurrency behavior,
> oversized-payload responses, and logging rules.
>
> Third, could we clarify the user value of the reduced local store?
> The proposed local representation drops job/run history, run state, most
> facets, and process attribution.
> That may still be useful as a small dependency index for impact analysis or
> future catalog policy checks, but it is not what users usually get from
> querying an OL backend.
> Which queries is Polaris-local storage supposed to answer, and which are
> left to Marquez/DataHub/OpenMetadata/etc.?
>
> Given those questions, my preferred scoping would be:
> if inline Polaris auth/enrichment is not required, start with
> configuration/discovery and let OL clients emit directly to the OL backend.
> If Polaris does need to be in the event path, scope M0 to proxy/gateway
> mode only.
> I would keep local storage and Polaris-specific lineage query APIs out of
> M0 until the product semantics, security model, runtime limits, and
> persistence model are clearer.
>
> If local storage stays in scope, I think the proposal should define a
> backend-agnostic LineagePersistence contract first, instead of making
> relational tables the logical model.
> JDBC-only as the first implementation is fine, but the SPI should not bake
> in assumptions that make the NoSQL backend awkward later.
>
> There are also a few core semantics that I think still need to be spelled
> out: collapsed edges, staleness/removal, dataset identity and Polaris
> entity resolution, query node identity, supported OL event types, batch
> behavior, and forwarding failure modes.
> In particular, a 200 from Polaris should have a clear meaning in each mode:
> local-only, fail-closed forwarding, fail-open forwarding, and
> passthrough-only.
>
> So I am not against the OpenLineage integration direction.
> I mainly want to avoid treating "Polaris as an OL gateway" and "Polaris as
> a new lineage storage/query system" as the same milestone.
> The gateway/configuration pieces feel like a natural fit; the storage/query
> pieces still need more design work before I would be comfortable calling
> that part consensus.
>
> Robert
>
> On Mon, May 25, 2026 at 6:57 PM Dmitri Bourlatchkov <[email protected]>
> wrote:
>
> > Hi Adnan,
> >
> > Thanks for the update and apologies again for the late review.
> >
> > I posted some comments in the docs; all of them are non-blocking. We can
> > address them in the doc or PRs, if you prefer.
> >
> > However, for the sake of clarity, I'd like to better understand the idea
> > behind collapsing OL edges.
> >
> > What is the use case for the resulting data?
> >
> > What kind of problems can it help to solve?
> >
> > How is "staleness pruning" supposed to work (mentioned in section "Edge
> > Semantics")?
> >
> > Thanks,
> > Dmitri.
> >
> > On Fri, May 22, 2026 at 2:50 PM Adnan Hemani via dev <
> > [email protected]>
> > wrote:
> >
> > > Hi Dmitri,
> > >
> > > My understanding is that we discuss threads in the community before
> > > implementation to ensure alignment on the proposal's direction before
> > > community members put time into the implementation. I'm fully aligned
> > that
> > > the implementation details are always up for discussion (both before or
> > > after a proposal or even before/after a PR :) I just don't want us to
> > > proceed with putting further time/effort in if the community is not
> > aligned
> > > on introducing these endpoints in general. I am claiming "approv[al] by
> > > lazy consensus" for this direction because the doc has been circulating
> > in
> > > the ML for quite long without any objections.
> > >
> > > The old proposal has a different direction altogether and does not
> align
> > > with where we now want to go with this proposal. I wanted to ensure
> there
> > > was no confusion between the old (now-abandoned) proposal and the new
> > one.
> > > I-Ting is joining on as a co-author on the new proposal.
> > >
> > > We should definitely discuss this at the next Community Sync!
> > >
> > > Best,
> > > Adnan Hemani
> > >
> > > On Fri, May 22, 2026 at 8:53 AM Dmitri Bourlatchkov <[email protected]>
> > > wrote:
> > >
> > > > Hi Adnan,
> > > >
> > > > Apologies for the lack of feedback (too many concurrent activities).
> > > >
> > > > However, this is a discussion thread, not a vote on a concrete
> > > > implementation with a timeline. Lack of comment does not mean that
> > people
> > > > approve the design.
> > > >
> > > > If you prefer to proceed to concrete PRs, that would be fine from my
> > POV,
> > > > but please do not assume that PRs will not be challenged on aspects
> > that
> > > > were not already discussed.
> > > >
> > > > I believe it would be preferable to add this to the next Community
> Sync
> > > > agenda to invigorate the discussion.
> > > >
> > > > Also, I'm not sure what the status is of this document / email thread
> > > with
> > > > respect to I-Ting's old proposal [1]. Why not continue the discussion
> > on
> > > > that thread? WDYT?
> > > >
> > > > [1] https://lists.apache.org/thread/qqpq5hl1xrq8mwnd7kn4vgt8x9mqtvmg
> > > >
> > > > Cheers,
> > > > Dmitri.
> > > >
> > > > On Fri, May 22, 2026 at 1:07 AM Adnan Hemani via dev <
> > > > [email protected]> wrote:
> > > >
> > > >> Hi folks,
> > > >>
> > > >> Since there hasn't been much traffic on this document or email
> > threads,
> > > I
> > > >> will consider the document approved by lazy consensus if there are
> no
> > > >> further blocking comments by Friday, May 28th.
> > > >>
> > > >> Thanks!
> > > >>
> > > >> Best,
> > > >> Adnan Hemani
> > > >>
> > > >> On Thu, May 14, 2026 at 1:49 PM Yufei Gu <[email protected]>
> > wrote:
> > > >>
> > > >> > Hi Adnan,
> > > >> >
> > > >> > Thanks for putting this proposal together and resurfacing it in a
> > > >> > dedicated thread.
> > > >> >
> > > >> > I did one round of review on the document already, and I think it
> > > would
> > > >> be
> > > >> > great if more folks from the community could take a look and
> provide
> > > >> > feedback as well. It would be especially helpful to get
> perspectives
> > > >> from
> > > >> > people working on related areas before implementation starts.
> > > >> >
> > > >> > Thanks,
> > > >> > Yufei
> > > >> >
> > > >> >
> > > >> > On Wed, May 13, 2026 at 10:01 PM Adnan Hemani via dev <
> > > >> > [email protected]> wrote:
> > > >> >
> > > >> >> Hi all,
> > > >> >>
> > > >> >> I wanted to ensure that the OpenLineage proposal I previously
> > posted
> > > >> in a
> > > >> >> different thread [1] was actually being found, given that it was
> > deep
> > > >> into
> > > >> >> the thread. I request the community to review this proposal so we
> > can
> > > >> >> potentially start implementation.
> > > >> >>
> > > >> >> Proposal:
> > > >> >>
> > > >> >>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1iOzIuFW66SFL2wZOADD9knMTG21OwY7VmaWVSvMUqQk/edit?tab=t.0#heading=h.59bmbnsf0gp1
> > > >> >>
> > > >> >> Best,
> > > >> >> Adnan Hemani
> > > >> >>
> > > >> >> [1]
> > https://lists.apache.org/thread/1fd6hrvx0v0s5wm6gh74cdo3yn4w1zhx
> > > >> >>
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to