collado-mike opened a new issue, #425:
URL: https://github.com/apache/polaris/issues/425

   ### Is your feature request related to a problem? Please describe.
   
   [OpenLineage](https://openlineage.io/) is an open standard for reporting and 
collecting lineage information about processing jobs (i.e., which data sets 
were inputs to a processing job and which datasets were output). OpenLineage 
libraries are typically modeled as listeners or before/after hooks that are 
triggered by running processing jobs and have engine-specific code that 
collects information about the job and the datasets. That information is 
serialized as JSON and transmitted to a well-defined endpoint that either 
processes and stores that information or modifies and relays it to another 
endpoint.
   
   ### Describe the solution you'd like
   
   Polaris is a good candidate for a proxying lineage endpoint because it has a 
canonical view of the datasets being processed and can augment the lineage 
payload with useful data. This is especially true when Polaris is used to 
access External catalogs, where the authoritative metadata lives somewhere 
else. 
   
   Spark or other OpenLineage clients can only report information about the 
datasets that can be gleaned from the client - e.g., the namespace of the data 
will be the Polaris endpoint that was used to access the data. The name of the 
catalog will be whatever name assigned to the catalog in that particular 
application (e.g., a user might configure the catalog as either 
`spark.sql.catalog.polaris` or `spark.sql.catalog.iceberg`). A table might have 
been renamed or moved from another catalog.
   
   Polaris, however, knows exactly where the dataset originated and can use the 
table metadata's UUID field to uniquely identify the dataset. It also knows the 
snapshot information (datasets can be versioned in OpenLineage) as well as the 
schema, table properties, and other information that could be reported as an 
OpenLineage facet. 
   
   Polaris doesn't make sense as a container for lineage information, as 
parsing and storing that information is not cheap. However, there is already 
precedence for an [OpenLineage 
proxy](https://openlineage.io/docs/development/ol-proxy/), which can be used to 
augument the lineage information and pass it on to another service 
([Marquez](https://github.com/MarquezProject/marquez) is the reference 
implementation of the OpenLineage server endpoint). 
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to