[ 
https://issues.apache.org/jira/browse/FLINK-31275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782529#comment-17782529
 ] 

Maciej Obuchowski commented on FLINK-31275:
-------------------------------------------

[~zjureel] 
Generally the idea is that the interface should allow for connecting the 
listener and particular connector in a scenario where they are written by 
totally separate people and don't even know of each other. For example, I 
should be able to write an OpenLineage listener (or Datahub, Atlan...) and it 
should work - and report lineage - from some proprietary, internal connector 
that implements FLIP-314 interface.

The way of doing it in the currently proposed and implemented interface 
requires either
 # Listener knowing the LineageVertex subclasses of all possible connectors, or
 # Listener providing it's own specialized subclass that the connectors would 
implement.

The problem with first idea is that it's not possible to provide extensibility 
mechanism. If I have to enumerate all the supported connectors, I won't be able 
to cover some proprietary connector that that I don't know the code of - and 
the connector author won't be able to do it either.

The problem with second idea is that it's pretty much duplicating effort here. 
It's hard to get consensus around particular interface, and it's even harder to 
have implementation of it in open source. Even if this was done, it pretty much 
forces all the people to use this listener - which I believe is the opposite of 
the goal of the open interface as FLIP-314.

> 2. For customized source/sink in datastream jobs, we can get source and slink 
> `LineageVertex` implementations from `LineageVertexProvider`. When users 
> implement customized lineage vertex and edge, they need to update them when 
> their connectors are updated.

> IIUC, do you mean we should give an implementation of `LineageVertex` for 
> datastream jobs and users can provide source/sink information there just like 
> `TableLinageVertex` in sql jobs? Then listeners can use the datastream 
> lineage vertex which is similar with table lineage vertex?

In a way, yes. While datastream jobs are flexible, sources and sinks generally 
read from and write to known data systems, and those connectors know them. On 
the other hand, I think best possible interface wouldn't have 
`TableLinageVertex` or `DatasetLineageVertex`, just one unified interface that 
would allow connectors themselves to describe the list of datasets read from 
and written to, like the one I've posted in previous comment.

> Due to the flexibility of the source and sink in `DataStream`, we think it's 
> hard to cover all of them, so we just provide `LineageVertex` and 
> `LineageVertexProvider` for them. So we left this flexibility to users and 
> listeners. If a custom connector is a table in `DataStream` job, users can 
> return `TableLineageVertex` in the `LineageVertexProvider`.

My idea is that connector authors provide those - not end users. End users 
providing those means the job is duplicated - while the Kafka connector always 
knows it's reading from some topics, or JDBC connector knows it's writing to a 
particular table in particular database.

However, end users should have the ability to enrich this data with some 
particular information.

> I feel the most painful thing is to infer the schema of source/source for 
> lineage perspective. If the schema info can be provided in Flink connector, 
> the integration in open lineage or even other framework will be clean, 
> concise. 

Agreed - schema is very important next to actual dataset identifier. And very 
important for any possible future column level lineage work.

> Flink supports reporting and storage of source/sink tables relationship
> -----------------------------------------------------------------------
>
>                 Key: FLINK-31275
>                 URL: https://issues.apache.org/jira/browse/FLINK-31275
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner
>    Affects Versions: 1.18.0
>            Reporter: Fang Yong
>            Assignee: Fang Yong
>            Priority: Major
>
> FLIP-314 has been accepted 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to