[
https://issues.apache.org/jira/browse/FLINK-31275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786866#comment-17786866
]
Maciej Obuchowski edited comment on FLINK-31275 at 11/16/23 5:24 PM:
---------------------------------------------------------------------
[~zjureel] I think this is a step in the right direction.
For `StreamingFacet`, if the intention is to attach it to connectors
interfacing with services like Kafka, Kinesis, Pulsar, PubSub then I believe it
needs ability to describe multiple topics - since, at least for Kafka, you can
read from either list of topics or wildcard pattern. The same for `SchemaFacet`
- the topics can have different schema.
The way that OpenLineage deals with this - and Flink could do too - is
associating those with concept of `Dataset`, rather than job itself. So, we'd
have roughly
{code:java}
public interface LineageVertex {
/* List of input (for source) or output (for sink) datasets interacted with
by the connector */
List<Dataset> datasets;
/* Facets for the lineage vertex to describe the general information of
source/sink. */
Map<String, Facet> facets;
/* Config for the lineage vertex contains all the options for the
connector. */
Map<String, String> config;
} {code}
{code:java}
public interface Dataset {
/* Name for this particular dataset. */
String name;
/* Unique name for this dataset's datasource. */
String namespace;
/* Facets for the lineage vertex to describe the particular information of
dataset. */
Map<String, Facet> facets;
} {code}
and then those facets could be assigned either for `Dataset` or for
`LineageVertex`.
was (Author: mobuchowski):
[~zjureel] I think this is a step in the right direction.
For `StreamingFacet`, if the intention is to attach it to connectors
interfacing with services like Kafka, Kinesis, Pulsar, PubSub then I believe it
needs ability to describe multiple topics - since, at least for Kafka, you can
read from either list of topics or wildcard pattern. The same for `SchemaFacet`
- the topics can have different schema.
The way that OpenLineage deals with this - and Flink could do too - is
associating those with concept of `Dataset`, rather than job itself. So, we'd
have roughly
{code:java}
public interface LineageVertex {
/* List of input (for source) or output (for sink) datasets interacted with
by the connector */
List<Dataset> datasets;
/* Facets for the lineage vertex to describe the general information of
source/sink. */
Map<String, Facet> facets;
/* Config for the lineage vertex contains all the options for the
connector. */
Map<String, String> config;
} {code}
{code:java}
public interface Dataset {
/* Name for this particular dataset. */
String name;
/* Unique name for this dataset's datasource. */
String namespace;
/* Facets for the lineage vertex to describe the particular information of
dataset. */
Map<String, Facet> facets;
} {code}
> Flink supports reporting and storage of source/sink tables relationship
> -----------------------------------------------------------------------
>
> Key: FLINK-31275
> URL: https://issues.apache.org/jira/browse/FLINK-31275
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / Planner
> Affects Versions: 1.18.0
> Reporter: Fang Yong
> Assignee: Fang Yong
> Priority: Major
>
> FLIP-314 has been accepted
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
--
This message was sent by Atlassian Jira
(v8.20.10#820010)