[jira] [Comment Edited] (FLINK-31275) Flink supports reporting and storage of source/sink tables relationship

Maciej Obuchowski (Jira) Thu, 16 Nov 2023 09:25:52 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-31275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786866#comment-17786866
 ]


Maciej Obuchowski edited comment on FLINK-31275 at 11/16/23 5:24 PM:
---------------------------------------------------------------------

[~zjureel] I think this is a step in the right direction.

For `StreamingFacet`, if the intention is to attach it to connectors 
interfacing with services like Kafka, Kinesis, Pulsar, PubSub then I believe it 
needs ability to describe multiple topics - since, at least for Kafka, you can 
read from either list of topics or wildcard pattern. The same for `SchemaFacet` 
- the topics can have different schema.

The way that OpenLineage deals with this - and Flink could do too - is 
associating those with concept of `Dataset`, rather than job itself. So, we'd 
have roughly
{code:java}
public interface LineageVertex {
    /* List of input (for source) or output (for sink) datasets interacted with 
by the connector */
    List<Dataset> datasets; 
    /* Facets for the lineage vertex to describe the general information of 
source/sink. */ 
    Map<String, Facet> facets; 
    /* Config for the lineage vertex contains all the options for the 
connector. */ 
    Map<String, String> config;
} {code}
{code:java}
public interface Dataset {
    /* Name for this particular dataset. */
    String name;
    /* Unique name for this dataset's datasource. */
    String namespace; 
    /* Facets for the lineage vertex to describe the particular information of 
dataset. */ 
    Map<String, Facet> facets; 
} {code}
and then those facets could be assigned either for `Dataset` or for 
`LineageVertex`.


was (Author: mobuchowski):
[~zjureel] I think this is a step in the right direction.

For `StreamingFacet`, if the intention is to attach it to connectors 
interfacing with services like Kafka, Kinesis, Pulsar, PubSub then I believe it 
needs ability to describe multiple topics - since, at least for Kafka, you can 
read from either list of topics or wildcard pattern. The same for `SchemaFacet` 
- the topics can have different schema.

The way that OpenLineage deals with this - and Flink could do too - is 
associating those with concept of `Dataset`, rather than job itself. So, we'd 
have roughly
{code:java}
public interface LineageVertex {
    /* List of input (for source) or output (for sink) datasets interacted with 
by the connector */
    List<Dataset> datasets; 
    /* Facets for the lineage vertex to describe the general information of 
source/sink. */ 
    Map<String, Facet> facets; 
    /* Config for the lineage vertex contains all the options for the 
connector. */ 
    Map<String, String> config;
} {code}
{code:java}
public interface Dataset {
    /* Name for this particular dataset. */
    String name;
    /* Unique name for this dataset's datasource. */
    String namespace; 
    /* Facets for the lineage vertex to describe the particular information of 
dataset. */ 
    Map<String, Facet> facets; 
} {code}

> Flink supports reporting and storage of source/sink tables relationship
> -----------------------------------------------------------------------
>
>                 Key: FLINK-31275
>                 URL: https://issues.apache.org/jira/browse/FLINK-31275
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner
>    Affects Versions: 1.18.0
>            Reporter: Fang Yong
>            Assignee: Fang Yong
>            Priority: Major
>
> FLIP-314 has been accepted 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31275) Flink supports reporting and storage of source/sink tables relationship

Reply via email to