[jira] [Updated] (FLINK-22915) Update Flink ML library to support Estimator/Transformer DAG and online learning

Dong Lin (Jira) Tue, 22 Jun 2021 19:41:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dong Lin updated FLINK-22915:
-----------------------------
    Description: 
Currently Flink ML API allows users to compose an Estimator/Transformer from a 
pipeline (i.e. linear sequence) of Estimator/Transformer, and each 
Estimator/Transformer has one input and one output.

The following use-cases are not supported yet. And we propose FLIP-173 [1] to 
address these use-cases.

1) Express an Estimator/Transformer that has multiple inputs/outputs.

For example, some graph embedding algorithms need to take two tables as inputs. 
These two tables represent nodes and edges of the graph respectively. This 
logic can be expressed as an Estimator with 2 input tables.

And some workflow may need to split 1 table into 2 tables, and use these tables 
for training and validation respectively. This logic can be expressed by a 
Transformer with 1 input table and 2 output tables.

2) Compose a directed-acyclic-graph Estimator/Transformer into an 
Estimator/Transformer.

For example, the workflow may involve the join of two tables, where each table 
are generated by a chain of Estimator/Transformer. The entire workflow is 
therefore a DAG of Estimator/Transformer.

3) Online learning where a long-running instance Transformer needs to be 
updated by the latest model data generated by another long-running instance of 
Estimator.

In this scenario, we need to allow the Estimator to be run on a different 
machine than the Transformer. So that Estimator could consume sufficient 
computation resource in a cluster while the Transformer could be deployed on 
edge devices.



  was:
Currently Flink ML API allows users to compose an Estimator/Transformer from a 
pipeline (i.e. linear sequence) of Estimator/Transformer. We propose to extend 
the Flink ML API so that users can compose an Estimator/Transformer from a 
directed-acyclic-graph (i.e. DAG) of Estimator/Transformer. 

This feature is useful for the following use-cases:

1) The preprocessing workflow (shared between training and inference workflows) 
may involve the join of multiple tables, where the join of two tables can be 
expressed as a Transformer of 2 inputs and 1 output. And the preprocessing 
workflow could also involve the spilt operation, where the split operation has 
1 input (e.g. the original table) and 2 outputs (e.g. the split of the original 
table).

The expression of preprocessing workflow involving the join/split operation 
needs to be expressed as a DAG of Transformer.

2) The graph-embedding algorithm can be expressed as an Estimator, where the 
Estimator takes as input two tables (e.g. a node table and an edge table). The 
corresponding Transformer has 1 input (i.e. the node) and 1 output (i.e. the 
node after embedding)

The expression of training workflow involving the graph-embedding Estimator 
needs to be expressed as a DAG of Transformer/Estimator.





        Summary: Update Flink ML library to support Estimator/Transformer DAG 
and online learning  (was: Extend Flink ML API to support Estimator/Transformer 
DAG)

> Update Flink ML library to support Estimator/Transformer DAG and online 
> learning
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-22915
>                 URL: https://issues.apache.org/jira/browse/FLINK-22915
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Priority: Major
>
> Currently Flink ML API allows users to compose an Estimator/Transformer from 
> a pipeline (i.e. linear sequence) of Estimator/Transformer, and each 
> Estimator/Transformer has one input and one output.
> The following use-cases are not supported yet. And we propose FLIP-173 [1] to 
> address these use-cases.
> 1) Express an Estimator/Transformer that has multiple inputs/outputs.
> For example, some graph embedding algorithms need to take two tables as 
> inputs. These two tables represent nodes and edges of the graph respectively. 
> This logic can be expressed as an Estimator with 2 input tables.
> And some workflow may need to split 1 table into 2 tables, and use these 
> tables for training and validation respectively. This logic can be expressed 
> by a Transformer with 1 input table and 2 output tables.
> 2) Compose a directed-acyclic-graph Estimator/Transformer into an 
> Estimator/Transformer.
> For example, the workflow may involve the join of two tables, where each 
> table are generated by a chain of Estimator/Transformer. The entire workflow 
> is therefore a DAG of Estimator/Transformer.
> 3) Online learning where a long-running instance Transformer needs to be 
> updated by the latest model data generated by another long-running instance 
> of Estimator.
> In this scenario, we need to allow the Estimator to be run on a different 
> machine than the Transformer. So that Estimator could consume sufficient 
> computation resource in a cluster while the Transformer could be deployed on 
> edge devices.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-22915) Update Flink ML library to support Estimator/Transformer DAG and online learning

Reply via email to