[ 
https://issues.apache.org/jira/browse/FLINK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated FLINK-22915:
-----------------------------
    Description: 
The FLIP design doc can be found at 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.

The existing Flink ML library allows users to compose an Estimator/Transformer 
from a pipeline (i.e. linear sequence) of Estimator/Transformer, and each 
Estimator/Transformer has one input and one output.

The following use-cases are not supported yet. And we would like to address 
these use-cases with the changes proposed in this FLIP.

1) Express an Estimator/Transformer that has multiple inputs/outputs.

For example, some graph embedding algorithms (e.g., MetaPath2Vec) need to take 
two tables as inputs. These two tables represent nodes labels and edges of the 
graph respectively. This logic can be expressed as an Estimator with 2 input 
tables.

And some workflow may need to split 1 table into 2 tables, and use these tables 
for training and validation respectively. This logic can be expressed by a 
Transformer with 1 input table and 2 output tables.

2) Express a generic machine learning computation logic that does not have the 
"transformation" semantic.

We believe most machine learning engineers associate the name "Transformer" 
with the "transformation" semantic, where the a record in the output typically 
corresponds to one record in the input. Thus it is counter-intuitive to use 
Transformer to encode aggregation logic, where a record in the output 
corresponds to an arbitrary number of records in the input.

Therefore we need to have a class with a name different from "Transformer" to 
encode generic multi-input multi-output computation logic. 

3) Online learning where a long-running Model instance needs to be continuously 
updated by the latest model data generated by another long-running Estimator 
instance.

In this scenario, we need to allow the Estimator to be run on a different 
machine than the Model, so that the Estimator could consume sufficient 
computation resource in a cluster while the Model could be deployed on edge 
devices.

4) Provide APIs to allow Estimator/Model to be efficiently saved/loaded even if 
state (e.g. model data) of Estimator/Model is more than 10s of GBs.

The existing PipelineStage::toJson basically requires developer of 
Estimator/Model to serialize all model data into an in-memory string, which 
could be very inefficient (or practically impossible) if the model data is very 
large (e.g 10s of GBs).


In addition to addressing the above use-cases, this FLIP also proposes a few 
more changes to simplify the class hierarchy and improve API usability. The 
existing Flink ML library has the following usability issues:

5) fit/transform API requires users to explicitly provide the TableEnvironment, 
where the TableEnvironment could be retrieved from the Table instance given to 
the fit/transform.

6) A Pipeline is currently both a Transformer and an Estimator. The experience 
of using Pipeline is inconsistent from the experience of using Estimator (with 
the needFit API).



  was:
The FLIP design doc can be found at 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.

The existing Flink ML library allows users to compose an Estimator/Transformer 
from a pipeline (i.e. linear sequence) of Estimator/Transformer, and each 
Estimator/Transformer has one input and one output.

The following use-cases are not supported yet. And we propose FLIP-173 [1] to 
address these use-cases.

1) Express an Estimator/Transformer that has multiple inputs/outputs.

For example, some graph embedding algorithms need to take two tables as inputs. 
These two tables represent nodes and edges of the graph respectively. This 
logic can be expressed as an Estimator with 2 input tables.

And some workflow may need to split 1 table into 2 tables, and use these tables 
for training and validation respectively. This logic can be expressed by a 
Transformer with 1 input table and 2 output tables.

2) Compose a directed-acyclic-graph Estimator/Transformer into an 
Estimator/Transformer.

For example, the workflow may involve the join of 2 tables, where each table 
could be generated by a chain of Estimator/Transformer. The entire workflow is 
therefore a DAG of Estimator/Transformer.

3) Online learning where a long-running instance Transformer needs to be 
updated by the latest model data generated by another long-running instance of 
Estimator.

In this scenario, we need to allow the Estimator to be run on a different 
machine than the Transformer. So that Estimator could consume sufficient 
computation resource in a cluster while the Transformer could be deployed on 
edge devices.

In addition to addressing the above use-cases, we also propose a few more 
changes to simplify the class hierarchy and improve API usability. The existing 
Flink ML library has the following usability issues:

4) The Model interface does not provide any added value (given that we already 
have Transformer). The added class hierarchy complexity is not justified.

5) fit/transform API requires users to explicitly provide the TableEnvironment, 
where the TableEnvironment could be retrieved from the Table instance given to 
the fit/transform.

6) A Pipeline is both a Transformer and an Estimator. The experience of using 
Pipeline is therefore different from the experience of using Estimator (with 
the needFit API).

7) There is no API provided by the Estimator/Transformer interface to validate 
the schema consistency of a Pipeline. Users would have to instantiate Tables 
(with I/O logics) and run fit/transform to know whether the stages in the 
Pipeline are compatible with each other.


[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181311360



        Summary: FLIP-173: Support DAG of algorithms  (was: FLIP-173: Update 
Flink ML library to support Estimator/Transformer DAG and online learning)

> FLIP-173: Support DAG of algorithms
> -----------------------------------
>
>                 Key: FLINK-22915
>                 URL: https://issues.apache.org/jira/browse/FLINK-22915
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Priority: Major
>
> The FLIP design doc can be found at 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.
> The existing Flink ML library allows users to compose an 
> Estimator/Transformer from a pipeline (i.e. linear sequence) of 
> Estimator/Transformer, and each Estimator/Transformer has one input and one 
> output.
> The following use-cases are not supported yet. And we would like to address 
> these use-cases with the changes proposed in this FLIP.
> 1) Express an Estimator/Transformer that has multiple inputs/outputs.
> For example, some graph embedding algorithms (e.g., MetaPath2Vec) need to 
> take two tables as inputs. These two tables represent nodes labels and edges 
> of the graph respectively. This logic can be expressed as an Estimator with 2 
> input tables.
> And some workflow may need to split 1 table into 2 tables, and use these 
> tables for training and validation respectively. This logic can be expressed 
> by a Transformer with 1 input table and 2 output tables.
> 2) Express a generic machine learning computation logic that does not have 
> the "transformation" semantic.
> We believe most machine learning engineers associate the name "Transformer" 
> with the "transformation" semantic, where the a record in the output 
> typically corresponds to one record in the input. Thus it is 
> counter-intuitive to use Transformer to encode aggregation logic, where a 
> record in the output corresponds to an arbitrary number of records in the 
> input.
> Therefore we need to have a class with a name different from "Transformer" to 
> encode generic multi-input multi-output computation logic. 
> 3) Online learning where a long-running Model instance needs to be 
> continuously updated by the latest model data generated by another 
> long-running Estimator instance.
> In this scenario, we need to allow the Estimator to be run on a different 
> machine than the Model, so that the Estimator could consume sufficient 
> computation resource in a cluster while the Model could be deployed on edge 
> devices.
> 4) Provide APIs to allow Estimator/Model to be efficiently saved/loaded even 
> if state (e.g. model data) of Estimator/Model is more than 10s of GBs.
> The existing PipelineStage::toJson basically requires developer of 
> Estimator/Model to serialize all model data into an in-memory string, which 
> could be very inefficient (or practically impossible) if the model data is 
> very large (e.g 10s of GBs).
> In addition to addressing the above use-cases, this FLIP also proposes a few 
> more changes to simplify the class hierarchy and improve API usability. The 
> existing Flink ML library has the following usability issues:
> 5) fit/transform API requires users to explicitly provide the 
> TableEnvironment, where the TableEnvironment could be retrieved from the 
> Table instance given to the fit/transform.
> 6) A Pipeline is currently both a Transformer and an Estimator. The 
> experience of using Pipeline is inconsistent from the experience of using 
> Estimator (with the needFit API).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to