[
https://issues.apache.org/jira/browse/FLINK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dong Lin updated FLINK-22915:
-----------------------------
Description:
The FLIP design doc can be found at
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.
The existing Flink ML library allows users to compose an Estimator/Transformer
from a pipeline (i.e. linear sequence) of Estimator/Transformer, and each
Estimator/Transformer has one input and one output.
The following use-cases are not supported yet. And we propose FLIP-173 [1] to
address these use-cases.
1) Express an Estimator/Transformer that has multiple inputs/outputs.
For example, some graph embedding algorithms need to take two tables as inputs.
These two tables represent nodes and edges of the graph respectively. This
logic can be expressed as an Estimator with 2 input tables.
And some workflow may need to split 1 table into 2 tables, and use these tables
for training and validation respectively. This logic can be expressed by a
Transformer with 1 input table and 2 output tables.
2) Compose a directed-acyclic-graph Estimator/Transformer into an
Estimator/Transformer.
For example, the workflow may involve the join of 2 tables, where each table
could be generated by a chain of Estimator/Transformer. The entire workflow is
therefore a DAG of Estimator/Transformer.
3) Online learning where a long-running instance Transformer needs to be
updated by the latest model data generated by another long-running instance of
Estimator.
In this scenario, we need to allow the Estimator to be run on a different
machine than the Transformer. So that Estimator could consume sufficient
computation resource in a cluster while the Transformer could be deployed on
edge devices.
In addition to addressing the above use-cases, we also propose a few more
changes to simplify the class hierarchy and improve API usability. The existing
Flink ML library has the following usability issues:
4) The Model interface does not provide any added value (given that we already
have Transformer). The added class hierarchy complexity is not justified.
5) fit/transform API requires users to explicitly provide the TableEnvironment,
where the TableEnvironment could be retrieved from the Table instance given to
the fit/transform.
6) A Pipeline is both a Transformer and an Estimator. The experience of using
Pipeline is therefore different from the experience of using Estimator (with
the needFit API).
7) There is no API provided by the Estimator/Transformer interface to validate
the schema consistency of a Pipeline. Users would have to instantiate Tables
(with I/O logics) and run fit/transform to know whether the stages in the
Pipeline are compatible with each other.
[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181311360
was:
The existing Flink ML library allows users to compose an Estimator/Transformer
from a pipeline (i.e. linear sequence) of Estimator/Transformer, and each
Estimator/Transformer has one input and one output.
The following use-cases are not supported yet. And we propose FLIP-173 [1] to
address these use-cases.
1) Express an Estimator/Transformer that has multiple inputs/outputs.
For example, some graph embedding algorithms need to take two tables as inputs.
These two tables represent nodes and edges of the graph respectively. This
logic can be expressed as an Estimator with 2 input tables.
And some workflow may need to split 1 table into 2 tables, and use these tables
for training and validation respectively. This logic can be expressed by a
Transformer with 1 input table and 2 output tables.
2) Compose a directed-acyclic-graph Estimator/Transformer into an
Estimator/Transformer.
For example, the workflow may involve the join of 2 tables, where each table
could be generated by a chain of Estimator/Transformer. The entire workflow is
therefore a DAG of Estimator/Transformer.
3) Online learning where a long-running instance Transformer needs to be
updated by the latest model data generated by another long-running instance of
Estimator.
In this scenario, we need to allow the Estimator to be run on a different
machine than the Transformer. So that Estimator could consume sufficient
computation resource in a cluster while the Transformer could be deployed on
edge devices.
In addition to addressing the above use-cases, we also propose a few more
changes to simplify the class hierarchy and improve API usability. The existing
Flink ML library has the following usability issues:
4) The Model interface does not provide any added value (given that we already
have Transformer). The added class hierarchy complexity is not justified.
5) fit/transform API requires users to explicitly provide the TableEnvironment,
where the TableEnvironment could be retrieved from the Table instance given to
the fit/transform.
6) A Pipeline is both a Transformer and an Estimator. The experience of using
Pipeline is therefore different from the experience of using Estimator (with
the needFit API).
7) There is no API provided by the Estimator/Transformer interface to validate
the schema consistency of a Pipeline. Users would have to instantiate Tables
(with I/O logics) and run fit/transform to know whether the stages in the
Pipeline are compatible with each other.
[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181311360
Summary: FLIP-173: Update Flink ML library to support
Estimator/Transformer DAG and online learning (was: Update Flink ML library to
support Estimator/Transformer DAG and online learning)
> FLIP-173: Update Flink ML library to support Estimator/Transformer DAG and
> online learning
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-22915
> URL: https://issues.apache.org/jira/browse/FLINK-22915
> Project: Flink
> Issue Type: Improvement
> Components: Library / Machine Learning
> Reporter: Dong Lin
> Priority: Major
>
> The FLIP design doc can be found at
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.
> The existing Flink ML library allows users to compose an
> Estimator/Transformer from a pipeline (i.e. linear sequence) of
> Estimator/Transformer, and each Estimator/Transformer has one input and one
> output.
> The following use-cases are not supported yet. And we propose FLIP-173 [1] to
> address these use-cases.
> 1) Express an Estimator/Transformer that has multiple inputs/outputs.
> For example, some graph embedding algorithms need to take two tables as
> inputs. These two tables represent nodes and edges of the graph respectively.
> This logic can be expressed as an Estimator with 2 input tables.
> And some workflow may need to split 1 table into 2 tables, and use these
> tables for training and validation respectively. This logic can be expressed
> by a Transformer with 1 input table and 2 output tables.
> 2) Compose a directed-acyclic-graph Estimator/Transformer into an
> Estimator/Transformer.
> For example, the workflow may involve the join of 2 tables, where each table
> could be generated by a chain of Estimator/Transformer. The entire workflow
> is therefore a DAG of Estimator/Transformer.
> 3) Online learning where a long-running instance Transformer needs to be
> updated by the latest model data generated by another long-running instance
> of Estimator.
> In this scenario, we need to allow the Estimator to be run on a different
> machine than the Transformer. So that Estimator could consume sufficient
> computation resource in a cluster while the Transformer could be deployed on
> edge devices.
> In addition to addressing the above use-cases, we also propose a few more
> changes to simplify the class hierarchy and improve API usability. The
> existing Flink ML library has the following usability issues:
> 4) The Model interface does not provide any added value (given that we
> already have Transformer). The added class hierarchy complexity is not
> justified.
> 5) fit/transform API requires users to explicitly provide the
> TableEnvironment, where the TableEnvironment could be retrieved from the
> Table instance given to the fit/transform.
> 6) A Pipeline is both a Transformer and an Estimator. The experience of using
> Pipeline is therefore different from the experience of using Estimator (with
> the needFit API).
> 7) There is no API provided by the Estimator/Transformer interface to
> validate the schema consistency of a Pipeline. Users would have to
> instantiate Tables (with I/O logics) and run fit/transform to know whether
> the stages in the Pipeline are compatible with each other.
> [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181311360
--
This message was sent by Atlassian Jira
(v8.3.4#803005)