[
https://issues.apache.org/jira/browse/FLINK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiangjie Qin reassigned FLINK-22915:
------------------------------------
Assignee: Dong Lin
> FLIP-173: Support DAG of algorithms
> -----------------------------------
>
> Key: FLINK-22915
> URL: https://issues.apache.org/jira/browse/FLINK-22915
> Project: Flink
> Issue Type: Improvement
> Components: Library / Machine Learning
> Reporter: Dong Lin
> Assignee: Dong Lin
> Priority: Major
> Labels: pull-request-available
>
> The FLIP design doc can be found at
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.
> The existing Flink ML library allows users to compose an
> Estimator/Transformer from a pipeline (i.e. linear sequence) of
> Estimator/Transformer, and each Estimator/Transformer has one input and one
> output.
> The following use-cases are not supported yet. And we would like to address
> these use-cases with the changes proposed in this FLIP.
> 1) Express an Estimator/Transformer that has multiple inputs/outputs.
> For example, some graph embedding algorithms (e.g., MetaPath2Vec) need to
> take two tables as inputs. These two tables represent nodes labels and edges
> of the graph respectively. This logic can be expressed as an Estimator with 2
> input tables.
> And some workflow may need to split 1 table into 2 tables, and use these
> tables for training and validation respectively. This logic can be expressed
> by a Transformer with 1 input table and 2 output tables.
> 2) Express a generic machine learning computation logic that does not have
> the "transformation" semantic.
> We believe most machine learning engineers associate the name "Transformer"
> with the "transformation" semantic, where the a record in the output
> typically corresponds to one record in the input. Thus it is
> counter-intuitive to use Transformer to encode aggregation logic, where a
> record in the output corresponds to an arbitrary number of records in the
> input.
> Therefore we need to have a class with a name different from "Transformer" to
> encode generic multi-input multi-output computation logic.
> 3) Online learning where a long-running Model instance needs to be
> continuously updated by the latest model data generated by another
> long-running Estimator instance.
> In this scenario, we need to allow the Estimator to be run on a different
> machine than the Model, so that the Estimator could consume sufficient
> computation resource in a cluster while the Model could be deployed on edge
> devices.
> 4) Provide APIs to allow Estimator/Model to be efficiently saved/loaded even
> if state (e.g. model data) of Estimator/Model is more than 10s of GBs.
> The existing PipelineStage::toJson basically requires developer of
> Estimator/Model to serialize all model data into an in-memory string, which
> could be very inefficient (or practically impossible) if the model data is
> very large (e.g 10s of GBs).
> In addition to addressing the above use-cases, this FLIP also proposes a few
> more changes to simplify the class hierarchy and improve API usability. The
> existing Flink ML library has the following usability issues:
> 5) fit/transform API requires users to explicitly provide the
> TableEnvironment, where the TableEnvironment could be retrieved from the
> Table instance given to the fit/transform.
> 6) A Pipeline is currently both a Transformer and an Estimator. The
> experience of using Pipeline is inconsistent from the experience of using
> Estimator (with the needFit API).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)