[ 
https://issues.apache.org/jira/browse/FLINK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-22915:
-----------------------------------
    Labels: pull-request-available  (was: )

> FLIP-173: Support DAG of algorithms
> -----------------------------------
>
>                 Key: FLINK-22915
>                 URL: https://issues.apache.org/jira/browse/FLINK-22915
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Priority: Major
>              Labels: pull-request-available
>
> The FLIP design doc can be found at 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783.
> The existing Flink ML library allows users to compose an 
> Estimator/Transformer from a pipeline (i.e. linear sequence) of 
> Estimator/Transformer, and each Estimator/Transformer has one input and one 
> output.
> The following use-cases are not supported yet. And we would like to address 
> these use-cases with the changes proposed in this FLIP.
> 1) Express an Estimator/Transformer that has multiple inputs/outputs.
> For example, some graph embedding algorithms (e.g., MetaPath2Vec) need to 
> take two tables as inputs. These two tables represent nodes labels and edges 
> of the graph respectively. This logic can be expressed as an Estimator with 2 
> input tables.
> And some workflow may need to split 1 table into 2 tables, and use these 
> tables for training and validation respectively. This logic can be expressed 
> by a Transformer with 1 input table and 2 output tables.
> 2) Express a generic machine learning computation logic that does not have 
> the "transformation" semantic.
> We believe most machine learning engineers associate the name "Transformer" 
> with the "transformation" semantic, where the a record in the output 
> typically corresponds to one record in the input. Thus it is 
> counter-intuitive to use Transformer to encode aggregation logic, where a 
> record in the output corresponds to an arbitrary number of records in the 
> input.
> Therefore we need to have a class with a name different from "Transformer" to 
> encode generic multi-input multi-output computation logic. 
> 3) Online learning where a long-running Model instance needs to be 
> continuously updated by the latest model data generated by another 
> long-running Estimator instance.
> In this scenario, we need to allow the Estimator to be run on a different 
> machine than the Model, so that the Estimator could consume sufficient 
> computation resource in a cluster while the Model could be deployed on edge 
> devices.
> 4) Provide APIs to allow Estimator/Model to be efficiently saved/loaded even 
> if state (e.g. model data) of Estimator/Model is more than 10s of GBs.
> The existing PipelineStage::toJson basically requires developer of 
> Estimator/Model to serialize all model data into an in-memory string, which 
> could be very inefficient (or practically impossible) if the model data is 
> very large (e.g 10s of GBs).
> In addition to addressing the above use-cases, this FLIP also proposes a few 
> more changes to simplify the class hierarchy and improve API usability. The 
> existing Flink ML library has the following usability issues:
> 5) fit/transform API requires users to explicitly provide the 
> TableEnvironment, where the TableEnvironment could be retrieved from the 
> Table instance given to the fit/transform.
> 6) A Pipeline is currently both a Transformer and an Estimator. The 
> experience of using Pipeline is inconsistent from the experience of using 
> Estimator (with the needFit API).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to