[ https://issues.apache.org/jira/browse/FLINK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-22915: ----------------------------------- Labels: pull-request-available (was: ) > FLIP-173: Support DAG of algorithms > ----------------------------------- > > Key: FLINK-22915 > URL: https://issues.apache.org/jira/browse/FLINK-22915 > Project: Flink > Issue Type: Improvement > Components: Library / Machine Learning > Reporter: Dong Lin > Priority: Major > Labels: pull-request-available > > The FLIP design doc can be found at > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783. > The existing Flink ML library allows users to compose an > Estimator/Transformer from a pipeline (i.e. linear sequence) of > Estimator/Transformer, and each Estimator/Transformer has one input and one > output. > The following use-cases are not supported yet. And we would like to address > these use-cases with the changes proposed in this FLIP. > 1) Express an Estimator/Transformer that has multiple inputs/outputs. > For example, some graph embedding algorithms (e.g., MetaPath2Vec) need to > take two tables as inputs. These two tables represent nodes labels and edges > of the graph respectively. This logic can be expressed as an Estimator with 2 > input tables. > And some workflow may need to split 1 table into 2 tables, and use these > tables for training and validation respectively. This logic can be expressed > by a Transformer with 1 input table and 2 output tables. > 2) Express a generic machine learning computation logic that does not have > the "transformation" semantic. > We believe most machine learning engineers associate the name "Transformer" > with the "transformation" semantic, where the a record in the output > typically corresponds to one record in the input. Thus it is > counter-intuitive to use Transformer to encode aggregation logic, where a > record in the output corresponds to an arbitrary number of records in the > input. > Therefore we need to have a class with a name different from "Transformer" to > encode generic multi-input multi-output computation logic. > 3) Online learning where a long-running Model instance needs to be > continuously updated by the latest model data generated by another > long-running Estimator instance. > In this scenario, we need to allow the Estimator to be run on a different > machine than the Model, so that the Estimator could consume sufficient > computation resource in a cluster while the Model could be deployed on edge > devices. > 4) Provide APIs to allow Estimator/Model to be efficiently saved/loaded even > if state (e.g. model data) of Estimator/Model is more than 10s of GBs. > The existing PipelineStage::toJson basically requires developer of > Estimator/Model to serialize all model data into an in-memory string, which > could be very inefficient (or practically impossible) if the model data is > very large (e.g 10s of GBs). > In addition to addressing the above use-cases, this FLIP also proposes a few > more changes to simplify the class hierarchy and improve API usability. The > existing Flink ML library has the following usability issues: > 5) fit/transform API requires users to explicitly provide the > TableEnvironment, where the TableEnvironment could be retrieved from the > Table instance given to the fit/transform. > 6) A Pipeline is currently both a Transformer and an Estimator. The > experience of using Pipeline is inconsistent from the experience of using > Estimator (with the needFit API). -- This message was sent by Atlassian Jira (v8.3.4#803005)