Thanks for the post, Dong :) We welcome everyone to drop us an email on Flink ML. Let's work together to build machine learning on Flink :)
Dong Lin <lindon...@gmail.com> 于2021年8月25日周三 下午8:58写道: > Hi everyone, > > Based on the feedback received in the online/offline discussion in the > past few weeks, we (Zhepeng, Fan, myself and a few other developers at > Alibaba) have reached agreement on the design to support DAG of algorithms. > We have merged the ideas from the intial two options into this FLIP-176 > <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783> > design > doc. > > If you have comments on the latest design doc, please let us know! > > Cheers, > Dong > > > On Mon, Aug 23, 2021 at 5:07 PM Becket Qin <becket....@gmail.com> wrote: > >> Thanks for the comments, Fan. Please see the reply inline. >> >> On Thu, Aug 19, 2021 at 10:25 PM Fan Hong <hongfa...@gmail.com> wrote: >> >> > Hi, Becket, >> > >> > Many thanks to your detailed review. I agree that it is easier to >> involve >> > more people to discuss if fundamental differences are highlighted. >> > >> > >> > Here are some of my thoughts to help other people to think about these >> > differences. (correct me if those technique details are not right.) >> > >> > >> > 1. One set of API or not? May be not that important. >> > >> > >> > First of all, AlgoOperators and Pipeline / Transformer / Estimator in >> > Proposal 2 are absolutely *NOT* independent. >> > >> > >> > One may think they are independent, because they see Pipeline / >> > Transformer / Estimator are already in Flink ML Lib and AlgoOperators >> are >> > lately added in this proposal. But that's not true. If you check >> Alink[1] >> > where the idea of Proposal 2 originated, both of them have been >> presented >> > long ago, and they collaborate tightly. >> > >> >> > In the aspects of functionalities, they are also not independent. Their >> > relation is more like a two-level API to specify ML tasks: >> AlgoOperators is >> > a general-purpose level to represent any ML algorithms, while Pipeline / >> > Transformer / Estimator provides a higher-level API which enables >> wrapping >> > multiple ML algorithms together in a fit-transform way. >> > >> >> We probably need to first clarify what "independent" means here. Sure, >> users can always wrap the Transformer into an AlgoOperator, but users can >> basically wrap any code, any class into an AlgoOperator. And we wouldn't >> say AlgoOperator is not independent of any class, right? In my opinion, >> the >> two APIs are independent because even if we agree that Transformers are >> doing things that are conceptually a subset of what AlgoOperators do, a >> Transformer cannot be used as an AlgoOperator out of the box without >> wrapping. And even worse, a MIMO AlgoOperator cannot be wrapped into a >> Transformer / Estimator if these two APIs are SISO. So from what I see, in >> Option 2, these two APIs are independent from API design perspective. >> >> One could consider Flink DataStream - Table as an analogy to AlgoOperators >> > - Pipeline. The two-level APIs provides different functionalities to end >> > users, and the higher-level API will call lower-level of API in internal >> > implementation. I'm not saying the two-level API design in Proposal 2 is >> > good because Flink already did this. I just hope to help community >> people >> > to understand the relation between AlgoOperators and Pipeline. >> > >> >> I am not sure if it is accurate to say DataStream is a low-level API of >> Table. They are simply two different DSL, one for relational / SQL-like >> analytics paradigm, and the other for those who are more familiar with >> streaming applications. More importantly, they are designed to >> support conversion from one to the other out of the box, which is unlike >> Pipeline and AlgoOperators in proposal 2. >> >> >> > An additional usage and benefit of Pipeline API is that SISO >> PipelineModel >> > corresponds to a deployable unit for online serving exactly. >> > >> > In online serving, Flink runtime are usually avoided to achieve low >> > latency. So models have to be wrapped for transmission from Flink >> ecosystem >> > to a non-Flink one. Here is the place where the wrapping is really >> needed >> > and inevitable, because the serving service providers are usually >> expected >> > to be general to one type of models. Pipeline API in Proposal 2 target >> to >> > this scene exactly without complicated APIs. >> > >> > Yet, for offline or nearline inference, they can be completed in Flink >> > ecosystem. That's where Flink ML Lib still exists, so a loose wrapping >> > using AlgoOperators in Proposal 2 still works with not much overhead. >> > >> >> It seems that a MIMO transformer can easily support all SISO use cases, >> right? And there is zero overhead because users may not have to wrap >> AlgoOperators, but can just build a Pipeline directly by putting either >> Transformer or AlgoOperators into it, without worrying about whether they >> are interoperable. >> >> >> > At the same time, these two levels of APIs are not redundant in their >> > functionalities, they have to collaborate to build ML tasks. >> > >> > AlgoOperator API is self-consistent and self-complete in constructing ML >> > tasks, but if users are seeking to wrap a sequence of subtasks, >> especially >> > for online serving, Pipeline / Transformer / Estimator API is >> inevitable. >> > On the other side, Pipeline / Transformer / Estimator API lacks >> > completeness, even for the extended version plus Graph API in Proposal 1 >> > (last case in [4]), so it cannot replace AlgoOperator API. >> > >> > One case of their collaboration lies in my response to Mingliang's >> > recommendation scenarios, where AlgoOperators + Pipeline can provide >> > cleaner usage than Graph API. >> > >> >> I think the link/linkFrom API is more like a convenient wrapper of >> fit/transform/compute. Functionality wise, they are equivalent. The >> Graph/GraphBuilder API, on the other hand, is an encapsulation design, on >> top of Estimator/Transformer/AlgoOperators. Without the GraphBuilder/Graph >> API, users would have to create their own class to encapsulate the code. >> Just like without the Pipeline API, users would have to create their own >> class to wrap the pipeline logic. So I don't think we should compare >> link/linkFrom with Graph/GraphBuilder API because they serve different >> purposes. Even though both of them need to describe a DAG, GraphBuilder is >> describing for encapsulation while link/linkFrom is not. >> >> >> > >> > 2. What is core semantics of Pipeline / Transformer / Estimator? >> > >> > >> > I will not give my answer because I can't. I think it would be difficult >> > to reach an agreement on this. >> > >> > But I did two things, and hope they can provide some hints. >> > >> > >> > One thing is to seek answers from other ML libraries. Scikit-learn and >> > SparkML are well-known general-purpose ML libraries. >> > >> > Spark ML gives the definition of Pipeline / Transformer / Estimator in >> its >> > documents. Here I quote as follows [2]: >> > >> > >> > *Transformer* >> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#transformers>: >> >> A Transformer is an algorithm which can transform *one* DataFrame into >> >> *another* DataFrame. E.g., an ML model is a Transformer which >> transforms >> >> *a* DataFrame with features into *a* DataFrame with predictions. >> >> *Estimator* >> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#estimators>: >> >> An Estimator is an algorithm which can be fit on *a* DataFrame to >> >> produce a Transformer. E.g., a learning algorithm is an Estimator which >> >> trains on a DataFrame and produces a model. >> >> *Pipeline* >> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline>: >> >> A Pipeline chains multiple Transformers and Estimators together to >> specify >> >> an ML workflow. >> > >> > >> > SparkML clearly declare the quantity of inputs and outputs for Estimator >> > and Transformer API. Scikit-learn does not give clear definition, >> instead >> > present its APIs [3]: >> > >> > >> > >> >> *Estimator:*The base object, implements a fit method to learn from >> data, >> >> either: >> >> estimator = estimator.fit(data, targets) >> >> or: >> >> estimator = estimator.fit(data) >> >> >> >> *Transformer:*For filtering or modifying the data, in a supervised or >> >> unsupervised way, implements: >> >> new_data = transformer.transform(data) >> >> When fitting and transforming can be performed much more efficiently >> >> together than separately, implements: >> >> new_data = transformer.fit_transform(data) >> > >> > >> > In their API signatures, one 1 input and 1 output is defined. >> > >> > >> > Another thing I did is to seek some concepts in Big Data APIs to make >> > analogies to Pipeline / Transformer / Estimator in ML APIs, so non-ML >> > developers may have a better understanding about their positions in ML >> APIs. >> > >> > At last, I think 'map' in the MapReduce paradigm may be a fair analogy >> and >> > easy to understand for everyone. One may think 'map' as the MapFunction >> or >> > FlatMapFunction in Flink or Mapper in Hadoop. As far as I know, no Big >> Data >> > APIs trying to extend 'map' to support multiple inputs or outputs and >> still >> > keep the original name. In Flink, there exists co-Map or co-FlatMap >> which >> > can be considered as extensions, yet they did not use the name 'map' >> anyway. >> > >> > >> > So, the core semantics of 'map' is conversion from data to data, or >> from 1 >> > dataset to another dataset? With either answer, the fact is no one >> breaks >> > the usage convention of 'map'. >> > >> > >> > >> This is an interesting discussion. First of all, I don't think "map" is a >> good comparison here. This method is always defined in a class >> representing >> a data collection. So there is no actual data input to the method at all. >> The only semantic that makes sense is to operate on the data collection >> the >> `map` method was defined on. And the parameter of `map` is the processing >> logic, which would also be weird to have more than one. >> >> Regarding scikit-learn and Spark, every API design has their context, >> targeted use cases and design goals. I think it is more important to >> analyze and understand the reason WHY their API looks like that and >> whether >> they are good designs, instead of just following WHAT they look like. >> >> In my opinion, one primary reason is Spark and scikit-learn Pipeline >> assumes that all the samples, whether for training or inference, are well >> prepared. It basically excluded the data preparation step from the >> Pipeline. Take the recommendation systems as an example, it is quite >> typical that the samples are generated based on user behaviors stored in >> different dataset, such as exposures, clicks, and maybe also user profiles >> stored in the relational databases. So MIMO is a must in this case. Today, >> the data preparation is out of the scope of scikit-learn and not included >> in its Pipeline API. People usually uses other ways such as Pandas or >> Spark >> DataFrame to prepare the data. >> >> In order to discuss whether the MIMO Pipeline makes sense, we need to >> think >> whether it is valuable to include the data preparation into the Pipeline >> as >> well. Personally I think it is a good extension and I don't see much harm >> in doing so. Apparently, a MIMO Pipeline API would also support SISO >> Pipeline, so it is conceptually backwards compatible. For those algorithms >> that only make sense for SISO, they can keep just as is. The only >> difference is that instead of just returning an output Table, they return >> a >> single-table array. And for those who do need MIMO support, we also make >> them happy. Therefore it looks like a useful feature with little cost. >> >> BTW, I have asked a few AI practitioners, including those from industry >> and >> academia. The concept of MIMO Pipeline itself seems well accepted. >> Somewhat >> surprisingly, although the concept of Transformer / Estimator is >> understood >> by most people I talked to, they are not familiar with what a transformer >> / >> estimator should look like. I think this is partly because ML Pipeline is >> a >> well known concept without a well agreed API. In fact, even Spark and >> scikit-learn have quite different designs for Estimator / Transformer / >> Pipeline when it comes to details. >> >> >> > 3. About potential inconsistent availabilit.y of algorithms >> > >> > >> > Becket has mentioned that developers may be confused by how to implement >> > the same algorithm in two levels of APIs of Proposal 2. >> > >> > If one accept the relation between AlgoOperator API and Pipeline API >> > described before, then it is not a problem. It is natural that >> developers >> > implement their algorithms in AlgoOperators, and call AlgoOperators in >> > Estimator/Transformers. >> > >> > >> > If not, I propose a rough idea here: >> > >> > An abstract class AlgoOpEstimatorImpl is provided as a subclass of >> > Estimator. It has a method named getTrainOp() which returns the >> > AlgoOperator where the computation logic resides. Other codes in >> > AlgoOpEstimatorImpl are fixed. In this way, developers of Flink ML Lib >> are >> > asked to implement Estimator by inheriting AlgoOpEstimatorImpl. >> > >> > Other solutions are also possible, but may still need some community >> > convention. >> > >> > >> > I also would like to mention the same issue exists in Proposal 1, as >> there >> > are also multiple places where developers can implement algorithms. >> > >> > >> I am not sure I fully understand what "there are also multiple places >> where >> developers can implement algorithms" means. It is always the algorithm >> authors' call in terms of how to implement the interfaces. Implementation >> wise, it is OK to have an abstract class such as AlgoImpl, the algorithm >> authors can choose to leverage it or not. But in either case, the end >> users >> won't see the implementation class and should only rely on public >> interfaces such as Estimator / Transformer / AlgoOperator, etc. >> >> >> > >> >> > In summary, I think the first and second issue above are >> > preference-related, and hope my thoughts can give some clues. The third >> > issue can be considered as a common technique problem in both >> proposals. We >> > may work together to seek better solutions. >> > >> > >> > Sincerely, >> > >> > Fan Hong. >> > >> > >> > >> > [1] https://github.com/alibaba/Alink >> > >> > [2] https://spark.apache.org/docs/latest/ml-pipeline.html >> > >> > [3] https://scikit-learn.org/stable/developers/develop.html >> > >> > [4] >> > >> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do >> > >> > On Tue, Jul 20, 2021 at 11:42 AM Becket Qin <becket....@gmail.com> >> wrote: >> > >> >> Hi Dong, Zhipeng and Fan, >> >> >> >> Thanks for the detailed proposals. It is quite a lot of reading! Given >> >> that we are introducing a lot of stuff here, I find that it might be >> easier >> >> for people to discuss if we can list the fundamental differences first. >> >> From what I understand, the very fundamental difference between the two >> >> proposals is following: >> >> >> >> * In order to support graph structure, do we extend >> >> Transformer/Estimator, or do we introduce a new set of API? * >> >> >> >> Proposal 1 tries to keep one set of API, which is based on >> >> Transformer/Estimator/Pipeline. More specifically, it does the >> following: >> >> - Make Transformer and Estimator multi-input and multi-output >> (MIMO). >> >> - Introduce a Graph/GraphModel as counter parts of >> >> Pipeline/PipelineModel. >> >> >> >> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is. >> >> Instead, it introduces AlgoOperators to support the graph structure. >> The >> >> AlgoOperators are general-purpose graph nodes supporting MIMO. They are >> >> independent of Pipeline / Transformer / Estimator. >> >> >> >> >> >> My two cents: >> >> >> >> I think it is a big advantage to have a single set of API rather than >> two >> >> independent sets of API, if possible. But I would suggest we change the >> >> current proposal 1 a little bit, by learning from proposal 2. >> >> >> >> What I like about proposal 1: >> >> 1. A single set of API, symmetric in Graph/GraphModel and >> >> Pipeline/PipelineModel. >> >> 2. Keeping most of the benefits from Transformer/Estimator, including >> the >> >> fit-then-transform relation and save/load capability. >> >> >> >> However, proposal 1 also introduced some changes that I am not sure >> about: >> >> >> >> 1. The most controversial part of proposal 1 is whether we should >> extend >> >> the Transformer/Estimator/Pipeline? In fact, different projects have >> >> slightly different designs for Transformer/Estimator/Pipeline. So I >> think >> >> it is OK to extend it. However, there are some commonly recognized core >> >> semantics that we ideally want to keep. To me these core semantics are: >> >> 1. Transformer is a Data -> Data conversion, Estimator deals with >> Data >> >> -> Model conversion. >> >> 2. Estimator.fit() gives a Transformer, and users can just call >> >> Transformer.transform() to perform inference. >> >> To me, as long as these core semantics are kept, extension to the API >> >> seems acceptable. >> >> >> >> Proposal 1 extends the semantic of Transformer from Data -> Data >> >> conversion to generic Table -> Table conversion, and claims it is >> >> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph >> node. >> >> It does change the first semantic. That said, this might just be a >> naming >> >> problem, though. One possible solution to this problem is having a new >> >> subclass of Stage without strong conventional semantics, e.g. "AlgoOp" >> if >> >> we borrow the name from proposal 2, and let Transformer extend it. Just >> >> like a PipelineModel is a more specific Transformer, a Transformer >> would be >> >> a more specific "AlgoOp". If we do that, the processing logic that >> people >> >> don't feel comfortable to be a Transformer can just be put into an >> "AlgoOp" >> >> and thus can still be added to a Pipeline / Graph. This borrows the >> >> advantage of proposal 2. In another word, this essentially makes the >> >> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to >> be >> >> added to the Graph and Pipeline if people want to. >> >> >> >> This also gives my thoughts regarding the concern that making the >> >> Transformer/Estimator to MIMO would break the convention of single >> input >> >> single output (SISO) Transformer/Estimator. Since this does not change >> the >> >> core semantic of Transformer/Estimator, it sounds an intuitive >> extension to >> >> me. >> >> >> >> 2. Another semantic related case brought up was heterogeneous >> topologies >> >> in training and inference. In that case, the input of an Estimator >> would be >> >> different from the input of the transformer returned by >> Estimator.fit(). >> >> The example to this case is Word2Vec, where the input of the Estimator >> >> would be an article while the input to the Transformer is a single >> word. >> >> The well recognized ML Pipeline doesn't seem to support this case, >> because >> >> it assumes the input of the Estimator and corresponding Transformer >> are the >> >> same. >> >> >> >> Both proposal 1 and proposal 2 leaves this case unsupported in the >> >> Pipeline. To support this case, >> >> - Proposal 1 adds support to such cases in the Graph/GraphModel by >> >> introducing "EstimatorInput" and "TransformerInput". The downside is >> that >> >> it complicates the API. >> >> - Proposal 2 leaves this to users to construct two different DAG for >> >> training and inference respectively. This means users would have to >> >> construct the DAG twice even if most parts of the DAG are the same in >> >> training and inference. >> >> >> >> My gut feeling is that this is not a critical difference because such >> >> heterogeneous topology is sort of a corner case. Most users do not >> need to >> >> worry about this. For those who do need this, either proposal 1 and >> >> proposal 2 seems acceptable to me. That said, it looks that with >> proposal >> >> 1, users can still choose to write the program twice without using the >> >> Graph API, just like what they do in proposal 2. So technically >> speaking, >> >> proposal 1 is more flexible and allows users to choose either flavor. >> On >> >> the other hand, one could argue that proposal 1 may confuse users with >> >> these two flavors. Although personally I feel it is clear to me, I am >> open >> >> to other ideas. >> >> >> >> 3. Lastly, there was a concern about proposal 1 is that some Estimators >> >> can no longer be added to the Pipeline while the original Pipeline >> accepts >> >> any Estimator. >> >> >> >> It seems that users have to always make sure the input schema required >> by >> >> the Estimator matches the input table. So even for the existing >> Pipeline, >> >> people cannot naively add any Estimator into a pipeline. Admittedly, >> >> proposal 1 added some more requirements, namely 1) the number of inputs >> >> needs to match the number of outputs of the previous stage, and 2) the >> >> Estimator does not generate a transformer with different required input >> >> schema (the heterogeneous case mentioned above). However, given that >> these >> >> mismatches will result in exceptions at compile time, just like users >> put >> >> an Estimator with mismatched input schema, personally I find it does >> not >> >> change the user experience much. >> >> >> >> >> >> So to summarize my thoughts on this fundamental difference. >> >> - In general, I personally prefer having one set of API. >> >> - The current proposal 1 may need some improvements in some cases, >> by >> >> borrowing something from proposal 2. >> >> >> >> >> >> >> >> A few other differences that I consider as non-fundamental: >> >> >> >> * Do we need a top level encapsulation API for an Algorithm? * >> >> >> >> Proposal 1 has a concept of Graph which encapsulates the entire >> algorithm >> >> to provide a unified API following the same semantic of >> >> Estimator/Transformer. Users can choose not to package everything into >> a >> >> Graph, but just write their own program and wrap it as an ordinary >> function. >> >> >> >> Proposal 2 does not have the top level API such as Graph. Instead, >> users >> >> can choose to write an arbitrary function if they want to. >> >> >> >> From what I understand, in proposal 1, users may still choose to ignore >> >> Graph API and simply construct a DAG by themselves by calling >> transform() >> >> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal >> 1 as >> >> I suggested earlier. So Graph is just an additional way to construct a >> >> graph - people can use Graph in a similar way as they do to the >> >> Pipeline/Pipeline model. In another word, there is no conflict between >> >> proposal 1 and proposal 2. >> >> >> >> >> >> * The ways to describe a Graph? * >> >> >> >> Proposal 1 gives two ways to construct a DAG. >> >> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as >> well). >> >> 2. using the GraphBuilder API. >> >> >> >> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a >> >> main output and some other side outputs, so it can call >> >> algoOp1.linkFrom(algoOp2) without specifying the index of the output, >> at >> >> the cost of wrapping all the Tables into an AlgoOperator. >> >> >> >> The usability argument was mostly around the raw APIs. I don't think >> the >> >> two APIs differ too much from each other. With the same assumption, >> >> proposal 1 and proposal 2 can probably achieve very similar levels of >> >> usability when describing a Graph, if not exactly the same. >> >> >> >> >> >> There are some more other differences/arguments mentioned between the >> two >> >> proposals. However, I don't think they are fundamental. And just like >> the >> >> cases mentioned above, the two proposals can easily learn from each >> other. >> >> >> >> Thanks, >> >> >> >> Jiangjie (Becket) Qin >> >> >> >> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote: >> >> >> >>> Hi all, >> >>> >> >>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two >> >>> different >> >>> designs to extend Flink ML API to support more use-cases, e.g. >> >>> expressing a >> >>> DAG of preprocessing and training logics. These two designs have been >> >>> documented in FLIP-173 >> >>> < >> >>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783 >> >>> > >> >>> 。 >> >>> >> >>> We have different opinions on the usability and the >> ease-of-understanding >> >>> of the proposed APIs. It will be really useful to have comments of >> those >> >>> designs from the open source community and to learn your preferences. >> >>> >> >>> To facilitate the discussion, we have summarized our design principles >> >>> and >> >>> opinions in this Google doc >> >>> < >> >>> >> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do >> >>> >. >> >>> Code snippets for a few example use-cases are also provided in this >> doc >> >>> to >> >>> demonstrate the difference between these two solutions. >> >>> >> >>> This Flink ML API is super important to the future of Flink ML >> library. >> >>> Please feel free to reply to this email thread or comment in the >> Google >> >>> doc >> >>> directly. >> >>> >> >>> Thank you! >> >>> Dong, Zhipeng, Fan >> >>> >> >> >> > -- best, Zhipeng