It has pasted a while and I think we can move forward to JIRA discussion. I will try to split the design into smaller pieces to make it more understandable.
Actually, I have already implemented an initial version and ported some flink.ml algorithms using this new API. Thus, we can have a better base for design discussion. Thanks Weihua Chen Qin <qinnc...@gmail.com> 于2018年11月21日周三 下午1:36写道: > Hi Yun, > > Very excited to see Flink ML forward! There are many touch points your > document touched. I couldn't agree more the value of having a (unified) > table API could bring to Flink ecosystem towards running ML workload. Most > ML pipelines we observed starts from single box python scripts or adhoc > tools researcher run to train model on powerful machine. When that proves > successful, they need to hook up with data warehouse and extract features > (SQL kick in). In training phase, the landscape is very segmented. Small to > median sized model can be trained on JVM, while large/deep model needs to > optimize operator per iteration data random shuffle (SGD based DL) often > ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of > hack around map-reduce) > > Hope it makes sense. BTW, xgboost (most popular ML competition framework) > has very primitive flink support, might worth check out. > https://github.com/dmlc/xgboost > > Chen > > On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <weihua.ji...@gmail.com> > wrote: > > > Hi Yun, > > > > Can't wait to see your design. > > > > Thanks > > Weihua > > > > Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道: > > > > > Hi Weihua, > > > > > > Thanks for the exciting proposal! > > > > > > I have quickly read through it, and I really appropriate the idea > of > > > providing the ML Pipeline API similar to the commonly used library > > > scikit-learn, since it greatly reduce the learning cost for the AI > > > engineers to transfer to the Flink platform. > > > > > > Currently we are also working on a related issue, namely enhancing > > the > > > stream iteration of Flink to support both SGD and online learning, and > it > > > also support batch training as a special case. we have had a rough > design > > > and will start a new discussion in the next few days. I think the > > enhanced > > > stream iteration will help to implement Estimators directly in Flink, > and > > > it may help to simplify the online learning pipeline by eliminating the > > > requirement to load the models from external file systems. > > > > > > I will read the design doc more carefully. Thanks again for sharing > > > the design doc! > > > > > > Yours sincerely > > > Yun Gao > > > > > > > > > ------------------------------------------------------------------ > > > 发件人:Weihua Jiang <weihua.ji...@gmail.com> > > > 发送时间:2018年11月20日(星期二) 20:53 > > > 收件人:dev <dev@flink.apache.org> > > > 主 题:[DISCUSS] Embracing Table API in Flink ML > > > > > > ML Pipeline is the idea brought by Scikit-learn > > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > > this > > > idea and made their own implementations [Spark ML Pipeline > > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > > Pipeline > > > < > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > > > >]. > > > > > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both > ML > > > and DL pipelines. > > > > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > > for > > > feature engineering) . And it enables logic reuse in train and > inference > > > phases (via pipeline persistence and load), which is essential for AI > > > engineering. ML Pipeline can also be a good base for Flink based AI > > > engineering platform if we can make ML Pipeline have good tooling > support > > > (i.e. meta data human readable). > > > > > > > > > As the Table API will be the unified high level API for both stream and > > > batch processing, I want to initiate the design discussion of new Table > > > based Flink ML Pipeline. > > > > > > > > > I drafted a design document [1] for this discussion. This design tries > to > > > create a new ML Pipeline implementation so that concrete ML/DL > algorithms > > > can fit to this new API to achieve interoperability. > > > > > > > > > Any feedback is highly appreciated. > > > > > > > > > Thanks > > > > > > Weihua > > > > > > > > > [1] > > > > > > > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > > > > > > >