[GitHub] [flink] c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the interface of ML pipeline and ML lib

GitBox Tue, 21 May 2019 06:27:56 -0700

c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the 
interface of ML pipeline and ML lib
URL: https://github.com/apache/flink/pull/8402#discussion_r286024108


 ##########
 File path: 
flink-ml/flink-ml-api/src/main/java/org/apache/flink/ml/api/core/PipelineStage.java
 ##########
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.api.core;
+
+import org.apache.flink.ml.api.misc.param.ParamInfo;
+import org.apache.flink.ml.api.misc.param.WithParams;
+import org.apache.flink.ml.api.misc.persist.Persistable;
+import org.apache.flink.ml.util.param.ExtractParamInfosUtil;
+import org.apache.flink.ml.util.persist.MLStageFactory;
+
+import com.google.gson.Gson;
+import com.google.gson.JsonObject;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base class for a stage in a pipeline. The interface is only a concept, and 
does not have any
+ * actual functionality. Its subclasses must be either Estimator or 
Transformer. No other classes
+ * should inherit this interface directly.
+ *
+ * <p>Each pipeline stage is with parameters and meanwhile persistable.
+ *
+ * @param <T> The class type of the PipelineStage implementation itself, used 
by {@link
+ *            org.apache.flink.ml.api.misc.param.WithParams}
+ * @see WithParams
+ */
+interface PipelineStage<T extends PipelineStage<T>> extends WithParams<T>, 
Serializable,
+               Persistable {
+
+       default String toJson() {
 
 Review comment:
   @blublinsky
   Thanks for your feedback. Your raised a few good points and suggestions, on 
persistence and model serving. Here are my thoughts.
   
   First of all, I'd like to highlight that the current persistence system is 
mainly used inside Flink only, and export will be provided in the future as a 
utility interface.
   
   --------------
   I suppose that the persistence system should use a reversible format that 
can fully express a PipelineStage(including Pipeline) and can be used to 
restore the stage, whatever the stage is. Saved model can be used for serving 
or as an initial model of another estimator. Saved estimator may be used in 
tuning, where users only want to modify some params and rerun the same 
estimator.
   
   Some 3rd-party system may also use the persistence result. The jpmml-sparkml 
you mentioned reads a saved PipelineModel and convert it to a PMML, where the 
saved PipelineModel is consist of saved stages files, and how these files are 
written is completely defined by its stage class itself. Exporting a pipeline 
as PMML and saving a pipeline are independent in this case.
   
   In my point of view, the persistence is neither export nor checkpoint. It's 
more like a JobGraph or SQL, which describes the procedure of pipeline. Besides 
that, we also tried to find a way like SQL to describe the pipeline, but it's 
hard to design a new SQL-like language that everyone understands and accepts. 
JSON is finally chosen.
   
   It's true that the JSON may be too complex to read if everything in a large 
pipeline put in it. In fact we suppose that the JSON contains only the params 
set by users, and large models should be saved separately and only the meta is 
in the JSON. Since where to save the model is also one of the required 
configurations of estimators, which is usually used to create a Sink, it should 
be quite natural to put in the params. Of course, utility interfaces and 
methods like "WithExternalModelFile" would be provided in future to make the 
implementation easier.
   
   The persisted model JSON can be used to restore the pipeline and serve in 
Flink. All these should be really easy for a Flink user. If one want to use the 
trained model in an external serving system that support only standard model 
format like PMML, he should "export" the model as PMML before using it, and 
it's not the job persistence should do. For instance, we can implement serving 
transformers such as PMMLServingTransformer to serve external standard models.
   
   I hope the above has explained “persistence". Next I will try to explain 
“export", which is not initially included in this PR.
   
   ---------------
   I really appreciated your 3rd point. Initially, We planned to provide 
"Exportable" interfaces like PMMLExportable. Model may extend it and provide 
exportPMML() method to users. It's hard to give a unified implementation for 
every models, so the interface will be implemented by each model. Exportable 
requires the model implement it, but it would be hard to extend a model to 
support exporting without modifying the code, and would be even worst if 
multiple “export” is going to be supported.
   
   Your 3rd suggestion is a perfect solution to address the above issue, we can 
implement an exporter utility, which is completely independent from the model. 
We may first introduce a basic Exporter<M extends Model> interface. And library 
developers or any one who wants to extend a library can implement it. Users 
need to create a new exporter instance then call exporter.export(model) to use 
it as needed. This should be more flexible and also easy to use. 
   
   We can add “exporter” interface in the next PRs after the api and some 
initial estimators/transformers are merged. I would really appreciate if you 
can join the discussion when we design it at that time.
   
   -----------------
   
   Thanks again for your feedback. Please let me know if I have addressed all 
your comments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the interface of ML pipeline and ML lib

Reply via email to