c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the interface of ML pipeline and ML lib URL: https://github.com/apache/flink/pull/8402#discussion_r286024108
########## File path: flink-ml/flink-ml-api/src/main/java/org/apache/flink/ml/api/core/PipelineStage.java ########## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.api.core; + +import org.apache.flink.ml.api.misc.param.ParamInfo; +import org.apache.flink.ml.api.misc.param.WithParams; +import org.apache.flink.ml.api.misc.persist.Persistable; +import org.apache.flink.ml.util.param.ExtractParamInfosUtil; +import org.apache.flink.ml.util.persist.MLStageFactory; + +import com.google.gson.Gson; +import com.google.gson.JsonObject; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Base class for a stage in a pipeline. The interface is only a concept, and does not have any + * actual functionality. Its subclasses must be either Estimator or Transformer. No other classes + * should inherit this interface directly. + * + * <p>Each pipeline stage is with parameters and meanwhile persistable. + * + * @param <T> The class type of the PipelineStage implementation itself, used by {@link + * org.apache.flink.ml.api.misc.param.WithParams} + * @see WithParams + */ +interface PipelineStage<T extends PipelineStage<T>> extends WithParams<T>, Serializable, + Persistable { + + default String toJson() { Review comment: @blublinsky Thanks for your feedback. Your raised a few good points and suggestions, on persistence and model serving. Here are my thoughts. First of all, I'd like to highlight that the current persistence system is mainly used inside Flink only, and export will be provided in the future as a utility interface. -------------- I suppose that the persistence system should use a reversible format that can fully express a PipelineStage(including Pipeline) and can be used to restore the stage, whatever the stage is. Saved model can be used for serving or as an initial model of another estimator. Saved estimator may be used in tuning, where users only want to modify some params and rerun the same estimator. Some 3rd-party system may also use the persistence result. The jpmml-sparkml you mentioned reads a saved PipelineModel and convert it to a PMML, where the saved PipelineModel is consist of saved stages files, and how these files are written is completely defined by its stage class itself. Exporting a pipeline as PMML and saving a pipeline are independent in this case. In my point of view, the persistence is neither export nor checkpoint. It's more like a JobGraph or SQL, which describes the procedure of pipeline. Besides that, we also tried to find a way like SQL to describe the pipeline, but it's hard to design a new SQL-like language that everyone understands and accepts. JSON is finally chosen. It's true that the JSON may be too complex to read if everything in a large pipeline put in it. In fact we suppose that the JSON contains only the params set by users, and large models should be saved separately and only the meta is in the JSON. Since where to save the model is also one of the required configurations of estimators, which is usually used to create a Sink, it should be quite natural to put in the params. Of course, utility interfaces and methods like "WithExternalModelFile" would be provided in future to make the implementation easier. The persisted model JSON can be used to restore the pipeline and serve in Flink. All these should be really easy for a Flink user. If one want to use the trained model in an external serving system that support only standard model format like PMML, he should "export" the model as PMML before using it, and it's not the job persistence should do. For instance, we can implement serving transformers such as PMMLServingTransformer to serve external standard models. I hope the above has explained “persistence". Next I will try to explain “export", which is not initially included in this PR. --------------- I really appreciated your 3rd point. Initially, We planned to provide "Exportable" interfaces like PMMLExportable. Model may extend it and provide exportPMML() method to users. It's hard to give a unified implementation for every models, so the interface will be implemented by each model. Exportable requires the model implement it, but it would be hard to extend a model to support exporting without modifying the code, and would be even worst if multiple “export” is going to be supported. Your 3rd suggestion is a perfect solution to address the above issue, we can implement an exporter utility, which is completely independent from the model. We may first introduce a basic Exporter<M extends Model> interface. And library developers or any one who wants to extend a library can implement it. Users need to create a new exporter instance then call exporter.export(model) to use it as needed. This should be more flexible and also easy to use. We can add “exporter” interface in the next PRs after the api and some initial estimators/transformers are merged. I would really appreciate if you can join the discussion when we design it at that time. ----------------- Thanks again for your feedback. Please let me know if I have addressed all your comments. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
