This has been discussed numerous times. I suppose no one thinks supporting pickle only is great, but a custom dict is unmaintainable. The best we've got AFAIK (and it looks <https://github.com/jpmml/jpmml-sklearn/graphs/contributors> like it's getting better all the time) is a tool to convert one-way to PMML, which is portable to production environments. See https://github.com/jpmml/sklearn2pmml (python interface) and https://github.com/jpmml/jpmml-sklearn(command-line interface and guts of the thing).
I hope that helps; and thanks to Villu Ruusmann: that list of supported estimators is awesome! PS: please write to the new list at [email protected] On 14 July 2016 at 17:24, Miroslav Zoričák <[email protected]> wrote: > Hi everybody, > > I have been using scikit-learn for a while, but I have run into a problem > that does not seem to have any good solutions. > > Basically I would like to: > - build my pipeline in a Jupyter Notebook > - persist it (to json or hdf5) > - load it in production and execute the prediction there > > The problem is that for persisting estimators such as the RobustScaler for > example, the recommended way is to pickle them. Now I don't want to do > this, for three reasons: > > - Security, pickle is potentially dangerous > - Portability, I can't unpickle it in scala for example > - Pickle stores a lot of details and information which is not strictly > necessary to reconstruct the RobustScaler and therefore might prevent it > from being reconstructed correctly if a different version is used. > > Another option I would seem to have is to access the private members of > each serialiser that I want to use and store them on my own, but this is > inconvenient, because: > > - It forces me as a user to understand how the robust scaler works and how > it stores its internal state, which is generally bad for usability > - The internal implementation could change, leaving me to fix my > serialisers (see #1) > - I would need to do this for each new Estimator I decide to use > > Now, to me it seems the solution is quite obvious: > Write a Mixin or update the BaseEstimator class to include two additional > methods: > > to_dict() - will return a dictionary such, that when passed to > from_dict(dictionary) - it will reconstruct the original object > > these dictionaries could be passed to the JSON module or the YAML module > or stored elsewhere. We could provide more convenience methods to do this > for the user. > > In case of the RobustScaler the dict would look something like: > { "center": "0,0", "scale": "1.0"} > > Now the bulk of the work is writing these serialisers and deserialisers > for all of the estimators, but that can be simplified by adding a method > that could do that automatically via reflection and the estimator would > only need to specify which fields to serialise. > > I am happy to start working on this and create a pull request on Github, > but before I do that I wanted to get some initial thoughts and reactions > from the community, so please let me know what you think. > > Best Regards, > Miroslav Zoricak > -- > Best Regards, > Miroslav Zoricak > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and > traffic > patterns at an interface-level. Reveals which users, apps, and protocols > are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning > reports.http://sdm.link/zohodev2dev > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
