The ONNX-approach sounds most promising, esp. because it will also allow library interoperability but I wonder if this is for parametric models only and not for the nonparametric ones like KNN, tree-based classifiers, etc.
All-in-all I can definitely see the appeal for having a way to export sklearn estimators in a text-based format (e.g., via JSON), since it would make sharing code easier. This doesn't even have to be compatible with multiple sklearn versions. A typical use case would be to include these JSON exports as e.g., supplemental files of a research paper for other people to run the models etc. (here, one can just specify which sklearn version it would require; of course, one could also share pickle files, by I am personally always hesitant reg. running/trusting other people's pickle files). Unfortunately though, as Gael pointed out, this "feature" would be a huge burden for the devs, and it would probably also negatively impact the development of scikit-learn itself because it imposes another design constraint. However, I do think this sounds like an excellent case for a contrib project. Like scikit-export, scikit-serialize or sth like that. Best, Sebastian > On Oct 3, 2018, at 5:49 AM, Javier López <jlo...@ende.cc> wrote: > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <gael.varoqu...@normalesup.org> > wrote: > The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model, which is > defined via the internals of objects. > > Plus the fact that loading a pickle can execute arbitrary code, and there is > no way to know > if any malicious code is in there in advance because the contents of the > pickle cannot > be easily inspected without loading/executing it. > > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > > My "text-based serialization" suggestion was nowhere near as ambitious as > that, > as I have already explained, and wasn't aiming at solving the versioning > issues, but > rather at having something which is "about as good" as pickle but in a > human-readable > format. I am not asking for a Turing-complete language to reproduce the > prediction > function, but rather something simple in the spirit of the output produced by > the gist code I linked above, just for the model families where it is > reasonable: > > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 > > The code I posted mostly works (specific cases of nested models need to be > addressed > separately, as well as pipelines), and we have been using (a version of) it > in production > for quite some time. But there are hackish aspects to it that we are not > happy with, > such as the manual separation of init and fitted parameters by checking if > the name ends with "_", having to infer class name and location using > "model.__class__.__name__" and "model.__module__", and the wacky use of > "__import__". > > My suggestion was more along the lines of adding some metadata to sklearn > estimators so > that a code in a similar style would be nicer to write; little things like > having a `init_parameters` and `fit_parameters` properties that would return > the lists of named parameters, > or a `model_info` method that would return data like sklearn version, class > name and location, or a package level dictionary pointing at the estimator > classes by a string name, like > > from sklearn.linear_models import LogisticRegression > estimator_classes = {"LogisticRegression": LogisticRegression, ...} > > so that one can load the appropriate class from the string description > without calling __import__ or eval; that sort of stuff. > > I am aware this would not address the common complain of "prefect prediction > reproducibility" > across versions, but I think we can all agree that this utopia of perfect > reproducibility is not > feasible. > > And in the long, long run, I agree that PFA/onnx or whichever similar format > that emerges, is > the way to go. > > J > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn