The ONNX-approach sounds most promising, esp. because it will also allow 
library interoperability but I wonder if this is for parametric models only and 
not for the nonparametric ones like KNN, tree-based classifiers, etc.

All-in-all I can definitely see the appeal for having a way to export sklearn 
estimators in a text-based format (e.g., via JSON), since it would make sharing 
code easier. This doesn't even have to be compatible with multiple sklearn 
versions. A typical use case would be to include these JSON exports as e.g., 
supplemental files of a research paper for other people to run the models etc. 
(here, one can just specify which sklearn version it would require; of course, 
one could also share pickle files, by I am personally always hesitant reg. 
running/trusting other people's pickle files).

Unfortunately though, as Gael pointed out, this "feature" would be a huge 
burden for the devs, and it would probably also negatively impact the 
development of scikit-learn itself because it imposes another design constraint.

However, I do think this sounds like an excellent case for a contrib project. 
Like scikit-export, scikit-serialize or sth like that.

Best,
Sebastian



> On Oct 3, 2018, at 5:49 AM, Javier López <jlo...@ende.cc> wrote:
> 
> 
> On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <gael.varoqu...@normalesup.org> 
> wrote:
> The reason that pickles are brittle and that sharing pickles is a bad
> practice is that pickle use an implicitly defined data model, which is
> defined via the internals of objects.
> 
> Plus the fact that loading a pickle can execute arbitrary code, and there is 
> no way to know
> if any malicious code is in there in advance because the contents of the 
> pickle cannot
> be easily inspected without loading/executing it.
>  
> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing persistence code that
> does not fall in these problems is very costly in terms of developer time
> and makes it harder to add new methods or improve existing one. I am not
> excited about it.
> 
> My "text-based serialization" suggestion was nowhere near as ambitious as 
> that,
> as I have already explained, and wasn't aiming at solving the versioning 
> issues, but
> rather at having something which is "about as good" as pickle but in a 
> human-readable
> format. I am not asking for a Turing-complete language to reproduce the 
> prediction
> function, but rather something simple in the spirit of the output produced by 
> the gist code I linked above, just for the model families where it is 
> reasonable:
> 
> https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
> 
> The code I posted mostly works (specific cases of nested models need to be 
> addressed 
> separately, as well as pipelines), and we have been using (a version of) it 
> in production
> for quite some time. But there are hackish aspects to it that we are not 
> happy with,
> such as the manual separation of init and fitted parameters by checking if 
> the name ends with "_", having to infer class name and location using 
> "model.__class__.__name__" and "model.__module__", and the wacky use of 
> "__import__".
> 
> My suggestion was more along the lines of adding some metadata to sklearn 
> estimators so
> that a code in a similar style would be nicer to write; little things like 
> having a `init_parameters` and `fit_parameters` properties that would return 
> the lists of named parameters, 
> or a `model_info` method that would return data like sklearn version, class 
> name and location, or a package level dictionary pointing at the estimator 
> classes by a string name, like
> 
> from sklearn.linear_models import LogisticRegression
> estimator_classes = {"LogisticRegression": LogisticRegression, ...}
> 
> so that one can load the appropriate class from the string description 
> without calling __import__ or eval; that sort of stuff.
> 
> I am aware this would not address the common complain of "prefect prediction 
> reproducibility"
> across versions, but I think we can all agree that this utopia of perfect 
> reproducibility is not 
> feasible.
> 
> And in the long, long run, I agree that PFA/onnx or whichever similar format 
> that emerges, is
> the way to go.
> 
> J
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to