On 10/03/2018 03:32 PM, Nick Pentreath wrote:
For ONNX you may be interested in
https://github.com/onnx/onnxmltools - which supports conversion of a
few skelarn models to ONNX already.
However as far as I am aware, none of the ONNX backends actually
support the ONNX-ML extended spec (in open-source at least). So you
would not be able to actually do prediction I think...
Exactly, that's what I'm waiting for. MS is working on itafaik.
As for PFA, to my current knowledge there is no library that does it
yet. Our own Aardpfark project
(https://github.com/CODAIT/aardpfark) focuses on SparkML export to PFA
for now but would like to add sklearn support in the future.
On Wed, 3 Oct 2018 at 20:07 Sebastian Raschka
<m...@sebastianraschka.com <mailto:m...@sebastianraschka.com>> wrote:
The ONNX-approach sounds most promising, esp. because it will also
allow library interoperability but I wonder if this is for
parametric models only and not for the nonparametric ones like
KNN, tree-based classifiers, etc.
All-in-all I can definitely see the appeal for having a way to
export sklearn estimators in a text-based format (e.g., via JSON),
since it would make sharing code easier. This doesn't even have to
be compatible with multiple sklearn versions. A typical use case
would be to include these JSON exports as e.g., supplemental files
of a research paper for other people to run the models etc. (here,
one can just specify which sklearn version it would require; of
course, one could also share pickle files, by I am personally
always hesitant reg. running/trusting other people's pickle files).
Unfortunately though, as Gael pointed out, this "feature" would be
a huge burden for the devs, and it would probably also negatively
impact the development of scikit-learn itself because it imposes
another design constraint.
However, I do think this sounds like an excellent case for a
contrib project. Like scikit-export, scikit-serialize or sth like
that.
Best,
Sebastian
> On Oct 3, 2018, at 5:49 AM, Javier López <jlo...@ende.cc> wrote:
>
>
> On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux
<gael.varoqu...@normalesup.org
<mailto:gael.varoqu...@normalesup.org>> wrote:
> The reason that pickles are brittle and that sharing pickles is
a bad
> practice is that pickle use an implicitly defined data model,
which is
> defined via the internals of objects.
>
> Plus the fact that loading a pickle can execute arbitrary code,
and there is no way to know
> if any malicious code is in there in advance because the
contents of the pickle cannot
> be easily inspected without loading/executing it.
>
> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing
persistence code that
> does not fall in these problems is very costly in terms of
developer time
> and makes it harder to add new methods or improve existing one.
I am not
> excited about it.
>
> My "text-based serialization" suggestion was nowhere near as
ambitious as that,
> as I have already explained, and wasn't aiming at solving the
versioning issues, but
> rather at having something which is "about as good" as pickle
but in a human-readable
> format. I am not asking for a Turing-complete language to
reproduce the prediction
> function, but rather something simple in the spirit of the
output produced by the gist code I linked above, just for the
model families where it is reasonable:
>
> https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
>
> The code I posted mostly works (specific cases of nested models
need to be addressed
> separately, as well as pipelines), and we have been using (a
version of) it in production
> for quite some time. But there are hackish aspects to it that we
are not happy with,
> such as the manual separation of init and fitted parameters by
checking if the name ends with "_", having to infer class name and
location using
> "model.__class__.__name__" and "model.__module__", and the wacky
use of "__import__".
>
> My suggestion was more along the lines of adding some metadata
to sklearn estimators so
> that a code in a similar style would be nicer to write; little
things like having a `init_parameters` and `fit_parameters`
properties that would return the lists of named parameters,
> or a `model_info` method that would return data like sklearn
version, class name and location, or a package level dictionary
pointing at the estimator classes by a string name, like
>
> from sklearn.linear_models import LogisticRegression
> estimator_classes = {"LogisticRegression": LogisticRegression, ...}
>
> so that one can load the appropriate class from the string
description without calling __import__ or eval; that sort of stuff.
>
> I am aware this would not address the common complain of
"prefect prediction reproducibility"
> across versions, but I think we can all agree that this utopia
of perfect reproducibility is not
> feasible.
>
> And in the long, long run, I agree that PFA/onnx or whichever
similar format that emerges, is
> the way to go.
>
> J
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org <mailto:scikit-learn@python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn