On Fri, Sep 28, 2018 at 8:46 PM Andreas Mueller <t3k...@gmail.com> wrote:
> Basically what you're saying is that you're fine with versioning the > models and having the model break loudly if anything changes. > That's not actually what most people want. They want to be able to make > predictions with a given model for ever into the future. > Are we talking about "(the new version of) the old model can still make predictions" or "the old model makes exactly the same predictions as before"? I'd like the first to hold, don't care that much about the second. > > Your use-case is similar, but if retraining the model is not an issue, > why don't you want to retrain every time scikit-learn releases a new > version? > Thousands of models. I don't want to retrain ALL of them unless needed > We're now storing the version of scikit-learn that was used in the > pickle and warn if you're trying to load with a different version. This is not the whole truth. Yes, you store the sklearn version on the pickle and raise a warning; I am mostly ok with that, but the pickles are brittle and oftentimes they stop loading when other versions of other stuff change. I am not talking about "Warning: wrong version", but rather "Unpickling error: expected bytes, found tuple" that prevent the file from loading entirely. > That's basically a stricter test than what you wanted. Yes, there are > false positives, but given that this release took a year, > this doesn't seem that big an issue? > 1. Things in the current state break when something else changes, not only sklearn. 2. Sharing pickles is a bad practice due to a number of reasons. 3. We might want to explore model parameters without having to load the entire runtime Also, in order to retrain the model we need to keep the whole model description with parameters. This needs to be saved somewhere, which in the current state would force us to keep two files: one with the parameters (in a text format to avoid the "non-loadng" problems from above) and the pkl with the fitted model. My proposal would keep both in a single file. As mentioned in previous emails, we already have our own solution that kind-of-works for our needs, but we have to do a few hackish things to keep things running. If sklearn estimators simply included a text serialization method (similar in spirit to the one used for __display__ or __repr__) it would make things easier. But I understand that not everyone's needs are the same, so if you guys don't consider this type of thing a priority, we can live with that :) I mostly mentioned it since "Backwards-compatible de/serialization of some estimators" is listed in the roadmap as a desirable goal for version 1.0 and feedback on such roadmap was requested. J
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn