Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects

Joel Nothman Wed, 23 Mar 2016 20:32:22 -0700

I think all the scikit-learn devs know that the serialisation available in
scikit-learn is inadequate, and recommend storing training data and model
parameters.


Designing a serialisation format that is robust to future changes is a huge
engineering effort, and is likely to result in one of: (a) a framework that
has all the power and hence faults of pickling; (b) an implementation that
is limited to only some parameter values on some estimators; or (c) a
specialised, over-engineered monolith that we can't afford to maintain.

One approach mooted time and again is supporting export to a
framework-independent model description language, like PMML. For this see
the work begun at https://github.com/alex-pirozhenko/sklearn-pmml. The
intention here, however, is not especially to re-load the models in
scikit-learn, but to perform prediction with scikit-learn-fitted models in
other frameworks.

On 24 March 2016 at 13:04, Chris Hausler <chaus...@gmail.com> wrote:

> We also have similar issues. It'd be great to hear any cool solutions :-)
>
> On Thu, 24 Mar 2016 at 12:47 Keith Lehman <kleh...@intercapenergy.com>
> wrote:
>
>> Thanks Sebastian.
>>
>> This is basically what we are doing too. The hard/time consuming part is
>> determining what attributes of each sckikit-learn object need to be saved
>> and how best to extract them.
>>
>> - Keith
>>
>> -----Original Message-----
>> From: Sebastian Raschka [mailto:se.rasc...@gmail.com]
>> Sent: Wednesday, March 23, 2016 4:05 PM
>> To: scikit-learn-general@lists.sourceforge.net
>> Subject: Re: [Scikit-learn-general] Scikit-learn standards for
>> serializing/saving objects
>>
>> I also had some issues with Pickle in the past and have to admit that I
>> actually don't trust pickle files ;). Maybe, I am too paranoid, but I am
>> always afraid of corrupting or losing the data.
>> Probably not the most elegant solution, but I typically store estimator
>> settings and model parameters as JSON files (since they are human readable
>> in the worst case scenario having "reproducible research" in mind ;)).
>>
>>
>> For example:
>>
>>
>> # Model fitting and saving params to JSON
>>
>> from sklearn.linear_model import LinearRegression from sklearn.datasets
>> import load_diabetes
>>
>> diabetes = load_diabetes()
>> X, y = diabetes.data, diabetes.target
>> regr = LinearRegression()
>> regr.fit(X, y)
>>
>> import json
>>
>> with open('./params.json', 'w', encoding='utf-8') as outfile:
>>     json.dump(regr.get_params(), outfile)
>>
>> with open('./weights.json', 'w', encoding='utf-8') as outfile:
>>     json.dump(regr.coef_.tolist(), outfile, separators=(',', ':'),
>> sort_keys=True, indent=4)
>>
>> with open('./intercept.json', 'w', encoding='utf-8') as outfile:
>>     json.dump(regr.intercept_, outfile)
>>
>>
>> # In a new session: load the params from the JSON files
>>
>>
>> import json
>> import codecs
>> from sklearn.linear_model import LinearRegression from sklearn.datasets
>> import load_diabetes import numpy as np
>>
>> diabetes = load_diabetes()
>> X, y = diabetes.data, diabetes.target
>>
>> obj_text = codecs.open('./params.json', 'r', encoding='utf-8').read()
>> params = json.loads(obj_text)
>>
>> obj_text = codecs.open('./weights.json', 'r', encoding='utf-8').read()
>> weights = json.loads(obj_text)
>>
>> obj_text = codecs.open('./intercept.json', 'r', encoding='utf-8').read()
>> intercept = json.loads(obj_text)
>>
>> regr = LinearRegression()
>> regr.set_params(**params)
>> regr.intercept_, regr.coef_ = intercept, np.array(weights)
>>
>> regr.predict(X[:10])
>>
>> array([ 206.11706979,   68.07234761,  176.88406035,  166.91796559,
>>         128.45984241,  106.34908972,   73.89417947,  118.85378669,
>>         158.81033076,  213.58408893])
>>
>>
>> In any case, I know that this isn't pretty, and I would also be looking
>> forward to a better solution!
>>
>> Best,
>> Sebastian Raschka
>>
>>
>> > On Mar 23, 2016, at 12:47 PM, Keith Lehman <kleh...@intercapenergy.com>
>> wrote:
>> >
>> > Hi:
>> >
>> > I’m fairly new to scikit-learn, python, and machine learning. This
>> community has built a great set of libraries though, and is actually a
>> large part of the reason why my company has selected python to experiment
>> with ML.
>> >
>> > As we are developing our product, however, we keep running into trouble
>> saving various objects. When possible, we use pickle to save the objects,
>> but this can cause problems in development – objects saved during a debug
>> session can not be loaded outside of the debugger. The reason appears to be
>> because even when pickling a “pickleable” object (such as a trained
>> LinearRegression), pickle finds and saves more primitive objects that have
>> been instantiated within the debug environment. Dill and cpickle have the
>> same issue. My question is, does the scikit-learn community plan to add
>> standard load/save or dump/dumps and load/loads methods that would not
>> create these dependencies?
>> >
>> > If there is a better forum for posting questions like these, please let
>> me know and I’ll be happy to post there instead.
>> >
>> > Thanks!
>> >
>> > Keith Lehman
>> > Cell: 617-834-2863
>> > Skype: k.lehman
>> > e-mail: kleh...@intercapenergy.com
>> >
>> > ----------------------------------------------------------------------
>> > --------
>> > Transform Data into Opportunity.
>> > Accelerate data analysis in your applications with Intel Data
>> > Analytics Acceleration Library.
>> > Click to learn more.
>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140______
>> > _________________________________________
>> > Scikit-learn-general mailing list
>> > Scikit-learn-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with Intel Data Analytics
>> Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 2016.0.7497 / Virus Database: 4545/11867 - Release Date: 03/23/16
>>
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects

Reply via email to