Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

Dale T Smith Thu, 14 Jul 2016 07:27:31 -0700

Spark has a project PMML for Pipelines.

https://issues.apache.org/jira/browse/SPARK-11171


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected]

From: scikit-learn 
[mailto:[email protected]] On Behalf Of 
William Komp
Sent: Thursday, July 14, 2016 10:06 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

⚠ EXT MSG:
Hi,
Interesting conversation. I have captured model parameters in sql and use sql 
for scoring in massively parallel setups.  You can score billion record sets in 
seconds. Works really well with logistic regression and other functional based 
models.  Trees would be a bit more difficult.

Has there been any discussion on PFA (Portable Format for Analytics): 
http://dmg.org/pfa/index.html incorporation in scikit? Bob Grossman is the 
driving force behind it. Here is a link to a deck from a Predictive Analytics 
World talk he gave in chicago a few months ago.

http://www.slideshare.net/rgrossman/how-to-lower-the-cost-of-deploying-analytics-an-introduction-to-the-portable-format-for-analytics
William

On Thu, Jul 14, 2016 at 8:35 AM, Dale T Smith 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I investigated this subject last year, and have tried to keep up, so I can 
perhaps offer some alternatives.


•         The only packages I know that read PMML in Python are proprietary. 
There are several alternatives for writing to PMML, as you can easily find.

I also found

https://code.google.com/archive/p/augustus/

and

https://github.com/ctrl-alt-d/lightpmmlpredictor

Depending on your project, sklearn-compiledtrees may be an option.

https://github.com/ajtulloch/sklearn-compiledtrees

Py2PMML (https://support.zementis.com/entries/37092748-Introducing-Py2PMML) is 
by Zemantis and it’s a commercial product, meaning you pay for a license.


•         Another option is what we planned to do at an old job of mine – read 
the model characteristics out of the scikit-learn object after fit, and produce 
C code ourselves. This is a viable option for decision trees. Adapt 
print_decision_trees() from this Stackoverflow answer.

http://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree


•         You can also reconsider your use of joblib.dump again. I’m aware that 
it has problems, but you can include enough versioning information in the 
objects you dump in order to apply checks in your code to make sure 
scikit-learn versions are compatible, etc. I know this is a pain in the neck, 
but it’s a viable alternative to creating your own PMML reader, writing a code 
generator of some kind, or buying a license.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | 
[email protected]<mailto:[email protected]>

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith<mailto:scikit-learn-bounces%2Bdale.t.smith>[email protected]<mailto:[email protected]>]
 On Behalf Of Joel Nothman
Sent: Thursday, July 14, 2016 4:18 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

⚠ EXT MSG:
This has been discussed numerous times. I suppose no one thinks supporting 
pickle only is great, but a custom dict is unmaintainable. The best we've got 
AFAIK (and it looks<https://github.com/jpmml/jpmml-sklearn/graphs/contributors> 
like it's getting better all the time) is a tool to convert one-way to PMML, 
which is portable to production environments. See 
https://github.com/jpmml/sklearn2pmml (python interface) and 
https://github.com/jpmml/jpmml-sklearn(command-line interface and guts of the 
thing).

I hope that helps; and thanks to Villu Ruusmann: that list of supported 
estimators is awesome!

PS: please write to the new list at 
[email protected]<mailto:[email protected]>

On 14 July 2016 at 17:24, Miroslav Zoričák 
<[email protected]<mailto:[email protected]>> wrote:
Hi everybody,

I have been using scikit-learn for a while, but I have run into a problem that 
does not seem to have any good solutions.

Basically I would like to:
- build my pipeline in a Jupyter Notebook
- persist it (to json or hdf5)
- load it in production and execute the prediction there

The problem is that for persisting estimators such as the RobustScaler for 
example, the recommended way is to pickle them. Now I don't want to do this, 
for three reasons:

- Security, pickle is potentially dangerous
- Portability, I can't unpickle it in scala for example
- Pickle stores a lot of details and information which is not strictly 
necessary to reconstruct the RobustScaler and therefore might prevent it from 
being reconstructed correctly if a different version is used.

Another option I would seem to have is to access the private members of each 
serialiser that I want to use and store them on my own, but this is 
inconvenient, because:

- It forces me as a user to understand how the robust scaler works and how it 
stores its internal state, which is generally bad for usability
- The internal implementation could change, leaving me to fix my serialisers 
(see #1)
- I would need to do this for each new Estimator I decide to use

Now, to me it seems the solution is quite obvious:
Write a Mixin or update the BaseEstimator class to include two additional 
methods:

to_dict() - will return a dictionary such, that when passed to
from_dict(dictionary) - it will reconstruct the original object

these dictionaries could be passed to the JSON module or the YAML module or 
stored elsewhere. We could provide more convenience methods to do this for the 
user.

In case of the RobustScaler the dict would look something like:
{ "center": "0,0", "scale": "1.0"}

Now the bulk of the work is writing these serialisers and deserialisers for all 
of the estimators, but that can be simplified by adding a method that could do 
that automatically via reflection and the estimator would only need to specify 
which fields to serialise.

I am happy to start working on this and create a pull request on Github, but 
before I do that I wanted to get some initial thoughts and reactions from the 
community, so please let me know what you think.

Best Regards,
Miroslav Zoricak
--
Best Regards,
Miroslav Zoricak

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
attachments.

_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

Reply via email to