Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

Nick Pentreath Thu, 14 Jul 2016 07:21:47 -0700

For PFA, you may wish to check out
https://github.com/opendatagroup/hadrian/ (the
"titus" subproject is a full Python impl of PFA, with a focus on some
"model producing" hooks such as a PrettyPFA higher-level text-based DSL for
PFA document construction).




On Thu, 14 Jul 2016 at 16:07 William Komp <[email protected]> wrote:

> Hi,
> Interesting conversation. I have captured model parameters in sql and use
> sql for scoring in massively parallel setups.  You can score billion record
> sets in seconds. Works really well with logistic regression and other
> functional based models.  Trees would be a bit more difficult.
>
> Has there been any discussion on PFA (Portable Format for Analytics):
> http://dmg.org/pfa/index.html incorporation in scikit? Bob Grossman is
> the driving force behind it. Here is a link to a deck from a Predictive
> Analytics World talk he gave in chicago a few months ago.
>
>
> http://www.slideshare.net/rgrossman/how-to-lower-the-cost-of-deploying-analytics-an-introduction-to-the-portable-format-for-analytics
>
> William
>
> On Thu, Jul 14, 2016 at 8:35 AM, Dale T Smith <[email protected]>
> wrote:
>
>> Hello,
>>
>>
>>
>> I investigated this subject last year, and have tried to keep up, so I
>> can perhaps offer some alternatives.
>>
>>
>>
>> ·         The only packages I know that read PMML in Python are
>> proprietary. There are several alternatives for writing to PMML, as you can
>> easily find.
>>
>>
>>
>> I also found
>>
>>
>>
>> https://code.google.com/archive/p/augustus/
>>
>>
>>
>> and
>>
>>
>>
>> https://github.com/ctrl-alt-d/lightpmmlpredictor
>>
>>
>>
>> Depending on your project, sklearn-compiledtrees may be an option.
>>
>>
>>
>> https://github.com/ajtulloch/sklearn-compiledtrees
>>
>>
>>
>> Py2PMML (
>> https://support.zementis.com/entries/37092748-Introducing-Py2PMML) is by
>> Zemantis and it’s a commercial product, meaning you pay for a license.
>>
>>
>>
>> ·         Another option is what we planned to do at an old job of mine
>> – read the model characteristics out of the scikit-learn object after fit,
>> and produce C code ourselves. This is a viable option for decision trees.
>> Adapt print_decision_trees() from this Stackoverflow answer.
>>
>>
>>
>>
>> http://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree
>>
>>
>>
>> ·         You can also reconsider your use of joblib.dump again. I’m
>> aware that it has problems, but you can include enough versioning
>> information in the objects you dump in order to apply checks in your code
>> to make sure scikit-learn versions are compatible, etc. I know this is a
>> pain in the neck, but it’s a viable alternative to creating your own PMML
>> reader, writing a code generator of some kind, or buying a license.
>>
>>
>>
>>
>>
>>
>> __________________________________________________________________________________________
>> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
>> Science and Capacity Planning
>> | 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected]
>>
>>
>>
>> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
>> [email protected]] *On Behalf Of *Joel Nothman
>> *Sent:* Thursday, July 14, 2016 4:18 AM
>> *To:* Scikit-learn user and developer mailing list
>> *Subject:* Re: [scikit-learn] [Scikit-learn-general] Estimator
>> serialisability
>>
>>
>>
>> ⚠ EXT MSG:
>>
>> This has been discussed numerous times. I suppose no one thinks
>> supporting pickle only is great, but a custom dict is unmaintainable. The
>> best we've got AFAIK (and it looks
>> <https://github.com/jpmml/jpmml-sklearn/graphs/contributors> like it's
>> getting better all the time) is a tool to convert one-way to PMML, which is
>> portable to production environments. See
>> https://github.com/jpmml/sklearn2pmml (python interface) and
>> https://github.com/jpmml/jpmml-sklearn(command-line interface and guts
>> of the thing).
>>
>>
>>
>> I hope that helps; and thanks to Villu Ruusmann: that list of supported
>> estimators is awesome!
>>
>>
>>
>> PS: please write to the new list at [email protected]
>>
>>
>>
>> On 14 July 2016 at 17:24, Miroslav Zoričák <[email protected]>
>> wrote:
>>
>> Hi everybody,
>>
>>
>>
>> I have been using scikit-learn for a while, but I have run into a problem
>> that does not seem to have any good solutions.
>>
>>
>>
>> Basically I would like to:
>>
>> - build my pipeline in a Jupyter Notebook
>>
>> - persist it (to json or hdf5)
>>
>> - load it in production and execute the prediction there
>>
>>
>>
>> The problem is that for persisting estimators such as the RobustScaler
>> for example, the recommended way is to pickle them. Now I don't want to do
>> this, for three reasons:
>>
>>
>>
>> - Security, pickle is potentially dangerous
>>
>> - Portability, I can't unpickle it in scala for example
>>
>> - Pickle stores a lot of details and information which is not strictly
>> necessary to reconstruct the RobustScaler and therefore might prevent it
>> from being reconstructed correctly if a different version is used.
>>
>>
>>
>> Another option I would seem to have is to access the private members of
>> each serialiser that I want to use and store them on my own, but this is
>> inconvenient, because:
>>
>>
>>
>> - It forces me as a user to understand how the robust scaler works and
>> how it stores its internal state, which is generally bad for usability
>>
>> - The internal implementation could change, leaving me to fix my
>> serialisers (see #1)
>>
>> - I would need to do this for each new Estimator I decide to use
>>
>>
>>
>> Now, to me it seems the solution is quite obvious:
>>
>> Write a Mixin or update the BaseEstimator class to include two additional
>> methods:
>>
>>
>>
>> to_dict() - will return a dictionary such, that when passed to
>>
>> from_dict(dictionary) - it will reconstruct the original object
>>
>>
>>
>> these dictionaries could be passed to the JSON module or the YAML module
>> or stored elsewhere. We could provide more convenience methods to do this
>> for the user.
>>
>>
>>
>> In case of the RobustScaler the dict would look something like:
>>
>> { "center": "0,0", "scale": "1.0"}
>>
>>
>>
>> Now the bulk of the work is writing these serialisers and deserialisers
>> for all of the estimators, but that can be simplified by adding a method
>> that could do that automatically via reflection and the estimator would
>> only need to specify which fields to serialise.
>>
>>
>>
>> I am happy to start working on this and create a pull request on Github,
>> but before I do that I wanted to get some initial thoughts and reactions
>> from the community, so please let me know what you think.
>>
>>
>>
>> Best Regards,
>>
>> Miroslav Zoricak
>>
>> --
>>
>> Best Regards,
>> Miroslav Zoricak
>>
>>
>>
>> ------------------------------------------------------------------------------
>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>> traffic
>> patterns at an interface-level. Reveals which users, apps, and protocols
>> are
>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>> planning
>> reports.http://sdm.link/zohodev2dev
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
>> opening attachments.
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

Reply via email to