For PFA, you may wish to check out https://github.com/opendatagroup/hadrian/ (the "titus" subproject is a full Python impl of PFA, with a focus on some "model producing" hooks such as a PrettyPFA higher-level text-based DSL for PFA document construction).
On Thu, 14 Jul 2016 at 16:07 William Komp <[email protected]> wrote: > Hi, > Interesting conversation. I have captured model parameters in sql and use > sql for scoring in massively parallel setups. You can score billion record > sets in seconds. Works really well with logistic regression and other > functional based models. Trees would be a bit more difficult. > > Has there been any discussion on PFA (Portable Format for Analytics): > http://dmg.org/pfa/index.html incorporation in scikit? Bob Grossman is > the driving force behind it. Here is a link to a deck from a Predictive > Analytics World talk he gave in chicago a few months ago. > > > http://www.slideshare.net/rgrossman/how-to-lower-the-cost-of-deploying-analytics-an-introduction-to-the-portable-format-for-analytics > > William > > On Thu, Jul 14, 2016 at 8:35 AM, Dale T Smith <[email protected]> > wrote: > >> Hello, >> >> >> >> I investigated this subject last year, and have tried to keep up, so I >> can perhaps offer some alternatives. >> >> >> >> · The only packages I know that read PMML in Python are >> proprietary. There are several alternatives for writing to PMML, as you can >> easily find. >> >> >> >> I also found >> >> >> >> https://code.google.com/archive/p/augustus/ >> >> >> >> and >> >> >> >> https://github.com/ctrl-alt-d/lightpmmlpredictor >> >> >> >> Depending on your project, sklearn-compiledtrees may be an option. >> >> >> >> https://github.com/ajtulloch/sklearn-compiledtrees >> >> >> >> Py2PMML ( >> https://support.zementis.com/entries/37092748-Introducing-Py2PMML) is by >> Zemantis and it’s a commercial product, meaning you pay for a license. >> >> >> >> · Another option is what we planned to do at an old job of mine >> – read the model characteristics out of the scikit-learn object after fit, >> and produce C code ourselves. This is a viable option for decision trees. >> Adapt print_decision_trees() from this Stackoverflow answer. >> >> >> >> >> http://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree >> >> >> >> · You can also reconsider your use of joblib.dump again. I’m >> aware that it has problems, but you can include enough versioning >> information in the objects you dump in order to apply checks in your code >> to make sure scikit-learn versions are compatible, etc. I know this is a >> pain in the neck, but it’s a viable alternative to creating your own PMML >> reader, writing a code generator of some kind, or buying a license. >> >> >> >> >> >> >> __________________________________________________________________________________________ >> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data >> Science and Capacity Planning >> | 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected] >> >> >> >> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= >> [email protected]] *On Behalf Of *Joel Nothman >> *Sent:* Thursday, July 14, 2016 4:18 AM >> *To:* Scikit-learn user and developer mailing list >> *Subject:* Re: [scikit-learn] [Scikit-learn-general] Estimator >> serialisability >> >> >> >> ⚠ EXT MSG: >> >> This has been discussed numerous times. I suppose no one thinks >> supporting pickle only is great, but a custom dict is unmaintainable. The >> best we've got AFAIK (and it looks >> <https://github.com/jpmml/jpmml-sklearn/graphs/contributors> like it's >> getting better all the time) is a tool to convert one-way to PMML, which is >> portable to production environments. See >> https://github.com/jpmml/sklearn2pmml (python interface) and >> https://github.com/jpmml/jpmml-sklearn(command-line interface and guts >> of the thing). >> >> >> >> I hope that helps; and thanks to Villu Ruusmann: that list of supported >> estimators is awesome! >> >> >> >> PS: please write to the new list at [email protected] >> >> >> >> On 14 July 2016 at 17:24, Miroslav Zoričák <[email protected]> >> wrote: >> >> Hi everybody, >> >> >> >> I have been using scikit-learn for a while, but I have run into a problem >> that does not seem to have any good solutions. >> >> >> >> Basically I would like to: >> >> - build my pipeline in a Jupyter Notebook >> >> - persist it (to json or hdf5) >> >> - load it in production and execute the prediction there >> >> >> >> The problem is that for persisting estimators such as the RobustScaler >> for example, the recommended way is to pickle them. Now I don't want to do >> this, for three reasons: >> >> >> >> - Security, pickle is potentially dangerous >> >> - Portability, I can't unpickle it in scala for example >> >> - Pickle stores a lot of details and information which is not strictly >> necessary to reconstruct the RobustScaler and therefore might prevent it >> from being reconstructed correctly if a different version is used. >> >> >> >> Another option I would seem to have is to access the private members of >> each serialiser that I want to use and store them on my own, but this is >> inconvenient, because: >> >> >> >> - It forces me as a user to understand how the robust scaler works and >> how it stores its internal state, which is generally bad for usability >> >> - The internal implementation could change, leaving me to fix my >> serialisers (see #1) >> >> - I would need to do this for each new Estimator I decide to use >> >> >> >> Now, to me it seems the solution is quite obvious: >> >> Write a Mixin or update the BaseEstimator class to include two additional >> methods: >> >> >> >> to_dict() - will return a dictionary such, that when passed to >> >> from_dict(dictionary) - it will reconstruct the original object >> >> >> >> these dictionaries could be passed to the JSON module or the YAML module >> or stored elsewhere. We could provide more convenience methods to do this >> for the user. >> >> >> >> In case of the RobustScaler the dict would look something like: >> >> { "center": "0,0", "scale": "1.0"} >> >> >> >> Now the bulk of the work is writing these serialisers and deserialisers >> for all of the estimators, but that can be simplified by adding a method >> that could do that automatically via reflection and the estimator would >> only need to specify which fields to serialise. >> >> >> >> I am happy to start working on this and create a pull request on Github, >> but before I do that I wanted to get some initial thoughts and reactions >> from the community, so please let me know what you think. >> >> >> >> Best Regards, >> >> Miroslav Zoricak >> >> -- >> >> Best Regards, >> Miroslav Zoricak >> >> >> >> ------------------------------------------------------------------------------ >> What NetFlow Analyzer can do for you? Monitors network bandwidth and >> traffic >> patterns at an interface-level. Reveals which users, apps, and protocols >> are >> consuming the most bandwidth. Provides multi-vendor support for NetFlow, >> J-Flow, sFlow and other flows. Make informed decisions using capacity >> planning >> reports.http://sdm.link/zohodev2dev >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or >> opening attachments. >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
