Surprisingly, I am working on a similar code generation project, with the target language being C. One of the reasons I chose to use decision trees (& ensembles there-of) was that it should be easy to code gen these things & deploy.
On Wed, 12 Aug 2015 11:46:18 +0000 Rafael Calsaverini <rafael.calsaver...@gmail.com> wrote: > Hum, I see. So, those values aren't available from the > DecisionTreeClassifier class, is that right? > > Let me make more clear what I'm trying to do, maybe you guys have had this > problem in the past and can devise better solutions. I need to embed a > classifier in an external code, which is a proof-of-concept. There's a few > constraints on how much of that code I have freedom to change, so what > seems to be the more productive approach is to do the following: > > 1) Train/optimize hyperparameters/cross-validate the model with > scikit-learn until I have a decent initial model. > 2) Implement at the target (probably Java, but could be python) only the > part of the code that does the prediction with hard-coded parameters copied > from the scikit-learn model. > > So, for instance, I can train a RandomForestClassifier in scikit-learn and > then just implement a simple decision function in the Java code, with all > the trees hard-coded (basically just a list of thresholds, features, left > and right children and the final class decision for each leaf node, and a > method to run the decisions and report the same result that predict_proba > would). > > I can already retrieve most of the needed parameters from the > DecisionTreeClassifier (namely: thresholds, left and right children, and > the feature index for each node). Is the example count for each class for > each node doesn't seem to be externally available? If not I can just do a > "manual" count, but it would help. > > The main problem is: I can't just serialize the final trained model and > load it every time. It would involve more change in the final code than I'm > allowed to do (reading the serialized model every time would be a huge > overhead and to avoid it I'd have to change code that is beyond the scope > of what we're willing to change in the short term). Another problem is that > the platform is running in a JVM language, so probably I'll implement that > hard-coded predictor in that language. I could get away with python if the > dev team decide to use apache thrift for communication but that is > currently not a 100% sure thing. > > If you guys had this kind of problem in the past and found better > solutions, I'd be thankful to hear about it. > > Thanks. > > Em qua, 12 de ago de 2015 às 04:58, Jacob Schreiber <jmschreibe...@gmail.com> > escreveu: > > > Hi Rafael > > > > When the tree needs to make a prediction, it usually goes through the > > predict method, then the apply method, then the _apply_dense method (this > > helps partition between dense and sparse data). > > > > Take a look at lines 3463 to 3503, the _apply_dense method. This ends up > > returning an array of offsets to the predict method, where each offset is > > the leaf node a point falls under. The predict method then indexes the > > value array (where node prediction values) are stored by this offset array, > > assigning a prediction value to each point. > > > > A small source of confusion is that for regression trees, the value array > > is one value per output per node, which makes sense. However, for > > classification trees, the value array stores the number of training points > > for each class for each output for each node. For example, a regression > > tree may have 2.5 as the prediction value in a leaf, but a classification > > tree may have [3, 40, 5] as the value in a leaf if there are three classes. > > The final prediction uses argmax to select class 1. > > > > Let me know if you have any other questions! > > > > Jacob > > > > On Tue, Aug 11, 2015 at 2:17 PM, Rafael Calsaverini < > > rafael.calsaver...@gmail.com> wrote: > > > >> Hi there all, > >> > >> I'm taking a look on the code for decision trees and trying to understand > >> how it actually decides the class and I'm having some trouble with the > >> final step. > >> > >> The heart of the algorithm seem to be on lines 3249 to 3260 of > >> the sklearn/tree/_tree.pyx file. > >> > >> Lines 3249 to 3258 are fine, they are just the standard walking through > >> the branchs on the decision trees. What I failed to understand is how the > >> tree actually decides which class to assign to the sample being classified > >> after it reaches a leaf node. Aren't the final classes assigned to each > >> final branch stored anywhere? > >> > >> Thanks, > >> Rafael Calsaverini > >> > >> > >> > >> > >> > >> > >> ------------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> Scikit-learn-general mailing list > >> Scikit-learn-general@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >> > >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general