Re: [Scikit-learn-general] Question on the code for Decision Trees

Simon Burton Thu, 13 Aug 2015 10:05:38 -0700

Surprisingly, I am working on a similar code generation project,
with the target language being C. One of the reasons I chose to
use decision trees (& ensembles there-of) was that it should be
easy to code gen these things & deploy.




On Wed, 12 Aug 2015 11:46:18 +0000
Rafael Calsaverini <rafael.calsaver...@gmail.com> wrote:

> Hum, I see. So, those values aren't available from the
> DecisionTreeClassifier class, is that right?
> 
> Let me make more clear what I'm trying to do, maybe you guys have had this
> problem in the past and can devise better solutions. I need to embed a
> classifier in an external code, which is a proof-of-concept. There's a few
> constraints on how much of that code I have freedom to change, so what
> seems to be the more productive approach is to do the following:
> 
> 1) Train/optimize hyperparameters/cross-validate the model with
> scikit-learn until I have a decent initial model.
> 2) Implement at the target (probably Java, but could be python) only the
> part of the code that does the prediction with hard-coded parameters copied
> from the scikit-learn model.
> 
> So, for instance, I can train a RandomForestClassifier in scikit-learn and
> then just implement a simple decision function in the Java code, with all
> the trees hard-coded (basically just a list of thresholds, features, left
> and right children and the final class decision for each leaf node, and a
> method to run the decisions and report the same result that predict_proba
> would).
> 
> I can already retrieve most of the needed parameters from the
> DecisionTreeClassifier (namely: thresholds, left and right children, and
> the feature index for each node). Is the example count for each class for
> each node doesn't seem to be externally available? If not I can just do a
> "manual" count, but it would help.
> 
> The main problem is: I can't just serialize the final trained model and
> load it every time. It would involve more change in the final code than I'm
> allowed to do (reading the serialized model every time would be a huge
> overhead and to avoid it I'd have to change code that is beyond the scope
> of what we're willing to change in the short term). Another problem is that
> the platform is running in a JVM language, so probably I'll implement that
> hard-coded predictor in that language. I could get away with python if the
> dev team decide to use apache thrift for communication but that is
> currently not a 100% sure thing.
> 
> If you guys had this kind of problem in the past and found better
> solutions, I'd be thankful to hear about it.
> 
> Thanks.
> 
> Em qua, 12 de ago de 2015 às 04:58, Jacob Schreiber <jmschreibe...@gmail.com>
> escreveu:
> 
> > Hi Rafael
> >
> > When the tree needs to make a prediction, it usually goes through the
> > predict method, then the apply method, then the _apply_dense method (this
> > helps partition between dense and sparse data).
> >
> > Take a look at lines 3463 to 3503, the _apply_dense method. This ends up
> > returning an array of offsets to the predict method, where each offset is
> > the leaf node a point falls under. The predict method then indexes the
> > value array (where node prediction values) are stored by this offset array,
> > assigning a prediction value to each point.
> >
> > A small source of confusion is that for regression trees, the value array
> > is one value per output per node, which makes sense. However, for
> > classification trees, the value array stores the number of training points
> > for each class for each output for each node. For example, a regression
> > tree may have 2.5 as the prediction value in a leaf, but a classification
> > tree may have [3, 40, 5] as the value in a leaf if there are three classes.
> > The final prediction uses argmax to select class 1.
> >
> > Let me know if you have any other questions!
> >
> > Jacob
> >
> > On Tue, Aug 11, 2015 at 2:17 PM, Rafael Calsaverini <
> > rafael.calsaver...@gmail.com> wrote:
> >
> >> Hi there all,
> >>
> >> I'm taking a look on the code for decision trees and trying to understand
> >> how it actually decides the class and I'm having some trouble with the
> >> final step.
> >>
> >> The heart of the algorithm seem to be on lines 3249 to 3260 of
> >> the sklearn/tree/_tree.pyx file.
> >>
> >> Lines 3249 to 3258 are fine, they are just the standard walking through
> >> the branchs on the decision trees. What I failed to understand is how the
> >> tree actually decides which class to assign to the sample being classified
> >> after it reaches a leaf node.  Aren't the final classes assigned to each
> >> final branch stored anywhere?
> >>
> >> Thanks,
> >> Rafael Calsaverini
> >>
> >>
> >>
> >>
> >>
> >>
> >> ------------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question on the code for Decision Trees

Reply via email to