Re: [Scikit-learn-general] Question on the code for Decision Trees

Andreas Mueller Thu, 13 Aug 2015 10:30:12 -0700

For C you should definitely check out this:
https://github.com/ajtulloch/sklearn-compiledtrees/


It's linked here btw ;)
http://scikit-learn.org/dev/related_projects.html

On 08/13/2015 01:04 PM, Simon Burton wrote:
> Surprisingly, I am working on a similar code generation project,
> with the target language being C. One of the reasons I chose to
> use decision trees (& ensembles there-of) was that it should be
> easy to code gen these things & deploy.
>
>
>
> On Wed, 12 Aug 2015 11:46:18 +0000
> Rafael Calsaverini <rafael.calsaver...@gmail.com> wrote:
>
>> Hum, I see. So, those values aren't available from the
>> DecisionTreeClassifier class, is that right?
>>
>> Let me make more clear what I'm trying to do, maybe you guys have had this
>> problem in the past and can devise better solutions. I need to embed a
>> classifier in an external code, which is a proof-of-concept. There's a few
>> constraints on how much of that code I have freedom to change, so what
>> seems to be the more productive approach is to do the following:
>>
>> 1) Train/optimize hyperparameters/cross-validate the model with
>> scikit-learn until I have a decent initial model.
>> 2) Implement at the target (probably Java, but could be python) only the
>> part of the code that does the prediction with hard-coded parameters copied
>> from the scikit-learn model.
>>
>> So, for instance, I can train a RandomForestClassifier in scikit-learn and
>> then just implement a simple decision function in the Java code, with all
>> the trees hard-coded (basically just a list of thresholds, features, left
>> and right children and the final class decision for each leaf node, and a
>> method to run the decisions and report the same result that predict_proba
>> would).
>>
>> I can already retrieve most of the needed parameters from the
>> DecisionTreeClassifier (namely: thresholds, left and right children, and
>> the feature index for each node). Is the example count for each class for
>> each node doesn't seem to be externally available? If not I can just do a
>> "manual" count, but it would help.
>>
>> The main problem is: I can't just serialize the final trained model and
>> load it every time. It would involve more change in the final code than I'm
>> allowed to do (reading the serialized model every time would be a huge
>> overhead and to avoid it I'd have to change code that is beyond the scope
>> of what we're willing to change in the short term). Another problem is that
>> the platform is running in a JVM language, so probably I'll implement that
>> hard-coded predictor in that language. I could get away with python if the
>> dev team decide to use apache thrift for communication but that is
>> currently not a 100% sure thing.
>>
>> If you guys had this kind of problem in the past and found better
>> solutions, I'd be thankful to hear about it.
>>
>> Thanks.
>>
>> Em qua, 12 de ago de 2015 às 04:58, Jacob Schreiber <jmschreibe...@gmail.com>
>> escreveu:
>>
>>> Hi Rafael
>>>
>>> When the tree needs to make a prediction, it usually goes through the
>>> predict method, then the apply method, then the _apply_dense method (this
>>> helps partition between dense and sparse data).
>>>
>>> Take a look at lines 3463 to 3503, the _apply_dense method. This ends up
>>> returning an array of offsets to the predict method, where each offset is
>>> the leaf node a point falls under. The predict method then indexes the
>>> value array (where node prediction values) are stored by this offset array,
>>> assigning a prediction value to each point.
>>>
>>> A small source of confusion is that for regression trees, the value array
>>> is one value per output per node, which makes sense. However, for
>>> classification trees, the value array stores the number of training points
>>> for each class for each output for each node. For example, a regression
>>> tree may have 2.5 as the prediction value in a leaf, but a classification
>>> tree may have [3, 40, 5] as the value in a leaf if there are three classes.
>>> The final prediction uses argmax to select class 1.
>>>
>>> Let me know if you have any other questions!
>>>
>>> Jacob
>>>
>>> On Tue, Aug 11, 2015 at 2:17 PM, Rafael Calsaverini <
>>> rafael.calsaver...@gmail.com> wrote:
>>>
>>>> Hi there all,
>>>>
>>>> I'm taking a look on the code for decision trees and trying to understand
>>>> how it actually decides the class and I'm having some trouble with the
>>>> final step.
>>>>
>>>> The heart of the algorithm seem to be on lines 3249 to 3260 of
>>>> the sklearn/tree/_tree.pyx file.
>>>>
>>>> Lines 3249 to 3258 are fine, they are just the standard walking through
>>>> the branchs on the decision trees. What I failed to understand is how the
>>>> tree actually decides which class to assign to the sample being classified
>>>> after it reaches a leaf node.  Aren't the final classes assigned to each
>>>> final branch stored anywhere?
>>>>
>>>> Thanks,
>>>> Rafael Calsaverini
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question on the code for Decision Trees

Reply via email to