Is it possible to efficiently get at the branch statistics that
decision tree algorithms iterate over in scikit?

For example if the root population has the class counts in the output vector:
   c0: 5000
   c1: 500

Then I'd like to iterate over:
# For a boolean (2 valued category)
   f1=True:      c0=3000,  c1=450
   f1=False:    c0=300,    c1=30
   f1=Null:       c0=1700,  c1=20  # ? Is considered?

# For a continuous value
   f2<10:         c0= ...  c1= ...
   f2>=10:         c0= ...  c1= ...

   f2<22:         c0= ...  c1= ...
   f2>=22:         c0= ...  c1= ...


I'd like to experiment with building models on-demand for each input
row in a predict.
To work efficiently, I'd like to reduce the training set to the 'most
significant' sub-space(s) using the population statistics.

I can do it in pandas, although its fairly inefficient to iterate over
each feature column many times.

Thanks,
- Stu
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to