[Scikit-learn-general] on-demand feature for Random Decision Forests

Alexander Rüsch Tue, 18 Nov 2014 00:35:27 -0800

Hello scikit-learn team!

After reading about on-demand features in Decision Forests withLong-Range Spatial Context for Organ Localization in CT Volumes<http://research.microsoft.com/apps/pubs/default.aspx?id=81675> byCriminisi et al. [1] the idea came up to make user defined functions inRDFs generally available.

The basic idea is not to randomly choose features of a predefinedfeature stack as it has been until now but to compute features on-demandgiven by a /function/ with randomized input data. Thus, an infinitenumber of features is available.




     A short example how this feature can be used:

In this example organs in medical images should be segmented by usingvoxel-wise RDF classification. For that reason Criminisi et al.comparing the means of two sub-regions (*F1*,*F2*) of the image. Thesubregions have a randomly chosen location indicated by a vector *o*.The difference of the sum of all data points within these boxes becomesthe on-demand feature *f(x,theta)*, with *theta* describing the usedfeature [1]:


*f(x,theta) = { sum(C(q1)) - sum(C(q2)) }*
        with *q1*, *q2* elements of box *F1*, *F2*,
*C* = voxel value
*theta* = (instantiations of *o*, *F1* and *F2*)

Thus, an infinite number of features can be generated setting positionand size of the boxes. The results of organ location in CT-scans promisebenefits in many classifications.



     The idea for scikit-learn library:

The code should handle predefined /and/ on-demand features at the sametime. To give an example: A node split should be trained and therefore arange of 400 features are compared to find the greatest informationgain. The features consist of 2 predefined ones and 398 calculated onesdefined by a /user given /function with randomly generated input data.

This little example should work fine for one node. Thinking of traininga hole forest we need lots of features and they should be randomlyselected out of a range of predefined ones and randomly generated by afunction. Here a little problem strikes in:



     The proportion problem:

v = [0.5, 0.3, 0.8, 1.2, 0.09, 0.2, …]     # predefined set of features
f(x,theta)                                              # function 1
g(x,theta)                                             # function 2
feature_set = [v, f, g]

The subset given to a tree or node then consists of thresholds from v, fand g. But in which proportion? Should we prefer predefined features orgenerated ones? Is it helpful to use a weighting so the user can decidehow much influence a feature source should get? This could look like thefollowing:


Weight = (wv, wf, wg) , example: Weight = (0.2, 0.5, 0.3)

I am struggling here because there are a few situations that can occurand that make this decision a bit tricky:


1) predefined << feature set size for one node
2) predefined >> feature set size for one node
3) predefined ≃ feature set size for one node

Situation 1 leads to an inordinate use of the few predefined features onthe one hand. On the other hand its possible that in a few trees nopredefined features are used.The second situation can be handled just vice-versa. The on-demandfeatures could be ignored or only available in a very small amount.These functions are mostly designed to appear in a huge number becauseof their probabilistic manner.In contrast to the settings above in the last situation an equaldistribution of all input sources (predefined, function1, function2, …)should be preferred. What do you think?

I hope this new functionality will strike a chord with you and thescikit-learn community.Additionally some feedback concerning the mentioned problem and later ontime improvement and usability would be greatly appreciated.


Best regards,
Alex


   [1] Criminisi, A., Shotton, J., Bucciarelli, S. 2009.  Decision
   Forests with Long-Range Spatial Context for Organ Localization in CT
   Volumes. MICCAI workshop on Probabilistic Models for Medical Image
   Analysis (MICCAI-PMMIA). URL:
   http://research.microsoft.com/apps/pubs/default.aspx?id=81675.

<http://research.microsoft.com/apps/pubs/default.aspx?id=81675>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] on-demand feature for Random Decision Forests

Reply via email to