Hello scikit-learn team!

After reading about on-demand features in Decision Forests with Long-Range Spatial Context for Organ Localization in CT Volumes <http://research.microsoft.com/apps/pubs/default.aspx?id=81675> by Criminisi et al. [1] the idea came up to make user defined functions in RDFs generally available.

The basic idea is not to randomly choose features of a predefined feature stack as it has been until now but to compute features on-demand given by a /function/ with randomized input data. Thus, an infinite number of features is available.



     A short example how this feature can be used:

In this example organs in medical images should be segmented by using voxel-wise RDF classification. For that reason Criminisi et al. comparing the means of two sub-regions (*F1*,*F2*) of the image. The subregions have a randomly chosen location indicated by a vector *o*. The difference of the sum of all data points within these boxes becomes the on-demand feature *f(x,theta)*, with *theta* describing the used feature [1]:

*f(x,theta) = { sum(C(q1)) - sum(C(q2)) }*
        with *q1*, *q2* elements of box *F1*, *F2*,
*C* = voxel value
*theta* = (instantiations of *o*, *F1* and *F2*)

Thus, an infinite number of features can be generated setting position and size of the boxes. The results of organ location in CT-scans promise benefits in many classifications.


     The idea for scikit-learn library:

The code should handle predefined /and/ on-demand features at the same time. To give an example: A node split should be trained and therefore a range of 400 features are compared to find the greatest information gain. The features consist of 2 predefined ones and 398 calculated ones defined by a /user given /function with randomly generated input data.

This little example should work fine for one node. Thinking of training a hole forest we need lots of features and they should be randomly selected out of a range of predefined ones and randomly generated by a function. Here a little problem strikes in:


     The proportion problem:

v = [0.5, 0.3, 0.8, 1.2, 0.09, 0.2, …]     # predefined set of features
f(x,theta)                                              # function 1
g(x,theta)                                             # function 2
feature_set = [v, f, g]

The subset given to a tree or node then consists of thresholds from v, f and g. But in which proportion? Should we prefer predefined features or generated ones? Is it helpful to use a weighting so the user can decide how much influence a feature source should get? This could look like the following:

Weight = (wv, wf, wg) , example: Weight = (0.2, 0.5, 0.3)

I am struggling here because there are a few situations that can occur and that make this decision a bit tricky:

1) predefined << feature set size for one node
2) predefined >> feature set size for one node
3) predefined ≃ feature set size for one node

Situation 1 leads to an inordinate use of the few predefined features on the one hand. On the other hand its possible that in a few trees no predefined features are used. The second situation can be handled just vice-versa. The on-demand features could be ignored or only available in a very small amount. These functions are mostly designed to appear in a huge number because of their probabilistic manner. In contrast to the settings above in the last situation an equal distribution of all input sources (predefined, function1, function2, …) should be preferred. What do you think?


I hope this new functionality will strike a chord with you and the scikit-learn community. Additionally some feedback concerning the mentioned problem and later on time improvement and usability would be greatly appreciated.

Best regards,
Alex


   [1] Criminisi, A., Shotton, J., Bucciarelli, S. 2009.  Decision
   Forests with Long-Range Spatial Context for Organ Localization in CT
   Volumes. MICCAI workshop on Probabilistic Models for Medical Image
   Analysis (MICCAI-PMMIA). URL:
   http://research.microsoft.com/apps/pubs/default.aspx?id=81675.
<http://research.microsoft.com/apps/pubs/default.aspx?id=81675>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to