Hello scikit-learn team!
After reading about on-demand features in Decision Forests with
Long-Range Spatial Context for Organ Localization in CT Volumes
<http://research.microsoft.com/apps/pubs/default.aspx?id=81675> by
Criminisi et al. [1] the idea came up to make user defined functions in
RDFs generally available.
The basic idea is not to randomly choose features of a predefined
feature stack as it has been until now but to compute features on-demand
given by a /function/ with randomized input data. Thus, an infinite
number of features is available.
A short example how this feature can be used:
In this example organs in medical images should be segmented by using
voxel-wise RDF classification. For that reason Criminisi et al.
comparing the means of two sub-regions (*F1*,*F2*) of the image. The
subregions have a randomly chosen location indicated by a vector *o*.
The difference of the sum of all data points within these boxes becomes
the on-demand feature *f(x,theta)*, with *theta* describing the used
feature [1]:
*f(x,theta) = { sum(C(q1)) - sum(C(q2)) }*
with *q1*, *q2* elements of box *F1*, *F2*,
*C* = voxel value
*theta* = (instantiations of *o*, *F1* and *F2*)
Thus, an infinite number of features can be generated setting position
and size of the boxes. The results of organ location in CT-scans promise
benefits in many classifications.
The idea for scikit-learn library:
The code should handle predefined /and/ on-demand features at the same
time. To give an example: A node split should be trained and therefore a
range of 400 features are compared to find the greatest information
gain. The features consist of 2 predefined ones and 398 calculated ones
defined by a /user given /function with randomly generated input data.
This little example should work fine for one node. Thinking of training
a hole forest we need lots of features and they should be randomly
selected out of a range of predefined ones and randomly generated by a
function. Here a little problem strikes in:
The proportion problem:
v = [0.5, 0.3, 0.8, 1.2, 0.09, 0.2, …] # predefined set of features
f(x,theta) # function 1
g(x,theta) # function 2
feature_set = [v, f, g]
The subset given to a tree or node then consists of thresholds from v, f
and g. But in which proportion? Should we prefer predefined features or
generated ones? Is it helpful to use a weighting so the user can decide
how much influence a feature source should get? This could look like the
following:
Weight = (wv, wf, wg) , example: Weight = (0.2, 0.5, 0.3)
I am struggling here because there are a few situations that can occur
and that make this decision a bit tricky:
1) predefined << feature set size for one node
2) predefined >> feature set size for one node
3) predefined ≃ feature set size for one node
Situation 1 leads to an inordinate use of the few predefined features on
the one hand. On the other hand its possible that in a few trees no
predefined features are used.
The second situation can be handled just vice-versa. The on-demand
features could be ignored or only available in a very small amount.
These functions are mostly designed to appear in a huge number because
of their probabilistic manner.
In contrast to the settings above in the last situation an equal
distribution of all input sources (predefined, function1, function2, …)
should be preferred. What do you think?
I hope this new functionality will strike a chord with you and the
scikit-learn community.
Additionally some feedback concerning the mentioned problem and later on
time improvement and usability would be greatly appreciated.
Best regards,
Alex
[1] Criminisi, A., Shotton, J., Bucciarelli, S. 2009. Decision
Forests with Long-Range Spatial Context for Organ Localization in CT
Volumes. MICCAI workshop on Probabilistic Models for Medical Image
Analysis (MICCAI-PMMIA). URL:
http://research.microsoft.com/apps/pubs/default.aspx?id=81675.
<http://research.microsoft.com/apps/pubs/default.aspx?id=81675>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general