Hi Alexander.
On-demand features for random forest have been pretty established for a
long time.
I think the main reason why we don't support them in scikit-learn is
that they would need to be gil-released functions to be efficient.
The tree building code is all in Cython with no gil, so calling a python
function is actually not possible, and even if it would, it would be
much too slow.
So you would need to hand in a Cython / C function pointer, which is a
lot less convenient for the python user to provide, and I also don't know
if it is possible to pass that in through python code (I guess it would be).
Concerning your question: For vision applications I think it is common
to provide all features as functions, and attach a sampling probability
to these functions.
How exactly you parametrize them is a good question, but you can for
example use hyperparameter optimization on that. A student in my
previous lab
wrote pretty comprehensively about that:
http://www.ais.uni-bonn.de/theses/Benedikt_Waldvogel_Master_Thesis_07_2013.pdf
Cheers,
Andy
On 11/18/2014 03:34 AM, Alexander Rüsch wrote:
Hello scikit-learn team!
After reading about on-demand features in Decision Forests with
Long-Range Spatial Context for Organ Localization in CT Volumes
<http://research.microsoft.com/apps/pubs/default.aspx?id=81675> by
Criminisi et al. [1] the idea came up to make user defined functions
in RDFs generally available.
The basic idea is not to randomly choose features of a predefined
feature stack as it has been until now but to compute features
on-demand given by a /function/ with randomized input data. Thus, an
infinite number of features is available.
A short example how this feature can be used:
In this example organs in medical images should be segmented by using
voxel-wise RDF classification. For that reason Criminisi et al.
comparing the means of two sub-regions (*F1*,*F2*) of the image. The
subregions have a randomly chosen location indicated by a vector *o*.
The difference of the sum of all data points within these boxes
becomes the on-demand feature *f(x,theta)*, with *theta* describing
the used feature [1]:
*f(x,theta) = { sum(C(q1)) - sum(C(q2)) }*
with *q1*, *q2* elements of box *F1*, *F2*,
*C* = voxel value
*theta* = (instantiations of *o*, *F1* and *F2*)
Thus, an infinite number of features can be generated setting position
and size of the boxes. The results of organ location in CT-scans
promise benefits in many classifications.
The idea for scikit-learn library:
The code should handle predefined /and/ on-demand features at the same
time. To give an example: A node split should be trained and therefore
a range of 400 features are compared to find the greatest information
gain. The features consist of 2 predefined ones and 398 calculated
ones defined by a /user given /function with randomly generated input
data.
This little example should work fine for one node. Thinking of
training a hole forest we need lots of features and they should be
randomly selected out of a range of predefined ones and randomly
generated by a function. Here a little problem strikes in:
The proportion problem:
v = [0.5, 0.3, 0.8, 1.2, 0.09, 0.2, …] # predefined set of features
f(x,theta) # function 1
g(x,theta) # function 2
feature_set = [v, f, g]
The subset given to a tree or node then consists of thresholds from v,
f and g. But in which proportion? Should we prefer predefined features
or generated ones? Is it helpful to use a weighting so the user can
decide how much influence a feature source should get? This could look
like the following:
Weight = (wv, wf, wg) , example: Weight = (0.2, 0.5, 0.3)
I am struggling here because there are a few situations that can occur
and that make this decision a bit tricky:
1) predefined << feature set size for one node
2) predefined >> feature set size for one node
3) predefined ≃ feature set size for one node
Situation 1 leads to an inordinate use of the few predefined features
on the one hand. On the other hand its possible that in a few trees no
predefined features are used.
The second situation can be handled just vice-versa. The on-demand
features could be ignored or only available in a very small amount.
These functions are mostly designed to appear in a huge number because
of their probabilistic manner.
In contrast to the settings above in the last situation an equal
distribution of all input sources (predefined, function1, function2,
…) should be preferred. What do you think?
I hope this new functionality will strike a chord with you and the
scikit-learn community.
Additionally some feedback concerning the mentioned problem and later
on time improvement and usability would be greatly appreciated.
Best regards,
Alex
[1] Criminisi, A., Shotton, J., Bucciarelli, S. 2009. Decision
Forests with Long-Range Spatial Context for Organ Localization in
CT Volumes. MICCAI workshop on Probabilistic Models for Medical
Image Analysis (MICCAI-PMMIA). URL:
http://research.microsoft.com/apps/pubs/default.aspx?id=81675.
<http://research.microsoft.com/apps/pubs/default.aspx?id=81675>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general