Re: [Scikit-learn-general] on-demand feature for Random Decision Forests

Andy Tue, 18 Nov 2014 09:03:51 -0800

Hi Alexander.

On-demand features for random forest have been pretty established for along time.I think the main reason why we don't support them in scikit-learn isthat they would need to be gil-released functions to be efficient.The tree building code is all in Cython with no gil, so calling a pythonfunction is actually not possible, and even if it would, it would bemuch too slow.

So you would need to hand in a Cython / C function pointer, which is alot less convenient for the python user to provide, and I also don't know

if it is possible to pass that in through python code (I guess it would be).

Concerning your question: For vision applications I think it is commonto provide all features as functions, and attach a sampling probabilityto these functions.How exactly you parametrize them is a good question, but you can forexample use hyperparameter optimization on that. A student in myprevious labwrote pretty comprehensively about that:http://www.ais.uni-bonn.de/theses/Benedikt_Waldvogel_Master_Thesis_07_2013.pdf


Cheers,
Andy


On 11/18/2014 03:34 AM, Alexander Rüsch wrote:

Hello scikit-learn team!
After reading about on-demand features in Decision Forests withLong-Range Spatial Context for Organ Localization in CT Volumes<http://research.microsoft.com/apps/pubs/default.aspx?id=81675> byCriminisi et al. [1] the idea came up to make user defined functionsin RDFs generally available.
The basic idea is not to randomly choose features of a predefinedfeature stack as it has been until now but to compute featureson-demand given by a /function/ with randomized input data. Thus, aninfinite number of features is available.
      A short example how this feature can be used:
In this example organs in medical images should be segmented by usingvoxel-wise RDF classification. For that reason Criminisi et al.comparing the means of two sub-regions (*F1*,*F2*) of the image. Thesubregions have a randomly chosen location indicated by a vector *o*.The difference of the sum of all data points within these boxesbecomes the on-demand feature *f(x,theta)*, with *theta* describingthe used feature [1]:
*f(x,theta) = { sum(C(q1)) - sum(C(q2)) }*
        with *q1*, *q2* elements of box *F1*, *F2*,
*C* = voxel value
*theta* = (instantiations of *o*, *F1* and *F2*)
Thus, an infinite number of features can be generated setting positionand size of the boxes. The results of organ location in CT-scanspromise benefits in many classifications.
      The idea for scikit-learn library:
The code should handle predefined /and/ on-demand features at the sametime. To give an example: A node split should be trained and thereforea range of 400 features are compared to find the greatest informationgain. The features consist of 2 predefined ones and 398 calculatedones defined by a /user given /function with randomly generated inputdata.
This little example should work fine for one node. Thinking oftraining a hole forest we need lots of features and they should berandomly selected out of a range of predefined ones and randomlygenerated by a function. Here a little problem strikes in:
      The proportion problem:

v = [0.5, 0.3, 0.8, 1.2, 0.09, 0.2, …]     # predefined set of features
f(x,theta)                                              # function 1
g(x,theta)                                             # function 2
feature_set = [v, f, g]
The subset given to a tree or node then consists of thresholds from v,f and g. But in which proportion? Should we prefer predefined featuresor generated ones? Is it helpful to use a weighting so the user candecide how much influence a feature source should get? This could looklike the following:
Weight = (wv, wf, wg) , example: Weight = (0.2, 0.5, 0.3)
I am struggling here because there are a few situations that can occurand that make this decision a bit tricky:
1) predefined << feature set size for one node
2) predefined >> feature set size for one node
3) predefined ≃ feature set size for one node
Situation 1 leads to an inordinate use of the few predefined featureson the one hand. On the other hand its possible that in a few trees nopredefined features are used.The second situation can be handled just vice-versa. The on-demandfeatures could be ignored or only available in a very small amount.These functions are mostly designed to appear in a huge number becauseof their probabilistic manner.In contrast to the settings above in the last situation an equaldistribution of all input sources (predefined, function1, function2,…) should be preferred. What do you think?
I hope this new functionality will strike a chord with you and thescikit-learn community.Additionally some feedback concerning the mentioned problem and lateron time improvement and usability would be greatly appreciated.
Best regards,
Alex


    [1] Criminisi, A., Shotton, J., Bucciarelli, S. 2009.  Decision
    Forests with Long-Range Spatial Context for Organ Localization in
    CT Volumes. MICCAI workshop on Probabilistic Models for Medical
    Image Analysis (MICCAI-PMMIA). URL:
    http://research.microsoft.com/apps/pubs/default.aspx?id=81675.
<http://research.microsoft.com/apps/pubs/default.aspx?id=81675>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] on-demand feature for Random Decision Forests

Reply via email to