Re: [scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

Nicolas Hug Sat, 15 Feb 2020 05:56:48 -0800

For my knowledge’s sake, could you please inform me about thetechnique being employed now to take advantage of the correlationsbetween targets? Is it the Mahalanobis distance or some other metric?In other words, could you please give me a hint as to the underlyingreason why the single output predictions differ from the multioutputpredictions?

I don't know much more than what's already in the doc that I linked to.Namely, the best split is the chosen to minimize the *average* criteriaacross all outputs, instead of just using a single output. You'll findmore details in the code.

About the docs: we generally try to write all the useful info about theestimators in the "User Guide" section(https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees).In this case you can find a link to the multi-output handling. Sometimesthe info is instead in the docstrings. That's not always perfect though,and the link might not have been there when you first looked. We'reworking hard to keep on improving the docs. But there's so much infothat it's easy to miss some...



Welcome back to python!


On 2/14/20 8:47 PM, Paul Chike Ofoche via scikit-learn wrote:

Many thanks Nicolas and Andreas.
I appreciate your taking the time and effort to look into the issuethat I raised and for pointing me to the documentation. It is quitepleasant to know that scikit-learn’s RandomForestRegressor handlesmultioutput cases. This issue has been very important to me and wasthe sole reason that I switched from Python to R for my research inthe Fall of 2018 and have seldom used Python since then.
I got convinced about my earlier stance when reading a documentationsuch ashttps://scikit-learn.org/stable/modules/multiclass.html#multioutput-regressionwhich explained that the “MultiOutputRegressor fits one regressor pertarget and cannot take advantage of correlations between targets”,although I am aware that this is different from the RandomForestRegressor.
Inline image
I was wondering whether this multioutput handling capability of theRandomForestRegressor has been added recently. In order to verify, Iwent on a fact-finding mission by re-running the exact same codes Ihad in 2018 and noticed quite a number of changes. I guess that manymoons have passed since then!
For instance, sklearn.cross_validation has been deprecated since whenlast I used it in 2018 (and replaced by sklearn.model_selection).Also, such errors as:
i. ValueError: Expected 2D array, got scalar array instead:

array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has asingle feature or array.reshape(1, -1) if it contains a single sample.
and
ii. DataConversionWarning: A column-vector y was passed when a 1darray was expected. Please change the shape of y to (n_samples,), forexample using ravel().
when passing a *scalar* and a *column-vector y* respectively areentirely new from when last I made use of Python’sRandomForestRegressor. Previously, they worked just fine withoutthrowing out any errors. I know that the “multioutputs” were handledback in 2018 (I actually tested this capability back then), but Iassumed that the regressors were fit per target i.e. that there was nocorrelation between targets.
Today, for comparison, I generated some random target outputs (threecolumns) and using the same *random_state*, I ran the all-inclusivemultioutput prediction (with all three output targets simultaneouslyvs. re-running each output prediction one at a time). The results aredifferent, implying that some form of correlation takes place amongstthe multioutput targets, when predicted together. (For completeness, Idisplay the first 28 predicted output values, from the multioutputprediction as well as the single output predictions.)
Results from the multioutput prediction of the targets (capturingtheir correlations).
Inline image


Results from the individual prediction of each single output target.

Inline image
For my knowledge’s sake, could you please inform me about thetechnique being employed now to take advantage of the correlationsbetween targets? Is it the Mahalanobis distance or some other metric?In other words, could you please give me a hint as to the underlyingreason why the single output predictions differ from the multioutputpredictions? I am curious to know as this would finally fully quenchmy appetite after nearly two years. I will have to retrace my stepsand get back to the good old Python ways (again). Thank you.
Highest regards,
Paul
On Friday, February 14, 2020, 07:00:35 a.m. CST, Nicolas Hug<[email protected]> wrote:
Hi Paul,
The way multioutput is handled in decision trees (and thus in theforests) is described inhttps://scikit-learn.org/stable/modules/tree.html#multi-output-problems.As you can see, the correlation between the output values *is* takeninto account.
Can you explain what you would like to modify there?

Nicolas

On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote:
Scikit-learn random forest does *not *handle the multi-output case,but only maps to each output one at a time, thereby not accounting forthe correlation between multi-outputs, which is what the Mahalanobisdistance does. I, as well as other researchers have observed thisissue for as much as two years. Could there be a solution to implementit in RandomForest, since Python already has a function that computesMahalanobis distances?
On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller<[email protected]> <mailto:[email protected]> wrote:
On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:

Hello all,
My name is Paul and I am enthused about data science. I have beenusing Python and other programming languages for close to two years.There is an issue that I have been facing since I began applyingPython to the analysis of my research work.
My question has remained unanswered for months. Has anybody not runinto the need to work with data whereby the regression results are amultiple output, in which the output parameters are correlated witheach other? This is called a multi-output multivariate problem. Aversion of random forest that handles multiple outputs is referred toas the multivariate random forest. It is implemented in theprogramming language, R (see attached reference documentation below).
The scikit-learn random forest actually handles this. It doesn't usethe mahalanobis distance but that seems like a simple preprocessing step.
Till date, there exists no such package in Python. My question iswhether anybody knows how to go about implementing this. The randomforest univariate regression case utilizes the Euclidean distance asthe measurement criteria, whereas the multivariate regression caseuses the Mahalanobis distance, which takes into account theinter-relationships between the multiple outputs. I have inquiredabout an equivalent capability in Python for many years, but it hasstill not been addressed. Such a multivariate random forest mode isvery applicable to the type of research and analysis that I do. Couldsomeone help, please?
Thank you,

Paul Ofoche
PS: This is an important need for multivariate output analysis as atechnique to solving practical research problems. Here are someposted questions by various other Python users concerning this sameissue.
*https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r*
Multi-output regression<https://stackoverflow.com/questions/49391637/multi-output-regression>
        


        


    Multi-output regression
I have been looking in to Multi-output regression the last viewweeks. I am working with the scikit learn packag...
<https://stackoverflow.com/questions/49391637/multi-output-regression>




_______________________________________________
scikit-learn mailing list
[email protected]  <mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected] <mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]  <mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected] <mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

Reply via email to