Re: [scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

Andreas Mueller Mon, 17 Feb 2020 15:36:13 -0800


On 2/14/20 5:47 PM, Paul Chike Ofoche via scikit-learn wrote:

Many thanks Nicolas and Andreas.
I was wondering whether this multioutput handling capability of theRandomForestRegressor has been added recently. In order to verify, Iwent on a fact-finding mission by re-running the exact same codes Ihad in 2018 and noticed quite a number of changes. I guess that manymoons have passed since then!
For instance, sklearn.cross_validation has been deprecated since whenlast I used it in 2018 (and replaced by sklearn.model_selection).Also, such errors as:
i. ValueError: Expected 2D array, got scalar array instead:

array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has asingle feature or array.reshape(1, -1) if it contains a single sample.
and
ii. DataConversionWarning: A column-vector y was passed when a 1darray was expected. Please change the shape of y to (n_samples,), forexample using ravel().

All of these were errors in 2018 already, you might not have had themost up-to-date version then ;)

cross_validation was deprecated in 2016:
https://scikit-learn.org/dev/whats_new/v0.18.html#version-0-18

when passing a *scalar* and a *column-vector y* respectively areentirely new from when last I made use of Python’sRandomForestRegressor. Previously, they worked just fine withoutthrowing out any errors. I know that the “multioutputs” were handledback in 2018 (I actually tested this capability back then), but Iassumed that the regressors were fit per target i.e. that there was nocorrelation between targets.

I can't find a changelog entry but pretty sure this goes back to 2014 orso. Definitely it was present in 2018.

Today, for comparison, I generated some random target outputs (threecolumns) and using the same *random_state*, I ran the all-inclusivemultioutput prediction (with all three output targets simultaneouslyvs. re-running each output prediction one at a time). The results aredifferent, implying that some form of correlation takes place amongstthe multioutput targets, when predicted together. (For completeness, Idisplay the first 28 predicted output values, from the multioutputprediction as well as the single output predictions.
For my knowledge’s sake, could you please inform me about thetechnique being employed now to take advantage of the correlationsbetween targets? Is it the Mahalanobis distance or some other metric?In other words, could you please give me a hint as to the underlyingreason why the single output predictions differ from the multioutputpredictions? I am curious to know as this would finally fully quenchmy appetite after nearly two years. I will have to retrace my stepsand get back to the good old Python ways (again). Thank you.

It doesn't explicitly use the correlation. The splitting criterion is isthe sum over the splitting criteria over the outputs. That means there'san implicit regularization as the tree is shared between the targets.

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

Reply via email to