Hi Robert,
Thanks for the report. This is definitely not something just on your end;
MAE does run longer than MSE, especially on larger datasets, due to the
need to find the median of data for MAE (expensive) vs the mean of data for
MSE (not so expensive). We've used a variety of tricks to try to make it
faster for growing trees, but it still seems like it is quite slow for
these larger datasets.

I've been working on a patch to speed it up by using a binary mask to
further reduce the amount of computation MAE needs per split, but I've been
bogged down with real life recently and haven't had a chance to wrap it up.

Nelson Liu

On Sun, Oct 23, 2016 at 3:37 PM, Robert Slater <[email protected]> wrote:

> I searched the archives to see if this was a known issue, but could not
> seem to find anyone else having the problem.
>
> Nevertheless, in the latest version (0.18) Random Forest Regressor has the
> new option of 'mae' for criterion.  However it appears to run
> disporportinally longer than the 'mse' critera,
>
> For example:
>
> from sklearn.ensemble import RandomForestRegressor
> rf_tree=50
> rf_depth=5
> rf=RandomForestRegressor(n_estimators=rf_tree, criterion='mae',
> max_depth=rf_depth,
>                          min_samples_split=4, min_samples_leaf=2,
> max_features=0.5,
>                          max_leaf_nodes=5,
>                          oob_score=True, n_jobs=1, random_state=0,
> verbose=1)
>
> from sklearn.ensemble import ExtraTreesRegressor
> et_tree=100
> et=ExtraTreesRegressor(n_estimators=et_tree,max_depth=5,min_samples_split=4,
> min_samples_leaf=2,max_features=0.5,verbose=1,n_jobs=4)
>
> from sklearn.model_selection import train_test_split
> from sklearn.metrics import mean_absolute_error
> X_train, X_test, y_train, y_test = train_test_split(train, loss,
> test_size=0.2, random_state=42)
>
> et.fit(X_train,y_train)
> rf.fit(X_train,y_train)
>
> rf_pred=rf.predict(X_test)
> et_pred=et.predict(X_test)
>
> print(mean_absolute_error(y_test,rf_pred))
> print(mean_absolute_error(y_test,et_pred))
>
> I was using these two for a recent Kaggle competition.  If I use
> "criterion='mse'" in the Random forest it takes around 1 min to build.
> Switching to 'mae' causes 100% CPU usage and 30 minutes (at least) if wait
> time before I kill my kernel.
>
> Not sure if the problem is on my end or if there is a real issue so I
> wanted to reach out and see if there or others.
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to