Re: [scikit-learn] Missing data and decision trees

Raghav R V Thu, 13 Oct 2016 13:25:58 -0700

Hi Stuart Reynold,

Like Jacob said we have an active PR at
https://github.com/scikit-learn/scikit-learn/pull/5974


You could do

git fetch https://github.com/raghavrv/scikit-learn.git
missing_values_rf:missing_values_rf
git checkout missing_values_rf
python setup.py install

And try it out. I warn you thought, there are some memory leaks I'm trying
to debug. But for the most part it works well and outperforms basic
imputation techniques.

Please let us know if it breaks / not solves your usecase. Your input as a
user of that feature would be invaluable!

> I ran into this several times as well with scikit-learn implementation of
GBM. Look at xgboost if you have not already (is there someone out there
that hasn't ? :)- it deals with missing values in the predictor space in a
very eloquent manner. http://xgboost.readthedocs.io/
en/latest/python/python_intro.html
<http://xgboost.readthedocs.io/en/latest/python/python_intro.html>

The PR handles it in a conceptually similar approach. It is currently
implemented for DecisionTreeClassifier. After reviews and integration,
DecisionTreeRegressor would also be supporting missing values. Once that
happens, enabling it in gradient boosting will be possible.

Thanks for the interest!!

On Thu, Oct 13, 2016 at 8:33 PM, Raphael C <drr...@gmail.com> wrote:

> You can simply make a new binary feature (per feature that might have a
> missing value) that is 1 if the value is missing and 0 otherwise.  The RF
> can then work out what to do with this information.
>
> I don't know how this compares in practice to more sophisticated
> approaches.
>
> Raphael
>
>
> On Thursday, October 13, 2016, Stuart Reynolds <stu...@stuartreynolds.net>
> wrote:
>
>> I'm looking for a decision tree and RF implementation that supports
>> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>>
>> It seems that scikit's decision tree algorithm doesn't allow this --
>> which is disappointing because its one of the few methods that should be
>> able to sensibly handle problems with high amounts of missingness.
>>
>> Are there plans to allow missing data in scikit's decision trees?
>>
>> Also, is there any particular reason why missing values weren't supported
>> originally (e.g. integrates poorly with other features)
>>
>> Regards
>> - Stuart
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Missing data and decision trees

Reply via email to