Hi Gaël,

[...] the other one that
amounts to imputation using a forest, and can be done in scikit-learn
by setting up the IteratuiveImputer to use forests as a base learner
(this will however be slow).

The main difference is that when I use the IterativeImputer in scikit-learn, I still have to apply this imputation on the test set, before being able to predict with the RF. However, other implementations do not impute missing values, but instead split up the test instance.

I made the experience that this makes a big difference, and you are able to use features where the majority of values is missing, and where at the same time the class ratio of the examples with missing values is largely different to those without missing values.

Kind regards,
Martin





Am 03.03.2023 15:41 schrieb Gael Varoquaux:
On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin Gütlein wrote:
> 2. Ignores whether a value is missing or not for the inference
What I meant is rather, that the missing value should NOT be treated as
another possible value of the variable (this is e.g., what the
HistGradientBoostingClassifier implementation in sk-learn does). Instead, multiple predictions could be done when a split-attribute is missing, and
those can be averaged.

This is how it is e.g. implemented in WEKA (we cannot switch do Java, though
;-):
http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
and described by the inventors of the RF:
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

The text that you link to describes two types of strategies, one that
is similar to that done in HistGradientBoosting, the other one that
amounts to imputation using a forest, and can be done in scikit-learn
by setting up the IteratuiveImputer to use forests as a base learner
(this will however be slow).

Cheers,

Gaël
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to