Hi Martin, I think that you could use `imbalanced-learn` and a bit of Pandas/NumPy to get the behaviour that you want. You can use a `FunctionSampler` ( https://imbalanced-learn.org/stable/references/generated/imblearn.FunctionSampler.html) in which you remove the sample containing missing values. This process is only apply when calling `fit`. You will need to use the `Pipeline` from imbalanced-learn` as well.
In some way, it seems that you want to resample the training set which what the `Sampler` are intended for in `imbalanced-learn`. Cheers, On Fri, 10 Mar 2023 at 14:21, Martin Gütlein <guetl...@posteo.de> wrote: > Hi Gaël, > > > [...] the other one that > > amounts to imputation using a forest, and can be done in scikit-learn > > by setting up the IteratuiveImputer to use forests as a base learner > > (this will however be slow). > > The main difference is that when I use the IterativeImputer in > scikit-learn, I still have to apply this imputation on the test set, > before being able to predict with the RF. However, other implementations > do not impute missing values, but instead split up the test instance. > > I made the experience that this makes a big difference, and you are able > to use features where the majority of values is missing, and where at > the same time the class ratio of the examples with missing values is > largely different to those without missing values. > > Kind regards, > Martin > > > > > > Am 03.03.2023 15:41 schrieb Gael Varoquaux: > > On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin Gütlein wrote: > >> > 2. Ignores whether a value is missing or not for the inference > >> What I meant is rather, that the missing value should NOT be treated > >> as > >> another possible value of the variable (this is e.g., what the > >> HistGradientBoostingClassifier implementation in sk-learn does). > >> Instead, > >> multiple predictions could be done when a split-attribute is missing, > >> and > >> those can be averaged. > > > >> This is how it is e.g. implemented in WEKA (we cannot switch do Java, > >> though > >> ;-): > >> > http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4 > >> and described by the inventors of the RF: > >> > https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 > > > > The text that you link to describes two types of strategies, one that > > is similar to that done in HistGradientBoosting, the other one that > > amounts to imputation using a forest, and can be done in scikit-learn > > by setting up the IteratuiveImputer to use forests as a base learner > > (this will however be slow). > > > > Cheers, > > > > Gaël > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn