It would already help us, if someone could confirm that this is not
possible in sci-kit learn, because we are still not entirely sure that
we have no missed something.?
Regards,
Martin
Am 21.02.2023 15:48 schrieb Martin Gütlein:
Hi,
I am looking for a classification model in python that can handle
missing values, without imputation and "without learning from missing
values", i.e. without using the fact that the information is missing
for the inference.
Explained with the help of decision trees:
* The algorithm should NOT learn whether missing values should go to
the left or right child (like the HistGradientBoostingClassifier).
* Instead it could built the prediction for each child node and
aggregate these (like some Random Forest implementations do).
If that is not possible in sci-kit learn, maybe you have already
discussed this? Or you know of a fork of sci-kit learn that is able to
do this, or some other python library?
Any help would be really appreciated, kind regards,
Martin
P.S. Here is my use-case, in case you are interested: I have a binary
classification problem with a positive and a negative class, and two
types of features A and B. In my training data, I have a lot more data
(90%) where B is missing. In my test data, I always have B, which is
good because the B features are better than the A features. In the
cases where B is present in the training data, the ratio of positive
examples is much higher than when its missing. So what
HistGradientBoostingClassifier does, it uses the fact that B is not
missing in the test data, and predicts way too many positives.
(Additionally, some feature values of type A are also often missing)
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn