Hi,
I am looking for a classification model in python that can handle
missing values, without imputation and "without learning from missing
values", i.e. without using the fact that the information is missing for
the inference.
Explained with the help of decision trees:
* The algorithm should NOT learn whether missing values should go to the
left or right child (like the HistGradientBoostingClassifier).
* Instead it could built the prediction for each child node and
aggregate these (like some Random Forest implementations do).
If that is not possible in sci-kit learn, maybe you have already
discussed this? Or you know of a fork of sci-kit learn that is able to
do this, or some other python library?
Any help would be really appreciated, kind regards,
Martin
P.S. Here is my use-case, in case you are interested: I have a binary
classification problem with a positive and a negative class, and two
types of features A and B. In my training data, I have a lot more data
(90%) where B is missing. In my test data, I always have B, which is
good because the B features are better than the A features. In the cases
where B is present in the training data, the ratio of positive examples
is much higher than when its missing. So what
HistGradientBoostingClassifier does, it uses the fact that B is not
missing in the test data, and predicts way too many positives.
(Additionally, some feature values of type A are also often missing)
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn