Hi,

I am looking for a classification model in python that can handle missing values, without imputation and "without learning from missing values", i.e. without using the fact that the information is missing for the inference.

Explained with the help of decision trees:
* The algorithm should NOT learn whether missing values should go to the left or right child (like the HistGradientBoostingClassifier). * Instead it could built the prediction for each child node and aggregate these (like some Random Forest implementations do).

If that is not possible in sci-kit learn, maybe you have already discussed this? Or you know of a fork of sci-kit learn that is able to do this, or some other python library?

Any help would be really appreciated, kind regards,
Martin


P.S. Here is my use-case, in case you are interested: I have a binary classification problem with a positive and a negative class, and two types of features A and B. In my training data, I have a lot more data (90%) where B is missing. In my test data, I always have B, which is good because the B features are better than the A features. In the cases where B is present in the training data, the ratio of positive examples is much higher than when its missing. So what HistGradientBoostingClassifier does, it uses the fact that B is not missing in the test data, and predicts way too many positives. (Additionally, some feature values of type A are also often missing)
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to