On 17/11/2020 09:57, Sole Galli via scikit-learn wrote:
And I understand that it has to do with the cost function, because if we re-balance the dataset with say class_weight = 'balance'. then the probabilities seem to be calibrated as a result.
As far I know, logistic regression will have well calibrated probabilities even in the imbalanced case. However, with the default decision threshold at 0.5, some of the infrequent categories may never be predicted since their probability is too low.
If you use class_weight = 'balanced' the probabilities will no longer be well calibrated, however you would predict some of those infrequent categories.
See discussions in https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues.
-- Roman _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn