Re: [scikit-learn] Can fit a model with a target array of probabilities?

Andreas Mueller Wed, 04 Oct 2017 13:12:06 -0700

Hi Stuart.

There is no interface to do this in scikit-learn (and maybe we should atthis to the FAQ).

Yes, in principle this would be possible with several of the models.


I think statsmodels can do that, and I think I saw another glm package
for Python that does that?

It's certainly a legitimate use-case but would require substantial
changes to the code. I think so far we decided not to support
this in scikit-learn. Basically we don't have a concept of a link
function, and it's a concept that only applies to a subset of models.
We try to have a consistent interface for all our estimators, and
this doesn't really fit well within that interface.

Hth,
Andy

On 10/04/2017 03:58 PM, Stuart Reynolds wrote:

I'd like to fit a model that maps a matrix of continuous inputs to a
target that's between 0 and 1 (a probability).

In principle, I'd expect logistic regression should work out of the
box with no modification (although its often posed as being strictly
for classification, its loss function allows for fitting targets in
the range 0 to 1, and not strictly zero or one.)

However, scikit's LogisticRegression and LogisticRegressionCV reject
target arrays that are continuous. Other LR implementations allow a
matrix of probability estimates. Looking at:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
and the fix here:
https://github.com/scikit-learn/scikit-learn/pull/5084, which disables
continuous inputs, it looks like there was some reason for this. So
... I'm looking for alternatives.

SGDClassifier allows log loss and (if I understood the docs correctly)
adds a logistic link function, but also rejects continuous targets.
Oddly, SGDRegressor only allows ‘squared_loss’, ‘huber’,
‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
seems to give a logistic function.

In principle, GLM allow this, but scikit's docs say the GLM models
only allows strict linear functions of their input, and doesn't allow
a logistic link function. The docs direct people to the
LogisticRegression class for this case.

In R, there is:

glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
family = binomial(link=logit), weights = Total_Service_Points_Played)
which would be ideal.

Is something similar available in scikit? (Or any continuous model
that takes and 0 to 1 target and outputs a 0 to 1 target?)

I was surprised to see that the implementation of
CalibratedClassifierCV(method="sigmoid") uses an internal
implementation of logistic regression to do its logistic regressing --
which I can use, although I'd prefer to use a user-facing library.

Thanks,
- Stuart
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can fit a model with a target array of probabilities?

Reply via email to