Re: [scikit-learn] Can fit a model with a target array of probabilities?

Sean Violante Thu, 05 Oct 2017 01:27:07 -0700

Hi Stuart

the underlying logistic regression code in scikit learn (at least for the
non liblinear implementation) allows sample weights which would allow you
to do what you want.
[pass in sample weight Total_Service_Points_Won and target 1 and (
Total_Service_Points_Played-Total_Service_Points_Won) and target 0]
ie for each 'instance' you pass in two rows.


Unfortunately it has never been fully implemented
see
https://github.com/scikit-learn/scikit-learn/pull/2784#issuecomment-84734590

Unfortunately, it has never been fully exposed - I have given it a go and I
ran into problems because the code is shared with the linear SVC model as I
recall.
ie logistic regression would work, but some of the test cases would fail
with linear svc

[note that there is also a version of the original liblinear code that
supports sample weights]



[I would point out  having a single row rather than 2 is easier - eg
crossvalidation is a pain]


if you really want to give a continuous target then you probably want beta
regression - an example would be predicting concentrations, then the sample
weights are giving you the # times you observed that concentration
[and you could replace concentration with probability too, eg if you
literally had an 'oracle' that gave you the true probability of an instance]




sean

On Wed, Oct 4, 2017 at 10:26 PM, Stuart Reynolds <stu...@stuartreynolds.net>
wrote:

> Hi Andy,
> Thanks -- I'll give another statsmodels another go.
> I remember I had some fitting speed issues with it in the past, and
> also some issues related their models keeping references to the data
> (=disaster for serialization and multiprocessing) -- although that was
> a long time ago.
> - Stuart
>
> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3k...@gmail.com> wrote:
> > Hi Stuart.
> > There is no interface to do this in scikit-learn (and maybe we should at
> > this to the FAQ).
> > Yes, in principle this would be possible with several of the models.
> >
> > I think statsmodels can do that, and I think I saw another glm package
> > for Python that does that?
> >
> > It's certainly a legitimate use-case but would require substantial
> > changes to the code. I think so far we decided not to support
> > this in scikit-learn. Basically we don't have a concept of a link
> > function, and it's a concept that only applies to a subset of models.
> > We try to have a consistent interface for all our estimators, and
> > this doesn't really fit well within that interface.
> >
> > Hth,
> > Andy
> >
> >
> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
> >>
> >> I'd like to fit a model that maps a matrix of continuous inputs to a
> >> target that's between 0 and 1 (a probability).
> >>
> >> In principle, I'd expect logistic regression should work out of the
> >> box with no modification (although its often posed as being strictly
> >> for classification, its loss function allows for fitting targets in
> >> the range 0 to 1, and not strictly zero or one.)
> >>
> >> However, scikit's LogisticRegression and LogisticRegressionCV reject
> >> target arrays that are continuous. Other LR implementations allow a
> >> matrix of probability estimates. Looking at:
> >>
> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-
> logistic-regression-on-a-continuous-target-variable
> >> and the fix here:
> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables
> >> continuous inputs, it looks like there was some reason for this. So
> >> ... I'm looking for alternatives.
> >>
> >> SGDClassifier allows log loss and (if I understood the docs correctly)
> >> adds a logistic link function, but also rejects continuous targets.
> >> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
> >> seems to give a logistic function.
> >>
> >> In principle, GLM allow this, but scikit's docs say the GLM models
> >> only allows strict linear functions of their input, and doesn't allow
> >> a logistic link function. The docs direct people to the
> >> LogisticRegression class for this case.
> >>
> >> In R, there is:
> >>
> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
> >>      family = binomial(link=logit), weights =
> Total_Service_Points_Played)
> >> which would be ideal.
> >>
> >> Is something similar available in scikit? (Or any continuous model
> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
> >>
> >> I was surprised to see that the implementation of
> >> CalibratedClassifierCV(method="sigmoid") uses an internal
> >> implementation of logistic regression to do its logistic regressing --
> >> which I can use, although I'd prefer to use a user-facing library.
> >>
> >> Thanks,
> >> - Stuart
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can fit a model with a target array of probabilities?

Reply via email to