Hi Stuart the underlying logistic regression code in scikit learn (at least for the non liblinear implementation) allows sample weights which would allow you to do what you want. [pass in sample weight Total_Service_Points_Won and target 1 and ( Total_Service_Points_Played-Total_Service_Points_Won) and target 0] ie for each 'instance' you pass in two rows.
Unfortunately it has never been fully implemented see https://github.com/scikit-learn/scikit-learn/pull/2784#issuecomment-84734590 Unfortunately, it has never been fully exposed - I have given it a go and I ran into problems because the code is shared with the linear SVC model as I recall. ie logistic regression would work, but some of the test cases would fail with linear svc [note that there is also a version of the original liblinear code that supports sample weights] [I would point out having a single row rather than 2 is easier - eg crossvalidation is a pain] if you really want to give a continuous target then you probably want beta regression - an example would be predicting concentrations, then the sample weights are giving you the # times you observed that concentration [and you could replace concentration with probability too, eg if you literally had an 'oracle' that gave you the true probability of an instance] sean On Wed, Oct 4, 2017 at 10:26 PM, Stuart Reynolds <stu...@stuartreynolds.net> wrote: > Hi Andy, > Thanks -- I'll give another statsmodels another go. > I remember I had some fitting speed issues with it in the past, and > also some issues related their models keeping references to the data > (=disaster for serialization and multiprocessing) -- although that was > a long time ago. > - Stuart > > On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3k...@gmail.com> wrote: > > Hi Stuart. > > There is no interface to do this in scikit-learn (and maybe we should at > > this to the FAQ). > > Yes, in principle this would be possible with several of the models. > > > > I think statsmodels can do that, and I think I saw another glm package > > for Python that does that? > > > > It's certainly a legitimate use-case but would require substantial > > changes to the code. I think so far we decided not to support > > this in scikit-learn. Basically we don't have a concept of a link > > function, and it's a concept that only applies to a subset of models. > > We try to have a consistent interface for all our estimators, and > > this doesn't really fit well within that interface. > > > > Hth, > > Andy > > > > > > On 10/04/2017 03:58 PM, Stuart Reynolds wrote: > >> > >> I'd like to fit a model that maps a matrix of continuous inputs to a > >> target that's between 0 and 1 (a probability). > >> > >> In principle, I'd expect logistic regression should work out of the > >> box with no modification (although its often posed as being strictly > >> for classification, its loss function allows for fitting targets in > >> the range 0 to 1, and not strictly zero or one.) > >> > >> However, scikit's LogisticRegression and LogisticRegressionCV reject > >> target arrays that are continuous. Other LR implementations allow a > >> matrix of probability estimates. Looking at: > >> > >> http://scikit-learn-general.narkive.com/4dSCktaM/using- > logistic-regression-on-a-continuous-target-variable > >> and the fix here: > >> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables > >> continuous inputs, it looks like there was some reason for this. So > >> ... I'm looking for alternatives. > >> > >> SGDClassifier allows log loss and (if I understood the docs correctly) > >> adds a logistic link function, but also rejects continuous targets. > >> Oddly, SGDRegressor only allows ‘squared_loss’, ‘huber’, > >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't > >> seems to give a logistic function. > >> > >> In principle, GLM allow this, but scikit's docs say the GLM models > >> only allows strict linear functions of their input, and doesn't allow > >> a logistic link function. The docs direct people to the > >> LogisticRegression class for this case. > >> > >> In R, there is: > >> > >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... , > >> family = binomial(link=logit), weights = > Total_Service_Points_Played) > >> which would be ideal. > >> > >> Is something similar available in scikit? (Or any continuous model > >> that takes and 0 to 1 target and outputs a 0 to 1 target?) > >> > >> I was surprised to see that the implementation of > >> CalibratedClassifierCV(method="sigmoid") uses an internal > >> implementation of logistic regression to do its logistic regressing -- > >> which I can use, although I'd prefer to use a user-facing library. > >> > >> Thanks, > >> - Stuart > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn