Thanks Josef. Was very useful. result.remove_data() reduces a 5 parameter Logit result object from megabytes to 5Kb (as compared to a minimum uncompressed size of the parameters of ~320 bytes). Is big improvement. I'll experiment with what you suggest -- since this is still >10x larger than possible. I think the difference is mostly attribute names. I don't mind the lack of a multinomial support. I've often had better results mixing independent models for each class.
I'll experiment with the different solvers. I tried the Logit model in the past -- its fit function only exposed a maxiter, and not a tolerance -- meaning I had to set maxiter very high. The newer statsmodels GLM module looks great and seem to solve this. For other who come this way, I think the magic for ridge regression is: from statsmodels.genmod.generalized_linear_model import GLM from statsmodels.genmod.generalized_linear_model import families from statsmodels.genmod.generalized_linear_model.families import links model = GLM(y, Xtrain, family=families.Binomial(link=links.Logit)) result = model.fit_regularized(method='elastic_net', alpha=l2weight, L1_wt=0.0, tol=...) result.remove_data() result.predict(Xtest) One last thing -- its clear that it should be possible to do something like scikit's LogisticRegressionCV in order to quickly optimize a single parameter by re-using past coefficients. Are there any wrappers in statsmodels for doing this or should I roll my own? - Stu On Wed, Oct 4, 2017 at 3:43 PM, <josef.p...@gmail.com> wrote: > > > On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds <stu...@stuartreynolds.net> > wrote: >> >> Hi Andy, >> Thanks -- I'll give another statsmodels another go. >> I remember I had some fitting speed issues with it in the past, and >> also some issues related their models keeping references to the data >> (=disaster for serialization and multiprocessing) -- although that was >> a long time ago. > > > The second has not changed and will not change, but there is a remove_data > method that deletes all references to full, data sized arrays. However, once > the data is removed, it is not possible anymore to compute any new results > statistics which are almost all lazily computed. > The fitting speed depends a lot on the optimizer, convergence criteria and > difficulty of the problem, and availability of good starting parameters. > Almost all nonlinear estimation problems use the scipy optimizers, all > unconstrained optimizers can be used. There are no optimized special methods > for cases with a very large number of features. > > Multinomial/multiclass models don't support continuous response (yet), all > other GLM and discrete models allow for continuous data in the interval > extension of the domain. > > Josef > > >> >> - Stuart >> >> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3k...@gmail.com> wrote: >> > Hi Stuart. >> > There is no interface to do this in scikit-learn (and maybe we should at >> > this to the FAQ). >> > Yes, in principle this would be possible with several of the models. >> > >> > I think statsmodels can do that, and I think I saw another glm package >> > for Python that does that? >> > >> > It's certainly a legitimate use-case but would require substantial >> > changes to the code. I think so far we decided not to support >> > this in scikit-learn. Basically we don't have a concept of a link >> > function, and it's a concept that only applies to a subset of models. >> > We try to have a consistent interface for all our estimators, and >> > this doesn't really fit well within that interface. >> > >> > Hth, >> > Andy >> > >> > >> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote: >> >> >> >> I'd like to fit a model that maps a matrix of continuous inputs to a >> >> target that's between 0 and 1 (a probability). >> >> >> >> In principle, I'd expect logistic regression should work out of the >> >> box with no modification (although its often posed as being strictly >> >> for classification, its loss function allows for fitting targets in >> >> the range 0 to 1, and not strictly zero or one.) >> >> >> >> However, scikit's LogisticRegression and LogisticRegressionCV reject >> >> target arrays that are continuous. Other LR implementations allow a >> >> matrix of probability estimates. Looking at: >> >> >> >> >> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable >> >> and the fix here: >> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables >> >> continuous inputs, it looks like there was some reason for this. So >> >> ... I'm looking for alternatives. >> >> >> >> SGDClassifier allows log loss and (if I understood the docs correctly) >> >> adds a logistic link function, but also rejects continuous targets. >> >> Oddly, SGDRegressor only allows ‘squared_loss’, ‘huber’, >> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't >> >> seems to give a logistic function. >> >> >> >> In principle, GLM allow this, but scikit's docs say the GLM models >> >> only allows strict linear functions of their input, and doesn't allow >> >> a logistic link function. The docs direct people to the >> >> LogisticRegression class for this case. >> >> >> >> In R, there is: >> >> >> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... , >> >> family = binomial(link=logit), weights = >> >> Total_Service_Points_Played) >> >> which would be ideal. >> >> >> >> Is something similar available in scikit? (Or any continuous model >> >> that takes and 0 to 1 target and outputs a 0 to 1 target?) >> >> >> >> I was surprised to see that the implementation of >> >> CalibratedClassifierCV(method="sigmoid") uses an internal >> >> implementation of logistic regression to do its logistic regressing -- >> >> which I can use, although I'd prefer to use a user-facing library. >> >> >> >> Thanks, >> >> - Stuart >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn@python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn@python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn