Re: [R] Time and space considerations in using predict.glm.

David Winsemius Tue, 24 Aug 2010 13:33:03 -0700


On Aug 24, 2010, at 3:16 PM, Daniel Yarlett wrote:

Hello,
I am using R to train a logistic regression model and save theresultingmodel to disk. I am then subsequently reloading these saved objects,andusing predict.glm on them in order to make predictions about single-row dataframes that are generated in real-time from requests arriving at anHTTPserver. The following code demonstrates the sort of R calls that Ihave in
mind:
cases <- 2000000
data <-
data.frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))
lr1 <- glm(y~x1*x2,family=binomial,data=data)
new_data <- data.frame(x1=0,x2=0)
out <- predict(lr1,type="response",newdata=new_data)
The first thing I am noticing is that the models that I am storingare very
large because I am using large data-sets, and the models seem to store
residuals, fitted values and so on, by default.
object.size(lr1)
1056071320 bytes
Access to all this information is not necessary for my application-- all Ireally need is access to model$coefficients in order to make mypredictions-- so I am wondering if there is some way to prevent thisinformation fromgetting stored in the glm objects when they are created (or ofremoving it
after the models have been trained)? I have discovered the
model=FALSE,x=FALSE,y=FALSE switches to glm() and these seem to help
somewhat, but perhaps there is some other way of only recording the
coefficients of the model and other minimal details?


Perhaps instead:

lr2 <-coef( glm(y~x1*x2,family=binomial,data=data,model=FALSE,x=FALSE,y=FALSE) )

object.size(lr2)

Will be much smaller

Secondly, on data-sets of the scale I am using, predict.glm seems tobe
taking a very long time to make its predictions.
print(system.time(predict(lr1,type="response",newdata=new_data)))
  user  system elapsed
 0.136   0.040   0.175
print(system.time(predict(lr2,type="response",newdata=new_data)))
  user  system elapsed
 0.109   0.013   0.121
This may be an issue of swap-time, and so it could potentially besolved byaddressing my first question above. However, given that I amessentially
asking R to compute

1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + b3*x1*x2)))

I can't see any reason why this request should be taking longer than a
hundredth or a thousandth of a second, say.


You could try crossprod with a data.matrix and a matrix of coefficients.

1 / (1 + exp(-(crossprod(lr2, new_data)))


> cases <- 2000

> data <-data.frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))

> lr1 <- coef(glm(y~x1*x2,family=binomial,data=data))
> new_data <- matrix(c(1, x1=0,x2=0, x1x2=0), nrow=4)
# took me a while to figure out that I needed an interaction entry.
> out <- 1 / (1 + exp(-(crossprod(new_data,lr1))))
> out
          [,1]
[1,] 0.5107252
> lr1
(Intercept)          x1          x2       x1:x2
 0.04290728 -0.16826991 -0.03561711  0.06229122
> > object.size(lr1)
456 bytes

Obviously R is providing a much
greater level of functionality than I am requiring in this particular
instance, so my overall question is what is the best way for me toreducethe size of the data I have to store in my GLM models, and toincrease thespeed at which I can use R to generate predictions of this sort(i.e. for
novel x1,x2 pairs)?
I could obviously write a custom function / class which only storesthe
model coefficients and computes predictions based on these using the
equation above, but before I go down this route I wanted to get comeadvicefrom the R community about whether there might be a better way toaddressthis problem and/or whether I have missed something obvious (toothers). Ialso want to avoid writing custom code if possible because thatobviouslymeans sacrificing the great generality and power of R which couldclearly be
useful in my application down the line.

Many thanks in advance for your assistance,

Dan.


David Winsemius, MD
West Hartford, CT

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Time and space considerations in using predict.glm.

Reply via email to