The other thing you could try after doing this is to sample
some rows from your data and see if the subset gives
nearly the same answer as the entire data set.
On 4/24/06, Liaw, Andy <[EMAIL PROTECTED]> wrote:
> Here's a skeletal example. Embellish as needed:
>
> p <- 5
> n <- 300
> set.seed(1)
> dat <- cbind(rnorm(n), matrix(runif(n * p), n, p))
> write.table(dat, file="c:/temp/big.txt", row=FALSE, col=FALSE)
>
> xtx <- matrix(0, p + 1, p + 1)
> xty <- numeric(p + 1)
> f <- file("c:/temp/big.txt", open="r")
> for (i in 1:3) {
> x <- matrix(scan(f, nlines=100), 100, p + 1, byrow=TRUE)
> xtx <- xtx + crossprod(cbind(1, x[, -1]))
> xty <- xty + crossprod(cbind(1, x[, -1]), x[, 1])
> }
> close(f)
> solve(xtx, xty)
> coef(lm.fit(cbind(1, dat[,-1]), dat[,1])) ## check result
>
> unlink("c:/temp/big.txt") ## clean up.
>
> Andy
>
> -----Original Message-----
> From: Sachin J [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 24, 2006 5:09 PM
> To: Liaw, Andy; [email protected]
> Subject: RE: [R] Handling large dataset & dataframe [Broadcast]
>
>
> Hi Andy:
>
> I searched through R-archive to find out how to handle large data set using
> readLines and other related R functions. I couldn't find any single post
> which elaborates the process. Can you provide me with an example or any
> pointers to the postings elaborating the process.
>
> Thanx in advance
> Sachin
>
>
> "Liaw, Andy" <[EMAIL PROTECTED]> wrote:
>
> Instead of reading the entire data in at once, you read a chunk at a time,
> and compute X'X and X'y on that chunk, and accumulate (i.e., add) them.
> There are examples in "S Programming", taken from independent replies by the
> two authors to a post on S-news, if I remember correctly.
>
> Andy
>
> From: Sachin J
> >
> > Gabor:
> >
> > Can you elaborate more.
> >
> > Thanx
> > Sachin
> >
> > Gabor Grothendieck wrote:
> > You just need the much smaller cross product matrix X'X and
> > vector X'Y so you can build those up as you read the data in
> > in chunks.
> >
> >
> > On 4/24/06, Sachin J wrote:
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266 columns. Out of
> > > 266 columns 250 are dummy variable columns. I am trying to
> > read this
> > > data set into R dataframe object but unable to do it due to memory
> > > size limitations (object size created is too large to
> > handle in R). Is
> > > there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> > >
> > > ---------------------------------
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [email protected] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [email protected] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
>
>
> ----------------------------------------------------------------------------
> --
> Notice: This e-mail message, together with any attachments, ...{{dropped}}
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html