Hello R-users, First my settings: R-1.8.1, compiled as a 64bit application for a Solaris 5.8, 64 bit. The OS has 8Gb of RAM available and I am the sole user of the machine, hence pretty much all the 8Gb are available to R.
I am pretty new to R and I am having a hard time to work with large data sets, which make up over 90% of the anlyses done here. The data set I imported in R, from S+, has a little over 2,000,000 rows by somewhere around 60 variables, most of them factors, but a few continuous. The data set is in fact a subset of a larger data set used for analysis in S+. I know that some of you will think that I should sample, but it is not an option in the present settings. After first reading the data set into R -- which had its challenges on its own -- when I quit R and save the work space it takes over 5 minutes, when I start a new session and load the data set it takes around 15 minutes. I am trying to build a model that I have already built in S+, so I can make sure I am doing the right thing and can compare resources usage, but so far I have no luck! After 45 minutes or so R has used up all the available memory and is swapping, which brings CPU usage close to nothing. I am convinced there are settings I could use to optimize memory management for such problems. I tried help(Memory) which tells me about the options " --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu", but it is not clear if they should be used and when. Further down the pages it says:"..., and since setting larger values of the minima will make R slightly more efficient on large tasks." But on the other hand, searching the R-site, for memory management clues I found, from Brian Ripley, dated 13 Nov. 2003: "But had you actually read the documentation you would know it did not do that. That needs --max-memory-size set.", that was in response to someone who had increased the value of "min-vsize= "; furthermore I don't find any "--max-memory-size" option? I am wondering if someone having experience working with large data sets would share the configurations and options he is using. If that matters here is the model I was trying to fit. library(package = "statmod", pos = 2, lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS") qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol + categveh + champion + cie + dossiera + faq13c + faq5a + kmaff + kmprom + nbvt + rabprof + sexeprin + newage, family = tweedie(var.power = 1.577, link.power = 0), etastart = log(rep(mean(qc.b3.sans.occ[, 'pp20B3']), nrow(qc.b3.sans.occ))), weights = unsb3t1, trace = T, data = qc.b3.sans.occ) After one iteration (45+ minutes) R is trashing through over 10Gb of memory. Thanks for any insights, Gérald Jean Analyste-conseil (statistiques), Actuariat télephone : (418) 835-4900 poste (7639) télecopieur : (418) 835-6657 courrier électronique: [EMAIL PROTECTED] "In God we trust all others must bring data" W. Edwards Deming ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html