Don't run 500K separate models. Use the limma package to fit one model that can learn the variance parameters jointly. Run it on your laptop. And don't use %methylation as your Y variable, use logit(percent), i.e. the Beta value.
-Aaron On Mon, Aug 8, 2016 at 2:49 PM, Ellis, Alicia M <alicia.el...@med.uvm.edu> wrote: > I have a large dataset with ~500,000 columns and 1264 rows. Each column > represents the percent methylation at a given location in the genome. I > need to run 500,000 linear models for each of 4 predictors of interest in > the form of: > Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9 > ...and save only the pvalue for the predictor > > The original methylation data file had methylation sites as row labels and > the individuals as columns so I read the data in chunks and transposed it > so I now have 5 csv files (chunks) with columns representing methylation > sites and rows as individuals. > > I was able to get results for all of the regressions by running each chunk > of methylation data separately on our supercomputer using the code below. > However, I'm going to have to do this again for another project and I would > really like to accomplish two things to make the whole process more > computationally efficient: > > > 1) Work with data.tables instead of data.frames (reading and > manipulating will be much easier and faster) > > 2) Do the work in parallel using say 12 cores at once and having the > program divide the work up on the cores rather than me having to split the > data and run 5 separate jobs on the supercomputer. > > I have some basic knowledge of the data.table package but I wasn't able to > modify the foreach code below to get it to work and the code using > data.frames didn't seem to be using all 12 cores that I created in the > cluster. > > Can anyone suggest some modifications to the foreach code below that will > allow me to do this in parallel with datatables and not have to do it in > chunks? > > > ############# Set up cluster > clus = makeCluster(12, type = "SOCK") > registerDoSNOW(clus) > getDoParWorkers() > getDoParName() > > > ################### Following code needs to be modified to run the full > dataset (batch1-batch5) in parallel > ### Currently I read in the following chunks, and run each predictor > separately for each chunk of data > > ############### Methylation data in batches > batch1=read.csv("/home/alicia.m.ellis/batch1.csv") ###### #Each batch > has about 100,000 columns and 1264 rows; want to alter this to: > ## batch1=fread(file= ) > batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv") > batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv") > batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv") > batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv") > > predictors ## this is a data.frame with 4 columns and 1264 rows > > covariates ## this is a data.frame with 9 columns and 1264 rows > > fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1), > function(x) summary(lm(batch1[, x] ~ predictors[,1] + > > covariates[,1]+ > > covariates[,2]+ > > covariates[,3]+ > > covariates[,4]+ > > covariates[,5]+ > > covariates[,6]+ > > covariates[,7]+ > > covariates[,8]+ > > covariates[,9] > ) > )$coefficients[2,4] > ) > ) > ] > > > ###################################### This is what I was trying but > wasn't having much luck > #### I'm having trouble getting the data merged as a single data.frame and > the code below doesn't seem to be dividing the work among the 12 cores in > the cluster > > all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1), > combine='rbind', .inorder=TRUE) %dopar% { > > model = lm(meth[, i] ~ predictors[,j] + > covariates[,1]+ > covariates[,2]+ > covariates[,3]+ > covariates[,4]+ > covariates[,5]+ > covariates[,6]+ > covariates[,7]+ > covariates[,8]+ > covariates[,9]) > summary(model)$coefficients[2,4] > } > > > Alicia Ellis, Ph.D > Biostatistician > Pathology & Laboratory Medicine > Colchester Research Facility > 360 South Park Drive, Room 209C > Colchester, VT 05446 > 802-656-9840 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.