Hello,
I have a large data set 320.000 rows and 1000 columns. All the data has the
values 0,1,2.
I wrote a script to remove all the rows with more than 46 missing values. This
works perfect on a smaller dataset. But the problem arises when I try to run it
on the larger data set I get an error cannot allocate vector size 1240 kb.
Ive searched through previous posts and found out that it might be because im
running it on a linux cluster with R version R 2.1.0. which operates on a 32
bit processor. But I could not find a solution for this problem. The cluster is
a really fast one and should be able to cope with these large amounts of data
the systems configuration are Speed: 3.4 GHz, memory 4GByte. Is there a way to
change the settings or processor under R? I want to run the function Random
Forest on my large data set it should be able to cope with that amount. Perhaps
someone has tried this before in R or is Fortram a better choice? I added my R
script down below.
Best regards,
Iris Kolder
SNP <- read.table("file.txt", header=FALSE, sep="") # read in data file
SNP[SNP==9]<-NA # change missing values from
a 9 to a NA
SNP$total.NAs = rowSums(is.na(SN # calculate the number of NA per row
and adds a colum with total Na's
SNP = SNP[ SNP$total.NAs < 46, ] # create a subset with no more than
5%(46) NA's
SNP$total.NAs=NULL # remove added column with sum
of NA's
SNP = t(as.matrix(SNP)) # transpose rows and columns
set.seed(1)
snp.na<-SNP
snp.roughfix<-na.roughfix(snp.na)
fSNP<-factor(snp.roughfix[, 1]) # Asigns factor to case control
status
snp.narf<- randomForest(snp.roughfix[,-1], fSNP, na.action=na.roughfix,
ntree=500, mtry=10, importance=TRUE, keep.forest=FALSE, do.trace=100)
print(snp.narf)
__________________________________________________
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.