hi all

i know that one should try and limit the amount of looping in R
programs. i have supplied some code below. i am interested in seeing how
the code cold be rewritten if we dont use the loops.


a brief overview of what is done in the code.
==============================================
==============================================
==============================================

1. the input file contains 120*500*61 cells. 120*500 rows and 61
columns.

2. we need to import the cells in 500 at a time and perform the same
operations on each sub group

3. the file contais numeric values. there are quite a lot of missing
values. this has been coded as NA in the text file (the file that is
imported)

4. for each variable we check for outliers. this is done by setting all
values that are greater than 3 standard deviations (sd) from the mean of
a variable to be equal to the 3 sd value.

5. the data set has one response variable , the first column, and 60
explanatory variables.

6. we regress each of the explanatory variables against the response and
record the slope of the explanatory variable. (i.e. simple linear
regression is performed)

7. nsize = 500 since we import 500 rows at a time

8. nruns = how many groups you want to run the analysis on

==============================================
==============================================
==============================================


TRY<-function(nsize=500,filename="C:/A.txt",nvar=61,nruns=1)
{

#the matrix with the payoff weights
fit.reg<-matrix(nrow=nruns,ncol=nvar-1)

for (ii in 1:nruns)
{
skip=1+(ii-1)*nsize

        #import the data in batches of "nsize*nvar"
        #save as a matrix and then delete "dscan" to save memory space

dscan<-scan(file=filename,sep="\t",skip=skip,nlines=nsize,fill=T,quiet=T)
        dm<-matrix(dscan,nrow=nsize,byrow=T)
        rm(dscan)

        #this calculates which of the columns have entries in the columns 
        #that are not NA
        #only perform regressions on those with more than 2 data points
        #obviously the number of points has to be much larger than 2
        #col.points = the number of points in the column that are not NA

        col.points<-apply(dm,2,function(x)
sum(match(x,rep(NA,nsize),nomatch=0)))
        col.points

        #adjust for outliers
        dm.new<-dm
        mean.dm.new<-apply(dm.new,2,function(x) mean(x,na.rm=T))
        sd.dm.new<-apply(dm.new,2,function(x) sd(x,na.rm=T))
        top.dm.new<-mean.dm.new+3*sd.dm.new
        bottom.dm.new<-mean.dm.new-3*sd.dm.new

        for (i in 1:nvar)
        {
                dm.new[,i][dm.new[,i]>top.dm.new[i]]<-top.dm.new[i]
                dm.new[,i][dm.new[,i]<bottom.dm.new[i]]<-bottom.dm.new[i]
        }

        #standardize the variables
        #we dont have to change the variable names here but i did!
        means.dm.new<-apply(dm.new,2,function(x) mean(x,na.rm=T))
        std.dm.new<-apply(dm.new,2,function(x) sd(x,na.rm=T))

        dm.new<-sweep(sweep(dm.new,2,means.dm.new,"-"),2,std.dm.new,"/")

        for (j in 2:nvar)
        {       
                'WE DO NOT PERFORM THE REGRESSION IF ALL VALUES IN THE COLUMN 
ARE "NA"
                if (col.points[j]!=nsize)
                {       
                        #fit the regression equations
                        
fit.reg[ii,j-1]<-summary(lm(dm.new[,1]~dm.new[,j]))$coef[2,1]
                }
                else fit.reg[ii,j-1]<-"L"
        }
}

dm.names<-scan(file=filename,sep="\t",skip=0,nlines=1,fill=T,quiet=T,what="charachter")
dm.names<-matrix(dm.names,nrow=1,ncol=nvar,byrow=T)
colnames(fit.reg)<-dm.names[-1]

output<-c("$fit.reg")

list(fit.reg=fit.reg,output=output)

}

a=TRY(nsize=500,filename="C:/A.txt",nvar=61,nruns=1)


==============================================
==============================================
==============================================




thanking you in advance
/
allan
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to