Re: [R] replace Na values with the mean of the column which contains them

William Dunlap Mon, 29 Jul 2013 11:43:29 -0700

Replacements are a case where I think an explicit for-loop is better than 
sapply or any
other *apply function.  The for-loop will make the output resemble the output: 
while
sapply and friends will mangle the class, dimnames, and other attributes of the 
input.
Also, if you want to replace the NA's by the mean of the containing row then 
you have
to use t() on sapply's output.


E.g.
  > d <- cbind(AllNAs=NA, NoNAs=c(i=1,ii=2,iii=3,iv=4,v=5), 
SomeNAs=rep(c(100,NA),len=5))
  > f1 <- function(de)sapply(seq_len(ncol(de)),function(i) 
{de[,i][is.na(de[,i])]<-mean(de[,i],na.rm=TRUE);de[,i]})
  > f2 <- function(de) { for(i in seq_len(ncol(de))) de[is.na(de[,i]),i] <- 
mean(de[,i], na.rm=TRUE) ; de }
  > str(f1(d)) # no column names
   num [1:5, 1:3] NaN NaN NaN NaN NaN 1 2 3 4 5 ...
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:5] "i" "ii" "iii" "iv" ...
    ..$ : NULL
  > str(f2(d))
   num [1:5, 1:3] NaN NaN NaN NaN NaN 1 2 3 4 5 ...
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:5] "i" "ii" "iii" "iv" ...
    ..$ : chr [1:3] "AllNAs" "NoNAs" "SomeNAs"

  > df <- data.frame(AllNAs=NA, NoNAs=c(i=1,ii=2,iii=3,iv=4,v=5), 
SomeNAs=rep(c(100+1i,NA),len=5))
  > str(f1(df)) # matrix of complex, not data.frame
   cplx [1:5, 1:3] NaN+0i NaN+0i NaN+0i ...
  > str(f2(df))
  'data.frame':   5 obs. of  3 variables:
   $ AllNAs : num  NaN NaN NaN NaN NaN
   $ NoNAs  : num  1 2 3 4 5
   $ SomeNAs: cplx  100+1i 100+1i 100+1i ...

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On 
> Behalf
> Of arun
> Sent: Monday, July 29, 2013 10:58 AM
> To: iza.ch1
> Cc: R help
> Subject: Re: [R] replace Na values with the mean of the column which contains 
> them
> 
> Hi,
> 
> de<- structure(c(NA, NA, NA, NA, NA, NA, NA, NA, 0.27500571, -3.07568579,
> -0.42240954, -0.26901731, 0.01766284, -0.8099958, 0.20805934,
> 0.03036708, -0.26928087, 1.20925752, 0.38012008, -0.41778861,
> -0.49677462, -0.13248754, -0.54179054, 0.35788624, -0.41467591,
> -0.59234248, 0.73642396, -0.06768044, -0.40321968, -1.52283305,
> 0.25974308, -0.0401373, -0.1192078, 0.9325334, -1.8927164, 1.4330507,
> 0.2892706, 1.3976522, 0.2295291, -0.5009389, -0.342656, -0.8439027,
> -0.4971999, -1.6127122, -0.6508823, 1.4729576, -1.6093478, 0.1686006
> ), .Dim = c(16L, 3L))
> 
> 
> Your code should be:
> sapply(seq_len(ncol(de)),function(i) {de[,i][is.na(de[,i])]<-
> mean(de[,i],na.rm=TRUE);de[,i]})
> A.K.
> 
> 
> 
> 
> Hi everyone
> 
> I have a problem with replacing the NA values with the mean of
> the column which contains them. If I replace Na with the means of the
> rest values in the column, the mean of the whole column will be still
> the same as if I would have omitted NA values. I have the following data
> 
> de
>      [,1]        [,2]       [,3]
>  [1,]          NA -0.26928087 -0.1192078
>  [2,]          NA  1.20925752  0.9325334
>  [3,]          NA  0.38012008 -1.8927164
>  [4,]          NA -0.41778861  1.4330507
>  [5,]          NA -0.49677462  0.2892706
>  [6,]          NA -0.13248754  1.3976522
>  [7,]          NA -0.54179054  0.2295291
>  [8,]          NA  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
> 
> and I wrote the code
> de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) 
> {mean(de[,i],na.rm=TRUE)})
> 
> I get as the result
>    [,1]        [,2]       [,3]
>  [1,] -0.50575168 -0.26928087 -0.1192078
>  [2,] -0.12222376  1.20925752  0.9325334
>  [3,] -0.13412312  0.38012008 -1.8927164
>  [4,] -0.50575168 -0.41778861  1.4330507
>  [5,] -0.12222376 -0.49677462  0.2892706
>  [6,] -0.13412312 -0.13248754  1.3976522
>  [7,] -0.50575168 -0.54179054  0.2295291
>  [8,] -0.12222376  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
> 
> It has replaced the NA values in first column with mean of first
>  column -0.505... and second cell with mean of second column etc.
> I want to have the result like this:
> [,1]        [,2]       [,3]
>  [1,] -0.50575168 -0.26928087 -0.1192078
>  [2,] -0.50575168  1.20925752  0.9325334
>  [3,] -0.50575168  0.38012008 -1.8927164
>  [4,] -0.50575168 -0.41778861  1.4330507
>  [5,] -0.50575168 -0.49677462  0.2892706
>  [6,] -0.50575168 -0.13248754  1.3976522
>  [7,] -0.50575168 -0.54179054  0.2295291
>  [8,] -0.50575168  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
> 
> Thanks in advance
> 
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] replace Na values with the mean of the column which contains them

Reply via email to