Re: [R] Dates and missing values

Marc Schwartz Mon, 08 Feb 2016 12:39:38 -0800

> On Feb 8, 2016, at 12:45 PM, Göran Broström <[email protected]> wrote:
> 
> Thanks Marc, but see below!
> 
> On 2016-02-08 19:26, Marc Schwartz wrote:
>> 
>>> On Feb 8, 2016, at 11:26 AM, Göran Broström <[email protected]> wrote:
>>> 
>>> I have a data frame with dates as integers:
>>> 
>>>> summary(persons[, c("foddat", "doddat")])
>>>     foddat             doddat
>>> Min.   :16790000   Min.   :18000000
>>> 1st Qu.:18760904   1st Qu.:18810924
>>> Median :19030426   Median :19091227
>>> Mean   :18946659   Mean   :19027233
>>> 3rd Qu.:19220911   3rd Qu.:19310526
>>> Max.   :19660124   Max.   :19691228
>>> NA's   :624        NA's   :207570
>>> 
>>> After converting the dates to Date format ('as.Date') I get:
>>> 
>>>> summary(per[, c("foddat", "doddat")])
>>>    foddat               doddat
>>> Min.   :1679-07-01   Min.   :1800-01-26
>>> 1st Qu.:1876-09-04   1st Qu.:1881-09-24
>>> Median :1903-04-26   Median :1909-12-27
>>> Mean   :1895-02-04   Mean   :1903-02-22
>>> 3rd Qu.:1922-09-10   3rd Qu.:1931-05-26
>>> Max.   :1966-01-24   Max.   :1969-12-28
>>> 
>>> My question is: Why are the numbers of missing values not printed in the 
>>> second case? 'is.na' gives the correct (same) numbers.
>>> 
>>> Can I somehow force 'summary' to print NA's? I found no clues in the 
>>> documentation.
>> 
>> 
>> Hi,
>> 
>> Two things:
>> 
>> 1. We are going to need to see the exact call to as.Date() that you used. 
>> as.Date() will take a numeric vector as input, but the presumption is that 
>> the number represents the number of days since an origin, which needs to be 
>> specified explicitly. If you coerced the numeric vector to character first, 
>> presuming a "%Y%m%d" format, then you need to be cautious about how that is 
>> done and the result.
>> 
>> 2. Your second call is to a data frame called 'per', which may or may not 
>> have the same content as 'persons' in your first call.
>> 
>> 
>> If I do the following, taking some of your numeric values from above:
>> 
>> x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
>> 
>> DF <- data.frame(x)
>> 
>>> summary(DF)
>>        x
>>  Min.   :18000000
>>  1st Qu.:18865001
>>  Median :19059230
>>  Mean   :18988523
>>  3rd Qu.:19255701
>>  Max.   :19691228
>>  NA's   :1
>> 
>>> as.character(DF$x)
>> [1] "1.8e+07"  "18810924" "19091227" "19027233" "19310526" "19691228"
>> [7] NA
>> 
>> DF$x.Date <- as.Date(as.character(DF$x), format = "%Y%m%d")
>> 
>>> DF
>>          x     x.Date
>> 1 18000000       <NA>
>> 2 18810924 1881-09-24
>> 3 19091227 1909-12-27
>> 4 19027233       <NA>
>> 5 19310526 1931-05-26
>> 6 19691228 1969-12-28
>> 7       NA       <NA>
>> 
>>> summary(DF)
>>        x                x.Date
>>  Min.   :18000000   Min.   :1881-09-24
>>  1st Qu.:18865001   1st Qu.:1902-12-04
>>  Median :19059230   Median :1920-09-10
>>  Mean   :18988523   Mean   :1923-04-12
>>  3rd Qu.:19255701   3rd Qu.:1941-01-17
>>  Max.   :19691228   Max.   :1969-12-28
>>  NA's   :1          NA's   :3
>> 
> But:
> 
> > summary(DF[, "x.Date", drop = FALSE])
>     x.Date
> Min.   :1881-09-24
> 1st Qu.:1902-12-04
> Median :1920-09-10
> Mean   :1923-04-12
> 3rd Qu.:1941-01-17
> Max.   :1969-12-28
> 
> No NA's. But again:
> 
> > summary(DF[, "x.Date"])
>        Min.      1st Qu.       Median         Mean      3rd Qu.   Max.
> "1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" "1969-12-28"
>        NA's
>         "3"
> 
>> 
>> So summary does support the reporting of NA's for Dates, using 
>> summary.Date().
> 
> Not always, as it seems. Strange. (The 'persons' vs. 'per' is a red herring.)
> 
> Göran Broström



Ok, thanks for the clarification.

I spent some time running summary.Date() under debug, trying to see where 
things fail.

Within the function, the result object 'x', is created correctly, with the 
correct class attributes and the count of NA values retained in an "NAs" 
attribute

However, upon function exit, the class attributes appear to be lost and the 
result is of class table, which also loses the "NAs" attribute, which is 
assigned within the function body.

I believe that this is happening within summary.data.frame().

I can extend the example more generally, when the only columns in the source 
data frame are Dates:

DF.Dates <- data.frame(Col1 = DF$x.Date, Col2 = DF$x.Date)

> DF.Dates
        Col1       Col2
1       <NA>       <NA>
2 1881-09-24 1881-09-24
3 1909-12-27 1909-12-27
4       <NA>       <NA>
5 1931-05-26 1931-05-26
6 1969-12-28 1969-12-28
7       <NA>       <NA>

> summary(DF.Dates)
      Col1                 Col2           
 Min.   :1881-09-24   Min.   :1881-09-24  
 1st Qu.:1902-12-04   1st Qu.:1902-12-04  
 Median :1920-09-10   Median :1920-09-10  
 Mean   :1923-04-12   Mean   :1923-04-12  
 3rd Qu.:1941-01-17   3rd Qu.:1941-01-17  
 Max.   :1969-12-28   Max.   :1969-12-28  


So, it is not dependent upon the subsetting used in your original call per se, 
but when the data frame passed to summary.data.frame() consists of only Date 
class columns.

I am still working through the code, but the preliminary source of the issue 
appears to be the following line in summary.data.frame:

length(sms) <- nr

which truncates the internal object 'sms', where before that line, 'sms' is of 
length 7 and afterwards, 6:

Browse[2]> nr
[1] 6
Browse[2]> sms
[1] "Min.   :1881-09-24  " "1st Qu.:1902-12-04  " "Median :1920-09-10  "
[4] "Mean   :1923-04-12  " "3rd Qu.:1941-01-17  " "Max.   :1969-12-28  "
[7] "NA's   :3  "         
Browse[2]> 
debug: length(sms) <- nr
Browse[2]> sms
[1] "Min.   :1881-09-24  " "1st Qu.:1902-12-04  " "Median :1920-09-10  "
[4] "Mean   :1923-04-12  " "3rd Qu.:1941-01-17  " "Max.   :1969-12-28  "
[7] "NA's   :3  "         
Browse[2]> 
debug: z[[i]] <- sms
Browse[2]> sms
[1] "Min.   :1881-09-24  " "1st Qu.:1902-12-04  " "Median :1920-09-10  "
[4] "Mean   :1923-04-12  " "3rd Qu.:1941-01-17  " "Max.   :1969-12-28  "



OK, I now believe that I have found the issue...

Internally, an object 'z' is created by the following:

z <- lapply(X = as.list(object), FUN = summary, maxsum = maxsum, 
        digits = 12L, ...)

For my data frame, DF.Dates, 'z' is:

Browse[2]> z
$Col1
        Min.      1st Qu.       Median         Mean      3rd Qu. 
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
        Max.         NA's 
"1969-12-28"          "3" 

$Col2
        Min.      1st Qu.       Median         Mean      3rd Qu. 
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
        Max.         NA's 
"1969-12-28"          "3" 

which shows the result of summary.Date() on the two columns. 

The print()ed output is the result of each list element being of the class set 
by summary.Date():

Browse[2]> class(z$Col1)
[1] "summaryDefault" "table"          "Date"          
Browse[2]> class(z$Col2)
[1] "summaryDefault" "table"          "Date"       


The problem is that the NA component of the result is an attribute and not part 
of the vector itself:

Browse[2]> str(z)
List of 2
 $ Col1: summaryDefault[1:6], format: "1881-09-24" ...
  ..- attr(*, "names")="Min." ...
 $ Col2: summaryDefault[1:6], format: "1881-09-24" ...
  ..- attr(*, "names")="Min." ...


Note that each list element is of length 6, hence the value used in 'nr' above, 
rather than 7.

The count of NA values are stored in attributes:

Browse[2]> attr(z$Col1, "NAs")
[1] 3
Browse[2]> attr(z$Col2, "NAs")
[1] 3


Hence, when internal variable 'nr' is set, it is:

Browse[2]> max(unlist(lapply(z, NROW)))
[1] 6

Browse[2]> nr
[1] 6


And...that results in the truncation seen above and the loss of the NA 
attribute components otherwise returned.

My original example worked, where a Date column is present with columns of 
other data types, because that 'nr' variable internally is set to the correct 
length (7) for the other data types, BUT, only if NA's are present in at least 
one other column:

DF.Dates$Col3 <- 1:7

> DF.Dates
        Col1       Col2 Col3
1       <NA>       <NA>    1
2 1881-09-24 1881-09-24    2
3 1909-12-27 1909-12-27    3
4       <NA>       <NA>    4
5 1931-05-26 1931-05-26    5
6 1969-12-28 1969-12-28    6
7       <NA>       <NA>    7

> summary(DF.Dates)
      Col1                 Col2                 Col3    
 Min.   :1881-09-24   Min.   :1881-09-24   Min.   :1.0  
 1st Qu.:1902-12-04   1st Qu.:1902-12-04   1st Qu.:2.5  
 Median :1920-09-10   Median :1920-09-10   Median :4.0  
 Mean   :1923-04-12   Mean   :1923-04-12   Mean   :4.0  
 3rd Qu.:1941-01-17   3rd Qu.:1941-01-17   3rd Qu.:5.5  
 Max.   :1969-12-28   Max.   :1969-12-28   Max.   :7.0  


DF.Dates$Col3 <- c(1:6, NA)

> summary(DF.Dates)
      Col1                 Col2                 Col3     
 Min.   :1881-09-24   Min.   :1881-09-24   Min.   :1.00  
 1st Qu.:1902-12-04   1st Qu.:1902-12-04   1st Qu.:2.25  
 Median :1920-09-10   Median :1920-09-10   Median :3.50  
 Mean   :1923-04-12   Mean   :1923-04-12   Mean   :3.50  
 3rd Qu.:1941-01-17   3rd Qu.:1941-01-17   3rd Qu.:4.75  
 Max.   :1969-12-28   Max.   :1969-12-28   Max.   :6.00  
 NA's   :3            NA's   :3            NA's   :1     



So, there is a bug in summary.data.frame() when only Date class columns are 
present and no other columns have NA's, from what this suggests.

The key would seem to be to modify the code that creates 'nr', which is 
currently:

nr <- if (nv) 
        max(unlist(lapply(z, NROW)))
    else 0


to account for the presence of the "NAs" attribute from summary.Date(), restore 
the attribute further down in the code, if present, or alternatively, to modify 
the code for summary.Date() so that rather than adding the "NAs" attribute:

  x <- summary.default(unclass(object), digits = digits, ...)
  if (m <- match("NA's", names(x), 0)) {
        NAs <- as.integer(x[m])
        x <- x[-m]
        attr(x, "NAs") <- NAs
    }


it behaves more like summary.default(), so that the NA count is an actual 
element in the result vector, rather than an attribute:

nas <- is.na(object)
        object <- object[!nas]
        qq <- stats::quantile(object)
        qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
        names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
            "Max.")
        if (any(nas)) 
            c(qq, `NA's` = sum(nas))
        else qq


This is where I would defer to a member of R Core for guidance, since I presume 
that there may be some logic in the difference, other than perhaps different 
authors over time and there may be other implications that I am not considering 
here.

Regards,

Marc

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Dates and missing values

Reply via email to