> On Feb 8, 2016, at 12:45 PM, Göran Broström <[email protected]> wrote:
>
> Thanks Marc, but see below!
>
> On 2016-02-08 19:26, Marc Schwartz wrote:
>>
>>> On Feb 8, 2016, at 11:26 AM, Göran Broström <[email protected]> wrote:
>>>
>>> I have a data frame with dates as integers:
>>>
>>>> summary(persons[, c("foddat", "doddat")])
>>> foddat doddat
>>> Min. :16790000 Min. :18000000
>>> 1st Qu.:18760904 1st Qu.:18810924
>>> Median :19030426 Median :19091227
>>> Mean :18946659 Mean :19027233
>>> 3rd Qu.:19220911 3rd Qu.:19310526
>>> Max. :19660124 Max. :19691228
>>> NA's :624 NA's :207570
>>>
>>> After converting the dates to Date format ('as.Date') I get:
>>>
>>>> summary(per[, c("foddat", "doddat")])
>>> foddat doddat
>>> Min. :1679-07-01 Min. :1800-01-26
>>> 1st Qu.:1876-09-04 1st Qu.:1881-09-24
>>> Median :1903-04-26 Median :1909-12-27
>>> Mean :1895-02-04 Mean :1903-02-22
>>> 3rd Qu.:1922-09-10 3rd Qu.:1931-05-26
>>> Max. :1966-01-24 Max. :1969-12-28
>>>
>>> My question is: Why are the numbers of missing values not printed in the
>>> second case? 'is.na' gives the correct (same) numbers.
>>>
>>> Can I somehow force 'summary' to print NA's? I found no clues in the
>>> documentation.
>>
>>
>> Hi,
>>
>> Two things:
>>
>> 1. We are going to need to see the exact call to as.Date() that you used.
>> as.Date() will take a numeric vector as input, but the presumption is that
>> the number represents the number of days since an origin, which needs to be
>> specified explicitly. If you coerced the numeric vector to character first,
>> presuming a "%Y%m%d" format, then you need to be cautious about how that is
>> done and the result.
>>
>> 2. Your second call is to a data frame called 'per', which may or may not
>> have the same content as 'persons' in your first call.
>>
>>
>> If I do the following, taking some of your numeric values from above:
>>
>> x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
>>
>> DF <- data.frame(x)
>>
>>> summary(DF)
>> x
>> Min. :18000000
>> 1st Qu.:18865001
>> Median :19059230
>> Mean :18988523
>> 3rd Qu.:19255701
>> Max. :19691228
>> NA's :1
>>
>>> as.character(DF$x)
>> [1] "1.8e+07" "18810924" "19091227" "19027233" "19310526" "19691228"
>> [7] NA
>>
>> DF$x.Date <- as.Date(as.character(DF$x), format = "%Y%m%d")
>>
>>> DF
>> x x.Date
>> 1 18000000 <NA>
>> 2 18810924 1881-09-24
>> 3 19091227 1909-12-27
>> 4 19027233 <NA>
>> 5 19310526 1931-05-26
>> 6 19691228 1969-12-28
>> 7 NA <NA>
>>
>>> summary(DF)
>> x x.Date
>> Min. :18000000 Min. :1881-09-24
>> 1st Qu.:18865001 1st Qu.:1902-12-04
>> Median :19059230 Median :1920-09-10
>> Mean :18988523 Mean :1923-04-12
>> 3rd Qu.:19255701 3rd Qu.:1941-01-17
>> Max. :19691228 Max. :1969-12-28
>> NA's :1 NA's :3
>>
> But:
>
> > summary(DF[, "x.Date", drop = FALSE])
> x.Date
> Min. :1881-09-24
> 1st Qu.:1902-12-04
> Median :1920-09-10
> Mean :1923-04-12
> 3rd Qu.:1941-01-17
> Max. :1969-12-28
>
> No NA's. But again:
>
> > summary(DF[, "x.Date"])
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> "1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" "1969-12-28"
> NA's
> "3"
>
>>
>> So summary does support the reporting of NA's for Dates, using
>> summary.Date().
>
> Not always, as it seems. Strange. (The 'persons' vs. 'per' is a red herring.)
>
> Göran Broström
Ok, thanks for the clarification.
I spent some time running summary.Date() under debug, trying to see where
things fail.
Within the function, the result object 'x', is created correctly, with the
correct class attributes and the count of NA values retained in an "NAs"
attribute
However, upon function exit, the class attributes appear to be lost and the
result is of class table, which also loses the "NAs" attribute, which is
assigned within the function body.
I believe that this is happening within summary.data.frame().
I can extend the example more generally, when the only columns in the source
data frame are Dates:
DF.Dates <- data.frame(Col1 = DF$x.Date, Col2 = DF$x.Date)
> DF.Dates
Col1 Col2
1 <NA> <NA>
2 1881-09-24 1881-09-24
3 1909-12-27 1909-12-27
4 <NA> <NA>
5 1931-05-26 1931-05-26
6 1969-12-28 1969-12-28
7 <NA> <NA>
> summary(DF.Dates)
Col1 Col2
Min. :1881-09-24 Min. :1881-09-24
1st Qu.:1902-12-04 1st Qu.:1902-12-04
Median :1920-09-10 Median :1920-09-10
Mean :1923-04-12 Mean :1923-04-12
3rd Qu.:1941-01-17 3rd Qu.:1941-01-17
Max. :1969-12-28 Max. :1969-12-28
So, it is not dependent upon the subsetting used in your original call per se,
but when the data frame passed to summary.data.frame() consists of only Date
class columns.
I am still working through the code, but the preliminary source of the issue
appears to be the following line in summary.data.frame:
length(sms) <- nr
which truncates the internal object 'sms', where before that line, 'sms' is of
length 7 and afterwards, 6:
Browse[2]> nr
[1] 6
Browse[2]> sms
[1] "Min. :1881-09-24 " "1st Qu.:1902-12-04 " "Median :1920-09-10 "
[4] "Mean :1923-04-12 " "3rd Qu.:1941-01-17 " "Max. :1969-12-28 "
[7] "NA's :3 "
Browse[2]>
debug: length(sms) <- nr
Browse[2]> sms
[1] "Min. :1881-09-24 " "1st Qu.:1902-12-04 " "Median :1920-09-10 "
[4] "Mean :1923-04-12 " "3rd Qu.:1941-01-17 " "Max. :1969-12-28 "
[7] "NA's :3 "
Browse[2]>
debug: z[[i]] <- sms
Browse[2]> sms
[1] "Min. :1881-09-24 " "1st Qu.:1902-12-04 " "Median :1920-09-10 "
[4] "Mean :1923-04-12 " "3rd Qu.:1941-01-17 " "Max. :1969-12-28 "
OK, I now believe that I have found the issue...
Internally, an object 'z' is created by the following:
z <- lapply(X = as.list(object), FUN = summary, maxsum = maxsum,
digits = 12L, ...)
For my data frame, DF.Dates, 'z' is:
Browse[2]> z
$Col1
Min. 1st Qu. Median Mean 3rd Qu.
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17"
Max. NA's
"1969-12-28" "3"
$Col2
Min. 1st Qu. Median Mean 3rd Qu.
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17"
Max. NA's
"1969-12-28" "3"
which shows the result of summary.Date() on the two columns.
The print()ed output is the result of each list element being of the class set
by summary.Date():
Browse[2]> class(z$Col1)
[1] "summaryDefault" "table" "Date"
Browse[2]> class(z$Col2)
[1] "summaryDefault" "table" "Date"
The problem is that the NA component of the result is an attribute and not part
of the vector itself:
Browse[2]> str(z)
List of 2
$ Col1: summaryDefault[1:6], format: "1881-09-24" ...
..- attr(*, "names")="Min." ...
$ Col2: summaryDefault[1:6], format: "1881-09-24" ...
..- attr(*, "names")="Min." ...
Note that each list element is of length 6, hence the value used in 'nr' above,
rather than 7.
The count of NA values are stored in attributes:
Browse[2]> attr(z$Col1, "NAs")
[1] 3
Browse[2]> attr(z$Col2, "NAs")
[1] 3
Hence, when internal variable 'nr' is set, it is:
Browse[2]> max(unlist(lapply(z, NROW)))
[1] 6
Browse[2]> nr
[1] 6
And...that results in the truncation seen above and the loss of the NA
attribute components otherwise returned.
My original example worked, where a Date column is present with columns of
other data types, because that 'nr' variable internally is set to the correct
length (7) for the other data types, BUT, only if NA's are present in at least
one other column:
DF.Dates$Col3 <- 1:7
> DF.Dates
Col1 Col2 Col3
1 <NA> <NA> 1
2 1881-09-24 1881-09-24 2
3 1909-12-27 1909-12-27 3
4 <NA> <NA> 4
5 1931-05-26 1931-05-26 5
6 1969-12-28 1969-12-28 6
7 <NA> <NA> 7
> summary(DF.Dates)
Col1 Col2 Col3
Min. :1881-09-24 Min. :1881-09-24 Min. :1.0
1st Qu.:1902-12-04 1st Qu.:1902-12-04 1st Qu.:2.5
Median :1920-09-10 Median :1920-09-10 Median :4.0
Mean :1923-04-12 Mean :1923-04-12 Mean :4.0
3rd Qu.:1941-01-17 3rd Qu.:1941-01-17 3rd Qu.:5.5
Max. :1969-12-28 Max. :1969-12-28 Max. :7.0
DF.Dates$Col3 <- c(1:6, NA)
> summary(DF.Dates)
Col1 Col2 Col3
Min. :1881-09-24 Min. :1881-09-24 Min. :1.00
1st Qu.:1902-12-04 1st Qu.:1902-12-04 1st Qu.:2.25
Median :1920-09-10 Median :1920-09-10 Median :3.50
Mean :1923-04-12 Mean :1923-04-12 Mean :3.50
3rd Qu.:1941-01-17 3rd Qu.:1941-01-17 3rd Qu.:4.75
Max. :1969-12-28 Max. :1969-12-28 Max. :6.00
NA's :3 NA's :3 NA's :1
So, there is a bug in summary.data.frame() when only Date class columns are
present and no other columns have NA's, from what this suggests.
The key would seem to be to modify the code that creates 'nr', which is
currently:
nr <- if (nv)
max(unlist(lapply(z, NROW)))
else 0
to account for the presence of the "NAs" attribute from summary.Date(), restore
the attribute further down in the code, if present, or alternatively, to modify
the code for summary.Date() so that rather than adding the "NAs" attribute:
x <- summary.default(unclass(object), digits = digits, ...)
if (m <- match("NA's", names(x), 0)) {
NAs <- as.integer(x[m])
x <- x[-m]
attr(x, "NAs") <- NAs
}
it behaves more like summary.default(), so that the NA count is an actual
element in the result vector, rather than an attribute:
nas <- is.na(object)
object <- object[!nas]
qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.",
"Max.")
if (any(nas))
c(qq, `NA's` = sum(nas))
else qq
This is where I would defer to a member of R Core for guidance, since I presume
that there may be some logic in the difference, other than perhaps different
authors over time and there may be other implications that I am not considering
here.
Regards,
Marc
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.