On Aug 6, 2013, at 4:02 PM, Mike Miller wrote:
> I received two additional suggestions, one off-list, both appended below.
> Both helped me to learn a bit more about how to get what I want.
>
> First, the aggregate() function is in package:stats, it provides the numbers
> I needed, but I don't like the output format as much as I liked the format
> from doBy:summaryBy(). Here it is:
>
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x,
>> function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x)))
> Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd
> Age.0% Age.25% Age.50% Age.75% Age.100% Age.N
> 1 Offspring DZ Female 11 ES 17.7852830 0.3535863
> 16.9300000 17.6000000 17.7750000 17.9650000 18.9200000 106.0000000
> 2 Parent DZ Female 11 ES 44.6151240 5.1246314
> 32.1700000 41.3400000 44.6800000 48.2800000 57.9500000 121.0000000
>
snipped
> 23 Offspring MZ Male 17 notES 17.4911446 0.3961757
> 16.6500000 17.1775000 17.5000000 17.8100000 18.3500000 332.0000000
> 24 Parent MZ Male 17 notES 46.6929771 5.2421896
> 34.4500000 43.1500000 45.8900000 49.0050000 63.8000000 131.0000000
>
> That's great but there are two things I didn't like: (1) There too many
> digits, especially on the integers in the last column. I thought five digits
> to the right of the decimal was more than enough but here we have seven, even
> for integers. (2) The ordering of levels within factors implied by the right
> side of the formula is not honored -- it looks like it used the order Cohort,
> ESstatus, Sex, Zygosity, Generation. Unlike doBy::summaryBy(), it does not
> accept an order=T argument (that is the default in doBy::summaryBy()).
>
> One thing both suggestions taught me was to use names in function definitions
> so that I always get correct column headings on output. This was in the
> documentation for doBy::summaryBy(), but I didn't understand it when I first
> read it. Using that naming concept, I created this function:
>
> descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...),
> quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))}
>
> That will allow me to feed the na.rm=T argument to the mean, sd and quantile
> functions. By not naming the quantile function (e.g., not using
> q=quantile(x, ...)) I allow the builtin column names to be used unaltered
> (i.e., 0%, 25%, 50%, 75%, 100%). I also did not use length() because it will
> count NA values and I want to see the sample sizes used for mean, sd and
> quantile. To deal with that problem I created a function with output named
> "N" to count those sample sizes and one with output named "NAs" to count the
> number of NAs. Then I do this:
>
>> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x,
>> FUN=descriptivefun, na.rm=T)
> Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd Age.0%
> Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs
> 1 Offspring DZ Female 11 ES 17.78528 0.3535863 16.93
> 17.6000 17.775 17.9650 18.92 106 0
> 2 Offspring DZ Female 11 notES 18.13679 0.5555968 16.76
> 17.8525 18.190 18.4575 19.50 162 0
>
snipped
> 22 Parent MZ Male 11 ES 43.40787 5.3507439 31.28
> 39.9700 43.440 46.4800 64.65 197 0
> 23 Parent MZ Male 11 notES 41.56363 4.6564818 32.10
> 38.0250 41.390 44.6450 65.29 331 0
> 24 Parent MZ Male 17 notES 46.69298 5.2421896 34.45
> 43.1500 45.890 49.0050 63.80 131 0
>
> I think that output looks very nice. One thing that I don't understand is
> why my function produces %.5f output for every value but the
> doBy::summaryBy() function uses different formats in different columns.
Look at the code. You are attributing behavior to `summaryBy` that should be
ascribed to `print.data.frame`, and to `format.data.frame`. Your function is
returning a numeric vector and getting displayed by `print.default`.
--
David.
> Compare the above output with this output:
>
>> descriptivefun(x$Age)
> mean sd 0% 25% 50% 75% 100%
> N NAs
> 28.49302 13.29077 16.55000 17.65000 18.23000 42.25500 65.29000
> 4434.00000 0.00000
>
> It's not a big deal, but it would be cool if I could tell doBy::summaryBy()
> how to format the numbers using something like format=c(rep("%.2f",7),
> rep("%d",2)).
>
> Mike
>
> --
> Michael B. Miller, Ph.D.
> Minnesota Center for Twin and Family Research
> Department of Psychology
> University of Minnesota
>
>
>
> On Mon, 5 Aug 2013, David Carlson wrote:
>
>> This is a bit simpler. The function quantile() labels the output whereas
>> fivenum() does not:
>>
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort +
>> ESstatus, data=x,
>> function(x) c(mean=mean(x), sd=sd(x), quantile(x)))
>
>
> On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote:
>
>> Dear Dr. Miller:
>>
>> Pasted below is syntax that should mostly answer your recent question to the
>> R mailing list:
>>
>> descriptivefun <- function(x, ...){
>> c(m=mean(x, ...), sd=sd(x, ...), l=length(x))
>> }
>>
>> doBy::summaryBy(Final ~ Method.recode +
>> ComCol.recode,
>> data=Final.table,
>> FUN=descriptivefun,
>> na.rm=TRUE,
>> keep.names=TRUE,
>> order=TRUE)
>>
>> I go into far more detail on this package::function and similar functions in
>> my recent text on Twoway ANOVA,
>> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7.
>>
>> Best wishes.
>>
>> Tom
David Winsemius
Alameda, CA, USA
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.