I am using ddply on a data set that contains 2+ million rows; trying to rank
the values of a variable within groups, and then transform the ranks to
(approximate) z-scores --- i.e generate quantiles on the normal scale.
Here is some sample data for one group:x <- NA 0.3640951 0.1175880 0.3453916
0.4214050 0.7469022 0.1091423 0.6099482 NA NA 0.6786140 0.1785854
0.9750262 NA
I have tried the following two alternatives:
(1) Using the qnorm function from the stats package in conjunction with the
percent_rank function from the dplyr package:For example:
y <- qnorm(percent_rank(x))
This produces -Inf and Inf for the extreme values in the sample data. This
issue is resolved if I use the rank function from the stats package instead,
for example:y <- qnorm(rank(x, na.last = "keep", ties.method =
"average")/length(x))
but if there are no NAs in a certain group, the upper extreme data point is
still evaluated to Inf.
(2) Using the ztransform function from the GenABEL package:
For example:
y <- ztransform(percent_rank(x))
This preserves the extreme values but produces one of the following types of
errors when used on my full data set.
Error in ztransform(x) : trait is binary
ORError in ztransform(x) : trait is monomorphic
I suspect these errors may be due to the fact that there are very few
observations and/or several missing values (NAs) within certain groups, but I
am not sure since there are several hundred groups.
Is there a better way?
Sent from Yahoo Mail. Get the app
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.