A couple follow up questions:
1. Is there any way to modify this so that non-numeric values are ignored? (As it is, length seems to "count" the NA values.) 2. In order fro the cbind function "x <- cbind(x, do.call("rbind", r))" to work as intended, does the data need to be Ordered by State and Year? e.g. "x <- x[order(x$State,x$Year), ]" Here is some sample data, with non-numeric values included: Year,State,Subject,Income 2000,TX,1,30776 2000,AL,1,81240 2000,TX,2,28035 2000,AL,2,35947 2000,TX,3,42010 2000,AL,3,48830 2000,TX,4,18040 2000,AL,4,77758 2000,TX,5,20771 2000,AL,5,59132 2000,TX,6,46370 2000,AL,6,45573 2000,TX,7,57256 2000,AL,7,83402 2000,TX,8,3780 2000,AL,8,90695 2000,TX,9,51745 2000,AL,9,4105 2000,TX,10,1154 2000,AL,10,96598 2001,TX,1,25767 2001,AL,1,37032 2001,TX,2,39848 2001,AL,2,69029 2001,TX,3,17142 2001,AL,3,92850 2001,TX,4,62939 2001,AL,4,82730 2001,TX,5,30708 2001,AL,5,25339 2001,TX,6,64710 2001,AL,6,44541 2001,TX,7,96699 2001,AL,7,9151 2001,TX,8,57793 2001,AL,8,20981 2001,TX,9,12523 2001,AL,9,36139 2001,TX,10,53553 2001,AL,10,3767 2002,TX,1,55232 2002,AL,1,54655 2002,TX,2,76255 2002,AL,2,53581 2002,TX,3,77030 2002,AL,3,34869 2002,TX,4,98956 2002,AL,4,60332 2002,TX,5,33052 2002,AL,5,12348 2002,TX,6,96057 2002,AL,6,24509 2002,TX,7,66177 2002,AL,7,45952 2002,TX,8,73331 2002,AL,8,35813 2002,TX,9,3014 2002,AL,9,57097 2002,TX,10,83657 2002,AL,10,91640 2003,TX,1,5638 2003,AL,1,17026 2003,TX,2,66902 2003,AL,2,71080 2003,TX,3,88195 2003,AL,3,95415 2003,TX,4,13028 2003,AL,4,49123 2003,TX,5,19867 2003,AL,5,22990 2003,TX,6,67639 2003,AL,6,69435 2003,TX,7,62469 2003,AL,7,59939 2003,TX,8,24874 2003,AL,8,44829 2003,TX,9,77180 2003,AL,9,68488 2003,TX,10,80686 2003,AL,10,72622 2004,TX,1,46854 2004,AL,1,62499 2004,TX,2,20461 2004,AL,2,53834 2004,TX,3,54909 2004,AL,3,69527 2004,TX,4,33066 2004,AL,4,78035 2004,TX,5,23569 2004,AL,5,59757 2004,TX,6,44514 2004,AL,6,41223 2004,TX,7,85665 2004,AL,7,91972 2004,TX,8,30073 2004,AL,8,90642 2004,TX,9,32741 2004,AL,9,97111 2004,TX,10,8093 2004,AL,10,20077 2005,TX,1,48377 2005,AL,1,88216 2005,TX,2,35752 2005,AL,2,74897 2005,TX,3,27772 2005,AL,3,88945 2005,TX,4,86512 2005,AL,4,88422 2005,TX,5,27488 2005,AL,5,21140 2005,TX,6,35777 2005,AL,6,32772 2005,TX,7,77477 2005,AL,7,98282 2005,TX,8,73346 2005,AL,8,38943 2005,TX,9,38947 2005,AL,9,70195 2005,TX,10,23890 2005,AL,10,84020 2000,TX,11,na 2005,AL,11,null Sundar Dorai-Raj <[EMAIL PROTECTED]> wrote: t c wrote: > What is the easiest way to calculate a percent rank by an index key? > > > > Foe example, I have a dataset with 3 fields: > > > > Year, State, Income , > > > > I wish to calculate the rank, by year, by state. > > I also wish to calculate the percent rank, where I define percent rank as > rank/n. > > > > (n is the number of numeric data points within each date-state grouping.) > > > > > > This is what I am currently doing: > > > > 1. I create a group by field by using the paste function to combine date > and state into a field called date_state. I then use the rank function to > calculate the rank by date, by state. > > > > 2. I then add a field called one that I set to 1 if the value in income is > numeric and to 0 if it is not. > > > > 3. I then take an aggregate sum of one. This gives me a count (n) for each > date-state grouping. > > > > > > 4. I next use merge to add this count to the table. > > > > 5. Finally, I calculate the percent rank. > > > > Pr<-rank/n > > > > The merge takes quite a bit of time to process. > > > > Is there an easier/more efficient way to calculate the percent rank? > How about using ?by: set.seed(100) # fake data set, replace with your own # "Subject" is just a dummy to produce replicates x <- expand.grid(Year = 2000:2005, State = c("TX", "AL"), Subject = 1:10) x$Income <- floor(runif(NROW(x)) * 100000) r <- by(x$Income, x[c("Year", "State")], function(x) { r <- rank(x) n <- length(x) cbind(Rank = r, PRank = r/n) }) x <- cbind(x, do.call("rbind", r)) HTH, --sundar Sundar Dorai-Raj <[EMAIL PROTECTED]> wrote: t c wrote: > What is the easiest way to calculate a percent rank by an index key? > > > > Foe example, I have a dataset with 3 fields: > > > > Year, State, Income , > > > > I wish to calculate the rank, by year, by state. > > I also wish to calculate the percent rank, where I define percent rank as > rank/n. > > > > (n is the number of numeric data points within each date-state grouping.) > > > > > > This is what I am currently doing: > > > > 1. I create a group by field by using the paste function to combine date > and state into a field called date_state. I then use the rank function to > calculate the rank by date, by state. > > > > 2. I then add a field called one that I set to 1 if the value in income is > numeric and to 0 if it is not. > > > > 3. I then take an aggregate sum of one. This gives me a count (n) for each > date-state grouping. > > > > > > 4. I next use merge to add this count to the table. > > > > 5. Finally, I calculate the percent rank. > > > > Pr<-rank/n > > > > The merge takes quite a bit of time to process. > > > > Is there an easier/more efficient way to calculate the percent rank? > How about using ?by: set.seed(100) # fake data set, replace with your own # "Subject" is just a dummy to produce replicates x <- expand.grid(Year = 2000:2005, State = c("TX", "AL"), Subject = 1:10) x$Income <- floor(runif(NROW(x)) * 100000) r <- by(x$Income, x[c("Year", "State")], function(x) { r <- rank(x) n <- length(x) cbind(Rank = r, PRank = r/n) }) x <- cbind(x, do.call("rbind", r)) HTH, --sundar ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html --------------------------------- [[alternative HTML version deleted]]
______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html