Re: [R] Data manipulation problem

David Winsemius Tue, 06 Apr 2010 12:43:28 -0700


On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:

On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
OK... next question.. Which is still a data manipulation problem so I
believe the heading is still OK.

##So now I read my population data from excel.
No, you read it from a text file and providing the first ten linesof that text file should have been really easy. Read the PostingGuide for advice about offering datasets either as structure()objects with dput or dump or as attached files with "*.txt"extension (not .csv). Just change the file name with your filebrowser.
pop<-read.csv("pop.csv")

typeof(pop) ## yields a list
Really? I would have guessed it to yield just "list".
where I have age-specific population rows
and a yearly column population, where the years are suffixed by X
And had you used class(pop) you would have learned it was adataframe and even more informative would have been str(pop).
c<-(1953:2008)
No, no, no. Do not use variable names that are important functionnames. The R interpreter can (usually) keep things straight but itis our brains that experience problems. Other function names toavoid: data, df, cut, mean, sd, list, vector, matrix
names(pop)<-c
c.div<-cut(c,break=seq(1950,2010,by=5)
(You should have gotten an error here.) After fixing the error, didyou you notice that there were only 3 of the first level???
Watch out for cut(). It uses the default convention of ( , ] , i.e.open interval at right

er, ^left^

which is backwards to what some (most?) of us think natural. Becauseof that the lowest level gets dropped unless you take specialprecautions. That is undoubtedly why Harrell set up his Hmisc::cut2to have the default be [ , )
Aggregating across columns? Certainly possible, but maybe not asnatural a fit to functions like split as would occur with workingacross rows. I suppose you could use something like this untested(because _still_ no sample dataset provided) code:
apply(pop, 1,    # this works a row a time
function(x) tapply(x, list(c.div), sum) ) ) # or use aggregatewhich uses tapply
I'm not sure it will work, since I don't know if the column nameswould get carried over into "x" by apply(). You might need to createa separate index that used the numeric positions of the columnsrather than their names. Perhaps use c.div <- seq(0,(2008-1953)) %/% 5 or some such inside tapply.
Now I'd like to sum the agespecific population over the individual
levels of -c.div- and generate a new table for this with agespecific
rows and columns containing the 5-year bins instead of the original
yearly data. Do I have to program this from scratch or is it possible
to use an already existing function?
I think you ought to read more introductory material (and thePosting Guide regarding how to offer example datasets). In this casethere are many functions that do data aggregation and most of themshould be illustrated in a good introductory text.
--
David.
//M

qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE

On Mon, Apr 5, 2010 at 10:11 PM, moleps <[email protected]> wrote:
Thx Erik,
I have no idea what went wrong with the other code snippet, butthis one works.. Appreciate it.
qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
M


On 5. apr. 2010, at 21.45, Erik Iverson wrote:
I don't know what your data are like, since you haven't given areproducible example. I was imagining something like:
## generate fake data
age <- sample(20:90, 100, replace = TRUE)
year <- sample(1950:2000, 100, replace = TRUE)

##look at big table
table(age, year)

## categorize data
## see include.lowest and right arguments to cut
age.factor <- cut(age, breaks = seq(20, 90, by = 10),
               include.lowest = TRUE)

year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
                include.lowest = TRUE)

table(age.factor, year.factor)

moleps wrote:
I already did try the regression modeling approach. However theepidemiologists (referee) turns out to be quite fond ofcomparing the incidence rates to different standard populations,hence the need for this labourius approach. And trying the"cutting" approach I ended up with :
table (age5)
age5
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35](35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70](70,75] (75,80] (80,85] (85,100] 35 3433 47 51 109 157 231 362511 745 926 1002 866 547 24782 18
table (yr5)
yr5
(1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975](1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000](2000,2005] (2005,2009] 3 55 5 5 5 55 5 5 5 3
table (yr5,age5)
Error in table(yr5, age5) : all arguments must have the samelength
Sincerely,
M
On 5. apr. 2010, at 20.59, Bert Gunter wrote:
You have tempted, and being weak, I yield to temptation:

"Any good ideas?"

Yes. Don't do this.
(what you probably really want to do is fit a model with age asa factor,which can be done statistically e.g. by logistic regression; orgraphicallyusing conditioning plots, e.g. via trellis graphics (thelattice package).This avoids the arbitrariness and discontinuities of binning byage range.)
Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of moleps
Sent: Monday, April 05, 2010 11:46 AM
To: [email protected]
Subject: [R] Data manipulation problem

Dear R´ers.

I´ve got a dataset with age and year of diagnosis. In order to
age-standardize the incidence I need to transform the data intoa matrixwith age-groups (divided in 5 or 10 years) along one axis andyear dividedinto 5 years along the other axis. Each cell should contain thenumber of
cases for that age group and for that period.
I.e.
My data format now is
ID-age (to one decimal)-year(yearly data).

What I´d like is

age 1960-1965 1966-1970 etc...
0-5 3 8 10 15
6-10 2 5 8 13
etc..


Any good ideas?

Regards,
M


David Winsemius, MD
West Hartford, CT

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data manipulation problem

Reply via email to