David, Thanks!!! Your previous example, which I'm just playing with now seems to work. Given that the file I'm working with has, usually, over a million cases, I want to make sure it works on something smaller before setting it to work on the bigger table. I'll try optimizing later.
Thanks a TON, Damian Damian Betebenner Center for Assessment PO Box 351 Dover, NH 03821-0351 Phone (office): (603) 516-7900 Phone (cell): (857) 234-2474 Fax: (603) 516-7910 [email protected] www.nciea.org -----Original Message----- From: David Winsemius [mailto:[email protected]] Sent: Sunday, September 05, 2010 6:56 PM To: [email protected] Cc: Damian Betebenner; David Winsemius Subject: Re: [datatable-help] Data table syntax I went back and reviewed the merge.data.frame method and then (re-)reviewed the data.table documents. I was disappointed to find the there were relatively few worked examples of data.table merges, but finally got something that seems to make sense: dftxt <- as.data.frame(dtxt) # The dataframe method > merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009, c(1,2,4)], by="STUDENT_ID") STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y 1 1 100 39 200 48 2 2 100 64 200 73 3 3 100 35 200 35 4 4 100 52 200 61 5 5 100 51 200 58 6 6 300 45 400 55 7 7 300 69 400 77 8 8 300 47 400 47 9 9 300 57 400 58 10 10 300 47 400 53 #The datatable method: > setkey(dtxt) <- "STUDENT_ID" #as far as I can tell this is needed to repalce the by= argument in merge.data.frame() #( I tried several failed efforts at incorporating the key= statement within the data.table calls.) > merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ]) STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1 SCORE.1 [1,] 1 100 2008 39 200 2009 48 [2,] 2 100 2008 64 200 2009 73 [3,] 3 100 2008 35 200 2009 35 [4,] 4 100 2008 52 200 2009 61 [5,] 5 100 2008 51 200 2009 58 [6,] 6 300 2008 45 400 2009 55 [7,] 7 300 2008 69 400 2009 77 [8,] 8 300 2008 47 400 2009 47 [9,] 9 300 2008 57 400 2009 58 [10,] 10 300 2008 47 400 2009 53 > mtxt <- merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ]) > mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)] X2009_SCH V1 [1,] 200 48.2 [2,] 400 53.0 On Sep 5, 2010, at 4:43 PM, David Winsemius wrote: > > On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote: > >> Hi David, >> >> Thanks for the quick and thoughtful reply. >> >> Sorry for not being clearer. >> >> There are two schools that the students attended in 2009 (200 and >> 400). I'd like to break on those, and calculate the mean for all >> the students in those two schools but for their 2008 scores. >> >> Thus, the output would have 2 rows: >> >> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN >> >> 2009 200 54.4 >> 2009 400 53 > > I dont see what sort of linkage you have between the 2008 and 2009 > school-numbers but see if this satisfies: > > > dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID], > mean(SCORE), by=SCHOOL_NUMBER] > SCHOOL_NUMBER V1 > [1,] 100 48.2 > [2,] 300 53.0 > >> >> Thanks for considering this, >> >> Damian >> >> >> Damian Betebenner >> Center for Assessment >> PO Box 351 >> Dover, NH 03821-0351 >> >> Phone (office): (603) 516-7900 >> Phone (cell): (857) 234-2474 >> Fax: (603) 516-7910 >> >> [email protected] >> www.nciea.org >> >> >> >> >> -----Original Message----- >> From: David Winsemius [mailto:[email protected]] >> Sent: Sunday, September 05, 2010 1:03 PM >> To: David Winsemius >> Cc: Damian Betebenner; [email protected] >> Subject: Re: [datatable-help] Data table syntax >> >> >> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote: >> >>> >>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote: >>> >>>> Thanks for the invaluable help on my previous questions. The speed >>>> up in create summary tables has been immense and I'm enthused about >>>> all the possibilities going forward. >>>> >>>> I'm currently stuck in trying to put together syntax for a "long" >>>> for table. In the example below, each case is a unique Student by >>>> Year combination. What I'm trying to do is take >>>> such a table, aggregate on the student's current year (i.e., 2009 >>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the >>>> previous year (i.e., 2008 in this data). >>>> >>>> If the file were "wide", with each case representing a unique >>>> student with separate variables for the year, then it would be easy >>>> to break on the 2009 SCHOOL_NUMBER and take the >>>> mean of the 2008 SCORE. >> >> But there is only one 2008 SCORE for each student??? >> >>>> >>>> Is conversion of long to wide necessary to do this? >>> >>> Probably not. Are you familiar with the "ave" function in base R? >> >> I am having some difficulty understanding the structure of the >> desired >> output. I initially thought it might be something like: >> rd.txt <- >> function(txt, header=TRUE, ...) { >> rd <- read.table(textConnection(txt), header=header, ...) >> closeAllConnections() >> rd } >> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE >> 1 100 2008 39 >> 1 200 2009 48 >> 2 100 2008 64 >> 2 200 2009 73 >> 3 100 2008 35 >> 3 200 2009 35 >> 4 100 2008 52 >> 4 200 2009 61 >> 5 100 2008 51 >> 5 200 2009 58 >> 6 300 2008 45 >> 6 400 2009 55 >> 7 300 2008 69 >> 7 400 2009 77 >> 8 300 2008 47 >> 8 400 2009 47 >> 9 300 2008 57 >> 9 400 2009 58 >> 10 300 2008 47 >> 10 400 2009 53") >> dtxt <- data.table(txt) >> >>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))] # returns a >> vector as long as its input >>> dtxt >> >> But now I am wondering if you wanted: >> >>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)] # returns vector >> only as long as product of category levels. >> 1 2 3 4 5 6 7 8 9 10 >> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0 >> >>> >>>> >>>> >>>> STUDENT_ID SCHOOL_NUMBER YEAR SCORE >>>> [1,] 1 100 2008 39 >>>> [2,] 1 200 2009 48 >>>> [3,] 2 100 2008 64 >>>> [4,] 2 200 2009 73 >>>> [5,] 3 100 2008 35 >>>> [6,] 3 200 2009 35 >>>> [7,] 4 100 2008 52 >>>> [8,] 4 200 2009 61 >>>> [9,] 5 100 2008 51 >>>> [10,] 5 200 2009 58 >>>> [11,] 6 300 2008 45 >>>> [12,] 6 400 2009 55 >>>> [13,] 7 300 2008 69 >>>> [14,] 7 400 2009 77 >>>> [15,] 8 300 2008 47 >>>> [16,] 8 400 2009 47 >>>> [17,] 9 300 2008 57 >>>> [18,] 9 400 2009 58 >>>> [19,] 10 300 2008 47 >>>> [20,] 10 400 2009 53 >>>> >>>> >>>> Thanks, >>>> >>>> Damian >>> >> >> >> David Winsemius, MD >> West Hartford, CT >> > > David Winsemius, MD > West Hartford, CT > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help David Winsemius, MD West Hartford, CT _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
