Thanks Matthew and David, The joining to the previous score, in the method you suggested, is really nice and, I think, efficient.
I had tried what David initially proposed as well as "flattening" the file in a manner you suggested to an earlier problem I had: dtxt[,as.list(SCORE[match(2008, YEAR)]), by=SCHOOL_NUMBER] I like your solution because it keeps the original "long" file intact and just adds on the relevant data right where it belongs. The file I'm working with has about 6 million records so getting something that works, at least, relatively efficiently was a priority. Best regards, Damian Damian Betebenner Center for Assessment PO Box 351 Dover, NH 03821-0351 Phone (office): (603) 516-7900 Phone (cell): (857) 234-2474 Fax: (603) 516-7910 [email protected] www.nciea.org -----Original Message----- From: Matthew Dowle [mailto:[email protected]] Sent: Monday, September 06, 2010 7:09 PM To: Damian Betebenner Cc: [email protected]; [email protected] Subject: Re: [datatable-help] Data table syntax Damian, I was also struggling with the example data but I think I get it now. All these children moved from one school to another then; none stayed at the same school. Also, in the desired result : YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN 2009 200 54.4 2009 400 53 it seems the 54.4 should be 48.2 as David obtained : SCHOOL_NUMBER V1 100 48.2 300 53.0 If I understand that correctly so far then ... It's usually good to include time as the last column of the key. I pasted David's code (thanks) to create the data then did : > setkey(dtxt, STUDENT_ID, YEAR) > dtxt STUDENT_ID SCHOOL_NUMBER YEAR SCORE [1,] 1 100 2008 39 [2,] 1 200 2009 48 [3,] 2 100 2008 64 [4,] 2 200 2009 73 [5,] 3 100 2008 35 [6,] 3 200 2009 35 [7,] 4 100 2008 52 [8,] 4 200 2009 61 [9,] 5 100 2008 51 [10,] 5 200 2009 58 [11,] 6 300 2008 45 [12,] 6 400 2009 55 [13,] 7 300 2008 69 [14,] 7 400 2009 77 [15,] 8 300 2008 47 [16,] 8 400 2009 47 [17,] 9 300 2008 57 [18,] 9 400 2009 58 [19,] 10 300 2008 47 [20,] 10 400 2009 53 Self join each student to their previous year : > dtxt[J(STUDENT_ID,YEAR-1)] STUDENT_ID SCHOOL_NUMBER YEAR SCORE [1,] NA NA NA NA [2,] 1 100 2008 39 [3,] NA NA NA NA [4,] 2 100 2008 64 [5,] NA NA NA NA [6,] 3 100 2008 35 [7,] NA NA NA NA [8,] 4 100 2008 52 [9,] NA NA NA NA [10,] 5 100 2008 51 [11,] NA NA NA NA [12,] 6 300 2008 45 [13,] NA NA NA NA [14,] 7 300 2008 69 [15,] NA NA NA NA [16,] 8 300 2008 47 [17,] NA NA NA NA [18,] 9 300 2008 57 [19,] NA NA NA NA [20,] 10 300 2008 47 Pick out the prior year's score and add it as column : > dtxt$PRIOR_SCORE = dtxt[J(STUDENT_ID,YEAR-1),SCORE] > dtxt STUDENT_ID SCHOOL_NUMBER YEAR SCORE PRIOR_SCORE [1,] 1 100 2008 39 NA [2,] 1 200 2009 48 39 [3,] 2 100 2008 64 NA [4,] 2 200 2009 73 64 [5,] 3 100 2008 35 NA [6,] 3 200 2009 35 35 [7,] 4 100 2008 52 NA [8,] 4 200 2009 61 52 [9,] 5 100 2008 51 NA [10,] 5 200 2009 58 51 [11,] 6 300 2008 45 NA [12,] 6 400 2009 55 45 [13,] 7 300 2008 69 NA [14,] 7 400 2009 77 69 [15,] 8 300 2008 47 NA [16,] 8 400 2009 47 47 [17,] 9 300 2008 57 NA [18,] 9 400 2009 58 57 [19,] 10 300 2008 47 NA [20,] 10 400 2009 53 47 > dtxt[,mean(PRIOR_SCORE),by=list(SCHOOL_NUMBER,YEAR)] SCHOOL_NUMBER YEAR V1 [1,] 100 2008 NA [2,] 200 2009 48.2 [3,] 300 2008 NA [4,] 400 2009 53.0 or the 'by without by' direct way : > setkey(dtxt,SCHOOL_NUMBER,YEAR) > dtxt[J(c(200,400),2009),mean(PRIOR_SCORE),mult="all"] SCHOOL_NUMBER YEAR V1 [1,] 200 2009 48.2 [2,] 400 2009 53.0 Hope that helps? Matthew On Sun, 2010-09-05 at 19:12 -0400, David Winsemius wrote: > On Sep 5, 2010, at 6:58 PM, Damian Betebenner wrote: > > > David, > > > > Thanks!!! Your previous example, which I'm just playing with now > > seems to work. Given that the file I'm working with has, usually, > > over a million cases, I want to make sure it works on something > > smaller before setting it to work on the bigger table. I'll try > > optimizing later. > > The problem with my first solution is that there is no method to > associate or test for the prior or next school attended. The second > method should allow looking at the cross-tabulation of 2008 and 2009 > schools if you constructed a more realistic test table and also used a > by= argument that had both years in the list(). > > -- > David. > > > > > > Thanks a TON, > > > > Damian > > > > Damian Betebenner > > Center for Assessment > > PO Box 351 > > Dover, NH 03821-0351 > > > > Phone (office): (603) 516-7900 > > Phone (cell): (857) 234-2474 > > Fax: (603) 516-7910 > > > > [email protected] > > www.nciea.org > > > > > > > > > > -----Original Message----- > > From: David Winsemius [mailto:[email protected]] > > Sent: Sunday, September 05, 2010 6:56 PM > > To: [email protected] > > Cc: Damian Betebenner; David Winsemius > > Subject: Re: [datatable-help] Data table syntax > > > > I went back and reviewed the merge.data.frame method and then > > (re-)reviewed the data.table documents. I was disappointed to find the > > there were relatively few worked examples of data.table merges, but > > finally got something that seems to make sense: > > > > dftxt <- as.data.frame(dtxt) > > # The dataframe method > >> merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009, > > c(1,2,4)], by="STUDENT_ID") > > STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y > > 1 1 100 39 200 48 > > 2 2 100 64 200 73 > > 3 3 100 35 200 35 > > 4 4 100 52 200 61 > > 5 5 100 51 200 58 > > 6 6 300 45 400 55 > > 7 7 300 69 400 77 > > 8 8 300 47 400 47 > > 9 9 300 57 400 58 > > 10 10 300 47 400 53 > > > > #The datatable method: > > > >> setkey(dtxt) <- "STUDENT_ID" #as far as I can tell this is needed > > to repalce the by= argument in merge.data.frame() > > > > #( I tried several failed efforts at incorporating the key= statement > > within the data.table calls.) > > > >> merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ]) > > STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1 > > SCORE.1 > > [1,] 1 100 2008 39 200 2009 > > 48 > > [2,] 2 100 2008 64 200 2009 > > 73 > > [3,] 3 100 2008 35 200 2009 > > 35 > > [4,] 4 100 2008 52 200 2009 > > 61 > > [5,] 5 100 2008 51 200 2009 > > 58 > > [6,] 6 300 2008 45 400 2009 > > 55 > > [7,] 7 300 2008 69 400 2009 > > 77 > > [8,] 8 300 2008 47 400 2009 > > 47 > > [9,] 9 300 2008 57 400 2009 > > 58 > > [10,] 10 300 2008 47 400 > > 2009 53 > > > > > >> mtxt <- merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ]) > > > >> mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)] > > X2009_SCH V1 > > [1,] 200 48.2 > > [2,] 400 53.0 > > > > > > On Sep 5, 2010, at 4:43 PM, David Winsemius wrote: > > > >> > >> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote: > >> > >>> Hi David, > >>> > >>> Thanks for the quick and thoughtful reply. > >>> > >>> Sorry for not being clearer. > >>> > >>> There are two schools that the students attended in 2009 (200 and > >>> 400). I'd like to break on those, and calculate the mean for all > >>> the students in those two schools but for their 2008 scores. > >>> > >>> Thus, the output would have 2 rows: > >>> > >>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN > >>> > >>> 2009 200 54.4 > >>> 2009 400 53 > >> > >> I dont see what sort of linkage you have between the 2008 and 2009 > >> school-numbers but see if this satisfies: > >> > >>> dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID], > >> mean(SCORE), by=SCHOOL_NUMBER] > >> SCHOOL_NUMBER V1 > >> [1,] 100 48.2 > >> [2,] 300 53.0 > >> > >>> > >>> Thanks for considering this, > >>> > >>> Damian > >>> > >>> > >>> Damian Betebenner > >>> Center for Assessment > >>> PO Box 351 > >>> Dover, NH 03821-0351 > >>> > >>> Phone (office): (603) 516-7900 > >>> Phone (cell): (857) 234-2474 > >>> Fax: (603) 516-7910 > >>> > >>> [email protected] > >>> www.nciea.org > >>> > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: David Winsemius [mailto:[email protected]] > >>> Sent: Sunday, September 05, 2010 1:03 PM > >>> To: David Winsemius > >>> Cc: Damian Betebenner; [email protected] > >>> Subject: Re: [datatable-help] Data table syntax > >>> > >>> > >>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote: > >>> > >>>> > >>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote: > >>>> > >>>>> Thanks for the invaluable help on my previous questions. The speed > >>>>> up in create summary tables has been immense and I'm enthused > >>>>> about > >>>>> all the possibilities going forward. > >>>>> > >>>>> I'm currently stuck in trying to put together syntax for a "long" > >>>>> for table. In the example below, each case is a unique Student by > >>>>> Year combination. What I'm trying to do is take > >>>>> such a table, aggregate on the student's current year (i.e., 2009 > >>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the > >>>>> previous year (i.e., 2008 in this data). > >>>>> > >>>>> If the file were "wide", with each case representing a unique > >>>>> student with separate variables for the year, then it would be > >>>>> easy > >>>>> to break on the 2009 SCHOOL_NUMBER and take the > >>>>> mean of the 2008 SCORE. > >>> > >>> But there is only one 2008 SCORE for each student??? > >>> > >>>>> > >>>>> Is conversion of long to wide necessary to do this? > >>>> > >>>> Probably not. Are you familiar with the "ave" function in base R? > >>> > >>> I am having some difficulty understanding the structure of the > >>> desired > >>> output. I initially thought it might be something like: > >>> rd.txt <- > >>> function(txt, header=TRUE, ...) { > >>> rd <- read.table(textConnection(txt), header=header, ...) > >>> closeAllConnections() > >>> rd } > >>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE > >>> 1 100 2008 39 > >>> 1 200 2009 48 > >>> 2 100 2008 64 > >>> 2 200 2009 73 > >>> 3 100 2008 35 > >>> 3 200 2009 35 > >>> 4 100 2008 52 > >>> 4 200 2009 61 > >>> 5 100 2008 51 > >>> 5 200 2009 58 > >>> 6 300 2008 45 > >>> 6 400 2009 55 > >>> 7 300 2008 69 > >>> 7 400 2009 77 > >>> 8 300 2008 47 > >>> 8 400 2009 47 > >>> 9 300 2008 57 > >>> 9 400 2009 58 > >>> 10 300 2008 47 > >>> 10 400 2009 53") > >>> dtxt <- data.table(txt) > >>> > >>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))] # returns a > >>> vector as long as its input > >>>> dtxt > >>> > >>> But now I am wondering if you wanted: > >>> > >>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)] # returns vector > >>> only as long as product of category levels. > >>> 1 2 3 4 5 6 7 8 9 10 > >>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0 > >>> > >>>> > >>>>> > >>>>> > >>>>> STUDENT_ID SCHOOL_NUMBER YEAR SCORE > >>>>> [1,] 1 100 2008 39 > >>>>> [2,] 1 200 2009 48 > >>>>> [3,] 2 100 2008 64 > >>>>> [4,] 2 200 2009 73 > >>>>> [5,] 3 100 2008 35 > >>>>> [6,] 3 200 2009 35 > >>>>> [7,] 4 100 2008 52 > >>>>> [8,] 4 200 2009 61 > >>>>> [9,] 5 100 2008 51 > >>>>> [10,] 5 200 2009 58 > >>>>> [11,] 6 300 2008 45 > >>>>> [12,] 6 400 2009 55 > >>>>> [13,] 7 300 2008 69 > >>>>> [14,] 7 400 2009 77 > >>>>> [15,] 8 300 2008 47 > >>>>> [16,] 8 400 2009 47 > >>>>> [17,] 9 300 2008 57 > >>>>> [18,] 9 400 2009 58 > >>>>> [19,] 10 300 2008 47 > >>>>> [20,] 10 400 2009 53 > >>>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Damian > >>>> > >>> > >>> > >>> David Winsemius, MD > >>> West Hartford, CT > >>> > >> > >> David Winsemius, MD > >> West Hartford, CT > >> > >> _______________________________________________ > >> datatable-help mailing list > >> [email protected] > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > David Winsemius, MD > > West Hartford, CT > > > > David Winsemius, MD > West Hartford, CT > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
