At least in your toy example, each lm would be run when x is only at one value, so y ~ x would not be meaningful. Perhaps you are doing something else in the actual problem, e.g. lm(y~x2), by=x1? (No need to reply.)
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Matthew Finkbeiner Sent: Wednesday, February 22, 2012 5:40 AM To: [email protected] Subject: Re: [datatable-help] Using list valued columns with by (Matthew Dowle) I wanted to follow up on this as I am trying to do something similar to what Chris asked about. but first, let me say thanks for the work on this. I have several different situations where a call to ddply takes about 10 minutes but only ~1second with data.table. So I'm very thankful for the package, but I'm still very much a novice with it. Here's the present problem. here's the toy data again: d<- data.table(x=rep(1:2,each=10), y=rnorm(20), key="x") > dim(d) [1] 20 2 I would like to generate a column of fitted values from lm that I'll later cbind to the original data. f<- function(d) list(pred = fitted(lm(y ~ x,d))) p<- d[,f(d), by = x] > dim(p) [1] 40 2 for reasons I don't understand, this generates 2 sets of (correct) "pred" values, but the "x" values are wrong. Why does this generate two duplicate sets? I should say that the real data has ~2 million rows and the call will be something closer to: p<- d[,f(d), by = list(X1, X2, X3, X4)]. Matthew > or functional form : > > f <- function(y) list(a=mean(y), b=list(rep(y[1],3)) ) data[, f(y), > by=x] > x a b [1,] 1 > -0.07760762 -0.1715334, -0.1715334, -0.1715334 [2,] 2 0.36923570 > 1.01892, 1.01892, 1.01892 > >> data <- data.table(x=rep(1:2,each=10), y=rnorm(20), key="x") >> >> f <- function(y) { >> return( list(a=mean(y), b=rep(y[1],10) ) } >> >> result <- data[, list(f(y)), by=x] >> >> >> What winds up happening is that result winds up having V1 alternate >> between f(y)$a and f(y)$b, resulting in 4 rows, 2 for each value of x. >> What I want instead is result to have 2 rows, with V1 being the list >> that gets returned from f(y). >> >> I have found that this works: >> >> result <- data[, list(list(f(y))), by=x] >> >> But then I have to do: >> >> result[J(1),][,V1][[1]] >> >> to get the same thing I would get from f(result[J(1),][,V1]). I want >> to lose the [[1]] but I can't seem to see how I would do so. Really >> what I would envision is like with sapply, I want to do >> >> >> result <- data[, f(y), by=x, simplify=FALSE] >> >> But of course simplify isn't an argument for data.table. Thoughts? >> >> -Chris >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatabl >> e-help > > > > > ------------------------------ > > Message: 2 > Date: Tue, 21 Feb 2012 17:52:40 -0500 > From: Steve Lianoglou <[email protected]> > To: [email protected] > Cc: [email protected], Prasad Chalasani > <[email protected]> > Subject: Re: [datatable-help] BUG: droplevels mangles subsetted > data.table > Message-ID: > > <caha9mcnblauink9fjr2jnow10r8vfhyrfvu0upea4qjwfde...@mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > I guess I'm missing something, but ... why isn't your proposed > droplevels.data.table consistent with base? Because the ordering of > the rows might change (maybe(?))? > > -steve > > On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <[email protected]> wrote: >> >> Yes, could do. Building on that here's a quick stab at >> droplevels.data.table. This does it by reference, or it could take a >> copy(). If it takes a copy() it would be consistent with base >> (probably required), but then how best to make a non-copying version >> available? >> >> droplevels.data.table = function(dt) { ? ?oldkey = key( dt ) ? ?for >> (i in names(dt)) { ? ? ? ?if (is.factor(dt[[i]])) >> dt[,i:=droplevels(dt[[i]]),with=FALSE] >> ? ?} >> ? ?setkeyv( dt, oldkey ) >> ? ?dt >> } >> >> On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote: >>> Meanwhile as a work-around, I suppose one should do: >>> >>> keys <- key( dt ) # this could in general be a large set of keys >>> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) ) sub_dt <- >>> data.table( sub_d ) setkeyv( sub_dt, keys ) >>> >>> >>> >>> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote: >>> >>> > >>> > I see the problem too but (just) adding droplevels.data.table >>> > might miss the root cause. >>> > >>> >> because the way the >>> >> droplevels.data.frame method works isn't compatible with >>> >> data.table indexing. >>> > >>> > But it's intended to be. I can see the switch at the top of >>> > [.data.table is detecting the caller isn't data.table aware, and >>> > it is then dispatching to `[.data.frame` but why it then isn't >>> > working I'm not sure. Something to do with the missing j or >>> > missing drop not being passed through correctly, perhaps. >>> > >>> > I have heard it said (once or twice) that data.table is "almost" >>> > compatible with non-data.table-aware packages, but never had an >>> > example before. I wonder if this is it! >>> > >>> > A (fast) droplevels.data.table using := would be good anyway, though. >>> > >>> > Matthew >>> > >>> > >>> > >>> >> Hi, >>> >> >>> >> I see what the problem is -- we need to provide a >>> >> droplevels.data.table S3 method, because the way the >>> >> droplevels.data.frame method works isn't compatible with >>> >> data.table indexing. >>> >> >>> >> Will fix: >>> >> >>> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1 >>> >> 841&group_id=240&atid=975 >>> >> >>> >> Thanks for raising the flag. >>> >> >>> >> Cheers, >>> >> -steve >>> >> >>> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <[email protected]> >>> >> wrote: >>> >>> ?Surprising that this wasn't noticed before, or perhaps I'm not >>> >>> following some recommended idiom to drop levels when using >>> >>> ?data.table. The following code illustrates the bug clearly: The >>> >>> bug remains regardless of whether I use "subset" or simply use >>> >>> dt1 = dt[ name != 'a' ]. >>> >>> >>> >>> >>> >>> >>> >>> ? ?d <- data.table(name = c('a','b','c'), value = 1:3) ? ?dt <- >>> >>> data.table(d) ? ?setkey(dt,'name') ? ?dt1 <- subset(dt,name != >>> >>> 'a') ?# or dt1 <- dt[ name != 'a' ] ? ?> dt1 ? ? ? ? ?name value >>> >>> ? ? [1,] ? ?b ? ? 2 ? ? [2,] ? ?c ? ? 3 >>> >>> >>> >>> ? ?> droplevels(dt1) >>> >>> ? ? ? ? ?name value >>> >>> ? ? [1,] ? ?b ? ? 1 >>> >>> ? ? [2,] ? ?c ? ? 3 >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> View this message in context: >>> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted-data-table-tp4407694p4407694.html >>> >>> Sent from the datatable-help mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> >>> datatable-help mailing list >>> >>> [email protected] >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >>> >> >>> >> >>> >> -- >>> >> Steve Lianoglou >>> >> Graduate Student: Computational Systems Biology >>> >> ?| Memorial Sloan-Kettering Cancer Center >>> >> ?| Weill Medical College of Cornell University >>> >> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> _______________________________________________ >>> >> datatable-help mailing list >>> >> [email protected] >>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >>> > >>> > >>> >> >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > ------------------------------ > > Message: 3 > Date: Tue, 21 Feb 2012 23:22:59 +0000 > From: Matthew Dowle <[email protected]> > To: Steve Lianoglou <[email protected]> > Cc: [email protected] > Subject: Re: [datatable-help] BUG: droplevels mangles subsetted > data.table > Message-ID: <1329866579.2108.208.camel@netbook> > Content-Type: text/plain; charset="UTF-8" > > Hi. Just because as it stands it doesn't copy, so > > newDT = dropfactors(DT) > > would change DT by reference with newDT a new pointer to that same > modified object, whereas base would leave DT unchanged with newDT a > modified copy. > > Just adding dt=copy(dt) at the start of the function would make it > consistent, but then how would we (data.table-aware code) call the > non-copying version if we wanted that (which is likely needed, given the > motivation of dropping unused levels I guess). Could continue the set* > theme and create setdropfactors()? but that doesn't roll off the tongue. > Or the copy() could be switched in the usual way : > > if (!cedta) dt = copy(dt) > > and then we data.table users would just know that droplevels worked by > reference and we should copy() first if we want a copy, in the usual > way. Whilst not upsetting non-data.table-aware packages, since they > would still copy. Think I prefer the switched copy, carefully > documented, which would save yet another new function. I'm thinking that > users' expectations of dropfactors() would probably be that it worked by > reference on data.tables anyway (or if not, would want it to after the > initial surprise). > > Matthew > > On Tue, 2012-02-21 at 17:52 -0500, Steve Lianoglou wrote: >> Hi, >> >> I guess I'm missing something, but ... why isn't your proposed >> droplevels.data.table consistent with base? Because the ordering of >> the rows might change (maybe(?))? >> >> -steve >> >> On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <[email protected]> >> wrote: >> > >> > Yes, could do. Building on that here's a quick stab at >> > droplevels.data.table. This does it by reference, or it could take a >> > copy(). If it takes a copy() it would be consistent with base (probably >> > required), but then how best to make a non-copying version available? >> > >> > droplevels.data.table = function(dt) { >> > oldkey = key( dt ) >> > for (i in names(dt)) { >> > if (is.factor(dt[[i]])) dt[,i:=droplevels(dt[[i]]),with=FALSE] >> > } >> > setkeyv( dt, oldkey ) >> > dt >> > } >> > >> > On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote: >> >> Meanwhile as a work-around, I suppose one should do: >> >> >> >> keys <- key( dt ) # this could in general be a large set of keys >> >> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) ) >> >> sub_dt <- data.table( sub_d ) >> >> setkeyv( sub_dt, keys ) >> >> >> >> >> >> >> >> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote: >> >> >> >> > >> >> > I see the problem too but (just) adding droplevels.data.table might miss >> >> > the root cause. >> >> > >> >> >> because the way the >> >> >> droplevels.data.frame method works isn't compatible with data.table >> >> >> indexing. >> >> > >> >> > But it's intended to be. I can see the switch at the top of [.data.table >> >> > is detecting the caller isn't data.table aware, and it is then >> >> > dispatching >> >> > to `[.data.frame` but why it then isn't working I'm not sure. Something >> >> > to >> >> > do with the missing j or missing drop not being passed through >> >> > correctly, >> >> > perhaps. >> >> > >> >> > I have heard it said (once or twice) that data.table is "almost" >> >> > compatible with non-data.table-aware packages, but never had an example >> >> > before. I wonder if this is it! >> >> > >> >> > A (fast) droplevels.data.table using := would be good anyway, though. >> >> > >> >> > Matthew >> >> > >> >> > >> >> > >> >> >> Hi, >> >> >> >> >> >> I see what the problem is -- we need to provide a >> >> >> droplevels.data.table S3 method, because the way the >> >> >> droplevels.data.frame method works isn't compatible with data.table >> >> >> indexing. >> >> >> >> >> >> Will fix: >> >> >> >> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1841&group_id=240&atid=975 >> >> >> >> >> >> Thanks for raising the flag. >> >> >> >> >> >> Cheers, >> >> >> -steve >> >> >> >> >> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <[email protected]> >> >> >> wrote: >> >> >>> Surprising that this wasn't noticed before, or perhaps I'm not >> >> >>> following >> >> >>> some recommended idiom to drop levels when using data.table. The >> >> >>> following >> >> >>> code illustrates the bug clearly: The bug remains regardless of >> >> >>> whether >> >> >>> I >> >> >>> use "subset" or simply use dt1 = dt[ name != 'a' ]. >> >> >>> >> >> >>> >> >> >>> >> >> >>> d <- data.table(name = c('a','b','c'), value = 1:3) >> >> >>> dt <- data.table(d) >> >> >>> setkey(dt,'name') >> >> >>> dt1 <- subset(dt,name != 'a') # or dt1 <- dt[ name != 'a' ] >> >> >>> > dt1 >> >> >>> name value >> >> >>> [1,] b 2 >> >> >>> [2,] c 3 >> >> >>> >> >> >>> > droplevels(dt1) >> >> >>> name value >> >> >>> [1,] b 1 >> >> >>> [2,] c 3 >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> View this message in context: >> >> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted-data-table-tp4407694p4407694.html >> >> >>> Sent from the datatable-help mailing list archive at Nabble.com. >> >> >>> _______________________________________________ >> >> >>> datatable-help mailing list >> >> >>> [email protected] >> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Steve Lianoglou >> >> >> Graduate Student: Computational Systems Biology >> >> >> | Memorial Sloan-Kettering Cancer Center >> >> >> | Weill Medical College of Cornell University >> >> >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> >> _______________________________________________ >> >> >> datatable-help mailing list >> >> >> [email protected] >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> > >> >> > >> >> >> > >> > >> >> >> > > > > > ------------------------------ > > Message: 4 > Date: Tue, 21 Feb 2012 21:24:33 -0500 > From: Steve Lianoglou <[email protected]> > To: [email protected] > Cc: [email protected] > Subject: Re: [datatable-help] BUG: droplevels mangles subsetted > data.table > Message-ID: > <CAHA9McNzNWNS+=4pXwLwfj5GvnpUerJx9otUOV4pY1fEXfk=r...@mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Ahh, right ... the copying. Good point. > > Regarding the logic you suggest as to when to copy or not, how do you > feel about going the explicit route instead of trying to take a best > guess when we should/shouldn't copy via `cedta` and doing the > 'data.frame behavior' by default. > > By that I mean: since the droplevels function has a `...` param, can > we do something like: > > droplevels.data.table <- function(x, except=NULL, do.copy=TRUE, ...) { > if (do.copy) { > x <- copy(x) > } > oldkey = key(x) > change.me <- names(x) > if (!is.null(except)) { > change.me <- setdiff(change.me, names(x)[except]) > } > for (i in change.me)) { > if (is.factor(x[[i]])) x[,i:=droplevels(x[[i]]),with=FALSE] > } > setkeyv( x, oldkey ) > } > > yay/nay? > > -steve > > On Tue, Feb 21, 2012 at 6:22 PM, Matthew Dowle <[email protected]> wrote: >> Hi. Just because as it stands it doesn't copy, so >> >> ? ?newDT = dropfactors(DT) >> >> would change DT by reference with newDT a new pointer to that same >> modified object, whereas base would leave DT unchanged with newDT a >> modified copy. >> >> Just adding dt=copy(dt) at the start of the function would make it >> consistent, ?but then how would we (data.table-aware code) call the >> non-copying version if we wanted that (which is likely needed, given the >> motivation of dropping unused levels I guess). Could continue the set* >> theme and create setdropfactors()? but that doesn't roll off the tongue. >> Or the copy() could be switched in the usual way : >> >> ? ? if (!cedta) dt = copy(dt) >> >> and then we data.table users would just know that droplevels worked by >> reference and we should copy() first if we want a copy, in the usual >> way. Whilst not upsetting non-data.table-aware packages, since they >> would still copy. Think I prefer the switched copy, carefully >> documented, which would save yet another new function. I'm thinking that >> users' expectations of dropfactors() would probably be that it worked by >> reference on data.tables anyway (or if not, would want it to after the >> initial surprise). >> >> Matthew >> >> On Tue, 2012-02-21 at 17:52 -0500, Steve Lianoglou wrote: >>> Hi, >>> >>> I guess I'm missing something, but ... why isn't your proposed >>> droplevels.data.table consistent with base? Because the ordering of >>> the rows might change (maybe(?))? >>> >>> -steve >>> >>> On Tue, Feb 21, 2012 at 4:42 PM, Matthew Dowle <[email protected]> >>> wrote: >>> > >>> > Yes, could do. Building on that here's a quick stab at >>> > droplevels.data.table. This does it by reference, or it could take a >>> > copy(). If it takes a copy() it would be consistent with base (probably >>> > required), but then how best to make a non-copying version available? >>> > >>> > droplevels.data.table = function(dt) { >>> > ? ?oldkey = key( dt ) >>> > ? ?for (i in names(dt)) { >>> > ? ? ? ?if (is.factor(dt[[i]])) dt[,i:=droplevels(dt[[i]]),with=FALSE] >>> > ? ?} >>> > ? ?setkeyv( dt, oldkey ) >>> > ? ?dt >>> > } >>> > >>> > On Tue, 2012-02-21 at 15:38 -0500, Prasad Chalasani wrote: >>> >> Meanwhile as a work-around, I suppose one should do: >>> >> >>> >> keys <- key( dt ) # this could in general be a large set of keys >>> >> sub_d <- droplevels( as.data.frame( dt[ name != 'a' ] ) ) >>> >> sub_dt <- data.table( sub_d ) >>> >> setkeyv( sub_dt, keys ) >>> >> >>> >> >>> >> >>> >> On Feb 21, 2012, at 1:59 PM, Matthew Dowle wrote: >>> >> >>> >> > >>> >> > I see the problem too but (just) adding droplevels.data.table might >>> >> > miss >>> >> > the root cause. >>> >> > >>> >> >> because the way the >>> >> >> droplevels.data.frame method works isn't compatible with data.table >>> >> >> indexing. >>> >> > >>> >> > But it's intended to be. I can see the switch at the top of >>> >> > [.data.table >>> >> > is detecting the caller isn't data.table aware, and it is then >>> >> > dispatching >>> >> > to `[.data.frame` but why it then isn't working I'm not sure. >>> >> > Something to >>> >> > do with the missing j or missing drop not being passed through >>> >> > correctly, >>> >> > perhaps. >>> >> > >>> >> > I have heard it said (once or twice) that data.table is "almost" >>> >> > compatible with non-data.table-aware packages, but never had an example >>> >> > before. I wonder if this is it! >>> >> > >>> >> > A (fast) droplevels.data.table using := would be good anyway, though. >>> >> > >>> >> > Matthew >>> >> > >>> >> > >>> >> > >>> >> >> Hi, >>> >> >> >>> >> >> I see what the problem is -- we need to provide a >>> >> >> droplevels.data.table S3 method, because the way the >>> >> >> droplevels.data.frame method works isn't compatible with data.table >>> >> >> indexing. >>> >> >> >>> >> >> Will fix: >>> >> >> >>> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1841&group_id=240&atid=975 >>> >> >> >>> >> >> Thanks for raising the flag. >>> >> >> >>> >> >> Cheers, >>> >> >> -steve >>> >> >> >>> >> >> On Tue, Feb 21, 2012 at 12:38 PM, pchalasani <[email protected]> >>> >> >> wrote: >>> >> >>> ?Surprising that this wasn't noticed before, or perhaps I'm not >>> >> >>> following >>> >> >>> some recommended idiom to drop levels when using ?data.table. The >>> >> >>> following >>> >> >>> code illustrates the bug clearly: The bug remains regardless of >>> >> >>> whether >>> >> >>> I >>> >> >>> use "subset" or simply use dt1 = dt[ name != 'a' ]. >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> ? ?d <- data.table(name = c('a','b','c'), value = 1:3) >>> >> >>> ? ?dt <- data.table(d) >>> >> >>> ? ?setkey(dt,'name') >>> >> >>> ? ?dt1 <- subset(dt,name != 'a') ?# or dt1 <- dt[ name != 'a' ] >>> >> >>> ? ?> dt1 >>> >> >>> ? ? ? ? ?name value >>> >> >>> ? ? [1,] ? ?b ? ? 2 >>> >> >>> ? ? [2,] ? ?c ? ? 3 >>> >> >>> >>> >> >>> ? ?> droplevels(dt1) >>> >> >>> ? ? ? ? ?name value >>> >> >>> ? ? [1,] ? ?b ? ? 1 >>> >> >>> ? ? [2,] ? ?c ? ? 3 >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> -- >>> >> >>> View this message in context: >>> >> >>> http://r.789695.n4.nabble.com/BUG-droplevels-mangles-subsetted-data-table-tp4407694p4407694.html >>> >> >>> Sent from the datatable-help mailing list archive at Nabble.com. >>> >> >>> _______________________________________________ >>> >> >>> datatable-help mailing list >>> >> >>> [email protected] >>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> Steve Lianoglou >>> >> >> Graduate Student: Computational Systems Biology >>> >> >> ?| Memorial Sloan-Kettering Cancer Center >>> >> >> ?| Weill Medical College of Cornell University >>> >> >> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >> _______________________________________________ >>> >> >> datatable-help mailing list >>> >> >> [email protected] >>> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> >>> >> > >>> >> > >>> >> >>> > >>> > >>> >>> >>> >> >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > ------------------------------ > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > End of datatable-help Digest, Vol 24, Issue 9 > ********************************************* _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
