Ricardo, You read my mind.. :) I was thinking of the same as well.. Whether the community agrees or not would be interesting as well. It could save trouble with "alloc.col" manually.
Arun On Wednesday, July 31, 2013 at 6:04 PM, Ricardo Saporta wrote: > Hey Arun, > > great call on using `alloc.col()` I would not have thought of that. > > Since we were previously talking about updates to common functions in the > package, I wouldnt mind seeing a arugment added to `unique.data.table` along > the lines of `useKey=FALSE` (perhaps better named). Thoughts? > > Rick > > On Wed, Jul 31, 2013 at 11:06 AM, Arunkumar Srinivasan <[email protected] > (mailto:[email protected])> wrote: > > Ricardo, > > > > Yes, I was also thinking of this, because of precisely the issue you > > mention. In this case, I'd do `invisible(alloc.col(DT2))` before assigning > > by reference. The typical way of converting from a data.frame to a > > data.table (without complete copy or rather with a "shallow" copy) is: > > > > DF <- data.frame(x=1:5, y=6:10) > > tracemem(DF) > > [1] "<0x100f08678>" > > > > setattr(DF, 'class', c('data.table', 'data.frame')) > > data.table:::settruelength(DF, 0) > > invisible(alloc.col(DF)) > > tracemem(DF) > > [1] "<0x103c23b30>" > > > > DF[, z := 1] > > > > Even thought there's a copy happening, this, as I understand is a "shallow" > > copy (copying only references/pointers and not the entire data) and > > therefore should have almost negligible time in copying). Now, if you look > > at the second line, it first sets the "truelength" attribute to 0 (which is > > set to NULL for a data.frame, if you look at as.data.frame.data.table > > function). Then it allocates the columns with "alloc.col". So, > > > > DT1 <- data.table(1) > > DT2 <- unique.data.frame(DT1) # <~~~ your true length is screwed up > > truelength(DT2) > > # [1] 0 > > > > invisible(alloc.col(DT2)) > > truelength(DT2) > > # [1] 100 > > > > DT2[, w := 2] > > # no warning / full copy. > > > > So, Frank, I guess this is an alternate way if you don't want the > > warning/full copy, but you want to specifically use `unique.data.frame`. > > > > Thanks for bringing it up Ricardo. If I've gotten something wrong, feel > > free to correct me.. > > > > Arun > > > > > > On Wednesday, July 31, 2013 at 3:49 PM, Ricardo Saporta wrote: > > > > > Arun, just to comment on this part: > > > > > > <<The answer to your problem is that you should be using `unique(DT1)` > > > instead of `unique.data.frame(DT1)` because `unique` will call the > > > "correct" `unique.data.table` method on DT1. >> > > > > > > I use `unique.data.frame(DT)` all the time. > > > The reason being that I often have data with multiple rows per key. If I > > > want all unique rows, `unique.data.table` gives me a result other than > > > what I need. Any thoughts on a better way? > > > > > > On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote: > > > > Frank, > > > > > > > > The answer to your problem is that you should be using `unique(DT1)` > > > > instead of `unique.data.frame(DT1)` because `unique` will call the > > > > "correct" `unique.data.table` method on DT1. > > > > > > > > Now, as to why this is happening… You should know that data.table over > > > > allocates a list of column pointers in order to add columns by > > > > reference (you can read about this more, if you wish, by looking at > > > > ?`:=`). That is, if you do: > > > > > > > > DT1 <- data.table(1) > > > > > > > > You've created 1 column. But you've (or data.table has) allocated > > > > vector of a 100 column pointers (by default). You can see this by using > > > > the function `truelength`. > > > > > > > > truelength(DT1) > > > > > 100 > > > > > > > > Your problem with `unique.data.frame` is that this `truelength` is not > > > > maintained after doing this copy. That is: > > > > > > > > DT2 <- unique(DT1) # <~~~ correct way > > > > DT3 <- unique.data.frame(DT1) # <~~~ incorrect way > > > > > > > > truelength(DT2) > > > > > 100 > > > > truelength(DT3) > > > > > 0 > > > > > > > > Therefore, we've a problem now. The over-allocated memory is somehow > > > > "gone" after this copy. Therefore when you do a `:=` after this, we > > > > will be writing to a memory location which isn't allocated. And this > > > > would normally lead to a segmentation fault (IIUC). > > > > > > > > And this is what happened with an earlier version of data.table in a > > > > similar context - setting the key of data.table. In version 1.7.8, the > > > > key of a data.table was set by: > > > > > > > > key(DT) <- … > > > > > > > > And this resulted in a "copy" that set the true length to 0. So > > > > assigning by reference after this step lead to a segmentation fault. > > > > This is why now we have a "setkey" function or more general "setattr" > > > > function to assign things without R's copy screwing things up. > > > > > > > > In order to catch this issue and rectify it without throwing a > > > > segmentation fault, the attribute ".internal.selfref" was designed. > > > > Basically it finds these situations and in that case gets a copy before > > > > assigning by reference. I can't find a documentation on "how" it's > > > > done. But the way I think of it is that when you assign by reference > > > > the existing .internal.selfref attribute (which is of class > > > > externalptr) is compared with the actual value of your data.table and > > > > if they match, then everything's good. Else, it has to make a copy and > > > > set the correct ptr as the attribute. > > > > > > > > You can read about this in ?setkey. So in essence use `unique` which'll > > > > call the correct `unique.data.table` (hidden) function. Hope this > > > > helps. If there's ambiguity or I got something wrong, please point out. > > > > > > > > > > > > Arun > > > > > > > > > > > > On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote: > > > > > > > > > I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a > > > > > warning about pointers, so apparently it is not...? > > > > > > > > > > A short example: > > > > > > > > > > > DT1 <- data.table(1) > > > > > > DT2 <- unique.data.frame(DT1) > > > > > > > > > > > > DT2[,gah:=1] > > > > > > > > > > > > > > > > > > > > > An example closer to my application, undoing a cartesian/cross join: > > > > > > > > > > > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > > > > > > setkey(DT1,A) > > > > > > > > > > > > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) > > > > > > > > > > > > DT2[,gah:=1] # warning: I should have made a copy, apparently > > > > > > > > > > > > > > > > > > > > > I'm fine with explicitly making a copy, of course, and don't really > > > > > know anything about pointers. I just thought I'd bring it up. > > > > > > > > > > --Frank > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > [email protected] > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ricardo Saporta > > > Graduate Student, Data Analytics > > > Rutgers University, New Jersey > > > e: [email protected] (mailto:[email protected]) > > > > > > > > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
