Arun, just to comment on this part: <<The answer to your problem is that you should be using `unique(DT1)` instead of `unique.data.frame(DT1)` because `unique` will call the "correct" `unique.data.table` method on DT1. >>
I use `unique.data.frame(DT)` all the time. The reason being that I often have data with multiple rows per key. If I want all unique rows, `unique.data.table` gives me a result other than what I need. Any thoughts on a better way? On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote: > Frank, > > The answer to your problem is that you should be using `unique(DT1)` > instead of `unique.data.frame(DT1)` because `unique` will call the > "correct" `unique.data.table` method on DT1. > > Now, as to why this is happening… You should know that data.table over > allocates a list of column pointers in order to add columns by reference > (you can read about this more, if you wish, by looking at ?`:=`). That is, > if you do: > > DT1 <- data.table(1) > > You've created 1 column. But you've (or data.table has) allocated vector > of a 100 column pointers (by default). You can see this by using the > function `truelength`. > > truelength(DT1) > > 100 > > Your problem with `unique.data.frame` is that this `truelength` is not > maintained after doing this copy. That is: > > DT2 <- unique(DT1) # <~~~ correct way > DT3 <- unique.data.frame(DT1) # <~~~ incorrect way > > truelength(DT2) > > 100 > truelength(DT3) > > 0 > > Therefore, we've a problem now. The over-allocated memory is somehow > "gone" after this copy. Therefore when you do a `:=` after this, we will be > writing to a memory location which isn't allocated. And this would normally > lead to a segmentation fault (IIUC). > > And this is what happened with an earlier version of data.table in a > similar context - setting the key of data.table. In version 1.7.8, the key > of a data.table was set by: > > key(DT) <- … > > And this resulted in a "copy" that set the true length to 0. So assigning > by reference after this step lead to a segmentation fault. This is why now > we have a "setkey" function or more general "setattr" function to assign > things without R's copy screwing things up. > > In order to catch this issue and rectify it without throwing a > segmentation fault, the attribute ".internal.selfref" was designed. > Basically it finds these situations and in that case gets a copy before > assigning by reference. I can't find a documentation on "how" it's done. > But the way I think of it is that when you assign by reference the existing > .internal.selfref attribute (which is of class externalptr) is compared > with the actual value of your data.table and if they match, then > everything's good. Else, it has to make a copy and set the correct ptr as > the attribute. > > You can read about this in ?setkey. So in essence use `unique` which'll > call the correct `unique.data.table` (hidden) function. Hope this helps. If > there's ambiguity or I got something wrong, please point out. > > Arun > > On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote: > > I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a > warning about pointers, so apparently it is not...? > > A short example: > > DT1 <- data.table(1) > DT2 <- unique.data.frame(DT1) > DT2[,gah:=1] > > > An example closer to my application, undoing a cartesian/cross join: > > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > setkey(DT1,A) > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) > DT2[,gah:=1] # warning: I should have made a copy, apparently > > > I'm fine with explicitly making a copy, of course, and don't really know > anything about pointers. I just thought I'd bring it up. > > --Frank > _______________________________________________ > datatable-help mailing list > [email protected] <javascript:_e({}, 'cvml', > '[email protected]');> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -- Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: [email protected]
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
