Hey Arun, great call on using `alloc.col()` I would not have thought of that.
Since we were previously talking about updates to common functions in the package, I wouldnt mind seeing a arugment added to `unique.data.table` along the lines of `useKey=FALSE` (perhaps better named). Thoughts? Rick On Wed, Jul 31, 2013 at 11:06 AM, Arunkumar Srinivasan < [email protected]> wrote: > Ricardo, > > Yes, I was also thinking of this, because of precisely the issue you > mention. In this case, I'd do `invisible(alloc.col(DT2))` before assigning > by reference. The typical way of converting from a data.frame to a > data.table (without complete copy or rather with a "shallow" copy) is: > > DF <- data.frame(x=1:5, y=6:10) > tracemem(DF) > [1] "<0x100f08678>" > > setattr(DF, 'class', c('data.table', 'data.frame')) > data.table:::settruelength(DF, 0) > invisible(alloc.col(DF)) > tracemem(DF) > [1] "<0x103c23b30>" > > DF[, z := 1] > > Even thought there's a copy happening, this, as I understand is a > "shallow" copy (copying only references/pointers and not the entire data) > and therefore should have almost negligible time in copying). Now, if you > look at the second line, it first sets the "truelength" attribute to 0 > (which is set to NULL for a data.frame, if you look at > as.data.frame.data.table function). Then it allocates the columns with > "alloc.col". So, > > DT1 <- data.table(1) > DT2 <- unique.data.frame(DT1) # <~~~ your true length is screwed up > truelength(DT2) > # [1] 0 > > invisible(alloc.col(DT2)) > truelength(DT2) > # [1] 100 > > DT2[, w := 2] > # no warning / full copy. > > So, Frank, I guess this is an alternate way if you don't want the > warning/full copy, but you want to specifically use `unique.data.frame`. > > Thanks for bringing it up Ricardo. If I've gotten something wrong, feel > free to correct me.. > > Arun > > On Wednesday, July 31, 2013 at 3:49 PM, Ricardo Saporta wrote: > > Arun, just to comment on this part: > > <<The answer to your problem is that you should be using `unique(DT1)` > instead of `unique.data.frame(DT1)` because `unique` will call the > "correct" `unique.data.table` method on DT1. >> > > I use `unique.data.frame(DT)` all the time. > The reason being that I often have data with multiple rows per key. If I > want all unique rows, `unique.data.table` gives me a result other than > what I need. Any thoughts on a better way? > > On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote: > > Frank, > > The answer to your problem is that you should be using `unique(DT1)` > instead of `unique.data.frame(DT1)` because `unique` will call the > "correct" `unique.data.table` method on DT1. > > Now, as to why this is happening… You should know that data.table over > allocates a list of column pointers in order to add columns by reference > (you can read about this more, if you wish, by looking at ?`:=`). That is, > if you do: > > DT1 <- data.table(1) > > You've created 1 column. But you've (or data.table has) allocated vector > of a 100 column pointers (by default). You can see this by using the > function `truelength`. > > truelength(DT1) > > 100 > > Your problem with `unique.data.frame` is that this `truelength` is not > maintained after doing this copy. That is: > > DT2 <- unique(DT1) # <~~~ correct way > DT3 <- unique.data.frame(DT1) # <~~~ incorrect way > > truelength(DT2) > > 100 > truelength(DT3) > > 0 > > Therefore, we've a problem now. The over-allocated memory is somehow > "gone" after this copy. Therefore when you do a `:=` after this, we will be > writing to a memory location which isn't allocated. And this would normally > lead to a segmentation fault (IIUC). > > And this is what happened with an earlier version of data.table in a > similar context - setting the key of data.table. In version 1.7.8, the key > of a data.table was set by: > > key(DT) <- … > > And this resulted in a "copy" that set the true length to 0. So assigning > by reference after this step lead to a segmentation fault. This is why now > we have a "setkey" function or more general "setattr" function to assign > things without R's copy screwing things up. > > In order to catch this issue and rectify it without throwing a > segmentation fault, the attribute ".internal.selfref" was designed. > Basically it finds these situations and in that case gets a copy before > assigning by reference. I can't find a documentation on "how" it's done. > But the way I think of it is that when you assign by reference the existing > .internal.selfref attribute (which is of class externalptr) is compared > with the actual value of your data.table and if they match, then > everything's good. Else, it has to make a copy and set the correct ptr as > the attribute. > > You can read about this in ?setkey. So in essence use `unique` which'll > call the correct `unique.data.table` (hidden) function. Hope this helps. If > there's ambiguity or I got something wrong, please point out. > > Arun > > On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote: > > I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a > warning about pointers, so apparently it is not...? > > A short example: > > DT1 <- data.table(1) > DT2 <- unique.data.frame(DT1) > DT2[,gah:=1] > > > An example closer to my application, undoing a cartesian/cross join: > > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > setkey(DT1,A) > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) > DT2[,gah:=1] # warning: I should have made a copy, apparently > > > I'm fine with explicitly making a copy, of course, and don't really know > anything about pointers. I just thought I'd bring it up. > > --Frank > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: [email protected] > > > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
