Frank,  

The answer to your problem is that you should be using `unique(DT1)` instead of 
`unique.data.frame(DT1)` because `unique` will call the "correct" 
`unique.data.table` method on DT1.  

Now, as to why this is happening… You should know that data.table over 
allocates a list of column pointers in order to add columns by reference (you 
can read about this more, if you wish, by looking at ?`:=`). That is, if you do:

DT1 <- data.table(1)

You've created 1 column. But you've (or data.table has) allocated vector of a 
100 column pointers (by default). You can see this by using the function 
`truelength`.

truelength(DT1)
> 100

Your problem with `unique.data.frame` is that this `truelength` is not 
maintained after doing this copy. That is:

DT2 <- unique(DT1) # <~~~ correct way
DT3 <- unique.data.frame(DT1) # <~~~ incorrect way

truelength(DT2)
> 100
truelength(DT3)
> 0

Therefore, we've a problem now. The over-allocated memory is somehow "gone" 
after this copy. Therefore when you do a `:=` after this, we will be writing to 
a memory location which isn't allocated. And this would normally lead to a 
segmentation fault (IIUC).  

And this is what happened with an earlier version of data.table in a similar 
context - setting the key of data.table. In version  1.7.8, the key of a 
data.table was set by:

key(DT) <- …

And this resulted in a "copy" that set the true length to 0. So assigning by 
reference after this step lead to a segmentation fault. This is why now we have 
a "setkey" function or more general "setattr" function to assign things without 
R's copy screwing things up.

In order to catch this issue and rectify it without throwing a segmentation 
fault, the attribute ".internal.selfref" was designed. Basically it finds these 
situations and in that case gets a copy before assigning by reference. I can't 
find a documentation on "how" it's done. But the way I think of it is that when 
you assign by reference the existing .internal.selfref attribute (which is of 
class externalptr) is compared with the actual value of your data.table and if 
they match, then everything's good. Else, it has to make a copy and set the 
correct ptr as the attribute.

You can read about this in ?setkey. So in essence use `unique` which'll call 
the correct `unique.data.table` (hidden) function. Hope this helps. If there's 
ambiguity or I got something wrong, please point out.

Arun


On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote:

> I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a warning 
> about pointers, so apparently it is not...?  
>  
> A short example:
>  
> > DT1 <- data.table(1)
> > DT2 <- unique.data.frame(DT1)
> >  
> > DT2[,gah:=1]
> >  
>  
>  
> An example closer to my application, undoing a cartesian/cross join:  
>  
> > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0]
> > setkey(DT1,A)
> >  
> > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE])
> >  
> > DT2[,gah:=1] # warning: I should have made a copy, apparently
> >  
>  
>  
> I'm fine with explicitly making a copy, of course, and don't really know 
> anything about pointers. I just thought I'd bring it up.  
>  
> --Frank  
> _______________________________________________
> datatable-help mailing list
> [email protected] 
> (mailto:[email protected])
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>  
>  


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to