Ricardo,  

Yes, I was also thinking of this, because of precisely the issue you mention. 
In this case, I'd do `invisible(alloc.col(DT2))` before assigning by reference. 
The typical way of converting from a data.frame to a data.table (without 
complete copy or rather with a "shallow" copy) is:

DF <- data.frame(x=1:5, y=6:10)
tracemem(DF)
[1] "<0x100f08678>"

setattr(DF, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(DF, 0)
invisible(alloc.col(DF))
tracemem(DF)
[1] "<0x103c23b30>"

DF[, z := 1]

Even thought there's a copy happening, this, as I understand is a "shallow" 
copy (copying only references/pointers and not the entire data) and therefore 
should have almost negligible time in copying). Now, if you look at the second 
line, it first sets the "truelength" attribute to 0 (which is set to NULL for a 
data.frame, if you look at as.data.frame.data.table function). Then it 
allocates the columns with "alloc.col". So,  

DT1 <- data.table(1)
DT2 <- unique.data.frame(DT1) # <~~~ your true length is screwed up
truelength(DT2)
# [1] 0

invisible(alloc.col(DT2))
truelength(DT2)
# [1] 100

DT2[, w := 2]
# no warning / full copy.

So, Frank, I guess this is an alternate way if you don't want the warning/full 
copy, but you want to specifically use `unique.data.frame`.

Thanks for bringing it up Ricardo. If I've gotten something wrong, feel free to 
correct me..

Arun


On Wednesday, July 31, 2013 at 3:49 PM, Ricardo Saporta wrote:

> Arun, just to comment on this part:  
>  
> <<The answer to your problem is that you should be using `unique(DT1)` 
> instead of `unique.data.frame(DT1)` because `unique` will call the "correct" 
> `unique.data.table` method on DT1. >>  
>  
> I use `unique.data.frame(DT)` all the time.  
> The reason being that I often have data with multiple rows per key.  If I 
> want all unique rows, `unique.data.table` gives me a result other than what I 
> need.   Any thoughts on a better way?  
>  
> On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote:
> > Frank,  
> >  
> > The answer to your problem is that you should be using `unique(DT1)` 
> > instead of `unique.data.frame(DT1)` because `unique` will call the 
> > "correct" `unique.data.table` method on DT1.   
> >  
> > Now, as to why this is happening… You should know that data.table over 
> > allocates a list of column pointers in order to add columns by reference 
> > (you can read about this more, if you wish, by looking at ?`:=`). That is, 
> > if you do:  
> >  
> > DT1 <- data.table(1)
> >  
> > You've created 1 column. But you've (or data.table has) allocated vector of 
> > a 100 column pointers (by default). You can see this by using the function 
> > `truelength`.  
> >  
> > truelength(DT1)
> > > 100
> >  
> > Your problem with `unique.data.frame` is that this `truelength` is not 
> > maintained after doing this copy. That is:
> >  
> > DT2 <- unique(DT1) # <~~~ correct way  
> > DT3 <- unique.data.frame(DT1) # <~~~ incorrect way
> >  
> > truelength(DT2)
> > > 100
> > truelength(DT3)
> > > 0
> >  
> > Therefore, we've a problem now. The over-allocated memory is somehow "gone" 
> > after this copy. Therefore when you do a `:=` after this, we will be 
> > writing to a memory location which isn't allocated. And this would normally 
> > lead to a segmentation fault (IIUC).   
> >  
> > And this is what happened with an earlier version of data.table in a 
> > similar context - setting the key of data.table. In version  1.7.8, the key 
> > of a data.table was set by:
> >  
> > key(DT) <- …  
> >  
> > And this resulted in a "copy" that set the true length to 0. So assigning 
> > by reference after this step lead to a segmentation fault. This is why now 
> > we have a "setkey" function or more general "setattr" function to assign 
> > things without R's copy screwing things up.  
> >  
> > In order to catch this issue and rectify it without throwing a segmentation 
> > fault, the attribute ".internal.selfref" was designed. Basically it finds 
> > these situations and in that case gets a copy before assigning by 
> > reference. I can't find a documentation on "how" it's done. But the way I 
> > think of it is that when you assign by reference the existing 
> > .internal.selfref attribute (which is of class externalptr) is compared 
> > with the actual value of your data.table and if they match, then 
> > everything's good. Else, it has to make a copy and set the correct ptr as 
> > the attribute.  
> >  
> > You can read about this in ?setkey. So in essence use `unique` which'll 
> > call the correct `unique.data.table` (hidden) function. Hope this helps. If 
> > there's ambiguity or I got something wrong, please point out.  
> >  
> > Arun
> >  
> >  
> > On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote:
> >  
> > > I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a 
> > > warning about pointers, so apparently it is not...?  
> > >  
> > > A short example:
> > >  
> > > > DT1 <- data.table(1)
> > > > DT2 <- unique.data.frame(DT1)
> > > >  
> > > > DT2[,gah:=1]
> > > >  
> > >  
> > >  
> > > An example closer to my application, undoing a cartesian/cross join:  
> > >  
> > > > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0]
> > > > setkey(DT1,A)
> > > >  
> > > > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE])
> > > >  
> > > > DT2[,gah:=1] # warning: I should have made a copy, apparently
> > > >  
> > >  
> > >  
> > > I'm fine with explicitly making a copy, of course, and don't really know 
> > > anything about pointers. I just thought I'd bring it up.  
> > >  
> > > --Frank  
> > > _______________________________________________
> > > datatable-help mailing list
> > > [email protected] (javascript:_e({}, 'cvml', 
> > > '[email protected]');)
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >  
> > >  
> > >  
> >  
> >  
>  
>  
> --  
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: [email protected] (mailto:[email protected])
>  
>  

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to