Re: [datatable-help] rbindlist and unique

Arunkumar Srinivasan Wed, 21 May 2014 04:01:32 -0700

Nathaniel, Thanks.

First, I use rbindlist pretty often, and I've been quite happy with it.  The 
new  use.names and fill features definitely scratch an itch for me; I wound up 
using rbind_all from dplyr (which worked well, I'm not complaining), but I'm 
looking forward to having a data.table implementation.  
A data.table implementation (in rbind) exists since the last release 
(v1.9.0/2). This one just builds on it.

Arun

From: Nathaniel Graham [email protected]
Reply: Nathaniel Graham [email protected]
Date: May 21, 2014 at 2:20:44 AM
To: data.table source forge [email protected]
Subject:  [datatable-help] rbindlist and unique  

First, I use rbindlist pretty often, and I've been quite happy with it.  The 
new use.names and fill features definitely scratch an itch for me; I wound up 
using rbind_all from dplyr (which worked well, I'm not complaining), but I'm 
looking forward to having a data.table implementation.  The speed increase is 
also welcome.  So thank you for the new features!  I don't personally have a 
preference with respect to the use.names and fill defaults, so whatever you 
guys decide will be fine with me.

I do have a question regarding unique, which I use very, very frequently, and 
often after rbindlist.  I have a fairly large data set (tens of millions of raw 
observations), many of which are duplicates.  The observations come from a 
variety of sources, but the formats and variable names are (nearly) identical.

The problem is that many "duplicates" aren't perfect duplicates, and some rows 
have more information than others.  A simple example might look like this:

> foo
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  1  3 TRUE
6:  1  4   NA
7:  2  3 TRUE
8:  2  4 TRUE
9:  3  1   NA
> unique(foo, by = c("V1", "V2"))
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  3  1   NA

Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story) 
uniquely identify an observation, but if there's a row where I also have V3, 
I'd prefer to have that row rather than a row where it's missing.  You can see 
that a naive use of unique here gets me the less-preferable 2,3 row.  If I only 
had three columns, this would be easy to solve (sort/setkey first would do it). 
 However, I have more than a dozen additional columns, and when I drop 
duplicates I want to retain the row with the greatest number of non-missing 
values.  Additionally, some columns are more important than others.  If (to 
refer again to the example above), there are no rows that have V3 for a given 
V1 & V2 (like 3,1), I still need to retain a row, so I can't just condition on 
!is.na(V3).

Does anybody have any insight or techniques for this sort of thing?  I'm 
currently sorting on all columns prior to unique, but I'm quite sure that this 
loses some information.

-------
Nathaniel Graham
[email protected]
[email protected]
https://sites.google.com/site/npgraham1/
_______________________________________________  
datatable-help mailing list  
[email protected]  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] rbindlist and unique

Reply via email to