Thanks! That's a good idea, and a lot simpler than what I was concocting in my head. I'll give that a try. I think--just for for posterity--you mean
DT[, importance := 0 - is.na(V3)] rather than 0 + is.na(V3), so that rows with V3 are lower than rows without. ------- Nathaniel Graham [email protected] [email protected] https://sites.google.com/site/npgraham1/ On Tue, May 20, 2014 at 8:34 PM, Gabor Grothendieck <[email protected] > wrote: > On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <[email protected]> > wrote: > > First, I use rbindlist pretty often, and I've been quite happy with it. > The > > new use.names and fill features definitely scratch an itch for me; I > wound > > up using rbind_all from dplyr (which worked well, I'm not complaining), > but > > I'm looking forward to having a data.table implementation. The speed > > increase is also welcome. So thank you for the new features! I don't > > personally have a preference with respect to the use.names and fill > > defaults, so whatever you guys decide will be fine with me. > > > > I do have a question regarding unique, which I use very, very frequently, > > and often after rbindlist. I have a fairly large data set (tens of > millions > > of raw observations), many of which are duplicates. The observations > come > > from a variety of sources, but the formats and variable names are > (nearly) > > identical. > > > > The problem is that many "duplicates" aren't perfect duplicates, and some > > rows have more information than others. A simple example might look like > > this: > > > >> foo > > V1 V2 V3 > > 1: 1 3 TRUE > > 2: 1 4 TRUE > > 3: 2 3 NA > > 4: 2 4 TRUE > > 5: 1 3 TRUE > > 6: 1 4 NA > > 7: 2 3 TRUE > > 8: 2 4 TRUE > > 9: 3 1 NA > >> unique(foo, by = c("V1", "V2")) > > V1 V2 V3 > > 1: 1 3 TRUE > > 2: 1 4 TRUE > > 3: 2 3 NA > > 4: 2 4 TRUE > > 5: 3 1 NA > > > > > > Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story) > > uniquely identify an observation, but if there's a row where I also have > V3, > > I'd prefer to have that row rather than a row where it's missing. You > can > > see that a naive use of unique here gets me the less-preferable 2,3 row. > If > > I only had three columns, this would be easy to solve (sort/setkey first > > would do it). However, I have more than a dozen additional columns, and > > when I drop duplicates I want to retain the row with the greatest number > of > > non-missing values. Additionally, some columns are more important than > > others. If (to refer again to the example above), there are no rows that > > have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so > I > > can't just condition on !is.na(V3). > > > > Does anybody have any insight or techniques for this sort of thing? I'm > > currently sorting on all columns prior to unique, but I'm quite sure that > > this loses some information. > > Append an importance column which ranks the importance of that row > (lower better) and make importance the low order component of the key. > > DT[, importance := 0+is.na(V3)] > setkey(DT, V1, V2, importance) > unique(DT, by = c("V1", "V2")) > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
