[R] Identifying similar but not identical rows in a dataframe

Jeffrey Salinger Sun, 18 Oct 2009 19:57:43 -0700

I would like to identify _almost_ duplicated rows in a data frame.  For 
example, I might declare as duplicates pairs of rows that are alike at about 
80% of their columns.  When working with tens of thousands of rows and upwards 
of 20 columns an iterative approach, testing all permutations, can be time 
consuming.


 Duplicated() with incomparables sounds like the ticket.  But previous 
discussion in this forum indicates that specifying an
incomparable value when using duplicated() on a data frame is not yet
implemented. 

Any suggestions about how to implement this efficiently would be appreciated.  

All data are numerical, and each datum could, for example, be reduced to a byte 
representation in a string.  A fuzzy matching approach with agrep() might be 
possible.

Thanks.



      __________________________________________________________________
Be smarter than spam. See how smart SpamGuard is at giving junk email the b

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Identifying similar but not identical rows in a dataframe

Reply via email to