Hello, This is a perfect example for using hash sets (std::unordered_set in C++11) with custom implementations of hashing and comparison operators.
See this gist: https://gist.github.com/romainfrancois/8187193 I have only tested it with Rcpp11 (and therefore using C++11), but I get down to 2 or 3 seconds. > microbenchmark( unordered_pairs( m, 1, 2 ), times=10 ) Unit: seconds expr min lq median uq max neval unordered_pairs(m, 1, 2) 2.534141 2.536889 2.556607 2.714883 2.804312 10 I have used an input matrix of the same size, but probably not the same as yours. I would still expect this code to perform better to your 29 seconds. Romain Le 30 déc. 2013 à 19:01, Asis Hallab <asis.hal...@gmail.com> a écrit : > Dear Rcpp experts, > > I need to select all unique un-ordered pairs from a two column table. > Here each row represents a pair and > two rows, un-ordered pairs p and q are identical if > either first and second elements of p and q are identical, respectively, > or if first element of p is identical to second element of q and vice versa. > > Hence: > ( "A", "B" ) > ( "B", "A" ) > ( "A", "B" ) > are all identical pairs. > > Currently I have this generation of a mathematical set of unordered > pairs implemented as follows - basically using a custom comparator for > a standard set of standard vectors: > " > struct Comparator { > bool operator()(const std::vector<std::string> & a, const > std::vector<std::string> & b) { > const bool swapA = a[0] < a[1]; > const std::string & al = swapA ? a[0] : a[1]; > const std::string & ar = swapA ? a[1] : a[0]; > const bool swapB = b[0] < b[1]; > const std::string & bl = swapB ? b[0] : b[1]; > const std::string & br = swapB ? b[1] : b[0]; > return al < bl || (al == bl && ar < br); > } > }; > > SEXP extractProteinPairs( SEXP proteinAccessionMatrix, SEXP > pairFirstMemberColIndex, SEXP pairSecondMemberColIndex ) { > BEGIN_RCPP > > StringMatrix tbl( proteinAccessionMatrix ); > int colA( NumericVector( pairFirstMemberColIndex )( 0 ) ); > int colB( NumericVector( pairSecondMemberColIndex )( 0 ) ); > std::set<std::vector<std::string>, Comparator> prs; > > for ( int i=0; i<tbl.nrow(); ++i ) { > CharacterVector r( tbl( i, _ ) ); > std::vector< std::string > p; > p.push_back( std::string( r( colA ) ) ); > p.push_back( std::string( r( colB ) ) ); > prs.insert( p ); > } > > return( wrap( prs ) ); > > END_RCPP > } > " > > I am wondering if this could not be sped up using the following > algorithm sketched in "pseudo R": > > # input: > pairsTable > # a matrix with two columns and n rows, each representing an un-ordered pair > > # output: > m <- matrix( ncol=2 ) > # a matrix with two columns, subset of input and initialized as given > > for_each row p in argument pairsTable { > indices <- find all rows of m where EITHER column is identical with > either p[[1]] or p[[2]] > identityCandidates <- m[ indices, ] > if( ! any( candidate in identiyCandidates identical with p ) ) { > append p to m > } > } > > Would the above algorithm be faster than my implementation, because it > does just two "find_all" for each column in the result matrix and a > subsequent identity comparison with the current pair p? > And if the algorithm is faster how would I best implement this in Rcpp? > - The currently used Rcpp implementation given above takes > approximately 29 secs on an input table with 3,018,671 rows, returning > a subset with 2,597,797 unique un-ordered pairs (rows). > > - If I managed to explain my problem properly and did not confuse you too > much: > Any comments and help will be much appreciated! > > A happy new year to everyone! > Cheers! > _______________________________________________ > Rcpp-devel mailing list > Rcpp-devel@lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel _______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel