Asis, On 30 December 2013 at 19:01, Asis Hallab wrote: | I need to select all unique un-ordered pairs from a two column table. | Here each row represents a pair and | two rows, un-ordered pairs p and q are identical if | either first and second elements of p and q are identical, respectively, | or if first element of p is identical to second element of q and vice versa. | | Hence: | ( "A", "B" ) | ( "B", "A" ) | ( "A", "B" ) | are all identical pairs. | | Currently I have this generation of a mathematical set of unordered | pairs implemented as follows - basically using a custom comparator for | a standard set of standard vectors: | " | struct Comparator { | bool operator()(const std::vector<std::string> & a, const | std::vector<std::string> & b) { | const bool swapA = a[0] < a[1]; | const std::string & al = swapA ? a[0] : a[1]; | const std::string & ar = swapA ? a[1] : a[0]; | const bool swapB = b[0] < b[1]; | const std::string & bl = swapB ? b[0] : b[1]; | const std::string & br = swapB ? b[1] : b[0]; | return al < bl || (al == bl && ar < br); | } | }; | | SEXP extractProteinPairs( SEXP proteinAccessionMatrix, SEXP | pairFirstMemberColIndex, SEXP pairSecondMemberColIndex ) { | BEGIN_RCPP | | StringMatrix tbl( proteinAccessionMatrix ); | int colA( NumericVector( pairFirstMemberColIndex )( 0 ) ); | int colB( NumericVector( pairSecondMemberColIndex )( 0 ) ); | std::set<std::vector<std::string>, Comparator> prs; | | for ( int i=0; i<tbl.nrow(); ++i ) { | CharacterVector r( tbl( i, _ ) ); | std::vector< std::string > p; | p.push_back( std::string( r( colA ) ) ); | p.push_back( std::string( r( colB ) ) ); | prs.insert( p ); | } | | return( wrap( prs ) ); | | END_RCPP | } | " | | I am wondering if this could not be sped up using the following | algorithm sketched in "pseudo R": | | # input: | pairsTable | # a matrix with two columns and n rows, each representing an un-ordered pair | | # output: | m <- matrix( ncol=2 ) | # a matrix with two columns, subset of input and initialized as given | | for_each row p in argument pairsTable { | indices <- find all rows of m where EITHER column is identical with | either p[[1]] or p[[2]] | identityCandidates <- m[ indices, ] | if( ! any( candidate in identiyCandidates identical with p ) ) { | append p to m | } | } | | Would the above algorithm be faster than my implementation, because it | does just two "find_all" for each column in the result matrix and a | subsequent identity comparison with the current pair p?
Not sure. | And if the algorithm is faster how would I best implement this in Rcpp? Not sure either. You have something that works. Maybe just profile it to see if there are any bottlenecks. | - The currently used Rcpp implementation given above takes | approximately 29 secs on an input table with 3,018,671 rows, returning | a subset with 2,597,797 unique un-ordered pairs (rows). | | - If I managed to explain my problem properly and did not confuse you too much: | Any comments and help will be much appreciated! With that many objects it starts to matter how you store and access the data. Maybe std::vector<std::string> is not the best -- I don't know. But as you have something reasonably fast now -- is it worth spending days on getting 29 secs down to 19 secs ? Dirk | | A happy new year to everyone! | Cheers! | _______________________________________________ | Rcpp-devel mailing list | Rcpp-devel@lists.r-forge.r-project.org | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel -- Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com _______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel