On 28 March 2012 at 16:36, Jonas Rauch wrote:
| If I remember correctly, environments use hash tables to lookup names.
|
| In pure R to check if an element L[["key"]] exists I would usually use
|> "key" %in% names(L)
| which seems to be reasonably fast. `%in%` works in vectors, too
|> keys %in% L
| returning exactly what Dirks suggestion was supposed to do (you missed an
"if",
| or alternatively an "or")
|
| `%in%` uses .Internal(match(...)), which computes a temporary hash table as
far
| as I can see in the source (src/main/unique.c). I am not sure about the
| implementation of the Hash. In particular I don't know if some kind of caching
| is done, because the lookup is much faster than Dirks function (see below) and
| computing a hash table for one lookup seems inefficient. I think it would be
| good to use the internal `match` function of R to do what you want. In
| Rinternals.h match is exposed as
| SEXP Rf_match(SEXP itable, SEXP ix, int nmatch)
| where nmatch is the value to be returned if there is no match.
Excellent suggestion!
I think we can even return just the index vector, with NA or another filler
for non-matches:
R> g<- cxxfunction(signature(ls="list", ts="character"), plugin="Rcpp", body='
+ Rcpp::List lst(ls);
+ Rcpp::CharacterVector nam = lst.names();
+
+ Rcpp::IntegerVector ind(Rf_match(nam,ts,R_NaInt));
+ return ind;
+ ')
R> g(ll, c("aa", "b", "c"))
[1] NA 2 NA
R>
Dirk
| Below are the results of my little benchmark (see below). First parameter of
| `lookup` is the number elements in L, second is the length of keys, third one
| specifies the probability that a key is in present in the list.
|
| Regards,
| Jonas
|
| ##Benchmark results
|> lookup(10000, 1000, 0.1)
| test elapsed relative
| 2 f(L, keys) 0.641 160.25
| 3 g(L, keys) 0.005 1.25
| 1 keys %in% names(L) 0.004 1.00
|> lookup(10000, 1000, 1.0)
| test elapsed relative
| 2 f(L, keys) 0.643 160.75
| 3 g(L, keys) 0.004 1.00
| 1 keys %in% names(L) 0.004 1.00
|> lookup(100000, 100, 0.1)
| test elapsed relative
| 2 f(L, keys) 0.656 164
| 3 g(L, keys) 0.004 1
| 1 keys %in% names(L) 0.004 1
|> lookup(100000, 100, 1.0)
| test elapsed relative
| 2 f(L, keys) 0.654 163.5
| 3 g(L, keys) 0.004 1.0
| 1 keys %in% names(L) 0.004 1.0
|> lookup(100000, 1, 0.1)
| test elapsed relative
| 2 f(L, keys) 0.647 161.75
| 3 g(L, keys) 0.004 1.00
| 1 keys %in% names(L) 0.005 1.25
|> lookup(100000, 1, 1.0)
| test elapsed relative
| 2 f(L, keys) 0.647 161.75
| 3 g(L, keys) 0.004 1.00
| 1 keys %in% names(L) 0.005 1.25
|
|
| ## Code for benchmarks
| library(inline)
| library(rbenchmark)
| f<- cxxfunction(signature(ls="list", ts="character"), plugin="Rcpp", body='
| Rcpp::List lst(ls);
| Rcpp::CharacterVector nam = lst.names();
| std::vector<std::string> nm = Rcpp::as< std::vector< std::string> >
| (nam);
| int m = nam.size();
|
| Rcpp::CharacterVector tok(ts);
| std::vector<std::string> tk = Rcpp::as< std::vector< std::string> >
| (tok);
| int n = tok.size();
|
| Rcpp::LogicalVector log(n);
|
| for (int i=0; i<n; i++) { // look at all tokens
| log[i] = false; // assume
| false, but compare to all names
| for (int j=0; j<m; j++) {
| if(tk[i] == nm[j])
| log[i] = 1; // and note equality if we find
| it
| }
| }
| return log;
| ')
|
| g<- cxxfunction(signature(ls="list", ts="character"), plugin="Rcpp", body='
| Rcpp::List lst(ls);
| Rcpp::CharacterVector nam = lst.names();
|
| Rcpp::IntegerVector ind(Rf_match(nam,ts,0));
| Rcpp::LogicalVector log(ind);
| return log;
| ')
|
|
| lookup<-function(nList, nKeys, P) {
| nList<-10000
| nKeys<-1000
| P<-0.1
| keys<-as.character(floor(runif(nKeys, 1, nList/P+1)))
| L<-as.list(1:nList)
| names(L)<-as.character(1:nList)
| benchmark(keys %in% names(L), f(L, keys), g(L, keys), replications=10,
| columns=c("test", "elapsed", "relative"))
| }
|
| lookup(10000, 1000, 0.1)
| lookup(10000, 1000, 1.0)
| lookup(100000, 100, 0.1)
| lookup(100000, 100, 1.0)
| lookup(100000, 1, 0.1)
| lookup(100000, 1, 1.0)
|
|