take a look at using the 'data.table' package.  Here are some times to
do the lookup using dataframes, matrices and data.tables:  data.tables
give the answer is less than 0.1 seconds.

> str(x.df)
'data.frame':   2500000 obs. of  4 variables:
 $ x  : Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
 $ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
 $ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
 $ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
> system.time(a <- x.df[[1]] %in% "AAAA")
   user  system elapsed
   0.33    0.00    0.39
> x.m <- as.matrix(x.df)
> str(x.m)
 chr [1:2500000, 1:4] "LMDC" "WFXC" "NUBQ" "RMOK" "LZVR" "GLCE" "GAZE"
"NIFT" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:4] "x" "x.1" "x.2" "x.3"
> system.time(a <- x.m[,1] %in% "AAAA")
   user  system elapsed
   0.50    0.00    0.51
> require(data.table)
> x.df <- data.table(x.df)
> setkey(x.df, x)
> system.time(a <- x.df["AAAA"])
   user  system elapsed
   0.05    0.03    0.13
> str(a)
Classes ‘data.table’ and 'data.frame':  7 obs. of  4 variables:
 $ x  : Factor w/ 1 level "AAAA": 1 1 1 1 1 1 1
 $ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1
 $ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1
 $ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1
 - attr(*, "sorted")= chr "x"
> system.time(x.df["ABCD"])
   user  system elapsed
   0.08    0.02    0.16
>

On Tue, Nov 22, 2011 at 2:01 PM, TimothyDalbey <tmdal...@gmail.com> wrote:
> Hey All,
>
> So - I promise to write a blog post on this topic and post it somewhere on
> the internet once I get to the bottom of this.  Basically, the set-up to the
> problem is like this:
>
> 1.  I have a data frame with dim (2547290, 4)
> 2.  I need to make SQL like lookups on the dataframe.  I have been using the
> following sort of syntax:
>
> a.dataframe[a.dataframe[[column_index]] %in% some_value, ]
>
> 3.  This process takes quite a lot of time (~2 seconds) on m1.small
> instances AMIs (AWS)
>
> So, I hope I can get that look-up/search logic quite a lot faster.  I have
> heard that using matrices is the way to do it but I haven't found any
> resources on performing that sort of operation specifically that have
> yielded better results.
>
> Thought, feelings and advice are more than welcome.
>
> Best,
> TMD
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Data-Frame-Search-Slow-tp4096906p4096906.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to