take a look at using the 'data.table' package. Here are some times to do the lookup using dataframes, matrices and data.tables: data.tables give the answer is less than 0.1 seconds.
> str(x.df) 'data.frame': 2500000 obs. of 4 variables: $ x : Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029 305994 209907 112469 105656 233058 247529 416273 ... $ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029 305994 209907 112469 105656 233058 247529 416273 ... $ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029 305994 209907 112469 105656 233058 247529 416273 ... $ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029 305994 209907 112469 105656 233058 247529 416273 ... > system.time(a <- x.df[[1]] %in% "AAAA") user system elapsed 0.33 0.00 0.39 > x.m <- as.matrix(x.df) > str(x.m) chr [1:2500000, 1:4] "LMDC" "WFXC" "NUBQ" "RMOK" "LZVR" "GLCE" "GAZE" "NIFT" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:4] "x" "x.1" "x.2" "x.3" > system.time(a <- x.m[,1] %in% "AAAA") user system elapsed 0.50 0.00 0.51 > require(data.table) > x.df <- data.table(x.df) > setkey(x.df, x) > system.time(a <- x.df["AAAA"]) user system elapsed 0.05 0.03 0.13 > str(a) Classes ‘data.table’ and 'data.frame': 7 obs. of 4 variables: $ x : Factor w/ 1 level "AAAA": 1 1 1 1 1 1 1 $ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1 $ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1 $ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1 - attr(*, "sorted")= chr "x" > system.time(x.df["ABCD"]) user system elapsed 0.08 0.02 0.16 > On Tue, Nov 22, 2011 at 2:01 PM, TimothyDalbey <tmdal...@gmail.com> wrote: > Hey All, > > So - I promise to write a blog post on this topic and post it somewhere on > the internet once I get to the bottom of this. Basically, the set-up to the > problem is like this: > > 1. I have a data frame with dim (2547290, 4) > 2. I need to make SQL like lookups on the dataframe. I have been using the > following sort of syntax: > > a.dataframe[a.dataframe[[column_index]] %in% some_value, ] > > 3. This process takes quite a lot of time (~2 seconds) on m1.small > instances AMIs (AWS) > > So, I hope I can get that look-up/search logic quite a lot faster. I have > heard that using matrices is the way to do it but I haven't found any > resources on performing that sort of operation specifically that have > yielded better results. > > Thought, feelings and advice are more than welcome. > > Best, > TMD > > -- > View this message in context: > http://r.789695.n4.nabble.com/Data-Frame-Search-Slow-tp4096906p4096906.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.