Dear Duncan, This is a model of the data I work with.
database <- replicate(50000, paste(sample(letters,rexp(1,1/500), rep=TRUE), collapse="")) words <- replicate(10000,paste(sample(letters,rexp(1,1/70), rep=TRUE), collapse="")) NumberOfWords <- 10 system.time(lapply(words[1: NumberOfWords], grep, database)) user system elapsed 5.002 0.003 5.005 The model reproduces the running times I have to cope with. To use grep in this context is rather naive and I am wondering if there are better solutions availabe in R. On 3 August 2015 at 15:13, Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > On 03/08/2015 5:25 AM, Witold E Wolski wrote: > > I have a database of text documents (letter sequences). Several thousands > > of documents with approx. 1000-2000 letters each. > > > > I need to find exact matches of short 3-15 letters sequences in those > > documents. > > > > Without any regexp patterns the search of one 3-15 letter "words" takes > in > > the order of 1s. > > > > So for a database with several thousand documents it's an the order of > > hours. > > The naive approach would be to use mcmapply, but than on a standard > > hardware I am still in the same order and since R is an interactive > > programming environment this isn't a solution I would go for. > > > > But aren't there faster algorithmic solutions? Can anyone point me please > > to an implementation available in R. > > You haven't shown us what you did, but it sounds far slower than I'd > expect. I just used the code below to set up a database of 10000 > documents of 2000 letters each, and searching those documents for "abc" > takes about 70 milliseconds: > > database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE), > collapse="")) > > grep("abc", database, fixed=TRUE) > > Duncan Murdoch > -- Witold Eryk Wolski [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.