On Thursday 09 July 2009 10:17:40 Ronald J Kimball wrote:
> On Wed, Jul 08, 2009 at 08:33:34PM -0400, Steve Tolkin wrote:
> > I want to search a sorted array to see if which strings, if any, have
> > the same prefix as my search string. For example, given the search
> > string of "tempest" I want to find "tempests" and "tempestuous". My
> > word list has about 250,000 words. Even though only about 7500 will be
> > used to start as search I think brute force would be too slow. It would
> > take O (n*n) time.
> >
> > Because I believe in measuring performance I actually wrote the brute
> > force code. It only uses 2500 strings to strart the search, rather than
> > the 7500 I plan to use. In fact it is very slow. It is still running
> > 15 minutes later, and I can tell it is only half done.
>
> I'm very surprised it is so slow. I wrote code to test the brute force
> approach and it searches ~250,000 words for 2500 strings in just 3 minutes.
>
> > Alternatively, can someone recommend another perl module for prefix
> > searching on a moderately large set of strings, e.g. trie or suffix tree
> > or suffix array, etc.. It must be reliable and easy to use.
>
> Another option would be a dictionary tree, in which each node is a single
> letter, so each word is represented by a path through the tree. I don't
> know how suitable it would be for this purpose. It is a cool structure
> though. :)
Heh, yes, it's a really cool data structure, and it seems it's one of those
rites of passage of every Perl programmer (templating system, or markup parser
anyone?)...
Since in Perl 5.10 the regex engine automatically optimizes alternations of
literal string to tries, why not let Perl's regexes do the hard work for you?
Here are timings for three algorithms to solve the problem Steve proposed
(assume that the lists of target, search and found strings are in
@targetWords, @searchWords and @foundWords):
For reference, the brute force algorithm:
#### Brute force
T: foreach my $targetWord (@targetWords) {
foreach my $searchWord (@searchWords) {
if ($targetWord =~ /^$searchWord/) {
push @foundWords, $targetWord;
next T;
}
}
}
say $_ for @foundWords;
$ time ./search_prefixWords.pl wn-tmp/data.srtu wn-tmp/data.srtu.5chars >
out-brute.stdout
real 21m54.758s
user 21m44.346s
sys 0m2.772s
Compile search prefixes into trie regex:
#### Semi-brute force
my $targetWordCounter = 0;
my $searchWord_alt = join('|', @searchWords);
my $searchWord_re = qr/$searchWord_alt/o;
foreach my $targetWord (@targetWords) {
if ($targetWord =~ /^($searchWord_re)/) {
push @foundWords, $targetWord;
}
}
say $_ for @foundWords;
$ time ./search_prefixWords.pl wn-tmp/data.srtu wn-tmp/data.srtu.5chars > out-
semibrute.stdout
real 0m5.296s
user 0m4.888s
sys 0m0.064s
Also compile the target words, this time into a single string with word
boundaries ('#'):
#### Regex force
my $searchWord_alt = join('|', @searchWords);
my $searchWord_re = qr/$searchWord_alt/o;
my $targetWords = '#' . join('#', @targetWords) . '#';
(@foundWords) = $targetWords =~ /(?<=#)($searchWord_re[^#]*)/g;
@foundWords = map { defined $_ ? $_ : () } @foundWords;
say $_ for @foundWords;
That map statement is there because it turns out that for every found word, I
got 5 undefined list elements??? I have no idea why this happens. Is this some
quirk of the regex engine? In any case, even with that extra filtering step,
the increase in speed is huge.
$ time ./search_prefixWords.pl wn-tmp/data.srtu wn-tmp/data.srtu.5chars >
out-regex.stdout
real 0m0.940s
user 0m0.816s
sys 0m0.068s
All times include reading the input files into lists. For target words, I used
a sort-uniqued list from WordNet 3.0 of the nouns, verbs, adjectives and
adverbs (~88 kwords), and for the search prefixes, the subset of words in the
target list that were 5 or fewer characters (~7500 words):
$ wc wn-tmp/data.srtu*
87633 87633 1006762 wn-tmp/data.srtu
7477 7477 40785 wn-tmp/data.srtu.5chars
95110 95110 1047547 total
So both the semi-brute and the regex variants should be O(n) on the target
list size, but I suppose the latter has smaller constant factors. Unless you
need to recompile the target string or the search regex often, this should be
pretty efficient, and the memory footprint should be pretty good too (if you
undefine the lists after compiling into string and regex).
Bernardo
>
> Ronald
>
> _______________________________________________
> Boston-pm mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/boston-pm
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm