On Thursday 09 July 2009 10:17:40 Ronald J Kimball wrote:
> On Wed, Jul 08, 2009 at 08:33:34PM -0400, Steve Tolkin wrote:
> > I want to search a sorted array to see if which strings, if any,  have
> > the same prefix as my search string.  For example, given the search
> > string of "tempest" I want to find "tempests" and "tempestuous".   My
> > word list has about 250,000 words.  Even though only about 7500 will be
> > used to start as search I think brute force would be too slow.  It would
> > take  O (n*n) time.
> >
> > Because I believe in measuring performance I actually wrote the brute
> > force code.  It only uses 2500 strings to strart the search, rather than
> > the 7500 I plan to use.  In fact it is very slow.   It is still running
> > 15 minutes later, and I can tell it is only half done.
>
> I'm very surprised it is so slow.  I wrote code to test the brute force
> approach and it searches ~250,000 words for 2500 strings in just 3 minutes.
>
> > Alternatively, can someone recommend  another perl module for prefix
> > searching on a moderately large set of strings, e.g. trie  or suffix tree
> > or suffix array, etc..  It must be reliable and easy to use.
>
> Another option would be a dictionary tree, in which each node is a single
> letter, so each word is represented by a path through the tree.  I don't
> know how suitable it would be for this purpose.  It is a cool structure
> though.  :)

Heh, yes, it's a really cool data structure, and it seems it's one of those 
rites of passage of every Perl programmer (templating system, or markup parser 
anyone?)...

Since in Perl 5.10 the regex engine automatically optimizes alternations of 
literal string to tries, why not let Perl's regexes do the hard work for you?

Here are timings for three algorithms to solve the problem Steve proposed 
(assume that the lists of target, search and found strings are in 
@targetWords, @searchWords and @foundWords):


For reference, the brute force algorithm:

#### Brute force
T: foreach my $targetWord (@targetWords) {
    foreach my $searchWord (@searchWords) {
        if ($targetWord =~ /^$searchWord/) {
            push @foundWords, $targetWord;
            next T;
        }
    }
}
say $_ for @foundWords;


$ time ./search_prefixWords.pl wn-tmp/data.srtu wn-tmp/data.srtu.5chars  > 
out-brute.stdout

real    21m54.758s
user    21m44.346s
sys     0m2.772s


Compile search prefixes into trie regex:

#### Semi-brute force
my $targetWordCounter = 0;
my $searchWord_alt = join('|', @searchWords);
my $searchWord_re = qr/$searchWord_alt/o;
foreach my $targetWord (@targetWords) {
    if ($targetWord =~ /^($searchWord_re)/) {
        push @foundWords, $targetWord;
    }
}
say $_ for @foundWords;


$ time ./search_prefixWords.pl wn-tmp/data.srtu wn-tmp/data.srtu.5chars > out-
semibrute.stdout

real    0m5.296s
user    0m4.888s
sys     0m0.064s


Also compile the target words, this time into a single string with word 
boundaries ('#'):

#### Regex force
my $searchWord_alt = join('|', @searchWords);
my $searchWord_re = qr/$searchWord_alt/o;
my $targetWords = '#' . join('#', @targetWords) . '#';
(@foundWords) = $targetWords =~ /(?<=#)($searchWord_re[^#]*)/g;
@foundWords = map { defined $_ ? $_ : () } @foundWords;
say $_ for @foundWords;

That map statement is there because it turns out that for every found word, I 
got 5 undefined list elements??? I have no idea why this happens. Is this some 
quirk of the regex engine? In any case, even with that extra filtering step, 
the increase in speed is huge.


$ time ./search_prefixWords.pl wn-tmp/data.srtu wn-tmp/data.srtu.5chars  > 
out-regex.stdout

real    0m0.940s
user    0m0.816s
sys     0m0.068s


All times include reading the input files into lists. For target words, I used 
a sort-uniqued list from WordNet 3.0 of the nouns, verbs, adjectives and 
adverbs (~88 kwords), and for the search prefixes, the subset of words in the 
target list that were 5 or fewer characters (~7500 words):

$ wc wn-tmp/data.srtu*
  87633   87633 1006762 wn-tmp/data.srtu
   7477    7477   40785 wn-tmp/data.srtu.5chars
  95110   95110 1047547 total

So both the semi-brute and the regex variants should be O(n) on the target 
list size, but I suppose the latter has smaller constant factors. Unless you 
need to recompile the target string or the search regex often, this should be 
pretty efficient, and the memory footprint should be pretty good too (if you 
undefine the lists after compiling into string and regex).


Bernardo

>
> Ronald
>
> _______________________________________________
> Boston-pm mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/boston-pm

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to