Re: [Boston.pm] binary search on a list of sorted strings in memory

Shlomi Fish Thu, 09 Jul 2009 07:43:49 -0700

On Thursday 09 July 2009 17:17:40 Ronald J Kimball wrote:
> On Wed, Jul 08, 2009 at 08:33:34PM -0400, Steve Tolkin wrote:
> > I want to search a sorted array to see if which strings, if any,  have
> > the same prefix as my search string.  For example, given the search
> > string of "tempest" I want to find "tempests" and "tempestuous".   My
> > word list has about 250,000 words.  Even though only about 7500 will be
> > used to start as search I think brute force would be too slow.  It would
> > take  O (n*n) time.
> >
> > Because I believe in measuring performance I actually wrote the brute
> > force code.  It only uses 2500 strings to strart the search, rather than
> > the 7500 I plan to use.  In fact it is very slow.   It is still running
> > 15 minutes later, and I can tell it is only half done.
>
> I'm very surprised it is so slow.  I wrote code to test the brute force
> approach and it searches ~250,000 words for 2500 strings in just 3 minutes.
>
> > Alternatively, can someone recommend  another perl module for prefix
> > searching on a moderately large set of strings, e.g. trie  or suffix tree
> > or suffix array, etc..  It must be reliable and easy to use.
>
> Another option would be a dictionary tree, in which each node is a single
> letter, so each word is represented by a path through the tree.  I don't
> know how suitable it would be for this purpose.  It is a cool structure
> though.  :)
>


This is called a Radix Tree, a Patricia Tree or a Trie:

http://en.wikipedia.org/wiki/Radix_tree

I have written a slight variation of this in Perl to parse the Freecell Solver 
command line arguments:

http://xrl.us/be2h4g

I could not have used a getopt implementation because the syntax of the 
program's command line processing is different from the GNU conventions. (i.e: 
"-asw" is different from "-a -s -w" and you cannot do "--key=value" only "--
key" "value"). What I did there was build a radix tree in memory in Perl and 
then generate C code using if's and case's, strcmp's etc. from it. 

You can find another implementation of a radix tree (this time as a C data-
structure) in hspell, the open-source Hebrew speller:

http://tech.groups.yahoo.com/group/hackers-il/message/3475

There is an article in the book "Best of the Perl Journal: Computer Science 
and Perl Programming" where its author mentions several methods he tried to do 
in Perl in order to locate all the stock market symbols of companies (e.g: 
"MSFT", "SUN", "IBM", etc.) in text so he can hyperlink them. If I were him, I 
would have just used a radix tree (in C or possibly in Perl).

Regards,

        Shlomi Fish

> Ronald
>
> _______________________________________________
> Boston-pm mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/boston-pm

-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Original Riddles - http://www.shlomifish.org/puzzles/

God gave us two eyes and ten fingers so we will type five times as much as we
read.

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] binary search on a list of sorted strings in memory

Reply via email to