On Thursday 09 July 2009 17:17:40 Ronald J Kimball wrote: > On Wed, Jul 08, 2009 at 08:33:34PM -0400, Steve Tolkin wrote: > > I want to search a sorted array to see if which strings, if any, have > > the same prefix as my search string. For example, given the search > > string of "tempest" I want to find "tempests" and "tempestuous". My > > word list has about 250,000 words. Even though only about 7500 will be > > used to start as search I think brute force would be too slow. It would > > take O (n*n) time. > > > > Because I believe in measuring performance I actually wrote the brute > > force code. It only uses 2500 strings to strart the search, rather than > > the 7500 I plan to use. In fact it is very slow. It is still running > > 15 minutes later, and I can tell it is only half done. > > I'm very surprised it is so slow. I wrote code to test the brute force > approach and it searches ~250,000 words for 2500 strings in just 3 minutes. > > > Alternatively, can someone recommend another perl module for prefix > > searching on a moderately large set of strings, e.g. trie or suffix tree > > or suffix array, etc.. It must be reliable and easy to use. > > Another option would be a dictionary tree, in which each node is a single > letter, so each word is represented by a path through the tree. I don't > know how suitable it would be for this purpose. It is a cool structure > though. :) >
This is called a Radix Tree, a Patricia Tree or a Trie: http://en.wikipedia.org/wiki/Radix_tree I have written a slight variation of this in Perl to parse the Freecell Solver command line arguments: http://xrl.us/be2h4g I could not have used a getopt implementation because the syntax of the program's command line processing is different from the GNU conventions. (i.e: "-asw" is different from "-a -s -w" and you cannot do "--key=value" only "-- key" "value"). What I did there was build a radix tree in memory in Perl and then generate C code using if's and case's, strcmp's etc. from it. You can find another implementation of a radix tree (this time as a C data- structure) in hspell, the open-source Hebrew speller: http://tech.groups.yahoo.com/group/hackers-il/message/3475 There is an article in the book "Best of the Perl Journal: Computer Science and Perl Programming" where its author mentions several methods he tried to do in Perl in order to locate all the stock market symbols of companies (e.g: "MSFT", "SUN", "IBM", etc.) in text so he can hyperlink them. If I were him, I would have just used a radix tree (in C or possibly in Perl). Regards, Shlomi Fish > Ronald > > _______________________________________________ > Boston-pm mailing list > [email protected] > http://mail.pm.org/mailman/listinfo/boston-pm -- ----------------------------------------------------------------- Shlomi Fish http://www.shlomifish.org/ Original Riddles - http://www.shlomifish.org/puzzles/ God gave us two eyes and ten fingers so we will type five times as much as we read. _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

