>>>>> "ST" == Steve Tolkin <[email protected]> writes:
ST> I want to do a binary search on a list of strings in memory. ST> Basically I want the bsearch functionality, also implemented in ST> Search::Dict. ST> I want to search a sorted array to see if which strings, if any, ST> have the same prefix as my search string. For example, given the ST> search string of "tempest" I want to find "tempests" and ST> "tempestuous". My word list has about 250,000 words. Even though ST> only about 7500 will be used to start as search I think brute ST> force would be too slow. It would take O (n*n) time. ST> Because I believe in measuring performance I actually wrote the ST> brute force code. It only uses 2500 strings to strart the search, ST> rather than the 7500 I plan to use. In fact it is very slow. It ST> is still running 15 minutes later, and I can tell it is only half ST> done. ST> Alternatively, can someone recommend another perl module for ST> prefix searching on a moderately large set of strings, e.g. trie ST> or suffix tree or suffix array, etc.. It must be reliable and ST> easy to use. i can think of several algorithms which would work. start with a simple binary search to locate where the prefix falls in the sorted list. then follow it by a simple linear scan to find all the words with matching prefixes. there is a Binary::Search module on cpan which does the search on any data you want as it uses a 'read' callback to get the next 'record'. you can do this on your array of strings pretty easily. and then scan from the location (<= the prefix) for matching words. implementing your own binary search is pretty easy as well. if you want more speed, here is one idea. find out the max length (N) of your prefixes and then scan the list grabbing from each word the leading strings from 1 to N in length. use those as keys in a hash with the values being a hash all the words that share that prefix. then finding the matching words is a single hash lookup followed by a keys call on another hash. this is the classic preindexing data for fast searches. it sucks up ram but it is instantaneous in getting your results. if your prefixes are all longer than 1 (or M) then you can index fewer total prefixes that way. so you have a choice of how fast you want to make this vs ram size. the binary search is O(log N) assuming presorted data as you said. the index is O(1) but requires more work and ram to start. uri -- Uri Guttman ------ [email protected] -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Free Perl Training --- http://perlhunter.com/college.html --------- --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com --------- _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

