Re: [Boston.pm] binary search on a list of sorted strings in memory

Uri Guttman Wed, 08 Jul 2009 20:10:43 -0700

>>>>> "ST" == Steve Tolkin <[email protected]> writes:


  ST> I want to do a binary search on a list of strings in memory.
  ST> Basically I want the bsearch functionality, also implemented in
  ST> Search::Dict.

  ST> I want to search a sorted array to see if which strings, if any,
  ST> have the same prefix as my search string.  For example, given the
  ST> search string of "tempest" I want to find "tempests" and
  ST> "tempestuous".  My word list has about 250,000 words.  Even though
  ST> only about 7500 will be used to start as search I think brute
  ST> force would be too slow.  It would take O (n*n) time.

  ST> Because I believe in measuring performance I actually wrote the
  ST> brute force code.  It only uses 2500 strings to strart the search,
  ST> rather than the 7500 I plan to use.  In fact it is very slow.  It
  ST> is still running 15 minutes later, and I can tell it is only half
  ST> done.

  ST> Alternatively, can someone recommend another perl module for
  ST> prefix searching on a moderately large set of strings, e.g. trie
  ST> or suffix tree or suffix array, etc..  It must be reliable and
  ST> easy to use.

i can think of several algorithms which would work. start with a simple
binary search to locate where the prefix falls in the sorted list. then
follow it by a simple linear scan to find all the words with matching
prefixes. there is a Binary::Search module on cpan which does the search
on any data you want as it uses a 'read' callback to get the next
'record'. you can do this on your array of strings pretty easily. and
then scan from the location (<= the prefix) for matching
words. implementing your own binary search is pretty easy as well.

if you want more speed, here is one idea. find out the max length (N) of
your prefixes and then scan the list grabbing from each word the leading
strings from 1 to N in length. use those as keys in a hash with the
values being a hash all the words that share that prefix. then finding
the matching words is a single hash lookup followed by a keys call on
another hash. this is the classic preindexing data for fast searches. it
sucks up ram but it is instantaneous in getting your results. if your
prefixes are all longer than 1 (or M) then you can index fewer total
prefixes that way.

so you have a choice of how fast you want to make this vs ram size. the
binary search is O(log N) assuming presorted data as you said. the index
is O(1) but requires more work and ram to start.

uri

-- 
Uri Guttman  ------  [email protected]  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] binary search on a list of sorted strings in memory

Reply via email to