Re: [Boston.pm] Q: giant-but-simple regex efficiency

Uri Guttman Fri, 04 Feb 2011 15:53:34 -0800

>>>>> "KS" == Kripa Sundar <[email protected]> writes:


  KS> I have a 900 Meg text file, containing random text.  I also have a list
  KS> of 6000 names (alphanumeric strings) that occur in the random text.
  KS> I need to tag a prefix on to each occurrence of each of these 6000
  KS> names.

  KS> My premise:
  KS> I believe a regex would give the simplest and most efficient algorithm.
  KS> If I am mistaken, I would be happy to learn.

  KS>   2: my $regex = join "|", @names;

that will kill your cpu. alternations are very slow since they have to
go back and try from the beginning of the list each time.

one trick would be to find a way to grab the names in a generic way and
check to see if they match one of the names in a hash. without data it
would be hard to show this in detail. but i will assume each name is 2-3
'words' in the text. the idea is to loop over the text's words and grab
the next 2-3 (a simple shift register using an array works for
this). push in new words and shift out old one in a loop. then take that
list of words (you could grab first 2 and then all 3 to get most name
combos) and look them up in the hash of names. if found, edit the file
in place and continue. you could read large blocks of text from the file
in an outer loop and keep a running buffer. this is how i do it in
File::ReadBackwards to get lines without knowing the boundaries in
advance.

so this technique would only scan the file one time and use a fast hash
for lookups. it could actually run in minutes or less if done
correctly.

uri

-- 
Uri Guttman  ------  [email protected]  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to