On Mon, Nov 9, 2009 at 9:22 PM, Matthew Sacks <matthewsacks1...@yahoo.com> wrote: > > I am unabashedly posting a quiz question I have about regular expressions: > > Looking for suggestions. > > I am thinking > > 1) make the set of regular expressions into one big expression? > 2) search the seach strings, somehow, for common substrings. "acme.org" > would be example. Each hit on acme.org would indicate a match on one of the > original search strings? > comments invited. > > > You have 100,000 strings which are regex patterns > intended to match URLs (including hostname and possibly a URI). When you > receive a message, you need to see if it matches at least one of these > patterns. > > The naive approach would be to go through your list of > patterns linearly and attempt a regex match on each one. Suggest an > alternative that would be more > efficient. >
For me, the question of optimization hinges on two questions: 1) is the goal to optimize a single match, or to optimize the *process*, and 2) will a single long-running program (deamon, even) process many messages, or will the program be launched for each incoming message? The branching approach is effective for a program that will execute separately for each incoming message, but assuming that the program will process many messages, and that the goal is to optimize the average, I would suggest the following strategy: 1) Compile all the regex (se the qr// operator) 2) Store the regex as a linked list indexed with an integer 3) Begin with the naive approach, recording hits (either in a separate hash or another layer nested in the list) 4) Periodically reorder the list pointers based on the recorded hits, so that you are always matching against the most likely targets first. Further optimizations: You may want to introduce aging in a long-running filter so that you are sure the currently-trending patterns always occupy the front of the list. You might introduce Shawn's branching tree to reduce start-up time for the first few (hundred? thousand?) matches until you fell you have a large enough sample to be predictive. It might also be worth it--depending on the particular use case--to store match everything against only the top few (dozen? hundred?) hits, and then go then branch. Branching optimization: If you go the tree route, as either a start-up, or as a complete solution, consider breaking out the branches of the tree for simultaneous processing (multi-threading, forking). If you don't multi-thread, consider dynamically reordering the branches based on predictive data. HTH, -- jay -------------------------------------------------- This email and attachment(s): [ ] blogable; [ x ] ask first; [ ] private and confidential daggerquill [at] gmail [dot] com http://www.tuaw.com http://www.downloadsquad.com http://www.engatiki.org values of β will give rise to dom! -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/