Re: fun programming problem

Jay Savage Tue, 10 Nov 2009 07:33:09 -0800

On Mon, Nov 9, 2009 at 9:22 PM, Matthew Sacks
<matthewsacks1...@yahoo.com> wrote:
>
> I am unabashedly posting a quiz question I have about regular expressions:
>
> Looking for suggestions.
>
> I am thinking
>
> 1) make the set of regular expressions into one big expression?
> 2) search the seach strings, somehow, for common substrings.  "acme.org" 
> would be example. Each hit on acme.org would indicate a match on one of the 
> original search strings?
> comments invited.
>
>
> You have 100,000 strings which are regex patterns
> intended to match URLs (including hostname and possibly a URI). When you
> receive a message, you need to see if it matches at least one of these
> patterns.
>
> The naive approach would be to go through your list of
> patterns linearly and attempt a regex match on each one.  Suggest an 
> alternative that would be more
> efficient.
>


For me, the question of optimization hinges on two questions: 1) is
the goal to optimize a single match, or to optimize the *process*, and
2) will a single long-running program (deamon, even) process many
messages, or will the program be launched for each incoming message?

The branching approach is effective for a program that will execute
separately for each incoming message, but assuming that the program
will process many messages, and that the goal is to optimize the
average, I would suggest the following strategy:

1) Compile all the regex (se the qr// operator)

2) Store the regex as a linked list indexed with an integer

3) Begin with the naive approach, recording hits (either in a separate
hash or another layer nested in the list)

4) Periodically reorder the list pointers based on the recorded hits,
so that you are always matching against the most likely targets first.



Further optimizations:

You may want to introduce aging in a long-running filter so that you
are sure the currently-trending patterns always occupy the front of
the list.

You might introduce Shawn's branching tree to reduce start-up time for
the first few (hundred? thousand?) matches until you fell you have a
large enough sample to be predictive.

It might also be worth it--depending on the particular use case--to
store match everything against only the top few (dozen? hundred?)
hits, and then go then branch.



Branching optimization:

If you go the tree route, as either a start-up, or as a complete
solution, consider breaking out the branches of the tree for
simultaneous processing (multi-threading, forking).

If you don't multi-thread, consider dynamically reordering the
branches based on predictive data.


HTH,

-- jay

--------------------------------------------------
This email and attachment(s): [  ] blogable; [ x ] ask first; [  ]
private and confidential

daggerquill [at] gmail [dot] com
http://www.tuaw.com  http://www.downloadsquad.com  http://www.engatiki.org

values of β will give rise to dom!

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: fun programming problem

Reply via email to