On Tue, 15 Feb 2005 13:11:21 -0800, Justin Mason <[EMAIL PROTECTED]> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Yitzchak Scott-Thoennes writes:
On Tue, Feb 15, 2005 at 12:40:50PM -0800, Justin Mason wrote:
FWIW, this looks like it'd be excellent for SpamAssassin ;)
Maybe somebody with SpamAssassin could do some benchmarks?
[...]
I did however do some benchmarks on the perfomance of one particular regex in the SA suite. Its from SA::Bayes and looks like this:
($token =~ /^(?:a(?:nd|ny|ble|ll|re)| m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)| t(?:his|he|ime|hrough|hat)| w(?:hy|here|ork|orld|ith|ithout|eb)| f(?:rom|or|ew)| e(?:ach|ven|mail)| o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)| s(?:uch|ame)| l(?:ook|ike|ong)| y(?:ou|our|ou're)| The|has|have|into|using|http|see|It's|it's| number|just|both|come|years|right|know|already| people|place|first|because| And|give|year|information|can)$/x);
As you can see this regex has been PreSuf'ed for a minor speedup. Run
times of this regex being used to search regcomp.c (a large file) in a
while /\b(RE)\b/ loop were prepatch: 21/s postpatch: 51/s. When the
pre-suf in the pattern is unrolled the times were prepatched:10/s postpatched:100/s. (Note this is the number of times to extract all
of the matches from the file.)
If you're benching, can you compare how your trie version (out)?performs regexps produced by Regexp::Optimizer and Regexp::Assemble on the above list of words?
Regexp::Assemble produces the following RE on the list of above words:
(?:m(?:a(?:il(?:ing|to)?|[dk]e)|o(?:re|st)|uch)| w(?:ith(?:out)?|h(?:ere|y)|or(?:ld|k)|eb)| a(?:l(?:ready|l)|(?:bl|r)e|n[dy])|t(?:h(?:rough|at|is|e)|ime)| i(?:n(?:formation|to)|t's)|o(?:n(?:ly|e)|ff|ut|wn)| y(?:ou(?:'re|r)?|ears?)|(?:p(?:eopl|lac)|giv)e| n(?:o[tw]|umber|eed)|f(?:irst|rom|ew|or)|l(?:o(?:ng|ok)|ike)| h(?:a(?:ve|s)|ttp)|s(?:(?:am|e)e|uch)|e(?:mail|ach|ven)| b(?:ecause|oth)|(?:righ|jus)t|c(?:ome|an)|using|It's|know|And)
Regexp::Optimizer produces:
(?:a(?:n[dy]|l(?:l|ready)|(?:bl|r)e)|m(?:o(?:st|re)| a(?:il(?:ing|to)?|[dk]e)|uch)|t(?:h(?:is|e|rough|at)|ime)| w(?:h(?:y|ere)|or(?:k|ld)|ith(?:out)?|eb)|f(?:rom|or|ew|irst)| e(?:ach|ven|mail)|o(?:n(?:e|ly)|ff|wn|ut)|n(?:o[wt]|eed|umber)| s(?:uch|(?:am|e)e)|l(?:o(?:ok|ng)|ike)|y(?:ou(?:r|'re)?|ears?)| h(?:a(?:s|ve)|ttp)|i(?:n(?:to|formation)|t's)|b(?:oth|ecause)| c(?:ome|an)|p(?:eopl|lac)e|using|It's|(?:jus|righ)t|know|And|give)
(word-wrapped to avoid mangling, should be on one line).
David
