[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

bugzilla-daemon Tue, 26 May 2009 14:58:56 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119






--- Comment #7 from Karsten Bräckelmann <[email protected]>  2009-05-26 
14:58:30 PST ---
Just had a quick look at attachment 4452 and the BodyEval tvd_vertical_words()
function, adding some noisy debugging love. The reason is quite simple -- the
space to non-space ratio doesn't exceed 9%, which is less than the default 10%
max.

This didn't become apparent from looking at the code only without the
debugging, though. I expected it to check the body line by line. However, it
actually checks the space ratio for *paragraphs* in a traditional UN*X style.
That paragraph ends with *two* newlines.

This line for example would have a ratio of 18% on its own, still 13% with the
longish header-style prefix and no (munged?) linebreak.

  Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)

The text being looked at is the entire paragraph, though, including all lines
immediately preceding or following without an empty line. Resulting in 20/201,
or about 9%. One reason, and an explanation why it loves to hit on such
messages, are the very long words prefixing each line. Or, in other word:
There's not much real, human generated text there. Compare it to this very
paragraph...

A quick and easy fix is, to lower the max threshold (second argument) in
20_body_tests.cf, which currently reads:
  body TVD_SPACE_RATIO  eval:tvd_vertical_words('0','10')

However, given the idea is to identify lots of *vertical* words, I seriously
wonder if this used to work on actual *lines*, rather than whole paragraphs.
Theo?


-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Reply via email to