https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8315

            Bug ID: 8315
           Summary: BayesStore/SQL regression when using MySQL defaults
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Learner
          Assignee: dev@spamassassin.apache.org
          Reporter: d...@sr71.net
  Target Milestone: Undefined

Created attachment 6000
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=6000&action=edit
fix the SQLite concat syntax

I noticed that my bayes learning was not working well. The biggest symptom was
lots of new tokens but very few hammy, neutral or spammy tokens.  For instance:

X-Spam-TokenSummary: Tokens: new, 177; hammy, 0; neutral, 1; spammy, 1.
X-Spam-TokenSummary: Tokens: new, 178; hammy, 0; neutral, 0; spammy, 2.
X-Spam-TokenSummary: Tokens: new, 104; hammy, 1; neutral, 1; spammy, 0.

I've been using the following configuration:

bayes_store_module           Mail::SpamAssassin::BayesStore::SQL

for a loooooooong time, probably 10+ years. A change in 2022[1] changed the
default SQL syntax which uses "||" as a string concatenation operator.  That's
evidently fine in SQLite, but not in MySQL by default[2].  As a result, the
generated SQL ended up with a boolean value instead of a string for the token:

MariaDB [spamassassin]> SELECT SUBSTR(token || '     ', 1, 5), spam_count,
ham_count, atime from bayes_token limit 10;
+--------------------------------+------------+-----------+------------+
| SUBSTR(token || '     ', 1, 5) | spam_count | ham_count | atime      |
+--------------------------------+------------+-----------+------------+
| 0                              |          0 |         1 | 1696434003 |
| 0                              |          0 |         3 | 1696434018 |
| 0                              |          0 |         6 | 1696441099 |
| 0                              |          0 |         1 | 1696434008 |
| 0                              |          0 |         2 | 1696440870 |
| 0                              |          0 |         3 | 1696440394 |
| 0                              |          0 |         1 | 1696434011 |
| 0                              |          0 |         2 | 1696445725 |
| 0                              |          0 |         1 | 1696441419 |
| 0                              |          0 |         1 | 1696433986 |
+--------------------------------+------------+-----------+------------+

Basically, the token was either 0 or 1.

Then this loop in SpamAssassin/Plugin/Bayes.pm:

  foreach my $tokendata (@{$tokensdata}) {
    ...
    my ($token, $tok_spam, $tok_ham, $atime) = @{$tokendata};
    $pw{$token} = {...
  }

Would only see $token as "0" or "1" and the hashing would ensure that there
were only at *MOST* two tokens which explains the low token counts I see coming
out of the database.

The issue can be worked around by using:

bayes_store_module           Mail::SpamAssassin::BayesStore::MySQL

but I think it should probably be fixed in case other folks are using plain
"SQL", not "MySQL". A totally untested patch is attached.

1. https://svn.apache.org/viewvc?view=revision&revision=1899738
2.
https://dev.mysql.com/doc/refman/8.4/en/sql-mode.html#sqlmode_pipes_as_concat

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to