https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

          Priority: P2
            Bug ID: 6855
          Assignee: [email protected]
           Summary: Rules with extremely poor CPU_time / mail_size ratio
          Severity: normal
    Classification: Unclassified
                OS: All
          Reporter: [email protected]
          Hardware: All
            Status: NEW
           Version: 3.4 SVN branch
         Component: Rules
           Product: Spamassassin

Here is another problem report referring to long SpamAssassin scan times
with larger mail. Unlike Bug 6854, which concentrates on code hotspots,
this one refers to problematic rules (and/or their implementation).
There may be other similar PRs elsewhere.

Attached is a png image of a plot. The scattergram shows dots for
each mail message being processed on our system for the last two days.
The x-axis is a mail size in bytes, the y-axis is a CPU usage time
in milliseconds used only by SpamAssassin, as reported by a getrusage(2)
system call (via Unix-Getrusage-0.03 perl module).

Initialization and startup times are excluded from the measurements,
and so are mail transfer times. Measuring the CPU time instead of
elapsed time also largely removes the effect of network rules.

What caught my eye is that the ratio of CPU time vs. mail size
does not fall on a narrow line, but forms a rather wide wedge.
While most samples fall on the lower side of the wedge (showing
a slope of about 355 kB/s with a fixed offset of about 0.4 s),
some samples fall near the upper side of the wedge, showing
an extremely wasteful linear slope of about 11.5 kB/s.
These samples are shown in blue on the diagram.

I managed to get hold of a couple of such extreme messages
which fell near the upper edge of the wedge. Luckily some
were addressed to me - like a SA cron job as sent to
"[email protected]" and a
"bad sandbox rules report". Similar were a couple of in-house
cron jobs or rsync reports. But not all extreme samples were
the result of automatic runs, there were also legitimate
googlegroups chats with long thread citations and some reports
to authors from a publishing company. Will attach two or three
of these.

Note these were not extremely large messages, here are sample
sizes of four of such cases: 187kB, 154kB, 73kB, 93kB.
Profiling these four with a NYTProf profiler in a command-line
spamassassin run shows similar results as can be obtained
through a HitFreqsRuleTiming.pm plugin.

Will attach the results. Here is a summary - a top few of
the most "CPU resource usage per mail size" wasteful rules
for each of these four (quite different) messages:

785ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
777ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
299ms  72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
216ms  72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
128ms  72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
126ms  72_active.cf, rule __FILL_THIS_FORM_LONG1, 
59.5ms 72_active.cf, rule __LOTSA_MONEY_03,

645ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
641ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
179ms  72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
145ms  72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
141ms  72_active.cf, rule __FILL_THIS_FORM_LONG1, 
110ms  72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
48.6ms 72_active.cf, rule __LOTSA_MONEY_03, 

219ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
212ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
84.2ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
62.6ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
28.6ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
27.7ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 
15.7ms 72_active.cf, rule __LOTSA_MONEY_03, 

236ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
225ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
89.9ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
62.9ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
34.9ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 
34.5ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
18.8ms 72_active.cf, rule T_LOTTO_URI, 
18.4ms 72_active.cf, rule __GAPPY_HTML_02, 
16.8ms 72_active.cf, rule __LOTSA_MONEY_03, 
15.9ms 72_active.cf, rule __LOTSA_MONEY_01, 

I'm not sure what can/should be done about it,
but people tell me it is not acceptable that it takes
30 seconds of intensive CPU spinning to process a 350 kB message
(on our AMD Opteron, not the very latest and greatest,
but still a decent hardware).

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to