http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





------- Additional Comments From [EMAIL PROTECTED]  2004-10-09 14:59 -------
Dallas, I know that you had said that the entire message was being sent to
create_lm. I thought from Thunderbird's rendering of the message that would not
be true and expressed surprise when I saw that you were correct about that.

Thunderbird displays only that first picture as text in the message, and the
other MIME parts show up as malformed attached jpeg files.

Theo, When I said "it makes sense" I meant that the running time for create_lm
makes sense for an over 200K input given the other numbers I saw, not that it
makes sense for this message to cause it to have that big an input.

I see several issues here:

1. Since TexCat behaves badly on "large" inputs that are within the size
threshold we recommend for SpamAssassin and since it doesn't need that much real
text input to get a reliable result (anything over 1K works very well if that 1K
does contain representative text in a modeled language), I do suggest limiting
its input to, say 10000 bytes. Would someone who is more of a perl expert than I
comment on whether using create_lm(substr($input, 0, 10000)) is the proper way
to do that?

2. The memory blowup is of more concern to me than the time it takes. The code
in create_lm is supposed to do the following:

For every "word" in the input, where that is defined as delimited by digits and
whitespace, count the occurences of every length 1, 2, 3, 4, and 5 substring of
the word with a start and end marker of character \000.

Again, for the perl experts: Is there a better way of getting all those
substrings into a hash table for counting without all the overhead of creating
all the temporary strings and sorting and so on that the current code does?

3. This is coming from a malformed message. But is there something we could do
to handle it better so that SpamAssassin would not put all of the message into
what it thinks is the rendered message body? If this fools other MUAs, perhaps
there isn't something we can do, as we do need to duplicate behavior of MUAs,
but what do MUAs do with this message? Thunderbird does not display all of it,
what about Outlook Express and Eudora and some others? If in fact MUAs do not
display this entire message as text, then SpamAssassin should not be using it
all, no matter what Ripmime does.

4. We should see where the rest of the time is being spent in the processing of
the large body in case there is another optimization to do, as it still is quite
slow. It might just turn out that processing a 200K message body with our rules
does take several seconds, but it would be good to take a careful look.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to