[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-17 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776

[EMAIL PROTECTED] changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution||FIXED



--- Additional Comments From [EMAIL PROTECTED]  2004-10-17 11:25 ---
committed to 3.0 branch in rev 54970



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-17 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-17 10:49 ---
+1



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-14 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 23:36 ---
to be explicit: +1 from me.  I've applied the 3.1.0 version as r54771.

pity about the int array, but c'est la vie.  I think the length
limit will take care of the main part of the issue anyway...




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-14 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776

[EMAIL PROTECTED] changed:

   What|Removed |Added

Attachment #2447 is|0   |1
   obsolete||



--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 21:00 ---
Created an attachment (id=2449)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=2449&action=view)
jm's alternative patch made on b3_0 branch

This is the same code as patch 2448 in the form of a patch for the b3_0 branch.


Please apply this and review for committing to 3.0.1. I think we can count this
as +1s from Justin and me already, correct?

I'll wait for that one more vote before I commit to the 3.1 branch, in case
there are any more changes from anyone.

Justin, I don't see a reasonable way of using an integer array. It really is a
hash table with a quarter million entries and key values up to 40 bits long.

In any case there is no good reason I see for not truncating the input down to
10,000 bytes, and I bet that makes it fast enough that there is no longer even
a reason to use an XS version.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 15:38 ---
(ps: integers *are* more efficient if you can use arrays with a sane length.  I
don't know if this code is amenable to that though.)



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 15:37 ---
yep, the memory is generally freed once processing finishes -- freed to the perl
allocator.   However, that generally does not get returned to the OS; it remains
mapped in the process for future allocations.  So it's better to avoid
allocating too much if possible.

Each hash entry is about 15 bytes for the key, plus string length+1, 15 bytes
for the int value, and the entry overhead itself is about 15 bytes too iirc.  So
with a perfect allocator, that's about 12.5 megs.  then there's overhead from
allocating a lot of small structures; perl doesn't make them all contiguous, for
speed.  (perl is generally optimised for speed over memory.)

the "75MB" figure is probably including the overhead from the rest of
SpamAssassin, which is about 25-30MB iirc in current 3.0.0.

so in other words there's no leaking -- it's just an inefficient allocation
strategy for this situation, which then never gets freed.  it may get reused,
but in the meantime, it's not efficient to carry around that many pages, perl
may allocate new stuff all over that arena, and it looks bad in ps listings ;)




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 14:25 ---
> hash keys are always converted to strings anyway

I expected that my Lisp instincts for optimization would not work here :-)

Well, does all the memory allocated in create_lm for its my %ngram and all the
strings used for most of its keys get freed when create_lm returns just the
array of the top 400 keys?

Actually, as I think about it, I don't see how the problem of this bug report
happens even if the memory is not freed at the end of create_lm. Create_lm is
only called once, in the child process, so even if it leaked memory, wouldn't
that be freed when the child finished processing the message? The hash table in
the example is only growing to around 250,000 entries with keys 1 to 5  bytes
long, and with integer values. The message body is less than 250,000 characters.
How does that translate into a spamd child blowing up past a 75Mb memory image?

It's great that empirically we see limiting the size of the message body passed
to create_lm fixes the problem, but I'm worried that there is still a memory
leak that we have just pushed down below our annoyance threshold without really
fixing it. Can you explain how the current code creates the problem that we see?

Sorry for pushing this, but I don't want to leave a problem of memory blowup to
a fix that seems to work for one possible cause without a full understanding of
how it all is working.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 11:53 ---
actually, hash keys are always converted to strings anyway, so it wouldn't make
any difference :(  using ints therefore wouldn't help.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 11:30 ---
I like that. I did ask for someone with more perl chops to look at the code :-)
I have to test it for myself before I vote on it, though, to be responsible.

While you are looking at this so closely, is there any savings to be had inside
the create_lm loop by using something else for the hash keys instead of building
a separate string for each one? Right now it takes each word from the split,
then it makes a string from that with a \000 added at the beginning and the end,
then it makes strings for the hash keys from every length 1, 2, 3, 4, and 5
substring of that word string. Would it be more efficient to use integers
instead of strings? Or use integers instead of strings for all but the 5
character substrings? I know the answer to these things are not always intuitive
because of what gets done in the perl runtime vs what has to be done in perl 
code.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776





--- Additional Comments From [EMAIL PROTECTED]  2004-10-13 10:25 ---
Created an attachment (id=2448)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=2448&action=view)
alternative patch

FWIW, here's *another* alternative patch for the same issue. This avoids
copying the string multiple times, which could chew RAM for large messages, by
passing a ref to the string, and by truncating it further up the call stack (in
Mail/SpamAssassin/Message/Metadata).

This way, the string won't get copied between its assembly and the split() call
in create_lm().

As a bonus, it also removes the use of $_ in create_lm, since $_ is slower than
a "my" var; and it inlines the $non_word_characters regexp into the split call,
as that will then only be compiled once rather than once per split() call.  So
it should be faster, too.

I think I prefer this one ;)




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3776] [REVIEW] massive memory consumption in spamd

2004-10-13 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776

[EMAIL PROTECTED] changed:

   What|Removed |Added

Summary|massive memory consumption  |[REVIEW] massive memory
   |in spamd|consumption in spamd
   Target Milestone|Future  |3.0.1





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.