[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 [EMAIL PROTECTED] changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution||FIXED --- Additional Comments From [EMAIL PROTECTED] 2004-10-17 11:25 --- committed to 3.0 branch in rev 54970 --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-17 10:49 --- +1 --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 23:36 --- to be explicit: +1 from me. I've applied the 3.1.0 version as r54771. pity about the int array, but c'est la vie. I think the length limit will take care of the main part of the issue anyway... --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 [EMAIL PROTECTED] changed: What|Removed |Added Attachment #2447 is|0 |1 obsolete|| --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 21:00 --- Created an attachment (id=2449) --> (http://bugzilla.spamassassin.org/attachment.cgi?id=2449&action=view) jm's alternative patch made on b3_0 branch This is the same code as patch 2448 in the form of a patch for the b3_0 branch. Please apply this and review for committing to 3.0.1. I think we can count this as +1s from Justin and me already, correct? I'll wait for that one more vote before I commit to the 3.1 branch, in case there are any more changes from anyone. Justin, I don't see a reasonable way of using an integer array. It really is a hash table with a quarter million entries and key values up to 40 bits long. In any case there is no good reason I see for not truncating the input down to 10,000 bytes, and I bet that makes it fast enough that there is no longer even a reason to use an XS version. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 15:38 --- (ps: integers *are* more efficient if you can use arrays with a sane length. I don't know if this code is amenable to that though.) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 15:37 --- yep, the memory is generally freed once processing finishes -- freed to the perl allocator. However, that generally does not get returned to the OS; it remains mapped in the process for future allocations. So it's better to avoid allocating too much if possible. Each hash entry is about 15 bytes for the key, plus string length+1, 15 bytes for the int value, and the entry overhead itself is about 15 bytes too iirc. So with a perfect allocator, that's about 12.5 megs. then there's overhead from allocating a lot of small structures; perl doesn't make them all contiguous, for speed. (perl is generally optimised for speed over memory.) the "75MB" figure is probably including the overhead from the rest of SpamAssassin, which is about 25-30MB iirc in current 3.0.0. so in other words there's no leaking -- it's just an inefficient allocation strategy for this situation, which then never gets freed. it may get reused, but in the meantime, it's not efficient to carry around that many pages, perl may allocate new stuff all over that arena, and it looks bad in ps listings ;) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 14:25 --- > hash keys are always converted to strings anyway I expected that my Lisp instincts for optimization would not work here :-) Well, does all the memory allocated in create_lm for its my %ngram and all the strings used for most of its keys get freed when create_lm returns just the array of the top 400 keys? Actually, as I think about it, I don't see how the problem of this bug report happens even if the memory is not freed at the end of create_lm. Create_lm is only called once, in the child process, so even if it leaked memory, wouldn't that be freed when the child finished processing the message? The hash table in the example is only growing to around 250,000 entries with keys 1 to 5 bytes long, and with integer values. The message body is less than 250,000 characters. How does that translate into a spamd child blowing up past a 75Mb memory image? It's great that empirically we see limiting the size of the message body passed to create_lm fixes the problem, but I'm worried that there is still a memory leak that we have just pushed down below our annoyance threshold without really fixing it. Can you explain how the current code creates the problem that we see? Sorry for pushing this, but I don't want to leave a problem of memory blowup to a fix that seems to work for one possible cause without a full understanding of how it all is working. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 11:53 --- actually, hash keys are always converted to strings anyway, so it wouldn't make any difference :( using ints therefore wouldn't help. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 11:30 --- I like that. I did ask for someone with more perl chops to look at the code :-) I have to test it for myself before I vote on it, though, to be responsible. While you are looking at this so closely, is there any savings to be had inside the create_lm loop by using something else for the hash keys instead of building a separate string for each one? Right now it takes each word from the split, then it makes a string from that with a \000 added at the beginning and the end, then it makes strings for the hash keys from every length 1, 2, 3, 4, and 5 substring of that word string. Would it be more efficient to use integers instead of strings? Or use integers instead of strings for all but the 5 character substrings? I know the answer to these things are not always intuitive because of what gets done in the perl runtime vs what has to be done in perl code. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-13 10:25 --- Created an attachment (id=2448) --> (http://bugzilla.spamassassin.org/attachment.cgi?id=2448&action=view) alternative patch FWIW, here's *another* alternative patch for the same issue. This avoids copying the string multiple times, which could chew RAM for large messages, by passing a ref to the string, and by truncating it further up the call stack (in Mail/SpamAssassin/Message/Metadata). This way, the string won't get copied between its assembly and the split() call in create_lm(). As a bonus, it also removes the use of $_ in create_lm, since $_ is slower than a "my" var; and it inlines the $non_word_characters regexp into the split call, as that will then only be compiled once rather than once per split() call. So it should be faster, too. I think I prefer this one ;) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] [REVIEW] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 [EMAIL PROTECTED] changed: What|Removed |Added Summary|massive memory consumption |[REVIEW] massive memory |in spamd|consumption in spamd Target Milestone|Future |3.0.1 --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.