[sniffer] Re: FW: [sniffer] Re: Sniffer 3.0 Froze Mail Server

2008-10-04 Thread Andy Schmidt
Hi Pete,

Well, I eliminated WeightGate for the time being, just to do my due
diligence.

Also, since there is a fix sized buffer, I assume actually LOWERING the 3rd
number (the allocation for each non-interactive process) would allow for
MORE parallel processes to run (as long as the value is still large enough
to support each of the applications that rely on it.)

Of course, I assume the heap issue in reality is actually a SECONDARY
problem ( a symptom of too many non-interactive tasks being launched and not
completing). Since the 'heap' space is finite, there is a hard limit as to
how many processes can be in a wait state at the same time. The problem to
focus on is not the known, limited heap, but rather the reason why these
processes  were unable to complete and thus eventually too many processes
being active.

Best Regards,
Andy

From: Pete McNeil [mailto:[EMAIL PROTECTED] 
Sent: Saturday, October 04, 2008 10:07 PM
To: Andy Schmidt
Cc: [EMAIL PROTECTED]
Subject: Re: FW: [sniffer] Re: Sniffer 3.0 Froze Mail Server

 

Hello Andy,

 

Saturday, October 4, 2008, 9:22:39 PM, you wrote:

 


 

Hi Pete,

Here the log files. 

I can't tell you WHEN the problem was triggered. I was off site and was
alerted around noon that the SMTP service had become unresponsive. I assumed
it had crashed, but found it running. Thus I tried restarting the SMTP
service, but after shutting down, it wouldn't allow me to restart. That's
when I started looking a bit more closely.

Once I realized that I had all these SNFclient processes running (I checked
the event log to see if it would give me any clue - but since the errors had
been occurring for a while, my system event log had wrapped around, so I
couldn't tell when it actually started and how long it may have taken
between the actual problem and until the SMTP service became unresponsive.

This Imail server is a PowerEdge 2950, Quad CPU, 3GHz.

2 GB of RAM and normally using about 1.5 GB of virtual RAM and on weekends,
CPU load is usually below 10%.

When this was going on, I didn't pay close attention because I wasn't quite
sure yet what was going on and was trying to figure out how to get out of
it. But, based on the memory use graph, I would guess it had maxed out 4 GB
of virtual RAM, which eventually starved the SMTP service and prevented it
from accepting more connections.. As soon as I flushed the command line
programs, the memory curve dropped very sharply by easily half.

Sorry - don't have anything more specific.

 

 

I've been watching your telemetry and I don't think the problem was
triggered by an ordinary overload. Your message rate is not high enough to
cause that -- SNFClients will only wait about 30 seconds or so at most if
they are unable to make contact - - even on the busiest systems.

 

The other thing that strikes me is that you had to kill a lot of
imailsrv.exe instances as well-- this is new and very different.

 

Once the mystery heap was exhausted I would expect SNFClient instances to
build up in a broken state (0x142) but there is no good reason for
imailsrv instances to build up that I can think of -- except maybe some kind
of list processing event? (IIRC, imailsrv is called to handle list
processing requests through an alias -- it's been a while).

 

I will check the SNF log to see if I can identify anything useful.

 

Thanks,

 

_M

 

-- 

Pete McNeil

Chief Scientist,

Arm Research Labs, LLC.



[sniffer] Re: FW: [sniffer] Re: Sniffer 3.0 Froze Mail Server

2008-10-04 Thread Pete McNeil




Hello Andy,

Saturday, October 4, 2008, 10:21:31 PM, you wrote:







Hi Pete,
Well, I eliminated WeightGate for the time being, just to do my due diligence.
Also, since there is a fix sized buffer, I assume actually LOWERING the 3rdnumber (the allocation for each non-interactive process) would allow for MORE parallel processes to run (as long as the value is still large enough to support each of the applications that rely on it.)
Of course, I assume the heap issue in reality is actually a SECONDARY problem ( a symptom of too many non-interactive tasks being launched and not completing). Since the heap space is finite, there is a hard limit as to how many processes can be in a wait state at the same time. The problem to focus on is not the known, limited heap, but rather the reason why these processes were unable to complete and thus eventually too many processes being active.





Indeed. Eliminating WeightGate might impact this because it will represent one less process per message.

I just did a search of errors in the SNF logs and didn't find anything unusual.

I was unable to pinpoint the time of the problem -- that will require a harder analysis of the data. Indications are that SNFServer didn't see any significant issues during the period covered by the two logs you sent. When client's talked to it they were served (according to the logs).

You're showing about 40 msg/minute on average.

According to a spot check of log entries SNFServer is finished processing these in an unmeasurable amount of time (0 indicates  15 ms for both setup, read, scan, and response). Most of the logs performance metrics p/ indicate s='0' and t='0' -- setup time in ms, and scan time in ms.

On occasion I see some nonzero t values - but nothing unusual (16, 47, 63, etc).

You probably don't need a lot of threads active on your system. If you have provided for a high number then you might consider reducing that number... Processing 1 message per second would exceed your average handily and doesn't take a lot of threads.

If for some reason you were hit with a large number of messages and put them in work in parallel then that might have exhausted the heap.

The new SNF is much more efficient than the old one and so it would have more easily allowed this... Sometimes introducing a more efficient component into a system exposes problems that were hidden by the previous less efficient component -- the less efficient component may have masked the problem by artificially reducing or shaping throughput. When we see this kind of thing we call it a "lens effect" -- the newer component reshapes the dynamics of the system and brings previously unknown problems "into focus".

It's possible the heap problem you experienced was caused by a "lens effect" since the new SNF engine is more efficient and would naturally allow for more messages to be handled concurrently in a burst than the previous version would have allowed.

A theory -- the previous version would naturally be constrained by I/O contention since it would need to create, scan, modify, and remove job control files. This would naturally couple performance to other I/O intensive operations such as writing new messages to the spool etc. The new version does not have any of this overhead and so would allow for an unconstrained ramp-up of new instances -- that might lead to a higher number of concurrent tasks and cause heap exhaustion--- after heap exhaustion is achieved additional tasks build up in a failed and partially initialized state. This typically continues until the failed tasks are manually removed -- since none of them is ever properly initialized none of the tasks can time out, fail, or shut down on their own.

Hope this helps,

_M



--
Pete McNeil
Chief Scientist,
Arm Research Labs, LLC.



#
This message is sent to you because you are subscribed to
  the mailing list sniffer@sortmonster.com.
To unsubscribe, E-mail to: [EMAIL PROTECTED]
To switch to the DIGEST mode, E-mail to [EMAIL PROTECTED]
To switch to the INDEX mode, E-mail to [EMAIL PROTECTED]
Send administrative queries to  [EMAIL PROTECTED]