On 4/25/2014 6:38 PM, Sam Varshavchik wrote:
> Bowie Bailey writes:
>
>> This server is pretty low-volume, but I checked it anyway.  I didn't see
>> any errors about maximum number of connections.  In fact, yesterday
>> morning, I got notices about timeouts at 1:14, 1:15, and 1:16.  When I
>> checked the logs, there were a grand total of 37 smtp connections made
>> between 1:10 and 1:20 (courieresmtpd: started,ip=).  And 11 of those
>> were from my monitoring system.
>>
>> There was a burst of 14 connections from the same IP between 1:13:47 and
>> 1:14:01, but I wouldn't expect 14 connections to cause a problem.
>> Besides, I can't even find that much traffic in some of the other time
>> periods.
>>
>> Any other suggestions?
> Reverse DNS lookup and identd. Each connecting IP address get a reverse DNS
> lookup, and an identd check. The focus would be on the host that's running
> your monitoring script.
>
> Do test reverse DNS lookups on the host's IP address. Doesn't matter if the
> IP address resolves, or not, just that the success/fail is immediate, and
> the DNS lookup doesn't hang.

Reverse DNS does not resolve, but it returns immediately.

> >From the mail server telnet to the monitoring host's IP address, port 113.
> Doesn't matter whether you connect or get a connection failure. If the
> connection hangs, Courier will wait 30 seconds before timing out. That, in
> conjunction with the reverse DNS lookup, may be long enough for the
> monitoring script to complain.

I have disabled ident lookups on all services.  I went through that 
routine a long time ago.

> Also, get some detail on the monitoring script, exactly what it does and how
> long it waits for whatever it's waiting for; that should be more productive
> than making random guesses about what it doesn't like.

The monitoring is being done by a Foundry ServerIron load balancer. For 
an SMTP port, what it does is open a connection and look for a greeting 
message with a 220 status.  I haven't been able to determine what the 
timeout is.

It seems like occasionally, the server briefly stops responding. 
Downtime is generally around 5-10 seconds until the server is reported 
back up again.

This may just be a quirk of the ServerIron.  I only had the old server 
running through it for a couple of weeks before I did the switchover, 
but I never saw this happen to the old server.  On the new server, it is 
*very* intermittent.  It did not fail at all over the weekend.  It just 
bugs me because everything on the server looks like it is running fine.  
Nothing is being logged anywhere except for the simple fact of the 
failure from the ServerIron, and I can't figure out how to get any other 
information other than leaving Wireshark running and hoping to catch a 
failure before the capture file grows too huge.

If you have any other ideas, let me know.  Otherwise, I'm just going to 
let it be unless I start noticing other problems.

-- 
Bowie

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
courier-users mailing list
courier-users@lists.sourceforge.net
Unsubscribe: https://lists.sourceforge.net/lists/listinfo/courier-users

Reply via email to