So I'm following up on this as no matter what version I run I seem to come
up against this.

I've started noticed a few things.

Firstly, if I send message by SMTP using thunderbird the server can sit
there for a while (10, 20 seconds) at connected before it actually does
anything. It then accepts the message really quickly and then sits there
for a while again.

I then spotted something in the logs:

2016-10-03 08:09:38 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_5 in 31.754 seconds - got (ok)
2016-10-03 08:43:38 [Main_Thread] Info: Main_Thread freed by idle Worker_2
in 31.700 seconds - got (ok)
2016-10-03 08:44:32 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_5 in 31.757 seconds - got (ok)
2016-10-03 08:49:23 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_9 in 31.792 seconds - got (ok)
2016-10-03 08:50:01 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_7 in 31.622 seconds - got (ok)
2016-10-03 08:50:41 [Main_Thread] Info: Main_Thread freed by idle Worker_2
in 31.514 seconds - got (ok)
2016-10-03 08:53:12 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_9 in 29.114 seconds - got (ok)
2016-10-03 08:56:32 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_3 in 31.499 seconds - got (ok)
2016-10-03 08:58:18 [Main_Thread] Info: Main_Thread freed by idle Worker_10
in 31.365 seconds - got (ok)
2016-10-03 08:58:52 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_9 in 31.988 seconds - got (ok)
2016-10-03 09:22:05 [Main_Thread] Info: Main_Thread freed by idle Worker_1
in 31.164 seconds - got (ok)
2016-10-03 09:22:39 [Main_Thread] Info: Main_Thread freed by idle Worker_6
in 30.599 seconds - got (ok)

I actually spotted the idle workers first. Surely if they are idle they
shouldn't take 30 seconds to pick up a connection from the Main Thread?
Then whilst looking at the logs to see what happened to the thread before I
found the interrupted Workers as well. I actually dropped this server back
to 7 workers during the day yesterday but that doesn't seem to have made a
difference.

The big thing here is that they are mostly around 30s. There are some
quicker and some longer. These happen throughout the day whilst I am seeing
the no running worker errors - for example yesterday these were logged
between 08:02 and 18:22.

The first shut down from no running workers was 09:32 and the last 18:25.

If I examine the first one, Worker 5 I read that as 08:09:38 - 32s 08:09:06

At 08:09:05 Worker 5 I see:

2016-10-03 08:09:05 [Worker_5] Info: Worker_5 got connection from MainThread

This is a 29.38kB email from gmail. It finishes this message at 08:09:19

2016-10-03 08:09:19 [Worker_5] Disconnected: session:7FEDCB20ACD8
74.125.82.50 - command list was 'EHLO,STARTTLS,EHLO,MAIL FROM,RCPT
TO,DATA,QUIT' - used 30 SocketCalls - processing time 14 seconds - damped 0
seconds

There is nothing from Worker 5 until:

2016-10-03 08:09:38 [Worker_5] Info: Worker_5 got connection from MainThread
2016-10-03 08:09:38 [Main_Thread] Info: Main_Thread freed by interrupted
Worker_5 in 31.754 seconds - got (ok)

If I go back up in the logs, all previous connections to Worker 5 have been
completed and closed.

So I can't make sense of this. The connection at 08:09:05 is passed off to
Worker_5 so surely the MainThread should be freed straight away. It is
almost like the Workers aren't freeing MainThread and MainThread is timing
them out after 30 seconds.

>From my understanding that would explain why Main_Thread can't accept any
more connections and explain a lot of the delays that people have
complained about.

Have I understood correctly or have I missed something obvious?


On Fri, Sep 30, 2016 at 3:25 PM, cw <colin.war...@gmail.com> wrote:

> Hi,
>
> For a while now I've had stability problems with ASSP. It has generally
> been one or two restarts a day. When I upgraded to 16270 I had huge
> problems with delayed mail, mail not getting through at all and ASSP
> continually shutting down each time there were thousands of "unable to
> detect any running worker" spewed into the log files. Yesterday I dropped
> back to 16256 and at least mail is flowing but so far I've had one ASSP
> instance shut itself off 8 times today.
>
> I've two instances doing this, both on Ubuntu 14.04.
>
> Last night I set debugging on and caught the incident on both servers
> within half an hour. I've looked through the debug file but there is
> nothing I can see to indicate any errors. Both servers were handling mail
> from different senders in the few minutes leading up to the fault.
>
> So I looked back through previous threads on the same issue today and saw
> a Thomas ask what the worker status page showed when it happens. I was
> wondering how on earth I was going to catch it when it happens and before
> it reboots then lo and behold whilst I was on the web interface I saw the
> errors flying past in a tail of the maillog.
>
> I went on the web interface and the dot at the top had turned red. I then
> went on the worker status page and that was all green.
>
> Up until now, I have been running 10 workers which is possibly overkill. I
> had just reduced this instance of ASSP to 5 workers as a test. Status of
> the workers is:
> 1,2,3,5 - ThreadGetNewCon with loop age 0s (worker 3 had 1s)
> 4 - Maillog
> 10000 - MonitorMainThread (0s)
> 10001 - schedule waiting (71s)
>
> I went back to the main page and the dot had gone back to green but the
> maillog was still filling with the running worker errors.
>
> I refreshed the status and the only changes were:
> 4 - "wh:0 - write: - wait: 0.005"
> The time on schedule waiting went up to 96s.
>
> Shortly after ASSP Shut down. It is like the main thread and the workers
> just stop talking to each other.
>
> I'd love to crack this and give the latest development version a go
> because right now I have the annoying issue of an SSL session taking so
> long that the sending server starts sending it again (This is
> smtproutes.com not gmail in this case). I've also seen this from Mandrill
> and got in touch with their support. They explained that it was down to
> shared spools. One server behind the infrastructure picks up the message
> and starts delivering it. 10 minutes later it is still there so another
> server picks it up and starts delivering it. Whichever completes first
> removes the file and the other servers terminate.
>
> All the best,
> Colin.
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to