On Wed, May 15, 2013 at 11:44:17AM -0400, Wietse Venema wrote:


> We just don't want to dedicate too many mail delivery resources to
> the slowest messages.  Faster messages (or an approximate proxy:
> new mail) should be scheduled soon for delivery. It should not have
> to wait at the end of the line.

Providing more capacity when we:

        - Have the memory resources.

        - Are in no risk of creating excessive contention.

is unconditionally going to help, even when some of the processes
are reserved for new mail.  As I've said before, these approaches
are *composable*.  One can dynamically yield more slots to the slow
mail, and use separate transports for slow vs. fast mail.

We could have multiple deferred queues, where mail that took a
long time to deliver goes into the slow deferred queue, while
mail that ways simply greylisted, ... goes into the regular
deferred queue, thus giving us better proxies for fast/slow.

There are many possible knobs.  Not letting slow mail accumulate
in the active queue, by adding concurrency is one of them.  Tuning
retries is another...

> Now we could take advantage of the fact that in many cases the
> "slow" and "fast" messages cluster around different sites, thus
> their recipients will end up in different in-memory queues.  If
> there was a feedback of fine-grained delivery agent latencies to
> qmgr(8), then could rank nexthop destinations. Not to starve slow
> mail, but only to ensure that slow mail does not starve new mail.

The queue manager tends to forget everything about a queue, when it
becomes empty.  This is needed to conserve memory.  With intermittent
queue scans, if the blockages clear before the next queue scan, the
knowledge that a source is slow may be flushed.

It takes multiple dead destinations to create a problem, since the
initial destination concurrency is 5, and we don't raise it when
deliveries site-fail (as they typically do for the bogus destinations).

So it takes 20+ such destinations to saturate the process limit.
If there are many, the queue starvation happens before there is
time for any feedback to rank slow destinations.  In Patricks's
case with made-up domains on webforms, does the bogus mail in fact
tend to have lots of instances of the same invented destination
domain?  Or are users sufficiently inventive to make collisions
rare?

In all probability all we need to do, is raise the maximal backoff
time from 4000s to ~4-5 hours, and advise the users in question to
reduce the maximal queue lifetime to ~2 days from 5.  This will
reduce the stream from the deferred queue to a trickle.

It may also help to add a random number between 1 and minimal_backoff_time
to the next retry time of a deferred message (on top of the
exponential backoff).  This will help to diffuse clusters of deferred
mail.

-- 
        Viktor.

Reply via email to