On Wed, May 15, 2013 at 11:44:17AM -0400, Wietse Venema wrote:
> We just don't want to dedicate too many mail delivery resources to > the slowest messages. Faster messages (or an approximate proxy: > new mail) should be scheduled soon for delivery. It should not have > to wait at the end of the line. Providing more capacity when we: - Have the memory resources. - Are in no risk of creating excessive contention. is unconditionally going to help, even when some of the processes are reserved for new mail. As I've said before, these approaches are *composable*. One can dynamically yield more slots to the slow mail, and use separate transports for slow vs. fast mail. We could have multiple deferred queues, where mail that took a long time to deliver goes into the slow deferred queue, while mail that ways simply greylisted, ... goes into the regular deferred queue, thus giving us better proxies for fast/slow. There are many possible knobs. Not letting slow mail accumulate in the active queue, by adding concurrency is one of them. Tuning retries is another... > Now we could take advantage of the fact that in many cases the > "slow" and "fast" messages cluster around different sites, thus > their recipients will end up in different in-memory queues. If > there was a feedback of fine-grained delivery agent latencies to > qmgr(8), then could rank nexthop destinations. Not to starve slow > mail, but only to ensure that slow mail does not starve new mail. The queue manager tends to forget everything about a queue, when it becomes empty. This is needed to conserve memory. With intermittent queue scans, if the blockages clear before the next queue scan, the knowledge that a source is slow may be flushed. It takes multiple dead destinations to create a problem, since the initial destination concurrency is 5, and we don't raise it when deliveries site-fail (as they typically do for the bogus destinations). So it takes 20+ such destinations to saturate the process limit. If there are many, the queue starvation happens before there is time for any feedback to rank slow destinations. In Patricks's case with made-up domains on webforms, does the bogus mail in fact tend to have lots of instances of the same invented destination domain? Or are users sufficiently inventive to make collisions rare? In all probability all we need to do, is raise the maximal backoff time from 4000s to ~4-5 hours, and advise the users in question to reduce the maximal queue lifetime to ~2 days from 5. This will reduce the stream from the deferred queue to a trickle. It may also help to add a random number between 1 and minimal_backoff_time to the next retry time of a deferred message (on top of the exponential backoff). This will help to diffuse clusters of deferred mail. -- Viktor.