Hi, I'm advising a user on a system which experiences occasional lockups in an outgoing runner. The runner contacts a remote smarthost, and occasionally locks up with the smtp connection still open according to "lsof -i :25". This state seems to be permanent: the backlog for that slice starts to grow without bound, for at least an hour. When this was happening once a month, they would just restart Mailman core, and the backlog would clear in minutes. But recently it's been happening daily which is distracting their staff, and worries me.
Another thing that is strange about this site is that it should be possible to hit that runner with a SIGUSR1 and restart it. This works for me, but on that system the stuck runner exits, but does not restart. Since normally it only happens to one runner of several, I wanted to identify the runner and process. I attach the tool I developed, for anyone who might have configured multiple runners and is interested to see the distribution across runners. Has anybody seen an outgoing runner lock up with an open smtp session? Any ideas on why? My analysis so far: - Because a restart works every time, I'm pretty sure it doesn't have anything to do with message content. - I believe both the Mailman host and the outgoing smarthost are Linodes, in the same datacenter. The problematic Mailman system is Mailman 3.3.6, Python 3.10 on Ubuntu 16.xx LTS. I believe the smarthost is a more recent Ubuntu LTS, probably running Postfix 3.7 as the MTA. (Yes, they have sufficiently paranoid security and QA teams. ;-) - We're using smtplib, which as far as I can tell basically has a 60s timeout for each command. Thus you'd think it would time out. I guess it could be inflooping on timeout, retry, timeout, retry, but I don't know how to check that. Maybe ss(8) will serve? - My system where SIGUSR1 works as documented is Python 3.11.2 on a Digital Ocean droplet with Debian 12.9, Linux 6.1.0. - I haven't tried to reproduce the Python 3.10 + Mailman 3.3.6 configuration and test SIGUSR1 in that configuration yet. Seems unlikely, waiting for the proverbial "round tuit". - It has occurred to us to use the local Postfix as the relay MTA. I'm waiting on a report whether that alleviates the problem. Even if so, I want to fix the underlying defect if possible. Any ideas would be welcome, including general debugging advice. Here's the promised slice_inspector tool. (Patches and suggestions welcome, though I don't promise to implement any time soon.) The tool imports some files from Mailman core, click, and psutil (not psutils!) I believe the latter modules are required by Mailman, so you should be able to run it as is under the 'mailman' user (or perhaps 'list' on Debian) in the environment Mailman itself uses. It defaults to assuming that mailman.cfg is /etc/mailman3/mailman.cfg. You'll probably need to fix the shebang if you want to chmod +x. The '--help' should be pretty self-explanatory.
#! /opt/mailman/venv/bin/python # -*- coding: utf-8 -*- # Copyright (C) 2001-2023 by the Free Software Foundation, Inc. # # This file is derived from GNU Mailman, and useful only with Mailman. # # This file is free software: you can redistribute it and/or modify it under # the terms of the GNU General Public License as published by the Free # Software Foundation, either version 3 of the License, or (at your option) # any later version. # # GNU Mailman is distributed in the hope that it will be useful, but WITHOUT # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or # FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for # more details. # # You should have received a copy of the GNU General Public License along with # GNU Mailman. If not, see <https://www.gnu.org/licenses/>. """Utility for inspecting queue slices.""" import click import os import psutil from mailman.config import config from mailman.core.initialize import initialize from mailman.utilities.string import expand from signal import SIGUSR1 as USR1 # 20 bytes of all bits set, maximum hashlib.sha.digest() value. We do it this # way for Python 2/3 compatibility. shamax = int('0xffffffffffffffffffffffffffffffffffffffff', 16) normal_backlog_max = 100 class SliceInspector: """Finds the queued messages for a queued slice. See also `ISwitchboard`.""" def __init__(self, name, queue_directory, slice, numslices): """Create a switchboard-like object. :param name: The queue name. :type name: str :param queue_directory: The queue directory. :type queue_directory: str :param slice: The slice number for this switchboard, or None. If not None, it must be [0..`numslices`). :type slice: int or None :param numslices: The total number of slices to split this queue directory into. It must be a power of 2. :type numslices: int :param recover: True if backup files should be recovered. :type recover: bool """ assert (numslices & (numslices - 1)) == 0, ( 'Not a power of 2: {}'.format(numslices)) self.name = name self.slice = slice self.numslices = numslices self.queue_directory = queue_directory self.pid = None # Fast track for no slices self._lower = None self._upper = None # Algorithm may be subject to change! if numslices != 1: self._lower = ((shamax + 1) * slice) / numslices self._upper = (((shamax + 1) * (slice + 1)) / numslices) - 1 def get_files(self, extension='.pck'): """See `ISwitchboard`.""" times = {} lower = self._lower upper = self._upper for f in os.listdir(self.queue_directory): # By ignoring anything that doesn't end in .pck, we ignore # tempfiles and avoid a race condition. filebase, ext = os.path.splitext(f) if ext != extension: continue when, digest = filebase.split('+', 1) # Throw out any files which don't match our bitrange. if lower is None or (lower <= int(digest, 16) <= upper): key = float(when) while key in times: key += DELTA times[key] = filebase # FIFO sort return [times[k] for k in sorted(times)] def get_pid(self): target = f'--runner={self.name}:{self.slice}:{self.numslices}' for p in psutil.process_iter(['pid', 'cmdline']): #print(p.info) if p.info['cmdline'] and target == p.info['cmdline'][-1]: self.pid = p.info['pid'] return self.pid def initialize_config(ctx, param, value): if not ctx.resilient_parsing: initialize(value) @click.command() @click.option('-C', '--config', 'config_file', envvar='MAILMAN_CONFIG_FILE', type=click.Path(exists=True, dir_okay=False, resolve_path=True), help="""\ Configuration file to use. If not given, the environment variable MAILMAN_CONFIG_FILE is consulted and used if set. If neither are given, a default configuration file is loaded.""", is_eager=True, callback=initialize_config) @click.option('--restart/--no-restart', type=bool, default=False, help='Any runner with a queue over the threshold is sent SIGUSR1') @click.option('--threshold', type=int, default=normal_backlog_max, help=f'Threshold for restarting runners, default {normal_backlog_max}') @click.argument('name') @click.pass_context def cli(ctx, config_file, restart, threshold, name): """ List the process ID and backlog count for each slice of the NAME queue. """ # from mailman/src/mailman/core/runner.py section = getattr(config, 'runner.' + name) substitutions = config.paths substitutions['name'] = name numslices = int(section.instances) queuedir = expand(section.path, None, substitutions) print(f'Using Mailman configuration in {config.filename}:') print(f' Queue directory: {queuedir}') print(f' Instances: {numslices}') print(f'Process ID and count of entries in slices of {name} queue:') for i in range(numslices): # #### is this factored appropriately? inspector = SliceInspector(name, queuedir, i, numslices) count = len(inspector.get_files()) pid = inspector.get_pid() if pid is None: print(f' Slice {i}: runner not found') continue print(f' Slice {i}: {pid:8} {count}') if restart and count > threshold: print(f' Slice {i} probably stuck: restarting') os.kill(pid, USR1) if __name__ == '__main__': cli()
Steve
_______________________________________________ Mailman-users mailing list -- mailman-users@mailman3.org To unsubscribe send an email to mailman-users-le...@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/ Archived at: https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/L73MKR2FR2MXZTRURXUMME3GNLJ7VVIZ/ This message sent to arch...@mail-archive.com