Hi,

I'm advising a user on a system which experiences occasional lockups
in an outgoing runner.  The runner contacts a remote smarthost, and
occasionally locks up with the smtp connection still open according to
"lsof -i :25".  This state seems to be permanent: the backlog for that
slice starts to grow without bound, for at least an hour.  When this
was happening once a month, they would just restart Mailman core, and
the backlog would clear in minutes.  But recently it's been happening
daily which is distracting their staff, and worries me.

Another thing that is strange about this site is that it should be
possible to hit that runner with a SIGUSR1 and restart it.  This works
for me, but on that system the stuck runner exits, but does not
restart.

Since normally it only happens to one runner of several, I wanted to
identify the runner and process.  I attach the tool I developed, for
anyone who might have configured multiple runners and is interested to
see the distribution across runners.

Has anybody seen an outgoing runner lock up with an open smtp session?
Any ideas on why?

My analysis so far:
- Because a restart works every time, I'm pretty sure it doesn't have
  anything to do with message content.
- I believe both the Mailman host and the outgoing smarthost are
  Linodes, in the same datacenter.  The problematic Mailman system is
  Mailman 3.3.6, Python 3.10 on Ubuntu 16.xx LTS.  I believe the
  smarthost is a more recent Ubuntu LTS, probably running Postfix 3.7
  as the MTA.  (Yes, they have sufficiently paranoid security and QA
  teams. ;-)
- We're using smtplib, which as far as I can tell basically has a 60s
  timeout for each command.  Thus you'd think it would time out.  I
  guess it could be inflooping on timeout, retry, timeout, retry, but
  I don't know how to check that.  Maybe ss(8) will serve?
- My system where SIGUSR1 works as documented is Python 3.11.2 on a
  Digital Ocean droplet with Debian 12.9, Linux 6.1.0.
- I haven't tried to reproduce the Python 3.10 + Mailman 3.3.6
  configuration and test SIGUSR1 in that configuration yet.  Seems
  unlikely, waiting for the proverbial "round tuit".
- It has occurred to us to use the local Postfix as the relay MTA.
  I'm waiting on a report whether that alleviates the problem.  Even
  if so, I want to fix the underlying defect if possible.

Any ideas would be welcome, including general debugging advice.

Here's the promised slice_inspector tool.  (Patches and suggestions
welcome, though I don't promise to implement any time soon.)  The tool
imports some files from Mailman core, click, and psutil (not psutils!)
I believe the latter modules are required by Mailman, so you should be
able to run it as is under the 'mailman' user (or perhaps 'list' on
Debian) in the environment Mailman itself uses.  It defaults to
assuming that mailman.cfg is /etc/mailman3/mailman.cfg.  You'll
probably need to fix the shebang if you want to chmod +x.  The
'--help' should be pretty self-explanatory.

#! /opt/mailman/venv/bin/python
# -*- coding: utf-8 -*-

# Copyright (C) 2001-2023 by the Free Software Foundation, Inc.
#
# This file is derived from GNU Mailman, and useful only with Mailman.
#
# This file is free software: you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option)
# any later version.
#
# GNU Mailman is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
# more details.
#
# You should have received a copy of the GNU General Public License along with
# GNU Mailman.  If not, see <https://www.gnu.org/licenses/>.

"""Utility for inspecting queue slices."""

import click
import os
import psutil
from mailman.config import config
from mailman.core.initialize import initialize
from mailman.utilities.string import expand
from signal import SIGUSR1 as USR1

# 20 bytes of all bits set, maximum hashlib.sha.digest() value.  We do it this
# way for Python 2/3 compatibility.
shamax = int('0xffffffffffffffffffffffffffffffffffffffff', 16)
normal_backlog_max = 100

class SliceInspector:
    """Finds the queued messages for a queued slice. See also `ISwitchboard`."""

    def __init__(self, name, queue_directory, slice, numslices):
        """Create a switchboard-like object.

        :param name: The queue name.
        :type name: str
        :param queue_directory: The queue directory.
        :type queue_directory: str
        :param slice: The slice number for this switchboard, or None.  If not
            None, it must be [0..`numslices`).
        :type slice: int or None
        :param numslices: The total number of slices to split this queue
            directory into.  It must be a power of 2.
        :type numslices: int
        :param recover: True if backup files should be recovered.
        :type recover: bool
        """
        assert (numslices & (numslices - 1)) == 0, (
            'Not a power of 2: {}'.format(numslices))
        self.name = name
        self.slice = slice
        self.numslices = numslices
        self.queue_directory = queue_directory
        self.pid = None
        # Fast track for no slices
        self._lower = None
        self._upper = None
        # Algorithm may be subject to change!
        if numslices != 1:
            self._lower = ((shamax + 1) * slice) / numslices
            self._upper = (((shamax + 1) * (slice + 1)) / numslices) - 1

    def get_files(self, extension='.pck'):
        """See `ISwitchboard`."""
        times = {}
        lower = self._lower
        upper = self._upper
        for f in os.listdir(self.queue_directory):
            # By ignoring anything that doesn't end in .pck, we ignore
            # tempfiles and avoid a race condition.
            filebase, ext = os.path.splitext(f)
            if ext != extension:
                continue
            when, digest = filebase.split('+', 1)
            # Throw out any files which don't match our bitrange.
            if lower is None or (lower <= int(digest, 16) <= upper):
                key = float(when)
                while key in times:
                    key += DELTA
                times[key] = filebase
        # FIFO sort
        return [times[k] for k in sorted(times)]

    def get_pid(self):
        target = f'--runner={self.name}:{self.slice}:{self.numslices}'
        for p in psutil.process_iter(['pid', 'cmdline']):
            #print(p.info)
            if p.info['cmdline'] and target == p.info['cmdline'][-1]:
                self.pid = p.info['pid']
                return self.pid


def initialize_config(ctx, param, value):
    if not ctx.resilient_parsing:
        initialize(value)

@click.command()
@click.option('-C', '--config', 'config_file',
    envvar='MAILMAN_CONFIG_FILE',
    type=click.Path(exists=True, dir_okay=False, resolve_path=True),
    help="""\
    Configuration file to use.  If not given, the environment variable
    MAILMAN_CONFIG_FILE is consulted and used if set.  If neither are given, a
    default configuration file is loaded.""",
    is_eager=True, callback=initialize_config)
@click.option('--restart/--no-restart', type=bool, default=False,
              help='Any runner with a queue over the threshold is sent SIGUSR1')
@click.option('--threshold', type=int, default=normal_backlog_max,
              help=f'Threshold for restarting runners, default {normal_backlog_max}')
@click.argument('name')
@click.pass_context
def cli(ctx, config_file, restart, threshold, name):
    """
    List the process ID and backlog count for each slice of the NAME queue.
    """

    # from mailman/src/mailman/core/runner.py
    section = getattr(config, 'runner.' + name)
    substitutions = config.paths
    substitutions['name'] = name
    numslices = int(section.instances)
    queuedir = expand(section.path, None, substitutions)

    print(f'Using Mailman configuration in {config.filename}:')
    print(f'  Queue directory: {queuedir}')
    print(f'  Instances: {numslices}')
    print(f'Process ID and count of entries in slices of {name} queue:')
    for i in range(numslices):
        # #### is this factored appropriately?
        inspector = SliceInspector(name, queuedir, i, numslices)
        count = len(inspector.get_files())
        pid = inspector.get_pid()
        if pid is None:
            print(f'  Slice {i}: runner not found')
            continue
        print(f'  Slice {i}: {pid:8} {count}')
        if restart and count > threshold:
            print(f'  Slice {i} probably stuck: restarting')
            os.kill(pid, USR1)


if __name__ == '__main__':
    cli()

Steve

_______________________________________________
Mailman-users mailing list -- mailman-users@mailman3.org
To unsubscribe send an email to mailman-users-le...@mailman3.org
https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/
Archived at: 
https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/L73MKR2FR2MXZTRURXUMME3GNLJ7VVIZ/

This message sent to arch...@mail-archive.com

Reply via email to