Bug#966544: [Pkg-net-snmp-devel] Bug#966544: snmpd: extend option broken after update
Hello Craig, These issues, do they warrant utterly breaking things w/o any recourse short of recompiling things for many, many users that use the extend feature? Especially given the fact that SNMP traffic tends to be on private networks and the feature not being enabled by default in the config. At the very least a "this will break things, abort now" missive during upgrade would have been nice. If upstream can't/won't fix this snmpd has lost it's usefulness for me in the long run compared to other data collectors. Regards, Christian On Fri, 31 Jul 2020 10:46:29 +1000 Craig Small wrote: > Hi James, > That would have been intentional, the EXTEND MIB has major security > issues. > > - Craig > > > On Thu, 30 Jul 2020 at 23:03, James Greig wrote: > > > Package: snmpd > > Version: 5.7.3+dfsg-1.7+deb9u2 > > Severity: important > > > > Dear Maintainer, > > > > *** Reporter, please consider answering these questions, where appropriate > > *** > > > > Updating snmpd from deb9u1 to deb9u2 via apt on any stretch system > > breaks the ability to use 'extend' in snmpd. > > > > After updating on any stretch system and restarting snmpd this error will > > appear:- > > > > Warning: Unknown token: extend > > > > It's likely the latest binary build of this package has not included > > options to > > enable extend and/or other extras. > > > > *** End of the template - remove these template lines *** > > > > > > -- System Information: > > Debian Release: 9.13 > > APT prefers oldstable-updates > > APT policy: (500, 'oldstable-updates'), (500, 'oldstable') > > Architecture: amd64 (x86_64) > > > > Kernel: Linux 4.9.0-13-amd64 (SMP w/8 CPU cores) > > Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), > > LANGUAGE=en_GB:en (charmap=UTF-8) > > Shell: /bin/sh linked to /bin/dash > > Init: systemd (via /run/systemd/system) > > > > Versions of packages snmpd depends on: > > ii adduser3.115 > > ii debconf [debconf-2.0] 1.5.61 > > ii init-system-helpers1.48 > > ii libc6 2.24-11+deb9u4 > > ii libsnmp-base 5.7.3+dfsg-1.7+deb9u2 > > ii libsnmp30 5.7.3+dfsg-1.7+deb9u2 > > ii lsb-base 9.20161125 > > > > snmpd recommends no packages. > > > > Versions of packages snmpd suggests: > > pn snmptrapd > > > > -- debconf information excluded -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Mobile Inc.
Bug#922234: Still ongoing
Hello, just wanted to confirm that this is still ongoing and turned a significant number of my sparse hairs gray in the middle of last night. This is on a Supermicro SYS-2028TP-DC0FR/X10DRT-PIBF with Intel I350 and 82575EB ethernet interfaces and Mellanox ConnectX-3 in case anybody is taking notes which firmware driver might be the culprit. I also had to revert to the oldest (4.9.0-4) kernel to get this working and have to basically consider everything with Stretch backport kernels to be non-bootable at this point. would be nice if somebody from the kernel team could pipe up so we can provide more info if needed. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Mobile Inc.
Bug#888512: Clamd suddenly eat up all file descriptors, 'Too many open files' error
Hello, I can very much and very urgently confirm this bug. It started happening today on several servers here, unsurprisingly first the ones with the smallest number of default file descriptors configured. Interestingly even though the clamd version number is the same, Stretch servers are unaffected, while Jessie ones very much are. Alas while I'm able to upgrade most servers, 2 I definitely can not at this time, so that's not a solution/ Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#649106: syncid 'fix' breaks state sync in init script completely
Hello, sorry for the bug necromancy, but this clearly has not been fixed in 1.26 or another regression happened later. The patch was obviously half (manually?) applied and the current initscript in 1:1.28-3+b1 has the -"-syncid" statement TWICE in the start invocations, the first incarnation before "--start-daemon" needs to be removed. I've been using LVS/ipvsadm for really long time and not once in over 4 releases has a Debian start script worked out of the box for people using the daemons. Would be nice to get this fixed once and for all. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#864756: Failover of VMs from dead nodes doesn't work with RBD or any other extstorage
Package: ganeti Version: 2.15.2-7 Severity: Important Stretch As the tin says, ganeti 2.15 is unable/unwilling to failover VMs from a dead compute node if they aren't DRBD. This used to work with 2.12 and is a regression, but was not fixed until the current Debian version was released. See the following for details and a patch (works fine for me): https://groups.google.com/forum/#!topic/ganeti/ahbr7vb7zRo Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#850823: Official patch
The official patch is here and includes a bit more: https://github.com/ganeti/ganeti/commit/27a999616efefcff96b14688208c93c6a76d8813 -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#864754: CPU affinity broken with Stretch (or backport python versions on Jessie)
Package: ganeti Version: 2.15.2-7 Severity: Important Stretch The python psutil did the familiar game of python developers and changed a function call name. In particular for cpu affinity from the version supplied in Jessie (2.1.1) when going to version 4.x (Stretch and Jessie backports). For details and solution [simply change "set_cpu_affinity" to "cpu_affinity" (both occurrences) in lib/hypervisor/hv_kvm/__init__.py] see: https://groups.google.com/forum/#!topic/ganeti/fQu6Wr14k2M Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#864755: Bash completion missing in 2.15
Package: ganeti Version: 2.15.2-7 Severity: Important Stretch Unlike with 2.12 there is no longer a /etc/bash_completion.d/ganeti provided by this package. Given the nature of the ganeti beast and lack of a (stable,usable) GUI this is a rather essential feature. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#850823: Patch works and REALLY should be in there.
Hello Apollon (again). Just to confirm that the patch works, I'm just in the middle of installing a new ganeti/ceph cluster based on Stretch. Also the ganeti bash completion is gone, should I open a bug report for this or is that something that happened upstream? Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications
Bug#862248: [Debian-ha-maintainers] Bug#862248: No straightforward and permanent way to disable DRBD autostart, no drbd systemd unit file
On Thu, 11 May 2017 09:33:59 +0300 Apollon Oikonomopoulos wrote: > On 09:15 Thu 11 May , Christian Balzer wrote: > > Firstly I recreated the initial state bu unmasking drbd and enabling > > it, > then reloading systemd. > > > > That find then gives us: > > --- > > /run/systemd/generator.late/drbd.service > > /etc/systemd/system/multi-user.target.wants/drbd.service > > So, now we need ls -l > /etc/systemd/system/multi-user.target.wants/drbd.service to see how old > the symlink is and where it points to. Can you also zgrep drbd-utils > /var/log/dpkg.log*? > Sure thing, that's quite the old chestnut indeed: --- lrwxrwxrwx 1 root root 32 Aug 11 2015 /etc/systemd/system/multi-user.target.wants/drbd.service -> /lib/systemd/system/drbd.service --- Note that this link is also present on: A Wheezy system with "8.9.2~rc1-1~bpo70+1" installed. On a system that was initially installed with Jessie but had 8.9.2~rc1-2+deb8u1 installed first. The plot thickens, see below. No trace of the original install, just the upgrades when going from Wheezy to Jessie: --- /var/log/dpkg.log:2017-05-09 13:03:35 upgrade drbd-utils:amd64 8.9.2~rc1-1~bpo70+1 8.9.2~rc1-2+deb8u1 /var/log/dpkg.log:2017-05-09 13:14:41 upgrade drbd-utils:amd64 8.9.2~rc1-2+deb8u1 8.9.5-1~bpo8+1 --- As for Ferenc, that was of course _after_ again disabling drbd, but I wanted to start from the same state as before, so I unmasked and enabled it first. Also for the record, on another Jessie system that never had drbd-utils installed I installed directly "8.9.5-1~bpo8+1". There disable works as expected and only leaves the /run/systemd/generator.late/drbd.service around. So all points to 8.9.2~rc1 as the culprit. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#862248: [Debian-ha-maintainers] Bug#862248: No straightforward and permanent way to disable DRBD autostart, no drbd systemd unit file
Hello, On Wed, 10 May 2017 15:18:15 +0300 Apollon Oikonomopoulos wrote: > On 20:55 Wed 10 May , Christian Balzer wrote: > > is there any package you're not involved with? ^o^ > > Nah, we just happen to be running the same things :) > Evidently so. ^o^ > > On Wed, 10 May 2017 12:37:34 +0300 Apollon Oikonomopoulos wrote: > > > > > Control: severity -1 wishlist > > > > > Sure thing. > > > > > Hi, > > > > > > On 17:53 Wed 10 May , Christian Balzer wrote: > > > > Jessie (backports), systemd. > > > > > > > > When running DRBD with pacemaker it is recommended (and with systemd > > > > required, see link below) to disable DRBD startup at boot time. > > > > > > > > However: > > > > --- > > > > # systemctl disable drbd > > > > drbd.service is not a native service, redirecting to > > > > systemd-sysv-install. > > > > Executing: /lib/systemd/systemd-sysv-install disable drbd > > > > insserv: warning: current start runlevel(s) (empty) of script `drbd' > > > > overrides LSB defaults (2 3 4 5). > > > > insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script > > > > `drbd' overrides LSB defaults (0 1 6). > > > > --- > > > > > > > > But since systemd-sysv picks up anything in /etc/init.d/ we get after a > > > > reboot: > > > > --- > > > > # systemctl status drbd > > > > drbd.service - LSB: Control drbd resources. > > > >Loaded: loaded (/etc/init.d/drbd; generated; vendor preset: enabled) > > > >Active: active (exited) since Wed 2017-05-10 10:37:39 JST; 6h ago > > > > Docs: man:systemd-sysv-generator(8) > > > >CGroup: /system.slice/drbd.service > > > > --- > > > > > > > > Ways forward would be a unit file for systemd that actually allows > > > > disable > > > > to work as expected or some other means to (permanently) neuter the > > > > init.d > > > > file (instead of an "exit 0" at the top which did the trick for now). > > > > > > > > > > Thanks for the report! > > > > > > You can always use `systemctl mask drbd.service', which will neuter the > > > initscript completely. I'm downgrading the severity to 'wishlist', > > > unless `systemctl mask' causes some ill side-effects, in which case > > > please change the severity again. > > > > > That worked w/o any ill effects I can see. > > > > Unfortunately mask is not a particular well known/referenced systemctl > > feature, but then again that might be my tremendous love and admiration > > for all things systemd speaking. ^o^ > > mask is well-documented, it's just something we didn't have with > sysvinit, so most people ignore its existence and it's not cited often. > > > > > > But yes, ideally we should provide a native unit. > > > > > I wonder if this bears referencing to the systemd/systemd-sysv folks, to > > maybe suggest "mask" in the output when somebody runs disable against a > > LSB sysv init script. > > The thing is, systemctl disable *should* do the right thing, even in > jessie. It makes me suspect there are some older package left-overs > around. Can you please try running: > > $ systemctl disable drbd.service > $ systemctl daemon-reload > $ find /lib/systemd /run/systemd /etc/systemd -name drbd.service > Firstly I recreated the initial state bu unmasking drbd and enabling it, then reloading systemd. That find then gives us: --- /run/systemd/generator.late/drbd.service /etc/systemd/system/multi-user.target.wants/drbd.service --- These systems are an upgrade from Wheezy, but there are no old packages left and the relevant bits (drbd and systemd) are actually from backports. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#862248: [Debian-ha-maintainers] Bug#862248: No straightforward and permanent way to disable DRBD autostart, no drbd systemd unit file
Hello Apollon, is there any package you're not involved with? ^o^ On Wed, 10 May 2017 12:37:34 +0300 Apollon Oikonomopoulos wrote: > Control: severity -1 wishlist > Sure thing. > Hi, > > On 17:53 Wed 10 May , Christian Balzer wrote: > > Jessie (backports), systemd. > > > > When running DRBD with pacemaker it is recommended (and with systemd > > required, see link below) to disable DRBD startup at boot time. > > > > However: > > --- > > # systemctl disable drbd > > drbd.service is not a native service, redirecting to systemd-sysv-install. > > Executing: /lib/systemd/systemd-sysv-install disable drbd > > insserv: warning: current start runlevel(s) (empty) of script `drbd' > > overrides LSB defaults (2 3 4 5). > > insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `drbd' > > overrides LSB defaults (0 1 6). > > --- > > > > But since systemd-sysv picks up anything in /etc/init.d/ we get after a > > reboot: > > --- > > # systemctl status drbd > > drbd.service - LSB: Control drbd resources. > >Loaded: loaded (/etc/init.d/drbd; generated; vendor preset: enabled) > >Active: active (exited) since Wed 2017-05-10 10:37:39 JST; 6h ago > > Docs: man:systemd-sysv-generator(8) > >CGroup: /system.slice/drbd.service > > --- > > > > Ways forward would be a unit file for systemd that actually allows disable > > to work as expected or some other means to (permanently) neuter the init.d > > file (instead of an "exit 0" at the top which did the trick for now). > > Thanks for the report! > > You can always use `systemctl mask drbd.service', which will neuter the > initscript completely. I'm downgrading the severity to 'wishlist', > unless `systemctl mask' causes some ill side-effects, in which case > please change the severity again. > That worked w/o any ill effects I can see. Unfortunately mask is not a particular well known/referenced systemctl feature, but then again that might be my tremendous love and admiration for all things systemd speaking. ^o^ > But yes, ideally we should provide a native unit. > I wonder if this bears referencing to the systemd/systemd-sysv folks, to maybe suggest "mask" in the output when somebody runs disable against a LSB sysv init script. Regards, Christian > Regards, > Apollon > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#862248: No straightforward and permanent way to disable DRBD autostart, no drbd systemd unit file
Package: drbd-utils Version: 8.9.5-1~bpo8+1 Severity: Important Jessie (backports), systemd. When running DRBD with pacemaker it is recommended (and with systemd required, see link below) to disable DRBD startup at boot time. However: --- # systemctl disable drbd drbd.service is not a native service, redirecting to systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable drbd insserv: warning: current start runlevel(s) (empty) of script `drbd' overrides LSB defaults (2 3 4 5). insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `drbd' overrides LSB defaults (0 1 6). --- But since systemd-sysv picks up anything in /etc/init.d/ we get after a reboot: --- # systemctl status drbd drbd.service - LSB: Control drbd resources. Loaded: loaded (/etc/init.d/drbd; generated; vendor preset: enabled) Active: active (exited) since Wed 2017-05-10 10:37:39 JST; 6h ago Docs: man:systemd-sysv-generator(8) CGroup: /system.slice/drbd.service --- Ways forward would be a unit file for systemd that actually allows disable to work as expected or some other means to (permanently) neuter the init.d file (instead of an "exit 0" at the top which did the trick for now). See also: https://www.mail-archive.com/drbd-user%40lists.linbit.com/msg11045.html Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#859700: client state / Message count mismatch with imap-hibernate and mixed POP3/IMAP access
Hello Apollon, On Thu, 6 Apr 2017 12:17:10 +0300 Apollon Oikonomopoulos wrote: > Control: tags -1 moreinfo > > Hi Christian, > > Thanks for the report. > > On 15:48 Thu 06 Apr , Christian Balzer wrote: > > I've been seeing a few of these since starting this cluster (see > > previous > > mail), they all follow the same pattern, a user who accesses their mailbox > > with both POP3 and IMAP deletes mails with POP3 and the IMAP > > (imap-hibernate really) is getting confused and upset about this: > > > > --- > > Apr 6 09:55:49 mbx11 dovecot: imap-login: Login: user=<redac...@gol.com>, > > method=PLAIN, rip=xxx.xxx.x.46, lip=xxx.xxx.x.113, mpid=951561, secured, > > session=<2jBV+HRM1Pbc9w8u> > > Apr 6 10:01:06 mbx11 dovecot: pop3-login: Login: user=<redac...@gol.com>, > > method=PLAIN, rip=xxx.xxx.x.46, lip=xxx.xxx.x.41, mpid=35447, secured, > > session= > > Apr 6 10:01:07 mbx11 dovecot: pop3(redac...@gol.com): Disconnected: Logged > > out top=0/0, retr=0/0, del=1/1, size=20674 session= > > Apr 6 10:01:07 mbx11 dovecot: imap(redac...@gol.com): Error: imap-master: > > Failed to import client state: Message count mismatch after handling > > expunges (0 != 1) > > Apr 6 10:01:07 mbx11 dovecot: imap(redac...@gol.com): Client state > > initialization failed in=0 out=0 head=<0> del=<0> exp=<0> trash=<0> > > session=<2jBV+HRM1Pbc9w8u> > > Apr 6 10:01:15 mbx11 dovecot: imap-login: Login: user=<redac...@gol.com>, > > method=PLAIN, rip=xxx.xxx.x.46, lip=xxx.xxx.x.113, mpid=993303, secured, > > session=<6QC6C3VMF/jc9w8u> > > Apr 6 10:07:42 mbx11 dovecot: imap-hibernate(redac...@gol.com): Connection > > closed in=85 out=1066 head=<0> del=<0> exp=<0> trash=<0> > > session=<6QC6C3VMF/jc9w8u> > > --- > > > > According to the dovecot ML, this is fixed in 2.2.28, so getting this into > > Debian and backports would be much appreciated. > > Unfortunately we are in the middle of the freeze for next stable, and > updating to 2.2.28 itself is not an option at this time I'm afraid. If > we pinpoint the fix however, we can always backport it to 2.2.27 and > have it released with Stretch. > Nods, I think I saw that freeze bit when glancing over the new package tracker. Would love to have that fixed, even though it seems to be (thankfully) mostly transparent to the actual (and few) users that are affected by it. > > > > See also: > > https://www.dovecot.org/pipermail/dovecot/2017-April/107668.html > > According to the message by Aki in the dovecot ML, this is fixed in > 1fd44e0634. However, 1fd44e0634 is already part of 2.2.26 (and 2.2.27 > of course), which complicates things a bit more: > > $ git describe --contains 1fd44e0634ac312d0960f39f9518b71e08248b65 > 2.2.26~318 > > So either the fix is incomplete, or you have some old processes lying > around. I can't think of anything else :) > I vote for the first option, since this is a fresh install that never saw anything but 2.2.27 from backports. Also so the .26, .27 bits in the git excerpt but didn't check it in detail, though why did Aki suggest .28 when I stated that it's already .27 here? Will point the folks on the dovecot ML to this bug report, maybe Timo can make sense of it. Thanks, Christian > Regards, > Apollon > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#859700: client state / Message count mismatch with imap-hibernate and mixed POP3/IMAP access
Package: dovecot-core Version: 1:2.2.27-2~bpo8+1 Severity: Important Hello, I've been seeing a few of these since starting this cluster (see previous mail), they all follow the same pattern, a user who accesses their mailbox with both POP3 and IMAP deletes mails with POP3 and the IMAP (imap-hibernate really) is getting confused and upset about this: --- Apr 6 09:55:49 mbx11 dovecot: imap-login: Login: user=<redac...@gol.com>, method=PLAIN, rip=xxx.xxx.x.46, lip=xxx.xxx.x.113, mpid=951561, secured, session=<2jBV+HRM1Pbc9w8u> Apr 6 10:01:06 mbx11 dovecot: pop3-login: Login: user=<redac...@gol.com>, method=PLAIN, rip=xxx.xxx.x.46, lip=xxx.xxx.x.41, mpid=35447, secured, session= Apr 6 10:01:07 mbx11 dovecot: pop3(redac...@gol.com): Disconnected: Logged out top=0/0, retr=0/0, del=1/1, size=20674 session= Apr 6 10:01:07 mbx11 dovecot: imap(redac...@gol.com): Error: imap-master: Failed to import client state: Message count mismatch after handling expunges (0 != 1) Apr 6 10:01:07 mbx11 dovecot: imap(redac...@gol.com): Client state initialization failed in=0 out=0 head=<0> del=<0> exp=<0> trash=<0> session=<2jBV+HRM1Pbc9w8u> Apr 6 10:01:15 mbx11 dovecot: imap-login: Login: user=<redac...@gol.com>, method=PLAIN, rip=xxx.xxx.x.46, lip=xxx.xxx.x.113, mpid=993303, secured, session=<6QC6C3VMF/jc9w8u> Apr 6 10:07:42 mbx11 dovecot: imap-hibernate(redac...@gol.com): Connection closed in=85 out=1066 head=<0> del=<0> exp=<0> trash=<0> session=<6QC6C3VMF/jc9w8u> --- According to the dovecot ML, this is fixed in 2.2.28, so getting this into Debian and backports would be much appreciated. See also: https://www.dovecot.org/pipermail/dovecot/2017-April/107668.html Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#820958: Upgrade of 4.1 to 4.2 in Jessie forces the samba package to be installed and the daemons started (nagios plugins install)
Package: samba Version: 2:4.2.10+dfsg-0+deb8u1 Severity: Normal Hello, the just released security fix and thus upgrade from Samba 4.1 to 4.2 in Jessie introduces another potential security problem. Consider this (fairly common) scenario: Server isn't running samba at all, but nagios-plugins-standard was installed to monitor (NRPE) other services. nagios-plugins-standard pulls in samba-common (to get smbclient). So far so good, until now this didn't do anything dangerous and people most likely allowed all the dependencies/recommendations to be installed. However this latest version of samba requires the actual samba package to be installed as well if samba-common is present, which of course will install the daemon binaries and start them, potentially exposing the server in question to attacks. A quick workaround is of course to un-install samba if one didn't need the functionality in the first place. But a re-packaging in the previous style or at least a stern warning when pulling in samba into a system that only had samba-common before would be the correct way forward. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/
Bug#795060: Latest Wheezy backport kernel prefers Infiniband mlx4_en over mlx4_ib, breaks existing installs
On Fri, 14 Aug 2015 13:01:03 +0200 Ben Hutchings wrote: On Fri, 2015-08-14 at 13:45 +0900, Christian Balzer wrote: [...] So I decided to downgrade the Mellanox firmware of mbx09. After building a current version of mstflint (the one in Wheezy is ancient and not particular justified) I got the oldest firmware (2.32.5100) on the Supermicro FTP site for that mainboard and flashed it. Lo and behold, no more vanishing acts of the ib0: interface, no more need to blacklist/fake install the mlx4_en module. While moderately happy with the outcome, I still consider this a kernel bug. All the described behavior is not only very unexpected and unwelcome, the fact that a remote card reboot can make your network stack vanish (and not re-appear unless done manually) is just wrong. [...] This sounds rather more like a firmware bug than a kernel bug. Please ask Mellanox technical support about this. If they can identify a fix in the driver then I'll be happy to apply that. Nevermind that a firmware bug in my book is something that affects things locally, the fact that 3.2 is not affected and the items listed above make it a kernel/upstream Mellanox bug. I'd appreciate if you could use whatever means you have to kick this upstream as well. I'll send a mail to Mellanox support, which they may well ignore, this being a Supermicro OEM product, never mind that it's 100% identical. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/
Bug#795060: Latest Wheezy backport kernel prefers Infiniband mlx4_en over mlx4_ib, breaks existing installs
On Tue, 11 Aug 2015 20:00:41 +0200 Ben Hutchings wrote: On Tue, 2015-08-11 at 10:38 +0900, Christian Balzer wrote: Hello Ben, thanks for the quick and detailed reply. On Mon, 10 Aug 2015 15:53:57 +0200 Ben Hutchings wrote: Control: severity -1 important Control: tag -1 upstream On Mon, 2015-08-10 at 13:52 +0900, Christian Balzer wrote: [...] I'm also not seeing this on several other machines we use for Ceph with the current Jessie kernel, but to be fair they use slightly different (QDR, not FDR) ConnectX-3 HBAs. If SR-IOV is enabled on the adapter then the ports will always operate in Ethernet mode as it's apparently not supported for IB. Perhaps SR -IOV enabled on some adapters but not others? I was wondering about that, but wasn't aware of the Ethernet only bit of SR-IOV. Anyway, the previous cluster and one blade of this new one have Mellanox firmware 2.30.8000, which doesn't offer the Flexboot Bios menu and thus have no SR-IOV configuration option at boot time. However the other blade (replacement mobo for a DoA one) in the new server does have firmware 2.33.5100 and the Flexboot menu and had SR-IOV enabled. Alas disabling it (and taking out the fake-install) did result in the same behavior, mlx4_en was auto-loaded before mlx4_ib. [...] I added that options mlx4_core port_type_array=1 (since there is only one port) to /etc/modprobe.d/local.conf, depmod -a, update-initramfs -u, but no joy. The mlx4_en module gets auto-loaded before the IB one as well with this setting. [...] There was a deliberate change in mlx4_core in Linux 3.15 to load mlx4_en first if it finds any Ethernet port. Interesting. So this _could_ have bitten me earlier with any flavor of 3.16 kernel if there had been an Ethernet port around. Again, given that a cluster with otherwise identical hardware doesn't do this leads me to assume that the presence of that Ethernet port stems from the 2.33.5100 firmware, no matter if SR-IOV is enabled or not. But that is separate from the decision of what types of port are configured. From where I'm standing it looks like it will use/configure mlx_en no matter what. And once the mlx4_en is loaded, mlx4_ib is no longer capable of creating IB ports. In fact it will even tear down the remote IB port and load mlx4_en if just one side changes from IB to EN. To wit, I had both nodes up with running ib0: interfaces (mlx4_en disabled via fake-install). I then commented out the fake-install on both and did a depmod -a. On node mbx09 (the one with the newer firmware) I then rmmod'ed mlx4_ib and mlx4_core. Then I modprobe'd mlx4_core: --- Aug 12 10:14:56 mbx09 mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) Aug 12 10:14:56 mbx09 mlx4_core: Initializing :02:00.0 Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: PCIe link width is x8, device supports x8 Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 124 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 125 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 126 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 127 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 128 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 129 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 130 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 131 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 132 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 133 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 134 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 135 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 136 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 137 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 138 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 139 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 140 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 141 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 142 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 143 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_core :02:00.0: irq 144 for MSI/MSI-X Aug 12 10:15:01 mbx09 mlx4_ib mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 2014) Aug 12 10:15:01 mbx09 mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014) Aug 12 10:15:01 mbx09 mlx4_en :02:00.0: registered PHC clock Aug 12 10:15:01 mbx09 mlx4_en :02:00.0: Activating port:1 Aug 12 10:15:01 mbx09 mlx4_en: :02:00.0: Port 1: Using 192 TX rings Aug 12 10:15:01 mbx09 mlx4_en: :02:00.0: Port 1: Using 8 RX rings Aug 12 10:15:01 mbx09 mlx4_en: :02:00.0
Bug#795060: Latest Wheezy backport kernel prefers Infiniband mlx4_en over mlx4_ib, breaks existing installs
Hello Ben, thanks for the quick and detailed reply. On Mon, 10 Aug 2015 15:53:57 +0200 Ben Hutchings wrote: Control: severity -1 important Control: tag -1 upstream On Mon, 2015-08-10 at 13:52 +0900, Christian Balzer wrote: [...] I'm also not seeing this on several other machines we use for Ceph with the current Jessie kernel, but to be fair they use slightly different (QDR, not FDR) ConnectX-3 HBAs. If SR-IOV is enabled on the adapter then the ports will always operate in Ethernet mode as it's apparently not supported for IB. Perhaps SR -IOV enabled on some adapters but not others? I was wondering about that, but wasn't aware of the Ethernet only bit of SR-IOV. Anyway, the previous cluster and one blade of this new one have Mellanox firmware 2.30.8000, which doesn't offer the Flexboot Bios menu and thus have no SR-IOV configuration option at boot time. However the other blade (replacement mobo for a DoA one) in the new server does have firmware 2.33.5100 and the Flexboot menu and had SR-IOV enabled. Alas disabling it (and taking out the fake-install) did result in the same behavior, mlx4_en was auto-loaded before mlx4_ib. In all following tests I did reboot both nodes simultaneously, to avoid having one port in Ethernet mode forcing things on the other side. Also the newest QDR card for one of the Ceph cluster machines here does have that firmware, but behaves properly (no mlx4_en auto-load) with the latest Jessie kernel. If that's not the issue, it looks like you are supposed to set a module parameter in mlx4_core: port_type_array:Array of port types: HW_DEFAULT (0) is default 1 for IB, 2 for Ethernet (array of int) e.g.: options mlx4_core port_type_array=1,1 I added that options mlx4_core port_type_array=1 (since there is only one port) to /etc/modprobe.d/local.conf, depmod -a, update-initramfs -u, but no joy. The mlx4_en module gets auto-loaded before the IB one as well with this setting. So ultimately only the fake-install of mlx4_en provides a workaround. If you have anything else you would like to try let me know, this cluster will probably not go into production for another 2 weeks. I don't know what determines the hardware default. [...] Given that the previous version works as expected and that Jessie is doing the right thing as well, I'd consider this a critical bug. No, it is important (since it is a regression) but it is not critical. Fair enough. Had I rebooted the older production cluster with 500,000 users on it into this kernel, the results would not have been pretty. And that's why you tested on one machine first, right? Of course, but it would have still a) broken things (replication stopped) and b) taken me even more time to figure out what was going on and how to work around it as I can't reboot that cluster willy-nilly. There simply is a very high expectation that a kernel update like this won't leave you dead in the water. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#795060: Latest Wheezy backport kernel prefers Infiniband mlx4_en over mlx4_ib, breaks existing installs
Package: linux-image-3.16.0-0.bpo.4-amd64 Version: 3.16.7-ckt11-1+deb8u2~bpo70+1 Severity: Critical Hello, We have a 2 node Supermicro chassis (2028TP-DC0FR) chassis with an onboard Mellanox ConnecX-3 HBA in production since last year. Both nodes are directly connected with a QFSP FDR cable. We use IPoIB (for DRBD) and thus load the mlx4_ib module and all the assorted other ones in /etc/modules at boot time. These are Wheezy machines, currently with the 3.16.7-ckt2-1~bpo70+1 kernel. Last week we got another (identical) one of these chassis and I installed Wheezy as well (we need pacemaker, which is sorely lacking in Jessie). This was with the 3.16.7-ckt11-1+deb8u2~bpo70+1 kernel and unlike in the past it proceeded to load the mlx4_en module automatically, created an eth2: interface and the ib0: interface was nowhere to be found. This was not only very unexpected, I was also under the impression that mlx4_en and mlx4_ib could be used in parallel, but even though mlx4_ib was loaded it did not work (the /sys/class/net/ib0 entry was not created). Booting into the stock Wheezy 3.2 kernel (which we also run on older machines with ConnectX-2 HBAs) resulted in the expected behavior, IB interface, no Ethernet. I'm also not seeing this on several other machines we use for Ceph with the current Jessie kernel, but to be fair they use slightly different (QDR, not FDR) ConnectX-3 HBAs. After doing a fake-install (blacklisting didn't work) like this: --- echo install mlx4_en /bin/true /etc/modprobe.d/mlx4_en.conf depmod -a update-initramfs -u --- and rebooting I have IB running on 3.16.0-0.bpo.4-amd64 again as well. Given that the previous version works as expected and that Jessie is doing the right thing as well, I'd consider this a critical bug. Had I rebooted the older production cluster with 500,000 users on it into this kernel, the results would not have been pretty. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#773361: ceph: osd dies, something corrupts journal
Hello, see my thread in the Ceph ML named: OSD trashed by simple reboot (Debian Jessie, systemd?) I believe that upgrading to 0.80.9 will fix this problem. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#768922: Bug#768618: pacemaker: FTBFS in jessie: build-dependency not installable: libqb-dev (= 0.16.0.real)
Well... Meanwhile, here in what it what we tenuously call reality one can observe the following things: 1. Pacemaker broken in Jessie for more than 2 months now. 2. Silence on this bug for more than one month. 3. Pacemaker was recently removed from Jessie. 4. The February 5th deadline is rapidly approaching, cue the laughingstock. Between systemd and this gem Jessie is shaping up to be the best Debian release ever... Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#754341: Definitely a python conflict
I just downgraded a test machine running sid by wget-ing the following packages (unfortunately they are no longer in any package list): --- libpython2.7_2.7.7-2_amd64.deb python2.7_2.7.7-2_amd64.deb libpython2.7-minimal_2.7.7-2_amd64.deb python2.7-minimal_2.7.7-2_amd64.deb libpython2.7-stdlib_2.7.7-2_amd64.deb --- then putting them into a directory by themselves and running: --- dpkg --install * --- ceph (the command in any incarnation, not just ceph -s) now works again and the OSD on that machine unsurprisingly can be started again as well. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#754341: ceph command hangs in Jessie after update (python 2.7.8 suspected)
Package: ceph Version: 0.80.1-1+b1 Severity: Critical Hello, this is a Jessie machine, part of a Jessie based Ceph cluster. After doing a minor update today (40 odd packages) the ceph command hangs just before returning to the command prompt like this: --- # ceph -s cluster d6b84616-ff3e-4b04-b50b-bd398d7fa69a health HEALTH_OK monmap e1: 3 mons at {c-admin=10.0.0.10:6789/0,ceph-01=10.0.0.41:6789/0,ceph-02=10.0.0.42:6789/0}, election epoch 86, quorum 0,1,2 c-admin,ceph-01,ceph-02 osdmap e1676: 4 osds: 4 up, 4 in pgmap v4135542: 1152 pgs, 3 pools, 699 GB data, 182 kobjects 1358 GB used, 98819 GB / 100178 GB avail 1152 active+clean client io 525 kB/s wr, 132 op/s [hangs indefinitely until ctrl-c] --- This update include python 2.7.8 as in: --- Preparing to unpack .../python2.7_2.7.8-1_amd64.deb ... Unpacking python2.7 (2.7.8-1) over (2.7.7-2) ... --- So I suspect this is an incompatibility between ceph and python. Needless to say that this is critical, as this hanging command will prevent normal operations, the start of monitors or OSDs, in short ruin ones day quite effectively. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#729961: qemu-system-x86: rbd support
Hello, On Sun, 13 Apr 2014 19:04:45 +0400 Michael Tokarev wrote: 09.01.2014 10:53, Christian Balzer wrote: Hello, Meanwhile, we're at qemu 1.7, ceph is at 0.72, both in sid and wheezy-backports. I'd really, really would love to see RBD re-enabled by default for these and when things have trickled into jessie. I just removed librbd support in qemu once again, because the same old issue - lack of library/symbol versioning - which prevented ceph from going into wheezy - is _still_ not fixed. Because once I uploaded rbd- enabled qemu, I received a new grave bugreport against qemu-system which tells me that running qemu-system binary results in dynamic linker not finding symbol rbd_aio_flush or some other. Where is that NEW bug report? Would that be the OLD: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679686 you merged things with? And now I need _really_ good reason to re-enable it again, because, well, guys, this is not funny at all. Firstly I am running 1.7.0+dfsg-6 under Jessie and it works just fine. As did source build versions with RBD enabled in the past, either by using the inktank Ceph packages or since 0.72.x entered sid and then jessie. Secondly, Bug 679686 is about something that is not true in Jessie, as it contains Ceph 0.72.2 (at this time). That version is not in wheezy-backports yet and for that particular case it would be true. However for Jessie it most emphatically isn't. If there is indeed a bug (new or still present) with jessie, fine. If this is about wheezy-backports, since when do backports block packages from entering testing? Lastly, I will make the Ceph developers aware of this. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#729961: qemu-system-x86: rbd support
On Thu, 9 Jan 2014 10:19:18 -0800 Vagrant Cascadian wrote: On Thu, Jan 09, 2014 at 03:53:14PM +0900, Christian Balzer wrote: Meanwhile, we're at qemu 1.7, ceph is at 0.72, both in sid and wheezy-backports. It doesn't appear to be in wheezy-backports, and is still having troubles migrating to jessie/testing: Argh, my bad. I clearly had too many parallel browser windows open yesterday when researching this and not the wheezy-backport but the sid one for ceph. ^_^;; rmadison ceph ceph | 0.48-2 | jessie | source, amd64, armel, armhf, i386, ia64, mips, mipsel, powerpc, s390x, sparc ceph | 0.48-2 | sid| source ceph | 0.72.2-1 | sid| source, amd64, armel, armhf, i386, ia64, mips, mipsel, powerpc, s390x, sparc grep-excuses ceph ceph (0.48-2 to 0.72.2-1) Maintainer: Ceph Maintainers 7 days old (needed 5 days) out of date on i386: ceph-fuse, ceph-fuse-dbg (from 0.48-2) ... out of date on ia64: ceph-fuse, ceph-fuse-dbg (from 0.48-2) (but ia64 isn't keeping up, so nevermind) Updating ceph fixes old bugs: #705262 Not considered Depends: ceph google-perftools (not considered) Doesn't really seem like it's ready yet... Unfortunately, though nobody in their right mind would use ceph-fuse anyway. ^o^ Well, lets hope this gets resolved in time for jessie, would be a shame otherwise. Thanks, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#729961: qemu-system-x86: rbd support
Hello, Meanwhile, we're at qemu 1.7, ceph is at 0.72, both in sid and wheezy-backports. I'd really, really would love to see RBD re-enabled by default for these and when things have trickled into jessie. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#719675: [Pkg-libvirt-maintainers] Bug#719675: Live migration of KVM guests fails if it takes more than 30 seconds (large memory guests)
On Thu, 15 Aug 2013 08:16:02 +0200 Guido Günther wrote: On Thu, Aug 15, 2013 at 09:35:09AM +0900, Christian Balzer wrote: On Wed, 14 Aug 2013 21:50:22 +0200 Guido Günther wrote: On Wed, Aug 14, 2013 at 04:49:42PM +0900, Christian Balzer wrote: Package: libvirt0 Version: 0.9.12-11+deb7u1 Severity: important Hello, when doing a live migration using Pacemaker (the OCF VirtualDomain RA) on a cluster with DRBD (active/active) backing storage everything works fine with recently started (small memory footprint of about 200MB at most) KVM guests. After inflating one guest to 2GB memory usage (memtester comes in handy for that) the migration failed after 30 seconds, having managed to migrate about 400MB in that time over the direct, dedicated GbE link between my test cluster host nodes. libvirtd.log on the migration target node, migration start time is 07:24:51 : --- 2013-08-13 07:24:51.807+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.886+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.888+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.948+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.948+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:25:21.217+: 31950: warning : virKeepAliveTimer:182 : No response from client 0x1948280 after 5 keepalive messages in 30 seconds 2013-08-13 07:25:31.224+: 31950: warning : qemuProcessKill:3813 : Timed out waiting after SIGTERM to process 15926, sending SIGKILL This looks more like you're not replying via the keepalive protocol. What are you using to migrate VMs? -- Guido As I said up there, the Pacemaker (heartbeat, OCF really) resource agent, with SSH as transport (and only) option. This is not telling me how this is done within pacemaker. RHCS used to do this with virsh internally. I'll check the sources once I get around to. Sorry, I was assuming some familiarity with this resource agent. It indeed creates a virsh command line internally, the relevant code for this case is basically: --- # Find out the remote hypervisor to connect to. That is, turn # something like qemu://foo:/system into # qemu+tcp://bar:/system if [ -n ${OCF_RESKEY_migration_transport} ]; then transport_suffix=+${OCF_RESKEY_migration_transport} fi --- The above defines the transport, ssh in my case. And then later: --- # Scared of that sed expression? So am I. :-) remoteuri=$(echo ${OCF_RESKEY_hypervisor} | sed -e s,\(.*\)://[^/:]*\(:\?[0-9]*\)/\(.*\),\1${transport_suffix}://${target_node}\2/\3,) # OK, we know where to connect to. Now do the actual migration. ocf_log info $DOMAIN_NAME: Starting live migration to ${target_node} (using remote hypervisor URI ${remoteuri} ${migrateuri}). virsh ${VIRSH_OPTIONS} migrate --live $DOMAIN_NAME ${remoteuri} ${migrateuri} rc=$? --- In my case the migrateuri is empty as I didn't define anything, I thus left out the code that would potentially define it. Hope that helps, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#719675: Live migration of KVM guests fails if it takes more than 30 seconds (large memory guests)
Package: libvirt0 Version: 0.9.12-11+deb7u1 Severity: important Hello, when doing a live migration using Pacemaker (the OCF VirtualDomain RA) on a cluster with DRBD (active/active) backing storage everything works fine with recently started (small memory footprint of about 200MB at most) KVM guests. After inflating one guest to 2GB memory usage (memtester comes in handy for that) the migration failed after 30 seconds, having managed to migrate about 400MB in that time over the direct, dedicated GbE link between my test cluster host nodes. libvirtd.log on the migration target node, migration start time is 07:24:51 : --- 2013-08-13 07:24:51.807+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.886+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.888+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.948+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.948+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:25:21.217+: 31950: warning : virKeepAliveTimer:182 : No response from client 0x1948280 after 5 keepalive messages in 30 seconds 2013-08-13 07:25:31.224+: 31950: warning : qemuProcessKill:3813 : Timed out waiting after SIGTERM to process 15926, sending SIGKILL --- Below is the only thing I could find which is somewhat related to this, unfortunately it was cured by the miracle that is the next version upgrade without the root cause being found: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=816451 I will install Sid on another test cluster tomorrow and am betting that it will work just fine there. Since Testing is still at the same level as Wheezy I'm also betting that we won't see anything in wheezy-backports anytime soon. I'd really rather not create a production cluster based on Jessie or do those rather complex backports myself... Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#719675: [Pkg-libvirt-maintainers] Bug#719675: Live migration of KVM guests fails if it takes more than 30 seconds (large memory guests)
On Wed, 14 Aug 2013 21:50:22 +0200 Guido Günther wrote: On Wed, Aug 14, 2013 at 04:49:42PM +0900, Christian Balzer wrote: Package: libvirt0 Version: 0.9.12-11+deb7u1 Severity: important Hello, when doing a live migration using Pacemaker (the OCF VirtualDomain RA) on a cluster with DRBD (active/active) backing storage everything works fine with recently started (small memory footprint of about 200MB at most) KVM guests. After inflating one guest to 2GB memory usage (memtester comes in handy for that) the migration failed after 30 seconds, having managed to migrate about 400MB in that time over the direct, dedicated GbE link between my test cluster host nodes. libvirtd.log on the migration target node, migration start time is 07:24:51 : --- 2013-08-13 07:24:51.807+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.886+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.888+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.948+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:24:51.948+: 31953: warning : qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the async job owner; entering monitor without ask ing for a nested job is dangerous 2013-08-13 07:25:21.217+: 31950: warning : virKeepAliveTimer:182 : No response from client 0x1948280 after 5 keepalive messages in 30 seconds 2013-08-13 07:25:31.224+: 31950: warning : qemuProcessKill:3813 : Timed out waiting after SIGTERM to process 15926, sending SIGKILL This looks more like you're not replying via the keepalive protocol. What are you using to migrate VMs? -- Guido As I said up there, the Pacemaker (heartbeat, OCF really) resource agent, with SSH as transport (and only) option. So the resulting migration URI should be something like: qemu+ssh://targethost/system Of course with properly distributed authorized_keys, again this works just fine with a small enough guest. If there wasn't a proper two-way communication going on, shouldn't the migration fail from the start? [snip] Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#576901: init.d script fails under Squeeze with insserv due to lack of run level definitions
Oh, one more thing to make it _really_ work. ^.^ Before insserv times drbd was started before HA (heartbeat) and stopped after it. Without the additional X- lines below it will stop in parallel with HA and thus lead to all kinds of nastiness and fireworks. Please consider this for your final fix or alternatively get the HA people to include drbd in their Should- sections. --- ### BEGIN INIT INFO # Provides: drbd # Required-Start: $local_fs $network $syslog # Required-Stop:$local_fs $network $syslog # Should-Start: sshd multipathd # Should-Stop: sshd multipathd # X-Start-Before: HA # X-Stop-After: HA # Default-Start:2 3 4 5 # Default-Stop: 0 1 6 # Short-Description:Control drbd resources. ### END INIT INFO --- Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#576901: init.d script fails under Squeeze with insserv due to lack of run level definitions
Hallo, On Sun, 9 May 2010 20:31:41 +0200 gregor herrmann wrote: On Thu, 08 Apr 2010 13:39:25 +0900, Christian Balzer wrote: Package: drbd8-utils Version: 2:8.3.7-1 Severity: serious This incarnation of drbd8-utils has missing run level definitions in the INIT INFO section of the init.d script and Hm, 2:8.3.7-1 seems to have the header (cf. also debian/changelog), did you mean to report this bug against older versions? Nope, this was a fresh Squeeze install. Since my Sid test servers have been upgraded from at least Etch times back and are thus not really trustworthy in this regard I just installed drbd8-utils on a fresh Sid box and still the same result: --- ### BEGIN INIT INFO # Provides: drbd # Required-Start: $local_fs $network $syslog # Required-Stop: $local_fs $network $syslog # Should-Start: sshd multipathd # Should-Stop:sshd multipathd # Default-Start: # Default-Stop: # Short-Description:Control drbd resources. --- No Default-Start or Stop definitions... Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#576901: init.d script fails under Squeeze with insserv due to lack of run level definitions
Package: drbd8-utils Version: 2:8.3.7-1 Severity: serious This incarnation of drbd8-utils has missing run level definitions in the INIT INFO section of the init.d script and thus does not get included when insserv does its magic. Read, drbd is never started during bootup on Squeeze. Changing the void in Default-Start/Stop to the levels below and running insserv drbd again fixed things: ### BEGIN INIT INFO # Provides: drbd # Required-Start: $local_fs $network $syslog # Required-Stop:$local_fs $network $syslog # Should-Start: sshd multipathd # Should-Stop: sshd multipathd # Default-Start:2 3 4 5 # Default-Stop: 0 1 6 # Short-Description:Control drbd resources. ### END INIT INFO Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#553503: And another one
Hello, definitely seems to be happening around group_list, that looks rather messed up down there. --- batzmaru:~# gdb -c /tmp/exim4.core.1259874378.25229 GNU gdb 6.8-debian Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-linux-gnu. (no debugging symbols found) Core was generated by `/usr/sbin/exim4 -Mc 1NGIsU-0006Yu-NU'. Program terminated with signal 11, Segmentation fault. [New process 25229] #0 0x0041e1fa in ?? () (gdb) bt full #0 0x0041e1fa in ?? () No symbol table info available. (gdb) symbol-file /usr/lib/debug/usr/sbin/exim4 Reading symbols from /usr/lib/debug/usr/sbin/exim4...done. (gdb) bt full #0 exim_setugid (uid=0, gid=103, igflag=0, msg=0x49d0b2 Address 0x49d0b2 out of bounds) at exim.c:539 euid = value optimized out egid = value optimized out #1 0x0042213a in main (argc=3, cargv=0x7fffa93b4b08) at exim.c:3200 arg_receive_timeout = -1 arg_smtp_receive_timeout = -1 arg_error_handling = 0 filter_sfd = value optimized out filter_ufd = value optimized out group_count = 1 i = 0 list_queue_option = 0 msg_action = 0 msg_action_arg = 2 namelen = value optimized out queue_only_reason = value optimized out perl_start_option = 0 recipients_arg = 3 sender_address_domain = 0 test_retry_arg = -1 test_rewrite_arg = -1 arg_queue_only = 0 bi_option = 0 checking = 0 count_queue = 0 expansion_test = 0 extract_recipients = 0 forced_delivery = 0 f_end_dot = 0 deliver_give_up = 0 list_queue = 0 list_options = 0 local_queue_only = value optimized out more = value optimized out one_msg_action = 0 queue_only_set = 0 sender_ident_set = 0 session_local_queue_only = value optimized out unprivileged = 0 removed_privilege = value optimized out usage_wanted = value optimized out verify_address_mode = 0 verify_as_sender = 0 version_printed = 0 alias_arg = (uschar *) 0x0 called_as = (uschar *) 0x4be89f Address 0x4be89f out of bounds start_queue_run_id = (uschar *) 0x0 stop_queue_run_id = (uschar *) 0x0 expansion_test_message = (uschar *) 0x0 ftest_domain = (uschar *) 0x0 ftest_localpart = (uschar *) 0x0 ftest_prefix = (uschar *) 0x0 ftest_suffix = (uschar *) 0x0 real_sender_address = value optimized out originator_home = value optimized out reset_point = value optimized out pw = (struct passwd *) 0x6d15a0 statbuf = {st_dev = 13, st_ino = 574, st_nlink = 1, st_mode = 8630, st_uid = 0, st_gid = 0, pad0 = 0, st_rdev = 259, st_size = 0, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1259088058, tv_nsec = 506449560}, st_mtim = {tv_sec = 1259088058, tv_nsec = 506449560}, st_ctim = {tv_sec = 1259088058, tv_nsec = 634448800}, __unused = {0, 0, 0}} passed_qr_pid = 0 passed_qr_pipe = -1 group_list = {103, 0 repeats 62757 times, 2847059989, 32767, 0, 0, 2849139040, 32767, 2771834999, 32767, 2847034884, 32767, 0 repeats 16 times, 1, 0 repeats 33 times, 2847059989, 32767, 0, 0, 2849139040, 32767, 2773942719, 32767, 2847034884, 32767, 0 repeats 16 times, 1, 0 repeats 41 times, 2847059989, 32767, 0, 0, 2849139040, 32767, 2776079430, 32767, 2847034884, 32767, 0 repeats 16 times, 1, 0 repeats 41 times, 2847059989, 32767, 0, 0, 2849139040, 32767, 2778334571, 32767, 2847034884, 32767, 0 repeats 16 times, 1, 0 repeats 29 times, 2839234176, 32767, 2839234288, 32767, 40, 0, 2773935248, 32767, 0, 0, 2848057752, 32767, 2847054253, 32767, 0, 0, 2847034884, 32767, 0, 0, 2847056758, 32767, 2839234176, 32767, 2847054192, 32767, 2839234239, 32767, 2839234224, 32767, 2839234216, 32767, 2849217336, 32767, 1, 0, 0, 0, 0, 0, 2771834999, 32767, 2380267520, 4294922870, 40, 0, 2773935248, 32767, 0, 0, 2848057752, 32767, 1077936128, 4294922870, 1186332672, 4294923109, 0, 0, 0, 0, 2839234176, 32767, 2839234288, 32767, 0...} rsopts = Cannot access memory at address 0x49e740 --- Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#553503: Got a core
Hello there, I hope I've done the backtrace correctly, looks not so useful to me at least with all the values optimized out. I got the core here so if somebody can clue me in about how to get more info out of gdb, please do so. --- mb14:~# gdb -se /usr/lib/debug/usr/sbin/exim4 -c /tmp/exim4.core.1259801287.395 GNU gdb 6.8-debian Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-linux-gnu... warning: Unable to find dynamic linker breakpoint function. GDB will be unable to debug shared library initializers and track explicitly loaded dynamic code. Core was generated by `/usr/sbin/exim4 -Mc 1NFzrb-6M-DK'. Program terminated with signal 11, Segmentation fault. [New process 395] #0 main (argc=3, cargv=0x7fff949d3d28) at exim.c:1296 1296exim.c: No such file or directory. in exim.c (gdb) bt #0 main (argc=3, cargv=0x7fff949d3d28) at exim.c:1296 (gdb) bt full #0 main (argc=3, cargv=0x7fff949d3d28) at exim.c:1296 arg_receive_timeout = 0 arg_smtp_receive_timeout = 4812357 arg_error_handling = value optimized out filter_sfd = value optimized out filter_ufd = value optimized out group_count = value optimized out i = value optimized out list_queue_option = value optimized out msg_action = value optimized out msg_action_arg = value optimized out namelen = 0 queue_only_reason = value optimized out perl_start_option = value optimized out recipients_arg = value optimized out sender_address_domain = value optimized out test_retry_arg = value optimized out test_rewrite_arg = value optimized out arg_queue_only = value optimized out bi_option = value optimized out checking = value optimized out count_queue = value optimized out expansion_test = value optimized out extract_recipients = value optimized out forced_delivery = value optimized out f_end_dot = value optimized out deliver_give_up = value optimized out list_queue = value optimized out list_options = value optimized out local_queue_only = value optimized out more = value optimized out one_msg_action = value optimized out queue_only_set = value optimized out sender_ident_set = value optimized out session_local_queue_only = value optimized out unprivileged = value optimized out removed_privilege = value optimized out usage_wanted = value optimized out verify_address_mode = value optimized out verify_as_sender = value optimized out version_printed = value optimized out alias_arg = value optimized out called_as = value optimized out start_queue_run_id = value optimized out stop_queue_run_id = value optimized out expansion_test_message = value optimized out ftest_domain = value optimized out ftest_localpart = value optimized out ftest_prefix = value optimized out ftest_suffix = value optimized out real_sender_address = value optimized out originator_home = value optimized out reset_point = value optimized out pw = value optimized out statbuf = {st_dev = 140735686720624, st_ino = 140735686720568, st_nlink = 4131212846, st_mode = 2493332512, st_uid = 32767, st_gid = 0, pad0 = 0, st_rdev = 140735751927214, st_size = 0, st_blksize = 140735752954696, st_blocks = 140733193388033, st_atim = { tv_sec = 0, tv_nsec = 140733193388033}, st_mtim = { tv_sec = 140735754040496, tv_nsec = 37}, st_ctim = {tv_sec = 4294967295, tv_nsec = 6797900016}, __unused = {140735752954696, 140735754093400, 140735686720672}} passed_qr_pid = value optimized out passed_qr_pipe = value optimized out group_list = Cannot access memory at address 0x7fff94993960 --- The core itself is 1.2MB, I could attach it to a mail or put it someplace accessible as well if that helps. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#553503: Confirmed here as well, suspect kernel interaction
Package: exim4 Version: 4.69-9 Followup-For: Bug #553503 I have been seeing this for a while, esp. when stress testing 2 new mailbox servers here. There seem to be no lost mails and the segfault below happened during a queue run (-q2m) on an pretty idle box, with no mails being delivered for hours beforehand. Which pretty much rules out corrupted database files. Seen this with both heavy and light deamon flavors. No problems on the remaining Etch machines. --- Nov 17 20:57:35 mb14 kernel: [2956900.869830] exim4[15676]: segfault at 7fff6f022504 ip 0041e95c sp 7fff6f0224d0 error 6 in exim4[40+c8000] --- My architecture is as with the other 2 reporters amd64 (x86_64) on 2-8 core SMP machines, all running Lenny. I use custom kernels, predominantly 2.6.27.latest or 2.6.latest. However one decently busy machine running a 2.6.24.7 kernel NEVER exhibited this problem, leading me to suspect that there some interaction with a kernel feature after 2.6.24 and exim 4.69-9. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#508135: 1.4.9a-3 security upgrade breaks (with) plugins
Hello Thijs, On Mon, 8 Dec 2008 13:02:59 +0100 (CET) Thijs Kinkhorst wrote: [...] Only after disabling the check quota plugin as well (which does/did NOT require any patching nor has seen any updates since the original install) full rendering was restored. I cannot reproduce this with a plain installation of the squirrelmail Debian package and adding the check_quota plugin to that. The fact that you get a completely blank page, suggests to me that you get PHP errors but that display_errors is turned off. Indeed it was and I even did add a nice log file for php errors when this change was forced upon us. And obviously promptly forgot about it, too. I suggest you look into your error log to see what the PHP errors are, and whether they are caused by the change that the security update brought. Please let me know what you find. This is what I found: [08-Dec-2008 22:03:39] PHP Fatal error: Call to undefined function sq_change_text_domain() in /usr/share/squirrelmail/plugins/check_quota/functions.php on line 761 [08-Dec-2008 22:03:39] PHP Fatal error: Call to undefined function get_current_hook_name() in /usr/share/squirrelmail/plugins/check_quota/functions.php on line 149 The first one is in the folder frame, the 2nd in the message frame. Could it be that Yes, of course. The evil compatibility plugin raises it's rather functional head again. After re-applying that patch things are working again. It might be a very very good idea to mention re-running all patches needed by plugins and explicitly the compatibility one during installation. Esp. the compatibility one tends to be rather invisible. Case closed and another reason to pine for 1.5 I guess. ^^ Tough choice between an insecure or a crippled webmail interface here... That seems to be a false dilemma, since you could as well disable the specific plugin causing the trouble. Oh, I disabled it allrite. ^^ The problem is of course that these 2 plugins are rather essential to the functionality we strive to provide here. Thanks for looking into this, the display_errors was all I needed to figure it out. Christian -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ https://secure3.gol.com/mod-pl/ols/index.cgi/?intr_id=F-2ECXvzcr6656 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#508135: 1.4.9a-3 security upgrade breaks (with) plugins
Package: Squirrelmail Version: 1.4.9a-3 Hello, That will teach me to install security updates in a timely manner. ;-) Standard SM installation with 3 plugins enabled: $plugins[0] = 'select_language'; $plugins[1] = 'spam_buttons'; $plugins[2] = 'check_quota'; After the upgrade the select language plugin still works and logging in works fine. Alas nothing is rendered in the folder frame nor the message frame. Just the menus are present. Selecting the display preferences results in a long wait and ultimately a totally blank browser window. Now spam buttons requires a minor patch so some breakage was to be expected but nothing of this scale. Disabling spam buttons made the display preferences menu work again, but still no love from either folder or message frame. Only after disabling the check quota plugin as well (which does/did NOT require any patching nor has seen any updates since the original install) full rendering was restored. Tough choice between an insecure or a crippled webmail interface here... Regards, Christian -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ https://secure3.gol.com/mod-pl/ols/index.cgi/?intr_id=F-2ECXvzcr6656 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#495511: drbd-utils needs to be heartbeat aware
Package: drbd8-utils Version: 2:8.0.13-1 Hello, this really applies to all versions, but I guess getting it fixed in sid/lenny will be the way to do it. Every time the drbd8-util package gets updated it happily smashes the ownership and protection of drbdsetup and drbdmeta needed to work with heartbeat (dopd). You know, these happy messages from the cluster resource manager after upgrading drbd8-utils: --- You are using the 'drbd-peer-outdater' as outdate-peer program. If you use that mechanism the dopd heartbeat plugin program needs to be able to call drbdsetup and drbdmeta with root privileges. You need to fix this with these commands: chgrp haclient /sbin/drbdsetup chmod o-x /sbin/drbdsetup chmod u+s /sbin/drbdsetup chgrp haclient /sbin/drbdmeta chmod o-x /sbin/drbdmeta chmod u+s /sbin/drbdmeta --- I'd reckon the majority of serious drbd users utilize heartbeat to manage their drbd resources and thus are potentially subject to some rude awakening if they are not aware of this in advance. Solutions would be either a debconf option to set these ownerships and protections all the time to the correct values or to check the state of these 2 binaries in pre-inst and then and reapply the same settings in post-inst. Of course the current post-install behavior of trying to stop the drbd resources and reload the module are also not very cooperative (or successful) with heartbeat on top of things and the resource likely to be mounted. Printing out dire warnings about wanting matching drbd module and util versions is one thing (the upstream drbd maintainer btw stated that running a higher version util with a lower version module should be safe) but trying to pull the rug out from under a running system is... rude. ;) Regards, Christian -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ https://secure3.gol.com/mod-pl/ols/index.cgi/?intr_id=F-2ECXvzcr6656 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#304735: slapd 2.2.23 database corruption
Hello, Monday is nearly over here and neither today nor over the weekend any corruption or inconsistencies were observed (and I checked each record that was modified in the last 3 days). So using BDB instead of LDBM indeed seems to have fixed things for me. I guess the choice as far as the Debian package is concerned is now to either get a working LDBM backend from upstream or forcibly migrate users away from LDBM when Sarge hits the limelight... Even with the default 256KB cache of BDB things worked quite well and db_stat -m showed pretty nice cache hit rates. For the record and in case somebody wants to use this data, my DB_CONFIG now reads like this (after many tests on my test server): --- set_cachesize 0 134217728 1 set_flags DB_LOG_AUTOREMOVE set_flags DB_TXN_NOSYNC --- Yes, these servers have 2GB RAM and so I was very generous with the cache. It helps quite a bit, that alone made full load with ldapadd 6 times faster. The DB_TXN_NOSYNC speeds that up another 8 times, so instead of 53 minutes it takes 1 minute to load the entire LDIF. Inserting it with slapcat -q now takes 22 seconds, I'm reminded of the god ole ldif2ldbm days. I know that DB_LOG_AUTOREMOVE doesn't work the way it should for the moment, but here's hoping for the future. ;) I'm unsure about DB_TXN_NOSYNC in production, basically only writing out changes when the server gets shut down is somewhat hair raising. OTOH it speeds up things and I never had either slapd or the whole server crash. In which case I could create a good instance in the 22 seconds mentioned up there. Regards, Christian Balzer -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#304735: slapd 2.2.23 database corruption
Steve Langasek wrote: On Fri, Apr 15, 2005 at 01:30:33PM +0900, Christian Balzer wrote: [backend used] See above, LDBM (whatever actual DB that defaults to these days). Sorry, I missed that. I would strongly encourage you to switch to BDB, which is the recommended backend for OpenLDAP 2.2; LDBM was more stable in 2.1 because BDB itself was *un*stable, but in 2.2, BDB is reportedly quite solid whereas LDBM is less stable than it had been in 2.1. Seeing that it hardly can get worse (I have been running BDB on a test machine and that worked for the limited exposure it has), I changed the 2 servers over to BDB, something that I would have not done w/o the -q switch in slapadd (all those BDB log files otherwise, argh). I will monitor this over the weekend and see if the problem persists, goes away or (heavens forbid) mutates. Not matter the outcome of this though, the severity of this bug report remains the same. Right now anybody with a working sarge or woody LDAP installation will find themselves encountering mysterious heisenbugs when upgrading to 2.2.23-1 (at the very least when using LDBM). So unless the underlying problem can be fixed or the update somehow enforces (it didn't even suggest it) BDB usage (always assuming this actually fixes what I'm seeing here) we have a major show stopper. I loathe BDB for the times it takes for massive adds/modifies. Even with slapadd, which takes about 2 minutes to load the entire DB using ldbm as backend, but about 50 minutes with BDB. OpenLDAP 2.2 includes a '-q' option to slapadd that makes the load time much quicker by disabling checks that are unnecessary while loading a fresh db. This option will be enabled by default on database reloads in the slapd install scripts. This sure helps (helped in my case) with a fresh load. I still dread to see BDB performance in case I have something modifying or adding a large number of entries in normal (ldapmodify) operation. It tends to be about 2 times slower than LDBM with that. Regards, Christian Balzer -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#304735: slapd 2.2.23 database corruption
Hello, just a quick reply to the 3 mails from Torsten. a) will try to ride this out with BDB and slapd 2.2.23 for the moment and make the call if this is working or not on Monday. So far no corruption, but also just a few modify actions. If it fails as well, I might indeed need an old package. ;P b) I know of the DB_CONFIG stuff from other encounter with BDB (INN overview) and the test runs with it for slapd. It gives me headaches, but I'll look at it again. The slapd.conf cachesize is set to 100 and the servers are vastly overspec'ed in all aspects. So no problems thus far. c) the -q did indeed help (2 minutes instead of 43) because it suppressed those pesky log.01 files which really kill the BDB performance in this scenario. Regards, Christian Balzer -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#304735: slapd 2.2.23 database corruption
Package: slapd Version: 2.2.23-1 (sarge) Severity: critical This is basically the same as #303826 (why this got classified as normal and 2.2.23 got pushed into sarge is beyond me). I have a LARGE (60k users) users/mailsettings database in LDAP, on two identical servers running sarge. They have been rock stable like that for over a year. Changes are generated by ldifsort.pl/ldifdiff.pl and then applied with ldapmodify for a low impact and smooth operation, using ldbm as backend. Since the update of slapd in sarge 2 days ago I have been getting an increasing number of reports of user settings vanishing from the system. As with #303826 a full dump of the DB WILL show that these records are present, but a specific search for them will fail. So this hints very much at index corruption of some sort, as a stop/start of slapd does not change things. However a delete/add of that entire record tends to fix things and so far it seems only records that were touched with modify have been affected. Unfortunately this is not deterministic in the least, while one slapd instance on one server will happily return the correct data for a specific query the other one might not or vice versa. I urge you (in case this can't be fixed in a time frame of 1-2 days) to back out this update and revert to the previous version. If this LDAP DB would be the canonical one and not fed from a SQL DB, I'd be out of a job by now instead of frantically fixing things with good data. Caffeinated Greetings, Christian Balzer -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#304735: slapd 2.2.23 database corruption
Steve Langasek wrote: On Fri, Apr 15, 2005 at 12:09:19PM +0900, Christian Balzer wrote: Changes are generated by ldifsort.pl/ldifdiff.pl and then applied with ldapmodify for a low impact and smooth operation, using ldbm as backend. The previous version of slapd *also* had corruption issues, and this is the driving reason for putting slapd 2.2 in sarge. I read that and I'm all for using current versions of software when getting near to a Debian release. Alas it's hard to contrast one year of trouble free operation with the current state of affairs. A fix that breaks all the users which until now had a perfectly working setup is, well, not a fix. Or to put it quite blunt, people encountering DB corruption with the previous version most likely did NOT run production systems with it. Me and others on the other hand... Which LDAP backend are you using for this directory? See above, LDBM (whatever actual DB that defaults to these days). I loathe BDB for the times it takes for massive adds/modifies. Even with slapadd, which takes about 2 minutes to load the entire DB using ldbm as backend, but about 50 minutes with BDB. Regards, Christian Balzer -- Christian BalzerNetwork/Systems EngineerNOC [EMAIL PROTECTED] Global OnLine Japan/Fusion Network Services http://www.gol.com/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]