Bug#1032104: linux: ppc64el iouring corrupted read
Hello! This is not fixed. I sampled failing autopkgtests for MariaDB at https://ci.debian.net/packages/m/mariadb/testing/ppc64el/ between May 7th and 22nd. They still have crashes that include error message 'Database page corruption on disk'. Both failing and passing ones were running kernel: Linux 6.1.0-9-powerpc64le #1 SMP Debian 6.1.27-1 (2023-05-08) Most passing ones were on ci-worker-ppc64el-03, but also on -02. The failing ones were on workers -01 and -02. Since -02 had both failing and passing it indicates that this is not a hardware issue. The overall symptoms indicate that this is a software issue that started on Feb 6th (kernel: Linux 5.10.0-21-powerpc64le) and it happens sporadically, not on every run, and continues to happen. - Otto
Bug#1032104: linux: ppc64el iouring corrupted read
Hi Otto, On Sun, Apr 09, 2023 at 03:30:35PM -0700, Otto Kekäläinen wrote: > > > > Paul Gevers asked if the issues are gone as well with 6.1.12-1 > > > > (or later 6.1.y series versions, which will land in bookworm). That > > > > would be valuable information to know as well to exclude we do not > > > > have the issue as well in bookworm. > > > > > > Were you able to verify this? > > Yes and new kernel did not fix it. > > I reviewed now all ppc64el autopkgtest runs of src:mariadb at > https://ci.debian.net/packages/m/mariadb/testing/ppc64el/ > > This is still happening on latest kernel and latest src:mariadb in > bookworm. The failing test varies, but they all have in common that > they error on 'Database page corruption on disk'. > > autopkgtest [20:11:55]: starting date and time: 2023-04-08 20:11:55+ > autopkgtest [20:12:17]: testbed running kernel: Linux > 6.1.0-7-powerpc64le #1 SMP Debian 6.1.20-1 (2023-03-19) > autopkgtest [20:12:39]: testing package mariadb version 1:10.11.2-1 > Completed: Failed 6/1021 tests, 99.41% were successful. > Failing test(s): main.innodb_ext_key main.statistics_upgrade_not_done > > Attached summary of downloading all recent logs and running: > $ zgrep -e 'starting date' -e 'running kernel' -e 'testing package > mariadb version' -e 'Completed: ' -e 'Failing test(s)' *.gz | tee > mariadb-autopkgtest-ppc64el-summary.txt Are those issues still present with recent kernels? There were again enough io_uring based changes which make worth rebase our checking on those. Regards, Salvatore
Bug#1032104: linux: ppc64el iouring corrupted read
> > > Paul Gevers asked if the issues are gone as well with 6.1.12-1 > > > (or later 6.1.y series versions, which will land in bookworm). That > > > would be valuable information to know as well to exclude we do not > > > have the issue as well in bookworm. > > > > Were you able to verify this? Yes and new kernel did not fix it. I reviewed now all ppc64el autopkgtest runs of src:mariadb at https://ci.debian.net/packages/m/mariadb/testing/ppc64el/ This is still happening on latest kernel and latest src:mariadb in bookworm. The failing test varies, but they all have in common that they error on 'Database page corruption on disk'. autopkgtest [20:11:55]: starting date and time: 2023-04-08 20:11:55+ autopkgtest [20:12:17]: testbed running kernel: Linux 6.1.0-7-powerpc64le #1 SMP Debian 6.1.20-1 (2023-03-19) autopkgtest [20:12:39]: testing package mariadb version 1:10.11.2-1 Completed: Failed 6/1021 tests, 99.41% were successful. Failing test(s): main.innodb_ext_key main.statistics_upgrade_not_done Attached summary of downloading all recent logs and running: $ zgrep -e 'starting date' -e 'running kernel' -e 'testing package mariadb version' -e 'Completed: ' -e 'Failing test(s)' *.gz | tee mariadb-autopkgtest-ppc64el-summary.txt 30542346.log.gz:autopkgtest [16:38:18]: starting date and time: 2023-01-20 16:38:18+ 30542346.log.gz:autopkgtest [16:39:14]: testbed running kernel: Linux 5.10.0-20-powerpc64le #1 SMP Debian 5.10.158-2 (2022-12-13) 30542346.log.gz:autopkgtest [16:39:30]: testing package mariadb version 1:10.11.1-1 30542346.log.gz:Completed: All 1016 tests were successful. 31013059.log.gz:autopkgtest [23:16:23]: starting date and time: 2023-02-03 23:16:23+ 31013059.log.gz:autopkgtest [23:16:53]: testbed running kernel: Linux 5.10.0-20-powerpc64le #1 SMP Debian 5.10.158-2 (2022-12-13) 31013059.log.gz:autopkgtest [23:17:06]: testing package mariadb version 1:10.11.1-2 31013059.log.gz:Completed: All 1016 tests were successful. 31114152.log.gz:autopkgtest [10:00:31]: starting date and time: 2023-02-06 10:00:31+ 31114152.log.gz:autopkgtest [10:00:57]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31114152.log.gz:autopkgtest [10:01:09]: testing package mariadb version 1:10.11.1-3 31114152.log.gz:Completed: Failed 2/1016 tests, 99.80% were successful. 31114152.log.gz:Failing test(s): main.xa_prepared_binlog_off main.update_use_source 31138628.log.gz:autopkgtest [06:52:36]: starting date and time: 2023-02-07 06:52:36+ 31138628.log.gz:autopkgtest [06:53:04]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31138628.log.gz:autopkgtest [06:53:17]: testing package mariadb version 1:10.11.1-3 31138628.log.gz:Completed: All 1016 tests were successful. 31204767.log.gz:autopkgtest [12:32:51]: starting date and time: 2023-02-10 12:32:51+ 31204767.log.gz:autopkgtest [12:33:23]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31204767.log.gz:autopkgtest [12:33:46]: testing package mariadb version 1:10.11.1-4 31204767.log.gz:Completed: Failed 2/1016 tests, 99.80% were successful. 31204767.log.gz:Failing test(s): main.innodb_ext_key main.order_by_innodb 31253808.log.gz:autopkgtest [19:05:34]: starting date and time: 2023-02-11 19:05:34+ 31253808.log.gz:autopkgtest [19:06:15]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31253808.log.gz:autopkgtest [19:06:25]: testing package mariadb version 1:10.11.1-4 31253808.log.gz:Completed: All 1016 tests were successful. 31452860.log.gz:autopkgtest [09:50:34]: starting date and time: 2023-02-17 09:50:34+ 31452860.log.gz:autopkgtest [09:51:00]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31452860.log.gz:autopkgtest [09:51:21]: testing package mariadb version 1:10.11.1-5 31452860.log.gz:Completed: Failed 6/1020 tests, 99.41% were successful. 31452860.log.gz:Failing test(s): main.ctype_utf8mb4_innodb main.index_merge_innodb 31480673.log.gz:autopkgtest [01:00:30]: starting date and time: 2023-02-18 01:00:30+ 31480673.log.gz:autopkgtest [01:01:00]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31480673.log.gz:autopkgtest [01:01:17]: testing package mariadb version 1:10.11.2-1 31480673.log.gz:Completed: Failed 6/1021 tests, 99.41% were successful. 31480673.log.gz:Failing test(s): main.xa_prepared_binlog_off main.range_mrr_icp 31509348.log.gz:autopkgtest [05:09:32]: starting date and time: 2023-02-19 05:09:32+ 31509348.log.gz:autopkgtest [05:10:50]: testbed running kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) 31509348.log.gz:autopkgtest [05:11:06]: testing package mariadb version 1:10.11.2-1 31509348.log.gz:Completed: Failed 3/1019 tests, 99.71% were successful. 31509348.log.gz:Failing test(s): main.ctype_utf8mb4_innodb 323410
Bug#1032104: linux: ppc64el iouring corrupted read
Hi Otto, On 09-04-2023 03:54, Otto Kekäläinen wrote: Paul Gevers asked if the issues are gone as well with 6.1.12-1 (or later 6.1.y series versions, which will land in bookworm). That would be valuable information to know as well to exclude we do not have the issue as well in bookworm. Were you able to verify this? No, not yet. I have not done new uploads to experimental after the one I mentioned and linked above from March 18th. I don't understand this point, so I wonder if you understood my question. Maybe you did, but in my view no new uploads are needed to answer the bookworm question. The builds for unstable are passing because I forced the tests to run with regular fsync instead of native I/O in https://salsa.debian.org/mariadb-team/mariadb-server/-/commit/fc1358087b39ac6520420c7bbae2e536bc86748d. I will test this again later but right now I don't want to do any extra uploads as the package is pending unblock and inclusion in Bullseye (Bug#1033811) and I don't want one single minor issue to jeopardize getting fixes for multiple major issues forward. My point was that I upgraded the ppc64el hosts where ci.debian.net runs the autopkgtests (so *not* the Debian build infrastructure). Since that upgrade, all tests on ci.debian.net *in every suite* have been using the bookworm (6.1.y) kernel. E.g. in unstable MariaDb 1:10.11.2-1 (so before the "Prevent mariadb-test-run from using native I/O on ppc64el and s390x due to Linux kernel bug" change) passed on 2023-03-26 10:39 but failed on the same day at 14:40. Is any of the failures on ppc64el before 1:10.11.2-2 and after 2023-03-09 from the same kernel issue we're discussing here (and thus the kernel still needs fixing in bookworm). Or are all the failures in that time-span from something else, and thus can we conclude that the kernel *probably* (no proof of course) got fixed between the version of the kernel in bullseye and the version in bookworm. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1032104: linux: ppc64el iouring corrupted read
> > On Sat, Mar 18, 2023 at 11:19:29PM -0700, Otto Kekäläinen wrote: > > > Any updates on this one? > > > > > > I am still seeing the main.index_merge_innodb failure in > > > https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1678728871&raw=0 > > > and rebuild > > > https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1679174850&raw=0. > > > > > > Logs show: Kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian > > > 5.10.162-1 (2023-01-21) ppc64el (ppc64le) > > > > Remember that with the 5.10.162 upstream version the io_uring code was > > rebased to the 5.15-stable one. So it is likely, and it maches the > > verison ranges, that the regression was introduced with this > > particular changes. Ideally someone with access to the given > > architecture, can verify that the issue is gone with the current > > 5.10.175 upstream (where there were several followup fixes, in > > particular e.g. a similar one for s390x), and if not, reports the > > problem to upstream. > > > > Paul Gevers asked if the issues are gone as well with 6.1.12-1 > > (or later 6.1.y series versions, which will land in bookworm). That > > would be valuable information to know as well to exclude we do not > > have the issue as well in bookworm. > > Were you able to verify this? No, not yet. I have not done new uploads to experimental after the one I mentioned and linked above from March 18th. The builds for unstable are passing because I forced the tests to run with regular fsync instead of native I/O in https://salsa.debian.org/mariadb-team/mariadb-server/-/commit/fc1358087b39ac6520420c7bbae2e536bc86748d. I will test this again later but right now I don't want to do any extra uploads as the package is pending unblock and inclusion in Bullseye (Bug#1033811) and I don't want one single minor issue to jeopardize getting fixes for multiple major issues forward.
Bug#1032104: linux: ppc64el iouring corrupted read
Control: tags -1 + moreinfo Hi On Sun, Mar 19, 2023 at 05:02:19PM +0100, Salvatore Bonaccorso wrote: > Hi, > > On Sat, Mar 18, 2023 at 11:19:29PM -0700, Otto Kekäläinen wrote: > > Any updates on this one? > > > > I am still seeing the main.index_merge_innodb failure in > > https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1678728871&raw=0 > > and rebuild > > https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1679174850&raw=0. > > > > Logs show: Kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian > > 5.10.162-1 (2023-01-21) ppc64el (ppc64le) > > Remember that with the 5.10.162 upstream version the io_uring code was > rebased to the 5.15-stable one. So it is likely, and it maches the > verison ranges, that the regression was introduced with this > particular changes. Ideally someone with access to the given > architecture, can verify that the issue is gone with the current > 5.10.175 upstream (where there were several followup fixes, in > particular e.g. a similar one for s390x), and if not, reports the > problem to upstream. > > Paul Gevers asked if the issues are gone as well with 6.1.12-1 > (or later 6.1.y series versions, which will land in bookworm). That > would be valuable information to know as well to exclude we do not > have the issue as well in bookworm. Were you able to verify this? Regards, Salvatore
Bug#1032104: linux: ppc64el iouring corrupted read
Hi, On Sat, Mar 18, 2023 at 11:19:29PM -0700, Otto Kekäläinen wrote: > Any updates on this one? > > I am still seeing the main.index_merge_innodb failure in > https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1678728871&raw=0 > and rebuild > https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1679174850&raw=0. > > Logs show: Kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian > 5.10.162-1 (2023-01-21) ppc64el (ppc64le) Remember that with the 5.10.162 upstream version the io_uring code was rebased to the 5.15-stable one. So it is likely, and it maches the verison ranges, that the regression was introduced with this particular changes. Ideally someone with access to the given architecture, can verify that the issue is gone with the current 5.10.175 upstream (where there were several followup fixes, in particular e.g. a similar one for s390x), and if not, reports the problem to upstream. Paul Gevers asked if the issues are gone as well with 6.1.12-1 (or later 6.1.y series versions, which will land in bookworm). That would be valuable information to know as well to exclude we do not have the issue as well in bookworm. Regards, Salvatore
Bug#1032104: linux: ppc64el iouring corrupted read
Any updates on this one? I am still seeing the main.index_merge_innodb failure in https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1678728871&raw=0 and rebuild https://buildd.debian.org/status/fetch.php?pkg=mariadb&arch=ppc64el&ver=1%3A10.11.2-2%7Eexp1&stamp=1679174850&raw=0. Logs show: Kernel: Linux 5.10.0-21-powerpc64le #1 SMP Debian 5.10.162-1 (2023-01-21) ppc64el (ppc64le)
Bug#1032104: linux: ppc64el iouring corrupted read
On Mon, 6 Mar 2023 13:25:36 +1100 Daniel Black wrote: Since revering to linux-image-5.10.0-20 we've been free of the same errors. On ci.debian.net I upgraded all ppc64el hosts to bookworm on 2023-03-09. debian@ci-worker-ppc64el-04:~$ uname -a Linux ci-worker-ppc64el-04 6.1.0-5-powerpc64le #1 SMP Debian 6.1.12-1 (2023-02-15) ppc64le GNU/Linux Can you check if the errors are still the same (yes, there's still intermittent failures). Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1032104: linux: ppc64el iouring corrupted read
Since revering to linux-image-5.10.0-20 we've been free of the same errors.
Bug#1032104: linux: ppc64el iouring corrupted read
On Tue, Feb 28, 2023 at 5:24 PM Diederik de Haas wrote: > > On Tuesday, 28 February 2023 04:13:18 CET Daniel Black wrote: > > Source: linux > > Version: 5.10.0-21-powerpc64le > > Severity: grave > > Justification: causes non-serious data loss > > X-Debbugs-Cc: dan...@mariadb.org > > > > >From https://jira.mariadb.org/browse/MDEV-30728 > > > > MariaDB's mtr tests on a number of specific tests depend on the correct > > kernel operation. > > > > As observed in these tests, there is a ~1/5 chance the > > encryption.innodb_encryption test will read zeros on the later part of > > the 16k pages that InnoDB uses by default. > > > > This affects MariaDB-10.6+ packages where there is a liburing in the > > distribution. > > > > I tested on tmpfs. This is a different fault from bug #1020831 as: > > * there is no iouring error, just a bunch of zeros where data was > > expected. > > * this is ppc64le only. > > What was the last kernel where this problem did NOT occur? 2022-12-19 03:55:34 install linux-image-5.10.0-20-powerpc64le:ppc64el 5.10.158-2 no similar errors between ^ and .. 2023-01-24 03:19:59 install linux-image-5.10.0-21-powerpc64le:ppc64el 5.10.162-1 (no other linux image installs in between these two) first failure found ~ Feb 4 2023. Unsure when kernel rebooted to this kernel bug it does appear to be the last revision. https://buildbot.mariadb.org/#/builders/318/builds/10008 log example https://ci.mariadb.org/32263/logs/ppc64le-debian-11/mysqld.1.err.7 (search for CURRENT_TEST: encryption.innodb_encryption) - contains hex dump of page > It's probably needed to pinpoint the (upstream) commit that caused this error/ > issue and the best start is normally finding the closest range with Debian > kernel releases where it did not and did occur. > > > -- System Information: > > Debian Release: bullseye > > APT prefers jammy-updates > > APT policy: (500, 'jammy-updates'), (500, 'jammy-security'), (500, > > 'jammy'), (100, 'jammy-backports') Architecture: ppc64el (ppc64le) > > > > Kernel: Linux 5.10.0-21-powerpc64le (SMP w/128 CPU threads) > > Init: unable to detect > > Why is there no 'bullseye' in APT policy's output? > Mixing distrubutions (aka FrankenDebian) isn't recommended, but seeing no > bullseye in there is odd, especially since the kernel version very much does > look like Debian. Apologies for the FrankenDebian look. This was a jammy container and jammy report bug with bullseye edited (badly) in the system info.
Bug#1032104: linux: ppc64el iouring corrupted read
On Tuesday, 28 February 2023 04:13:18 CET Daniel Black wrote: > Source: linux > Version: 5.10.0-21-powerpc64le > Severity: grave > Justification: causes non-serious data loss > X-Debbugs-Cc: dan...@mariadb.org > > >From https://jira.mariadb.org/browse/MDEV-30728 > > MariaDB's mtr tests on a number of specific tests depend on the correct > kernel operation. > > As observed in these tests, there is a ~1/5 chance the > encryption.innodb_encryption test will read zeros on the later part of > the 16k pages that InnoDB uses by default. > > This affects MariaDB-10.6+ packages where there is a liburing in the > distribution. > > I tested on tmpfs. This is a different fault from bug #1020831 as: > * there is no iouring error, just a bunch of zeros where data was > expected. > * this is ppc64le only. What was the last kernel where this problem did NOT occur? It's probably needed to pinpoint the (upstream) commit that caused this error/ issue and the best start is normally finding the closest range with Debian kernel releases where it did not and did occur. > -- System Information: > Debian Release: bullseye > APT prefers jammy-updates > APT policy: (500, 'jammy-updates'), (500, 'jammy-security'), (500, > 'jammy'), (100, 'jammy-backports') Architecture: ppc64el (ppc64le) > > Kernel: Linux 5.10.0-21-powerpc64le (SMP w/128 CPU threads) > Init: unable to detect Why is there no 'bullseye' in APT policy's output? Mixing distrubutions (aka FrankenDebian) isn't recommended, but seeing no bullseye in there is odd, especially since the kernel version very much does look like Debian. signature.asc Description: This is a digitally signed message part.
Bug#1032104: linux: ppc64el iouring corrupted read
Source: linux Version: 5.10.0-21-powerpc64le Severity: grave Justification: causes non-serious data loss X-Debbugs-Cc: dan...@mariadb.org Dear Maintainer, *** Reporter, please consider answering these questions, where appropriate *** * What led up to the situation? * What exactly did you do (or not do) that was effective (or ineffective)? * What was the outcome of this action? * What outcome did you expect instead? *** End of the template - remove these template lines *** >From https://jira.mariadb.org/browse/MDEV-30728 MariaDB's mtr tests on a number of specific tests depend on the correct kernel operation. As observed in these tests, there is a ~1/5 chance the encryption.innodb_encryption test will read zeros on the later part of the 16k pages that InnoDB uses by default. This affects MariaDB-10.6+ packages where there is a liburing in the distribution. This has been observed in the CI of Debian (https://ci.debian.net/packages/m/mariadb/testing/ppc64el/) and upstreams https://buildbot.mariadb.org/#/builders/318. The one ppc64le worker that has the Debian 5.10.0-21 kernel, the same as the Debian CI, has the prefix ppc64le-db-bbw1-*. Test faults occur on all MariaDB 10.6+ builds in containers on this kernel. There a no faults on non-ppc64le or RHEL7/8 based ppc64le kernels. To reproduce: apt-get install mariadb-test cd /usr/share/mysql/mysql-test ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1 --vardir=/var/lib/mysql --force encryption.innodb_encryption,innodb,undo0 --repeat=12 A test will frequenty fail. 2023-02-28 1:41:01 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './ibdata1' page [page id: space=0, page number=282]. You may have to recover from a backup. (the page number isn't predictable) The complete mtr error log of mariadb server is $PWD/var/log/mysqld.1.err I tested on tmpfs. This is a different fault from bug #1020831 as: * there is no iouring error, just a bunch of zeros where data was expected. * this is ppc64le only. Note, more serious faults exist on overlayfs (MDEV-28751) and remote filesystems so sticking to local xfs, ext4, btrfs is recommended. -- System Information: Debian Release: bullseye APT prefers jammy-updates APT policy: (500, 'jammy-updates'), (500, 'jammy-security'), (500, 'jammy'), (100, 'jammy-backports') Architecture: ppc64el (ppc64le) Kernel: Linux 5.10.0-21-powerpc64le (SMP w/128 CPU threads) Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: unable to detect