Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
> 2020-10-24 16:41 GMT+02:00 Stefan Sperling : > > On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote: > > > Fair enough, but "there's no auto-assembly and it's inefficient and > > > nothing stops you from messing with the intermediate discipline" is a > > > different kind of not supported than "you should expect kernel panics". > > > > > > If the latter is the case, maybe it should be documented in the > > > softraid(4) CAVEATS, as it breaks the sd(4) abstraction. > > > > Neither Joel's mail nor the word "unsupported" imply a promise > > that it will work without auto-assembly and with inefficient i/o. > > > > Unsupported means unsupported. We don't need to list any reasons > > for this in user-facing documentation. > > I'm not suggesting justifying why, I am saying that softraid(4) is > documented to assemble sd(4) devices into sd(4) devices. If it's > actually "sd(4) devices that are not themselves softraid(4) backed", > that would be worth documenting as it breaks the sd(4) abstraction. > > Said another way, how was I supposed to find out this is unsupported? > It's not like "a mirrored full-disk encrypted device" is an exotic > configuration that would give me pause. It's documented in the FAQ: > Note that "stacking" softraid modes (mirrored drives and encryption, > for example) is not supported at this time https://www.openbsd.org/faq/faq14.html#softraidFDE
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
2020-10-24 19:26 GMT+02:00 Theo de Raadt : > Filippo Valsorda wrote: > > > 2020-10-24 19:01 GMT+02:00 Theo de Raadt : > > > > Filippo Valsorda wrote: > > > > > Said another way, how was I supposed to find out this is unsupported? > > > > The way you just found out. > > > > > It's not like "a mirrored full-disk encrypted device" is an exotic > > > configuration that would give me pause. > > > > there's a song that goes "You can't always get what you want" > > > > Nothing is perfect. Do people rail against other groups in the same way? > > > > Alright, I'm disengaging. > > > > This was a bizarre interaction, I just reported a crash that doesn't > > even affect me anymore (I was disassembling that system), trying to > > follow the reporting guidelines as much as possible, for something that > > I had no way of knowing was unsupported. > > > > You are disengaging... but just have to get ONE MORE snipe in! > > Meanwhile, no diff. Not for the kernel, that would be difficult. > > But no diff for the manual pages either (it is rather obviously that > the people who hit this would know what what pages they read, and > where they should have seen a warning, and what form it should take) Ah, if you're interested in a patch for the manual page, happy to send one. I'll read the contribution docs and send one tomorrow. I had suggested both the page and the section where I would have found a warning, but sending a diff telling you what you support and what you don't felt more like overstepping. In my own projects, I prefer users don't do that, as they can't know the boundary of what is supported and what is not.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
Filippo Valsorda wrote: > 2020-10-24 19:01 GMT+02:00 Theo de Raadt : > > Filippo Valsorda wrote: > > > Said another way, how was I supposed to find out this is unsupported? > > The way you just found out. > > > It's not like "a mirrored full-disk encrypted device" is an exotic > > configuration that would give me pause. > > there's a song that goes "You can't always get what you want" > > Nothing is perfect. Do people rail against other groups in the same way? > > Alright, I'm disengaging. > > This was a bizarre interaction, I just reported a crash that doesn't > even affect me anymore (I was disassembling that system), trying to > follow the reporting guidelines as much as possible, for something that > I had no way of knowing was unsupported. > You are disengaging... but just have to get ONE MORE snipe in! Meanwhile, no diff. Not for the kernel, that would be difficult. But no diff for the manual pages either (it is rather obviously that the people who hit this would know what what pages they read, and where they should have seen a warning, and what form it should take) But no. Either the margin is too narrow for such a diff, and it's easier to assume that "I am right" commentary will generate results. Some users really are their own worst enemy.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
2020-10-24 19:01 GMT+02:00 Theo de Raadt : > Filippo Valsorda wrote: > > > Said another way, how was I supposed to find out this is unsupported? > > The way you just found out. > > > It's not like "a mirrored full-disk encrypted device" is an exotic > > configuration that would give me pause. > > there's a song that goes "You can't always get what you want" > > > Nothing is perfect. Do people rail against other groups in the same way? Alright, I'm disengaging. This was a bizarre interaction, I just reported a crash that doesn't even affect me anymore (I was disassembling that system), trying to follow the reporting guidelines as much as possible, for something that I had no way of knowing was unsupported.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
Filippo Valsorda wrote: > Said another way, how was I supposed to find out this is unsupported? The way you just found out. > It's not like "a mirrored full-disk encrypted device" is an exotic > configuration that would give me pause. there's a song that goes "You can't always get what you want" Nothing is perfect. Do people rail against other groups in the same way?
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
2020-10-24 16:41 GMT+02:00 Stefan Sperling : > On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote: > > Fair enough, but "there's no auto-assembly and it's inefficient and > > nothing stops you from messing with the intermediate discipline" is a > > different kind of not supported than "you should expect kernel panics". > > > > If the latter is the case, maybe it should be documented in the > > softraid(4) CAVEATS, as it breaks the sd(4) abstraction. > > Neither Joel's mail nor the word "unsupported" imply a promise > that it will work without auto-assembly and with inefficient i/o. > > Unsupported means unsupported. We don't need to list any reasons > for this in user-facing documentation. I'm not suggesting justifying why, I am saying that softraid(4) is documented to assemble sd(4) devices into sd(4) devices. If it's actually "sd(4) devices that are not themselves softraid(4) backed", that would be worth documenting as it breaks the sd(4) abstraction. Said another way, how was I supposed to find out this is unsupported? It's not like "a mirrored full-disk encrypted device" is an exotic configuration that would give me pause.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
Demi M. Obenour wrote: > On 10/24/20 10:41 AM, Stefan Sperling wrote: > > On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote: > >> Fair enough, but "there's no auto-assembly and it's inefficient and > >> nothing stops you from messing with the intermediate discipline" is a > >> different kind of not supported than "you should expect kernel panics". > >> > >> If the latter is the case, maybe it should be documented in the > >> softraid(4) CAVEATS, as it breaks the sd(4) abstraction. > > > > Neither Joel's mail nor the word "unsupported" imply a promise > > that it will work without auto-assembly and with inefficient i/o. > > > > Unsupported means unsupported. We don't need to list any reasons > > for this in user-facing documentation. > > One could also argue that the kernel must never panic because userspace > did something wrong. The only exceptions I am aware of are: > > - init dying > - corrupt kernel image > - corrupt root filesystem > - not being able to mount the root filesystem > - overwriting kernel memory with /dev/mem or DMA > - hardware fault Really. rm -rf / reboot Oh my god, it panics on reboot. And hundreds of other possible ways for root to configure a broken system for the next operation. Sadly, the margin was too narrow for any solution in the form of source code or diff, instead we as developers get instructed on What To Do. If you guys aren't part of the solution, you are part of the precipitate.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
On 10/24/20 10:41 AM, Stefan Sperling wrote: > On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote: >> Fair enough, but "there's no auto-assembly and it's inefficient and >> nothing stops you from messing with the intermediate discipline" is a >> different kind of not supported than "you should expect kernel panics". >> >> If the latter is the case, maybe it should be documented in the >> softraid(4) CAVEATS, as it breaks the sd(4) abstraction. > > Neither Joel's mail nor the word "unsupported" imply a promise > that it will work without auto-assembly and with inefficient i/o. > > Unsupported means unsupported. We don't need to list any reasons > for this in user-facing documentation. One could also argue that the kernel must never panic because userspace did something wrong. The only exceptions I am aware of are: - init dying - corrupt kernel image - corrupt root filesystem - not being able to mount the root filesystem - overwriting kernel memory with /dev/mem or DMA - hardware fault In particular, I would expect that at securelevel 1 or higher, userspace should not be able to cause a fatal kernel page fault. Demi OpenPGP_0xB288B55FFF9C22C1.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote: > Fair enough, but "there's no auto-assembly and it's inefficient and > nothing stops you from messing with the intermediate discipline" is a > different kind of not supported than "you should expect kernel panics". > > If the latter is the case, maybe it should be documented in the > softraid(4) CAVEATS, as it breaks the sd(4) abstraction. Neither Joel's mail nor the word "unsupported" imply a promise that it will work without auto-assembly and with inefficient i/o. Unsupported means unsupported. We don't need to list any reasons for this in user-facing documentation.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
2020-10-24 15:37 GMT+02:00 Stefan Sperling : > On Sat, Oct 24, 2020 at 03:10:05PM +0200, Filippo Valsorda wrote: > > >Synopsis: kernel page fault in "sddetach -> bufq_destroy" during "bioctl > > >-d" > > >Category: kernel > > >Environment: > > System : OpenBSD 6.8 > > Details : OpenBSD 6.8 (GENERIC.MP) #98: Sun Oct 4 18:13:26 MDT 2020 > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > Architecture: OpenBSD.amd64 > > Machine : amd64 > > >Description: > > Starting with two RAID 1 arrays with a CRYPTO device on top of each (see > > "devices" below), I first unmounted the filesystems, then successfully > > detached the CRYPTO devices, and then tried to detach the first RAID 1. > > Stacking softraid volumes is not supported. > > That's the best answer at this point in time. It's clear that a raid1+crypto > solution is needed, but nobody has done the work to make it happen. > > As Joel explains here: > https://marc.info/?l=openbsd-misc=154349798307366=2 > you get to keep the pieces when it breaks. Fair enough, but "there's no auto-assembly and it's inefficient and nothing stops you from messing with the intermediate discipline" is a different kind of not supported than "you should expect kernel panics". If the latter is the case, maybe it should be documented in the softraid(4) CAVEATS, as it breaks the sd(4) abstraction.
Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
On Sat, Oct 24, 2020 at 03:10:05PM +0200, Filippo Valsorda wrote: > >Synopsis:kernel page fault in "sddetach -> bufq_destroy" during "bioctl > >-d" > >Category:kernel > >Environment: > System : OpenBSD 6.8 > Details : OpenBSD 6.8 (GENERIC.MP) #98: Sun Oct 4 18:13:26 MDT 2020 > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > Starting with two RAID 1 arrays with a CRYPTO device on top of each (see > "devices" below), I first unmounted the filesystems, then successfully > detached the CRYPTO devices, and then tried to detach the first RAID 1. Stacking softraid volumes is not supported. That's the best answer at this point in time. It's clear that a raid1+crypto solution is needed, but nobody has done the work to make it happen. As Joel explains here: https://marc.info/?l=openbsd-misc=154349798307366=2 you get to keep the pieces when it breaks.
Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"
>Synopsis: kernel page fault in "sddetach -> bufq_destroy" during "bioctl >-d" >Category: kernel >Environment: System : OpenBSD 6.8 Details : OpenBSD 6.8 (GENERIC.MP) #98: Sun Oct 4 18:13:26 MDT 2020 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP Architecture: OpenBSD.amd64 Machine : amd64 >Description: Starting with two RAID 1 arrays with a CRYPTO device on top of each (see "devices" below), I first unmounted the filesystems, then successfully detached the CRYPTO devices, and then tried to detach the first RAID 1. # bioctl -d sd8 # CRYPTO on top of sd5a # bioctl -d sd7 # CRYPTO on top of sd6a # bioctl -d sd5 # RAID 1 on top of sd2a,sd3a The machine immediately dropped into ddb, which I reached from the serial console. I was unable to recover the console output before the panic. ddb{0}> show panic kernel page fault uvm_fault(0x82153778, 0x0002, 0, 1) -> e bufq_destroy(80115710) at bufq_destroy+0x83 end trace frame: 0x8000227379a0, count: 0 ddb{0}> trace bufq_destroy(80115710) at bufq_destroy+0x83 sddetach(80115600,1) at sddetach+0x42 config_detach(80115600,1) at config_detach+0x142 scsi_detach_link(80125800,1) at scsi_detach_link+0x4d sr_discipline_shutdown(80124000,1,0) at sr_discipline_shutdown+0x13e sr_bio_handler(8012,80124000,c2d04227,80655c00) at sr_bio_handler+0x1ce sdioctl(d52,c2d04227,80655c00,3,800022604030) at sdioctl+0x4e9 VOP_IOCTL(fd8234962018,c2d04227,80655c00,3,fd828b7bd300,800022604030) at VOP_IOCTL+0x55 vn_ioctl(fd80edfe2788,c2d04227,80655c00,800022604030) at vn_ioctl+0x75 sys_ioctl(800022604030,800022737e60,800022737ec0) at sys_ioctl+0x2d4 syscall(800022737f30) at syscall+0x389 Xsyscall() at Xsyscall+0x128 end of kernel end trace frame: 0x7f7de2e0, count: -12 After rebooting that RAID 1 was gone. I successfully detached the remaining RAID 1, and generated the sendbug(1) output after wiping sd0,sd1,sd2,sd3. devices: ==> sd0 / duid: b27184271af409ec sd0: , serial WD-WCC4N2HCXC13 a: 2794.5G 64RAID c: 2794.5G0 unused ==> sd1 / duid: cbcf06a16339dea8 sd1: , serial WD-WCC4N2KYDERF a: 2794.5G 64RAID c: 2794.5G0 unused ==> sd2 / duid: 612a362a9042bc12 sd2: , serial WD-WCC4N4ARTXUZ a: 2794.5G 64RAID c: 2794.5G0 unused ==> sd3 / duid: 6c4bf1c584cf2ad5 sd3: , serial WD-WCC4N5VRVF3C a: 2794.5G 64RAID c: 2794.5G0 unused ==> sd4 / duid: 3141c7e6e6fc07f5 sd4: , serial (unknown) a: 0.3G 64 4.2BSD 2048 16384 5657 # / b: 0.5G 730144swap c:14.6G0 unused d: 0.4G 1739776 4.2BSD 2048 16384 7206 # /mfs/dev e: 0.6G 2662144 4.2BSD 2048 16384 9870 # /mfs/var f: 1.9G 3925504 4.2BSD 2048 16384 12960 # /usr g: 0.5G 7843264 4.2BSD 2048 16384 8061 # /home h: 1.6G 8883424 4.2BSD 2048 16384 12960 # /usr/local i: 1.4G 12249248 4.2BSD 2048 16384 12960 # /usr/src j: 5.2G 15080800 4.2BSD 2048 16384 12960 # /usr/obj ==> sd5 / duid: 7cc6a1f7b9a86bc3 Volume Status Size Device softraid0 0 Online 3000592646656 sd5 RAID1 0 Online 3000592646656 0:0.0 noencl 1 Online 3000592646656 0:1.0 noencl a: 2794.5G0RAID c: 2794.5G0 unused ==> sd6 / duid: b5e9c71ced61dca9 Volume Status Size Device softraid0 1 Online 3000592646656 sd6 RAID1 0 Online 3000592646656 1:0.0 noencl 1 Online 3000592646656 1:1.0 noencl a: 1863.0G0RAID c: 2794.5G0 unused d: 930.5G 3907021696 4.2BSD 8192 65536 52238 # /array/ct ==> sd7 / duid: d74b6781f7ca7b59 Volume Status Size Device softraid0 2 Online 2000394827264 sd7 CRYPTO 0 Online 2000394827264 2:0.0 noencl a: 1863.0G 64 4.2BSD 8192 65536 24688 # /array/misc c: 1863.0G0 unused ==> sd8 / duid: 246216bff2633caa Volume Status Size Device softraid0 3 Online 3000592376320 sd8 CRYPTO 0 Online 3000592376320 3:0.0 noencl a: 93