Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread T.J. Townsend
> 2020-10-24 16:41 GMT+02:00 Stefan Sperling :
> > On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote:
> > > Fair enough, but "there's no auto-assembly and it's inefficient and
> > > nothing stops you from messing with the intermediate discipline" is a
> > > different kind of not supported than "you should expect kernel panics".
> > > 
> > > If the latter is the case, maybe it should be documented in the
> > > softraid(4) CAVEATS, as it breaks the sd(4) abstraction.
> > 
> > Neither Joel's mail nor the word "unsupported" imply a promise
> > that it will work without auto-assembly and with inefficient i/o.
> > 
> > Unsupported means unsupported. We don't need to list any reasons
> > for this in user-facing documentation.
> 
> I'm not suggesting justifying why, I am saying that softraid(4) is
> documented to assemble sd(4) devices into sd(4) devices. If it's
> actually "sd(4) devices that are not themselves softraid(4) backed",
> that would be worth documenting as it breaks the sd(4) abstraction.
> 
> Said another way, how was I supposed to find out this is unsupported?
> It's not like "a mirrored full-disk encrypted device" is an exotic
> configuration that would give me pause.

It's documented in the FAQ:

> Note that "stacking" softraid modes (mirrored drives and encryption,
> for example) is not supported at this time

https://www.openbsd.org/faq/faq14.html#softraidFDE



Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Filippo Valsorda
2020-10-24 19:26 GMT+02:00 Theo de Raadt :
> Filippo Valsorda  wrote:
> 
> > 2020-10-24 19:01 GMT+02:00 Theo de Raadt :
> > 
> >  Filippo Valsorda  wrote:
> > 
> >  > Said another way, how was I supposed to find out this is unsupported?
> > 
> >  The way you just found out.
> > 
> >  > It's not like "a mirrored full-disk encrypted device" is an exotic
> >  > configuration that would give me pause.
> > 
> >  there's a song that goes "You can't always get what you want"
> > 
> >  Nothing is perfect.  Do people rail against other groups in the same way?
> > 
> > Alright, I'm disengaging.
> > 
> > This was a bizarre interaction, I just reported a crash that doesn't
> > even affect me anymore (I was disassembling that system), trying to
> > follow the reporting guidelines as much as possible, for something that
> > I had no way of knowing was unsupported.
> > 
> 
> You are disengaging... but just have to get ONE MORE snipe in!
> 
> Meanwhile, no diff.  Not for the kernel, that would be difficult.
> 
> But no diff for the manual pages either (it is rather obviously that
> the people who hit this would know what what pages they read, and
> where they should have seen a warning, and what form it should take)

Ah, if you're interested in a patch for the manual page, happy to send
one. I'll read the contribution docs and send one tomorrow.

I had suggested both the page and the section where I would have
found a warning, but sending a diff telling you what you support
and what you don't felt more like overstepping. In my own projects, I
prefer users don't do that, as they can't know the boundary of what is
supported and what is not.


Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Theo de Raadt
Filippo Valsorda  wrote:

> 2020-10-24 19:01 GMT+02:00 Theo de Raadt :
> 
>  Filippo Valsorda  wrote:
> 
>  > Said another way, how was I supposed to find out this is unsupported?
> 
>  The way you just found out.
> 
>  > It's not like "a mirrored full-disk encrypted device" is an exotic
>  > configuration that would give me pause.
> 
>  there's a song that goes "You can't always get what you want"
> 
>  Nothing is perfect.  Do people rail against other groups in the same way?
> 
> Alright, I'm disengaging.
> 
> This was a bizarre interaction, I just reported a crash that doesn't
> even affect me anymore (I was disassembling that system), trying to
> follow the reporting guidelines as much as possible, for something that
> I had no way of knowing was unsupported.
> 

You are disengaging... but just have to get ONE MORE snipe in!

Meanwhile, no diff.  Not for the kernel, that would be difficult.

But no diff for the manual pages either (it is rather obviously that
the people who hit this would know what what pages they read, and
where they should have seen a warning, and what form it should take)

But no.

Either the margin is too narrow for such a diff, and it's easier to
assume that "I am right" commentary will generate results.

Some users really are their own worst enemy.



Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Filippo Valsorda
2020-10-24 19:01 GMT+02:00 Theo de Raadt :
> Filippo Valsorda  wrote:
> 
> > Said another way, how was I supposed to find out this is unsupported?
> 
> The way you just found out.
> 
> > It's not like "a mirrored full-disk encrypted device" is an exotic
> > configuration that would give me pause.
> 
> there's a song that goes "You can't always get what you want"
> 
> 
> Nothing is perfect.  Do people rail against other groups in the same way?

Alright, I'm disengaging.

This was a bizarre interaction, I just reported a crash that doesn't
even affect me anymore (I was disassembling that system), trying to
follow the reporting guidelines as much as possible, for something that
I had no way of knowing was unsupported.


Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Theo de Raadt
Filippo Valsorda  wrote:

> Said another way, how was I supposed to find out this is unsupported?

The way you just found out.

> It's not like "a mirrored full-disk encrypted device" is an exotic
> configuration that would give me pause.

there's a song that goes "You can't always get what you want"


Nothing is perfect.  Do people rail against other groups in the same way?



Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Filippo Valsorda
2020-10-24 16:41 GMT+02:00 Stefan Sperling :
> On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote:
> > Fair enough, but "there's no auto-assembly and it's inefficient and
> > nothing stops you from messing with the intermediate discipline" is a
> > different kind of not supported than "you should expect kernel panics".
> > 
> > If the latter is the case, maybe it should be documented in the
> > softraid(4) CAVEATS, as it breaks the sd(4) abstraction.
> 
> Neither Joel's mail nor the word "unsupported" imply a promise
> that it will work without auto-assembly and with inefficient i/o.
> 
> Unsupported means unsupported. We don't need to list any reasons
> for this in user-facing documentation.

I'm not suggesting justifying why, I am saying that softraid(4) is
documented to assemble sd(4) devices into sd(4) devices. If it's
actually "sd(4) devices that are not themselves softraid(4) backed",
that would be worth documenting as it breaks the sd(4) abstraction.

Said another way, how was I supposed to find out this is unsupported?
It's not like "a mirrored full-disk encrypted device" is an exotic
configuration that would give me pause.


Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Theo de Raadt
Demi M. Obenour  wrote:

> On 10/24/20 10:41 AM, Stefan Sperling wrote:
> > On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote:
> >> Fair enough, but "there's no auto-assembly and it's inefficient and
> >> nothing stops you from messing with the intermediate discipline" is a
> >> different kind of not supported than "you should expect kernel panics".
> >>
> >> If the latter is the case, maybe it should be documented in the
> >> softraid(4) CAVEATS, as it breaks the sd(4) abstraction.
> > 
> > Neither Joel's mail nor the word "unsupported" imply a promise
> > that it will work without auto-assembly and with inefficient i/o.
> > 
> > Unsupported means unsupported. We don't need to list any reasons
> > for this in user-facing documentation.
> 
> One could also argue that the kernel must never panic because userspace
> did something wrong.  The only exceptions I am aware of are:
> 
> - init dying
> - corrupt kernel image
> - corrupt root filesystem
> - not being able to mount the root filesystem
> - overwriting kernel memory with /dev/mem or DMA
> - hardware fault

Really.

rm -rf /
reboot

Oh my god, it panics on reboot.  And hundreds of other possible ways for
root to configure a broken system for the next operation.

Sadly, the margin was too narrow for any solution in the form of source code
or diff, instead we as developers get instructed on What To Do.

If you guys aren't part of the solution, you are part of the precipitate.



Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Demi M. Obenour
On 10/24/20 10:41 AM, Stefan Sperling wrote:
> On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote:
>> Fair enough, but "there's no auto-assembly and it's inefficient and
>> nothing stops you from messing with the intermediate discipline" is a
>> different kind of not supported than "you should expect kernel panics".
>>
>> If the latter is the case, maybe it should be documented in the
>> softraid(4) CAVEATS, as it breaks the sd(4) abstraction.
> 
> Neither Joel's mail nor the word "unsupported" imply a promise
> that it will work without auto-assembly and with inefficient i/o.
> 
> Unsupported means unsupported. We don't need to list any reasons
> for this in user-facing documentation.

One could also argue that the kernel must never panic because userspace
did something wrong.  The only exceptions I am aware of are:

- init dying
- corrupt kernel image
- corrupt root filesystem
- not being able to mount the root filesystem
- overwriting kernel memory with /dev/mem or DMA
- hardware fault

In particular, I would expect that at securelevel 1 or higher,
userspace should not be able to cause a fatal kernel page fault.

Demi


OpenPGP_0xB288B55FFF9C22C1.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Stefan Sperling
On Sat, Oct 24, 2020 at 04:11:00PM +0200, Filippo Valsorda wrote:
> Fair enough, but "there's no auto-assembly and it's inefficient and
> nothing stops you from messing with the intermediate discipline" is a
> different kind of not supported than "you should expect kernel panics".
> 
> If the latter is the case, maybe it should be documented in the
> softraid(4) CAVEATS, as it breaks the sd(4) abstraction.

Neither Joel's mail nor the word "unsupported" imply a promise
that it will work without auto-assembly and with inefficient i/o.

Unsupported means unsupported. We don't need to list any reasons
for this in user-facing documentation.



Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Filippo Valsorda
2020-10-24 15:37 GMT+02:00 Stefan Sperling :
> On Sat, Oct 24, 2020 at 03:10:05PM +0200, Filippo Valsorda wrote:
> > >Synopsis: kernel page fault in "sddetach -> bufq_destroy" during "bioctl 
> > >-d"
> > >Category: kernel
> > >Environment:
> > System  : OpenBSD 6.8
> > Details : OpenBSD 6.8 (GENERIC.MP) #98: Sun Oct  4 18:13:26 MDT 2020
> >  dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.amd64
> > Machine : amd64
> > >Description:
> > Starting with two RAID 1 arrays with a CRYPTO device on top of each (see
> > "devices" below), I first unmounted the filesystems, then successfully
> > detached the CRYPTO devices, and then tried to detach the first RAID 1.
> 
> Stacking softraid volumes is not supported.
> 
> That's the best answer at this point in time. It's clear that a raid1+crypto
> solution is needed, but nobody has done the work to make it happen.
> 
> As Joel explains here: 
> https://marc.info/?l=openbsd-misc=154349798307366=2 
> you get to keep the pieces when it breaks.

Fair enough, but "there's no auto-assembly and it's inefficient and
nothing stops you from messing with the intermediate discipline" is a
different kind of not supported than "you should expect kernel panics".

If the latter is the case, maybe it should be documented in the
softraid(4) CAVEATS, as it breaks the sd(4) abstraction.


Re: Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Stefan Sperling
On Sat, Oct 24, 2020 at 03:10:05PM +0200, Filippo Valsorda wrote:
> >Synopsis:kernel page fault in "sddetach -> bufq_destroy" during "bioctl 
> >-d"
> >Category:kernel
> >Environment:
>   System  : OpenBSD 6.8
>   Details : OpenBSD 6.8 (GENERIC.MP) #98: Sun Oct  4 18:13:26 MDT 2020
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
> Starting with two RAID 1 arrays with a CRYPTO device on top of each (see
> "devices" below), I first unmounted the filesystems, then successfully
> detached the CRYPTO devices, and then tried to detach the first RAID 1.

Stacking softraid volumes is not supported.

That's the best answer at this point in time. It's clear that a raid1+crypto
solution is needed, but nobody has done the work to make it happen.

As Joel explains here: https://marc.info/?l=openbsd-misc=154349798307366=2 
you get to keep the pieces when it breaks.



Kernel page fault in "sddetach -> bufq_destroy" during "bioctl -d"

2020-10-24 Thread Filippo Valsorda
>Synopsis:      kernel page fault in "sddetach -> bufq_destroy" during "bioctl 
>-d"
>Category:  kernel
>Environment:
System  : OpenBSD 6.8
Details : OpenBSD 6.8 (GENERIC.MP) #98: Sun Oct  4 18:13:26 MDT 2020
 
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

Architecture: OpenBSD.amd64
Machine : amd64
>Description:
Starting with two RAID 1 arrays with a CRYPTO device on top of each (see
"devices" below), I first unmounted the filesystems, then successfully
detached the CRYPTO devices, and then tried to detach the first RAID 1.

# bioctl -d sd8 # CRYPTO on top of sd5a
# bioctl -d sd7 # CRYPTO on top of sd6a
# bioctl -d sd5 # RAID 1 on top of sd2a,sd3a

The machine immediately dropped into ddb, which I reached from the serial
console. I was unable to recover the console output before the panic.

ddb{0}> show panic
kernel page fault
uvm_fault(0x82153778, 0x0002, 0, 1) -> e
bufq_destroy(80115710) at bufq_destroy+0x83
end trace frame: 0x8000227379a0, count: 0

ddb{0}> trace
bufq_destroy(80115710) at bufq_destroy+0x83
sddetach(80115600,1) at sddetach+0x42
config_detach(80115600,1) at config_detach+0x142
scsi_detach_link(80125800,1) at scsi_detach_link+0x4d
sr_discipline_shutdown(80124000,1,0) at sr_discipline_shutdown+0x13e
sr_bio_handler(8012,80124000,c2d04227,80655c00) 
at sr_bio_handler+0x1ce
sdioctl(d52,c2d04227,80655c00,3,800022604030) at sdioctl+0x4e9

VOP_IOCTL(fd8234962018,c2d04227,80655c00,3,fd828b7bd300,800022604030)
 at VOP_IOCTL+0x55
vn_ioctl(fd80edfe2788,c2d04227,80655c00,800022604030) at 
vn_ioctl+0x75
sys_ioctl(800022604030,800022737e60,800022737ec0) at 
sys_ioctl+0x2d4
syscall(800022737f30) at syscall+0x389
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7de2e0, count: -12

After rebooting that RAID 1 was gone. I successfully detached the remaining
RAID 1, and generated the sendbug(1) output after wiping sd0,sd1,sd2,sd3.

devices:
==> sd0 / duid: b27184271af409ec
sd0: , serial WD-WCC4N2HCXC13
  a:  2794.5G   64RAID
  c:  2794.5G0  unused
==> sd1 / duid: cbcf06a16339dea8
sd1: , serial WD-WCC4N2KYDERF
  a:  2794.5G   64RAID
  c:  2794.5G0  unused
==> sd2 / duid: 612a362a9042bc12
sd2: , serial WD-WCC4N4ARTXUZ
  a:  2794.5G   64RAID
  c:  2794.5G0  unused
==> sd3 / duid: 6c4bf1c584cf2ad5
sd3: , serial WD-WCC4N5VRVF3C
  a:  2794.5G   64RAID
  c:  2794.5G0  unused
==> sd4 / duid: 3141c7e6e6fc07f5
sd4: , serial (unknown)
  a: 0.3G   64  4.2BSD   2048 16384  5657 # /
  b: 0.5G   730144swap
  c:14.6G0  unused
  d: 0.4G  1739776  4.2BSD   2048 16384  7206 # /mfs/dev
  e: 0.6G  2662144  4.2BSD   2048 16384  9870 # /mfs/var
  f: 1.9G  3925504  4.2BSD   2048 16384 12960 # /usr
  g: 0.5G  7843264  4.2BSD   2048 16384  8061 # /home
  h: 1.6G  8883424  4.2BSD   2048 16384 12960 # /usr/local
  i: 1.4G 12249248  4.2BSD   2048 16384 12960 # /usr/src
  j: 5.2G 15080800  4.2BSD   2048 16384 12960 # /usr/obj
==> sd5 / duid: 7cc6a1f7b9a86bc3
Volume  Status   Size Device
softraid0 0 Online  3000592646656 sd5 RAID1
  0 Online  3000592646656 0:0.0   noencl 
  1 Online  3000592646656 0:1.0   noencl 
  a:  2794.5G0RAID
  c:  2794.5G0  unused
==> sd6 / duid: b5e9c71ced61dca9
Volume  Status   Size Device
softraid0 1 Online  3000592646656 sd6 RAID1
  0 Online  3000592646656 1:0.0   noencl 
  1 Online  3000592646656 1:1.0   noencl 
  a:  1863.0G0RAID
  c:  2794.5G0  unused
  d:   930.5G   3907021696  4.2BSD   8192 65536 52238 # /array/ct
==> sd7 / duid: d74b6781f7ca7b59
Volume  Status   Size Device
softraid0 2 Online  2000394827264 sd7 CRYPTO
  0 Online  2000394827264 2:0.0   noencl 
  a:  1863.0G   64  4.2BSD   8192 65536 24688 # /array/misc
  c:  1863.0G0  unused
==> sd8 / duid: 246216bff2633caa
Volume  Status   Size Device
softraid0 3 Online  3000592376320 sd8 CRYPTO
  0 Online  3000592376320 3:0.0   noencl 
  a:   93