Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-16 Thread Kenneth R Westerback
On Tue, Jun 16, 2020 at 10:34:44AM +0200, Janne Johansson wrote:
> Den m??n 15 juni 2020 kl 20:18 skrev Kenneth R Westerback <
> kwesterb...@gmail.com>:
> 
> > > If it would help, I could screenshot one page at a time, that seems to be
> > > the best I can do today.
> >
> > Works for me, though I don't recommend sending all those pics to
> > bugs@.
> >
> >
> http://c66.it.su.se:8080/obsd/amd-dmesg-jpegs/index.html
> 
> Painstakingly screenshot:ed one at a time, but hopefully readable as a
> series of images in a row.
> 
> -- 
> May the most significant bit of your life be positive.

I see

nvme0: KXG60ZNV256G Toshiba, firmware AGGA4103, serial 
scsibus2 at nvme0: 2 targets, initiator 0
sd0 at scsibus2 targ 1 lun 0: 
sd0: 244198MB, 512 bytes/sector, 500118192 sectors

nvme1: Micro_9300_MTFDHAL3T8DP, firmware 11300DG0, serial 
scsibus3 at nvme1: 33 targets, initiator 0
sd1 at scsibus3 targ 1 lun 0: 
sd1: 3662830MB, 512 bytes/sector, 7501476528 sectors

 sd32 at scsibus3 targ 2 -> 32 lun 0: 

nvme2: Micro_9300_MTFDHAL3T8DP, firmware 11300DG0, serial 
scsibus4 at nvme2: 33 targets, initiator 0
sd33 at scsibus4 targ 1 lun 0: 
sd33: 3662830MB, 512 bytes/sector, 7501476528 sectors

 sd64 at scsibus4 targ 2 -> 32 lun 0: 

nvme3: Micro_9300_MTFDHAL3T8DP, firmware 11300DG0, serial 
scsibus5 at nvme3: 33 targets, initiator 0
sd65 at scsibus5 targ 1 lun 0: 
sd65: 3662830MB, 512 bytes/sector, 7501476528 sectors

 sd96 at scsibus6 targ 2 -> 32 lun 0: 

nvme4: INTEL SSDPED1K375GA, firmware E2010435, serial 
scsibus6 at nvme4: 2 targets, initiator 0
sd97 at scsibus6 targ 1 lun 0 
sd97: 357707MB, 512 bytes/sector, 732585168 sectors

So FIVE physical drives?

It looks to me like the Micron devices are reporting/configured with
33 namespaces. Each namespace is treated as a separate disk. I don't
know if the device is actually configured this way, or is reporting a
max number where other drives are reporting actual configured
namespaces.

I am *guessing* that all the namespaces other than the first one have
0 sectors allocated. But we print out the "sdNN at scsibusXX ..."
line before the size is determined via INQUIRY. Dunno if that is
avoidable.

A SCSIDEBUG kernel will print a lot of interesting information that
would shed more light.

 Ken



Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-16 Thread Janne Johansson
Den mån 15 juni 2020 kl 20:18 skrev Kenneth R Westerback <
kwesterb...@gmail.com>:

> > If it would help, I could screenshot one page at a time, that seems to be
> > the best I can do today.
>
> Works for me, though I don't recommend sending all those pics to
> bugs@.
>
>
http://c66.it.su.se:8080/obsd/amd-dmesg-jpegs/index.html

Painstakingly screenshot:ed one at a time, but hopefully readable as a
series of images in a row.

-- 
May the most significant bit of your life be positive.


Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-15 Thread Kenneth R Westerback
On Mon, Jun 15, 2020 at 08:03:40PM +0200, Janne Johansson wrote:
> Den m??n 15 juni 2020 kl 19:59 skrev Kenneth R Westerback <
> kwesterb...@gmail.com>:
> 
> > On Mon, Jun 15, 2020 at 07:15:36PM +0200, Janne Johansson wrote:
> > > Recent AMD box with a bunch of nvme drives, never booted anything,
> > crashes
> > The line
> >
> > scsibus2 at nvme1: 33 targets, initiator 0
> >
> > is also weird. I have never seen anything but 1 or 2 targets on nvme.
> >
> >
> My bad. It has four or five nvmes and there is some .. reflection going on.
> I seem to recall similar things in the old scsi1 days if you had bad
> termination, long since I saw mirrored devices like this on the bus.
> 
> 
> > Running a kernel with SCSIDEBUG will produce more information on the
> > negotiation/discovery interactions.
> > Clarification of "bunch of nvme drives" and a complete dmesg would
> > also help.
> >
> 
> If it would help, I could screenshot one page at a time, that seems to be
> the best I can do today.

Works for me, though I don't recommend sending all those pics to
bugs@.

> 
> -- 
> May the most significant bit of your life be positive.

The other random thing to try is to find the line

sc->sc_link.adapter_buswidth = sc->sc_nn + 1;

in /usr/src/sys/dev/ic/nvme.c

and replace the "sc->sc_nn + 1" with 1 or 2. Perhaps the nvme
controller is returning interesting values for the namespace count in
the identify message. I see there is already some weird code to deal
with Apple oddities. :-)

 Ken



Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-15 Thread Janne Johansson
Den mån 15 juni 2020 kl 19:59 skrev Kenneth R Westerback <
kwesterb...@gmail.com>:

> On Mon, Jun 15, 2020 at 07:15:36PM +0200, Janne Johansson wrote:
> > Recent AMD box with a bunch of nvme drives, never booted anything,
> crashes
> The line
>
> scsibus2 at nvme1: 33 targets, initiator 0
>
> is also weird. I have never seen anything but 1 or 2 targets on nvme.
>
>
My bad. It has four or five nvmes and there is some .. reflection going on.
I seem to recall similar things in the old scsi1 days if you had bad
termination, long since I saw mirrored devices like this on the bus.


> Running a kernel with SCSIDEBUG will produce more information on the
> negotiation/discovery interactions.
> Clarification of "bunch of nvme drives" and a complete dmesg would
> also help.
>

If it would help, I could screenshot one page at a time, that seems to be
the best I can do today.

-- 
May the most significant bit of your life be positive.


Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-15 Thread Kenneth R Westerback
On Mon, Jun 15, 2020 at 07:15:36PM +0200, Janne Johansson wrote:
> Recent AMD box with a bunch of nvme drives, never booted anything, crashes
> with
> panic: pr_find_pagehead: dma256: page header missing
> after listing some sd(4) drives on 14-Jun snapshot installation.
> 
> That is the only odd output in the dmesg as far as I can see, picture
> included.
> 
> 6.7 release has the same issue.
> 
> 6.6 installer works and I can boot the installed OS but I have no net on
> the box (mcx 40/56/100GE card in it) yet.
> 
> 
> -- 
> May the most significant bit of your life be positive.

The line

scsibus2 at nvme1: 33 targets, initiator 0

is also weird. I have never seen anything but 1 or 2 targets on nvme.

Running a kernel with SCSIDEBUG will produce more information on the
negotiation/discovery interactions.

Clarification of "bunch of nvme drives" and a complete dmesg would
also help.

 Ken



Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-15 Thread Janne Johansson
Den mån 15 juni 2020 kl 19:55 skrev Janne Johansson :

>
> I currently only have remote-console access to it, but the three of the
> four nvmes all get tons of ghosts.
>
>
and I am sorry in advance for the pictures, but with no network configured
on the switches the box is connected to, all I can do is screenshot the
console window I used for installation (along with remote-cd-iso use for
install66/67.iso)

-- 
May the most significant bit of your life be positive.


Re: AMD EPYC 7551 box panic: pr_find_pagehead

2020-06-15 Thread Mark Kettenis
> From: Janne Johansson 
> Date: Mon, 15 Jun 2020 19:15:36 +0200
> Content-Type: multipart/mixed; boundary="ea5a1505a8229334"
> 
> Recent AMD box with a bunch of nvme drives, never booted anything, crashes
> with
> panic: pr_find_pagehead: dma256: page header missing
> after listing some sd(4) drives on 14-Jun snapshot installation.
> 
> That is the only odd output in the dmesg as far as I can see, picture
> included.

That Micron_9300_MTFD disk showing up twice is suspicious.

Does the machine boot with nvme(4) disabled and/or that particular
NVMe disk removed?