Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Mark Millard via freebsd-stable
[I' unable to reproduce the under-Hyper-V early kernel
crash for WITH_ZFS= (implicit) build that includes the
for-loaders patch I was given to try.]

On 2018-Oct-22, at 10:01 AM, Mark Millard  wrote:

> [I will note the the loader problem has been shown to
> not be involved in the kernel problem that this
> "Subject:" was originally for.]
> 
> On 2018-Oct-22, at 9:26 AM, Warner Losh  wrote:
> 
>> On Mon, Oct 22, 2018 at 6:39 AM Mark Millard  wrote:
>>> On 2018-Oct-22, at 4:07 AM, Toomas Soome  wrote:
>>> 
 On 22 Oct 2018, at 13:58, Mark Millard  wrote:
> 
> On 2018-Oct-22, at 2:27 AM, Toomas Soome  wrote:
>> 
>>> On 22 Oct 2018, at 06:30, Warner Losh  wrote:
>>> 
>>> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
>>> 
 
 
 On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
 freebsd-stable@freebsd.org> wrote:
 
> [I built based on WITHOUT_ZFS= for other reasons. But,
> after installing the build, Hyper-V based boots are
> working.]
> 
> On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
> 
>> On 2018-Oct-20, at 1:39 AM, Mark Millard  
>> wrote:
>> . . .
>>> 
>> 
>> It would help to get output from loader lsdev -v command.
> 
> That turned out to be very interesting: The non-ZFS loader
> crashes during the listing, during disk8, which shows a
> x0 instead of a x512.
> 
 
 Yes, thats the root cause there. The non-zfs loader does only *read* the 
 boot disk, thats why the issue was not revealed there. 
 
 It would help to identify the sector size for that disk, at least from OS, 
 so we can compare with what we can get from INT13.
 
 I have pretty good idea what to look there, but I am afraid we need to run 
 few tests with you to understand why that disk is reporting sector size 0 
 there.
 
 
>>> 
>>> Looks like I guessed wrong about the device
>>> for "drive8".
>>> 
>>> So I unplugged the only other external
>>> storage device, so the original drives
>>> 0-13 become 0-11 overall.
>>> 
>>> The machine has a multi-LUN media card reader with
>>> no cards plugged in. It is built-in rather than
>>> one that I plugged into a port. It has 4 LUN's.
>>> 
>>> So 8+4=12 and drives 0-7 show up with media before
>>> it tries any of the 4 LUN's with no card in place.
>>> 
>>> I conclude that "drive8" is an empty LUN in a media
>>> card reader.
>>> 
>>> I conclude that there is no sector size available for
>>> any of the empty LUNs in the media reader.
>>> 
>> I think you are probably right and we're hitting some divide by 0 error when 
>> we should just ignore the disk.
> 
> In the Hyper-V context, the loader and kernel do not
> see the 4-LUN media reader at all: only drives with
> normal freebsd-* style partitions and free space.
> This explains why I did not see a loader problem
> in that context.
> 
> So I conclude that the kernel crash under Hyper-V
> associated with -r338807 is a separate issue even
> though WITHOUT_ZFS= seems to have avoided the
> crash.
> 
> My plan is to continue with the -r338807 investigation
> after the loader problem is fixed in my builds. Then
> I've go back to trying builds using WITH_ZFS= (implicit),
> both native boots and Hyper-V based ones.

So much for my ability to make that inference correctly:

The WITH_ZFS= (implicit) build worked fine for booting
natively and via Hyper-V when the patch to fix the loaders
was included in what to build. I'm now unable to reproduce
this kernel-time crash.

The patch was from: https://reviews.freebsd.org/D11174

The empty LUN's in the media reader now get messages that
look something like:

disk8: Read 1 sector(s) from 0 to 0xe000 (0x8000): 0x31

early in the loader activity.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Mark Millard via freebsd-stable
[I will note the the loader problem has been shown to
not be involved in the kernel problem that this
"Subject:" was originally for.]

On 2018-Oct-22, at 9:26 AM, Warner Losh  wrote:

> On Mon, Oct 22, 2018 at 6:39 AM Mark Millard  wrote:
>> On 2018-Oct-22, at 4:07 AM, Toomas Soome  wrote:
>> 
>> > On 22 Oct 2018, at 13:58, Mark Millard  wrote:
>> >> 
>> >> On 2018-Oct-22, at 2:27 AM, Toomas Soome  wrote:
>> >>> 
>>  On 22 Oct 2018, at 06:30, Warner Losh  wrote:
>>  
>>  On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
>>  
>> > 
>> > 
>> > On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
>> > freebsd-stable@freebsd.org> wrote:
>> > 
>> >> [I built based on WITHOUT_ZFS= for other reasons. But,
>> >> after installing the build, Hyper-V based boots are
>> >> working.]
>> >> 
>> >> On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
>> >> 
>> >>> On 2018-Oct-20, at 1:39 AM, Mark Millard  
>> >>> wrote:
>> >>> . . .
>>  
>> >>> 
>> >>> It would help to get output from loader lsdev -v command.
>> >> 
>> >> That turned out to be very interesting: The non-ZFS loader
>> >> crashes during the listing, during disk8, which shows a
>> >> x0 instead of a x512.
>> >> 
>> > 
>> > Yes, thats the root cause there. The non-zfs loader does only *read* the 
>> > boot disk, thats why the issue was not revealed there. 
>> > 
>> > It would help to identify the sector size for that disk, at least from OS, 
>> > so we can compare with what we can get from INT13.
>> > 
>> > I have pretty good idea what to look there, but I am afraid we need to run 
>> > few tests with you to understand why that disk is reporting sector size 0 
>> > there.
>> > 
>> > 
>> 
>> Looks like I guessed wrong about the device
>> for "drive8".
>> 
>> So I unplugged the only other external
>> storage device, so the original drives
>> 0-13 become 0-11 overall.
>> 
>> The machine has a multi-LUN media card reader with
>> no cards plugged in. It is built-in rather than
>> one that I plugged into a port. It has 4 LUN's.
>> 
>> So 8+4=12 and drives 0-7 show up with media before
>> it tries any of the 4 LUN's with no card in place.
>> 
>> I conclude that "drive8" is an empty LUN in a media
>> card reader.
>> 
>> I conclude that there is no sector size available for
>> any of the empty LUNs in the media reader.
>> 
> I think you are probably right and we're hitting some divide by 0 error when 
> we should just ignore the disk.

In the Hyper-V context, the loader and kernel do not
see the 4-LUN media reader at all: only drives with
normal freebsd-* style partitions and free space.
This explains why I did not see a loader problem
in that context.

So I conclude that the kernel crash under Hyper-V
associated with -r338807 is a separate issue even
though WITHOUT_ZFS= seems to have avoided the
crash.

My plan is to continue with the -r338807 investigation
after the loader problem is fixed in my builds. Then
I've go back to trying builds using WITH_ZFS= (implicit),
both native boots and Hyper-V based ones.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Warner Losh
On Mon, Oct 22, 2018 at 6:39 AM Mark Millard  wrote:

> On 2018-Oct-22, at 4:07 AM, Toomas Soome  wrote:
>
> > On 22 Oct 2018, at 13:58, Mark Millard  wrote:
> >>
> >> On 2018-Oct-22, at 2:27 AM, Toomas Soome  wrote:
> >>>
>  On 22 Oct 2018, at 06:30, Warner Losh  wrote:
> 
>  On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
> 
> >
> >
> > On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
> > freebsd-stable@freebsd.org> wrote:
> >
> >> [I built based on WITHOUT_ZFS= for other reasons. But,
> >> after installing the build, Hyper-V based boots are
> >> working.]
> >>
> >> On 2018-Oct-20, at 2:09 AM, Mark Millard 
> wrote:
> >>
> >>> On 2018-Oct-20, at 1:39 AM, Mark Millard 
> wrote:
> >>> . . .
> 
> >>>
> >>> It would help to get output from loader lsdev -v command.
> >>
> >> That turned out to be very interesting: The non-ZFS loader
> >> crashes during the listing, during disk8, which shows a
> >> x0 instead of a x512.
> >>
> >
> > Yes, thats the root cause there. The non-zfs loader does only *read* the
> boot disk, thats why the issue was not revealed there.
> >
> > It would help to identify the sector size for that disk, at least from
> OS, so we can compare with what we can get from INT13.
> >
> > I have pretty good idea what to look there, but I am afraid we need to
> run few tests with you to understand why that disk is reporting sector size
> 0 there.
> >
> >
>
> Looks like I guessed wrong about the device
> for "drive8".
>
> So I unplugged the only other external
> storage device, so the original drives
> 0-13 become 0-11 overall.
>
> The machine has a multi-LUN media card reader with
> no cards plugged in. It is built-in rather than
> one that I plugged into a port. It has 4 LUN's.
>
> So 8+4=12 and drives 0-7 show up with media before
> it tries any of the 4 LUN's with no card in place.
>
> I conclude that "drive8" is an empty LUN in a media
> card reader.
>
> I conclude that there is no sector size available for
> any of the empty LUNs in the media reader.
>

I think you are probably right and we're hitting some divide by 0 error
when we should just ignore the disk.

Warner


> >
> >
> >> Hand transcribed from pictures:
> >>
> >> OK lsdev -v
> >> disk devices
> >> disk0: BIOS drive C (937703088 x 512):
> >> disk0p1: FreeBSD boot 512K
> >> disk0p2: FreeBSD UFS  356G
> >> disk0p3: FreeBSD swap 15G
> >> disp0p4: FreeBSD swap 76G
> >> disk1: BIOS drive D (16514064 x 512):
> >> disk1s1: Linux   2048KB
> >> disk1s2: Unknown 952GB
> >> disk2: BIOS drive E (16514064 x 512):
> >> disk2p1: Unknown 128MB
> >> disk3: BIOS drive F (16514064 x 512):
> >> disk3p1: Unknown 128MB
> >> disk4: BIOS drive G (16434495 x 512):
> >> disk2p1: Unknown 128MB
> >> disk4p2: DOS/Windwos 1716GB
> >> disk5: BIOS drive H (16434495 x 512):
> >> disk5p1: FreeBSD boot 512K
> >> disk5p2: FreeBSD UFS  176G
> >> disk5p3: FreeBSD swap 193G
> >> disp5p4: FreeBSD swap 15G
> >> disk6: BIOS drive I (16434495 x 512):
> >> disk6p1: Unknown 499MB
> >> disk6p2: EFI 99MB
> >> disk6p3: Unknown 16MB
> >> disp6p4: DOS/Windows 886G
> >> dis7: BIOS drive H (16434495 x 512):
> >> disk7p1: FreeBSD boot 512K
> >> disk7p2: FreeBSD UFS  953G
> >> disk8: BIOS drive K (262144 x 0):
> >>
> >> int=  err=  efl=00010246  eip=000286bd
> >> eax=  ebx=72b50430  ecx=  edx=
> >> esi=  edi=00092080  ebp=00091eec  esp=00091ea8
> >> cs=002b  ds=0033  es=0033fs=0033  gs=0033  ss=0033
> >> cs:eip=f7 f1 89 c1 85 d2 0f 85-d8 01 00 00 6a 05 58 85
> >>   f6 0f 88 75 01 00 00 89-cb c1 fb 1f 89 ca 03 55
> >> ss:esp=09 00 00 00 00 00 00 00-0a 00 00 00 02 00 00 00
> >>   00 00 00 00 00 00 00 00-78 1f 09 00 33 45 04 00
> >> BTX halted
> >>
> >> I expect that "disk8" is what gpart show -p
> >> from a native boot showed as:
> >>
> >> =>   1  60062499da1  MBR  (29G)
> >>131 - free -  (16K)
> >>   32  60062468  da1s1  fat32lba  (29G)
> >>
> >> (That gpart show -p output is in another of the
> >> list messages.)
> >>
> >>> Also if you could test boot loader with UEFI - for example get to
> loader prompt via usb/cd boot and then get the same lsdev -v output.
> >>
> >> Still true given the above crash? Or, going the
> >> other way, should "drive8" be left as it is in
> >> order to be sure to do this test with the drive
> >> present?
> >>
> >> If I do this test later, it will take a bit to
> >> get media to do it with. (It is about 4AM in the
> >> morning and I've yet to get to sleep.)
> >>
> >> Note: I've never tried a UEFI based boot of FreeBSD
> >> on this machine (but the Windows 10 Pro x64 is EFI
> >> based). The only FreeBSD context using a EFI partition
> >> to boot that I have used is on an arm aarch64
> >> Cortex-A57 system.
> >>
> >>> I would be interested to see the sector size information and if the
> UEFI loader does also have issues.
> >>
> >> Understood.
> >>
> >>> If 

Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Eugene Grosbein
22.10.2018 21:35, Glen Barber wrote:

>> This is just a typical foot-shooting (and a shortcoming of the kernel build
>> system that allows such foot-shooting to happen).
>> I think that there can be other ways in which you can specify inconsistent
>> kernel options and/or an incorrect subset of modules in MODULES_OVERRIDE to
>> create missing dependencies for critical modules.
>> Do we want to issue an errata for each possible misconfiguration?
> 
> Not necessarily.  I think it is a matter of how common the edge case is,
> for example.  I am perfectly fine removing the errata entry if this is
> an extreme edge case.

Well, usage of stripped-down kernel instead of GENERIC may be less common than 
it was 10 years ago.
But not extreme rare because of low-class virtual machines.

Same with stripped-down installed set of files without full set of kernel 
modules
but this may be more extreme as (virtual) disk space is more cheap than RAM.

ZFS-on-root is definitely not so seldom.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Andriy Gapon
On 22/10/2018 17:32, Eugene Grosbein wrote:
> 22.10.2018 21:21, Andriy Gapon wrote:
>> This is just a typical foot-shooting (and a shortcoming of the kernel build
>> system that allows such foot-shooting to happen).
>> I think that there can be other ways in which you can specify inconsistent
>> kernel options and/or an incorrect subset of modules in MODULES_OVERRIDE to
>> create missing dependencies for critical modules.
>> Do we want to issue an errata for each possible misconfiguration?
> 
> OTOH, we have option krpc in sys/conf/options but it is not mentioned 
> elsewhere:
> not in the Handbook nor in the sys/conf/NOTES or GENERIC. Not a bit of our 
> documentation mentions
> that ZFS requires KRPC for last 10 years.
> 
> One can call it foot-shooting if it is against documentation but that's not 
> the case.

I certainly agree that there is a lack of documentation.
Still, this is a foot-shooting.

Anyway, my point was about a need to create an erratum for this kind of issue.
A documentation update would be much more appropriate.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Glen Barber
On Mon, Oct 22, 2018 at 05:21:43PM +0300, Andriy Gapon wrote:
> On 22/10/2018 17:15, Glen Barber wrote:
> > On Mon, Oct 22, 2018 at 09:09:14PM +0700, Eugene Grosbein wrote:
> >> 22.10.2018 21:03, Glen Barber wrote:
> >>
> >> t's strange that this is a 10.x vs 11.x issue.
> > I see that zfs has the krpc dependency since r193128.
> > And the call to xdrmem_create is there since r168404.
> 
>  You are right. I was mis-informed and have not verified enough a report 
>  from local user.
> 
>  Glen, maybe that errata record should be deleted. The problem is real 
>  but it is long-standing
>  and present in 10.x too.
> 
> >>>
> >>> Could you elaborate more on the failure case you originally reported
> >>> first?  If the problem is real, my feeling is that the errata entry
> >>> should stay, just worded differently to reflect the failure case here.
> >>
> >> zfs.ko depends on krpc.ko. The KRPC code in compiled in GENERIC kernel as 
> >> dependency
> >> of NFS client/server code. The problem arises if all of these are true:
> >>
> >> 1) a system uses custom kernel with NFS options removed;
> >> 2) there is no krpc.ko available due to MODULES_OVERRIDE excluding it;
> >> 3) the system boots off ZFS pool.
> >>
> >> In such case, loader cannot resolve dependency and fails to load zfs.ko
> >> and kernel fails to mount root breaking boot sequence.
> >>
> >>
> > 
> > So, if I understand correctly (and please correct me if I am wrong), the
> > majority of the text in the errata note is correct, however needs to be
> > tweaked to remove "upgrading from 10.x...".  Is this generally correct?
> 
> This is just a typical foot-shooting (and a shortcoming of the kernel build
> system that allows such foot-shooting to happen).
> I think that there can be other ways in which you can specify inconsistent
> kernel options and/or an incorrect subset of modules in MODULES_OVERRIDE to
> create missing dependencies for critical modules.
> Do we want to issue an errata for each possible misconfiguration?
> 

Not necessarily.  I think it is a matter of how common the edge case is,
for example.  I am perfectly fine removing the errata entry if this is
an extreme edge case.  Meaning, I think it would be excessive to
document the fallout from adding 'nodevice mem' to the configuration
file.

Glen



signature.asc
Description: PGP signature


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Eugene Grosbein
22.10.2018 21:21, Andriy Gapon wrote:

> Glen, maybe that errata record should be deleted. The problem is real but 
> it is long-standing
> and present in 10.x too.
>

 Could you elaborate more on the failure case you originally reported
 first?  If the problem is real, my feeling is that the errata entry
 should stay, just worded differently to reflect the failure case here.
>>>
>>> zfs.ko depends on krpc.ko. The KRPC code in compiled in GENERIC kernel as 
>>> dependency
>>> of NFS client/server code. The problem arises if all of these are true:
>>>
>>> 1) a system uses custom kernel with NFS options removed;
>>> 2) there is no krpc.ko available due to MODULES_OVERRIDE excluding it;
>>> 3) the system boots off ZFS pool.
>>>
>>> In such case, loader cannot resolve dependency and fails to load zfs.ko
>>> and kernel fails to mount root breaking boot sequence.
>>>
>>>
>>
>> So, if I understand correctly (and please correct me if I am wrong), the
>> majority of the text in the errata note is correct, however needs to be
>> tweaked to remove "upgrading from 10.x...".  Is this generally correct?

Yes.

> This is just a typical foot-shooting (and a shortcoming of the kernel build
> system that allows such foot-shooting to happen).
> I think that there can be other ways in which you can specify inconsistent
> kernel options and/or an incorrect subset of modules in MODULES_OVERRIDE to
> create missing dependencies for critical modules.
> Do we want to issue an errata for each possible misconfiguration?

OTOH, we have option krpc in sys/conf/options but it is not mentioned elsewhere:
not in the Handbook nor in the sys/conf/NOTES or GENERIC. Not a bit of our 
documentation mentions
that ZFS requires KRPC for last 10 years.

One can call it foot-shooting if it is against documentation but that's not the 
case.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Andriy Gapon
On 22/10/2018 17:15, Glen Barber wrote:
> On Mon, Oct 22, 2018 at 09:09:14PM +0700, Eugene Grosbein wrote:
>> 22.10.2018 21:03, Glen Barber wrote:
>>
>> t's strange that this is a 10.x vs 11.x issue.
> I see that zfs has the krpc dependency since r193128.
> And the call to xdrmem_create is there since r168404.

 You are right. I was mis-informed and have not verified enough a report 
 from local user.

 Glen, maybe that errata record should be deleted. The problem is real but 
 it is long-standing
 and present in 10.x too.

>>>
>>> Could you elaborate more on the failure case you originally reported
>>> first?  If the problem is real, my feeling is that the errata entry
>>> should stay, just worded differently to reflect the failure case here.
>>
>> zfs.ko depends on krpc.ko. The KRPC code in compiled in GENERIC kernel as 
>> dependency
>> of NFS client/server code. The problem arises if all of these are true:
>>
>> 1) a system uses custom kernel with NFS options removed;
>> 2) there is no krpc.ko available due to MODULES_OVERRIDE excluding it;
>> 3) the system boots off ZFS pool.
>>
>> In such case, loader cannot resolve dependency and fails to load zfs.ko
>> and kernel fails to mount root breaking boot sequence.
>>
>>
> 
> So, if I understand correctly (and please correct me if I am wrong), the
> majority of the text in the errata note is correct, however needs to be
> tweaked to remove "upgrading from 10.x...".  Is this generally correct?

This is just a typical foot-shooting (and a shortcoming of the kernel build
system that allows such foot-shooting to happen).
I think that there can be other ways in which you can specify inconsistent
kernel options and/or an incorrect subset of modules in MODULES_OVERRIDE to
create missing dependencies for critical modules.
Do we want to issue an errata for each possible misconfiguration?


-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Glen Barber
On Mon, Oct 22, 2018 at 09:09:14PM +0700, Eugene Grosbein wrote:
> 22.10.2018 21:03, Glen Barber wrote:
> 
> t's strange that this is a 10.x vs 11.x issue.
> >>> I see that zfs has the krpc dependency since r193128.
> >>> And the call to xdrmem_create is there since r168404.
> >>
> >> You are right. I was mis-informed and have not verified enough a report 
> >> from local user.
> >>
> >> Glen, maybe that errata record should be deleted. The problem is real but 
> >> it is long-standing
> >> and present in 10.x too.
> >>
> > 
> > Could you elaborate more on the failure case you originally reported
> > first?  If the problem is real, my feeling is that the errata entry
> > should stay, just worded differently to reflect the failure case here.
> 
> zfs.ko depends on krpc.ko. The KRPC code in compiled in GENERIC kernel as 
> dependency
> of NFS client/server code. The problem arises if all of these are true:
> 
> 1) a system uses custom kernel with NFS options removed;
> 2) there is no krpc.ko available due to MODULES_OVERRIDE excluding it;
> 3) the system boots off ZFS pool.
> 
> In such case, loader cannot resolve dependency and fails to load zfs.ko
> and kernel fails to mount root breaking boot sequence.
> 
> 

So, if I understand correctly (and please correct me if I am wrong), the
majority of the text in the errata note is correct, however needs to be
tweaked to remove "upgrading from 10.x...".  Is this generally correct?

Glen



signature.asc
Description: PGP signature


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Eugene Grosbein
22.10.2018 21:03, Glen Barber wrote:

t's strange that this is a 10.x vs 11.x issue.
>>> I see that zfs has the krpc dependency since r193128.
>>> And the call to xdrmem_create is there since r168404.
>>
>> You are right. I was mis-informed and have not verified enough a report from 
>> local user.
>>
>> Glen, maybe that errata record should be deleted. The problem is real but it 
>> is long-standing
>> and present in 10.x too.
>>
> 
> Could you elaborate more on the failure case you originally reported
> first?  If the problem is real, my feeling is that the errata entry
> should stay, just worded differently to reflect the failure case here.

zfs.ko depends on krpc.ko. The KRPC code in compiled in GENERIC kernel as 
dependency
of NFS client/server code. The problem arises if all of these are true:

1) a system uses custom kernel with NFS options removed;
2) there is no krpc.ko available due to MODULES_OVERRIDE excluding it;
3) the system boots off ZFS pool.

In such case, loader cannot resolve dependency and fails to load zfs.ko
and kernel fails to mount root breaking boot sequence.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Glen Barber
On Mon, Oct 22, 2018 at 08:59:10PM +0700, Eugene Grosbein wrote:
> 19.10.2018 21:34, Andriy Gapon wrote:
> 
> > It's strange that this is a 10.x vs 11.x issue.
> > I see that zfs has the krpc dependency since r193128.
> > And the call to xdrmem_create is there since r168404.
> 
> You are right. I was mis-informed and have not verified enough a report from 
> local user.
> 
> Glen, maybe that errata record should be deleted. The problem is real but it 
> is long-standing
> and present in 10.x too.
> 

Could you elaborate more on the failure case you originally reported
first?  If the problem is real, my feeling is that the errata entry
should stay, just worded differently to reflect the failure case here.

Glen



signature.asc
Description: PGP signature


Re: krpc: unbootable ZFS-on-root after major upgrade to 11.2

2018-10-22 Thread Eugene Grosbein
19.10.2018 21:34, Andriy Gapon wrote:

> It's strange that this is a 10.x vs 11.x issue.
> I see that zfs has the krpc dependency since r193128.
> And the call to xdrmem_create is there since r168404.

You are right. I was mis-informed and have not verified enough a report from 
local user.

Glen, maybe that errata record should be deleted. The problem is real but it is 
long-standing
and present in 10.x too.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Mark Millard via freebsd-stable
On 2018-Oct-22, at 4:07 AM, Toomas Soome  wrote:

> On 22 Oct 2018, at 13:58, Mark Millard  wrote:
>> 
>> On 2018-Oct-22, at 2:27 AM, Toomas Soome  wrote:
>>> 
 On 22 Oct 2018, at 06:30, Warner Losh  wrote:
 
 On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
 
> 
> 
> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
> freebsd-stable@freebsd.org> wrote:
> 
>> [I built based on WITHOUT_ZFS= for other reasons. But,
>> after installing the build, Hyper-V based boots are
>> working.]
>> 
>> On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
>> 
>>> On 2018-Oct-20, at 1:39 AM, Mark Millard  wrote:
>>> . . .
 
>>> 
>>> It would help to get output from loader lsdev -v command.
>> 
>> That turned out to be very interesting: The non-ZFS loader
>> crashes during the listing, during disk8, which shows a
>> x0 instead of a x512.
>> 
> 
> Yes, thats the root cause there. The non-zfs loader does only *read* the boot 
> disk, thats why the issue was not revealed there. 
> 
> It would help to identify the sector size for that disk, at least from OS, so 
> we can compare with what we can get from INT13.
> 
> I have pretty good idea what to look there, but I am afraid we need to run 
> few tests with you to understand why that disk is reporting sector size 0 
> there.
> 
> 

Looks like I guessed wrong about the device
for "drive8".

So I unplugged the only other external
storage device, so the original drives
0-13 become 0-11 overall.

The machine has a multi-LUN media card reader with
no cards plugged in. It is built-in rather than
one that I plugged into a port. It has 4 LUN's.

So 8+4=12 and drives 0-7 show up with media before
it tries any of the 4 LUN's with no card in place.

I conclude that "drive8" is an empty LUN in a media
card reader.

I conclude that there is no sector size available for
any of the empty LUNs in the media reader.

> 
> 
>> Hand transcribed from pictures:
>> 
>> OK lsdev -v
>> disk devices
>> disk0: BIOS drive C (937703088 x 512):
>> disk0p1: FreeBSD boot 512K
>> disk0p2: FreeBSD UFS  356G
>> disk0p3: FreeBSD swap 15G
>> disp0p4: FreeBSD swap 76G
>> disk1: BIOS drive D (16514064 x 512):
>> disk1s1: Linux   2048KB
>> disk1s2: Unknown 952GB
>> disk2: BIOS drive E (16514064 x 512):
>> disk2p1: Unknown 128MB
>> disk3: BIOS drive F (16514064 x 512):
>> disk3p1: Unknown 128MB
>> disk4: BIOS drive G (16434495 x 512):
>> disk2p1: Unknown 128MB
>> disk4p2: DOS/Windwos 1716GB
>> disk5: BIOS drive H (16434495 x 512):
>> disk5p1: FreeBSD boot 512K
>> disk5p2: FreeBSD UFS  176G
>> disk5p3: FreeBSD swap 193G
>> disp5p4: FreeBSD swap 15G
>> disk6: BIOS drive I (16434495 x 512):
>> disk6p1: Unknown 499MB
>> disk6p2: EFI 99MB
>> disk6p3: Unknown 16MB
>> disp6p4: DOS/Windows 886G
>> dis7: BIOS drive H (16434495 x 512):
>> disk7p1: FreeBSD boot 512K
>> disk7p2: FreeBSD UFS  953G
>> disk8: BIOS drive K (262144 x 0):
>> 
>> int=  err=  efl=00010246  eip=000286bd
>> eax=  ebx=72b50430  ecx=  edx=
>> esi=  edi=00092080  ebp=00091eec  esp=00091ea8
>> cs=002b  ds=0033  es=0033fs=0033  gs=0033  ss=0033
>> cs:eip=f7 f1 89 c1 85 d2 0f 85-d8 01 00 00 6a 05 58 85
>>   f6 0f 88 75 01 00 00 89-cb c1 fb 1f 89 ca 03 55
>> ss:esp=09 00 00 00 00 00 00 00-0a 00 00 00 02 00 00 00
>>   00 00 00 00 00 00 00 00-78 1f 09 00 33 45 04 00
>> BTX halted
>> 
>> I expect that "disk8" is what gpart show -p
>> from a native boot showed as:
>> 
>> =>   1  60062499da1  MBR  (29G)
>>131 - free -  (16K)
>>   32  60062468  da1s1  fat32lba  (29G)
>> 
>> (That gpart show -p output is in another of the
>> list messages.)
>> 
>>> Also if you could test boot loader with UEFI - for example get to loader 
>>> prompt via usb/cd boot and then get the same lsdev -v output.
>> 
>> Still true given the above crash? Or, going the
>> other way, should "drive8" be left as it is in
>> order to be sure to do this test with the drive
>> present?
>> 
>> If I do this test later, it will take a bit to
>> get media to do it with. (It is about 4AM in the
>> morning and I've yet to get to sleep.)
>> 
>> Note: I've never tried a UEFI based boot of FreeBSD
>> on this machine (but the Windows 10 Pro x64 is EFI
>> based). The only FreeBSD context using a EFI partition
>> to boot that I have used is on an arm aarch64
>> Cortex-A57 system.
>> 
>>> I would be interested to see the sector size information and if the UEFI 
>>> loader does also have issues.
>> 
>> Understood.
>> 
>>> If it does, I’d like to see the outputs from commands:
>> 
>>> zpool status
>>> zpool import
>> 
>> Independent of the UEFI test . . .
>> 
>> I do have a -r331924 head version on another one
>> of the devices and can native-boot that. It still
>> has its ZFS software (but a default loader without
>> ZFS).
>> 
>> Trying from that context, hand transcribed:
>> 
>> # zpool status
>> ZFS filesystem 

Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Toomas Soome via freebsd-stable


> On 22 Oct 2018, at 13:58, Mark Millard  wrote:
> 
> On 2018-Oct-22, at 2:27 AM, Toomas Soome http://me.com/>> 
> wrote:
>> 
>>> On 22 Oct 2018, at 06:30, Warner Losh  wrote:
>>> 
>>> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
>>> 
 
 
 On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
 freebsd-stable@freebsd.org> wrote:
 
> [I built based on WITHOUT_ZFS= for other reasons. But,
> after installing the build, Hyper-V based boots are
> working.]
> 
> On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
> 
>> On 2018-Oct-20, at 1:39 AM, Mark Millard  wrote:
>> 
>>> I attempted to jump from head -r334014 to -r339076
>>> on a threadripper 1950X board and the boot fails.
>>> This is both native booting and under Hyper-V,
>>> same machine and root file system in both cases.
>> 
>> I did my investigation under Hyper-V after seeing
>> a boot failure native.
>> 
>> Looks like the native failure is even earlier,
>> before db> is even possible, possibly during
>> early loader activity.
>> 
>> So this report is really for running under
>> Hyper-V: -r338804 boots and -r338810 does
>> not. By contrast -r334804 does not boot native.
>> (But I've little information for that context.)
>> 
>> Sorry for the confusion. I rushed the report
>> in hopes of getting to sleep. It was not to be.
>> 
>>> It fails just after the FreeBSD/SMP lines,
>>> reporting "kernel trap 9 with interrupts disabled".
>>> 
>>> It fails in pmap_force_invaldiate_cache_range at
>>> a clflusl (%rax) instruction that produces a
>>> "Fatal trap 9: general protection fault while
>>> in kernel mode". cpudid=0 apic id= 00
>>> 
>>> I used kernel.txz files from:
>>> 
>>> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/
>>> 
>>> to narrow the range of kernel builds for working -> failing
>>> and got:
>>> 
>>> -r338804 boots fine
>>> (no amd64 kernel builds between to try)
>>> -r338810+ fails (any that I tried, anyway)
>>> 
>>> In that range is -r338807 :
>>> 
>>> QUOTE
>>> Author: kib
>>> Date: Wed Sep 19 19:35:02 2018
>>> New Revision: 338807
>>> URL:
>>> https://svnweb.freebsd.org/changeset/base/338807
>>> 
>>> 
>>> Log:
>>> Convert x86 cache invalidation functions to ifuncs.
>>> 
>>> This simplifies the runtime logic and reduces the number of
>>> runtime-constant branches.
>>> 
>>> Reviewed by: alc, markj
>>> Sponsored by:The FreeBSD Foundation
>>> Approved by: re (gjb)
>>> Differential revision:
>>> https://reviews.freebsd.org/D16736
>>> 
>>> Modified:
>>> head/sys/amd64/amd64/pmap.c
>>> head/sys/amd64/include/pmap.h
>>> head/sys/dev/drm2/drm_os_freebsd.c
>>> head/sys/dev/drm2/i915/intel_ringbuffer.c
>>> head/sys/i386/i386/pmap.c
>>> head/sys/i386/i386/vm_machdep.c
>>> head/sys/i386/include/pmap.h
>>> head/sys/x86/iommu/intel_utils.c
>>> END QUOTE
>>> 
>>> There do seem to be changes associated with
>>> clflush(...) use. Looking at:
>>> 
>>> 
> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432
>>> 
>>> it appears that pmap_force_invalidate_cache_range has not
>>> changed since -r338807.
>>> 
>>> It seems that -r338806 and -r3388810 would be unlikely
>>> contributors.
>> 
> 
> I went after my native-boot loader problem first because I
> could switch kernels via the loader for booting FreeBSD under
> Hyper-V. Switching loaders is more of a problem.
> 
> In order to avoid the loader-time crash I switched to building
> installing based on WITHOUT_ZFS= . I've had no active use of
> ZFS in years. (The old official-build loaders that worked were
> non-ZFS ones.)
> 
> This took care of the native-boot loader-crash --and, to my
> surprise, also the Hyper-V-boot kernel-time crash.
> 
> My private builds now boot the 1950X in both contexts just
> fine.
> 
> During my early investigation I did pick up specific changes
> from after -r339076 that seemed to be tied to Ryzen and such.
> (They made no difference to the boot problems at the time
> but I saw no reason to remove them.)
> 
> # uname -apKU
> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun
> Oct 21 16:44:25 PDT 2018 
> markmi@FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG
> amd64 amd64 1200084 1200084
 
 
>>> (stupid gmail)
>>> 
>>> The phrase "no active use" bothers me. What does that mean? Are there any
>>> ZFS pools or any disks that any whiff of ZFSish thing on it at all?
>>> Clearly, there's something in the zfs boot loader that's freaking out by
>>> something on your system, but absent 

Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Mark Millard via freebsd-stable
On 2018-Oct-22, at 2:27 AM, Toomas Soome  wrote:
> 
>> On 22 Oct 2018, at 06:30, Warner Losh  wrote:
>> 
>> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
>> 
>>> 
>>> 
>>> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
>>> freebsd-stable@freebsd.org> wrote:
>>> 
 [I built based on WITHOUT_ZFS= for other reasons. But,
 after installing the build, Hyper-V based boots are
 working.]
 
 On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
 
> On 2018-Oct-20, at 1:39 AM, Mark Millard  wrote:
> 
>> I attempted to jump from head -r334014 to -r339076
>> on a threadripper 1950X board and the boot fails.
>> This is both native booting and under Hyper-V,
>> same machine and root file system in both cases.
> 
> I did my investigation under Hyper-V after seeing
> a boot failure native.
> 
> Looks like the native failure is even earlier,
> before db> is even possible, possibly during
> early loader activity.
> 
> So this report is really for running under
> Hyper-V: -r338804 boots and -r338810 does
> not. By contrast -r334804 does not boot native.
> (But I've little information for that context.)
> 
> Sorry for the confusion. I rushed the report
> in hopes of getting to sleep. It was not to be.
> 
>> It fails just after the FreeBSD/SMP lines,
>> reporting "kernel trap 9 with interrupts disabled".
>> 
>> It fails in pmap_force_invaldiate_cache_range at
>> a clflusl (%rax) instruction that produces a
>> "Fatal trap 9: general protection fault while
>> in kernel mode". cpudid=0 apic id= 00
>> 
>> I used kernel.txz files from:
>> 
>> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/
>> 
>> to narrow the range of kernel builds for working -> failing
>> and got:
>> 
>> -r338804 boots fine
>> (no amd64 kernel builds between to try)
>> -r338810+ fails (any that I tried, anyway)
>> 
>> In that range is -r338807 :
>> 
>> QUOTE
>> Author: kib
>> Date: Wed Sep 19 19:35:02 2018
>> New Revision: 338807
>> URL:
>> https://svnweb.freebsd.org/changeset/base/338807
>> 
>> 
>> Log:
>> Convert x86 cache invalidation functions to ifuncs.
>> 
>> This simplifies the runtime logic and reduces the number of
>> runtime-constant branches.
>> 
>> Reviewed by: alc, markj
>> Sponsored by:The FreeBSD Foundation
>> Approved by: re (gjb)
>> Differential revision:
>> https://reviews.freebsd.org/D16736
>> 
>> Modified:
>> head/sys/amd64/amd64/pmap.c
>> head/sys/amd64/include/pmap.h
>> head/sys/dev/drm2/drm_os_freebsd.c
>> head/sys/dev/drm2/i915/intel_ringbuffer.c
>> head/sys/i386/i386/pmap.c
>> head/sys/i386/i386/vm_machdep.c
>> head/sys/i386/include/pmap.h
>> head/sys/x86/iommu/intel_utils.c
>> END QUOTE
>> 
>> There do seem to be changes associated with
>> clflush(...) use. Looking at:
>> 
>> 
 https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432
>> 
>> it appears that pmap_force_invalidate_cache_range has not
>> changed since -r338807.
>> 
>> It seems that -r338806 and -r3388810 would be unlikely
>> contributors.
> 
 
 I went after my native-boot loader problem first because I
 could switch kernels via the loader for booting FreeBSD under
 Hyper-V. Switching loaders is more of a problem.
 
 In order to avoid the loader-time crash I switched to building
 installing based on WITHOUT_ZFS= . I've had no active use of
 ZFS in years. (The old official-build loaders that worked were
 non-ZFS ones.)
 
 This took care of the native-boot loader-crash --and, to my
 surprise, also the Hyper-V-boot kernel-time crash.
 
 My private builds now boot the 1950X in both contexts just
 fine.
 
 During my early investigation I did pick up specific changes
 from after -r339076 that seemed to be tied to Ryzen and such.
 (They made no difference to the boot problems at the time
 but I saw no reason to remove them.)
 
 # uname -apKU
 FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun
 Oct 21 16:44:25 PDT 2018 
 markmi@FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG
 amd64 amd64 1200084 1200084
>>> 
>>> 
>> (stupid gmail)
>> 
>> The phrase "no active use" bothers me. What does that mean? Are there any
>> ZFS pools or any disks that any whiff of ZFSish thing on it at all?
>> Clearly, there's something in the zfs boot loader that's freaking out by
>> something on your system, but absent that information I can't help you.
>> 
> 
> It would help to get output from loader lsdev -v command.

That turned out to be very interesting: The non-ZFS loader
crashes during the listing, during disk8, which 

Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Toomas Soome via freebsd-stable


> On 22 Oct 2018, at 06:30, Warner Losh  wrote:
> 
> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  > wrote:
> 
>> 
>> 
>> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
>> freebsd-stable@freebsd.org> wrote:
>> 
>>> [I built based on WITHOUT_ZFS= for other reasons. But,
>>> after installing the build, Hyper-V based boots are
>>> working.]
>>> 
>>> On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
>>> 
 On 2018-Oct-20, at 1:39 AM, Mark Millard  wrote:
 
> I attempted to jump from head -r334014 to -r339076
> on a threadripper 1950X board and the boot fails.
> This is both native booting and under Hyper-V,
> same machine and root file system in both cases.
 
 I did my investigation under Hyper-V after seeing
 a boot failure native.
 
 Looks like the native failure is even earlier,
 before db> is even possible, possibly during
 early loader activity.
 
 So this report is really for running under
 Hyper-V: -r338804 boots and -r338810 does
 not. By contrast -r334804 does not boot native.
 (But I've little information for that context.)
 
 Sorry for the confusion. I rushed the report
 in hopes of getting to sleep. It was not to be.
 
> It fails just after the FreeBSD/SMP lines,
> reporting "kernel trap 9 with interrupts disabled".
> 
> It fails in pmap_force_invaldiate_cache_range at
> a clflusl (%rax) instruction that produces a
> "Fatal trap 9: general protection fault while
> in kernel mode". cpudid=0 apic id= 00
> 
> I used kernel.txz files from:
> 
> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/
> 
> to narrow the range of kernel builds for working -> failing
> and got:
> 
> -r338804 boots fine
> (no amd64 kernel builds between to try)
> -r338810+ fails (any that I tried, anyway)
> 
> In that range is -r338807 :
> 
> QUOTE
> Author: kib
> Date: Wed Sep 19 19:35:02 2018
> New Revision: 338807
> URL:
> https://svnweb.freebsd.org/changeset/base/338807
> 
> 
> Log:
> Convert x86 cache invalidation functions to ifuncs.
> 
> This simplifies the runtime logic and reduces the number of
> runtime-constant branches.
> 
> Reviewed by: alc, markj
> Sponsored by:The FreeBSD Foundation
> Approved by: re (gjb)
> Differential revision:
> https://reviews.freebsd.org/D16736
> 
> Modified:
> head/sys/amd64/amd64/pmap.c
> head/sys/amd64/include/pmap.h
> head/sys/dev/drm2/drm_os_freebsd.c
> head/sys/dev/drm2/i915/intel_ringbuffer.c
> head/sys/i386/i386/pmap.c
> head/sys/i386/i386/vm_machdep.c
> head/sys/i386/include/pmap.h
> head/sys/x86/iommu/intel_utils.c
> END QUOTE
> 
> There do seem to be changes associated with
> clflush(...) use. Looking at:
> 
> 
>>> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432
> 
> it appears that pmap_force_invalidate_cache_range has not
> changed since -r338807.
> 
> It seems that -r338806 and -r3388810 would be unlikely
> contributors.
 
>>> 
>>> I went after my native-boot loader problem first because I
>>> could switch kernels via the loader for booting FreeBSD under
>>> Hyper-V. Switching loaders is more of a problem.
>>> 
>>> In order to avoid the loader-time crash I switched to building
>>> installing based on WITHOUT_ZFS= . I've had no active use of
>>> ZFS in years. (The old official-build loaders that worked were
>>> non-ZFS ones.)
>>> 
>>> This took care of the native-boot loader-crash --and, to my
>>> surprise, also the Hyper-V-boot kernel-time crash.
>>> 
>>> My private builds now boot the 1950X in both contexts just
>>> fine.
>>> 
>>> During my early investigation I did pick up specific changes
>>> from after -r339076 that seemed to be tied to Ryzen and such.
>>> (They made no difference to the boot problems at the time
>>> but I saw no reason to remove them.)
>>> 
>>> # uname -apKU
>>> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun
>>> Oct 21 16:44:25 PDT 2018 
>>> markmi@FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG
>>> amd64 amd64 1200084 1200084
>> 
>> 
> (stupid gmail)
> 
> The phrase "no active use" bothers me. What does that mean? Are there any
> ZFS pools or any disks that any whiff of ZFSish thing on it at all?
> Clearly, there's something in the zfs boot loader that's freaking out by
> something on your system, but absent that information I can't help you.
> 

It would help to get output from loader lsdev -v command. Also if you could 
test boot loader with UEFI - for example get to loader prompt via usb/cd boot 
and then get the same lsdev -v output. I would be interested to see the sector 
size information and if the UEFI loader does also have issues. If it does, I’d 
like to see the 

Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

2018-10-22 Thread Mark Millard via freebsd-stable



On 2018-Oct-21, at 8:30 PM, Warner Losh  wrote:

> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh  wrote:
> 
> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable 
>  wrote:
>> [I built based on WITHOUT_ZFS= for other reasons. But,
>> after installing the build, Hyper-V based boots are
>> working.]
>> 
>> On 2018-Oct-20, at 2:09 AM, Mark Millard  wrote:
>> 
>> > On 2018-Oct-20, at 1:39 AM, Mark Millard  wrote:
>> > 
>> >> I attempted to jump from head -r334014 to -r339076
>> >> on a threadripper 1950X board and the boot fails.
>> >> This is both native booting and under Hyper-V,
>> >> same machine and root file system in both cases.
>> > 
>> > I did my investigation under Hyper-V after seeing
>> > a boot failure native.
>> > 
>> > Looks like the native failure is even earlier,
>> > before db> is even possible, possibly during
>> > early loader activity.
>> > 
>> > So this report is really for running under
>> > Hyper-V: -r338804 boots and -r338810 does
>> > not. By contrast -r334804 does not boot native.
>> > (But I've little information for that context.)
>> > 
>> > Sorry for the confusion. I rushed the report
>> > in hopes of getting to sleep. It was not to be.
>> > 
>> >> It fails just after the FreeBSD/SMP lines,
>> >> reporting "kernel trap 9 with interrupts disabled".
>> >> 
>> >> It fails in pmap_force_invaldiate_cache_range at
>> >> a clflusl (%rax) instruction that produces a
>> >> "Fatal trap 9: general protection fault while
>> >> in kernel mode". cpudid=0 apic id= 00
>> >> 
>> >> I used kernel.txz files from:
>> >> 
>> >> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/
>> >> 
>> >> to narrow the range of kernel builds for working -> failing
>> >> and got:
>> >> 
>> >> -r338804 boots fine
>> >> (no amd64 kernel builds between to try)
>> >> -r338810+ fails (any that I tried, anyway)
>> >> 
>> >> In that range is -r338807 :
>> >> 
>> >> QUOTE
>> >> Author: kib
>> >> Date: Wed Sep 19 19:35:02 2018
>> >> New Revision: 338807
>> >> URL: 
>> >> https://svnweb.freebsd.org/changeset/base/338807
>> >> 
>> >> 
>> >> Log:
>> >> Convert x86 cache invalidation functions to ifuncs.
>> >> 
>> >> This simplifies the runtime logic and reduces the number of
>> >> runtime-constant branches.
>> >> 
>> >> Reviewed by: alc, markj
>> >> Sponsored by:The FreeBSD Foundation
>> >> Approved by: re (gjb)
>> >> Differential revision:   
>> >> https://reviews.freebsd.org/D16736
>> >> 
>> >> Modified:
>> >> head/sys/amd64/amd64/pmap.c
>> >> head/sys/amd64/include/pmap.h
>> >> head/sys/dev/drm2/drm_os_freebsd.c
>> >> head/sys/dev/drm2/i915/intel_ringbuffer.c
>> >> head/sys/i386/i386/pmap.c
>> >> head/sys/i386/i386/vm_machdep.c
>> >> head/sys/i386/include/pmap.h
>> >> head/sys/x86/iommu/intel_utils.c
>> >> END QUOTE
>> >> 
>> >> There do seem to be changes associated with
>> >> clflush(...) use. Looking at:
>> >> 
>> >> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432
>> >> 
>> >> it appears that pmap_force_invalidate_cache_range has not
>> >> changed since -r338807.
>> >> 
>> >> It seems that -r338806 and -r3388810 would be unlikely
>> >> contributors.
>> > 
>> 
>> I went after my native-boot loader problem first because I
>> could switch kernels via the loader for booting FreeBSD under
>> Hyper-V. Switching loaders is more of a problem.
>> 
>> In order to avoid the loader-time crash I switched to building
>> installing based on WITHOUT_ZFS= . I've had no active use of
>> ZFS in years. (The old official-build loaders that worked were
>> non-ZFS ones.)
>> 
>> This took care of the native-boot loader-crash --and, to my
>> surprise, also the Hyper-V-boot kernel-time crash.
>> 
>> My private builds now boot the 1950X in both contexts just
>> fine.
>> 
>> During my early investigation I did pick up specific changes
>> from after -r339076 that seemed to be tied to Ryzen and such.
>> (They made no difference to the boot problems at the time
>> but I saw no reason to remove them.)
>> 
>> # uname -apKU
>> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun Oct 
>> 21 16:44:25 PDT 2018 
>> markmi@FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG
>>   amd64 amd64 1200084 1200084
>> 
>> (stupid gmail) 
> 
> The phrase "no active use" bothers me. What does that mean? Are there any ZFS 
> pools or any disks that any whiff of ZFSish thing on it at all? Clearly, 
> there's something in the zfs boot loader that's freaking out by something on 
> your system, but absent that information I can't help you.

No ZFS pools: Strictly UFS for FreeBSD file systems
for the last few years, UFS before I had access to
the 1950X system.

I've never before bothered to use WITHOUT_ZFS= in
my builds. So the system had the ZFS support,
such as kernel modules, over all the time that
this system had been in use.

Prior to the recent versions I saw no such problems.
But the default loader was not ZFS capable.


As seen in the under-Hyper-V