Re: Regression in 028abd92 for Sun UltraSPARC T1

2021-03-23 Thread Frank Scheiner

On 23.03.21 17:57, Christoph Hellwig wrote:> Frank, can you double check
that commit

67e306c6906137020267eb9bbdbc127034da3627 really still works, and
only 028abd9222df0cf5855dab5014a5ebaf06f90565 broke your setup?


So I manually checked out both 67e306c6906137020267eb9bbdbc127034da3627
and 028abd9222df0cf5855dab5014a5ebaf06f90565 and recompiled both (doing
`make [...] mrproper` before each run).

The results didn't change from the ones from the bisecting process:

67e306c6906137020267eb9bbdbc127034da3627

...is working and:

028abd9222df0cf5855dab5014a5ebaf06f90565

...is broken on my T1000.

As I don't know how big attachments can be on this list, I put the logs
on pastebin.

A log for 028abd9222df is here:

https://pastebin.com/ApPYsMcu

A log for 67e306c69061 is here:

https://pastebin.com/uGLXX7RS

Cheers,
Frank



Re: Regression in 028abd92 for Sun UltraSPARC T1

2021-03-23 Thread Frank Scheiner

On 23.03.21 17:57, Christoph Hellwig wrote:

On Tue, Mar 23, 2021 at 05:50:59PM +0100, Jan Engelhardt wrote:

Some participants in the discussion over at the debian-sparc list mentioned
"NFS" and "Invalid argument", which is something I know just too well from
iptables. NFS is a filesystem that uses an extra data blob (5th argument to the
mount syscall). Such blobs have historically not always been designed to bear
the same layout between ILP32 and LP64 modes, and nfs's structs fell prey to
this as well.

My hypothesis now is that fs/nfs/fs_context.c line 1160:

if (in_compat_syscall())
nfs4_compat_mount_data_conv(data);

and ones similar to it (I didn't look too close where nfs3 gets to do its
conversion), no longer trigger as a result of compat_sys_mount being
wiped from the syscall table:


No, if in_compat_syscall() syscall doesn't trigger properly the kernel
would not get this far.

That being said, the NFS compat code was moved out of the compat mount
handler and into nfs and refactored in the commit just before this one.

Frank, can you double check that commit
67e306c6906137020267eb9bbdbc127034da3627 really still works, and
only 028abd9222df0cf5855dab5014a5ebaf06f90565 broke your setup?


Indeed, I also expected 67e306c6906137020267eb9bbdbc127034da3627 to fail
because of its commit message, but from my log it did work correctly.

As the T1000 is at home and I don't have another T1 based system in my
storage location where I am now, I'll double check that in the evening
and report back.

Strangely for a V245 (with UltraSPARC IIIi) both commits seem to work
according to my testing, but 5.10.x (from Debian) doesn't work and
5.9.15 (also from Debian) does work - tested now both for boot from
network and boot from disk.

Possibly unrelated to the problem with the T1000, the V245 emits the
following for boot from disk with 5.10.x:

```
[...]
Loading Linux 5.10.0-5-sparc64-smp ...
Loading initial ramdisk ...

[2.602821] rtc_cmos rtc_cmos: IRQ index 0 not found
/dev/sda2: clean, 33516/8454144 files, 1105784/33798750 blocks
[   13.542728] autofs4:pid:1:autofs_fill_super: called with bogus options
[   13.628931] systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed to
initialize automounter: Invalid argument
[   13.759917] systemd[1]: Failed to set up automount Arbitrary
Executable File Formats File System Automount Point.
[FAILED] Failed to set up automount  File System Automount Point.
[   14.456396] Unable to handle kernel paging request in mna handler
[   14.456400]  at virtual address da65f2fed110e482
[   14.597474] current->{active_,}mm->context = 00ce
[   14.597478] current->{active_,}mm->pgd = fff006d5c000
[   14.752380] Unable to handle kernel paging request in mna handler
[   14.752383]  at virtual address da65f2fed110e482
[   14.893509] current->{active_,}mm->context = 0094
[   14.969141] current->{active_,}mm->pgd = fff00011010e
[   15.040554] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   15.141430] Press Stop-A (L1-A) from sun keyboard or send break
[   15.141430] twice on console to return to the boot prom
[   15.141459] kernel BUG at kernel/cpu.c:960
```

Cheers,
Frank



Re: Regression in 028abd92 for Sun UltraSPARC T1

2021-03-23 Thread Christoph Hellwig
On Tue, Mar 23, 2021 at 05:50:59PM +0100, Jan Engelhardt wrote:
> Some participants in the discussion over at the debian-sparc list mentioned
> "NFS" and "Invalid argument", which is something I know just too well from
> iptables. NFS is a filesystem that uses an extra data blob (5th argument to 
> the
> mount syscall). Such blobs have historically not always been designed to bear
> the same layout between ILP32 and LP64 modes, and nfs's structs fell prey to
> this as well.
> 
> My hypothesis now is that fs/nfs/fs_context.c line 1160:
> 
>   if (in_compat_syscall())
>   nfs4_compat_mount_data_conv(data);
> 
> and ones similar to it (I didn't look too close where nfs3 gets to do its
> conversion), no longer trigger as a result of compat_sys_mount being
> wiped from the syscall table:

No, if in_compat_syscall() syscall doesn't trigger properly the kernel
would not get this far.

That being said, the NFS compat code was moved out of the compat mount
handler and into nfs and refactored in the commit just before this one.

Frank, can you double check that commit
67e306c6906137020267eb9bbdbc127034da3627 really still works, and
only 028abd9222df0cf5855dab5014a5ebaf06f90565 broke your setup?



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Frank Scheiner

Hi,

On 23.03.21 17:30, Connor McLaughlan wrote:

Hi,

can anyone possible give a list of known stable kernel versions for
SPARC machines? (is there a difference necessary between
architectures/old vs. newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been
experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the
machines are running fine without any apparent error visible in dmesg or
somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and
sometimes impossible to repair without reinstallation.


Can you be sure that your used disks are in full working order? Maybe
you have bad sectors on them and their EOL is nearing, manifesting in
these FS errors? I assume the more accesses you have on your disks the
more a problem is prone to show up. And the accesses happening during
compile runs could be already too much for your disks. If you have
enough RAM, you could try to run your compile jobs in a RAM disk and
check if this makes a difference.


This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night
with no high workload.


I believe the error this thread is about is unrelated to what you
experience on your machines. This because the problem happens early on
when the root FS is to be mounted.

Cheers,
Frank



Re: Regression in 028abd92 for Sun UltraSPARC T1

2021-03-23 Thread Jan Engelhardt


On Monday 2021-03-22 22:55, Frank Scheiner wrote:
>>> Riccardo Mottola first recognized a problem with 5.10.x kernels on his
>>> Sun T2000 with UltraSPARC T1 (details in [this thread]). I could verify
>>> the problem also on my Sun T1000 and it looks like this specific issue
>>> breaks the mounting of the root FS or maybe mounting file systems at
>>> all. This affects both booting from disk and from network.
>>> (...)
>>> ...as first bad commit.
>>>
>>> ```
>>> commit 028abd9222df0cf5855dab5014a5ebaf06f90565
>>> Author: Christoph Hellwig 
>>>  fs: remove compat_sys_mount

Some participants in the discussion over at the debian-sparc list mentioned
"NFS" and "Invalid argument", which is something I know just too well from
iptables. NFS is a filesystem that uses an extra data blob (5th argument to the
mount syscall). Such blobs have historically not always been designed to bear
the same layout between ILP32 and LP64 modes, and nfs's structs fell prey to
this as well.

My hypothesis now is that fs/nfs/fs_context.c line 1160:

if (in_compat_syscall())
nfs4_compat_mount_data_conv(data);

and ones similar to it (I didn't look too close where nfs3 gets to do its
conversion), no longer trigger as a result of compat_sys_mount being
wiped from the syscall table:

+++ arch/sparc/kernel/syscalls/syscall.tbl
@@ -201,7 +201,7 @@
 16464  utrap_install   sys_utrap_install
 165common  quotactlsys_quotactl
 166common  set_tid_address sys_set_tid_address
-167common  mount   sys_mount   
compat_sys_mount
+167common  mount   sys_mount

I didn't extract from the debian-sparc discussion whether people were running
the all-LP64 userspace, or had some older Debian with a ILP32-on-64bitkernel
setup.


[But that's just a theory - a kernel theory!]



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Connor McLaughlan
Hi,

can anyone possible give a list of known stable kernel versions for SPARC
machines? (is there a difference necessary between architectures/old vs.
newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been experiencing a
weird effect when using debian:
I would start a high compiling load for several days (7-10) where the
machines are running fine without any apparent error visible in dmesg or
somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and
sometimes impossible to repair without reinstallation.

This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night with
no high workload.

Regards,
Connor


On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner 
wrote:

> Hi Jan,
>
> On 23.03.21 16:36, Jan Engelhardt wrote:
> > On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
> >> ```
> >> [...]
> >> Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
> >> pass remote address
> >> mount: Invalid argument
> >
> > I seem to recall that NFS is one of those filesystems that (a) makes use
> of
> > filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount
> helper,
> > /usr/sbin/mount.nfs.
> >
> > Now, with the change in Linux kernel
> 028abd9222df0cf5855dab5014a5ebaf06f90565,
> > I am postulating the hypothesis that that the fs/nfs/ code for parsing
> this
> > binary blob is no longer aware that it is being invoked in a compat32
> context.
>
> That sounds interesting. Can you perhaps post your hypothesis also in
> this thread:
>
> https://marc.info/?t=16164490063=1=2
>
> Maybe this gives the kernel developers some ideas.
>
> > Since T2 systems were said to be fine and T1 not, perhaps the T1 systems
> in
> > question were all on NFS mounts and the T2 one wasn't?
>
> No, the T5220 was also running diskless, actually using the same root FS
> as the T1000 (in form of a btrfs subvolume snapshot) plus identical
> kernel and initramfs:
>
> ```
> root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
> lrwxrwxrwx 1 root root 35 Feb 28  2018 AC10026E ->
> boot/grub/sparc64-ieee1275/core.img
> lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img ->
> initrd.img.5.10.0-4.debian.sid.sparc64
> lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz ->
> linux.mp.5.10.0-4.debian.sid.sparc64
> ```
>
> Cheers,
> Frank
>
>


Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Frank Scheiner

Hi Jan,

On 23.03.21 16:36, Jan Engelhardt wrote:

On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:

```
[...]
Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument


I seem to recall that NFS is one of those filesystems that (a) makes use of
filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
/usr/sbin/mount.nfs.

Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsing this
binary blob is no longer aware that it is being invoked in a compat32 context.


That sounds interesting. Can you perhaps post your hypothesis also in
this thread:

https://marc.info/?t=16164490063=1=2

Maybe this gives the kernel developers some ideas.


Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in
question were all on NFS mounts and the T2 one wasn't?


No, the T5220 was also running diskless, actually using the same root FS
as the T1000 (in form of a btrfs subvolume snapshot) plus identical
kernel and initramfs:

```
root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
lrwxrwxrwx 1 root root 35 Feb 28  2018 AC10026E ->
boot/grub/sparc64-ieee1275/core.img
lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img ->
initrd.img.5.10.0-4.debian.sid.sparc64
lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz ->
linux.mp.5.10.0-4.debian.sid.sparc64
```

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Jan Engelhardt


On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
>>
>> while I was able to "install" correctly using a slightly older ISO, I
>> get not a bootable system. The kernel appears to crash very early during
>> boot.
>
> From my current testing it looks like "UltraSPARC IIIi"s are also
> affected by this problem with UltraSPARC T1s in some way:
>
> With the latest Linux 5.10.x (from Debian) the root FS can't be
> successfully mounted, with the latest Linux 5.9.x (also from Debian) it
> just works fine. Unfortunately the V245 doesn't fail/work for the exact
> same kernels that I tested during the bisecting for the T1000, e.g. the
> first bad commit version that didn't work on the T1000 seems to work on
> the V245 but some good versions don't with:
>
> ```
> [...]
> Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
> pass remote address
> mount: Invalid argument

I seem to recall that NFS is one of those filesystems that (a) makes use of
filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
/usr/sbin/mount.nfs.

Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsing this
binary blob is no longer aware that it is being invoked in a compat32 context.

Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in
question were all on NFS mounts and the T2 one wasn't?



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Frank Scheiner

Hi all,

On 09.03.21 13:23, Riccardo Mottola wrote:

Hi all,

while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during
boot.

Anybody else has this issue?

   Booting `Debian GNU/Linux'

Loading Linux 5.10.0-4-sparc64-smp ...
Loading initial ramdisk ...



From my current testing it looks like "UltraSPARC IIIi"s are also
affected by this problem with UltraSPARC T1s in some way:

With the latest Linux 5.10.x (from Debian) the root FS can't be
successfully mounted, with the latest Linux 5.9.x (also from Debian) it
just works fine. Unfortunately the V245 doesn't fail/work for the exact
same kernels that I tested during the bisecting for the T1000, e.g. the
first bad commit version that didn't work on the T1000 seems to work on
the V245 but some good versions don't with:

```
[...]
Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument
done.
[...]
```

I'm unsure what could go wrong here, as I always pass the remote address
via the kernel commandline:

```
[...]
[2.928512] Kernel command line: BOOT_IMAGE=(tftp)/AC10027A.vmlinux
root=/dev/nfs
ip=172.16.2.122:172.16.0.2:172.16.0.1:255.255.0.0:v245-2:enp9s4f0:off
nfsroot=172.16.0.2:/srv/nfs/v245-2/root nfsrootdebug rw
[...]
```

Maybe there is some breakage in the klibc based programs in the
initramfs, but why they don't affect both UltraSPARC IIIi and T1 in the
same way is somewhat strange.

Cheers,
Frank