Re: releng/13 release/13.0.0 : odd/incorrect diff result over nfs (in a zfs file systems context)

2021-05-21 Thread Rick Macklem
Mark Millard wrote:
[stuff snipped]
>Well, why is it that ls -R, find, and diff -r all get file
>name problems via genet0 but diff -r gets no problems
>comparing the content of files that it does match up (the
>vast majority)? Any clue how could the problems possibly
>be unique to the handling of file names/paths? Does it
>suggest anything else to look into for getting some more
>potentially useful evidence?
Well, all I can do is describe the most common TSO related
failure:
- When a read RPC reply (including NFS/RPC/TCP/IP headers)
  is slightly less than 64K bytes (many TSO implementations are
  limited to 64K or 32 discontiguous segments, think 32 2K
  mbuf clusters), the driver decides it is ok, but when the MAC
  header is added it exceeds what the hardware can handle correctly...
--> This will happen when reading a regular file that is slightly less
   than a multiple of 64K in size.
or
--> This will happen when reading just about any large directory,
  since the directory reply for a 64K request is converted to Sun XDR
  format and clipped at the last full directory entry that will fit within 
64K.
For ports, where most files are small, I think you can tell which is more
likely to happen.
--> If TSO is disabled, I have no idea how this might matter, but??

>I'll note that netstat -I ue0 -d and netstat -I genet0 -d
>do not report changes in Ierrs or Idrop in a before vs.
>after failures comparison. (There may be better figures
>to look at for all I know.)
>
>I tried "ifconfig genet0 -rxcsum -rxcsum -rxcsum6 -txcsum6"
>and got no obvious change in behavior.
All we know is that the data is getting corrupted somehow.

NFS traffic looks very different than typical TCP traffic. It is
mostly small messages travelling in both directions concurrently,
with some large messages thrown in the mix.
All I'm saying is that, testing a net interface with something like
bulk data transfer in one direction doesn't verify it works for NFS
traffic.

Also, the large RPC messages are a chain of about 33 mbufs of
various lengths, including a mix of partial clusters and regular
data mbufs, whereas a bulk send on a socket will typically
result in an mbuf chain of a lot of full 2K clusters.
--> As such, NFS can be good at tickling subtle bugs it the
  net driver related to mbuf handling.

rick

> W.r.t. reverting r367492...the patch to replace r367492 was just
> committed to "main" by rscheff@ with a two week MFC, so it
> should be in stable/13 soon. Not sure if an errata can be done
> for it for releng13.0?

That update is reported to be causing "rack" related panics:

https://lists.freebsd.org/pipermail/dev-commits-src-main/2021-May/004440.html

reports (via links):

panic: _mtx_lock_sleep: recursed on non-recursive mutex so_snd @ 
/syzkaller/managers/i386/kernel/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:10632

Still, I have a non-debug update to main building and will
likely do a debug build as well. llvm is rebuilding, so
the builds will take a notable time.

> Thanks for isolating this, rick
> ps: Co-incidentally, I've been thinking of buying an RBPi4 as a toy.

I'll warn that the primary "small arm" development/support
folk(s) do not work on the RPi*'s these days, beyond
committing what others provide and the like.




===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: releng/13 release/13.0.0 : odd/incorrect diff result over nfs (in a zfs file systems context)

2021-05-21 Thread Mark Millard via freebsd-stable



On 2021-May-21, at 09:00, Rick Macklem  wrote:

> Mark Millard wrote:
>> On 2021-May-20, at 22:19, Rick Macklem  wrote:
> [stuff snipped]
>>> ps: I do not think that r367492 could cause this, but it would be
>>>nice if you try a kernel with the r367492 patch reverted.
>>>It is currently in all of releng13, stable13 and main, although
>>>the patch to fix this is was just reviewed and may hit main soon.
>> 
>> Do you want a debug kernel to be used? Do you have a preference
>> for main vs. stable/13 vs. release/13.0.0 based? Is it okay to
>> stick to the base version things are now based on --or do you
>> want me to update to more recent? (That last only applies if
>> main or stable/13 is to be put to use.)
> Well, it sounds like you've isolated it to the genet interface.
> Good sluething.
> Unfortunately, NFS is only as good as the network fabric under it.
> However, it's usually hangs or poor performance. Except maybe
> for the readdir issue that Jason Bacon reported and resolved via
> an upgrade, this is a first.
> --> In the old days, I would have expected IP checksums to catch
>   this, but I'm guessing the hardware/net driver are doing them
>   these days?

Well, why is it that ls -R, find, and diff -r all get file
name problems via genet0 but diff -r gets no problems
comparing the content of files that it does match up (the
vast majority)? Any clue how could the problems possibly
be unique to the handling of file names/paths? Does it
suggest anything else to look into for getting some more
potentially useful evidence?

I'll note that netstat -I ue0 -d and netstat -I genet0 -d
do not report changes in Ierrs or Idrop in a before vs.
after failures comparison. (There may be better figures
to look at for all I know.)

I tried "ifconfig genet0 -rxcsum -rxcsum -rxcsum6 -txcsum6"
and got no obvious change in behavior.

> W.r.t. reverting r367492...the patch to replace r367492 was just
> committed to "main" by rscheff@ with a two week MFC, so it
> should be in stable/13 soon. Not sure if an errata can be done
> for it for releng13.0?

That update is reported to be causing "rack" related panics:

https://lists.freebsd.org/pipermail/dev-commits-src-main/2021-May/004440.html

reports (via links):

panic: _mtx_lock_sleep: recursed on non-recursive mutex so_snd @ 
/syzkaller/managers/i386/kernel/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:10632

Still, I have a non-debug update to main building and will
likely do a debug build as well. llvm is rebuilding, so
the builds will take a notable time.

> Thanks for isolating this, rick
> ps: Co-incidentally, I've been thinking of buying an RBPi4 as a toy.

I'll warn that the primary "small arm" development/support
folk(s) do not work on the RPi*'s these days, beyond
committing what others provide and the like.




===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: releng/13 release/13.0.0 : odd/incorrect diff result over nfs (in a zfs file systems context)

2021-05-21 Thread Rick Macklem
Mark Millard wrote:
>On 2021-May-20, at 22:19, Rick Macklem  wrote:
[stuff snipped]
>> ps: I do not think that r367492 could cause this, but it would be
>> nice if you try a kernel with the r367492 patch reverted.
>> It is currently in all of releng13, stable13 and main, although
>> the patch to fix this is was just reviewed and may hit main soon.
>
>Do you want a debug kernel to be used? Do you have a preference
>for main vs. stable/13 vs. release/13.0.0 based? Is it okay to
>stick to the base version things are now based on --or do you
>want me to update to more recent? (That last only applies if
>main or stable/13 is to be put to use.)
Well, it sounds like you've isolated it to the genet interface.
Good sluething.
Unfortunately, NFS is only as good as the network fabric under it.
However, it's usually hangs or poor performance. Except maybe
for the readdir issue that Jason Bacon reported and resolved via
an upgrade, this is a first.
--> In the old days, I would have expected IP checksums to catch
   this, but I'm guessing the hardware/net driver are doing them
   these days?

W.r.t. reverting r367492...the patch to replace r367492 was just
committed to "main" by rscheff@ with a two week MFC, so it
should be in stable/13 soon. Not sure if an errata can be done
for it for releng13.0?

Thanks for isolating this, rick
ps: Co-incidentally, I've been thinking of buying an RBPi4 as a toy.
> . . . old history deleted . . .

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: (was) FreeBSD 12 and Nocona

2021-05-21 Thread Marek Zarychta
W dniu 25.02.2019 o 18:47, Marek Zarychta pisze:
> W dniu 08.01.2019 o 12:51, Stefan Bethke pisze:
>> Am 08.01.2019 um 10:34 schrieb Marek Zarychta 
>> :
>>> W dniu 03.01.2019 o 14:13, Stefan Bethke pisze:
> I have under supervision a few old servers running 11.2-STABLE. The
> hardware is almost for retirement, but still in working condition. It's
> all old Nocona NetBurst microarchitecture. I have recently tried do
> upgrade OS two of them to 12.0-STABLE, but failed. When I use old
> bootloader the boot freezes on blue highlighted "Booting" stage, when I
> tried to use 12 loader, it freezes earlier, on loading kernel modules.
> The kernel was compiled from fresh sources for CPUTYPE?=nocona.
> 11.2-STABLE is fine with this optimization and the same kernel boots
> fine on newer hardware.
>
> It is fair, that 11 EOL is expected September 30, 2021 and these servers
> will likely be retired before this date, but some questions arise:
>
> Is such old hardware still supported? Is it possible (how to) debug the
> booting process?
 The first step is to try with known-good bits: can you boot these machines 
 off the 12.0 ISO or memstick images? Can you load your kernel and modules 
 with the loader from the ISO/memstick? Does GENERIC built without any 
 flags work?

 If any of these don’t work, try to be as specific as possible when 
 reporting problems. For example, the exact make of mainboard (kenv output) 
 and the BIOS version, and any relevant BIOS settings are likely important 
 for problems regarding the loader. If the kernel and modules load, you can 
 try a verbose boot to see better how far the kernel gets.

 I’d be really surprised if the CPUs themselves would cause trouble.
>>> The first step is done. The affected hardware doesn't boot from official
>>> 12.0-RELEASE CD either. Loader also freezes at the stage of loading
>>> kernel modules. These servers are old Maxdata Platinum 500 and 3200.
>>> Some time ago I have submitted dmesgs to NYC BUG dmesg repository[1][2].
>>>
>>> Both configurations are fine with 11-STABLE, so I am not going to
>>> upgrade them and I am replying only FYI.
>>>
>>>
>>> [1] https://dmesgd.nycbug.org/index.cgi?do=view&id=3790 
>>> 
>>> [2] https://dmesgd.nycbug.org/index.cgi?do=view&id=4111 
>>> 
>> I think it would be great to get some input from someone familiar with the 
>> new loader. I’ve cc’ed Warner, Kyle and Toomas, as they were listed in the 
>> quarterly status report.
>>
> After recent MFCs LUA loader became available in 11-STABLE branch. I
> have tested it on this old hardware using freshly built r344482. I must
> admit the LUA loader on this hardware loads and boots 11-STABLE kernel
> flawlessly in root on ZFS environment. The only caveat was that
> kernel.old haven't been listed under "5" key, but it wasn't tested
> intensely so the error could be temporary or nonexistent.
>
>
> Nothing changed with regard to booting FreeBSD 12 on this hardware. I
> have tried to boot the recent 12-STABLE kernel. The kernel and all
> modules were loaded (in the same manner as previously loader 4th did),
> but instead of booting machine froze at this stage, so the breakage is
> elsewhere.
>
Out of curiosity I booted 13.0-RELEASE on the last working machine from
this family before it gets retired. Surprisingly it boots just fine.  So
thank you all involved for unbreaking this and providing us with such an
excellent release.

Best regards,

-- 
Marek Zarychta




OpenPGP_signature
Description: OpenPGP digital signature


odd behaviour using lagg on bridge

2021-05-21 Thread Ruben van Staveren via freebsd-stable
Hi List,

I’m observing some odd behaviour after I decided to put the 2 interfaces in my 
system into a lagg failover bond

- Can’t add the lagg to a bridge, it will say:

$ sudo ifconfig vm-public addm lagg0
ifconfig: BRDGADD lagg0: Device busy

I also witnessed that starting a vm after having it reconfigured to use a 
manual bridge (instead of the vm-bhyve managed one) where lagg0 is present all 
IPv6 addresses are being stripped from the lagg interface.

A service netif/routing restart doesn’t restore ipv6 connectivity.
I think I saw something similar with a (iocage managed) vnet jail too, as that 
uses bridges too

Regards,
Ruben




signature.asc
Description: Message signed with OpenPGP


Re: releng/13 release/13.0.0 : odd/incorrect diff result over nfs (in a zfs file systems context) [RPi4B genet0 involved in problem]

2021-05-21 Thread Mark Millard via freebsd-stable
[Looks like the RPi4B genet0 handling is involved.]

On 2021-May-20, at 22:56, Mark Millard  wrote:
> 
> On 2021-May-20, at 22:19, Rick Macklem  wrote:
> 
>> Ok, so it isn't related to "soft".
>> I am wondering if it is something specific to what
>> "diff -r" does?
>> 
>> Could you try:
>> # cd /usr/ports
>> # ls -R > /tmp/x
>> # cd /mnt
>> # ls -R > /tmp/y
>> # cd /tmp
>> # diff -u -p x y
>> --> To see if "ls -R" finds any difference?
>> 
> 
> # diff -u -p x y 
> --- x   2021-05-20 22:35:48.021663000 -0700
> +++ y   2021-05-20 22:39:03.691936000 -0700
> @@ -227209,10 +227209,10 @@ 
> patch-chrome_browser_background_background__mode__mana
> patch-chrome_browser_background_background__mode__optimizer.cc
> patch-chrome_browser_browser__resources.grd
> patch-chrome_browser_browsing__data_chrome__browsing__data__remover__delegate.cc
> +patch-chrome_browser_chrome__browser
> patch-chrome_browser_chrome__browser__interface__binders.cc
> patch-chrome_browser_chrome__browser__main.cc
> patch-chrome_browser_chrome__browser__main__linux.cc
> -patch-chrome_browser_chrome__browser__main__posix.cc
> patch-chrome_browser_chrome__content__browser__client.cc
> patch-chrome_browser_chrome__content__browser__client.h
> patch-chrome_browser_crash__upload__list_crash__upload__list.cc
> 
> # find /usr/ports/ -name 'patch-chrome_browser_chrome__browser*' -print | more
> /usr/ports/devel/electron12/files/patch-chrome_browser_chrome__browser__main__linux.cc
> /usr/ports/devel/electron12/files/patch-chrome_browser_chrome__browser__main.cc
> /usr/ports/devel/electron12/files/patch-chrome_browser_chrome__browser__main__posix.cc
> /usr/ports/devel/electron12/files/patch-chrome_browser_chrome__browser__interface__binders.cc
> /usr/ports/www/chromium/files/patch-chrome_browser_chrome__browser__main__posix.cc
> /usr/ports/www/chromium/files/patch-chrome_browser_chrome__browser__main.cc
> /usr/ports/www/chromium/files/patch-chrome_browser_chrome__browser__main__linux.cc
> /usr/ports/www/chromium/files/patch-chrome_browser_chrome__browser__interface__binders.cc
> 
> find /mnt/ -name 'patch-chrome_browser_chrome__browser*' -print | more
> /mnt/devel/electron12/files/patch-chrome_browser_chrome__browser__main__linux.cc
> /mnt/devel/electron12/files/patch-chrome_browser_chrome__browser__main.cc
> /mnt/devel/electron12/files/patch-chrome_browser_chrome__browser__main__posix.cc
> /mnt/devel/electron12/files/patch-chrome_browser_chrome__browser__interface__binders.cc
> /mnt/www/chromium/files/patch-chrome_browser_chrome__browser
> /mnt/www/chromium/files/patch-chrome_browser_chrome__browser__main.cc
> /mnt/www/chromium/files/patch-chrome_browser_chrome__browser__main__linux.cc
> /mnt/www/chromium/files/patch-chrome_browser_chrome__browser__interface__binders.cc
> 
> So: patch-chrome_browser_chrome__browser appears to be a
> truncated: patch-chrome_browser_chrome__browser__main__posix.cc
> file name and find also gets the same oddity.
> 
> (Note: This had /usr/ports in a main context and /mnt/
> referring to a release/13.0.0 context.)
> 
>> ps: I do not think that r367492 could cause this, but it would be
>>nice if you try a kernel with the r367492 patch reverted.
>>It is currently in all of releng13, stable13 and main, although
>>the patch to fix this is was just reviewed and may hit main soon.
> 
> Do you want a debug kernel to be used? Do you have a preference
> for main vs. stable/13 vs. release/13.0.0 based? Is it okay to
> stick to the base version things are now based on --or do you
> want me to update to more recent? (That last only applies if
> main or stable/13 is to be put to use.)
> 
>> . . . old history deleted . . .

I reversed the roles of the faster vs. somewhat slower
machine and so far my diff -r attempts for this found
no differences. The machines were using different types
of EtherNet devices.

So I've substituted a different EtherNet device onto
the slower machine: the same type of USB3 EtherNet
device in use on the faster machine (instead of
using the RPi4B's builtin EtherNet). So the below
testing is with both machines having a:

ugen0.6:  at usbus0
ure0 on uhub0
ure0:  on usbus0
miibus1:  on ure0
rgephy0:  PHY 0 on miibus1
rgephy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, 
1000baseT-FDX-master, auto

in use.

I rebooted with this connected instead of the genet0
interface.

Mounting the slower machine's /usr/ports/ as /mnt/ from the faster machine:
No differences found by diff -r this way (expected result).

Mounting the faster machine's /usr/ports/ as /mnt/ from the slower machine:
No differences found by diff -r this way (expected result).

Doing diff -r's from both sides at the same time:
No differences found by diff -r this way (expected result).


So it looks like genet0 or its supporting software
is contributing to the problems that I had reported.

It is interesting that there were no examples of the
content of files reporting a mismatch, just some file
names/paths not findin