Re: git: 8d49fd7331bc - main - pf: remove DIOCGETRULE and DIOCGETSTATUS : net/py-libdnet and net/scapy now broken, kyua test suite damaged
On 14 Sep 2023, at 15:34, Mark Millard wrote: > [I've cc'd a couple of folks that have dealt with fixing > breakage in the past.] > I’ve submitted a fix for libdnet in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=273899 because it blocks net/scapy, which we rely on for tests. I do not plan to fix other ports as well. Best regards, Kristof
Fwd: git: 8d49fd7331bc - main - pf: remove DIOCGETRULE and DIOCGETSTATUS : net/py-libdnet and net/scapy now broken, kyua test suite damaged
[I've cc'd a couple of folks that have dealt with fixing breakage in the past.] From: Kristof Provost Subject: Re: git: 8d49fd7331bc - main - pf: remove DIOCGETRULE and DIOCGETSTATUS : net/py-libdnet and net/scapy now broken, kyua test suite damaged Date: September 14, 2023 at 02:02:38 PDT To: Mark Millard Cc: Current FreeBSD > > Hi Mark, > > On 14 Sep 2023, at 7:37, Mark Millard wrote: >> This change leads the port net/py-libdnet to be broken: >> >> --- fw-pf.lo --- >> fw-pf.c:212:22: error: use of undeclared identifier 'DIOCGETRULE' >> if (ioctl(fw->fd, DIOCGETRULE, &pcr) == 0 && >> ^ >> fw-pf.c:252:22: error: use of undeclared identifier 'DIOCGETRULE' >> if (ioctl(fw->fd, DIOCGETRULE, &pcr) == 0 && >> ^ >> --- intf.lo --- >> for (cnt = 0; !matched && cnt < (int) entry->intf_alias_num; cnt++) { >> ^ >> intf.c:571:2: note: previous statement is here >> if (entry->intf_addr.addr_type == ADDR_TYPE_IP && >> ^ >> --- fw-pf.lo --- >> fw-pf.c:296:28: error: use of undeclared identifier 'DIOCGETRULE' >> if ((ret = ioctl(fw->fd, DIOCGETRULE, &pr)) < 0) >> ^ >> 3 errors generated. >> >> That leads to: >> >> [00:00:41] [29] [00:00:26] Finished net/py-libdnet@py39 | >> py39-libdnet-1.13_4: Failed: build >> [00:00:42] [29] [00:00:27] Skipping net/scapy@py39 | py39-scapy-2.5.0_1: >> Dependent port net/py-libdnet@py39 | py39-libdnet-1.13_4 failed >> > > The commit removed those ioctls because they’ve been superseded by newer > (nvlist-based) versions. > Ports are strongly advised to use libpfctl rather than trying to deal with > nvlists themselves. > > See https://lists.freebsd.org/archives/freebsd-pf/2023-April/000345.html for > an example of what the ports will have to do. It’s generally a trivial change. > > Best regards, > Kristof === Mark Millard marklmi at yahoo.com
Re: git: 8d49fd7331bc - main - pf: remove DIOCGETRULE and DIOCGETSTATUS : net/py-libdnet and net/scapy now broken, kyua test suite damaged
Hi Mark, On 14 Sep 2023, at 7:37, Mark Millard wrote: > This change leads the port net/py-libdnet to be broken: > > --- fw-pf.lo --- > fw-pf.c:212:22: error: use of undeclared identifier 'DIOCGETRULE' > if (ioctl(fw->fd, DIOCGETRULE, &pcr) == 0 && > ^ > fw-pf.c:252:22: error: use of undeclared identifier 'DIOCGETRULE' > if (ioctl(fw->fd, DIOCGETRULE, &pcr) == 0 && > ^ > --- intf.lo --- > for (cnt = 0; !matched && cnt < (int) entry->intf_alias_num; cnt++) { > ^ > intf.c:571:2: note: previous statement is here > if (entry->intf_addr.addr_type == ADDR_TYPE_IP && > ^ > --- fw-pf.lo --- > fw-pf.c:296:28: error: use of undeclared identifier 'DIOCGETRULE' > if ((ret = ioctl(fw->fd, DIOCGETRULE, &pr)) < 0) > ^ > 3 errors generated. > > That leads to: > > [00:00:41] [29] [00:00:26] Finished net/py-libdnet@py39 | > py39-libdnet-1.13_4: Failed: build > [00:00:42] [29] [00:00:27] Skipping net/scapy@py39 | py39-scapy-2.5.0_1: > Dependent port net/py-libdnet@py39 | py39-libdnet-1.13_4 failed > The commit removed those ioctls because they’ve been superseded by newer (nvlist-based) versions. Ports are strongly advised to use libpfctl rather than trying to deal with nvlists themselves. See https://lists.freebsd.org/archives/freebsd-pf/2023-April/000345.html for an example of what the ports will have to do. It’s generally a trivial change. Best regards, Kristof
Re: git: 8d49fd7331bc - main - pf: remove DIOCGETRULE and DIOCGETSTATUS : net/py-libdnet and net/scapy now broken, kyua test suite damaged
This change leads the port net/py-libdnet to be broken: --- fw-pf.lo --- fw-pf.c:212:22: error: use of undeclared identifier 'DIOCGETRULE' if (ioctl(fw->fd, DIOCGETRULE, &pcr) == 0 && ^ fw-pf.c:252:22: error: use of undeclared identifier 'DIOCGETRULE' if (ioctl(fw->fd, DIOCGETRULE, &pcr) == 0 && ^ --- intf.lo --- for (cnt = 0; !matched && cnt < (int) entry->intf_alias_num; cnt++) { ^ intf.c:571:2: note: previous statement is here if (entry->intf_addr.addr_type == ADDR_TYPE_IP && ^ --- fw-pf.lo --- fw-pf.c:296:28: error: use of undeclared identifier 'DIOCGETRULE' if ((ret = ioctl(fw->fd, DIOCGETRULE, &pr)) < 0) ^ 3 errors generated. That leads to: [00:00:41] [29] [00:00:26] Finished net/py-libdnet@py39 | py39-libdnet-1.13_4: Failed: build [00:00:42] [29] [00:00:27] Skipping net/scapy@py39 | py39-scapy-2.5.0_1: Dependent port net/py-libdnet@py39 | py39-libdnet-1.13_4 failed net/scapy is used by parts of the kyua testsuite (when installed, anyway). So the kyua testsuite is now has damaged functionality on main [so: 15]. === Mark Millard marklmi at yahoo.com
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" possible file odd result
Just FYI: For the specific machine/storage media combination used for the openzfs import testing, the following combination seemed to work well relative to the subject of the "odd result": A) /etc/sysctl.conf having "vfs.zfs.per_txg_dirty_frees_percent=30" B) autotrim off but use of "zpool trim -w zamd64" first, after any freeing of space by deleting files. (Probably again after the build and any cleanout of the tempprary results.) C) Avoid having more poudriere builders than hardware threads. Of course, the combination does not apply to media that does not have trim accessible (USB3, for example) --and that may not support trim in some cases (Optane, for example). By contrast . . . In a USB3 context, vfs.zfs.per_txg_dirty_frees_percent=30 did not work well because of the delete sequence when a builder is to be reused for its next build. vfs.zfs.per_txg_dirty_frees_percent=5 allowed more overall progress on that aarch64 system. The big cleanout of all the builders at the end is not the only consideration in setting vfs.zfs.per_txg_dirty_frees_percent (for at least some systems). === Mark Millard marklmi at yahoo.com
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" possible file odd result
On Sep 5, 2023, at 00:00, Mark Millard wrote: > On Sep 4, 2023, at 22:06, Mark Millard wrote: > >> . . . > > So I tried 30 for per_txg_dirty_frees_percent for 2 contexts: > autotrim on > vs. > autotrim off > > where there was 100 GiByte+ to delete after a poudriere > bulk run. > > autotrim on: takes a fair time to delete even 1 GiByte of the 100+ GiByte > vs. > autotrim off: takes less time to delete more. > > The difference is very visible via "gstat -spod" use. > > autotrim on likely makes things less concurrent, somewhat > like USB3 storage having only one command to the device > at a time for FreeBSD. autotrim on seems to prevent the > larger unit of work from being an effective way to > decrease the time required, instead possibly increasing > the time requirement. > > That may be an example of the context dependendency for > what value of per_txg_dirty_frees_percent to use to > avoid wasting much time. Trying autotrim off with 30 for per_txg_dirty_frees_percent got me the oddity/extra-message (just using 32 builders to match the hardware thread count): . . . [00:03:25] [32] [00:00:00] Builder starting [00:03:43] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success [00:03:43] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:05:20] [01] [00:01:37] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success 23/.p/cleaning/rdeps/gettext-runtime-0.22_1/chemtool-1.6.14_4 copy: open failed: No such file or directory [00:05:23] [01] [00:00:00] Building devel/gmake | gmake-4.3_2 [00:05:55] [02] [00:02:30] Builder started . . . (Not that I know if the context actually matters and I have no clue if I'll ever get a repetition.) === Mark Millard marklmi at yahoo.com
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 22:06, Mark Millard wrote: > On Sep 4, 2023, at 18:39, Mark Millard wrote: > >> On Sep 4, 2023, at 10:05, Alexander Motin wrote: >> >>> On 04.09.2023 11:45, Mark Millard wrote: >>>> On Sep 4, 2023, at 06:09, Alexander Motin wrote: >>>>> per_txg_dirty_frees_percent is directly related to the delete delays we >>>>> see here. You are forcing ZFS to commit transactions each 5% of dirty >>>>> ARC limit, which is 5% of 10% or memory size. I haven't looked on that >>>>> code recently, but I guess setting it too low can make ZFS commit >>>>> transactions too often, increasing write inflation for the underlying >>>>> storage. I would propose you to restore the default and try again. >>>> While this machine is different, the original problem was worse than >>>> the issue here: the load average was less than 1 for the most part >>>> the parallel bulk build when 30 was used. The fraction of time waiting >>>> was much longer than with 5. If I understand right, both too high and >>>> too low for a type of context can lead to increased elapsed time and >>>> getting it set to a near optimal is a non-obvious exploration. >>> >>> IIRC this limit was modified several times since originally implemented. >>> May be it could benefit from another look, if default 30% is not good. It >>> would be good if generic ZFS issues like this were reported to OpenZFS >>> upstream to be visible to a wider public. Unfortunately I have several >>> other project I must work on, so if it is not a regression I can't promise >>> I'll take it right now, so anybody else is welcome. >> >> As I understand, there are contexts were 5 is inappropriate >> and 30 works fairly well: no good single answer as to what >> value range will avoid problems. >> >>>> An overall point for the goal of my activity is: what makes a >>>> good test context for checking if ZFS is again safe to use? >>>> May be other tradeoffs make, say, 4 hardware threads more >>>> reasonable than 32. >>> >>> Thank you for your testing. The best test is one that nobody else run. It >>> also correlates with the topic of "safe to use", which also depends on what >>> it is used for. :) >> >> Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a >> non-debug FreeBSD build is suitable for the bulk -a -J128 >> test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in >> use) on the 32 hardware thread system. (The swap partition >> in use is from the normal environment's PCIe Optane media.) >> The %idle and the load averages and %user stayed reasonable >> in a preliminary test. One thing it does introduce is trim >> management (both available and potentially useful). (Optane >> media does not support or need it.) No >> per_txg_dirty_frees_percent adjustment involved (still 5). >> >> I've learned to not use ^T for fear of /bin/sh aborting >> and messing up poudriere's context. So I now monitor with: >> >> # poudriere status -b >> >> in a separate ssh session. >> >> I'll note that I doubt I'd try for a complete bulk -a . >> I'd probably stop it if I notice that the number of >> active builders drops off for a notable time (normal >> waiting for prerequisites appearing to be why). >> >> > > So much for that idea. It has reached a state of staying > under 3500 w/s and up to 4.5ms/w (normally above 2ms/w). > %busy wondering in the range 85% to 101%. Lots of top > showing tx->tx. There is some read and other activity as > well. Of course the kBps figures are larger than the > earlier USB3 context (larger kB figures). > > It reached about 1350 port->package builds over the first > hour after the 2nd "Buildee started". > > autotrim is off. Doing a "zpool trim -w zamd64" leads to > . . . larger w/s figures during the process. So > more exploring to do at some point. Possibly: > > autotrim > per_txg_dirty_frees_percent > > For now, I'm just running "zpool trim -w zamd64" once > and a while so the process continues better. > > Still no evidence of deadlocks. No evidence of builds > failing for corruptions. > > . . . At around the end of 2nd hour: 2920 or so built. > > Still no evidence of deadlocks. No evidence of builds > failing for corruptions. > > . . . I've turned on autotrim without stopping the bulk > build. > >
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 18:39, Mark Millard wrote: > On Sep 4, 2023, at 10:05, Alexander Motin wrote: > >> On 04.09.2023 11:45, Mark Millard wrote: >>> On Sep 4, 2023, at 06:09, Alexander Motin wrote: >>>> per_txg_dirty_frees_percent is directly related to the delete delays we >>>> see here. You are forcing ZFS to commit transactions each 5% of dirty ARC >>>> limit, which is 5% of 10% or memory size. I haven't looked on that code >>>> recently, but I guess setting it too low can make ZFS commit transactions >>>> too often, increasing write inflation for the underlying storage. I would >>>> propose you to restore the default and try again. >>> While this machine is different, the original problem was worse than >>> the issue here: the load average was less than 1 for the most part >>> the parallel bulk build when 30 was used. The fraction of time waiting >>> was much longer than with 5. If I understand right, both too high and >>> too low for a type of context can lead to increased elapsed time and >>> getting it set to a near optimal is a non-obvious exploration. >> >> IIRC this limit was modified several times since originally implemented. >> May be it could benefit from another look, if default 30% is not good. It >> would be good if generic ZFS issues like this were reported to OpenZFS >> upstream to be visible to a wider public. Unfortunately I have several >> other project I must work on, so if it is not a regression I can't promise >> I'll take it right now, so anybody else is welcome. > > As I understand, there are contexts were 5 is inappropriate > and 30 works fairly well: no good single answer as to what > value range will avoid problems. > >>> An overall point for the goal of my activity is: what makes a >>> good test context for checking if ZFS is again safe to use? >>> May be other tradeoffs make, say, 4 hardware threads more >>> reasonable than 32. >> >> Thank you for your testing. The best test is one that nobody else run. It >> also correlates with the topic of "safe to use", which also depends on what >> it is used for. :) > > Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a > non-debug FreeBSD build is suitable for the bulk -a -J128 > test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in > use) on the 32 hardware thread system. (The swap partition > in use is from the normal environment's PCIe Optane media.) > The %idle and the load averages and %user stayed reasonable > in a preliminary test. One thing it does introduce is trim > management (both available and potentially useful). (Optane > media does not support or need it.) No > per_txg_dirty_frees_percent adjustment involved (still 5). > > I've learned to not use ^T for fear of /bin/sh aborting > and messing up poudriere's context. So I now monitor with: > > # poudriere status -b > > in a separate ssh session. > > I'll note that I doubt I'd try for a complete bulk -a . > I'd probably stop it if I notice that the number of > active builders drops off for a notable time (normal > waiting for prerequisites appearing to be why). > > So much for that idea. It has reached a state of staying under 3500 w/s and up to 4.5ms/w (normally above 2ms/w). %busy wondering in the range 85% to 101%. Lots of top showing tx->tx. There is some read and other activity as well. Of course the kBps figures are larger than the earlier USB3 context (larger kB figures). It reached about 1350 port->package builds over the first hour after the 2nd "Buildee started". autotrim is off. Doing a "zpool trim -w zamd64" leads to . . . larger w/s figures during the process. So more exploring to do at some point. Possibly: autotrim per_txg_dirty_frees_percent For now, I'm just running "zpool trim -w zamd64" once and a while so the process continues better. Still no evidence of deadlocks. No evidence of builds failing for corruptions. . . . At around the end of 2nd hour: 2920 or so built. Still no evidence of deadlocks. No evidence of builds failing for corruptions. . . . I've turned on autotrim without stopping the bulk build. . . . At around the end of 3rd hour: 4080 or so built. Still no evidence of deadlocks. No evidence of builds failing for corruptions. Looks like the % idle has been high for a significant time. I think I'll stop this specific test and clean things out. Looks like lang/guile* are examples of not respecting the lack of ALLOW_MAKE_JOBS use. Hmm. The ^C ended up with: ^C[03:41:07] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-defau
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 10:05, Alexander Motin wrote: > On 04.09.2023 11:45, Mark Millard wrote: >> On Sep 4, 2023, at 06:09, Alexander Motin wrote: >>> per_txg_dirty_frees_percent is directly related to the delete delays we see >>> here. You are forcing ZFS to commit transactions each 5% of dirty ARC >>> limit, which is 5% of 10% or memory size. I haven't looked on that code >>> recently, but I guess setting it too low can make ZFS commit transactions >>> too often, increasing write inflation for the underlying storage. I would >>> propose you to restore the default and try again. >> While this machine is different, the original problem was worse than >> the issue here: the load average was less than 1 for the most part >> the parallel bulk build when 30 was used. The fraction of time waiting >> was much longer than with 5. If I understand right, both too high and >> too low for a type of context can lead to increased elapsed time and >> getting it set to a near optimal is a non-obvious exploration. > > IIRC this limit was modified several times since originally implemented. May > be it could benefit from another look, if default 30% is not good. It would > be good if generic ZFS issues like this were reported to OpenZFS upstream to > be visible to a wider public. Unfortunately I have several other project I > must work on, so if it is not a regression I can't promise I'll take it right > now, so anybody else is welcome. As I understand, there are contexts were 5 is inappropriate and 30 works fairly well: no good single answer as to what value range will avoid problems. >> An overall point for the goal of my activity is: what makes a >> good test context for checking if ZFS is again safe to use? >> May be other tradeoffs make, say, 4 hardware threads more >> reasonable than 32. > > Thank you for your testing. The best test is one that nobody else run. It > also correlates with the topic of "safe to use", which also depends on what > it is used for. :) Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a non-debug FreeBSD build is suitable for the bulk -a -J128 test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in use) on the 32 hardware thread system. (The swap partition in use is from the normal environment's PCIe Optane media.) The %idle and the load averages and %user stayed reasonable in a preliminary test. One thing it does introduce is trim management (both available and potentially useful). (Optane media does not support or need it.) No per_txg_dirty_frees_percent adjustment involved (still 5). I've learned to not use ^T for fear of /bin/sh aborting and messing up poudriere's context. So I now monitor with: # poudriere status -b in a separate ssh session. I'll note that I doubt I'd try for a complete bulk -a . I'd probably stop it if I notice that the number of active builders drops off for a notable time (normal waiting for prerequisites appearing to be why). === Mark Millard marklmi at yahoo.com
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On 04.09.2023 11:45, Mark Millard wrote: On Sep 4, 2023, at 06:09, Alexander Motin wrote: per_txg_dirty_frees_percent is directly related to the delete delays we see here. You are forcing ZFS to commit transactions each 5% of dirty ARC limit, which is 5% of 10% or memory size. I haven't looked on that code recently, but I guess setting it too low can make ZFS commit transactions too often, increasing write inflation for the underlying storage. I would propose you to restore the default and try again. While this machine is different, the original problem was worse than the issue here: the load average was less than 1 for the most part the parallel bulk build when 30 was used. The fraction of time waiting was much longer than with 5. If I understand right, both too high and too low for a type of context can lead to increased elapsed time and getting it set to a near optimal is a non-obvious exploration. IIRC this limit was modified several times since originally implemented. May be it could benefit from another look, if default 30% is not good. It would be good if generic ZFS issues like this were reported to OpenZFS upstream to be visible to a wider public. Unfortunately I have several other project I must work on, so if it is not a regression I can't promise I'll take it right now, so anybody else is welcome. An overall point for the goal of my activity is: what makes a good test context for checking if ZFS is again safe to use? May be other tradeoffs make, say, 4 hardware threads more reasonable than 32. Thank you for your testing. The best test is one that nobody else run. It also correlates with the topic of "safe to use", which also depends on what it is used for. :) -- Alexander Motin
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
xg_thread_entermi_switch+0x173 sleepq_switch+0x104 >>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 >>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk01.txt:6 100881 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 >>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk01.txt:6 100882 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 >>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 >>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk02.txt:6 100881 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 >>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk02.txt:6 100882 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 >>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 >>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk03.txt:6 100881 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 >>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk03.txt:6 100882 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 >>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 >>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk04.txt:6 100881 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 >>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk04.txt:6 100882 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 >>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 >>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk05.txt:6 100881 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 >>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 >>>> fork_trampoline+0xe >>>> /usr/home/root/mmjnk05.txt:6 100882 zfskern >>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 >>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 >>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 >>>> fork_trampoline+0xe > > So quiesce threads are idle, while sync thread is waiting for TXG commit > writes completion. I see no no crime, we should see the same just for slow > storage. > >>>>> `zpool status`, `zpool get all` and `sysctl -a` would also not harm. >>>> >>>> It is a very simple zpool configuration: one partition. >>>> I only use it for bectl BE reasons, not the general >>>> range of reasons for using zfs. I created the media with >>>> my normal content, then checkpointed before doing the >>>> git fetch to start to set up the experiment. > > OK. And I see no scrub or async destroy, that could delay sync thread. > Though I don't see them in the above procstat either. > >>>> /etc/sysctl.conf does have: >>>> >>>> vfs.zfs.min_auto_ashift=12 >>>> vfs.zfs.per_txg_dirty_frees_percent=5 >>>> >>>> The vfs.zfs.per_txg_dirty_frees_percent is from prior >>>> Mateusz Guzik help, where after testing the change I >>>> reported: >>>> >>>> Result summary: Seems to have avoided the sustained periods >>>> of low load average activity. Much better for the context. >>>> >>>> But it was for a different machine (aarch64, 8 cores). But >>>> it was for poudriere bulk use. >>>> >>>> Turns out the default of 30 was causing
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On 04.09.2023 05:56, Mark Millard wrote: On Sep 4, 2023, at 02:00, Mark Millard wrote: On Sep 3, 2023, at 23:35, Mark Millard wrote: On Sep 3, 2023, at 22:06, Alexander Motin wrote: On 03.09.2023 22:54, Mark Millard wrote: After that ^t produced the likes of: load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k So the full state is not "tx->tx", but is actually a "tx->tx_quiesce_done_cv", which means the thread is waiting for new transaction to be opened, which means some previous to be quiesced and then synced. #0 0x80b6f103 at mi_switch+0x173 #1 0x80bc0f24 at sleepq_switch+0x104 #2 0x80aec4c5 at _cv_wait+0x165 #3 0x82aba365 at txg_wait_open+0xf5 #4 0x82a11b81 at dmu_free_long_range+0x151 Here it seems like transaction commit is waited due to large amount of delete operations, which ZFS tries to spread between separate TXGs. That fit the context: cleaning out /usr/local/poudriere/data/.m/ You should probably see some large and growing number in sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . After the reboot I started a -J64 example. It has avoided the early "witness exhausted". Again I ^C'd after about an hours after the 2nd builder had started. So: again cleaning out /usr/local/poudriere/data/.m/ Only seconds between: # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 As expected, deletes trigger and wait for TXG commits. I have found a measure of progress: zfs list's USED for /usr/local/poudriere/data/.m is decreasing. So ztop's d/s was a good classification: deletes. #5 0x829a87d2 at zfs_rmnode+0x72 #6 0x829b658d at zfs_freebsd_reclaim+0x3d #7 0x8113a495 at VOP_RECLAIM_APV+0x35 #8 0x80c5a7d9 at vgonel+0x3a9 #9 0x80c5af7f at vrecycle+0x3f #10 0x829b643e at zfs_freebsd_inactive+0x4e #11 0x80c598cf at vinactivef+0xbf #12 0x80c590da at vput_final+0x2aa #13 0x80c68886 at kern_funlinkat+0x2f6 #14 0x80c68588 at sys_unlink+0x28 #15 0x8106323f at amd64_syscall+0x14f #16 0x8103512b at fast_syscall_common+0xf8 What we don't see here is what quiesce and sync threads of the pool are actually doing. Sync thread has plenty of different jobs, including async write, async destroy, scrub and others, that all may delay each other. Before you rebooted the system, depending how alive it is, could you save a number of outputs of `procstat -akk`, or at least specifically `procstat -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe what they are doing. # grep txg_thread_enter ~/mmjnk0[0-5].txt /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 02:00, Mark Millard wrote: > On Sep 3, 2023, at 23:35, Mark Millard wrote: > >> On Sep 3, 2023, at 22:06, Alexander Motin wrote: >> >>> >>> On 03.09.2023 22:54, Mark Millard wrote: After that ^t produced the likes of: load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k >>> >>> So the full state is not "tx->tx", but is actually a >>> "tx->tx_quiesce_done_cv", which means the thread is waiting for new >>> transaction to be opened, which means some previous to be quiesced and then >>> synced. >>> #0 0x80b6f103 at mi_switch+0x173 #1 0x80bc0f24 at sleepq_switch+0x104 #2 0x80aec4c5 at _cv_wait+0x165 #3 0x82aba365 at txg_wait_open+0xf5 #4 0x82a11b81 at dmu_free_long_range+0x151 >>> >>> Here it seems like transaction commit is waited due to large amount of >>> delete operations, which ZFS tries to spread between separate TXGs. >> >> That fit the context: cleaning out /usr/local/poudriere/data/.m/ >> >>> You should probably see some large and growing number in sysctl >>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . >> >> After the reboot I started a -J64 example. It has avoided the >> early "witness exhausted". Again I ^C'd after about an hours >> after the 2nd builder had started. So: again cleaning out >> /usr/local/poudriere/data/.m/ Only seconds between: >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 >> >> I have found a measure of progress: zfs list's USED >> for /usr/local/poudriere/data/.m is decreasing. So >> ztop's d/s was a good classification: deletes. >> #5 0x829a87d2 at zfs_rmnode+0x72 #6 0x829b658d at zfs_freebsd_reclaim+0x3d #7 0x8113a495 at VOP_RECLAIM_APV+0x35 #8 0x80c5a7d9 at vgonel+0x3a9 #9 0x80c5af7f at vrecycle+0x3f #10 0x829b643e at zfs_freebsd_inactive+0x4e #11 0x80c598cf at vinactivef+0xbf #12 0x80c590da at vput_final+0x2aa #13 0x80c68886 at kern_funlinkat+0x2f6 #14 0x80c68588 at sys_unlink+0x28 #15 0x8106323f at amd64_syscall+0x14f #16 0x8103512b at fast_syscall_common+0xf8 >>> >>> What we don't see here is what quiesce and sync threads of the pool are >>> actually doing. Sync thread has plenty of different jobs, including async >>> write, async destroy, scrub and others, that all may delay each other. >>> >>> Before you rebooted the system, depending how alive it is, could you save a >>> number of outputs of `procstat -akk`, or at least specifically `procstat >>> -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe >>> what they are doing. >> >> # procstat -akk > ~/mmjnk00.txt >> # procstat -akk > ~/mmjnk01.txt >> # procstat -akk > ~/mmjnk02.txt >> # procstat -akk > ~/mmjnk03.txt >> # procstat -akk > ~/mmjnk04.txt >> # procstat -akk > ~/mmjnk05.txt >> # grep txg_thread_enter ~/mmjnk0[0-5].txt >> /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb >> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b >> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 >> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb >> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b >> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 >> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb >> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b >> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 >> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk03.txt:
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 3, 2023, at 23:35, Mark Millard wrote: > On Sep 3, 2023, at 22:06, Alexander Motin wrote: > >> >> On 03.09.2023 22:54, Mark Millard wrote: >>> After that ^t produced the likes of: >>> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s >>> 1% 13004k >> >> So the full state is not "tx->tx", but is actually a >> "tx->tx_quiesce_done_cv", which means the thread is waiting for new >> transaction to be opened, which means some previous to be quiesced and then >> synced. >> >>> #0 0x80b6f103 at mi_switch+0x173 >>> #1 0x80bc0f24 at sleepq_switch+0x104 >>> #2 0x80aec4c5 at _cv_wait+0x165 >>> #3 0x82aba365 at txg_wait_open+0xf5 >>> #4 0x82a11b81 at dmu_free_long_range+0x151 >> >> Here it seems like transaction commit is waited due to large amount of >> delete operations, which ZFS tries to spread between separate TXGs. > > That fit the context: cleaning out /usr/local/poudriere/data/.m/ > >> You should probably see some large and growing number in sysctl >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . > > After the reboot I started a -J64 example. It has avoided the > early "witness exhausted". Again I ^C'd after about an hours > after the 2nd builder had started. So: again cleaning out > /usr/local/poudriere/data/.m/ Only seconds between: > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 > > I have found a measure of progress: zfs list's USED > for /usr/local/poudriere/data/.m is decreasing. So > ztop's d/s was a good classification: deletes. > >>> #5 0x829a87d2 at zfs_rmnode+0x72 >>> #6 0x829b658d at zfs_freebsd_reclaim+0x3d >>> #7 0x8113a495 at VOP_RECLAIM_APV+0x35 >>> #8 0x80c5a7d9 at vgonel+0x3a9 >>> #9 0x80c5af7f at vrecycle+0x3f >>> #10 0x829b643e at zfs_freebsd_inactive+0x4e >>> #11 0x80c598cf at vinactivef+0xbf >>> #12 0x80c590da at vput_final+0x2aa >>> #13 0x80c68886 at kern_funlinkat+0x2f6 >>> #14 0x80c68588 at sys_unlink+0x28 >>> #15 0x8106323f at amd64_syscall+0x14f >>> #16 0x8103512b at fast_syscall_common+0xf8 >> >> What we don't see here is what quiesce and sync threads of the pool are >> actually doing. Sync thread has plenty of different jobs, including async >> write, async destroy, scrub and others, that all may delay each other. >> >> Before you rebooted the system, depending how alive it is, could you save a >> number of outputs of `procstat -akk`, or at least specifically `procstat >> -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe >> what they are doing. > > # procstat -akk > ~/mmjnk00.txt > # procstat -akk > ~/mmjnk01.txt > # procstat -akk > ~/mmjnk02.txt > # procstat -akk > ~/mmjnk03.txt > # procstat -akk > ~/mmjnk04.txt > # procstat -akk > ~/mmjnk05.txt > # grep txg_thread_enter ~/mmjnk0[0-5].txt > /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
; #9 0x80c5af7f at vrecycle+0x3f > #10 0x829b643e at zfs_freebsd_inactive+0x4e > #11 0x80c598cf at vinactivef+0xbf > #12 0x80c590da at vput_final+0x2aa > #13 0x80c68886 at kern_funlinkat+0x2f6 > #14 0x80c68588 at sys_unlink+0x28 > #15 0x8106323f at amd64_syscall+0x14f > #16 0x8103512b at fast_syscall_common+0xf8 > > The console/logs do report "witness exhausted": > > . . . > Sep 3 13:41:08 amd64-ZFS login[1751]: ROOT LOGIN (root) ON ttyv0 > Sep 3 13:51:35 amd64-ZFS kernel: witness_lock_list_get: witness exhausted > Sep 3 14:26:38 amd64-ZFS kernel: pid 27418 (conftest), jid 245, uid 0: > exited on signal 11 (core dumped) > . . . > > So it did not take long for the "witness exhausted" to > happen. > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 > main-n265143-525bc87f54f2-dirty: Sun Sep 3 13:35:04 PDT 2023 > root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG > amd64 amd64 150 150 > > > Looks like I'll be forcing the machine to reboot or > to power off. The media was deliberately set up for > doing risky tests. It is not my normal environment. > Using -J64 instead of -J128. It does avoid "witness exhausted" for at least the 1st hour. [00:03:51] Building 34214 packages using up to 64 builders [00:03:51] Hit CTRL+t at any time to see build progress and stats [00:03:51] [01] [00:00:00] Builder starting [00:04:49] [01] [00:00:58] Builder started [00:04:49] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.6 [00:06:07] [01] [00:01:18] Finished ports-mgmt/pkg | pkg-1.20.6: Success [00:06:31] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 [00:06:31] [02] [00:00:00] Builder starting . . . [00:06:33] [64] [00:00:00] Builder starting [00:09:06] [01] [00:02:35] Finished print/indexinfo | indexinfo-0.3.1: Success [00:09:08] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:21:49] [16] [00:15:18] Builder started [00:21:49] [16] [00:00:00] Building devel/libdaemon | libdaemon-0.14_1 [00:21:49] [29] [00:15:17] Builder started [00:21:49] [20] [00:15:18] Builder started [00:21:49] [41] [00:15:17] Builder started [00:21:49] [29] [00:00:00] Building textproc/libunibreak | libunibreak-5.1,1 [00:21:49] [20] [00:00:00] Building graphics/poppler-data | poppler-data-0.4.12 [00:21:49] [35] [00:15:17] Builder started [00:21:49] [41] [00:00:00] Building archivers/libmspack | libmspack-0.11alpha . . . [main-amd64-bulk_a-default] [2023-09-03_20h48m38s] [parallel_build:] Queued: 34588 Built: 438 Failed: 1 Skipped: 50Ignored: 335 Fetched: 0 Tobuild: 33764 Time: 01:21:30 . . . ^C[01:21:57] [32] [00:07:04] Finished devel/p5-Test-Deep | p5-Test-Deep-1.204: Success [01:21:57] Error: Signal SIGINT caught, cleaning up and exiting [01:22:03] [39] [00:06:01] Finished textproc/p5-Lingua-Stem-Ru | p5-Lingua-Stem-Ru-0.04: Success [01:22:04] [35] [00:06:09] Finished devel/p5-ExtUtils-InstallPaths | p5-ExtUtils-InstallPaths-0.012: Success [main-amd64-bulk_a-default] [2023-09-03_20h48m38s] [sigint:] Queued: 34588 Built: 442 Failed: 1 Skipped: 50Ignored: 335 Fetched: 0 Tobuild: 33760 Time: 01:21:50 [01:22:04] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-03_20h48m38s [01:22:11] Cleaning up (So only around 438 built in similar time frame relationship used for the -J128 test that got to more like 752. But -J128 had the "witness exhausted" status as well.) Turns out a measure of progress is the USED listed by zfs list for /usr/local/poudriere/data/.m/ . It spends lots of time waiting in various processes during its deletion activity. Previously I'd been told to use vfs.zfs.per_txg_dirty_frees_percent=5 instead of the default (30) in order to avoid ending up with sustained very small load averages in my poudiere bulk runs for my kind of context. (5 is the older default, as it turns out.) This may be somewhat of a deletion/cleanup stage variant of that sort of issue. It may be that trying a factor of 2+ for 32 hardware threads just does not scale the same as a factor of 2 did for folks using 4 hardware thread machines where testing for the deadlock issue. May be -J36 (so: 32+4) would be more reasonable for deadlock testing for this context, possibly avoiding running into these other issues so strongly. === Mark Millard marklmi at yahoo.com
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 3, 2023, at 22:06, Alexander Motin wrote: > > On 03.09.2023 22:54, Mark Millard wrote: >> After that ^t produced the likes of: >> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% >> 13004k > > So the full state is not "tx->tx", but is actually a > "tx->tx_quiesce_done_cv", which means the thread is waiting for new > transaction to be opened, which means some previous to be quiesced and then > synced. > >> #0 0x80b6f103 at mi_switch+0x173 >> #1 0x80bc0f24 at sleepq_switch+0x104 >> #2 0x80aec4c5 at _cv_wait+0x165 >> #3 0x82aba365 at txg_wait_open+0xf5 >> #4 0x82a11b81 at dmu_free_long_range+0x151 > > Here it seems like transaction commit is waited due to large amount of delete > operations, which ZFS tries to spread between separate TXGs. That fit the context: cleaning out /usr/local/poudriere/data/.m/ > You should probably see some large and growing number in sysctl > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . After the reboot I started a -J64 example. It has avoided the early "witness exhausted". Again I ^C'd after about an hours after the 2nd builder had started. So: again cleaning out /usr/local/poudriere/data/.m/ Only seconds between: # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 I have found a measure of progress: zfs list's USED for /usr/local/poudriere/data/.m is decreasing. So ztop's d/s was a good classification: deletes. >> #5 0x829a87d2 at zfs_rmnode+0x72 >> #6 0x829b658d at zfs_freebsd_reclaim+0x3d >> #7 0x8113a495 at VOP_RECLAIM_APV+0x35 >> #8 0x80c5a7d9 at vgonel+0x3a9 >> #9 0x80c5af7f at vrecycle+0x3f >> #10 0x829b643e at zfs_freebsd_inactive+0x4e >> #11 0x80c598cf at vinactivef+0xbf >> #12 0x80c590da at vput_final+0x2aa >> #13 0x80c68886 at kern_funlinkat+0x2f6 >> #14 0x80c68588 at sys_unlink+0x28 >> #15 0x8106323f at amd64_syscall+0x14f >> #16 0x8103512b at fast_syscall_common+0xf8 > > What we don't see here is what quiesce and sync threads of the pool are > actually doing. Sync thread has plenty of different jobs, including async > write, async destroy, scrub and others, that all may delay each other. > > Before you rebooted the system, depending how alive it is, could you save a > number of outputs of `procstat -akk`, or at least specifically `procstat -akk > | grep txg_thread_enter` if the full is hard? Or somehow else observe what > they are doing. # procstat -akk > ~/mmjnk00.txt # procstat -akk > ~/mmjnk01.txt # procstat -akk > ~/mmjnk02.txt # procstat -akk > ~/mmjnk03.txt # procstat -akk > ~/mmjnk04.txt # procstat -akk > ~/mmjnk05.txt # grep txg_thread_enter ~/mmjnk0[0-5].txt /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wai
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
Mark, On 03.09.2023 22:54, Mark Millard wrote: After that ^t produced the likes of: load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k So the full state is not "tx->tx", but is actually a "tx->tx_quiesce_done_cv", which means the thread is waiting for new transaction to be opened, which means some previous to be quiesced and then synced. #0 0x80b6f103 at mi_switch+0x173 #1 0x80bc0f24 at sleepq_switch+0x104 #2 0x80aec4c5 at _cv_wait+0x165 #3 0x82aba365 at txg_wait_open+0xf5 #4 0x82a11b81 at dmu_free_long_range+0x151 Here it seems like transaction commit is waited due to large amount of delete operations, which ZFS tries to spread between separate TXGs. You should probably see some large and growing number in sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . #5 0x829a87d2 at zfs_rmnode+0x72 #6 0x829b658d at zfs_freebsd_reclaim+0x3d #7 0x8113a495 at VOP_RECLAIM_APV+0x35 #8 0x80c5a7d9 at vgonel+0x3a9 #9 0x80c5af7f at vrecycle+0x3f #10 0x829b643e at zfs_freebsd_inactive+0x4e #11 0x80c598cf at vinactivef+0xbf #12 0x80c590da at vput_final+0x2aa #13 0x80c68886 at kern_funlinkat+0x2f6 #14 0x80c68588 at sys_unlink+0x28 #15 0x8106323f at amd64_syscall+0x14f #16 0x8103512b at fast_syscall_common+0xf8 What we don't see here is what quiesce and sync threads of the pool are actually doing. Sync thread has plenty of different jobs, including async write, async destroy, scrub and others, that all may delay each other. Before you rebooted the system, depending how alive it is, could you save a number of outputs of `procstat -akk`, or at least specifically `procstat -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe what they are doing. `zpool status`, `zpool get all` and `sysctl -a` would also not harm. PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make sure there is no storage problems, like some huge delays, timeouts, etc, that can be seen, for example, as busy percents regularly spiking far above 100% in your `gstat -spod`. -- Alexander Motin
An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
ThreadRipper 1950X (32 hardware threads) doing bulk -J128 with USE_TMPFS=no , no ALLOW_MAKE_JOBS , no ALLOW_MAKE_JOBS_PACKAGES , USB3 NVMe SSD storage/ZFS-boot-media, debug system build in use : [00:03:44] Building 34214 packages using up to 128 builders [00:03:44] Hit CTRL+t at any time to see build progress and stats [00:03:44] [01] [00:00:00] Builder starting [00:04:37] [01] [00:00:53] Builder started [00:04:37] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.6 [00:05:53] [01] [00:01:16] Finished ports-mgmt/pkg | pkg-1.20.6: Success [00:06:15] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 [00:06:15] [02] [00:00:00] Builder starting . . . [00:06:18] [128] [00:00:00] Builder starting [00:07:42] [01] [00:01:27] Finished print/indexinfo | indexinfo-0.3.1: Success [00:07:45] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:18:45] [01] [00:11:00] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success [00:19:06] [01] [00:00:00] Building devel/gmake | gmake-4.3_2 [00:24:13] [01] [00:05:07] Finished devel/gmake | gmake-4.3_2: Success [00:24:39] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 [00:31:08] [125] [00:24:50] Builder started [00:31:08] [125] [00:00:00] Building print/t1utils | t1utils-1.32 [00:31:15] [33] [00:25:00] Builder started [00:31:15] [81] [00:24:59] Builder started [00:31:15] [33] [00:00:00] Building databases/xapian-core | xapian-core-1.4.23,1 [00:31:15] [13] [00:25:00] Builder started [00:31:15] [81] [00:00:00] Building devel/bmake | bmake-20230723 [00:31:15] [13] [00:00:00] Building devel/evdev-proto | evdev-proto-5.8 [00:31:16] [41] [00:25:00] Builder started [00:31:16] [41] [00:00:00] Building devel/pcre | pcre-8.45_3 . . . (Looks like lang/go120 ignores the lack of ALLOW_MAKE_JOBS . There may be others that still have signficant parallel activity.) [main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [parallel_build:] Queued: 34588 Built: 727 Failed: 1 Skipped: 40Ignored: 335 Fetched: 0 Tobuild: 33485 Time: 01:36:51 (So about 1 hr after the last "Builder starting" it had built 727.) The vast majority of the time: lots of cpdup's with tx->tx showing most of the time for STATE but showing having some CPU time. ^T commonly showed various Builders in starting PHASE for 3min..6min. Around 66% mean Idle time (guess from watching top). After ^C "gstat -spod" reports it is almost always writing 2200 to 2500 writes per second or so for *hours* (still going on). ztop reports 1500 to 3200 d/s or so almost always for Dataset zamd64/poudriere/data/.m instead (also still going on). Normally no other Dataset is shown. With all the disk I/O activity, this is definitely "live" in some sense. But I've no clue if it is just repeating itself over and over vs. if it making some sort of progress. For reference for the ^C and after: ^C[01:39:00] [20] [00:00:03] Building sysutils/linux-c7-dosfstools | linux-c7-dosfstools-3.0.20 [01:39:00] [93] [00:07:12] Finished science/dimod | dimod-0.12.11: Success [01:39:00] Error: Signal SIGINT caught, cleaning up and exiting [01:39:02] [63] [00:06:34] Finished archivers/unarj | unarj-2.65_2: Success [01:39:03] [128] [00:07:47] Finished sysutils/shuf | shuf-3.0: Success [01:39:04] [113] [00:07:06] Finished devel/bsddialog | bsddialog-0.4.1: Success [main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [sigint:] Queued: 34588 Built: 752 Failed: 1 Skipped: 40Ignored: 335 Fetched: 0 Tobuild: 33460 Time: 01:38:56 [01:39:06] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-03_13h48m45s [01:39:14] [12] [00:09:07] Finished archivers/rzip | rzip-2.1_1: Success [01:39:14] Cleaning up exit: cannot open ./var/run/49_nohang.pid: No such file or directory exit: cannot open ./var/run/87_nohang.pid: No such file or directory After that ^t produced the likes of: load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k #0 0x80b6f103 at mi_switch+0x173 #1 0x80bc0f24 at sleepq_switch+0x104 #2 0x80aec4c5 at _cv_wait+0x165 #3 0x82aba365 at txg_wait_open+0xf5 #4 0x82a11b81 at dmu_free_long_range+0x151 #5 0x829a87d2 at zfs_rmnode+0x72 #6 0x829b658d at zfs_freebsd_reclaim+0x3d #7 0x8113a495 at VOP_RECLAIM_APV+0x35 #8 0x80c5a7d9 at vgonel+0x3a9 #9 0x80c5af7f at vrecycle+0x3f #10 0x829b643e at zfs_freebsd_inactive+0x4e #11 0x80c598cf at vinactivef+0xbf #12 0x80c590da at vput_final+0x2aa #13 0x80c68886 at kern_funlinkat+0x2f6 #14 0x80c68588 at sys_unlink+0x28 #15 0x8106323f at amd64_syscall+0x14f #16 0x8103512b at fast_syscall_common+0xf8 The console/logs do report "witness exhausted": . . . Sep 3 13:41:08 amd64-ZFS login[1751]: ROOT LOGIN (root) ON ttyv0 Sep 3 13:51:35 amd64-ZFS kernel: witness_lock_list_get: witness exhausted Sep 3 14:26:38 amd64-ZFS kernel: pid 27418 (conftest), jid 245
See bugzilla's 272965 and 272966 for cortex-A7 armv7 example kyua test case panics for main [so: 14], split by backtrace structure
See: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272965 and: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272966 All involve 'Alignment Fault' on read at some point but the 272966 ones first involve: Kernel page fault with the following non-sleepable locks held during in6ifa_ifwithaddr. The 272965 ones involve udp_input getting the alignment fault. These reports are based on using the most recent snapshot of main [so: 14], not based on my own builds. They do not involve aarch64 at all: no chroot use, no jail use, no lib32 existing for the already just-armv7 context. === Mark Millard marklmi at yahoo.com
Some results from my crude technique of using (some of) Kyua to test aarch64's lib32 vs. kyua runs in a armv7 chroot on aarch64
[I first report how I tested before reporting on errors that look to be valid kyua reports of issues.] [This testing is with a line commented out in order to prevent sys/net/if_bridge_test:gif from being tested, because of its leading to a panic for lib32 style testing.] I have /usr/obj/DESTDIRs/main-CA7-chroot/ containing an armv7 installworld distrib-dirs distribution DB_FROM_SRC=1 result. It also has various ports installed that kyua runs use. I use this tree for both the lib32 and the chroot testing. I have a script to preload various kernel modules: # grep kldload ~/prekyua-kldloads.sh | sort #kldload -v -n ipfw.ko #kldload -v -n pflog.ko #kldload -v -n pfsync.ko kldload -v -n bridgestp.ko kldload -v -n carp.ko kldload -v -n cryptodev.ko kldload -v -n dtrace.ko kldload -v -n dummynet.ko kldload -v -n fdescfs.ko kldload -v -n filemon.ko kldload -v -n geom_concat.ko kldload -v -n geom_eli.ko kldload -v -n geom_gate.ko kldload -v -n geom_mirror.ko kldload -v -n geom_multipath.ko kldload -v -n geom_nop.ko kldload -v -n geom_raid3.ko kldload -v -n geom_shsec.ko kldload -v -n geom_stripe.ko kldload -v -n geom_uzip.ko kldload -v -n if_bridge.ko kldload -v -n if_epair.ko kldload -v -n if_gif.ko kldload -v -n if_infiniband.ko kldload -v -n if_lagg.ko kldload -v -n if_ovpn.ko kldload -v -n if_stf.ko kldload -v -n if_tuntap.ko kldload -v -n if_wg.ko kldload -v -n ipdivert.ko kldload -v -n ipsec.ko kldload -v -n mqueuefs.ko kldload -v -n netgraph.ko kldload -v -n nfsd.ko kldload -v -n ng_bridge.ko kldload -v -n ng_ether.ko kldload -v -n ng_hub.ko kldload -v -n ng_socket.ko kldload -v -n ng_vlan_rotate.ko kldload -v -n nullfs.ko kldload -v -n opensolaris.ko kldload -v -n pf.ko kldload -v -n sctp.ko kldload -v -n sdt.ko kldload -v -n tarfs.ko kldload -v -n tcpmd5.ko kldload -v -n xz.ko kldload -v -n zfs.ko (Some I've listed despite there being built into the kernel or already being loaded for my normal environment.) Likely I'll end up adding some more later. I have some ports used by kyua runs that I build and then install into /usr/obj/DESTDIRs/main-CA7-chroot/ : # more ~/origins/kyua-origins.txt archivers/gtar devel/py-pytest devel/py-pytest-twisted devel/py-twisted lang/perl5.32 lang/python net/scapy security/openvpn security/sudo shells/ksh93 shells/bash sysutils/coreutils sysutils/sg3_utils textproc/jq Likely I'll add some more later. The above, of course, lead to other installs as well. For lib32 testing, I try to control where most *.so* 's that are not based full path references are found. This is via use of LD_32_LIBRARY_PATH . I try to have more programs that are not based on full path references run as armv7 code. This is via use of PATH . So: # env \ LD_32_LIBRARY_PATH=/usr/obj/DESTDIRs/main-CA7-chroot/lib\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/lib\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/libexec/rtld-elf\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libxo\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/csu/dynamiclib\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/tls\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/stdlib\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libthr/dlopen\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/site-packages\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/lib-dynload\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/CORE\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/auto \ PATH=/usr/obj/DESTDIRs/main-CA7-chroot/sbin\ :/usr/obj/DESTDIRs/main-CA7-chroot/bin\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/sbin\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/bin\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/sbin\ :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/bin\ :/usr/obj/DESTDIRs/main-CA7-chroot/root/bin \ /usr/obj/DESTDIRs/main-CA7-chroot/usr/bin/kyua test \ -k /usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/Kyuafile On the Windows Dev Kit 2023 I end up with the lib32 summary being (so far): # kyua report --verbose \ --results-file=usr_obj_DESTDIRs_main-CA7-chroot_usr_tests.20230731-080820-275974 \ 2>&1 \ | tail -6 ===> Summary Results read from /usr/home/root/.kyua/store/results.usr_obj_DESTDIRs_main-CA7-chroot_usr_tests.20230731-080820-275974.db Test cases: 8704 total, 1442 skipped, 37 expected failures, 46 broken, 746 failed Start time: 2023-07-31T08:08:20.858437Z End time: 2023-07-31T10:18:37.393732Z Total time: 6954.365s Of course, some tests labeled as broken/failed are just from the limitations of the techniques involved lib32 based kyua testing. For example the 127 "failures": In the chroot it is currently: # kyua report --verbose \ --results-file=usr_tests.20230731-163737-720329 \ 2>&1 \ | tail -6 ===> Summary Results read from /usr/home/root/.kyua/store/results.usr_tests.20230731-163737-720329.db Test cases: 8699 total, 1478 skipped, 38 exp
FYI for aarch64/armv7 lib32: armv7 kyua test sys/net/if_bridge_test:gif with preloaded if_bridge.ko still panics in my style of testing
I finally got around to testing lib32 some more, first trying the panic case that I'd gotten in early testing. The below is without any special lib32 patching for testing, just my normal non-debug environment updated to a lib32-present aarch64 FreeBSD vintage. Reminder: /usr/obj/DESTDIRs/main-CA7-chroot/ contains an armv7 installworld distrib-dirs distribution DB_FROM_SRC=1 result. (It also has various ports installed.) # ~/prekyua-kldloads.sh . . . # env \ > LD_32_LIBRARY_PATH=/usr/obj/DESTDIRs/main-CA7-chroot/lib\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/lib\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/libexec/rtld-elf\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libxo\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/csu/dynamiclib\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/tls\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/stdlib\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libthr/dlopen\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/site-packages\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/lib-dynload\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/CORE\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/auto \ > PATH=/usr/obj/DESTDIRs/main-CA7-chroot/sbin\ > :/usr/obj/DESTDIRs/main-CA7-chroot/bin\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/sbin\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/bin\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/sbin\ > :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/bin\ > :/usr/obj/DESTDIRs/main-CA7-chroot/root/bin \ > /usr/obj/DESTDIRs/main-CA7-chroot/usr/bin/kyua test \ > -k /usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/Kyuafile > sys/net/if_bridge_test:gif sys/net/if_bridge_test:gif -> Jul 29 21:29:16 CA72-16Gp-ZFS dhclient[56641]: epair0a: not found Jul 29 21:29:16 CA72-16Gp-ZFS dhclient[56641]: exiting. Fatal data abort: x0: 0xa0275306c560 x1: 0xa027f9d053d2 x2: 0x002a x3: 0xa0275306c560 x4: 0xa027f9d053fc x5: 0xa0275306c58a x6: 0x3ec2 x7: 0x010006085ba958bc x8: 0x002a x9: 0x002a x10: 0x0008010006085ba9 x11: 0x58bc3ec201000406 x12: 0x016433c65ba9 x13: 0x026433c6 x14: 0x00ff x15: 0x289f x16: 0x0002d056b370 (_DYNAMIC + 0x370) x17: 0x00598110 (m_dup + 0x0) x18: 0x0002801e94a0 x19: 0x0001 x20: 0x x21: 0x x22: 0x00d95000 (vop_spare3_desc + 0x18) x23: 0xa0275306c500 x24: 0xa0275306c500 x25: 0x00a0 x26: 0x0002 x27: 0x x28: 0xa0275306c500 x29: 0x0002801e94c0 sp: 0x0002801e94a0 lr: 0x00598308 (m_dup + 0x1f8) elr: 0x00598160 (m_dup + 0x50) spsr: 0x2045 far: 0x001c esr: 0x9604 panic: vm_fault failed: 0x00598160 error 1 cpuid = 14 time = 1690691356 KDB: stack backtrace: db_trace_self() at db_trace_self db_trace_self_wrapper() at db_trace_self_wrapper+0x30 vpanic() at vpanic+0x13c panic() at panic+0x44 data_abort() at data_abort+0x2fc handle_el1h_sync() at handle_el1h_sync+0x14 --- exception, esr 0x9604 m_dup() at m_dup+0x50 bridge_input() at bridge_input+0x17c gif_input() at gif_input+0x2dc in_gif_input() at in_gif_input+0x5c encap_input() at encap_input+0xfc encap4_input() at encap4_input+0x30 ip_input() at ip_input+0x5ac netisr_dispatch_src() at netisr_dispatch_src+0xf8 ether_demux() at ether_demux+0x14c ether_nh_input() at ether_nh_input+0x39c netisr_dispatch_src() at netisr_dispatch_src+0xf8 ether_input() at ether_input+0x50 epair_tx_start_deferred() at epair_tx_start_deferred+0x110 taskqueue_run_locked() at taskqueue_run_locked+0x198 taskqueue_thread_loop() at taskqueue_thread_loop+0x130 fork_exit() at fork_exit+0x88 fork_trampoline() at fork_trampoline+0x14 KDB: enter: panic [ thread pid 0 tid 1028122 ] Stopped at kdb_enter+0x44: str xzr, [x19, #3328] For reference: # uname -apKU FreeBSD CA72-16Gp-ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT aarch64 1400093 #102 main-n264334-215bab7924f6-dirty: Wed Jul 26 02:02:48 PDT 2023 root@CA72-16Gp-ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400093 1400093 === Mark Millard marklmi at yahoo.com
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On Jul 7, 2023, at 10:13, Mike Karels wrote: > On 7 Jul 2023, at 11:38, Mark Millard wrote: > >> On Jul 7, 2023, at 07:36, Mark Millard wrote: >> >>> On Jul 7, 2023, at 06:50, Mike Karels wrote: >>> >>>> On 7 Jul 2023, at 6:06, John F Carr wrote: >>>> >>>>> On Jul 6, 2023, at 20:42, Mike Karels wrote: >>>>>> >>>>>> >>>>>> Thanks for isolating this. Let me know when you have the bug number. >>>>>> I just tested a fix (the compat code drops the reference on the current >>>>>> address space an extra time, probably freeing it). >>>>>> >>>>>> Mike >>>> >>>> The fix is in >>>> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de. >>>> >>>>> The bug was introduced in January, 2022. It allows 32 bit binaries to >>>>> crash a 64 bit system when COMPAT_FREEBSD32 is on. Test coverage of the >>>>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time. >>>>> >>>>> There should be routine runs of 32 bit test suites on 64 bit systems. >>>>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 >>>>> kernel code needs to be exercised. This bug was only discovered by >>>>> manually running tests in the right environment, 17 months after >>>>> automated testing could have discovered it. >>>> >>>> That is not so simple currently, as the shared libraries for the >>>> test environment are not part of 32-bit compatibility package. >>>> The required bits could be extracted from the corresponding 32-bit >>>> build, but that isn't easy to automate. Fortunately, I think that >>>> very few of the tests exercise any 32-bit-specific code paths. >>> >>> One way that I demonstrated this problem on an aarch64 system >>> that supports aarch32/armv7 in user space was via the use of >>> an official snapshot armv7 image. In my case I: >>> >>> A) dd'd the image to USB3 media (after downloading it) >>> B) mounted the ufs file system on the media to a mount point >> >> I forgot to mention an important step: before chroot is >> used, I preload various kernel modules that are used in >> the Kyuafile tests --because the chroot'd activity will >> not cause the loads of themsleves but will use the >> modules if they have already been loaded. >> >>> C) used a chroot into that mount point to run the: >>> "kyua test -k /usr/tests/Kyuafile" >>> >>> (I happened to do all that as root.) >>> >>> There may be viable alternatives to dd'ing to allow an analogy to >>> (B) for (C) to use. I've not experimented with using a jail >>> instead of a chroot. >>> >>> One can also install an armv7 world into a local directory tree >>> and then use chroot (or analogous). >>> >>> How far off is an analogous sort of procedure from being reasonable >>> to automate? > > It would be easier to use the packages rather than the full image > (base.txz and tests.txz). But doing this as part of a CI setup would > mean fetching things from a different source from the install image, > and then of course doing various configuration. It's always a Small > Matter of Programming. Doing a full chroot gets into some other > problems, e.g. mdconfig doesn't currently work in compatibility > mode. It doesn't seem that automating all this would yield much; > it's hard enough to do manually. > >>> i386, of course, also has lib32 and independently testing that is >>> a messier issue, including trying to use /usr/tests/Kyuafile based >>> testing that avoids use of chroot (or analogous). I'm not claiming >>> lib32 has as simple of a potential path to automated testing. > > I think the problem is essentially the same. A chroot could be used > or a 32-bit library setup (which would test the libraries as well). > >>> I do not know if FreeBSD has powerpc64 hardware able to use a >>> powerpc world directory tree analogously. Such hardware may be too >>> old and otherwise problematical, making it not viable to automate >>> testing. > > Powerpc supports 32-bit libraries, unlike arm (so far). My understanding is that powerpc64le does not in FreeBSD: there is no powerpcle in FreeBSD. So, not even chroot style support for 32-bit little endian use. If I understand right, no 32 bit little endian ABI is defined, other than the void linux activity's material, anyway. It may be that all big endian POWER use has lib32 support, but I'm not sure if all POWER has big endian FreeBSD support. May be POWER9 (10?) still has such support in FreeBSD. === Mark Millard marklmi at yahoo.com
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On 7 Jul 2023, at 11:38, Mark Millard wrote: > On Jul 7, 2023, at 07:36, Mark Millard wrote: > >> On Jul 7, 2023, at 06:50, Mike Karels wrote: >> >>> On 7 Jul 2023, at 6:06, John F Carr wrote: >>> >>>> On Jul 6, 2023, at 20:42, Mike Karels wrote: >>>>> >>>>> >>>>> Thanks for isolating this. Let me know when you have the bug number. >>>>> I just tested a fix (the compat code drops the reference on the current >>>>> address space an extra time, probably freeing it). >>>>> >>>>> Mike >>> >>> The fix is in >>> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de. >>> >>>> The bug was introduced in January, 2022. It allows 32 bit binaries to >>>> crash a 64 bit system when COMPAT_FREEBSD32 is on. Test coverage of the >>>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time. >>>> >>>> There should be routine runs of 32 bit test suites on 64 bit systems. >>>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 >>>> kernel code needs to be exercised. This bug was only discovered by >>>> manually running tests in the right environment, 17 months after automated >>>> testing could have discovered it. >>> >>> That is not so simple currently, as the shared libraries for the >>> test environment are not part of 32-bit compatibility package. >>> The required bits could be extracted from the corresponding 32-bit >>> build, but that isn't easy to automate. Fortunately, I think that >>> very few of the tests exercise any 32-bit-specific code paths. >> >> One way that I demonstrated this problem on an aarch64 system >> that supports aarch32/armv7 in user space was via the use of >> an official snapshot armv7 image. In my case I: >> >> A) dd'd the image to USB3 media (after downloading it) >> B) mounted the ufs file system on the media to a mount point > > I forgot to mention an important step: before chroot is > used, I preload various kernel modules that are used in > the Kyuafile tests --because the chroot'd activity will > not cause the loads of themsleves but will use the > modules if they have already been loaded. > >> C) used a chroot into that mount point to run the: >>"kyua test -k /usr/tests/Kyuafile" >> >> (I happened to do all that as root.) >> >> There may be viable alternatives to dd'ing to allow an analogy to >> (B) for (C) to use. I've not experimented with using a jail >> instead of a chroot. >> >> One can also install an armv7 world into a local directory tree >> and then use chroot (or analogous). >> >> How far off is an analogous sort of procedure from being reasonable >> to automate? It would be easier to use the packages rather than the full image (base.txz and tests.txz). But doing this as part of a CI setup would mean fetching things from a different source from the install image, and then of course doing various configuration. It's always a Small Matter of Programming. Doing a full chroot gets into some other problems, e.g. mdconfig doesn't currently work in compatibility mode. It doesn't seem that automating all this would yield much; it's hard enough to do manually. >> i386, of course, also has lib32 and independently testing that is >> a messier issue, including trying to use /usr/tests/Kyuafile based >> testing that avoids use of chroot (or analogous). I'm not claiming >> lib32 has as simple of a potential path to automated testing. I think the problem is essentially the same. A chroot could be used or a 32-bit library setup (which would test the libraries as well). >> I do not know if FreeBSD has powerpc64 hardware able to use a >> powerpc world directory tree analogously. Such hardware may be too >> old and otherwise problematical, making it not viable to automate >> testing. Powerpc supports 32-bit libraries, unlike arm (so far). Mike
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On Jul 7, 2023, at 07:36, Mark Millard wrote: > On Jul 7, 2023, at 06:50, Mike Karels wrote: > >> On 7 Jul 2023, at 6:06, John F Carr wrote: >> >>> On Jul 6, 2023, at 20:42, Mike Karels wrote: >>>> >>>> >>>> Thanks for isolating this. Let me know when you have the bug number. >>>> I just tested a fix (the compat code drops the reference on the current >>>> address space an extra time, probably freeing it). >>>> >>>> Mike >> >> The fix is in >> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de. >> >>> The bug was introduced in January, 2022. It allows 32 bit binaries to >>> crash a 64 bit system when COMPAT_FREEBSD32 is on. Test coverage of the >>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time. >>> >>> There should be routine runs of 32 bit test suites on 64 bit systems. >>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 >>> kernel code needs to be exercised. This bug was only discovered by >>> manually running tests in the right environment, 17 months after automated >>> testing could have discovered it. >> >> That is not so simple currently, as the shared libraries for the >> test environment are not part of 32-bit compatibility package. >> The required bits could be extracted from the corresponding 32-bit >> build, but that isn't easy to automate. Fortunately, I think that >> very few of the tests exercise any 32-bit-specific code paths. > > One way that I demonstrated this problem on an aarch64 system > that supports aarch32/armv7 in user space was via the use of > an official snapshot armv7 image. In my case I: > > A) dd'd the image to USB3 media (after downloading it) > B) mounted the ufs file system on the media to a mount point I forgot to mention an important step: before chroot is used, I preload various kernel modules that are used in the Kyuafile tests --because the chroot'd activity will not cause the loads of themsleves but will use the modules if they have already been loaded. > C) used a chroot into that mount point to run the: >"kyua test -k /usr/tests/Kyuafile" > > (I happened to do all that as root.) > > There may be viable alternatives to dd'ing to allow an analogy to > (B) for (C) to use. I've not experimented with using a jail > instead of a chroot. > > One can also install an armv7 world into a local directory tree > and then use chroot (or analogous). > > How far off is an analogous sort of procedure from being reasonable > to automate? > > i386, of course, also has lib32 and independently testing that is > a messier issue, including trying to use /usr/tests/Kyuafile based > testing that avoids use of chroot (or analogous). I'm not claiming > lib32 has as simple of a potential path to automated testing. > > I do not know if FreeBSD has powerpc64 hardware able to use a > powerpc world directory tree analogously. Such hardware may be too > old and otherwise problematical, making it not viable to automate > testing. > === Mark Millard marklmi at yahoo.com
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On Jul 7, 2023, at 06:50, Mike Karels wrote: > On 7 Jul 2023, at 6:06, John F Carr wrote: > >> On Jul 6, 2023, at 20:42, Mike Karels wrote: >>> >>> >>> Thanks for isolating this. Let me know when you have the bug number. >>> I just tested a fix (the compat code drops the reference on the current >>> address space an extra time, probably freeing it). >>> >>> Mike > > The fix is in > https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de. > >> The bug was introduced in January, 2022. It allows 32 bit binaries to >> crash a 64 bit system when COMPAT_FREEBSD32 is on. Test coverage of the >> buggy function (sysctl_kern_proc_vm_layout) was added at the same time. >> >> There should be routine runs of 32 bit test suites on 64 bit systems. >> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 >> kernel code needs to be exercised. This bug was only discovered by manually >> running tests in the right environment, 17 months after automated testing >> could have discovered it. > > That is not so simple currently, as the shared libraries for the > test environment are not part of 32-bit compatibility package. > The required bits could be extracted from the corresponding 32-bit > build, but that isn't easy to automate. Fortunately, I think that > very few of the tests exercise any 32-bit-specific code paths. One way that I demonstrated this problem on an aarch64 system that supports aarch32/armv7 in user space was via the use of an official snapshot armv7 image. In my case I: A) dd'd the image to USB3 media (after downloading it) B) mounted the ufs file system on the media to a mount point C) used a chroot into that mount point to run the: "kyua test -k /usr/tests/Kyuafile" (I happened to do all that as root.) There may be viable alternatives to dd'ing to allow an analogy to (B) for (C) to use. I've not experimented with using a jail instead of a chroot. One can also install an armv7 world into a local directory tree and then use chroot (or analogous). How far off is an analogous sort of procedure from being reasonable to automate? i386, of course, also has lib32 and independently testing that is a messier issue, including trying to use /usr/tests/Kyuafile based testing that avoids use of chroot (or analogous). I'm not claiming lib32 has as simple of a potential path to automated testing. I do not know if FreeBSD has powerpc64 hardware able to use a powerpc world directory tree analogously. Such hardware may be too old and otherwise problematical, making it not viable to automate testing. === Mark Millard marklmi at yahoo.com
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On 7 Jul 2023, at 6:06, John F Carr wrote: > On Jul 6, 2023, at 20:42, Mike Karels wrote: >> >> >> Thanks for isolating this. Let me know when you have the bug number. >> I just tested a fix (the compat code drops the reference on the current >> address space an extra time, probably freeing it). >> >> Mike The fix is in https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de. > The bug was introduced in January, 2022. It allows 32 bit binaries to crash > a 64 bit system when COMPAT_FREEBSD32 is on. Test coverage of the buggy > function (sysctl_kern_proc_vm_layout) was added at the same time. > > There should be routine runs of 32 bit test suites on 64 bit systems. > Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 > kernel code needs to be exercised. This bug was only discovered by manually > running tests in the right environment, 17 months after automated testing > could have discovered it. That is not so simple currently, as the shared libraries for the test environment are not part of 32-bit compatibility package. The required bits could be extracted from the corresponding 32-bit build, but that isn't easy to automate. Fortunately, I think that very few of the tests exercise any 32-bit-specific code paths. Mike
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On Jul 6, 2023, at 20:42, Mike Karels wrote: > > > Thanks for isolating this. Let me know when you have the bug number. > I just tested a fix (the compat code drops the reference on the current > address space an extra time, probably freeing it). > > Mike The bug was introduced in January, 2022. It allows 32 bit binaries to crash a 64 bit system when COMPAT_FREEBSD32 is on. Test coverage of the buggy function (sysctl_kern_proc_vm_layout) was added at the same time. There should be routine runs of 32 bit test suites on 64 bit systems. Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 kernel code needs to be exercised. This bug was only discovered by manually running tests in the right environment, 17 months after automated testing could have discovered it.
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On 6 Jul 2023, at 18:53, John F Carr wrote: > The hang is caused by the sysctl call in tests/sys/kern/kern_copyin.c. The > function below hangs when called in a 32 bit ARM process running in a chroot > environment on a 64 bit ARM system. I will write up a bug report. > > static int > get_vm_layout(struct kinfo_vm_layout *kvm) > { > size_t len; > int mib[4]; > > mib[0] = CTL_KERN; > mib[1] = KERN_PROC; > mib[2] = KERN_PROC_VM_LAYOUT; > mib[3] = getpid(); > len = sizeof(*kvm); > > return (sysctl(mib, nitems(mib), kvm, &len, NULL, 0)); > } Thanks for isolating this. Let me know when you have the bug number. I just tested a fix (the compat code drops the reference on the current address space an extra time, probably freeing it). Mike
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On Jul 6, 2023, at 16:53, John F Carr wrote: > On Jun 25, 2023, at 20:16, Mark Millard wrote: >> >> . . . >> > > The hang is caused by the sysctl call in tests/sys/kern/kern_copyin.c. The > function below hangs when called in a 32 bit ARM process running in a chroot > environment on a 64 bit ARM system. I will write up a bug report. > > static int > get_vm_layout(struct kinfo_vm_layout *kvm) > { > size_t len; > int mib[4]; > > mib[0] = CTL_KERN; > mib[1] = KERN_PROC; > mib[2] = KERN_PROC_VM_LAYOUT; > mib[3] = getpid(); > len = sizeof(*kvm); > > return (sysctl(mib, nitems(mib), kvm, &len, NULL, 0)); > } > Thanks for the tiny-reproducer analysis! That should help make getting to a fix more actionable. === Mark Millard marklmi at yahoo.com
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
> On Jun 25, 2023, at 20:16, Mark Millard wrote: > > Using the likes of: > > FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img > and: > FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img > > I have shown the following behavior after setting up storage > media based on them. (This was a test that my builds were not > odd for the issue.) > > Boot the aarch64 media and log in. (Note: I logged in > as root.) > > mount the armv7 media (-noatime is just my habit) > and then put it to use: > > # mount -onoatime /dev/da1s2a /mnt > > # chroot /mnt/ > > # kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin > sys/kern/kern_copyin:kern_copyin -> > > On the serial console: > > # ps -xu > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > root 11 1498.4 0.0 0 256 - RNL 23:24 542:52.92 [idle] > root 1174 100.0 0.0 0 16 - Rs 23:37 0:00.00 > /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests > -r/tmp/kyua.9YUttj/2/result.atf kern_copyin > root00.0 0.0 0 1616 - DLs 23:24 0:00.50 [kernel] > root10.0 0.0 11704 1288 - ILs 23:24 0:00.02 /sbin/init > root20.0 0.0 0 256 - WL 23:24 0:00.26 [clock] > root30.0 0.0 0 272 - DL 23:24 0:00.00 [crypto] > root40.0 0.0 0 80 - DL 23:24 0:00.95 [cam] > root50.0 0.0 0 16 - DL 23:24 0:00.00 [busdma] > root60.0 0.0 0 16 - DL 23:24 0:00.03 [rand_harvestq] > root70.0 0.0 0 48 - DL 23:24 0:00.06 [pagedaemon] > root80.0 0.0 0 16 - DL 23:24 0:00.00 [vmdaemon] > root90.0 0.0 0 160 - DL 23:24 0:00.38 [bufdaemon] > root 100.0 0.0 0 16 - DL 23:24 0:00.00 [audit] > root 120.0 0.0 0 880 - WL 23:24 0:11.81 [intr] > root 130.0 0.0 0 48 - DL 23:24 0:00.04 [geom] > root 140.0 0.0 0 16 - DL 23:24 0:00.00 [sequencer 00] > root 150.0 0.0 0 160 - DL 23:24 0:06.42 [usb] > root 160.0 0.0 0 16 - DL 23:24 0:00.10 [acpi_thermal] > root 170.0 0.0 0 16 - DL 23:24 0:00.00 [acpi_cooling0] > root 180.0 0.0 0 16 - DL 23:24 0:00.04 [syncer] > root 190.0 0.0 0 16 - DL 23:24 0:00.00 [vnlru] > root 6710.0 0.0 13260 2600 - Is 23:25 0:00.00 dhclient: > system.syslog (dhclient) > root 6740.0 0.0 13260 2752 - Is 23:25 0:00.00 dhclient: dpni0 > [priv] (dhclient) > root 7610.0 0.0 14572 3972 - Ss 23:25 0:00.02 /sbin/devd > root 9640.0 0.0 12832 2764 - Is 23:25 0:00.02 /usr/sbin/syslogd > -s > root 10330.0 0.0 13012 2604 - Ss 23:25 0:00.01 /usr/sbin/cron -s > root 10580.0 0.0 21052 8308 - Is 23:25 0:00.01 sshd: > /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd) > root 10780.0 0.0 21288 9304 - Is 23:26 0:00.09 sshd: root@pts/0 > (sshd) > root 11750.0 0.0 21288 9496 - Is 23:37 0:00.04 sshd: root@pts/1 > (sshd) > root 10740.0 0.0 13380 3008 u0 Is 23:25 0:00.01 login [pam] > (login) > root 10750.0 0.0 13460 3292 u0 S23:25 0:00.02 -sh (sh) > root 12330.0 0.0 13588 3016 u0 R+ 00:00 0:00.00 ps -xu > root 10810.0 0.0 13460 3328 0 Is 23:26 0:00.02 -sh (sh) > root 11700.0 0.0 5788 2884 0 I23:36 0:00.02 /bin/sh -i > root 11720.0 0.0 10408 7192 0 I+ 23:37 0:00.30 kyua test -k > /usr/tests/Kyuafile sys/kern/kern_copyin > root 11780.0 0.0 13460 3320 1 Is+ 23:38 0:00.01 -sh (sh) > > 1174 is stuck, even if one waits for 30min+. > kill and kill -9 will not kill 1174. > > "shutdown -r now" hangs before the reboot happens > and reports: "some processes would not die". > > An interesting property is that ps and top disagree > about 1174 CPU usage: ps 100%, top 0%. But top also > indicates 1174 always has CPU0 "STATE". (Across > tests CPUn varies but within a test it has > a fixed n.) > > I have also seen ps "STAT" being RXs. > > The following is from my earlier activity with my own > builds involved, here 1119, not the 1174 from above. > truss reports as the last thing for the stuck process > as "getpid()". > > . . . > 1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ > mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) > = 0 (0x0) > 1119: 0.589065030 > mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANO
Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
On Jun 25, 2023, at 17:16, Mark Millard wrote: > Using the likes of: > > FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img > and: > FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img > > I have shown the following behavior after setting up storage > media based on them. (This was a test that my builds were not > odd for the issue.) > > Boot the aarch64 media and log in. (Note: I logged in > as root.) > > mount the armv7 media (-noatime is just my habit) > and then put it to use: > > # mount -onoatime /dev/da1s2a /mnt > > # chroot /mnt/ > > # kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin > sys/kern/kern_copyin:kern_copyin -> > > On the serial console: > > # ps -xu > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > root 11 1498.4 0.0 0 256 - RNL 23:24 542:52.92 [idle] > root 1174 100.0 0.0 0 16 - Rs 23:37 0:00.00 > /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests > -r/tmp/kyua.9YUttj/2/result.atf kern_copyin > root00.0 0.0 0 1616 - DLs 23:24 0:00.50 [kernel] > root10.0 0.0 11704 1288 - ILs 23:24 0:00.02 /sbin/init > root20.0 0.0 0 256 - WL 23:24 0:00.26 [clock] > root30.0 0.0 0 272 - DL 23:24 0:00.00 [crypto] > root40.0 0.0 0 80 - DL 23:24 0:00.95 [cam] > root50.0 0.0 0 16 - DL 23:24 0:00.00 [busdma] > root60.0 0.0 0 16 - DL 23:24 0:00.03 [rand_harvestq] > root70.0 0.0 0 48 - DL 23:24 0:00.06 [pagedaemon] > root80.0 0.0 0 16 - DL 23:24 0:00.00 [vmdaemon] > root90.0 0.0 0 160 - DL 23:24 0:00.38 [bufdaemon] > root 100.0 0.0 0 16 - DL 23:24 0:00.00 [audit] > root 120.0 0.0 0 880 - WL 23:24 0:11.81 [intr] > root 130.0 0.0 0 48 - DL 23:24 0:00.04 [geom] > root 140.0 0.0 0 16 - DL 23:24 0:00.00 [sequencer 00] > root 150.0 0.0 0 160 - DL 23:24 0:06.42 [usb] > root 160.0 0.0 0 16 - DL 23:24 0:00.10 [acpi_thermal] > root 170.0 0.0 0 16 - DL 23:24 0:00.00 [acpi_cooling0] > root 180.0 0.0 0 16 - DL 23:24 0:00.04 [syncer] > root 190.0 0.0 0 16 - DL 23:24 0:00.00 [vnlru] > root 6710.0 0.0 13260 2600 - Is 23:25 0:00.00 dhclient: > system.syslog (dhclient) > root 6740.0 0.0 13260 2752 - Is 23:25 0:00.00 dhclient: dpni0 > [priv] (dhclient) > root 7610.0 0.0 14572 3972 - Ss 23:25 0:00.02 /sbin/devd > root 9640.0 0.0 12832 2764 - Is 23:25 0:00.02 /usr/sbin/syslogd > -s > root 10330.0 0.0 13012 2604 - Ss 23:25 0:00.01 /usr/sbin/cron -s > root 10580.0 0.0 21052 8308 - Is 23:25 0:00.01 sshd: > /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd) > root 10780.0 0.0 21288 9304 - Is 23:26 0:00.09 sshd: root@pts/0 > (sshd) > root 11750.0 0.0 21288 9496 - Is 23:37 0:00.04 sshd: root@pts/1 > (sshd) > root 10740.0 0.0 13380 3008 u0 Is 23:25 0:00.01 login [pam] > (login) > root 10750.0 0.0 13460 3292 u0 S23:25 0:00.02 -sh (sh) > root 12330.0 0.0 13588 3016 u0 R+ 00:00 0:00.00 ps -xu > root 10810.0 0.0 13460 3328 0 Is 23:26 0:00.02 -sh (sh) > root 11700.0 0.0 5788 2884 0 I23:36 0:00.02 /bin/sh -i > root 11720.0 0.0 10408 7192 0 I+ 23:37 0:00.30 kyua test -k > /usr/tests/Kyuafile sys/kern/kern_copyin > root 11780.0 0.0 13460 3320 1 Is+ 23:38 0:00.01 -sh (sh) > > 1174 is stuck, even if one waits for 30min+. > kill and kill -9 will not kill 1174. > > "shutdown -r now" hangs before the reboot happens > and reports: "some processes would not die". > > An interesting property is that ps and top disagree > about 1174 CPU usage: ps 100%, top 0%. But top also > indicates 1174 always has CPU0 "STATE". (Across > tests CPUn varies but within a test it has > a fixed n.) > > I have also seen ps "STAT" being RXs. > > The following is from my earlier activity with my own > builds involved, here 1119, not the 1174 from above. > truss reports as the last thing for the stuck process > as "getpid()". > > . . . > 1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ > mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) > = 0 (0x0) > 1119: 0.589065030 > mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(
For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot
Using the likes of: FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img and: FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img I have shown the following behavior after setting up storage media based on them. (This was a test that my builds were not odd for the issue.) Boot the aarch64 media and log in. (Note: I logged in as root.) mount the armv7 media (-noatime is just my habit) and then put it to use: # mount -onoatime /dev/da1s2a /mnt # chroot /mnt/ # kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin sys/kern/kern_copyin:kern_copyin -> On the serial console: # ps -xu USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 11 1498.4 0.0 0 256 - RNL 23:24 542:52.92 [idle] root 1174 100.0 0.0 0 16 - Rs 23:37 0:00.00 /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests -r/tmp/kyua.9YUttj/2/result.atf kern_copyin root00.0 0.0 0 1616 - DLs 23:24 0:00.50 [kernel] root10.0 0.0 11704 1288 - ILs 23:24 0:00.02 /sbin/init root20.0 0.0 0 256 - WL 23:24 0:00.26 [clock] root30.0 0.0 0 272 - DL 23:24 0:00.00 [crypto] root40.0 0.0 0 80 - DL 23:24 0:00.95 [cam] root50.0 0.0 0 16 - DL 23:24 0:00.00 [busdma] root60.0 0.0 0 16 - DL 23:24 0:00.03 [rand_harvestq] root70.0 0.0 0 48 - DL 23:24 0:00.06 [pagedaemon] root80.0 0.0 0 16 - DL 23:24 0:00.00 [vmdaemon] root90.0 0.0 0 160 - DL 23:24 0:00.38 [bufdaemon] root 100.0 0.0 0 16 - DL 23:24 0:00.00 [audit] root 120.0 0.0 0 880 - WL 23:24 0:11.81 [intr] root 130.0 0.0 0 48 - DL 23:24 0:00.04 [geom] root 140.0 0.0 0 16 - DL 23:24 0:00.00 [sequencer 00] root 150.0 0.0 0 160 - DL 23:24 0:06.42 [usb] root 160.0 0.0 0 16 - DL 23:24 0:00.10 [acpi_thermal] root 170.0 0.0 0 16 - DL 23:24 0:00.00 [acpi_cooling0] root 180.0 0.0 0 16 - DL 23:24 0:00.04 [syncer] root 190.0 0.0 0 16 - DL 23:24 0:00.00 [vnlru] root 6710.0 0.0 13260 2600 - Is 23:25 0:00.00 dhclient: system.syslog (dhclient) root 6740.0 0.0 13260 2752 - Is 23:25 0:00.00 dhclient: dpni0 [priv] (dhclient) root 7610.0 0.0 14572 3972 - Ss 23:25 0:00.02 /sbin/devd root 9640.0 0.0 12832 2764 - Is 23:25 0:00.02 /usr/sbin/syslogd -s root 10330.0 0.0 13012 2604 - Ss 23:25 0:00.01 /usr/sbin/cron -s root 10580.0 0.0 21052 8308 - Is 23:25 0:00.01 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd) root 10780.0 0.0 21288 9304 - Is 23:26 0:00.09 sshd: root@pts/0 (sshd) root 11750.0 0.0 21288 9496 - Is 23:37 0:00.04 sshd: root@pts/1 (sshd) root 10740.0 0.0 13380 3008 u0 Is 23:25 0:00.01 login [pam] (login) root 10750.0 0.0 13460 3292 u0 S23:25 0:00.02 -sh (sh) root 12330.0 0.0 13588 3016 u0 R+ 00:00 0:00.00 ps -xu root 10810.0 0.0 13460 3328 0 Is 23:26 0:00.02 -sh (sh) root 11700.0 0.0 5788 2884 0 I23:36 0:00.02 /bin/sh -i root 11720.0 0.0 10408 7192 0 I+ 23:37 0:00.30 kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin root 11780.0 0.0 13460 3320 1 Is+ 23:38 0:00.01 -sh (sh) 1174 is stuck, even if one waits for 30min+. kill and kill -9 will not kill 1174. "shutdown -r now" hangs before the reboot happens and reports: "some processes would not die". An interesting property is that ps and top disagree about 1174 CPU usage: ps 100%, top 0%. But top also indicates 1174 always has CPU0 "STATE". (Across tests CPUn varies but within a test it has a fixed n.) I have also seen ps "STAT" being RXs. The following is from my earlier activity with my own builds involved, here 1119, not the 1174 from above. truss reports as the last thing for the stuck process as "getpid()". . . . 1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) = 0 (0x0) 1119: 0.589065030 mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(12),-1,0x0) = 1074188288 (0x4006d000) 1119: 0.589227544 openat(AT_FDCWD,"/tmp/kyua.aBQv6E/2/result.atf",O_WRONLY|O_CREAT|O_TRUNC,0644) = 3 (0x3) 1119: 0.589276503 getpid() = 1119 (0x45f) For reference, from inside an armv7 chroot session before doing such a test: # uname -apKU FreeBSD generic 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263748-b95d2237af40: Thu Jun 22 11:10:50 UTC 2023 r...@releng1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm armv7 1400090 1400090 === Mark Millard marklmi at yahoo.com
Re: TP-LINK USB no carrier after speed test
On 1/18/23 12:45, Gary Jennejohn wrote: It's not clear from the content of README.md whether Hans has added thunderbolt to the files under /sys/conf. Currently not so much has changed there, except from regularly rebasing the repository on top of FreeBSD-main. I currently have my hands full! --HPS
Re: TP-LINK USB no carrier after speed test
On Tue, 17 Jan 2023 19:58:54 -0300 (-03) Ivan Quitschal wrote: > On Tue, 27 Sep 2022, Hans Petter Selasky wrote: > > > > > FYI: There is some experimental thunderbolt support at: > > > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhselasky%2Fusb4&data=05%7C01%7C%7C14c86eee9f5d492c41d508daa0b49bdb%7C84df9e7fe9f640afb435%7C1%7C0%7C637998994857157968%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FOnIO3esoAmi1FSPkHRYpHCHkcN6U2rO9WhaimdaVbk%3D&reserved=0 > > > > But I'm not sure if it supports the hardware you've got. > > > > --HPS > > > > > Hi Hans > > i just told you early today that the problem was solved by sticking it into > USB > 2.0, well i was wrong. problem came back just like before > > I see Alexander also has the same XHCI that i have here > > xhci0@pci0:0:20:0: class=0x0c0330 rev=0x20 hdr=0x00 vendor=0x8086 > device=0xa0ed > subvendor=0x1028 subdevice=0x0ab0 > vendor = 'Intel Corporation' > device = 'Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller' > class = serial bus > subclass = USB > > > maybe this tiger lake support is the problem? > > > I have checked your git repository above, how could i test it here ? what > dirs am i supposed to copy to my /usr/src ? > > thank you > That information is in the README.md: This implements a basic kernel driver and userland tool for USB4 and Thunderbolt3. The relevant code is in the following locations: sys/dev/thunderbolt sys/modules/thunderbolt usr.sbin/tbtconfig So, you need the contents of those directories. You'll have to build a module under sys/modules/thunderbolt, which should result in a tb.ko file, which will have to be loaded using kldload. You also have to go into /usr/src/usr.sbin/tbtconfig and build that binary. There's a manpage there. It's not clear from the content of README.md whether Hans has added thunderbolt to the files under /sys/conf. -- Gary Jennejohn
Re: RES: TP-LINK USB no carrier after speed test
On Tue, 27 Sep 2022, Hans Petter Selasky wrote: FYI: There is some experimental thunderbolt support at: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhselasky%2Fusb4&data=05%7C01%7C%7C14c86eee9f5d492c41d508daa0b49bdb%7C84df9e7fe9f640afb435%7C1%7C0%7C637998994857157968%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FOnIO3esoAmi1FSPkHRYpHCHkcN6U2rO9WhaimdaVbk%3D&reserved=0 But I'm not sure if it supports the hardware you've got. --HPS Hi Hans i just told you early today that the problem was solved by sticking it into USB 2.0, well i was wrong. problem came back just like before I see Alexander also has the same XHCI that i have here xhci0@pci0:0:20:0: class=0x0c0330 rev=0x20 hdr=0x00 vendor=0x8086 device=0xa0ed subvendor=0x1028 subdevice=0x0ab0 vendor = 'Intel Corporation' device = 'Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller' class = serial bus subclass = USB maybe this tiger lake support is the problem? I have checked your git repository above, how could i test it here ? what dirs am i supposed to copy to my /usr/src ? thank you --tzk
Re: RES: TP-LINK USB no carrier after speed test
On 1/17/23 14:13, Ivan Quitschal wrote: not THAT fine of course, since its limited to around 300mbps. when in USB 3 it reaches 600mbps just fine. but besides that limitation from the version 2.0, it really works. ive tried a whole day of heavy traffic here and nothing happened at all. rings any bells ? Yes, I see that too: ugen0.3: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (248mA) Works like a charm spd=HIGH, but probably not super-speed. Maybe the vendor does something different when the speed is super speed so that the BULK transport can move more data at a time ... Vendor documentation is wanted! Maybe you simply need to USB trace the protocol when super-speed is used and vendor drivers are in place. Right now there is no option to disable super speed only, but maybe try to run this command on all USB 3.x root HUBS: usbconfig -d X.Y set_config 255 Maybe the device will show up as high-speed on the other computer aswell. It's worth a try. --HPS
Re: RES: TP-LINK USB no carrier after speed test
On Wed, 28 Sep 2022, Ivan Quitschal wrote: FYI: There is some experimental thunderbolt support at: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhselasky%2Fusb4&data=05%7C01%7C%7Cc2f534f631fd47afec9908daa135d60b%7C84df9e7fe9f640afb435%7C1%7C0%7C637999549868812246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uL5DwbPcldediZmBiufQXnkF7%2F2WQTizqVLAYBHZjqA%3D&reserved=0 But I'm not sure if it supports the hardware you've got. --HPS Hi Hans i got two log versions for you, one with the constant set to 2048 (the working version) , and the other with no patches whatsoever (the bad version) since the entire logs reached more than 150M of size, i had to cut to the last 1000 lines , hope toats enough pleaes find attached the two files the xhci_NOT_working i stoped recording right after the problem ocurred please let me know if you need something else thank you --tzk I need the full log. The XHCI driver is very verbose you see. Maybe you can do some filtering, like figuring out all the status codes you see: status=1 status=13 and so on. --HPS Hi Hans, sorry for the delay. i wasnt able to send the full logs because it went too big :( but i have something, not sure if that helps. it happens that ive moved my notebook to another place and now im using the ethernet adapter in the port USB 2.0 instead of USB 3 (where the problem used to happen) and now it works fine. not THAT fine of course, since its limited to around 300mbps. when in USB 3 it reaches 600mbps just fine. but besides that limitation from the version 2.0, it really works. ive tried a whole day of heavy traffic here and nothing happened at all. rings any bells ? thanks Ivan PS: (if you still want that log, let me know some place where i could upload it, i dont know)
Re: RES: TP-LINK USB no carrier after speed test
On Wed, 28 Sep 2022, Hans Petter Selasky wrote: On 9/28/22 11:07, Ivan Quitschal wrote: On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 15:22, Hans Petter Selasky wrote: On 9/27/22 14:17, Ivan Quitschal wrote: On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 02:24, Alexander Motin wrote: On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. Hi Alexander, Could you run "usbdump -d X.Y" at the same time to capture all the errors? Looking especially for USB_ERR_TIMEOUT . I have this: xhci0@pci0:3:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15e0 subvendor=0x1849 subdevice=0x vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Raven USB 3.1' class = serial bus subclass = USB xhci0@pci0:0:20:0: class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f vendor = 'Intel Corporation' device = 'Sunrise Point-LP USB 3.0 xHCI Controller' class = serial bus subclass = USB --HPS hi Hans i think i got some good logs for you before the problem i ran this: ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) # usbconfig -d ugen0.10 >> before # usbconfig -d ugen0.10 dump_all_desc >> before # usbconfig -d ugen0.10 dump_stats >> before_status the after the problem happened i ran # usbconfig -d ugen0.10 >> after # usbconfig -d ugen0.10 dump_all_desc >> after # usbconfig -d ugen0.10 dump_stats >> after_status just by looking i already see some problems comparing both for example before the problem we have: -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0300 bDeviceClass = 0x bDeviceSubClass = 0x bDeviceProtocol = 0x bMaxPacketSize0 = 0x0009 idVendor = 0x2357 idProduct = 0x0601 bcdDevice = 0x3000 iManufacturer = 0x0001 iProduct = 0x0002 iSerialNumber = 0x0006 <01> bNumConfigurations = 0x0002
Re: RES: TP-LINK USB no carrier after speed test
On 9/28/22 11:47, Tomoaki AOKI wrote: As I stated on Bug 237666 [1], I have Titan Ridge TB3 bridge on my ThinkPad P52. The relevant part of HW probe is at comment 206 [2]. Are there any other info I can provide for Titan Ridge support? (Not yet tried the codes.) [1]https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666 [2]https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666#c206 I cannot promise anything and I don't have an overview which TB3 controllers are compatible with eachother. Maybe grepping for the PCI ID's in Linux will give some clues, hence I don't have access to any thunderbolt documentation myself! --HPS
Re: RES: TP-LINK USB no carrier after speed test
On Tue, 27 Sep 2022 20:17:54 +0200 Hans Petter Selasky wrote: > On 9/27/22 15:22, Hans Petter Selasky wrote: (snip) > FYI: There is some experimental thunderbolt support at: > > https://github.com/hselasky/usb4 > > But I'm not sure if it supports the hardware you've got. > > --HPS Thanks for the hard work and info. As I stated on Bug 237666 [1], I have Titan Ridge TB3 bridge on my ThinkPad P52. The relevant part of HW probe is at comment 206 [2]. Are there any other info I can provide for Titan Ridge support? (Not yet tried the codes.) [1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666 [2] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666#c206 -- Tomoaki AOKI
Re: RES: TP-LINK USB no carrier after speed test
On 9/28/22 11:07, Ivan Quitschal wrote: On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 15:22, Hans Petter Selasky wrote: On 9/27/22 14:17, Ivan Quitschal wrote: On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 02:24, Alexander Motin wrote: On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. Hi Alexander, Could you run "usbdump -d X.Y" at the same time to capture all the errors? Looking especially for USB_ERR_TIMEOUT . I have this: xhci0@pci0:3:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15e0 subvendor=0x1849 subdevice=0x vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Raven USB 3.1' class = serial bus subclass = USB xhci0@pci0:0:20:0: class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f vendor = 'Intel Corporation' device = 'Sunrise Point-LP USB 3.0 xHCI Controller' class = serial bus subclass = USB --HPS hi Hans i think i got some good logs for you before the problem i ran this: ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) # usbconfig -d ugen0.10 >> before # usbconfig -d ugen0.10 dump_all_desc >> before # usbconfig -d ugen0.10 dump_stats >> before_status the after the problem happened i ran # usbconfig -d ugen0.10 >> after # usbconfig -d ugen0.10 dump_all_desc >> after # usbconfig -d ugen0.10 dump_stats >> after_status just by looking i already see some problems comparing both for example before the problem we have: -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0300 bDeviceClass = 0x bDeviceSubClass = 0x bDeviceProtocol = 0x bMaxPacketSize0 = 0x0009 idVendor = 0x2357 idProduct = 0x0601 bcdDevice = 0x3000 iManufacturer = 0x0001 iProduct = 0x0002 iSerialNumber = 0x0006 <01> bNumConfigurations = 0x0002 after the problem -- u
RES: RES: TP-LINK USB no carrier after speed test
> > FYI: There is some experimental thunderbolt support at: > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.c > om%2Fhselasky%2Fusb4&data=05%7C01%7C%7C14c86eee9f5d492c41d50 > 8daa0b49bdb%7C84df9e7fe9f640afb435%7C1%7C0%7C6379989 > 94857157968%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjo > iV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata > =%2FOnIO3esoAmi1FSPkHRYpHCHkcN6U2rO9WhaimdaVbk%3D&reserved > =0 > > But I'm not sure if it supports the hardware you've got. > > --HPS Hi Hans Should i wait to apply this thunderbolt business just yet , at least you analize the log I sent you in the last email or should I go for it ? Thanks --tzk
Re: RES: TP-LINK USB no carrier after speed test
On 9/27/22 15:22, Hans Petter Selasky wrote: On 9/27/22 14:17, Ivan Quitschal wrote: On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 02:24, Alexander Motin wrote: On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. Hi Alexander, Could you run "usbdump -d X.Y" at the same time to capture all the errors? Looking especially for USB_ERR_TIMEOUT . I have this: xhci0@pci0:3:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15e0 subvendor=0x1849 subdevice=0x vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Raven USB 3.1' class = serial bus subclass = USB xhci0@pci0:0:20:0: class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f vendor = 'Intel Corporation' device = 'Sunrise Point-LP USB 3.0 xHCI Controller' class = serial bus subclass = USB --HPS hi Hans i think i got some good logs for you before the problem i ran this: ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) # usbconfig -d ugen0.10 >> before # usbconfig -d ugen0.10 dump_all_desc >> before # usbconfig -d ugen0.10 dump_stats >> before_status the after the problem happened i ran # usbconfig -d ugen0.10 >> after # usbconfig -d ugen0.10 dump_all_desc >> after # usbconfig -d ugen0.10 dump_stats >> after_status just by looking i already see some problems comparing both for example before the problem we have: -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0300 bDeviceClass = 0x bDeviceSubClass = 0x bDeviceProtocol = 0x bMaxPacketSize0 = 0x0009 idVendor = 0x2357 idProduct = 0x0601 bcdDevice = 0x3000 iManufacturer = 0x0001 iProduct = 0x0002 iSerialNumber = 0x0006 <01> bNumConfigurations = 0x0002 after the problem -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 m
Re: RES: TP-LINK USB no carrier after speed test
On 9/27/22 14:17, Ivan Quitschal wrote: On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 02:24, Alexander Motin wrote: On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. Hi Alexander, Could you run "usbdump -d X.Y" at the same time to capture all the errors? Looking especially for USB_ERR_TIMEOUT . I have this: xhci0@pci0:3:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15e0 subvendor=0x1849 subdevice=0x vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Raven USB 3.1' class = serial bus subclass = USB xhci0@pci0:0:20:0: class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f vendor = 'Intel Corporation' device = 'Sunrise Point-LP USB 3.0 xHCI Controller' class = serial bus subclass = USB --HPS hi Hans i think i got some good logs for you before the problem i ran this: ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) # usbconfig -d ugen0.10 >> before # usbconfig -d ugen0.10 dump_all_desc >> before # usbconfig -d ugen0.10 dump_stats >> before_status the after the problem happened i ran # usbconfig -d ugen0.10 >> after # usbconfig -d ugen0.10 dump_all_desc >> after # usbconfig -d ugen0.10 dump_stats >> after_status just by looking i already see some problems comparing both for example before the problem we have: -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0300 bDeviceClass = 0x bDeviceSubClass = 0x bDeviceProtocol = 0x bMaxPacketSize0 = 0x0009 idVendor = 0x2357 idProduct = 0x0601 bcdDevice = 0x3000 iManufacturer = 0x0001 iProduct = 0x0002 iSerialNumber = 0x0006 <01> bNumConfigurations = 0x0002 after the problem -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA)
Re: RES: TP-LINK USB no carrier after speed test
On Tue, 27 Sep 2022, Hans Petter Selasky wrote: On 9/27/22 02:24, Alexander Motin wrote: On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. Hi Alexander, Could you run "usbdump -d X.Y" at the same time to capture all the errors? Looking especially for USB_ERR_TIMEOUT . I have this: xhci0@pci0:3:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15e0 subvendor=0x1849 subdevice=0x vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Raven USB 3.1' class = serial bus subclass = USB xhci0@pci0:0:20:0: class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f vendor = 'Intel Corporation' device = 'Sunrise Point-LP USB 3.0 xHCI Controller' class = serial bus subclass = USB --HPS hi Hans i think i got some good logs for you before the problem i ran this: ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) # usbconfig -d ugen0.10 >> before # usbconfig -d ugen0.10 dump_all_desc >> before # usbconfig -d ugen0.10 dump_stats >> before_status the after the problem happened i ran # usbconfig -d ugen0.10 >> after # usbconfig -d ugen0.10 dump_all_desc >> after # usbconfig -d ugen0.10 dump_stats >> after_status just by looking i already see some problems comparing both for example before the problem we have: -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0300 bDeviceClass = 0x bDeviceSubClass = 0x bDeviceProtocol = 0x bMaxPacketSize0 = 0x0009 idVendor = 0x2357 idProduct = 0x0601 bcdDevice = 0x3000 iManufacturer = 0x0001 iProduct = 0x0002 iSerialNumber = 0x0006 <01> bNumConfigurations = 0x0002 after the problem -- ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) ugen0.10: at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0300 bDev
Re: RES: TP-LINK USB no carrier after speed test
On 9/27/22 02:24, Alexander Motin wrote: On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. Hi Alexander, Could you run "usbdump -d X.Y" at the same time to capture all the errors? Looking especially for USB_ERR_TIMEOUT . I have this: xhci0@pci0:3:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15e0 subvendor=0x1849 subdevice=0x vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Raven USB 3.1' class = serial bus subclass = USB xhci0@pci0:0:20:0: class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f vendor = 'Intel Corporation' device = 'Sunrise Point-LP USB 3.0 xHCI Controller' class = serial bus subclass = USB --HPS
Re: RES: RES: TP-LINK USB no carrier after speed test
On 9/27/22 00:25, Ivan Quitschal wrote: Hi Hans, how do you want me to do those tests for you ? with or without any of your patches? With the actual code on git ? Without any patches. --HPS
Re: RES: TP-LINK USB no carrier after speed test
On 26.09.2022 17:29, Hans Petter Selasky wrote: I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? I have two laptops of different generations reproducing this problem, but both are having Thunderbolt on the USB-C ports: This is one (7th Gen Core i7): xhci1@pci0:56:0:0: class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x vendor = 'Intel Corporation' device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016]' class = serial bus subclass = USB bar [10] = type Memory, range 32, base 0xc3f0, size 65536, enabled cap 01[80] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS max read 512 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM disabled ecap 0003[100] = Serial 1 20ff910876f10c00 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0002[300] = VC 1 max VC0 ecap 0004[400] = Power Budgeting 1 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216 ecap 0018[600] = LTR 1 ecap 0019[700] = PCIe Sec 1 lane errors 0 This is another (11th Gen Core i7); xhci0@pci0:0:13:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991 vendor = 'Intel Corporation' device = 'Tiger Lake-LP Thunderbolt 4 USB Controller' class = serial bus subclass = USB bar [10] = type Memory, range 64, base 0x60552c, size 65536, enabled cap 01[70] = powerspec 2 supports D0 D3 current D0 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message cap 09[90] = vendor (length 20) Intel cap 15 version 0 cap 09[b0] = vendor (length 0) Intel cap 0 version 1 Does the system you also has Thunderbolt chip, or you use native Intel chipet's XHCI? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it recovered traffic. Only reset recovered it. -- Alexander Motin
Re: RES: TP-LINK USB no carrier after speed test
On Mon, 26 Sep 2022, Hans Petter Selasky wrote: On 9/26/22 21:28, Alexander Motin wrote: Ivan, On 26.09.2022 13:11, Ivan Quitschal wrote: bad news im afraid, problem occurred at the first attempt on speedtest.net. and I'm really trying to help you analizying this code here myself, but problem is: im far from expert on network protocol business. if it is a network problem at all. seems to me more like a USB protocol limit issue or something .. just FYI , limiting that first constant to 2048 still limits my upload to 90mbps , and also still solves the issue .. there has to be something about it obviously On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually help a lot more without so dramatic performance decrease. Though it is likely only a workaround and does not explain the cause, so I hope Hans more ideas for us to test. ;) Hi, I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? --HPS hi Hans without any patch , the actual code on repository pciconf -lv xhci0@pci0:0:20:0: class=0x0c0330 rev=0x20 hdr=0x00 vendor=0x8086 device=0xa0ed subvendor=0x1028 subdevice=0x0ab0 vendor = 'Intel Corporation' device = 'Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller' class = serial bus subclass = USB did the stress test, got the problem, then i tried the below [root@tzk-inspiron ~ ]# usbconfig -d ugen0.6 dump_string 0 STRING_0x00 = 0x04, 0x03, 0x09, 0x04 [root@tzk-inspiron ~ ]# nothing happened, still no carrier. in order to get back the internet i had to [root@tzk-inspiron ~ ]# usbconfig -d ugen0.6 reset --tzk
RES: RES: TP-LINK USB no carrier after speed test
> -Mensagem original- > De: Hans Petter Selasky > Enviada em: segunda-feira, 26 de setembro de 2022 18:29 > Para: Alexander Motin ; Ivan Quitschal > > Cc: freebsd-current@freebsd.org; freebsd-...@freebsd.org > Assunto: Re: RES: TP-LINK USB no carrier after speed test > > On 9/26/22 21:28, Alexander Motin wrote: > > Ivan, > > > > On 26.09.2022 13:11, Ivan Quitschal wrote: > >> bad news im afraid, problem occurred at the first attempt on > >> speedtest.net. > >> and I'm really trying to help you analizying this code here myself, > >> but problem is: im far from expert on network protocol business. if > >> it is a network problem at all. seems to me more like a USB protocol > >> limit issue or something .. just FYI , limiting that first constant > >> to 2048 still limits my upload to 90mbps , and also still solves the > >> issue .. there has to be something about it obviously > > > > On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually > > help a lot more without so dramatic performance decrease. Though it > > is likely only a workaround and does not explain the cause, so I hope > > Hans more ideas for us to test. ;) > > > > Hi, > > I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to > reproduce the if_ure hang on two different pieces of XHCI hardware, Intel > based > and AMD based, which I've got. > > This leads me to believe there is a bug in the XHCI driver or hardware on your > system. > > Can you share the pciconfig -lv output for your XHCI controllers? > > Also, when running the stress test and you see the traffic stops, what > happens if > you run this command as root on the ugen which the if_ure belongs to: > > usbconfig -d ugenX.Y dump_string 0 > > Does the traffic resume? > > --HPS Hi Hans, how do you want me to do those tests for you ? with or without any of your patches? With the actual code on git ? hi Alexander, I did what you suggested, and what happened was the inverse, the upload got back to 300mbps , and what dropped to a half was the download, dropped to 200 instead of 600 hehe --tzk
Re: RES: TP-LINK USB no carrier after speed test
On 9/26/22 21:28, Alexander Motin wrote: Ivan, On 26.09.2022 13:11, Ivan Quitschal wrote: bad news im afraid, problem occurred at the first attempt on speedtest.net. and I'm really trying to help you analizying this code here myself, but problem is: im far from expert on network protocol business. if it is a network problem at all. seems to me more like a USB protocol limit issue or something .. just FYI , limiting that first constant to 2048 still limits my upload to 90mbps , and also still solves the issue .. there has to be something about it obviously On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually help a lot more without so dramatic performance decrease. Though it is likely only a workaround and does not explain the cause, so I hope Hans more ideas for us to test. ;) Hi, I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel based and AMD based, which I've got. This leads me to believe there is a bug in the XHCI driver or hardware on your system. Can you share the pciconfig -lv output for your XHCI controllers? Also, when running the stress test and you see the traffic stops, what happens if you run this command as root on the ugen which the if_ure belongs to: usbconfig -d ugenX.Y dump_string 0 Does the traffic resume? --HPS
Re: RES: TP-LINK USB no carrier after speed test
Ivan, On 26.09.2022 13:11, Ivan Quitschal wrote: bad news im afraid, problem occurred at the first attempt on speedtest.net. and I'm really trying to help you analizying this code here myself, but problem is: im far from expert on network protocol business. if it is a network problem at all. seems to me more like a USB protocol limit issue or something .. just FYI , limiting that first constant to 2048 still limits my upload to 90mbps , and also still solves the issue .. there has to be something about it obviously On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually help a lot more without so dramatic performance decrease. Though it is likely only a workaround and does not explain the cause, so I hope Hans more ideas for us to test. ;) -- Alexander Motin
Re: RES: TP-LINK USB no carrier after speed test
On Mon, 26 Sep 2022, Hans Petter Selasky wrote: Hi Ivan, Can you revert all if_ure patches, and try this one instead. --HPS hi Hans bad news im afraid, problem occurred at the first attempt on speedtest.net. and I'm really trying to help you analizying this code here myself, but problem is: im far from expert on network protocol business. if it is a network problem at all. seems to me more like a USB protocol limit issue or something .. just FYI , limiting that first constant to 2048 still limits my upload to 90mbps , and also still solves the issue .. there has to be something about it obviously dont remember if i told you that but the tp-link adapter is currently plugged in a USB 3.2 port anyway anything i could do to help you on something here? just let me know thanks --tzk
Re: RES: TP-LINK USB no carrier after speed test
Hi Ivan, Can you revert all if_ure patches, and try this one instead. --HPSdiff --git a/sys/dev/usb/controller/xhci.c b/sys/dev/usb/controller/xhci.c index 045be9a40b99..09aefb02687d 100644 --- a/sys/dev/usb/controller/xhci.c +++ b/sys/dev/usb/controller/xhci.c @@ -2848,8 +2848,16 @@ xhci_transfer_insert(struct usb_xfer *xfer) /* check if already inserted */ if (xfer->flags_int.bandwidth_reclaimed) { - DPRINTFN(8, "Already in schedule\n"); - return (0); + DPRINTFN(8, "Already in schedule (ringin doorbell only)\n"); + + /* + * Apparently there may be a race with multi + * buffering, that the hardware doesn't see the new + * chain bit value and stops the endpoint + * execution. Fix this by ringing the doorbell after + * each and every job that has been completed. + */ + goto ring_doorbell; } pepext = xhci_get_endpoint_ext(xfer->xroot->udev, @@ -2966,6 +2974,7 @@ xhci_transfer_insert(struct usb_xfer *xfer) xfer->flags_int.bandwidth_reclaimed = 1; +ring_doorbell: xhci_endpoint_doorbell(xfer); return (0);
Re: RES: TP-LINK USB no carrier after speed test
On Mon, 19 Sep 2022, Ivan Quitschal wrote: On Mon, 19 Sep 2022, Hans Petter Selasky wrote: Hi Ivan, Can you also test this USB kernel patch? And revert your if_ure.c patch? --HPS hi Hans, it *almost* worked ... everything was perfect , full speed 600/300 on the first 5 tests (on sppedtest.net), on the 6th test, the same problem happened unfortunately thanks --tzk hi Hans today i tested again and the problem ocurred right away at the first attempt :( but the problem is definitely related to that constant and upload. I got it back to 2048 and the problem never happened again. but of course , the upload speed also dropped back to 90mbps (instead of 300) thanks --tzk
Re: RES: TP-LINK USB no carrier after speed test
On Mon, 19 Sep 2022, Hans Petter Selasky wrote: Hi Ivan, Can you also test this USB kernel patch? And revert your if_ure.c patch? --HPS hi Hans, it *almost* worked ... everything was perfect , full speed 600/300 on the first 5 tests (on sppedtest.net), on the 6th test, the same problem happened unfortunately thanks --tzk
Re: RES: TP-LINK USB no carrier after speed test
Hi Ivan, Can you also test this USB kernel patch? And revert your if_ure.c patch? --HPSdiff --git a/sys/dev/usb/usb_transfer.c b/sys/dev/usb/usb_transfer.c index 20ed2c897aac..757697926106 100644 --- a/sys/dev/usb/usb_transfer.c +++ b/sys/dev/usb/usb_transfer.c @@ -419,6 +419,7 @@ usbd_get_max_frame_length(const struct usb_endpoint_descriptor *edesc, switch (type) { case UE_CONTROL: + case UE_BULK: max_packet_count = 1; break; case UE_ISOCHRONOUS: @@ -529,6 +530,7 @@ usbd_transfer_setup_sub(struct usb_setup_params *parm) switch (type) { case UE_CONTROL: + case UE_BULK: xfer->max_packet_count = 1; break; case UE_ISOCHRONOUS:
Re: RES: TP-LINK USB no carrier after speed test
On 9/18/22 13:50, Ivan Quitschal wrote: Hi Hans just a heads up, it worked, tested a thousand times and the problem does not occur anylonger after i changed the constant to 2048 but upload speed was affcted a little i believe. insted of 600/300 of internet speed , im having 600/90 but thats fine, way better now. should this bug be in bugzilla for this ure driver as well wehave for axge? thanks --tzk Hi Ivan, I got one of those if_ure adapters at my hand, and will test it a bit before concluding. Stay tuned and thanks for your testing efforts! --HPS
Re: RES: TP-LINK USB no carrier after speed test
On Fri, 16 Sep 2022, Ivan Quitschal wrote: On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 16:31, Ivan Quitschal wrote: -Mensagem original- De: Hans Petter Selasky Enviada em: sexta-feira, 16 de setembro de 2022 10:40 Para: Ivan Quitschal Cc: freebsd-current@freebsd.org Assunto: Re: TP-LINK USB no carrier after speed test On 9/16/22 14:18, Ivan Quitschal wrote: On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 08:34, Hans Petter Selasky wrote: On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F% 2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6 40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C TW FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT 1f xyEalCMDMjOepy3MZm5c%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS You can also try other values, like subtracting one. --HPS hi Hans, it worked but with a cost. i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on the 6th. then i tried 4096 no success then i tried bufsize 2048 and its working now, i did several tests in a row and the internet keeps working just fine. but i noticed that the speed also dropped. with the same server , testing on windows, it goes like: 600 download / 300 upload and on freebsd with 2048 bufsiz it goes like: 300 download / 150 upload so it worked with a cost like i said, i had to give up half of my bandwitch what do you think ? and thank you again --tzk Hi, Can you try this instead: usbconfig -d X.Y set_config 1 X.Y are the numbers after ugenX.Y Then it will use a different driver. --HPS Hi Hans, After the usbconfig -d X.Y set_config 1, the interface doesn't work anymore, how can I undo this? Thanks Ivan usbconfig -d X.Y set_config 0 or usbconfig -d X.Y reset Did you check dmesg? --HPS hi Hans, i had to reboot my router but i got my interface back running. and i have good news after that, im still using the buffersiz 2048 and this time i got 600mbps download /300 upload just like that. these 2 things were not related i can see. i guess thats it , looks like its solved.. i will keep monitoring here , but so far so good thank you a lot as always --tzk Hi Hans just a heads up, it worked, tested a thousand times and the problem does not occur anylonger after i changed the constant to 2048 but upload speed was affcted a little i believe. insted of 600/300 of internet speed , im having 600/90 but thats fine, way better now. should this bug be in bugzilla for this ure driver as well wehave for axge? thanks --tzk
Re: RES: TP-LINK USB no carrier after speed test
On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 16:31, Ivan Quitschal wrote: -Mensagem original- De: Hans Petter Selasky Enviada em: sexta-feira, 16 de setembro de 2022 10:40 Para: Ivan Quitschal Cc: freebsd-current@freebsd.org Assunto: Re: TP-LINK USB no carrier after speed test On 9/16/22 14:18, Ivan Quitschal wrote: On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 08:34, Hans Petter Selasky wrote: On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F% 2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6 40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C TW FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT 1f xyEalCMDMjOepy3MZm5c%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS You can also try other values, like subtracting one. --HPS hi Hans, it worked but with a cost. i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on the 6th. then i tried 4096 no success then i tried bufsize 2048 and its working now, i did several tests in a row and the internet keeps working just fine. but i noticed that the speed also dropped. with the same server , testing on windows, it goes like: 600 download / 300 upload and on freebsd with 2048 bufsiz it goes like: 300 download / 150 upload so it worked with a cost like i said, i had to give up half of my bandwitch what do you think ? and thank you again --tzk Hi, Can you try this instead: usbconfig -d X.Y set_config 1 X.Y are the numbers after ugenX.Y Then it will use a different driver. --HPS Hi Hans, After the usbconfig -d X.Y set_config 1, the interface doesn't work anymore, how can I undo this? Thanks Ivan usbconfig -d X.Y set_config 0 or usbconfig -d X.Y reset Did you check dmesg? --HPS hi Hans, i had to reboot my router but i got my interface back running. and i have good news after that, im still using the buffersiz 2048 and this time i got 600mbps download /300 upload just like that. these 2 things were not related i can see. i guess thats it , looks like its solved.. i will keep monitoring here , but so far so good thank you a lot as always --tzk
Re: RES: TP-LINK USB no carrier after speed test
On 9/16/22 16:31, Ivan Quitschal wrote: -Mensagem original- De: Hans Petter Selasky Enviada em: sexta-feira, 16 de setembro de 2022 10:40 Para: Ivan Quitschal Cc: freebsd-current@freebsd.org Assunto: Re: TP-LINK USB no carrier after speed test On 9/16/22 14:18, Ivan Quitschal wrote: On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 08:34, Hans Petter Selasky wrote: On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F% 2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6 40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C TW FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT 1f xyEalCMDMjOepy3MZm5c%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS You can also try other values, like subtracting one. --HPS hi Hans, it worked but with a cost. i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on the 6th. then i tried 4096 no success then i tried bufsize 2048 and its working now, i did several tests in a row and the internet keeps working just fine. but i noticed that the speed also dropped. with the same server , testing on windows, it goes like: 600 download / 300 upload and on freebsd with 2048 bufsiz it goes like: 300 download / 150 upload so it worked with a cost like i said, i had to give up half of my bandwitch what do you think ? and thank you again --tzk Hi, Can you try this instead: usbconfig -d X.Y set_config 1 X.Y are the numbers after ugenX.Y Then it will use a different driver. --HPS Hi Hans, After the usbconfig -d X.Y set_config 1, the interface doesn't work anymore, how can I undo this? Thanks Ivan usbconfig -d X.Y set_config 0 or usbconfig -d X.Y reset Did you check dmesg? --HPS
RES: TP-LINK USB no carrier after speed test
> -Mensagem original- > De: Hans Petter Selasky > Enviada em: sexta-feira, 16 de setembro de 2022 10:40 > Para: Ivan Quitschal > Cc: freebsd-current@freebsd.org > Assunto: Re: TP-LINK USB no carrier after speed test > > On 9/16/22 14:18, Ivan Quitschal wrote: > > > > > > On Fri, 16 Sep 2022, Hans Petter Selasky wrote: > > > >> On 9/16/22 08:34, Hans Petter Selasky wrote: > >>> On 9/16/22 08:20, Hans Petter Selasky wrote: > >>>> On 9/15/22 17:36, Ivan Quitschal wrote: > >>>>> > >>>>> > >>>>> On Thu, 15 Sep 2022, Ivan Quitschal wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>> On Thu, 15 Sep 2022, Hans Petter Selasky wrote: > >>>>>> > >>>>>>> On 9/15/22 17:18, Hans Petter Selasky wrote: > >>>>>>>> On 9/15/22 17:16, Ivan Quitschal wrote: > >>>>>>>>> > >>>>>>>>> Hi All > >>>>>>>>> > >>>>>>>>> Does anybody have any idea what could be happening here?. > >>>>>>>>> I have a laptop DELL INSPIRON 3511 and everything works just > >>>>>>>>> fine, literally everything. even the iwlwifi0. > >>>>>>>>> > >>>>>>>>> But in order to use my full 600mbps, i dont use the wireless > >>>>>>>>> but a TP-LINK USB ethernet connected on "ue0" > >>>>>>>>> > >>>>>>>>> ugen0.6: at usbus0, cfg=0 > >>>>>>>>> md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> but something really strange is happening .. everytime i open > >>>>>>>>> the chromium e do a speedtest (could be speedtest.net or any > >>>>>>>>> other) , at the end of the test the eth interface dies .. it > >>>>>>>>> changes from full-duplex to half-duplex/no carrier and the > >>>>>>>>> only way to get the internet back thru ue0 is by rebooting the > >>>>>>>>> whole thing. > >>>>>>>>> not even a "service netif restart" does anything > >>>>>>>>> > >>>>>>>>> if anyone has any ideas why is that , would be appreciated > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I think it some new features they use in the USB data protocol > >>>>>>>> which we don't support. Check the Linux code. > >>>>>>>> > >>>>>>>> Between does: > >>>>>>>> > >>>>>>>> usbconfig -d 0.6 reset > >>>>>>>> > >>>>>>>> recover the device? > >>>>>>>> > >>>>>>>> --HPS > >>>>>>>> > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> Search for axge on bugzilla: > >>>>>>> > >>>>>>> I suspect you are using this chipset: > >>>>>>> > >>>>>>> Try: > >>>>>>> > >>>>>>> usbconfig show_ifdrv > >>>>>>> > >>>>>>> To know for sure. > >>>>>>> > >>>>>>> Also see: > >>>>>>> > >>>>>>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F% > >>>>>>> > 2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d > >>>>>>> > ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6 > >>>>>>> > 40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C > TW > >>>>>>> > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC > >>>>>>> > JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT > 1f > >>>>>>> xyEalCMDMjOepy3MZm5c%3D&reserved=0 > >>>>>>> > >>>>>>> --HPS > >>>>
Re: TP-LINK USB no carrier after speed test
Hi, I compared the Linux code and the FreeBSD code, and the Linux code has firmware upload support for this device. Maybe implementing that will fix some issues. Will come back to this after EuroBSDcon :-) --HPS
Re: TP-LINK USB no carrier after speed test
On 9/16/22 14:18, Ivan Quitschal wrote: On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 08:34, Hans Petter Selasky wrote: On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7C266a987745fc4f2d0b9008da97ae1d13%7C84df9e7fe9f640afb435%7C1%7C0%7C637989071335334015%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zKuhXtYc%2FG3qpRtU%2FZNr5uEeQARsGudcWIlC1bVOsLE%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS You can also try other values, like subtracting one. --HPS hi Hans, it worked but with a cost. i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on the 6th. then i tried 4096 no success then i tried bufsize 2048 and its working now, i did several tests in a row and the internet keeps working just fine. but i noticed that the speed also dropped. with the same server , testing on windows, it goes like: 600 download / 300 upload and on freebsd with 2048 bufsiz it goes like: 300 download / 150 upload so it worked with a cost like i said, i had to give up half of my bandwitch what do you think ? and thank you again --tzk Hi, Can you try this instead: usbconfig -d X.Y set_config 1 X.Y are the numbers after ugenX.Y Then it will use a different driver. --HPS
Re: TP-LINK USB no carrier after speed test
On Fri, 16 Sep 2022, Hans Petter Selasky wrote: On 9/16/22 08:34, Hans Petter Selasky wrote: On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7C266a987745fc4f2d0b9008da97ae1d13%7C84df9e7fe9f640afb435%7C1%7C0%7C637989071335334015%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zKuhXtYc%2FG3qpRtU%2FZNr5uEeQARsGudcWIlC1bVOsLE%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS You can also try other values, like subtracting one. --HPS hi Hans, it worked but with a cost. i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on the 6th. then i tried 4096 no success then i tried bufsize 2048 and its working now, i did several tests in a row and the internet keeps working just fine. but i noticed that the speed also dropped. with the same server , testing on windows, it goes like: 600 download / 300 upload and on freebsd with 2048 bufsiz it goes like: 300 download / 150 upload so it worked with a cost like i said, i had to give up half of my bandwitch what do you think ? and thank you again --tzk
Re: TP-LINK USB no carrier after speed test
On 9/16/22 08:34, Hans Petter Selasky wrote: On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS You can also try other values, like subtracting one. --HPS
Re: TP-LINK USB no carrier after speed test
On 9/16/22 08:20, Hans Petter Selasky wrote: On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS Hi, Try lowering this constant to 8192: sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ16384 Then recompile and install if_ure: make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install --HPS
Re: TP-LINK USB no carrier after speed test
On 9/15/22 17:36, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk Hi Ivan, Yes, there seems to be problem there. I need to look at the driver. Maybe it is simply sending too much data, than the device can handle! --HPS
Re: TP-LINK USB no carrier after speed test
On Thu, Sep 15, 2022 at 01:45:11PM -0300, Ivan Quitschal wrote: capabilities=68009b ether 54:af:97:86:be:2c inet 192.168.0.35 netmask 0xff00 broadcast 192.168.0.255 media: Ethernet 1000baseT status: active supported media: media autoselect media 1000baseT mediaopt full-duplex,master media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP media none nd6 options=29 In /etc/rc.conf, is it autoselected (so no mediaopt line) or are you specifying the media 1000baseT mediaopt full-duplex,master ? I'm asking because some network devices sometimes seem to *require* the speed to be specified because they don't play well autonegotiating. --
Re: TP-LINK USB no carrier after speed test
On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7C84d8684abc754f0596a108da97302431%7C84df9e7fe9f640afb435%7C1%7C0%7C637988530285207791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Lrg%2Fy3DsJOZj8MedxLJz2nkpm0swb8W%2F%2Bk1ZoRPKMT8%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk hi Hans i've seen you made 2 patches for ure driver which looked like a little with the problem im having here. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256675 problem is, its not compiling any longer, code must have changed since you made the patch. regarding the "axge" bugzilla you sent me , THATS EXACTLY the problem im having. The workaround for the guy's problem was doing this: # ifconfig ue0 media 1000baseT mediaopt flow problem is, my ure/ue0 interface does not have that option "flow" - [tzk@tzk-inspiron ~ ]$ ifconfig -m ue0 ue0: flags=8843 metric 0 mtu 1500 options=68009b capabilities=68009b ether 54:af:97:86:be:2c inet 192.168.0.35 netmask 0xff00 broadcast 192.168.0.255 media: Ethernet 1000baseT status: active supported media: media autoselect media 1000baseT mediaopt full-duplex,master media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP media none nd6 options=29 any ideas or any other patch you made ? appreciate any insights thanks in advance --tzk
Re: TP-LINK USB no carrier after speed test
On Thu, 15 Sep 2022, Ivan Quitschal wrote: On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk oh, i forgot to mention that the ure driver freezes not during the download test but in the middle of the upload, always dont know if that helps thanks --tzk
Re: TP-LINK USB no carrier after speed test
On Thu, 15 Sep 2022, Hans Petter Selasky wrote: On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Cedde022bc19842d21eec08da972e3fb5%7C84df9e7fe9f640afb435%7C1%7C0%7C637988522152537501%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wWy4fA5uwNIN2SC%2F1BNEwdJP6pHW5bsrKyhuVkbHEZs%3D&reserved=0 --HPS Hi Hans, actually the driver i use is not agxe (i thought it would be by the time i bought the usbcard) this is the module im using if_ure.ko and thank you , yes, reseting the usb entry with your command worked just fine. i got the internet back after doing this usbconfig -d 0.6 reset do we have a bug here then? thank you --tzk
Re: TP-LINK USB no carrier after speed test
On 9/15/22 17:18, Hans Petter Selasky wrote: On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS Hi, Search for axge on bugzilla: I suspect you are using this chipset: Try: usbconfig show_ifdrv To know for sure. Also see: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210488 --HPS
Re: TP-LINK USB no carrier after speed test
On 9/15/22 17:16, Ivan Quitschal wrote: Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated Hi, I think it some new features they use in the USB data protocol which we don't support. Check the Linux code. Between does: usbconfig -d 0.6 reset recover the device? --HPS
TP-LINK USB no carrier after speed test
Hi All Does anybody have any idea what could be happening here?. I have a laptop DELL INSPIRON 3511 and everything works just fine, literally everything. even the iwlwifi0. But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB ethernet connected on "ue0" ugen0.6: at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (200mA) but something really strange is happening .. everytime i open the chromium e do a speedtest (could be speedtest.net or any other) , at the end of the test the eth interface dies .. it changes from full-duplex to half-duplex/no carrier and the only way to get the internet back thru ue0 is by rebooting the whole thing. not even a "service netif restart" does anything if anyone has any ideas why is that , would be appreciated thanks --tzk
Re: test-includes breaks buildworld when WITHOUT_PF is set in src.conf
On Wed, 09 Feb 2022 11:08:44 +0100 Kristof Provost wrote: > On 9 Feb 2022, at 10:57, Gary Jennejohn wrote: > > test-includes uses pf.h when checking usage of pfvar.h. > > > > But, these lines in include/Makefile remove pf.h when WITHOUT_PF is > > set in src.conf: > > > > .if ${MK_PF} != "no" > > INCSGROUPS+= PF > > .endif > > > > This breaks buildworld. The error message: > > > > In file included from net_pfvar.c:1: > > /usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/pfvar.h:65:10: fatal error: > > 'netpfil/pf/pf.h' file not found > > #include > > ^ > > 1 error generated. > > --- net_pfvar.o --- > > *** [net_pfvar.o] Error code 1 > > > > make[3]: stopped in /usr/src/tools/build/test-includes > > .ERROR_TARGET='net_pfvar.o' > > > > Removing the .if/.endif fixes it for me, although there may be a better > > way to avoid the error. > > > Warner's working on a better fix. See https://reviews.freebsd.org/D34009 for > the discussion. > Thanks for the info. -- Gary Jennejohn
Re: test-includes breaks buildworld when WITHOUT_PF is set in src.conf
On 9 Feb 2022, at 10:57, Gary Jennejohn wrote: > test-includes uses pf.h when checking usage of pfvar.h. > > But, these lines in include/Makefile remove pf.h when WITHOUT_PF is > set in src.conf: > > .if ${MK_PF} != "no" > INCSGROUPS+= PF > .endif > > This breaks buildworld. The error message: > > In file included from net_pfvar.c:1: > /usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/pfvar.h:65:10: fatal error: > 'netpfil/pf/pf.h' file not found > #include > ^ > 1 error generated. > --- net_pfvar.o --- > *** [net_pfvar.o] Error code 1 > > make[3]: stopped in /usr/src/tools/build/test-includes > .ERROR_TARGET='net_pfvar.o' > > Removing the .if/.endif fixes it for me, although there may be a better > way to avoid the error. > Warner’s working on a better fix. See https://reviews.freebsd.org/D34009 for the discussion. Kristof
test-includes breaks buildworld when WITHOUT_PF is set in src.conf
test-includes uses pf.h when checking usage of pfvar.h. But, these lines in include/Makefile remove pf.h when WITHOUT_PF is set in src.conf: .if ${MK_PF} != "no" INCSGROUPS+= PF .endif This breaks buildworld. The error message: In file included from net_pfvar.c:1: /usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/pfvar.h:65:10: fatal error: 'netpfil/pf/pf.h' file not found #include ^ 1 error generated. --- net_pfvar.o --- *** [net_pfvar.o] Error code 1 make[3]: stopped in /usr/src/tools/build/test-includes .ERROR_TARGET='net_pfvar.o' Removing the .if/.endif fixes it for me, although there may be a better way to avoid the error. -- Gary Jennejohn
Re: kyua run under WITH_ASAN= built world reports a global-buffer-overflow during cpio test.
On 2022-Jan-12, at 01:54, Mark Millard wrote: > For the below it appears that the report from UBSAN is accurate. > > ==85511==ERROR: AddressSanitizer: global-buffer-overflow on address > 0x010753ca at pc 0x01139bda bp 0x7fffc2b0 sp 0x7fffc2a8 > READ of size 1 at 0x010753ca thread T0 >#0 0x1139bd9 in hexdump > /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 >#1 0x113b73c in assertion_text_file_contents > /usr/main-src/contrib/libarchive/test_utils/test_main.c:1182:3 >#2 0x1125d46 in basic_cpio > /usr/main-src/contrib/libarchive/cpio/test/test_basic.c:84:2 >#3 0x11259dc in test_basic > /usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:2 >#4 0x1144943 in test_run > /usr/main-src/contrib/libarchive/test_utils/test_main.c:3561:2 >#5 0x1144943 in main > /usr/main-src/contrib/libarchive/test_utils/test_main.c:4062:9 > > 0x010753ca is located 54 bytes to the left of global variable ' literal>' defined in > '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:13' (0x1075400) > of size 5 > '' is ascii string 'copy' > 0x010753ca is located 22 bytes to the left of global variable ' literal>' defined in > '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:228:38' (0x10753e0) > of size 9 > '' is ascii string '1 block > ' > 0x010753ca is located 0 bytes to the right of global variable ' literal>' defined in > '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:220:18' (0x10753c0) > of size 10 > '' is ascii string '2 blocks > ' > SUMMARY: AddressSanitizer: global-buffer-overflow > /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 in hexdump > Shadow bytes around the buggy address: > 0x4020ea20: f9 f9 f9 f9 02 f9 f9 f9 00 01 f9 f9 00 02 f9 f9 > 0x4020ea30: 00 00 00 00 00 00 02 f9 f9 f9 f9 f9 00 f9 f9 f9 > 0x4020ea40: 00 01 f9 f9 00 00 00 00 00 00 01 f9 f9 f9 f9 f9 > 0x4020ea50: 06 f9 f9 f9 07 f9 f9 f9 00 00 00 00 00 07 f9 f9 > 0x4020ea60: f9 f9 f9 f9 04 f9 f9 f9 05 f9 f9 f9 00 00 00 00 > =>0x4020ea70: 00 05 f9 f9 f9 f9 f9 f9 00[02]f9 f9 00 01 f9 f9 > 0x4020ea80: 05 f9 f9 f9 01 f9 f9 f9 00 01 f9 f9 00 05 f9 f9 > 0x4020ea90: 00 02 f9 f9 00 f9 f9 f9 00 02 f9 f9 07 f9 f9 f9 > 0x4020eaa0: 00 01 f9 f9 07 f9 f9 f9 00 02 f9 f9 00 02 f9 f9 > 0x4020eab0: 00 03 f9 f9 00 01 f9 f9 00 04 f9 f9 00 00 00 00 > 0x4020eac0: 00 00 00 03 f9 f9 f9 f9 00 00 00 f9 f9 f9 f9 f9 > Shadow byte legend (one shadow byte represents 8 application bytes): > Addressable: 00 > Partially addressable: 01 02 03 04 05 06 07 > Heap left redzone: fa > Freed heap region: fd > Stack left redzone: f1 > Stack mid redzone: f2 > Stack right redzone: f3 > Stack after return: f5 > Stack use after scope: f8 > Global redzone: f9 > Global init order: f6 > Poisoned by user:f7 > Container overflow: fc > Array cookie:ac > Intra object redzone:bb > ASan internal: fe > Left alloca redzone: ca > Right alloca redzone:cb > ==85511==ABORTING > > Well, contrib/libarchive/cpio/test/test_basic.c:84 is doing: > >assertTextFileContents(se, "pack.err"); > > which involves, in turn: > > int > assertion_text_file_contents(const char *filename, int line, const char > *buff, const char *fn) > { > . . . >s = (int)strlen(buff); >contents = malloc(s * 2 + 128); >n = (int)fread(contents, 1, s * 2 + 128 - 1, f); > . . . >if (n > 0) { >hexdump(contents, buff, n, 0); > . . . > > Nothing about the code seems to constrain n to fit the > size of the space for "pack.err" (9 bytes of "global" > space). > > The report is for the ref[i + j] in the code: > > static void > hexdump(const char *p, const char *ref, size_t l, size_t offset) > { > . . . >for (j = 0; j < 16 && i + j < l; j++) { >if (ref != NULL && p[i + j] != ref[i + j]) > . . . > > where ref points to the space for "pack.err" and l was > given a copy of the value of n in the previously shown > code. > > The i + j < l constraint need not avoid the code doing > ref[i + j] in a way that reaches outside the space for > "pack.err" --because of the supplied value of n (a.k.a. l) > not being sufficient to respect the space for "pack.err". pair below shows up in 13 reports: #0 0x1139bd9 in hexdump /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 #1 0x113b73c in assertion_text_file_contents /usr/main-src/contrib/libarchive/test_utils/test_main.c:1182:3 So the above notes are just an illustration of a more general issue with the assertion_text_file_contents use of "hexdump(contents, buff, n, 0)". === Mark Millard marklmi at yahoo.com
kyua run under WITH_ASAN= built world reports a global-buffer-overflow during cpio test.
For the below it appears that the report from UBSAN is accurate. ==85511==ERROR: AddressSanitizer: global-buffer-overflow on address 0x010753ca at pc 0x01139bda bp 0x7fffc2b0 sp 0x7fffc2a8 READ of size 1 at 0x010753ca thread T0 #0 0x1139bd9 in hexdump /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 #1 0x113b73c in assertion_text_file_contents /usr/main-src/contrib/libarchive/test_utils/test_main.c:1182:3 #2 0x1125d46 in basic_cpio /usr/main-src/contrib/libarchive/cpio/test/test_basic.c:84:2 #3 0x11259dc in test_basic /usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:2 #4 0x1144943 in test_run /usr/main-src/contrib/libarchive/test_utils/test_main.c:3561:2 #5 0x1144943 in main /usr/main-src/contrib/libarchive/test_utils/test_main.c:4062:9 0x010753ca is located 54 bytes to the left of global variable '' defined in '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:13' (0x1075400) of size 5 '' is ascii string 'copy' 0x010753ca is located 22 bytes to the left of global variable '' defined in '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:228:38' (0x10753e0) of size 9 '' is ascii string '1 block ' 0x010753ca is located 0 bytes to the right of global variable '' defined in '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:220:18' (0x10753c0) of size 10 '' is ascii string '2 blocks ' SUMMARY: AddressSanitizer: global-buffer-overflow /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 in hexdump Shadow bytes around the buggy address: 0x4020ea20: f9 f9 f9 f9 02 f9 f9 f9 00 01 f9 f9 00 02 f9 f9 0x4020ea30: 00 00 00 00 00 00 02 f9 f9 f9 f9 f9 00 f9 f9 f9 0x4020ea40: 00 01 f9 f9 00 00 00 00 00 00 01 f9 f9 f9 f9 f9 0x4020ea50: 06 f9 f9 f9 07 f9 f9 f9 00 00 00 00 00 07 f9 f9 0x4020ea60: f9 f9 f9 f9 04 f9 f9 f9 05 f9 f9 f9 00 00 00 00 =>0x4020ea70: 00 05 f9 f9 f9 f9 f9 f9 00[02]f9 f9 00 01 f9 f9 0x4020ea80: 05 f9 f9 f9 01 f9 f9 f9 00 01 f9 f9 00 05 f9 f9 0x4020ea90: 00 02 f9 f9 00 f9 f9 f9 00 02 f9 f9 07 f9 f9 f9 0x4020eaa0: 00 01 f9 f9 07 f9 f9 f9 00 02 f9 f9 00 02 f9 f9 0x4020eab0: 00 03 f9 f9 00 01 f9 f9 00 04 f9 f9 00 00 00 00 0x4020eac0: 00 00 00 03 f9 f9 f9 f9 00 00 00 f9 f9 f9 f9 f9 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user:f7 Container overflow: fc Array cookie:ac Intra object redzone:bb ASan internal: fe Left alloca redzone: ca Right alloca redzone:cb ==85511==ABORTING Well, contrib/libarchive/cpio/test/test_basic.c:84 is doing: assertTextFileContents(se, "pack.err"); which involves, in turn: int assertion_text_file_contents(const char *filename, int line, const char *buff, const char *fn) { . . . s = (int)strlen(buff); contents = malloc(s * 2 + 128); n = (int)fread(contents, 1, s * 2 + 128 - 1, f); . . . if (n > 0) { hexdump(contents, buff, n, 0); . . . Nothing about the code seems to constrain n to fit the size of the space for "pack.err" (9 bytes of "global" space). The report is for the ref[i + j] in the code: static void hexdump(const char *p, const char *ref, size_t l, size_t offset) { . . . for (j = 0; j < 16 && i + j < l; j++) { if (ref != NULL && p[i + j] != ref[i + j]) . . . where ref points to the space for "pack.err" and l was given a copy of the value of n in the previously shown code. The i + j < l constraint need not avoid the code doing ref[i + j] in a way that reaches outside the space for "pack.err" --because of the supplied value of n (a.k.a. l) not being sufficient to respect the space for "pack.err". === Mark Millard marklmi at yahoo.com
Re: FYI: An example ASAN failure report during kyua test -k /usr/tests/Kyuafile (info for some more examples)
On 2022-Jan-9, at 13:47, Mark Millard wrote: > On 2022-Jan-7, at 03:39, Mark Millard wrote: > >> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= >> after finding what to control to allow the build, I installed >> it in a directory tree for chroot use and have >> "kyua test -k /usr/tests/Kyuafile" running. >> >> I see evidence of one AddressSanitizer report. (kyua is still >> running.) The context is: >> >> # more >> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stdout.txt >> Executing command [ mkdir /tmp/kyua.FKD2vh/434/work/mntpt ] >> mount -t tmpfs -o size=10M tmpfs /tmp/kyua.FKD2vh/434/work/mntpt >> Executing command [ touch a ] >> Executing command [ rm a ] >> Executing command [ dd if=/dev/zero of=a bs=1m count=15 ] >> Executing command [ rm a ] >> >> # more >> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stderr.txt >> = >> ==14384==ERROR: AddressSanitizer: stack-buffer-overflow on address >> 0x7fffa948 at pc 0x000801f38f5a bp 0x7fffa830 sp 0x7fffa828 >> WRITE of size 8 at 0x7fffa948 thread T0 >> #0 0x801f38f59 in strtoimax_l >> /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 >> #1 0x10de6c8 in strtoimax >> /usr/main-src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:3441:18 >> #2 0x11a4723 in getq /usr/main-src/bin/test/test.c:560:6 >> #3 0x11a4523 in intcmp /usr/main-src/bin/test/test.c:584:7 >> #4 0x11a4523 in binop /usr/main-src/bin/test/test.c:351:10 >> #5 0x11a2f06 in primary /usr/main-src/bin/test/test.c:317:10 >> #6 0x11a2f06 in nexpr /usr/main-src/bin/test/test.c:275:9 >> #7 0x11a28cb in aexpr /usr/main-src/bin/test/test.c:261:8 >> #8 0x11a2a03 in aexpr /usr/main-src/bin/test/test.c:263:10 >> #9 0x11a228b in oexpr /usr/main-src/bin/test/test.c:247:8 >> #10 0x11a1fcf in testcmd /usr/main-src/bin/test/test.c:224:10 >> #11 0x1145289 in evalcommand /usr/main-src/bin/sh/eval.c:1107:16 >> #12 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >> #13 0x113fb34 in evaltree /usr/main-src/bin/sh/eval.c:225:4 >> #14 0x113f86b in evaltree /usr/main-src/bin/sh/eval.c:212:4 >> #15 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 >> #16 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >> #17 0x113fc55 in evaltree /usr/main-src/bin/sh/eval.c:241:4 >> #18 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 >> #19 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >> #20 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 >> #21 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >> #22 0x113eb88 in evalstring /usr/main-src/bin/sh/eval.c >> #23 0x1179727 in main /usr/main-src/bin/sh/main.c:171:3 >> >> Address 0x7fffa948 is located in stack of thread T0 at offset 264 in >> frame >> #0 0x801f387ff in strtoimax_l /usr/main-src/lib/libc/stdlib/strtoimax.c:58 >> >> This frame has 1 object(s): >> [32, 36) '__limit.i.i.i' <== Memory access at offset 264 overflows this >> variable >> HINT: this may be a false positive if your program uses some custom stack >> unwind mechanism, swapcontext or vfork >> (longjmp and C++ exceptions *are* supported) >> SUMMARY: AddressSanitizer: stack-buffer-overflow >> /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 in strtoimax_l >> Shadow bytes around the buggy address: >> 0x44d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 0x44e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 0x44f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 0x4500: f1 f1 f1 f1 00 00 00 00 f1 f1 f1 f1 f8 f3 f3 f3 >> 0x4510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> =>0x4520: 00 00 00 00 f3 f3 f3 f3 f3[f3]f3 f3 00 00 00 00 >> 0x4530: f1 f1 f1 f1 00 f3 f3 f3 00 00 00 00 00 00 00 00 >> 0x4540: f1 f1 f1 f1 00 f2 f2 f2 00 f3 f3 f3 00 00 00 00 >> 0x4550: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 >> 0x4560: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 >> 0x4570: f2 f2 f2 f2 f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 >> Shadow byte legend (one shadow byte represents 8 application bytes): >> Addressable: 00 >> Partially addressable: 01 02 03 04 05 06 07 >> Heap left redzone: fa >> Freed heap region: fd >> Stack left redzone: f1 >> Stack mid redzone: f2 >> Stack right r
Re: FYI: An example ASAN failure report during kyua test -k /usr/tests/Kyuafile (info for some more examples)
On 2022-Jan-7, at 03:39, Mark Millard wrote: > Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= > after finding what to control to allow the build, I installed > it in a directory tree for chroot use and have > "kyua test -k /usr/tests/Kyuafile" running. > > I see evidence of one AddressSanitizer report. (kyua is still > running.) The context is: > > # more > /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stdout.txt > Executing command [ mkdir /tmp/kyua.FKD2vh/434/work/mntpt ] > mount -t tmpfs -o size=10M tmpfs /tmp/kyua.FKD2vh/434/work/mntpt > Executing command [ touch a ] > Executing command [ rm a ] > Executing command [ dd if=/dev/zero of=a bs=1m count=15 ] > Executing command [ rm a ] > > # more > /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stderr.txt > = > ==14384==ERROR: AddressSanitizer: stack-buffer-overflow on address > 0x7fffa948 at pc 0x000801f38f5a bp 0x7fffa830 sp 0x7fffa828 > WRITE of size 8 at 0x7fffa948 thread T0 >#0 0x801f38f59 in strtoimax_l > /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 >#1 0x10de6c8 in strtoimax > /usr/main-src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:3441:18 >#2 0x11a4723 in getq /usr/main-src/bin/test/test.c:560:6 >#3 0x11a4523 in intcmp /usr/main-src/bin/test/test.c:584:7 >#4 0x11a4523 in binop /usr/main-src/bin/test/test.c:351:10 >#5 0x11a2f06 in primary /usr/main-src/bin/test/test.c:317:10 > #6 0x11a2f06 in nexpr /usr/main-src/bin/test/test.c:275:9 > #7 0x11a28cb in aexpr /usr/main-src/bin/test/test.c:261:8 >#8 0x11a2a03 in aexpr /usr/main-src/bin/test/test.c:263:10 >#9 0x11a228b in oexpr /usr/main-src/bin/test/test.c:247:8 >#10 0x11a1fcf in testcmd /usr/main-src/bin/test/test.c:224:10 >#11 0x1145289 in evalcommand /usr/main-src/bin/sh/eval.c:1107:16 >#12 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >#13 0x113fb34 in evaltree /usr/main-src/bin/sh/eval.c:225:4 >#14 0x113f86b in evaltree /usr/main-src/bin/sh/eval.c:212:4 >#15 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 >#16 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >#17 0x113fc55 in evaltree /usr/main-src/bin/sh/eval.c:241:4 >#18 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 >#19 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >#20 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 >#21 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 >#22 0x113eb88 in evalstring /usr/main-src/bin/sh/eval.c >#23 0x1179727 in main /usr/main-src/bin/sh/main.c:171:3 > > Address 0x7fffa948 is located in stack of thread T0 at offset 264 in frame >#0 0x801f387ff in strtoimax_l /usr/main-src/lib/libc/stdlib/strtoimax.c:58 > > This frame has 1 object(s): >[32, 36) '__limit.i.i.i' <== Memory access at offset 264 overflows this > variable > HINT: this may be a false positive if your program uses some custom stack > unwind mechanism, swapcontext or vfork > (longjmp and C++ exceptions *are* supported) > SUMMARY: AddressSanitizer: stack-buffer-overflow > /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 in strtoimax_l > Shadow bytes around the buggy address: > 0x44d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x44e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x44f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x4500: f1 f1 f1 f1 00 00 00 00 f1 f1 f1 f1 f8 f3 f3 f3 > 0x4510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > =>0x4520: 00 00 00 00 f3 f3 f3 f3 f3[f3]f3 f3 00 00 00 00 > 0x4530: f1 f1 f1 f1 00 f3 f3 f3 00 00 00 00 00 00 00 00 > 0x4540: f1 f1 f1 f1 00 f2 f2 f2 00 f3 f3 f3 00 00 00 00 > 0x4550: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 > 0x4560: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 > 0x4570: f2 f2 f2 f2 f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 > Shadow byte legend (one shadow byte represents 8 application bytes): > Addressable: 00 > Partially addressable: 01 02 03 04 05 06 07 > Heap left redzone: fa > Freed heap region: fd > Stack left redzone: f1 > Stack mid redzone: f2 > Stack right redzone: f3 > Stack after return: f5 > Stack use after scope: f8 > Global redzone: f9 > Global init order: f6 > Poisoned by user:f7 > Container overflow: fc > Array cookie:ac > Intra object redzone:bb > ASan internal: fe > Left alloca re
Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile
On 2022-Jan-7, at 04:31, Stefan Esser wrote: > Am 07.01.22 um 12:49 schrieb Mark Millard: >> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= >> after finding what to control to allow the build, I installed >> it in a directory tree for chroot use and have >> "kyua test -k /usr/tests/Kyuafile" running. >> >> I see evidence of various examples of one type of undefined >> behavior: "applying zero offset to null pointer" >> >> # more >> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt >> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/usr.bin/sed/process.c:715:18 in >> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> Fail: stderr not empty >> --- /dev/null 2022-01-07 10:29:57.182903000 + >> +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr 2022-01-07 >> 10:29:57.17310 + >> @@ -0,0 +1,2 @@ >> +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> Files left in work directory after failure: mntpt, mounterr >> >> >> In general the lib/libc/stdio/fread.c:133:10 example seems to >> be in a place that would make it fairly common. > > Interesting find: > >while (resid > (r = fp->_r)) { >(void)memcpy((void *)p, (void *)fp->_p, (size_t)r); >fp->_p += r; /* line 133 */ >/* fp->_r = 0 ... done in __srefill */ >p += r; >resid -= r; > > If fp->_p == NULL in line 133, then NULL has been passed as source address > in memcpy() in the line above, and I'd think that is undefined behavior, > even if a length of 0 is passed at the same time. My copy of ISO/IEC 9899:2011 (E) only explicitly mentions such a limitation for the memcpy_s variant. It does say "[t]he memcpy function returns the value of s1". The only mentioned "behavior is undefined" is for copying between objects that overlap. But there is more general wording in 7.24.1 (of 7.24 String handling ): QUOTE Where an argument declared as size_t n specifies the length of the array for a function, n can have the value zero on a call to that function. Unless explicitly stated otherwise in the description of a particular function in this subclause, pointer arguments on such a call still shall have valid values, as described in 7.1.4. On such a call, . . . a function that copies characters copies zero characters. END QUOTE But I've not noticed anything in 7.1.4 is that explicit about NULL arguments with zero sizes or that bans NULL arguments in any generality. In other words, I believe that the lack of a report for memcpy's argument values is consistent with what ISO/IEC 9899:2011 is explicit about for such things. I've not tried going through POSIX material or any other potential standards. > Maybe the code block quoted above (line 132 to 136) should be made wrapped > into "if (r > 0) {}"? > === Mark Millard marklmi at yahoo.com
Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile
On 2022-Jan-7, at 05:08, Mark Millard wrote: > On 2022-Jan-7, at 03:49, Mark Millard wrote: > >> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= >> after finding what to control to allow the build, I installed >> it in a directory tree for chroot use and have >> "kyua test -k /usr/tests/Kyuafile" running. >> >> I see evidence of various examples of one type of undefined >> behavior: "applying zero offset to null pointer" >> >> # more >> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt >> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/usr.bin/sed/process.c:715:18 in >> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> Fail: stderr not empty >> --- /dev/null 2022-01-07 10:29:57.182903000 + >> +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr 2022-01-07 >> 10:29:57.17310 + >> @@ -0,0 +1,2 @@ >> +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero >> offset to null pointer >> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior >> /usr/main-src/lib/libc/stdio/fread.c:133:10 in >> Files left in work directory after failure: mntpt, mounterr >> >> >> In general the lib/libc/stdio/fread.c:133:10 example seems to >> be in a place that would make it fairly common. >> >> usr.bin/sed/process.c:715:18 is more limited: just sed use. >> > > kyua ran to completion. This note is focused on UBSAN reports. > > By far the most common UBSAN report is for the > lib/libc/stdio/fread.c:133:10 code. > > Another somewhat common UBSAN report is: > > Standard error: > /usr/main-src/usr.bin/cut/cut.c:458:7: runtime error: addition of unsigned > offset to 0x6210010d overflowed to 0x6210010c > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/usr.bin/cut/cut.c:458:7 in > Fail: incorrect exit status: 1, expected: 0 > > > There is at least one example of: > > Standard error: > ld-elf.so.1: /lib/libthr.so.3: Undefined symbol > "__asan_option_detect_stack_use_after_return" > > > Some more zero offsets to null are: > > +/usr/main-src/bin/sh/jobs.c:590:35: runtime error: applying zero offset to > null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/bin/sh/jobs.c:590:35 in > +/usr/main-src/bin/sh/jobs.c:601:22: runtime error: applying zero offset to > null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/bin/sh/jobs.c:601:22 in > +/usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16: runtime error: > applying zero offset to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16 in > > +/usr/main-src/usr.sbin/makefs/ffs.c:1053:35: runtime error: applying zero > offset to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/usr.sbin/makefs/ffs.c:1053:35 in > Files left in work directory after failure: dir, ufs.img > > > contrib/libxo/libxo/xo_buf.h has examples of non-zero offsets: > > +/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22: runtime error: applying > non-zero offset 4 to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22 in > +/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44: runtime error: applying > zero offset to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44 in > +/usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29: runtime error: applying > non-zero offset 4 to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29 in > > As does contrib/openzfs/module/nvpair/nvpair.c : > >
Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile
On 2022-Jan-7, at 03:49, Mark Millard wrote: > Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= > after finding what to control to allow the build, I installed > it in a directory tree for chroot use and have > "kyua test -k /usr/tests/Kyuafile" running. > > I see evidence of various examples of one type of undefined > behavior: "applying zero offset to null pointer" > > # more > /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt > /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/usr.bin/sed/process.c:715:18 in > /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > Fail: stderr not empty > --- /dev/null 2022-01-07 10:29:57.182903000 + > +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr 2022-01-07 > 10:29:57.17310 + > @@ -0,0 +1,2 @@ > +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > Files left in work directory after failure: mntpt, mounterr > > > In general the lib/libc/stdio/fread.c:133:10 example seems to > be in a place that would make it fairly common. > > usr.bin/sed/process.c:715:18 is more limited: just sed use. > kyua ran to completion. This note is focused on UBSAN reports. By far the most common UBSAN report is for the lib/libc/stdio/fread.c:133:10 code. Another somewhat common UBSAN report is: Standard error: /usr/main-src/usr.bin/cut/cut.c:458:7: runtime error: addition of unsigned offset to 0x6210010d overflowed to 0x6210010c SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/usr.bin/cut/cut.c:458:7 in Fail: incorrect exit status: 1, expected: 0 There is at least one example of: Standard error: ld-elf.so.1: /lib/libthr.so.3: Undefined symbol "__asan_option_detect_stack_use_after_return" Some more zero offsets to null are: +/usr/main-src/bin/sh/jobs.c:590:35: runtime error: applying zero offset to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/bin/sh/jobs.c:590:35 in +/usr/main-src/bin/sh/jobs.c:601:22: runtime error: applying zero offset to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/bin/sh/jobs.c:601:22 in +/usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16: runtime error: applying zero offset to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16 in +/usr/main-src/usr.sbin/makefs/ffs.c:1053:35: runtime error: applying zero offset to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/usr.sbin/makefs/ffs.c:1053:35 in Files left in work directory after failure: dir, ufs.img contrib/libxo/libxo/xo_buf.h has examples of non-zero offsets: +/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22: runtime error: applying non-zero offset 4 to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22 in +/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44: runtime error: applying zero offset to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44 in +/usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29: runtime error: applying non-zero offset 4 to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29 in As does contrib/openzfs/module/nvpair/nvpair.c : /usr/main-src/sys/contrib/openzfs/module/nvpair/nvpair.c:3129:49: runtime error: applying non-zero offset 4 to null pointer SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/sys/contrib/openzfs/module/nvpair/nvpair.c:3129:49 in There is a: +/usr/main-src/bin/sh/arith_yacc.c:193:10: runtime error: negation of -9223372036854775808 cannot be represented in type 'arith_t' (aka 'long'); cast to an unsigned type to negate this value to itself +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavi
Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile
Am 07.01.22 um 12:49 schrieb Mark Millard: > Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= > after finding what to control to allow the build, I installed > it in a directory tree for chroot use and have > "kyua test -k /usr/tests/Kyuafile" running. > > I see evidence of various examples of one type of undefined > behavior: "applying zero offset to null pointer" > > # more > /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt > /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/usr.bin/sed/process.c:715:18 in > /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > Fail: stderr not empty > --- /dev/null 2022-01-07 10:29:57.182903000 + > +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr 2022-01-07 > 10:29:57.17310 + > @@ -0,0 +1,2 @@ > +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero > offset to null pointer > +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior > /usr/main-src/lib/libc/stdio/fread.c:133:10 in > Files left in work directory after failure: mntpt, mounterr > > > In general the lib/libc/stdio/fread.c:133:10 example seems to > be in a place that would make it fairly common. Interesting find: while (resid > (r = fp->_r)) { (void)memcpy((void *)p, (void *)fp->_p, (size_t)r); fp->_p += r; /* line 133 */ /* fp->_r = 0 ... done in __srefill */ p += r; resid -= r; If fp->_p == NULL in line 133, then NULL has been passed as source address in memcpy() in the line above, and I'd think that is undefined behavior, even if a length of 0 is passed at the same time. Maybe the code block quoted above (line 132 to 136) should be made wrapped into "if (r > 0) {}"? Regards, STefan OpenPGP_signature Description: OpenPGP digital signature
FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile
Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= after finding what to control to allow the build, I installed it in a directory tree for chroot use and have "kyua test -k /usr/tests/Kyuafile" running. I see evidence of various examples of one type of undefined behavior: "applying zero offset to null pointer" # more /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero offset to null pointer SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/lib/libc/stdio/fread.c:133:10 in /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero offset to null pointer SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/lib/libc/stdio/fread.c:133:10 in /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero offset to null pointer SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/usr.bin/sed/process.c:715:18 in /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero offset to null pointer SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/lib/libc/stdio/fread.c:133:10 in Fail: stderr not empty --- /dev/null 2022-01-07 10:29:57.182903000 + +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr 2022-01-07 10:29:57.17310 + @@ -0,0 +1,2 @@ +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero offset to null pointer +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/main-src/lib/libc/stdio/fread.c:133:10 in Files left in work directory after failure: mntpt, mounterr In general the lib/libc/stdio/fread.c:133:10 example seems to be in a place that would make it fairly common. usr.bin/sed/process.c:715:18 is more limited: just sed use. === Mark Millard marklmi at yahoo.com
FYI: An example ASAN failure report during kyua test -k /usr/tests/Kyuafile
Having done a buildworld with both WITH_ASAN= and WITH_UBSAN= after finding what to control to allow the build, I installed it in a directory tree for chroot use and have "kyua test -k /usr/tests/Kyuafile" running. I see evidence of one AddressSanitizer report. (kyua is still running.) The context is: # more /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stdout.txt Executing command [ mkdir /tmp/kyua.FKD2vh/434/work/mntpt ] mount -t tmpfs -o size=10M tmpfs /tmp/kyua.FKD2vh/434/work/mntpt Executing command [ touch a ] Executing command [ rm a ] Executing command [ dd if=/dev/zero of=a bs=1m count=15 ] Executing command [ rm a ] # more /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stderr.txt = ==14384==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fffa948 at pc 0x000801f38f5a bp 0x7fffa830 sp 0x7fffa828 WRITE of size 8 at 0x7fffa948 thread T0 #0 0x801f38f59 in strtoimax_l /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 #1 0x10de6c8 in strtoimax /usr/main-src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:3441:18 #2 0x11a4723 in getq /usr/main-src/bin/test/test.c:560:6 #3 0x11a4523 in intcmp /usr/main-src/bin/test/test.c:584:7 #4 0x11a4523 in binop /usr/main-src/bin/test/test.c:351:10 #5 0x11a2f06 in primary /usr/main-src/bin/test/test.c:317:10 #6 0x11a2f06 in nexpr /usr/main-src/bin/test/test.c:275:9 #7 0x11a28cb in aexpr /usr/main-src/bin/test/test.c:261:8 #8 0x11a2a03 in aexpr /usr/main-src/bin/test/test.c:263:10 #9 0x11a228b in oexpr /usr/main-src/bin/test/test.c:247:8 #10 0x11a1fcf in testcmd /usr/main-src/bin/test/test.c:224:10 #11 0x1145289 in evalcommand /usr/main-src/bin/sh/eval.c:1107:16 #12 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 #13 0x113fb34 in evaltree /usr/main-src/bin/sh/eval.c:225:4 #14 0x113f86b in evaltree /usr/main-src/bin/sh/eval.c:212:4 #15 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 #16 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 #17 0x113fc55 in evaltree /usr/main-src/bin/sh/eval.c:241:4 #18 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 #19 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 #20 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3 #21 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4 #22 0x113eb88 in evalstring /usr/main-src/bin/sh/eval.c #23 0x1179727 in main /usr/main-src/bin/sh/main.c:171:3 Address 0x7fffa948 is located in stack of thread T0 at offset 264 in frame #0 0x801f387ff in strtoimax_l /usr/main-src/lib/libc/stdlib/strtoimax.c:58 This frame has 1 object(s): [32, 36) '__limit.i.i.i' <== Memory access at offset 264 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork (longjmp and C++ exceptions *are* supported) SUMMARY: AddressSanitizer: stack-buffer-overflow /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 in strtoimax_l Shadow bytes around the buggy address: 0x44d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x44e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x44f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x4500: f1 f1 f1 f1 00 00 00 00 f1 f1 f1 f1 f8 f3 f3 f3 0x4510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x4520: 00 00 00 00 f3 f3 f3 f3 f3[f3]f3 f3 00 00 00 00 0x4530: f1 f1 f1 f1 00 f3 f3 f3 00 00 00 00 00 00 00 00 0x4540: f1 f1 f1 f1 00 f2 f2 f2 00 f3 f3 f3 00 00 00 00 0x4550: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 0x4560: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 0x4570: f2 f2 f2 f2 f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user:f7 Container overflow: fc Array cookie:ac Intra object redzone:bb ASan internal: fe Left alloca redzone: ca Right alloca redzone:cb ==14384==ABORTING Files left in work directory after failure: mntpt, mounterr === Mark Millard marklmi at yahoo.com
Re: FYI: WITH_REPRODUCIBLE_BUILD= problem for some files? [aarch64 test did not reproduce the issue]
On 2021-May-4, at 20:26, Mark Millard wrote: > On 2021-May-4, at 13:38, Mark Millard wrote: > >> [The first buidlworld is still in process. So while waiting . . .] >> >> On 2021-May-4, at 10:31, Mark Millard wrote: >> >>> I probably know why the huge count of differences this time >>> unlike the original report . . . >>> >>> Previously I built based on a checked-in branch as part of >>> my experimenting. This time it was in a -dirty form (not >>> checked in), again as part of my experimental exploration. >>> >>> WITH_REPRODUCIBLE_BUILD= makes a distinction between these >>> if I remember right: (partially?) disabling itself for >>> -dirty style. >>> >>> To reproduce the original style of test I need to create >>> a branch with my few patches checked in and do the >>> buildworlds from that branch. >>> >>> This will, of course, take a while. >>> >>> Sorry for the noise. >>> >> >> I've confirmed some of the details of the large number of >> files with difference while waiting for the 1st buildworld : >> >> The 4 bytes at the end of the .gnu_debuglink section >> that are ending up different are the checksum for the >> .debug file. The .debug files have differences such as: >> >> │ -<1a> DW_AT_comp_dir: (indirect string) >> /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/usr/13_0R-src/arm64.aarch64/lib/csu/aarch64 >> │ +<1a> DW_AT_comp_dir: (indirect string) >> /usr/obj/BUILDs/13_0R-CA72-nodbg-clang/usr/13_0R-src/arm64.aarch64/lib/csu/aarch64 >> >> So I need to build, snapshot (in case need >> to reference), install, clean-out, build, >> install elsewhere, compare. (Or analogous >> that uses the same build base-path for both >> installs despite separate buildworld's.) >> This is separate from any potential -dirty >> vs. checked-in handling variation by >> WITH_REPRODUCIBLE_BUILD= . >> >> My process that produced the original armv7 >> report happened to do that before I accidentally >> discovered the presence of the few files with >> differences. My new experiments were different >> and I'd not though of needing to vary the >> procedure to get you the right evidence. >> > > The two aarch64 test installs did not show any > differences in a "diff -rq" . Ignoring *.meta > files generated during the builds, the build > directory tree snapshots showed just the > differences: > > # diff -rq > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr > | grep -v '\.meta' | more > Files > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c > and > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c > differ > Files > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk > and > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk > differ > > # diff -u > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c > > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c > --- > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c > 2021-05-04 13:45:14.463351000 -0700 > +++ > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c > 2021-05-04 19:04:32.338203000 -0700 > @@ -4,7 +4,7 @@ > ** Words from CORE set written in FICL > ** Author: John Sadler (john_sad...@alum.mit.edu) > ** Created: 27 December 1997 > -** Last update: Tue May 4 13:45:14 PDT 2021 > +** Last update: Tue May 4 19:04:32 PDT 2021 > ***/ > /* > ** DO NOT EDIT THIS FILE -- it is generated by softwords/softcore.awk > > # diff -u > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk > > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk > --- > /usr/obj/BUILDs/13_0R-CA7
Re: FYI: WITH_REPRODUCIBLE_BUILD= problem for some files? [aarch64 test did not reproduce the issue]
On 2021-May-4, at 13:38, Mark Millard wrote: > [The first buidlworld is still in process. So while waiting . . .] > > On 2021-May-4, at 10:31, Mark Millard wrote: > >> I probably know why the huge count of differences this time >> unlike the original report . . . >> >> Previously I built based on a checked-in branch as part of >> my experimenting. This time it was in a -dirty form (not >> checked in), again as part of my experimental exploration. >> >> WITH_REPRODUCIBLE_BUILD= makes a distinction between these >> if I remember right: (partially?) disabling itself for >> -dirty style. >> >> To reproduce the original style of test I need to create >> a branch with my few patches checked in and do the >> buildworlds from that branch. >> >> This will, of course, take a while. >> >> Sorry for the noise. >> > > I've confirmed some of the details of the large number of > files with difference while waiting for the 1st buildworld : > > The 4 bytes at the end of the .gnu_debuglink section > that are ending up different are the checksum for the > .debug file. The .debug files have differences such as: > > │ -<1a> DW_AT_comp_dir: (indirect string) > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/usr/13_0R-src/arm64.aarch64/lib/csu/aarch64 > │ +<1a> DW_AT_comp_dir: (indirect string) > /usr/obj/BUILDs/13_0R-CA72-nodbg-clang/usr/13_0R-src/arm64.aarch64/lib/csu/aarch64 > > So I need to build, snapshot (in case need > to reference), install, clean-out, build, > install elsewhere, compare. (Or analogous > that uses the same build base-path for both > installs despite separate buildworld's.) > This is separate from any potential -dirty > vs. checked-in handling variation by > WITH_REPRODUCIBLE_BUILD= . > > My process that produced the original armv7 > report happened to do that before I accidentally > discovered the presence of the few files with > differences. My new experiments were different > and I'd not though of needing to vary the > procedure to get you the right evidence. > The two aarch64 test installs did not show any differences in a "diff -rq" . Ignoring *.meta files generated during the builds, the build directory tree snapshots showed just the differences: # diff -rq /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr | grep -v '\.meta' | more Files /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c and /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c differ Files /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk and /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk differ # diff -u /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c --- /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c 2021-05-04 13:45:14.463351000 -0700 +++ /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/stand/ficl/softcore.c 2021-05-04 19:04:32.338203000 -0700 @@ -4,7 +4,7 @@ ** Words from CORE set written in FICL ** Author: John Sadler (john_sad...@alum.mit.edu) ** Created: 27 December 1997 -** Last update: Tue May 4 13:45:14 PDT 2021 +** Last update: Tue May 4 19:04:32 PDT 2021 ***/ /* ** DO NOT EDIT THIS FILE -- it is generated by softwords/softcore.awk # diff -u /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk --- /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-0/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk 2021-05-04 10:55:26.030179000 -0700 +++ /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/.zfs/snapshot/commited-style-1/usr/13_0R-src/arm64.aarch64/toolchain-metadata.mk 2021-05-04 16:14:24.513346000 -0700 @@ -1,4 +1,4 @@ -.info Using cached toolchain metadata from build at CA72_4c8G_ZFS on Tue May 4 10:55:26 PDT 2021 +.info Using cached toolchain metadata from build at CA72_4c8G_ZFS on Tue May 4 16:14:24 P
Re: FYI: WITH_REPRODUCIBLE_BUILD= problem for some files? [Ignore recent test: -dirty vs. checked-in usage difference]
[The first buidlworld is still in process. So while waiting . . .] On 2021-May-4, at 10:31, Mark Millard wrote: > I probably know why the huge count of differences this time > unlike the original report . . . > > Previously I built based on a checked-in branch as part of > my experimenting. This time it was in a -dirty form (not > checked in), again as part of my experimental exploration. > > WITH_REPRODUCIBLE_BUILD= makes a distinction between these > if I remember right: (partially?) disabling itself for > -dirty style. > > To reproduce the original style of test I need to create > a branch with my few patches checked in and do the > buildworlds from that branch. > > This will, of course, take a while. > > Sorry for the noise. > I've confirmed some of the details of the large number of files with difference while waiting for the 1st buildworld : The 4 bytes at the end of the .gnu_debuglink section that are ending up different are the checksum for the .debug file. The .debug files have differences such as: │ -<1a> DW_AT_comp_dir: (indirect string) /usr/obj/BUILDs/13_0R-CA72-nodbg-clang-alt/usr/13_0R-src/arm64.aarch64/lib/csu/aarch64 │ +<1a> DW_AT_comp_dir: (indirect string) /usr/obj/BUILDs/13_0R-CA72-nodbg-clang/usr/13_0R-src/arm64.aarch64/lib/csu/aarch64 So I need to build, snapshot (in case need to reference), install, clean-out, build, install elsewhere, compare. (Or analogous that uses the same build base-path for both installs despite separate buildworld's.) This is separate from any potential -dirty vs. checked-in handling variation by WITH_REPRODUCIBLE_BUILD= . My process that produced the original armv7 report happened to do that before I accidentally discovered the presence of the few files with differences. My new experiments were different and I'd not though of needing to vary the procedure to get you the right evidence. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: FYI: WITH_REPRODUCIBLE_BUILD= problem for some files? [Ignore recent test: -dirty vs. checked-in usage difference]
I probably know why the huge count of differences this time unlike the original report . . . Previously I built based on a checked-in branch as part of my experimenting. This time it was in a -dirty form (not checked in), again as part of my experimental exploration. WITH_REPRODUCIBLE_BUILD= makes a distinction between these if I remember right: (partially?) disabling itself for -dirty style. To reproduce the original style of test I need to create a branch with my few patches checked in and do the buildworlds from that branch. This will, of course, take a while. Sorry for the noise. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: test suite for NIC features...
Hi! > Has anyone compiled a script/test suite for testing various NIC > features to make sure they work/function properly? > > That is, being able to run a couple interfaces back to back, and turn > off the features off on one, and make sure things like checksum offload > and the like work properly? I don't know of any project of that kind, but it sounds very useful. -- p...@opsec.eu+49 171 3101372Now what ? ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
test suite for NIC features...
Has anyone compiled a script/test suite for testing various NIC features to make sure they work/function properly? That is, being able to run a couple interfaces back to back, and turn off the features off on one, and make sure things like checksum offload and the like work properly? -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Undeletable files after kyua test runs
On Sat, Jul 04, 2020 at 10:02:45AM -0700, Enji Cooper wrote: > > On Jul 4, 2020, at 8:59 AM, Enji Cooper wrote: > >> On Jul 2, 2020, at 7:57 PM, Enji Cooper >> <mailto:yaneurab...@gmail.com>> wrote: > >>> On Jun 29, 2020, at 10:26 AM, Gordon Bergling >>> <mailto:g...@freebsd.org>> wrote: > >>> > >>> Hi, > >>> > >>> I recently stumbled across undeletable files that are generated by kyua > >>> test runs, > >>> for example > >>> > >>> -rwxr-xr-x 1 root wheel 0 May 9 13:10 > >>> /tmp/kyua.aB4q62/8676/work/fileforaudit > >>> > >>> I haven't yet identified the test that generate those files, but it is > >>> impossible > >>> to delete them. I have clear_tmp_enable="YES" set in the /etc/rc.conf, > >>> but > >>> on every boot the system argues that these file aren't deletable. > >>> I tried to 'rm -rf' them by hand but, even this wasn't possible. I have > >>> looked for > >>> any extend attributes, but I didn't find any. > >>> > >>> Has anyone an idea how this is possible and may how these files can be > >>> deleted? > >> > >> The issue is tests/sys/audit/file-attribute-modify.c , based on the file > >> present that can’t be deleted. Can you please provide more information > >> about the test run in a PR (I see how it can leave files behind, but I > >> want to make sure it is what I think it is, first)? > > > > PR filed: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247761 > > <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247761> . Working on a > > CR. > > Hi, > I just created the following CR: https://reviews.freebsd.org/D25561 > <https://reviews.freebsd.org/D25561> and added the following related GitHub > issue to PR: https://github.com/jmmv/kyua/issues/142 > <https://github.com/jmmv/kyua/issues/142> . This should be completed to avoid > issues like this in the future from occurring. > Thank you for the report! > -Enji Hi Enji, thanks for taking care of this issue and creating a PR. I didn't find the time in the last 3 days. Are you still need informationen about the kyua runs? I usually just do a # kyua test -k /usr/tests/Kyuafile once in a while, which results for example in the following undeletable files / directories. -rwxr-xr-x 1 root wheel 0 Jul 1 12:44 /tmp/kyua.gv1loN/8718/work/fileforaudit -rwxr-xr-x 1 root wheel 0 Jul 5 08:50 /tmp/kyua.FH0CAp/8718/work/fileforaudit --Gordon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Undeletable files after kyua test runs
> On Jul 4, 2020, at 8:59 AM, Enji Cooper wrote: > >> >> On Jul 2, 2020, at 7:57 PM, Enji Cooper > <mailto:yaneurab...@gmail.com>> wrote: >> >>> >>> On Jun 29, 2020, at 10:26 AM, Gordon Bergling >> <mailto:g...@freebsd.org>> wrote: >>> >>> Hi, >>> >>> I recently stumbled across undeletable files that are generated by kyua >>> test runs, >>> for example >>> >>> -rwxr-xr-x 1 root wheel 0 May 9 13:10 >>> /tmp/kyua.aB4q62/8676/work/fileforaudit >>> >>> I haven't yet identified the test that generate those files, but it is >>> impossible >>> to delete them. I have clear_tmp_enable="YES" set in the /etc/rc.conf, but >>> on every boot the system argues that these file aren't deletable. >>> I tried to 'rm -rf' them by hand but, even this wasn't possible. I have >>> looked for >>> any extend attributes, but I didn't find any. >>> >>> Has anyone an idea how this is possible and may how these files can be >>> deleted? >> >> The issue is tests/sys/audit/file-attribute-modify.c , based on the file >> present that can’t be deleted. Can you please provide more information about >> the test run in a PR (I see how it can leave files behind, but I want to >> make sure it is what I think it is, first)? > > PR filed: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247761 > <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247761> . Working on a CR. Hi, I just created the following CR: https://reviews.freebsd.org/D25561 <https://reviews.freebsd.org/D25561> and added the following related GitHub issue to PR: https://github.com/jmmv/kyua/issues/142 <https://github.com/jmmv/kyua/issues/142> . This should be completed to avoid issues like this in the future from occurring. Thank you for the report! -Enji ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Undeletable files after kyua test runs
> On Jul 2, 2020, at 7:57 PM, Enji Cooper wrote: > >> >> On Jun 29, 2020, at 10:26 AM, Gordon Bergling wrote: >> >> Hi, >> >> I recently stumbled across undeletable files that are generated by kyua test >> runs, >> for example >> >> -rwxr-xr-x 1 root wheel 0 May 9 13:10 >> /tmp/kyua.aB4q62/8676/work/fileforaudit >> >> I haven't yet identified the test that generate those files, but it is >> impossible >> to delete them. I have clear_tmp_enable="YES" set in the /etc/rc.conf, but >> on every boot the system argues that these file aren't deletable. >> I tried to 'rm -rf' them by hand but, even this wasn't possible. I have >> looked for >> any extend attributes, but I didn't find any. >> >> Has anyone an idea how this is possible and may how these files can be >> deleted? > > The issue is tests/sys/audit/file-attribute-modify.c , based on the file > present that can’t be deleted. Can you please provide more information about > the test run in a PR (I see how it can leave files behind, but I want to make > sure it is what I think it is, first)? PR filed: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247761 <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247761> . Working on a CR. Thanks, -Enji ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Undeletable files after kyua test runs
> On Jun 29, 2020, at 10:26 AM, Gordon Bergling wrote: > > Hi, > > I recently stumbled across undeletable files that are generated by kyua test > runs, > for example > > -rwxr-xr-x 1 root wheel 0 May 9 13:10 > /tmp/kyua.aB4q62/8676/work/fileforaudit > > I haven't yet identified the test that generate those files, but it is > impossible > to delete them. I have clear_tmp_enable="YES" set in the /etc/rc.conf, but > on every boot the system argues that these file aren't deletable. > I tried to 'rm -rf' them by hand but, even this wasn't possible. I have > looked for > any extend attributes, but I didn't find any. > > Has anyone an idea how this is possible and may how these files can be > deleted? The issue is tests/sys/audit/file-attribute-modify.c , based on the file present that can’t be deleted. Can you please provide more information about the test run in a PR (I see how it can leave files behind, but I want to make sure it is what I think it is, first)? Cheers, -Enji ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Undeletable files after kyua test runs
On Mon, Jun 29, 2020 at 01:18:12PM -0600, Ian Lepore wrote: > On Mon, 2020-06-29 at 21:08 +0200, Gordon Bergling wrote: > > On Mon, Jun 29, 2020 at 11:58:47AM -0700, Rodney W. Grimes wrote: > > > > On Mon, Jun 29, 2020 at 10:32:38AM -0700, Kevin Oberman wrote: > > > > > On Mon, Jun 29, 2020 at 10:26 AM Gordon Bergling < > > > > > g...@freebsd.org> wrote: > > > > > > I recently stumbled across undeletable files that are > > > > > > generated by kyua > > > > > > test runs, > > > > > > for example > > > > > > > > > > > > -rwxr-xr-x 1 root wheel 0 May 9 13:10 > > > > > > /tmp/kyua.aB4q62/8676/work/fileforaudit > > > > > > > > > > > > I haven't yet identified the test that generate those files, > > > > > > but it is > > > > > > impossible > > > > > > to delete them. I have clear_tmp_enable="YES" set in the > > > > > > /etc/rc.conf, but > > > > > > on every boot the system argues that these file aren't > > > > > > deletable. > > > > > > I tried to 'rm -rf' them by hand but, even this wasn't > > > > > > possible. I have > > > > > > looked for > > > > > > any extend attributes, but I didn't find any. > > > > > > > > > > > > Has anyone an idea how this is possible and may how these > > > > > > files can be > > > > > > deleted? > > > > > > > > > > Have you done 'ls -o' to check for flags like schg? > > > > > -- > > > > > Kevin Oberman, Part time kid herder and retired Network > > > > > Engineer > > > > > E-mail: rkober...@gmail.com > > > > > PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683 > > > > > > > > Argh, I haven't thought about chflags for quite some time. The > > > > chflags > > > > bit was set and after an > > > > > > > > # find /tmp/ -type f -exec chflags -R 0 {} \; > > > > > >^^Only files ^^ meaningless when chflags is > > > given ONLY files > > > > > > You probably could of done: > > > chflags -R 0 /tmp/ > > > > Okay, I am currently working on an update for clear_tmp_enable="YES" > > to include > > a check like this. I would think that an rc option like this should > > delete > > everything in /tmp. > > > > I disagree. One of the few things those immutable flags are good for > is protecting files from things like an rc script or other automation > that deletes files. Those flags are typically set and maintained by > users and admins, and automation should not change them in order to > delete files. > > The real fix we need is for the kyua tests to properly clean up after > themselves, including fixing the flags on temporary files created or > used by the tests, and then deleting them. > > -- Ian A fix for the causing RC script was my first idea, but I had of course the same idea that a kyua test could be fixed to not end in a state that leads to file that has chflags set to a value that couldn't be deleted by a job that is proposed to so. I take this as a homework and look at the kyua scripts that created those files. --Gordon signature.asc Description: PGP signature
Re: Undeletable files after kyua test runs
> On 29. Jun 2020, at 21:18, Ian Lepore wrote: > > On Mon, 2020-06-29 at 21:08 +0200, Gordon Bergling wrote: >> On Mon, Jun 29, 2020 at 11:58:47AM -0700, Rodney W. Grimes wrote: >>>> On Mon, Jun 29, 2020 at 10:32:38AM -0700, Kevin Oberman wrote: >>>>> On Mon, Jun 29, 2020 at 10:26 AM Gordon Bergling < >>>>> g...@freebsd.org> wrote: >>>>>> I recently stumbled across undeletable files that are >>>>>> generated by kyua >>>>>> test runs, >>>>>> for example >>>>>> >>>>>> -rwxr-xr-x 1 root wheel 0 May 9 13:10 >>>>>> /tmp/kyua.aB4q62/8676/work/fileforaudit >>>>>> >>>>>> I haven't yet identified the test that generate those files, >>>>>> but it is >>>>>> impossible >>>>>> to delete them. I have clear_tmp_enable="YES" set in the >>>>>> /etc/rc.conf, but >>>>>> on every boot the system argues that these file aren't >>>>>> deletable. >>>>>> I tried to 'rm -rf' them by hand but, even this wasn't >>>>>> possible. I have >>>>>> looked for >>>>>> any extend attributes, but I didn't find any. >>>>>> >>>>>> Has anyone an idea how this is possible and may how these >>>>>> files can be >>>>>> deleted? >>>>> >>>>> Have you done 'ls -o' to check for flags like schg? >>>>> -- >>>>> Kevin Oberman, Part time kid herder and retired Network >>>>> Engineer >>>>> E-mail: rkober...@gmail.com >>>>> PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683 >>>> >>>> Argh, I haven't thought about chflags for quite some time. The >>>> chflags >>>> bit was set and after an >>>> >>>> # find /tmp/ -type f -exec chflags -R 0 {} \; >>> >>> ^^Only files ^^ meaningless when chflags is >>> given ONLY files >>> >>> You probably could of done: >>> chflags -R 0 /tmp/ >> >> Okay, I am currently working on an update for clear_tmp_enable="YES" >> to include >> a check like this. I would think that an rc option like this should >> delete >> everything in /tmp. >> > > I disagree. One of the few things those immutable flags are good for > is protecting files from things like an rc script or other automation > that deletes files. Those flags are typically set and maintained by > users and admins, and automation should not change them in order to > delete files. > > The real fix we need is for the kyua tests to properly clean up after > themselves, including fixing the flags on temporary files created or > used by the tests, and then deleting them. > +1, having a routine script remove schg automatically IMHO defeats the purpose of setting this flag. Cheers, Michael ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"