subject:"test"

On Sep 4, 2023, at 18:39, Mark Millard  wrote:

> On Sep 4, 2023, at 10:05, Alexander Motin  wrote:
> 
>> On 04.09.2023 11:45, Mark Millard wrote:
>>> On Sep 4, 2023, at 06:09, Alexander Motin  wrote:
>>>> per_txg_dirty_frees_percent is directly related to the delete delays we 
>>>> see here.  You are forcing ZFS to commit transactions each 5% of dirty ARC 
>>>> limit, which is 5% of 10% or memory size.  I haven't looked on that code 
>>>> recently, but I guess setting it too low can make ZFS commit transactions 
>>>> too often, increasing write inflation for the underlying storage.  I would 
>>>> propose you to restore the default and try again.
>>> While this machine is different, the original problem was worse than
>>> the issue here: the load average was less than 1 for the most part
>>> the parallel bulk build when 30 was used. The fraction of time waiting
>>> was much longer than with 5. If I understand right, both too high and
>>> too low for a type of context can lead to increased elapsed time and
>>> getting it set to a near optimal is a non-obvious exploration.
>> 
>> IIRC this limit was modified several times since originally implemented.  
>> May be it could benefit from another look, if default 30% is not good.  It 
>> would be good if generic ZFS issues like this were reported to OpenZFS 
>> upstream to be visible to a wider public.  Unfortunately I have several 
>> other project I must work on, so if it is not a regression I can't promise 
>> I'll take it right now, so anybody else is welcome.
> 
> As I understand, there are contexts were 5 is inappropriate
> and 30 works fairly well: no good single answer as to what
> value range will avoid problems.
> 
>>> An overall point for the goal of my activity is: what makes a
>>> good test context for checking if ZFS is again safe to use?
>>> May be other tradeoffs make, say, 4 hardware threads more
>>> reasonable than 32.
>> 
>> Thank you for your testing.  The best test is one that nobody else run. It 
>> also correlates with the topic of "safe to use", which also depends on what 
>> it is used for. :)
> 
> Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a
> non-debug FreeBSD build is suitable for the bulk -a -J128
> test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in
> use) on the 32 hardware thread system. (The swap partition
> in use is from the normal environment's PCIe Optane media.)
> The %idle and the load averages and %user stayed reasonable
> in a preliminary test. One thing it does introduce is trim
> management (both available and potentially useful). (Optane
> media does not support or need it.) No
> per_txg_dirty_frees_percent adjustment involved (still 5).
> 
> I've learned to not use ^T for fear of /bin/sh aborting
> and messing up poudriere's context. So I now monitor with:
> 
> # poudriere status -b
> 
> in a separate ssh session.
> 
> I'll note that I doubt I'd try for a complete bulk -a .
> I'd probably stop it if I notice that the number of
> active builders drops off for a notable time (normal
> waiting for prerequisites appearing to be why).
> 
> 

So much for that idea. It has reached a state of staying
under 3500 w/s and up to 4.5ms/w (normally above 2ms/w).
%busy wondering in the range 85% to 101%. Lots of top
showing tx->tx. There is some read and other activity as
well. Of course the kBps figures are larger than the
earlier USB3 context (larger kB figures).

It reached about 1350 port->package builds over the first
hour after the 2nd "Buildee started".

autotrim is off. Doing a "zpool trim -w zamd64" leads to
. . . larger w/s figures during the process. So
more exploring to do at some point. Possibly:

autotrim
per_txg_dirty_frees_percent

For now, I'm just running "zpool trim -w zamd64" once
and a while so the process continues better.

Still no evidence of deadlocks. No evidence of builds
failing for corruptions.

. . . At around the end of 2nd hour: 2920 or so built. 

Still no evidence of deadlocks. No evidence of builds
failing for corruptions.

. . . I've turned on autotrim without stopping the bulk
build.

. . . At around the end of 3rd hour: 4080 or so built. 

Still no evidence of deadlocks. No evidence of builds
failing for corruptions.

Looks like the % idle has been high for a significant
time. I think I'll stop this specific test and clean
things out.

Looks like lang/guile* are examples of not respecting
the lack of ALLOW_MAKE_JOBS use.

Hmm. The ^C ended up with:

^C[03:41:07] Error: Signal SIGINT caught, cleaning up and exiting
[main-amd64-bulk_a-defau

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

On Sep 4, 2023, at 10:05, Alexander Motin  wrote:

> On 04.09.2023 11:45, Mark Millard wrote:
>> On Sep 4, 2023, at 06:09, Alexander Motin  wrote:
>>> per_txg_dirty_frees_percent is directly related to the delete delays we see 
>>> here.  You are forcing ZFS to commit transactions each 5% of dirty ARC 
>>> limit, which is 5% of 10% or memory size.  I haven't looked on that code 
>>> recently, but I guess setting it too low can make ZFS commit transactions 
>>> too often, increasing write inflation for the underlying storage.  I would 
>>> propose you to restore the default and try again.
>> While this machine is different, the original problem was worse than
>> the issue here: the load average was less than 1 for the most part
>> the parallel bulk build when 30 was used. The fraction of time waiting
>> was much longer than with 5. If I understand right, both too high and
>> too low for a type of context can lead to increased elapsed time and
>> getting it set to a near optimal is a non-obvious exploration.
> 
> IIRC this limit was modified several times since originally implemented.  May 
> be it could benefit from another look, if default 30% is not good.  It would 
> be good if generic ZFS issues like this were reported to OpenZFS upstream to 
> be visible to a wider public.  Unfortunately I have several other project I 
> must work on, so if it is not a regression I can't promise I'll take it right 
> now, so anybody else is welcome.

As I understand, there are contexts were 5 is inappropriate
and 30 works fairly well: no good single answer as to what
value range will avoid problems.

>> An overall point for the goal of my activity is: what makes a
>> good test context for checking if ZFS is again safe to use?
>> May be other tradeoffs make, say, 4 hardware threads more
>> reasonable than 32.
> 
> Thank you for your testing.  The best test is one that nobody else run. It 
> also correlates with the topic of "safe to use", which also depends on what 
> it is used for. :)

Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a
non-debug FreeBSD build is suitable for the bulk -a -J128
test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in
use) on the 32 hardware thread system. (The swap partition
in use is from the normal environment's PCIe Optane media.)
The %idle and the load averages and %user stayed reasonable
in a preliminary test. One thing it does introduce is trim
management (both available and potentially useful). (Optane
media does not support or need it.) No
per_txg_dirty_frees_percent adjustment involved (still 5).

I've learned to not use ^T for fear of /bin/sh aborting
and messing up poudriere's context. So I now monitor with:

# poudriere status -b

in a separate ssh session.

I'll note that I doubt I'd try for a complete bulk -a .
I'd probably stop it if I notice that the number of
active builders drops off for a notable time (normal
waiting for prerequisites appearing to be why).

===
Mark Millard
marklmi at yahoo.com

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

2023-09-04 Thread Alexander Motin


On 04.09.2023 11:45, Mark Millard wrote:

On Sep 4, 2023, at 06:09, Alexander Motin  wrote:

per_txg_dirty_frees_percent is directly related to the delete delays we see 
here.  You are forcing ZFS to commit transactions each 5% of dirty ARC limit, 
which is 5% of 10% or memory size.  I haven't looked on that code recently, but 
I guess setting it too low can make ZFS commit transactions too often, 
increasing write inflation for the underlying storage.  I would propose you to 
restore the default and try again.


While this machine is different, the original problem was worse than
the issue here: the load average was less than 1 for the most part
the parallel bulk build when 30 was used. The fraction of time waiting
was much longer than with 5. If I understand right, both too high and
too low for a type of context can lead to increased elapsed time and
getting it set to a near optimal is a non-obvious exploration.


IIRC this limit was modified several times since originally implemented. 
 May be it could benefit from another look, if default 30% is not good. 
 It would be good if generic ZFS issues like this were reported to 
OpenZFS upstream to be visible to a wider public.  Unfortunately I have 
several other project I must work on, so if it is not a regression I 
can't promise I'll take it right now, so anybody else is welcome.



An overall point for the goal of my activity is: what makes a
good test context for checking if ZFS is again safe to use?
May be other tradeoffs make, say, 4 hardware threads more
reasonable than 32.


Thank you for your testing.  The best test is one that nobody else run. 
It also correlates with the topic of "safe to use", which also depends 
on what it is used for. :)


--
Alexander Motin

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

xg_thread_entermi_switch+0x173 sleepq_switch+0x104 
>>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 
>>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk01.txt:6 100881 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 
>>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk01.txt:6 100882 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 
>>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 
>>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk02.txt:6 100881 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 
>>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk02.txt:6 100882 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 
>>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 
>>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk03.txt:6 100881 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 
>>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk03.txt:6 100882 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 
>>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 
>>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk04.txt:6 100881 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 
>>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk04.txt:6 100882 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 
>>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 
>>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk05.txt:6 100881 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 
>>>> txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 
>>>> fork_trampoline+0xe
>>>> /usr/home/root/mmjnk05.txt:6 100882 zfskern 
>>>> txg_thread_entermi_switch+0x173 sleepq_switch+0x104 
>>>> sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 
>>>> dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 
>>>> fork_trampoline+0xe
> 
> So quiesce threads are idle, while sync thread is waiting for TXG commit 
> writes completion.  I see no no crime, we should see the same just for slow 
> storage.
> 
>>>>> `zpool status`, `zpool get all` and `sysctl -a` would also not harm.
>>>> 
>>>> It is a very simple zpool configuration: one partition.
>>>> I only use it for bectl BE reasons, not the general
>>>> range of reasons for using zfs. I created the media with
>>>> my normal content, then checkpointed before doing the
>>>> git fetch to start to set up the experiment.
> 
> OK.  And I see no scrub or async destroy, that could delay sync thread. 
> Though I don't see them in the above procstat either.
> 
>>>> /etc/sysctl.conf does have:
>>>> 
>>>> vfs.zfs.min_auto_ashift=12
>>>> vfs.zfs.per_txg_dirty_frees_percent=5
>>>> 
>>>> The vfs.zfs.per_txg_dirty_frees_percent is from prior
>>>> Mateusz Guzik help, where after testing the change I
>>>> reported:
>>>> 
>>>> Result summary: Seems to have avoided the sustained periods
>>>> of low load average activity. Much better for the context.
>>>> 
>>>> But it was for a different machine (aarch64, 8 cores). But
>>>> it was for poudriere bulk use.
>>>> 
>>>> Turns out the default of 30 was causing

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

2023-09-04 Thread Alexander Motin


On 04.09.2023 05:56, Mark Millard wrote:

On Sep 4, 2023, at 02:00, Mark Millard  wrote:

On Sep 3, 2023, at 23:35, Mark Millard  wrote:

On Sep 3, 2023, at 22:06, Alexander Motin  wrote:

On 03.09.2023 22:54, Mark Millard wrote:

After that ^t produced the likes of:
load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 
13004k


So the full state is not "tx->tx", but is actually a "tx->tx_quiesce_done_cv", 
which means the thread is waiting for new transaction to be opened, which means some previous to be 
quiesced and then synced.


#0 0x80b6f103 at mi_switch+0x173
#1 0x80bc0f24 at sleepq_switch+0x104
#2 0x80aec4c5 at _cv_wait+0x165
#3 0x82aba365 at txg_wait_open+0xf5
#4 0x82a11b81 at dmu_free_long_range+0x151


Here it seems like transaction commit is waited due to large amount of delete 
operations, which ZFS tries to spread between separate TXGs.


That fit the context: cleaning out /usr/local/poudriere/data/.m/


You should probably see some large and growing number in sysctl 
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .


After the reboot I started a -J64 example. It has avoided the
early "witness exhausted". Again I ^C'd after about an hours
after the 2nd builder had started. So: again cleaning out
/usr/local/poudriere/data/.m/ Only seconds between:

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027


As expected, deletes trigger and wait for TXG commits.


I have found a measure of progress: zfs list's USED
for /usr/local/poudriere/data/.m is decreasing. So
ztop's d/s was a good classification: deletes.


#5 0x829a87d2 at zfs_rmnode+0x72
#6 0x829b658d at zfs_freebsd_reclaim+0x3d
#7 0x8113a495 at VOP_RECLAIM_APV+0x35
#8 0x80c5a7d9 at vgonel+0x3a9
#9 0x80c5af7f at vrecycle+0x3f
#10 0x829b643e at zfs_freebsd_inactive+0x4e
#11 0x80c598cf at vinactivef+0xbf
#12 0x80c590da at vput_final+0x2aa
#13 0x80c68886 at kern_funlinkat+0x2f6
#14 0x80c68588 at sys_unlink+0x28
#15 0x8106323f at amd64_syscall+0x14f
#16 0x8103512b at fast_syscall_common+0xf8


What we don't see here is what quiesce and sync threads of the pool are 
actually doing.  Sync thread has plenty of different jobs, including async 
write, async destroy, scrub and others, that all may delay each other.

Before you rebooted the system, depending how alive it is, could you save a 
number of outputs of `procstat -akk`, or at least specifically `procstat -akk | 
grep txg_thread_enter` if the full is hard?  Or somehow else observe what they 
are doing.


# grep txg_thread_enter ~/mmjnk0[0-5].txt
/usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk03.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

On Sep 4, 2023, at 02:00, Mark Millard  wrote:

> On Sep 3, 2023, at 23:35, Mark Millard  wrote:
> 
>> On Sep 3, 2023, at 22:06, Alexander Motin  wrote:
>> 
>>> 
>>> On 03.09.2023 22:54, Mark Millard wrote:
 After that ^t produced the likes of:
 load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 
 1% 13004k
>>> 
>>> So the full state is not "tx->tx", but is actually a 
>>> "tx->tx_quiesce_done_cv", which means the thread is waiting for new 
>>> transaction to be opened, which means some previous to be quiesced and then 
>>> synced.
>>> 
 #0 0x80b6f103 at mi_switch+0x173
 #1 0x80bc0f24 at sleepq_switch+0x104
 #2 0x80aec4c5 at _cv_wait+0x165
 #3 0x82aba365 at txg_wait_open+0xf5
 #4 0x82a11b81 at dmu_free_long_range+0x151
>>> 
>>> Here it seems like transaction commit is waited due to large amount of 
>>> delete operations, which ZFS tries to spread between separate TXGs.
>> 
>> That fit the context: cleaning out /usr/local/poudriere/data/.m/
>> 
>>> You should probably see some large and growing number in sysctl 
>>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .
>> 
>> After the reboot I started a -J64 example. It has avoided the
>> early "witness exhausted". Again I ^C'd after about an hours
>> after the 2nd builder had started. So: again cleaning out
>> /usr/local/poudriere/data/.m/ Only seconds between:
>> 
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042
>> 
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427
>> 
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323
>> 
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027
>> 
>> I have found a measure of progress: zfs list's USED
>> for /usr/local/poudriere/data/.m is decreasing. So
>> ztop's d/s was a good classification: deletes.
>> 
 #5 0x829a87d2 at zfs_rmnode+0x72
 #6 0x829b658d at zfs_freebsd_reclaim+0x3d
 #7 0x8113a495 at VOP_RECLAIM_APV+0x35
 #8 0x80c5a7d9 at vgonel+0x3a9
 #9 0x80c5af7f at vrecycle+0x3f
 #10 0x829b643e at zfs_freebsd_inactive+0x4e
 #11 0x80c598cf at vinactivef+0xbf
 #12 0x80c590da at vput_final+0x2aa
 #13 0x80c68886 at kern_funlinkat+0x2f6
 #14 0x80c68588 at sys_unlink+0x28
 #15 0x8106323f at amd64_syscall+0x14f
 #16 0x8103512b at fast_syscall_common+0xf8
>>> 
>>> What we don't see here is what quiesce and sync threads of the pool are 
>>> actually doing.  Sync thread has plenty of different jobs, including async 
>>> write, async destroy, scrub and others, that all may delay each other.
>>> 
>>> Before you rebooted the system, depending how alive it is, could you save a 
>>> number of outputs of `procstat -akk`, or at least specifically `procstat 
>>> -akk | grep txg_thread_enter` if the full is hard?  Or somehow else observe 
>>> what they are doing.
>> 
>> # procstat -akk > ~/mmjnk00.txt
>> # procstat -akk > ~/mmjnk01.txt
>> # procstat -akk > ~/mmjnk02.txt
>> # procstat -akk > ~/mmjnk03.txt
>> # procstat -akk > ~/mmjnk04.txt
>> # procstat -akk > ~/mmjnk05.txt
>> # grep txg_thread_enter ~/mmjnk0[0-5].txt
>> /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter 
>>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
>> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
>> /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter 
>>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
>> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
>> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
>> /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter 
>>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
>> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
>> /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter 
>>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
>> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
>> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
>> /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter 
>>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
>> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
>> /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter 
>>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
>> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
>> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
>> /usr/home/root/mmjnk03.txt:

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

On Sep 3, 2023, at 23:35, Mark Millard  wrote:

> On Sep 3, 2023, at 22:06, Alexander Motin  wrote:
> 
>> 
>> On 03.09.2023 22:54, Mark Millard wrote:
>>> After that ^t produced the likes of:
>>> load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 
>>> 1% 13004k
>> 
>> So the full state is not "tx->tx", but is actually a 
>> "tx->tx_quiesce_done_cv", which means the thread is waiting for new 
>> transaction to be opened, which means some previous to be quiesced and then 
>> synced.
>> 
>>> #0 0x80b6f103 at mi_switch+0x173
>>> #1 0x80bc0f24 at sleepq_switch+0x104
>>> #2 0x80aec4c5 at _cv_wait+0x165
>>> #3 0x82aba365 at txg_wait_open+0xf5
>>> #4 0x82a11b81 at dmu_free_long_range+0x151
>> 
>> Here it seems like transaction commit is waited due to large amount of 
>> delete operations, which ZFS tries to spread between separate TXGs.
> 
> That fit the context: cleaning out /usr/local/poudriere/data/.m/
> 
>> You should probably see some large and growing number in sysctl 
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .
> 
> After the reboot I started a -J64 example. It has avoided the
> early "witness exhausted". Again I ^C'd after about an hours
> after the 2nd builder had started. So: again cleaning out
> /usr/local/poudriere/data/.m/ Only seconds between:
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027
> 
> I have found a measure of progress: zfs list's USED
> for /usr/local/poudriere/data/.m is decreasing. So
> ztop's d/s was a good classification: deletes.
> 
>>> #5 0x829a87d2 at zfs_rmnode+0x72
>>> #6 0x829b658d at zfs_freebsd_reclaim+0x3d
>>> #7 0x8113a495 at VOP_RECLAIM_APV+0x35
>>> #8 0x80c5a7d9 at vgonel+0x3a9
>>> #9 0x80c5af7f at vrecycle+0x3f
>>> #10 0x829b643e at zfs_freebsd_inactive+0x4e
>>> #11 0x80c598cf at vinactivef+0xbf
>>> #12 0x80c590da at vput_final+0x2aa
>>> #13 0x80c68886 at kern_funlinkat+0x2f6
>>> #14 0x80c68588 at sys_unlink+0x28
>>> #15 0x8106323f at amd64_syscall+0x14f
>>> #16 0x8103512b at fast_syscall_common+0xf8
>> 
>> What we don't see here is what quiesce and sync threads of the pool are 
>> actually doing.  Sync thread has plenty of different jobs, including async 
>> write, async destroy, scrub and others, that all may delay each other.
>> 
>> Before you rebooted the system, depending how alive it is, could you save a 
>> number of outputs of `procstat -akk`, or at least specifically `procstat 
>> -akk | grep txg_thread_enter` if the full is hard?  Or somehow else observe 
>> what they are doing.
> 
> # procstat -akk > ~/mmjnk00.txt
> # procstat -akk > ~/mmjnk01.txt
> # procstat -akk > ~/mmjnk02.txt
> # procstat -akk > ~/mmjnk03.txt
> # procstat -akk > ~/mmjnk04.txt
> # procstat -akk > ~/mmjnk05.txt
> # grep txg_thread_enter ~/mmjnk0[0-5].txt
> /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter  
>   mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
> txg_quiesce_thread+0x144

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

; #9 0x80c5af7f at vrecycle+0x3f
> #10 0x829b643e at zfs_freebsd_inactive+0x4e
> #11 0x80c598cf at vinactivef+0xbf
> #12 0x80c590da at vput_final+0x2aa
> #13 0x80c68886 at kern_funlinkat+0x2f6
> #14 0x80c68588 at sys_unlink+0x28
> #15 0x8106323f at amd64_syscall+0x14f
> #16 0x8103512b at fast_syscall_common+0xf8
> 
> The console/logs do report "witness exhausted":
> 
> . . .
> Sep  3 13:41:08 amd64-ZFS login[1751]: ROOT LOGIN (root) ON ttyv0
> Sep  3 13:51:35 amd64-ZFS kernel: witness_lock_list_get: witness exhausted
> Sep  3 14:26:38 amd64-ZFS kernel: pid 27418 (conftest), jid 245, uid 0: 
> exited on signal 11 (core dumped)
> . . .
> 
> So it did not take long for the "witness exhausted" to
> happen.
> 
> # uname -apKU
> FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 
> main-n265143-525bc87f54f2-dirty: Sun Sep  3 13:35:04 PDT 2023 
> root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG
>  amd64 amd64 150 150
> 
> 
> Looks like I'll be forcing the machine to reboot or
> to power off. The media was deliberately set up for
> doing risky tests. It is not my normal environment.
> 

Using -J64 instead of -J128. It does avoid "witness exhausted"
for at least the 1st hour.

[00:03:51] Building 34214 packages using up to 64 builders
[00:03:51] Hit CTRL+t at any time to see build progress and stats
[00:03:51] [01] [00:00:00] Builder starting
[00:04:49] [01] [00:00:58] Builder started
[00:04:49] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.6
[00:06:07] [01] [00:01:18] Finished ports-mgmt/pkg | pkg-1.20.6: Success
[00:06:31] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
[00:06:31] [02] [00:00:00] Builder starting
. . .
[00:06:33] [64] [00:00:00] Builder starting
[00:09:06] [01] [00:02:35] Finished print/indexinfo | indexinfo-0.3.1: Success
[00:09:08] [01] [00:00:00] Building devel/gettext-runtime | 
gettext-runtime-0.22_1
[00:21:49] [16] [00:15:18] Builder started
[00:21:49] [16] [00:00:00] Building devel/libdaemon | libdaemon-0.14_1
[00:21:49] [29] [00:15:17] Builder started
[00:21:49] [20] [00:15:18] Builder started
[00:21:49] [41] [00:15:17] Builder started
[00:21:49] [29] [00:00:00] Building textproc/libunibreak | libunibreak-5.1,1
[00:21:49] [20] [00:00:00] Building graphics/poppler-data | poppler-data-0.4.12
[00:21:49] [35] [00:15:17] Builder started
[00:21:49] [41] [00:00:00] Building archivers/libmspack | libmspack-0.11alpha
. . .
[main-amd64-bulk_a-default] [2023-09-03_20h48m38s] [parallel_build:] Queued: 
34588 Built: 438   Failed: 1 Skipped: 50Ignored: 335   Fetched: 0 
Tobuild: 33764  Time: 01:21:30
. . .
^C[01:21:57] [32] [00:07:04] Finished devel/p5-Test-Deep | p5-Test-Deep-1.204: 
Success
[01:21:57] Error: Signal SIGINT caught, cleaning up and exiting
[01:22:03] [39] [00:06:01] Finished textproc/p5-Lingua-Stem-Ru | 
p5-Lingua-Stem-Ru-0.04: Success
[01:22:04] [35] [00:06:09] Finished devel/p5-ExtUtils-InstallPaths | 
p5-ExtUtils-InstallPaths-0.012: Success
[main-amd64-bulk_a-default] [2023-09-03_20h48m38s] [sigint:] Queued: 34588 
Built: 442   Failed: 1 Skipped: 50Ignored: 335   Fetched: 0 
Tobuild: 33760  Time: 01:21:50
[01:22:04] Logs: 
/usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-03_20h48m38s
[01:22:11] Cleaning up

(So only around 438 built in similar time frame relationship used for
the -J128 test that got to more like 752. But -J128 had the "witness
exhausted" status as well.)

Turns out a measure of progress is the USED listed by zfs list for
/usr/local/poudriere/data/.m/ . It spends lots of time waiting in
various processes during its deletion activity.

Previously I'd been told to use vfs.zfs.per_txg_dirty_frees_percent=5
instead of the default (30) in order to avoid ending up with sustained
very small load averages in my poudiere bulk runs for my kind of
context. (5 is the older default, as it turns out.) This may be
somewhat of a deletion/cleanup stage variant of that sort of issue.

It may be that trying a factor of 2+ for 32 hardware threads just
does not scale the same as a factor of 2 did for folks using 4
hardware thread machines where testing for the deadlock issue. May
be -J36 (so: 32+4) would be more reasonable for deadlock testing
for this context, possibly avoiding running into these other issues
so strongly.

===
Mark Millard
marklmi at yahoo.com

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

2023-09-03 Thread Mark Millard

On Sep 3, 2023, at 22:06, Alexander Motin  wrote:

> 
> On 03.09.2023 22:54, Mark Millard wrote:
>> After that ^t produced the likes of:
>> load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 
>> 13004k
> 
> So the full state is not "tx->tx", but is actually a 
> "tx->tx_quiesce_done_cv", which means the thread is waiting for new 
> transaction to be opened, which means some previous to be quiesced and then 
> synced.
> 
>> #0 0x80b6f103 at mi_switch+0x173
>> #1 0x80bc0f24 at sleepq_switch+0x104
>> #2 0x80aec4c5 at _cv_wait+0x165
>> #3 0x82aba365 at txg_wait_open+0xf5
>> #4 0x82a11b81 at dmu_free_long_range+0x151
> 
> Here it seems like transaction commit is waited due to large amount of delete 
> operations, which ZFS tries to spread between separate TXGs.

That fit the context: cleaning out /usr/local/poudriere/data/.m/

> You should probably see some large and growing number in sysctl 
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .

After the reboot I started a -J64 example. It has avoided the
early "witness exhausted". Again I ^C'd after about an hours
after the 2nd builder had started. So: again cleaning out
/usr/local/poudriere/data/.m/ Only seconds between:

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027

I have found a measure of progress: zfs list's USED
for /usr/local/poudriere/data/.m is decreasing. So
ztop's d/s was a good classification: deletes.

>> #5 0x829a87d2 at zfs_rmnode+0x72
>> #6 0x829b658d at zfs_freebsd_reclaim+0x3d
>> #7 0x8113a495 at VOP_RECLAIM_APV+0x35
>> #8 0x80c5a7d9 at vgonel+0x3a9
>> #9 0x80c5af7f at vrecycle+0x3f
>> #10 0x829b643e at zfs_freebsd_inactive+0x4e
>> #11 0x80c598cf at vinactivef+0xbf
>> #12 0x80c590da at vput_final+0x2aa
>> #13 0x80c68886 at kern_funlinkat+0x2f6
>> #14 0x80c68588 at sys_unlink+0x28
>> #15 0x8106323f at amd64_syscall+0x14f
>> #16 0x8103512b at fast_syscall_common+0xf8
> 
> What we don't see here is what quiesce and sync threads of the pool are 
> actually doing.  Sync thread has plenty of different jobs, including async 
> write, async destroy, scrub and others, that all may delay each other.
> 
> Before you rebooted the system, depending how alive it is, could you save a 
> number of outputs of `procstat -akk`, or at least specifically `procstat -akk 
> | grep txg_thread_enter` if the full is hard?  Or somehow else observe what 
> they are doing.

# procstat -akk > ~/mmjnk00.txt
# procstat -akk > ~/mmjnk01.txt
# procstat -akk > ~/mmjnk02.txt
# procstat -akk > ~/mmjnk03.txt
# procstat -akk > ~/mmjnk04.txt
# procstat -akk > ~/mmjnk05.txt
# grep txg_thread_enter ~/mmjnk0[0-5].txt
/usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
/usr/home/root/mmjnk03.txt:6 100882 zfskern txg_thread_enter
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wai

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

2023-09-03 Thread Alexander Motin


Mark,

On 03.09.2023 22:54, Mark Millard wrote:

After that ^t produced the likes of:

load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 
13004k


So the full state is not "tx->tx", but is actually a 
"tx->tx_quiesce_done_cv", which means the thread is waiting for new 
transaction to be opened, which means some previous to be quiesced and 
then synced.



#0 0x80b6f103 at mi_switch+0x173
#1 0x80bc0f24 at sleepq_switch+0x104
#2 0x80aec4c5 at _cv_wait+0x165
#3 0x82aba365 at txg_wait_open+0xf5
#4 0x82a11b81 at dmu_free_long_range+0x151


Here it seems like transaction commit is waited due to large amount of 
delete operations, which ZFS tries to spread between separate TXGs.  You 
should probably see some large and growing number in sysctl 
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .



#5 0x829a87d2 at zfs_rmnode+0x72
#6 0x829b658d at zfs_freebsd_reclaim+0x3d
#7 0x8113a495 at VOP_RECLAIM_APV+0x35
#8 0x80c5a7d9 at vgonel+0x3a9
#9 0x80c5af7f at vrecycle+0x3f
#10 0x829b643e at zfs_freebsd_inactive+0x4e
#11 0x80c598cf at vinactivef+0xbf
#12 0x80c590da at vput_final+0x2aa
#13 0x80c68886 at kern_funlinkat+0x2f6
#14 0x80c68588 at sys_unlink+0x28
#15 0x8106323f at amd64_syscall+0x14f
#16 0x8103512b at fast_syscall_common+0xf8


What we don't see here is what quiesce and sync threads of the pool are 
actually doing.  Sync thread has plenty of different jobs, including 
async write, async destroy, scrub and others, that all may delay each 
other.


Before you rebooted the system, depending how alive it is, could you 
save a number of outputs of `procstat -akk`, or at least specifically 
`procstat -akk | grep txg_thread_enter` if the full is hard?  Or somehow 
else observe what they are doing.


`zpool status`, `zpool get all` and `sysctl -a` would also not harm.

PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. 
Make sure there is no storage problems, like some huge delays, timeouts, 
etc, that can be seen, for example, as busy percents regularly spiking 
far above 100% in your `gstat -spod`.


--
Alexander Motin

An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

2023-09-03 Thread Mark Millard

ThreadRipper 1950X (32 hardware threads) doing bulk -J128
with USE_TMPFS=no , no ALLOW_MAKE_JOBS , no
ALLOW_MAKE_JOBS_PACKAGES , USB3 NVMe SSD storage/ZFS-boot-media,
debug system build in use :

[00:03:44] Building 34214 packages using up to 128 builders
[00:03:44] Hit CTRL+t at any time to see build progress and stats
[00:03:44] [01] [00:00:00] Builder starting
[00:04:37] [01] [00:00:53] Builder started
[00:04:37] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.6
[00:05:53] [01] [00:01:16] Finished ports-mgmt/pkg | pkg-1.20.6: Success
[00:06:15] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
[00:06:15] [02] [00:00:00] Builder starting
. . .
[00:06:18] [128] [00:00:00] Builder starting
[00:07:42] [01] [00:01:27] Finished print/indexinfo | indexinfo-0.3.1: Success
[00:07:45] [01] [00:00:00] Building devel/gettext-runtime | 
gettext-runtime-0.22_1
[00:18:45] [01] [00:11:00] Finished devel/gettext-runtime | 
gettext-runtime-0.22_1: Success
[00:19:06] [01] [00:00:00] Building devel/gmake | gmake-4.3_2
[00:24:13] [01] [00:05:07] Finished devel/gmake | gmake-4.3_2: Success
[00:24:39] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22
[00:31:08] [125] [00:24:50] Builder started
[00:31:08] [125] [00:00:00] Building print/t1utils | t1utils-1.32
[00:31:15] [33] [00:25:00] Builder started
[00:31:15] [81] [00:24:59] Builder started
[00:31:15] [33] [00:00:00] Building databases/xapian-core | xapian-core-1.4.23,1
[00:31:15] [13] [00:25:00] Builder started
[00:31:15] [81] [00:00:00] Building devel/bmake | bmake-20230723
[00:31:15] [13] [00:00:00] Building devel/evdev-proto | evdev-proto-5.8
[00:31:16] [41] [00:25:00] Builder started
[00:31:16] [41] [00:00:00] Building devel/pcre | pcre-8.45_3
. . .

(Looks like lang/go120 ignores the lack of ALLOW_MAKE_JOBS .
There may be others that still have signficant parallel
activity.)

[main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [parallel_build:] Queued: 
34588 Built: 727   Failed: 1 Skipped: 40Ignored: 335   Fetched: 0 
Tobuild: 33485  Time: 01:36:51

(So about 1 hr after the last "Builder starting" it had
built 727.)

The vast majority of the time: lots of cpdup's with tx->tx
showing most of the time for STATE but showing having some
CPU time.

^T commonly showed various Builders in starting PHASE for
3min..6min.

Around 66% mean Idle time (guess from watching top).

After ^C "gstat -spod" reports it is almost always writing
2200 to 2500 writes per second or so for *hours* (still
going on).

ztop reports 1500 to 3200 d/s or so almost always for
Dataset zamd64/poudriere/data/.m instead (also still going
on). Normally no other Dataset is shown.

With all the disk I/O activity, this is definitely "live"
in some sense. But I've no clue if it is just repeating
itself over and over vs. if it making some sort of progress.

For reference for the ^C and after:

^C[01:39:00] [20] [00:00:03] Building sysutils/linux-c7-dosfstools | 
linux-c7-dosfstools-3.0.20
[01:39:00] [93] [00:07:12] Finished science/dimod | dimod-0.12.11: Success
[01:39:00] Error: Signal SIGINT caught, cleaning up and exiting
[01:39:02] [63] [00:06:34] Finished archivers/unarj | unarj-2.65_2: Success
[01:39:03] [128] [00:07:47] Finished sysutils/shuf | shuf-3.0: Success
[01:39:04] [113] [00:07:06] Finished devel/bsddialog | bsddialog-0.4.1: Success
[main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [sigint:] Queued: 34588 
Built: 752   Failed: 1 Skipped: 40Ignored: 335   Fetched: 0 
Tobuild: 33460  Time: 01:38:56
[01:39:06] Logs: 
/usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-03_13h48m45s
[01:39:14] [12] [00:09:07] Finished archivers/rzip | rzip-2.1_1: Success
[01:39:14] Cleaning up
exit: cannot open ./var/run/49_nohang.pid: No such file or directory
exit: cannot open ./var/run/87_nohang.pid: No such file or directory

After that ^t produced the likes of:

load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 
13004k
#0 0x80b6f103 at mi_switch+0x173
#1 0x80bc0f24 at sleepq_switch+0x104
#2 0x80aec4c5 at _cv_wait+0x165
#3 0x82aba365 at txg_wait_open+0xf5
#4 0x82a11b81 at dmu_free_long_range+0x151
#5 0x829a87d2 at zfs_rmnode+0x72
#6 0x829b658d at zfs_freebsd_reclaim+0x3d
#7 0x8113a495 at VOP_RECLAIM_APV+0x35
#8 0x80c5a7d9 at vgonel+0x3a9
#9 0x80c5af7f at vrecycle+0x3f
#10 0x829b643e at zfs_freebsd_inactive+0x4e
#11 0x80c598cf at vinactivef+0xbf
#12 0x80c590da at vput_final+0x2aa
#13 0x80c68886 at kern_funlinkat+0x2f6
#14 0x80c68588 at sys_unlink+0x28
#15 0x8106323f at amd64_syscall+0x14f
#16 0x8103512b at fast_syscall_common+0xf8

The console/logs do report "witness exhausted":

. . .
Sep  3 13:41:08 amd64-ZFS login[1751]: ROOT LOGIN (root) ON ttyv0
Sep  3 13:51:35 amd64-ZFS kernel: witness_lock_list_get: witness exhausted
Sep  3 14:26:38 amd64-ZFS kernel: pid 27418 (conftest), jid 245

See bugzilla's 272965 and 272966 for cortex-A7 armv7 example kyua test case panics for main [so: 14], split by backtrace structure

2023-08-06 Thread Mark Millard

See:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272965
and:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272966

All involve 'Alignment Fault' on read at some point
but the 272966 ones first involve: Kernel page fault with
the following non-sleepable locks held during
in6ifa_ifwithaddr.

The 272965 ones involve udp_input getting the alignment
fault.

These reports are based on using the most recent snapshot
of main [so: 14], not based on my own builds. They do
not involve aarch64 at all: no chroot use, no jail use,
no lib32 existing for the already just-armv7 context.

===
Mark Millard
marklmi at yahoo.com

Some results from my crude technique of using (some of) Kyua to test aarch64's lib32 vs. kyua runs in a armv7 chroot on aarch64

2023-08-01 Thread Mark Millard

[I first report how I tested before reporting on errors
that look to be valid kyua reports of issues.]

[This testing is with a line commented out in order to
prevent sys/net/if_bridge_test:gif from being tested,
because of its leading to a panic for lib32 style
testing.]

I have /usr/obj/DESTDIRs/main-CA7-chroot/ containing an
armv7 installworld distrib-dirs distribution DB_FROM_SRC=1
result. It also has various ports installed that kyua
runs use. I use this tree for both the lib32 and the
chroot testing.

I have a script to preload various kernel modules:

# grep kldload ~/prekyua-kldloads.sh | sort
#kldload -v -n ipfw.ko
#kldload -v -n pflog.ko
#kldload -v -n pfsync.ko
kldload -v -n bridgestp.ko
kldload -v -n carp.ko
kldload -v -n cryptodev.ko
kldload -v -n dtrace.ko
kldload -v -n dummynet.ko
kldload -v -n fdescfs.ko
kldload -v -n filemon.ko
kldload -v -n geom_concat.ko
kldload -v -n geom_eli.ko
kldload -v -n geom_gate.ko
kldload -v -n geom_mirror.ko
kldload -v -n geom_multipath.ko
kldload -v -n geom_nop.ko
kldload -v -n geom_raid3.ko
kldload -v -n geom_shsec.ko
kldload -v -n geom_stripe.ko
kldload -v -n geom_uzip.ko
kldload -v -n if_bridge.ko
kldload -v -n if_epair.ko
kldload -v -n if_gif.ko
kldload -v -n if_infiniband.ko
kldload -v -n if_lagg.ko
kldload -v -n if_ovpn.ko
kldload -v -n if_stf.ko
kldload -v -n if_tuntap.ko
kldload -v -n if_wg.ko
kldload -v -n ipdivert.ko
kldload -v -n ipsec.ko
kldload -v -n mqueuefs.ko
kldload -v -n netgraph.ko
kldload -v -n nfsd.ko
kldload -v -n ng_bridge.ko
kldload -v -n ng_ether.ko
kldload -v -n ng_hub.ko
kldload -v -n ng_socket.ko
kldload -v -n ng_vlan_rotate.ko
kldload -v -n nullfs.ko
kldload -v -n opensolaris.ko
kldload -v -n pf.ko
kldload -v -n sctp.ko
kldload -v -n sdt.ko
kldload -v -n tarfs.ko
kldload -v -n tcpmd5.ko
kldload -v -n xz.ko
kldload -v -n zfs.ko

(Some I've listed despite there being built into the
kernel or already being loaded for my normal environment.)

Likely I'll end up adding some more later.

I have some ports used by kyua runs that I build and then
install into /usr/obj/DESTDIRs/main-CA7-chroot/ :

# more ~/origins/kyua-origins.txt 
archivers/gtar
devel/py-pytest
devel/py-pytest-twisted
devel/py-twisted
lang/perl5.32
lang/python
net/scapy
security/openvpn
security/sudo
shells/ksh93
shells/bash
sysutils/coreutils
sysutils/sg3_utils
textproc/jq

Likely I'll add some more later. The above, of course,
lead to other installs as well.

For lib32 testing, I try to control where most *.so* 's
that are not based full path references are found. This
is via use of LD_32_LIBRARY_PATH . I try to have more
programs that are not based on full path references run
as armv7 code. This is via use of PATH . So:

# env \
LD_32_LIBRARY_PATH=/usr/obj/DESTDIRs/main-CA7-chroot/lib\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/lib\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/libexec/rtld-elf\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libxo\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/csu/dynamiclib\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/tls\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/stdlib\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libthr/dlopen\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/site-packages\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/lib-dynload\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/CORE\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/auto \
PATH=/usr/obj/DESTDIRs/main-CA7-chroot/sbin\
:/usr/obj/DESTDIRs/main-CA7-chroot/bin\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/sbin\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/bin\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/sbin\
:/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/bin\
:/usr/obj/DESTDIRs/main-CA7-chroot/root/bin \
/usr/obj/DESTDIRs/main-CA7-chroot/usr/bin/kyua test \
-k /usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/Kyuafile

On the Windows Dev Kit 2023 I end up with the lib32 summary
being (so far):

# kyua report --verbose \
   
--results-file=usr_obj_DESTDIRs_main-CA7-chroot_usr_tests.20230731-080820-275974
 \
   2>&1 \
| tail -6
===> Summary
Results read from 
/usr/home/root/.kyua/store/results.usr_obj_DESTDIRs_main-CA7-chroot_usr_tests.20230731-080820-275974.db
Test cases: 8704 total, 1442 skipped, 37 expected failures, 46 broken, 746 
failed
Start time: 2023-07-31T08:08:20.858437Z
End time:   2023-07-31T10:18:37.393732Z
Total time: 6954.365s

Of course, some tests labeled as broken/failed are just from the
limitations of the techniques involved lib32 based kyua testing.
For example the 127 "failures":

In the chroot it is currently:

# kyua report --verbose \
--results-file=usr_tests.20230731-163737-720329 \
2>&1 \
| tail -6
===> Summary
Results read from 
/usr/home/root/.kyua/store/results.usr_tests.20230731-163737-720329.db
Test cases: 8699 total, 1478 skipped, 38 exp

FYI for aarch64/armv7 lib32: armv7 kyua test sys/net/if_bridge_test:gif with preloaded if_bridge.ko still panics in my style of testing

2023-07-29 Thread Mark Millard

I finally got around to testing lib32 some more, first
trying the panic case that I'd gotten in early testing.
The below is without any special lib32 patching for
testing, just my normal non-debug environment updated
to a lib32-present aarch64 FreeBSD vintage.

Reminder: /usr/obj/DESTDIRs/main-CA7-chroot/ contains an
armv7 installworld distrib-dirs distribution DB_FROM_SRC=1
result. (It also has various ports installed.)

# ~/prekyua-kldloads.sh
. . .
# env \
> LD_32_LIBRARY_PATH=/usr/obj/DESTDIRs/main-CA7-chroot/lib\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/lib\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/libexec/rtld-elf\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libxo\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/csu/dynamiclib\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/tls\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libc/stdlib\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/lib/libthr/dlopen\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/site-packages\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/python3.9/lib-dynload\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/CORE\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/lib/perl5/5.32/mach/auto \
> PATH=/usr/obj/DESTDIRs/main-CA7-chroot/sbin\
> :/usr/obj/DESTDIRs/main-CA7-chroot/bin\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/sbin\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/bin\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/sbin\
> :/usr/obj/DESTDIRs/main-CA7-chroot/usr/local/bin\
> :/usr/obj/DESTDIRs/main-CA7-chroot/root/bin \
> /usr/obj/DESTDIRs/main-CA7-chroot/usr/bin/kyua test \
> -k /usr/obj/DESTDIRs/main-CA7-chroot/usr/tests/Kyuafile 
> sys/net/if_bridge_test:gif
sys/net/if_bridge_test:gif  ->  Jul 29 21:29:16 CA72-16Gp-ZFS dhclient[56641]: 
epair0a: not found
Jul 29 21:29:16 CA72-16Gp-ZFS dhclient[56641]: exiting.
Fatal data abort:
  x0: 0xa0275306c560
  x1: 0xa027f9d053d2
  x2: 0x002a
  x3: 0xa0275306c560
  x4: 0xa027f9d053fc
  x5: 0xa0275306c58a
  x6: 0x3ec2
  x7: 0x010006085ba958bc
  x8: 0x002a
  x9: 0x002a
 x10: 0x0008010006085ba9
 x11: 0x58bc3ec201000406
 x12: 0x016433c65ba9
 x13: 0x026433c6
 x14: 0x00ff
 x15: 0x289f
 x16: 0x0002d056b370 (_DYNAMIC + 0x370)
 x17: 0x00598110 (m_dup + 0x0)
 x18: 0x0002801e94a0
 x19: 0x0001
 x20: 0x
 x21: 0x
 x22: 0x00d95000 (vop_spare3_desc + 0x18)
 x23: 0xa0275306c500
 x24: 0xa0275306c500
 x25: 0x00a0
 x26: 0x0002
 x27: 0x
 x28: 0xa0275306c500
 x29: 0x0002801e94c0
  sp: 0x0002801e94a0
  lr: 0x00598308 (m_dup + 0x1f8)
 elr: 0x00598160 (m_dup + 0x50)
spsr: 0x2045
 far: 0x001c
 esr: 0x9604
panic: vm_fault failed: 0x00598160 error 1
cpuid = 14
time = 1690691356
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x2fc
handle_el1h_sync() at handle_el1h_sync+0x14
--- exception, esr 0x9604
m_dup() at m_dup+0x50
bridge_input() at bridge_input+0x17c
gif_input() at gif_input+0x2dc
in_gif_input() at in_gif_input+0x5c
encap_input() at encap_input+0xfc
encap4_input() at encap4_input+0x30
ip_input() at ip_input+0x5ac
netisr_dispatch_src() at netisr_dispatch_src+0xf8
ether_demux() at ether_demux+0x14c
ether_nh_input() at ether_nh_input+0x39c
netisr_dispatch_src() at netisr_dispatch_src+0xf8
ether_input() at ether_input+0x50
epair_tx_start_deferred() at epair_tx_start_deferred+0x110
taskqueue_run_locked() at taskqueue_run_locked+0x198
taskqueue_thread_loop() at taskqueue_thread_loop+0x130
fork_exit() at fork_exit+0x88
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 0 tid 1028122 ]
Stopped at  kdb_enter+0x44: str xzr, [x19, #3328]

For reference:

# uname -apKU
FreeBSD CA72-16Gp-ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT aarch64 1400093 #102 
main-n264334-215bab7924f6-dirty: Wed Jul 26 02:02:48 PDT 2023 
root@CA72-16Gp-ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72
 arm64 aarch64 1400093 1400093



===
Mark Millard
marklmi at yahoo.com

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread Mark Millard

On Jul 7, 2023, at 10:13, Mike Karels  wrote:

> On 7 Jul 2023, at 11:38, Mark Millard wrote:
> 
>> On Jul 7, 2023, at 07:36, Mark Millard  wrote:
>> 
>>> On Jul 7, 2023, at 06:50, Mike Karels  wrote:
>>> 
>>>> On 7 Jul 2023, at 6:06, John F Carr wrote:
>>>> 
>>>>> On Jul 6, 2023, at 20:42, Mike Karels  wrote:
>>>>>> 
>>>>>> 
>>>>>> Thanks for isolating this.  Let me know when you have the bug number.
>>>>>> I just tested a fix (the compat code drops the reference on the current
>>>>>> address space an extra time, probably freeing it).
>>>>>> 
>>>>>> Mike
>>>> 
>>>> The fix is in 
>>>> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de.
>>>> 
>>>>> The bug was introduced in January, 2022.   It allows 32 bit binaries to 
>>>>> crash a 64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the 
>>>>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time.
>>>>> 
>>>>> There should be routine runs of 32 bit test suites on 64 bit systems.  
>>>>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 
>>>>> kernel code needs to be exercised.  This bug was only discovered by 
>>>>> manually running tests in the right environment, 17 months after 
>>>>> automated testing could have discovered it.
>>>> 
>>>> That is not so simple currently, as the shared libraries for the
>>>> test environment are not part of 32-bit compatibility package.
>>>> The required bits could be extracted from the corresponding 32-bit
>>>> build, but that isn't easy to automate.  Fortunately, I think that
>>>> very few of the tests exercise any 32-bit-specific code paths.
>>> 
>>> One way that I demonstrated this problem on an aarch64 system
>>> that supports aarch32/armv7 in user space was via the use of
>>> an official snapshot armv7 image. In my case I:
>>> 
>>> A) dd'd the image to USB3 media (after downloading it)
>>> B) mounted the ufs file system on the media to a mount point
>> 
>> I forgot to mention an important step: before chroot is
>> used, I preload various kernel modules that are used in
>> the Kyuafile tests --because the chroot'd activity will
>> not cause the loads of themsleves but will use the
>> modules if they have already been loaded.
>> 
>>> C) used a chroot into that mount point to run the:
>>>   "kyua test -k /usr/tests/Kyuafile"
>>> 
>>> (I happened to do all that as root.)
>>> 
>>> There may be viable alternatives to dd'ing to allow an analogy to
>>> (B) for (C) to use. I've not experimented with using a jail
>>> instead of a chroot.
>>> 
>>> One can also install an armv7 world into a local directory tree
>>> and then use chroot (or analogous).
>>> 
>>> How far off is an analogous sort of procedure from being reasonable
>>> to automate?
> 
> It would be easier to use the packages rather than the full image
> (base.txz and tests.txz).  But doing this as part of a CI setup would
> mean fetching things from a different source from the install image,
> and then of course doing various configuration.  It's always a Small
> Matter of Programming.  Doing a full chroot gets into some other
> problems, e.g. mdconfig doesn't currently work in compatibility
> mode.  It doesn't seem that automating all this would yield much;
> it's hard enough to do manually.
> 
>>> i386, of course, also has lib32 and independently testing that is
>>> a messier issue, including trying to use /usr/tests/Kyuafile based
>>> testing that avoids use of chroot (or analogous). I'm not claiming
>>> lib32 has as simple of a potential path to automated testing.
> 
> I think the problem is essentially the same.  A chroot could be used
> or a 32-bit library setup (which would test the libraries as well).
> 
>>> I do not know if FreeBSD has powerpc64 hardware able to use a
>>> powerpc world directory tree analogously. Such hardware may be too
>>> old and otherwise problematical, making it not viable to automate
>>> testing.
> 
> Powerpc supports 32-bit libraries, unlike arm (so far).

My understanding is that powerpc64le does not in FreeBSD:
there is no powerpcle in FreeBSD. So, not even chroot style
support for 32-bit little endian use.

If I understand right, no 32 bit little endian ABI is defined,
other than the void linux activity's material, anyway.

It may be that all big endian POWER use has lib32 support, but
I'm not sure if all POWER has big endian FreeBSD support. May
be POWER9 (10?) still has such support in FreeBSD.

===
Mark Millard
marklmi at yahoo.com

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread Mike Karels

On 7 Jul 2023, at 11:38, Mark Millard wrote:

> On Jul 7, 2023, at 07:36, Mark Millard  wrote:
>
>> On Jul 7, 2023, at 06:50, Mike Karels  wrote:
>>
>>> On 7 Jul 2023, at 6:06, John F Carr wrote:
>>>
>>>> On Jul 6, 2023, at 20:42, Mike Karels  wrote:
>>>>>
>>>>>
>>>>> Thanks for isolating this.  Let me know when you have the bug number.
>>>>> I just tested a fix (the compat code drops the reference on the current
>>>>> address space an extra time, probably freeing it).
>>>>>
>>>>> Mike
>>>
>>> The fix is in 
>>> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de.
>>>
>>>> The bug was introduced in January, 2022.   It allows 32 bit binaries to 
>>>> crash a 64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the 
>>>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time.
>>>>
>>>> There should be routine runs of 32 bit test suites on 64 bit systems.  
>>>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 
>>>> kernel code needs to be exercised.  This bug was only discovered by 
>>>> manually running tests in the right environment, 17 months after automated 
>>>> testing could have discovered it.
>>>
>>> That is not so simple currently, as the shared libraries for the
>>> test environment are not part of 32-bit compatibility package.
>>> The required bits could be extracted from the corresponding 32-bit
>>> build, but that isn't easy to automate.  Fortunately, I think that
>>> very few of the tests exercise any 32-bit-specific code paths.
>>
>> One way that I demonstrated this problem on an aarch64 system
>> that supports aarch32/armv7 in user space was via the use of
>> an official snapshot armv7 image. In my case I:
>>
>> A) dd'd the image to USB3 media (after downloading it)
>> B) mounted the ufs file system on the media to a mount point
>
> I forgot to mention an important step: before chroot is
> used, I preload various kernel modules that are used in
> the Kyuafile tests --because the chroot'd activity will
> not cause the loads of themsleves but will use the
> modules if they have already been loaded.
>
>> C) used a chroot into that mount point to run the:
>>"kyua test -k /usr/tests/Kyuafile"
>>
>> (I happened to do all that as root.)
>>
>> There may be viable alternatives to dd'ing to allow an analogy to
>> (B) for (C) to use. I've not experimented with using a jail
>> instead of a chroot.
>>
>> One can also install an armv7 world into a local directory tree
>> and then use chroot (or analogous).
>>
>> How far off is an analogous sort of procedure from being reasonable
>> to automate?

It would be easier to use the packages rather than the full image
(base.txz and tests.txz).  But doing this as part of a CI setup would
mean fetching things from a different source from the install image,
and then of course doing various configuration.  It's always a Small
Matter of Programming.  Doing a full chroot gets into some other
problems, e.g. mdconfig doesn't currently work in compatibility
mode.  It doesn't seem that automating all this would yield much;
it's hard enough to do manually.

>> i386, of course, also has lib32 and independently testing that is
>> a messier issue, including trying to use /usr/tests/Kyuafile based
>> testing that avoids use of chroot (or analogous). I'm not claiming
>> lib32 has as simple of a potential path to automated testing.

I think the problem is essentially the same.  A chroot could be used
or a 32-bit library setup (which would test the libraries as well).

>> I do not know if FreeBSD has powerpc64 hardware able to use a
>> powerpc world directory tree analogously. Such hardware may be too
>> old and otherwise problematical, making it not viable to automate
>> testing.

Powerpc supports 32-bit libraries, unlike arm (so far).

Mike

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread Mark Millard

On Jul 7, 2023, at 07:36, Mark Millard  wrote:

> On Jul 7, 2023, at 06:50, Mike Karels  wrote:
> 
>> On 7 Jul 2023, at 6:06, John F Carr wrote:
>> 
>>> On Jul 6, 2023, at 20:42, Mike Karels  wrote:
>>>> 
>>>> 
>>>> Thanks for isolating this.  Let me know when you have the bug number.
>>>> I just tested a fix (the compat code drops the reference on the current
>>>> address space an extra time, probably freeing it).
>>>> 
>>>> Mike
>> 
>> The fix is in 
>> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de.
>> 
>>> The bug was introduced in January, 2022.   It allows 32 bit binaries to 
>>> crash a 64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the 
>>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time.
>>> 
>>> There should be routine runs of 32 bit test suites on 64 bit systems.  
>>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 
>>> kernel code needs to be exercised.  This bug was only discovered by 
>>> manually running tests in the right environment, 17 months after automated 
>>> testing could have discovered it.
>> 
>> That is not so simple currently, as the shared libraries for the
>> test environment are not part of 32-bit compatibility package.
>> The required bits could be extracted from the corresponding 32-bit
>> build, but that isn't easy to automate.  Fortunately, I think that
>> very few of the tests exercise any 32-bit-specific code paths.
> 
> One way that I demonstrated this problem on an aarch64 system
> that supports aarch32/armv7 in user space was via the use of
> an official snapshot armv7 image. In my case I:
> 
> A) dd'd the image to USB3 media (after downloading it)
> B) mounted the ufs file system on the media to a mount point

I forgot to mention an important step: before chroot is
used, I preload various kernel modules that are used in
the Kyuafile tests --because the chroot'd activity will
not cause the loads of themsleves but will use the
modules if they have already been loaded.

> C) used a chroot into that mount point to run the:
>"kyua test -k /usr/tests/Kyuafile"
> 
> (I happened to do all that as root.)
> 
> There may be viable alternatives to dd'ing to allow an analogy to
> (B) for (C) to use. I've not experimented with using a jail
> instead of a chroot.
> 
> One can also install an armv7 world into a local directory tree
> and then use chroot (or analogous).
> 
> How far off is an analogous sort of procedure from being reasonable
> to automate?
> 
> i386, of course, also has lib32 and independently testing that is
> a messier issue, including trying to use /usr/tests/Kyuafile based
> testing that avoids use of chroot (or analogous). I'm not claiming
> lib32 has as simple of a potential path to automated testing.
> 
> I do not know if FreeBSD has powerpc64 hardware able to use a
> powerpc world directory tree analogously. Such hardware may be too
> old and otherwise problematical, making it not viable to automate
> testing.
> 



===
Mark Millard
marklmi at yahoo.com

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread Mark Millard

On Jul 7, 2023, at 06:50, Mike Karels  wrote:

> On 7 Jul 2023, at 6:06, John F Carr wrote:
> 
>> On Jul 6, 2023, at 20:42, Mike Karels  wrote:
>>> 
>>> 
>>> Thanks for isolating this.  Let me know when you have the bug number.
>>> I just tested a fix (the compat code drops the reference on the current
>>> address space an extra time, probably freeing it).
>>> 
>>> Mike
> 
> The fix is in 
> https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de.
> 
>> The bug was introduced in January, 2022.   It allows 32 bit binaries to 
>> crash a 64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the 
>> buggy function (sysctl_kern_proc_vm_layout) was added at the same time.
>> 
>> There should be routine runs of 32 bit test suites on 64 bit systems.  
>> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 
>> kernel code needs to be exercised.  This bug was only discovered by manually 
>> running tests in the right environment, 17 months after automated testing 
>> could have discovered it.
> 
> That is not so simple currently, as the shared libraries for the
> test environment are not part of 32-bit compatibility package.
> The required bits could be extracted from the corresponding 32-bit
> build, but that isn't easy to automate.  Fortunately, I think that
> very few of the tests exercise any 32-bit-specific code paths.

One way that I demonstrated this problem on an aarch64 system
that supports aarch32/armv7 in user space was via the use of
an official snapshot armv7 image. In my case I:

A) dd'd the image to USB3 media (after downloading it)
B) mounted the ufs file system on the media to a mount point
C) used a chroot into that mount point to run the:
"kyua test -k /usr/tests/Kyuafile"

(I happened to do all that as root.)

There may be viable alternatives to dd'ing to allow an analogy to
(B) for (C) to use. I've not experimented with using a jail
instead of a chroot.

One can also install an armv7 world into a local directory tree
and then use chroot (or analogous).

How far off is an analogous sort of procedure from being reasonable
to automate?

i386, of course, also has lib32 and independently testing that is
a messier issue, including trying to use /usr/tests/Kyuafile based
testing that avoids use of chroot (or analogous). I'm not claiming
lib32 has as simple of a potential path to automated testing.

I do not know if FreeBSD has powerpc64 hardware able to use a
powerpc world directory tree analogously. Such hardware may be too
old and otherwise problematical, making it not viable to automate
testing.

===
Mark Millard
marklmi at yahoo.com

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread Mike Karels

On 7 Jul 2023, at 6:06, John F Carr wrote:

> On Jul 6, 2023, at 20:42, Mike Karels  wrote:
>>
>>
>> Thanks for isolating this.  Let me know when you have the bug number.
>> I just tested a fix (the compat code drops the reference on the current
>> address space an extra time, probably freeing it).
>>
>> Mike

The fix is in 
https://cgit.freebsd.org/src/commit/?id=be30fd3ab2e8418a696e69f54a91a7e2db5962de.

> The bug was introduced in January, 2022.   It allows 32 bit binaries to crash 
> a 64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the buggy 
> function (sysctl_kern_proc_vm_layout) was added at the same time.
>
> There should be routine runs of 32 bit test suites on 64 bit systems.  
> Although i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 
> kernel code needs to be exercised.  This bug was only discovered by manually 
> running tests in the right environment, 17 months after automated testing 
> could have discovered it.

That is not so simple currently, as the shared libraries for the
test environment are not part of 32-bit compatibility package.
The required bits could be extracted from the corresponding 32-bit
build, but that isn't easy to automate.  Fortunately, I think that
very few of the tests exercise any 32-bit-specific code paths.

Mike

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread John F Carr

On Jul 6, 2023, at 20:42, Mike Karels  wrote:
> 
> 
> Thanks for isolating this.  Let me know when you have the bug number.
> I just tested a fix (the compat code drops the reference on the current
> address space an extra time, probably freeing it).
> 
> Mike

The bug was introduced in January, 2022.   It allows 32 bit binaries to crash a 
64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the buggy function 
(sysctl_kern_proc_vm_layout) was added at the same time.

There should be routine runs of 32 bit test suites on 64 bit systems.  Although 
i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 kernel code 
needs to be exercised.  This bug was only discovered by manually running tests 
in the right environment, 17 months after automated testing could have 
discovered it.

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-06 Thread Mike Karels

On 6 Jul 2023, at 18:53, John F Carr wrote:

> The hang is caused by the sysctl call in tests/sys/kern/kern_copyin.c.  The 
> function below hangs when called in a 32 bit ARM process running in a chroot 
> environment on a 64 bit ARM system.  I will write up a bug report.
>
> static int
> get_vm_layout(struct kinfo_vm_layout *kvm)
> {
>   size_t len;
>   int mib[4];
>
>   mib[0] = CTL_KERN;
>   mib[1] = KERN_PROC;
>   mib[2] = KERN_PROC_VM_LAYOUT;
>   mib[3] = getpid();
>   len = sizeof(*kvm);
>
>   return (sysctl(mib, nitems(mib), kvm, &len, NULL, 0));
> }

Thanks for isolating this.  Let me know when you have the bug number.
I just tested a fix (the compat code drops the reference on the current
address space an extra time, probably freeing it).

Mike

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-06 Thread Mark Millard

On Jul 6, 2023, at 16:53, John F Carr  wrote:

> On Jun 25, 2023, at 20:16, Mark Millard  wrote:
>> 
>> . . .
>> 
> 
> The hang is caused by the sysctl call in tests/sys/kern/kern_copyin.c.  The 
> function below hangs when called in a 32 bit ARM process running in a chroot 
> environment on a 64 bit ARM system.  I will write up a bug report.
> 
> static int
> get_vm_layout(struct kinfo_vm_layout *kvm)
> {
> size_t len;
> int mib[4];
> 
> mib[0] = CTL_KERN;
> mib[1] = KERN_PROC;
> mib[2] = KERN_PROC_VM_LAYOUT;
> mib[3] = getpid();
> len = sizeof(*kvm);
> 
> return (sysctl(mib, nitems(mib), kvm, &len, NULL, 0));
> }
> 

Thanks for the tiny-reproducer analysis! That should help
make getting to a fix more actionable.

===
Mark Millard
marklmi at yahoo.com

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-06 Thread John F Carr




> On Jun 25, 2023, at 20:16, Mark Millard  wrote:
> 
> Using the likes of:
> 
> FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img
> and:
> FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img
> 
> I have shown the following behavior after setting up storage
> media based on them. (This was a test that my builds were not
> odd for the issue.)
> 
> Boot the aarch64 media and log in. (Note: I logged in
> as root.)
> 
> mount the armv7 media (-noatime is just my habit)
> and then put it to use:
> 
> # mount -onoatime /dev/da1s2a /mnt
> 
> # chroot /mnt/
> 
> # kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin
> sys/kern/kern_copyin:kern_copyin  ->  
> 
> On the serial console:
> 
> # ps -xu
> USER  PID   %CPU %MEM   VSZ  RSS TT  STAT STARTED  TIME COMMAND
> root   11 1498.4  0.0 0  256  -  RNL  23:24   542:52.92 [idle]
> root 1174  100.0  0.0 0   16  -  Rs   23:37 0:00.00 
> /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests 
> -r/tmp/kyua.9YUttj/2/result.atf kern_copyin
> root00.0  0.0 0 1616  -  DLs  23:24 0:00.50 [kernel]
> root10.0  0.0 11704 1288  -  ILs  23:24 0:00.02 /sbin/init
> root20.0  0.0 0  256  -  WL   23:24 0:00.26 [clock]
> root30.0  0.0 0  272  -  DL   23:24 0:00.00 [crypto]
> root40.0  0.0 0   80  -  DL   23:24 0:00.95 [cam]
> root50.0  0.0 0   16  -  DL   23:24 0:00.00 [busdma]
> root60.0  0.0 0   16  -  DL   23:24 0:00.03 [rand_harvestq]
> root70.0  0.0 0   48  -  DL   23:24 0:00.06 [pagedaemon]
> root80.0  0.0 0   16  -  DL   23:24 0:00.00 [vmdaemon]
> root90.0  0.0 0  160  -  DL   23:24 0:00.38 [bufdaemon]
> root   100.0  0.0 0   16  -  DL   23:24 0:00.00 [audit]
> root   120.0  0.0 0  880  -  WL   23:24 0:11.81 [intr]
> root   130.0  0.0 0   48  -  DL   23:24 0:00.04 [geom]
> root   140.0  0.0 0   16  -  DL   23:24 0:00.00 [sequencer 00]
> root   150.0  0.0 0  160  -  DL   23:24 0:06.42 [usb]
> root   160.0  0.0 0   16  -  DL   23:24 0:00.10 [acpi_thermal]
> root   170.0  0.0 0   16  -  DL   23:24 0:00.00 [acpi_cooling0]
> root   180.0  0.0 0   16  -  DL   23:24 0:00.04 [syncer]
> root   190.0  0.0 0   16  -  DL   23:24 0:00.00 [vnlru]
> root  6710.0  0.0 13260 2600  -  Is   23:25 0:00.00 dhclient: 
> system.syslog (dhclient)
> root  6740.0  0.0 13260 2752  -  Is   23:25 0:00.00 dhclient: dpni0 
> [priv] (dhclient)
> root  7610.0  0.0 14572 3972  -  Ss   23:25 0:00.02 /sbin/devd
> root  9640.0  0.0 12832 2764  -  Is   23:25 0:00.02 /usr/sbin/syslogd 
> -s
> root 10330.0  0.0 13012 2604  -  Ss   23:25 0:00.01 /usr/sbin/cron -s
> root 10580.0  0.0 21052 8308  -  Is   23:25 0:00.01 sshd: 
> /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
> root 10780.0  0.0 21288 9304  -  Is   23:26 0:00.09 sshd: root@pts/0 
> (sshd)
> root 11750.0  0.0 21288 9496  -  Is   23:37 0:00.04 sshd: root@pts/1 
> (sshd)
> root 10740.0  0.0 13380 3008 u0  Is   23:25 0:00.01 login [pam] 
> (login)
> root 10750.0  0.0 13460 3292 u0  S23:25 0:00.02 -sh (sh)
> root 12330.0  0.0 13588 3016 u0  R+   00:00 0:00.00 ps -xu
> root 10810.0  0.0 13460 3328  0  Is   23:26 0:00.02 -sh (sh)
> root 11700.0  0.0  5788 2884  0  I23:36 0:00.02 /bin/sh -i
> root 11720.0  0.0 10408 7192  0  I+   23:37 0:00.30 kyua test -k 
> /usr/tests/Kyuafile sys/kern/kern_copyin
> root 11780.0  0.0 13460 3320  1  Is+  23:38 0:00.01 -sh (sh)
> 
> 1174 is stuck, even if one waits for 30min+.
> kill and kill -9 will not kill 1174.
> 
> "shutdown -r now" hangs before the reboot happens
> and reports: "some processes would not die".
> 
> An interesting property is that ps and top disagree
> about 1174 CPU usage: ps 100%, top 0%. But top also
> indicates 1174 always has CPU0 "STATE". (Across
> tests CPUn varies but within a test it has
> a fixed n.)
> 
> I have also seen ps "STAT" being RXs.
> 
> The following is from my earlier activity with my own
> builds involved, here 1119, not the 1174 from above.
> truss reports as the last thing for the stuck process
> as "getpid()".
> 
> . . .
> 1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ 
> mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) 
> = 0 (0x0)
> 1119: 0.589065030 
> mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANO

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-05 Thread Mark Millard

On Jun 25, 2023, at 17:16, Mark Millard  wrote:

> Using the likes of:
> 
> FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img
> and:
> FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img
> 
> I have shown the following behavior after setting up storage
> media based on them. (This was a test that my builds were not
> odd for the issue.)
> 
> Boot the aarch64 media and log in. (Note: I logged in
> as root.)
> 
> mount the armv7 media (-noatime is just my habit)
> and then put it to use:
> 
> # mount -onoatime /dev/da1s2a /mnt
> 
> # chroot /mnt/
> 
> # kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin
> sys/kern/kern_copyin:kern_copyin  ->  
> 
> On the serial console:
> 
> # ps -xu
> USER  PID   %CPU %MEM   VSZ  RSS TT  STAT STARTED  TIME COMMAND
> root   11 1498.4  0.0 0  256  -  RNL  23:24   542:52.92 [idle]
> root 1174  100.0  0.0 0   16  -  Rs   23:37 0:00.00 
> /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests 
> -r/tmp/kyua.9YUttj/2/result.atf kern_copyin
> root00.0  0.0 0 1616  -  DLs  23:24 0:00.50 [kernel]
> root10.0  0.0 11704 1288  -  ILs  23:24 0:00.02 /sbin/init
> root20.0  0.0 0  256  -  WL   23:24 0:00.26 [clock]
> root30.0  0.0 0  272  -  DL   23:24 0:00.00 [crypto]
> root40.0  0.0 0   80  -  DL   23:24 0:00.95 [cam]
> root50.0  0.0 0   16  -  DL   23:24 0:00.00 [busdma]
> root60.0  0.0 0   16  -  DL   23:24 0:00.03 [rand_harvestq]
> root70.0  0.0 0   48  -  DL   23:24 0:00.06 [pagedaemon]
> root80.0  0.0 0   16  -  DL   23:24 0:00.00 [vmdaemon]
> root90.0  0.0 0  160  -  DL   23:24 0:00.38 [bufdaemon]
> root   100.0  0.0 0   16  -  DL   23:24 0:00.00 [audit]
> root   120.0  0.0 0  880  -  WL   23:24 0:11.81 [intr]
> root   130.0  0.0 0   48  -  DL   23:24 0:00.04 [geom]
> root   140.0  0.0 0   16  -  DL   23:24 0:00.00 [sequencer 00]
> root   150.0  0.0 0  160  -  DL   23:24 0:06.42 [usb]
> root   160.0  0.0 0   16  -  DL   23:24 0:00.10 [acpi_thermal]
> root   170.0  0.0 0   16  -  DL   23:24 0:00.00 [acpi_cooling0]
> root   180.0  0.0 0   16  -  DL   23:24 0:00.04 [syncer]
> root   190.0  0.0 0   16  -  DL   23:24 0:00.00 [vnlru]
> root  6710.0  0.0 13260 2600  -  Is   23:25 0:00.00 dhclient: 
> system.syslog (dhclient)
> root  6740.0  0.0 13260 2752  -  Is   23:25 0:00.00 dhclient: dpni0 
> [priv] (dhclient)
> root  7610.0  0.0 14572 3972  -  Ss   23:25 0:00.02 /sbin/devd
> root  9640.0  0.0 12832 2764  -  Is   23:25 0:00.02 /usr/sbin/syslogd 
> -s
> root 10330.0  0.0 13012 2604  -  Ss   23:25 0:00.01 /usr/sbin/cron -s
> root 10580.0  0.0 21052 8308  -  Is   23:25 0:00.01 sshd: 
> /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
> root 10780.0  0.0 21288 9304  -  Is   23:26 0:00.09 sshd: root@pts/0 
> (sshd)
> root 11750.0  0.0 21288 9496  -  Is   23:37 0:00.04 sshd: root@pts/1 
> (sshd)
> root 10740.0  0.0 13380 3008 u0  Is   23:25 0:00.01 login [pam] 
> (login)
> root 10750.0  0.0 13460 3292 u0  S23:25 0:00.02 -sh (sh)
> root 12330.0  0.0 13588 3016 u0  R+   00:00 0:00.00 ps -xu
> root 10810.0  0.0 13460 3328  0  Is   23:26 0:00.02 -sh (sh)
> root 11700.0  0.0  5788 2884  0  I23:36 0:00.02 /bin/sh -i
> root 11720.0  0.0 10408 7192  0  I+   23:37 0:00.30 kyua test -k 
> /usr/tests/Kyuafile sys/kern/kern_copyin
> root 11780.0  0.0 13460 3320  1  Is+  23:38 0:00.01 -sh (sh)
> 
> 1174 is stuck, even if one waits for 30min+.
> kill and kill -9 will not kill 1174.
> 
> "shutdown -r now" hangs before the reboot happens
> and reports: "some processes would not die".
> 
> An interesting property is that ps and top disagree
> about 1174 CPU usage: ps 100%, top 0%. But top also
> indicates 1174 always has CPU0 "STATE". (Across
> tests CPUn varies but within a test it has
> a fixed n.)
> 
> I have also seen ps "STAT" being RXs.
> 
> The following is from my earlier activity with my own
> builds involved, here 1119, not the 1174 from above.
> truss reports as the last thing for the stuck process
> as "getpid()".
> 
> . . .
> 1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ 
> mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) 
> = 0 (0x0)
> 1119: 0.589065030 
> mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(

For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-06-25 Thread Mark Millard

Using the likes of:

FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img
and:
FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img

I have shown the following behavior after setting up storage
media based on them. (This was a test that my builds were not
odd for the issue.)

Boot the aarch64 media and log in. (Note: I logged in
as root.)

mount the armv7 media (-noatime is just my habit)
and then put it to use:

# mount -onoatime /dev/da1s2a /mnt

# chroot /mnt/

# kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin
sys/kern/kern_copyin:kern_copyin  ->  

On the serial console:

# ps -xu
USER  PID   %CPU %MEM   VSZ  RSS TT  STAT STARTED  TIME COMMAND
root   11 1498.4  0.0 0  256  -  RNL  23:24   542:52.92 [idle]
root 1174  100.0  0.0 0   16  -  Rs   23:37 0:00.00 
/usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests 
-r/tmp/kyua.9YUttj/2/result.atf kern_copyin
root00.0  0.0 0 1616  -  DLs  23:24 0:00.50 [kernel]
root10.0  0.0 11704 1288  -  ILs  23:24 0:00.02 /sbin/init
root20.0  0.0 0  256  -  WL   23:24 0:00.26 [clock]
root30.0  0.0 0  272  -  DL   23:24 0:00.00 [crypto]
root40.0  0.0 0   80  -  DL   23:24 0:00.95 [cam]
root50.0  0.0 0   16  -  DL   23:24 0:00.00 [busdma]
root60.0  0.0 0   16  -  DL   23:24 0:00.03 [rand_harvestq]
root70.0  0.0 0   48  -  DL   23:24 0:00.06 [pagedaemon]
root80.0  0.0 0   16  -  DL   23:24 0:00.00 [vmdaemon]
root90.0  0.0 0  160  -  DL   23:24 0:00.38 [bufdaemon]
root   100.0  0.0 0   16  -  DL   23:24 0:00.00 [audit]
root   120.0  0.0 0  880  -  WL   23:24 0:11.81 [intr]
root   130.0  0.0 0   48  -  DL   23:24 0:00.04 [geom]
root   140.0  0.0 0   16  -  DL   23:24 0:00.00 [sequencer 00]
root   150.0  0.0 0  160  -  DL   23:24 0:06.42 [usb]
root   160.0  0.0 0   16  -  DL   23:24 0:00.10 [acpi_thermal]
root   170.0  0.0 0   16  -  DL   23:24 0:00.00 [acpi_cooling0]
root   180.0  0.0 0   16  -  DL   23:24 0:00.04 [syncer]
root   190.0  0.0 0   16  -  DL   23:24 0:00.00 [vnlru]
root  6710.0  0.0 13260 2600  -  Is   23:25 0:00.00 dhclient: 
system.syslog (dhclient)
root  6740.0  0.0 13260 2752  -  Is   23:25 0:00.00 dhclient: dpni0 
[priv] (dhclient)
root  7610.0  0.0 14572 3972  -  Ss   23:25 0:00.02 /sbin/devd
root  9640.0  0.0 12832 2764  -  Is   23:25 0:00.02 /usr/sbin/syslogd -s
root 10330.0  0.0 13012 2604  -  Ss   23:25 0:00.01 /usr/sbin/cron -s
root 10580.0  0.0 21052 8308  -  Is   23:25 0:00.01 sshd: 
/usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
root 10780.0  0.0 21288 9304  -  Is   23:26 0:00.09 sshd: root@pts/0 
(sshd)
root 11750.0  0.0 21288 9496  -  Is   23:37 0:00.04 sshd: root@pts/1 
(sshd)
root 10740.0  0.0 13380 3008 u0  Is   23:25 0:00.01 login [pam] (login)
root 10750.0  0.0 13460 3292 u0  S23:25 0:00.02 -sh (sh)
root 12330.0  0.0 13588 3016 u0  R+   00:00 0:00.00 ps -xu
root 10810.0  0.0 13460 3328  0  Is   23:26 0:00.02 -sh (sh)
root 11700.0  0.0  5788 2884  0  I23:36 0:00.02 /bin/sh -i
root 11720.0  0.0 10408 7192  0  I+   23:37 0:00.30 kyua test -k 
/usr/tests/Kyuafile sys/kern/kern_copyin
root 11780.0  0.0 13460 3320  1  Is+  23:38 0:00.01 -sh (sh)

1174 is stuck, even if one waits for 30min+.
kill and kill -9 will not kill 1174.

"shutdown -r now" hangs before the reboot happens
and reports: "some processes would not die".

An interesting property is that ps and top disagree
about 1174 CPU usage: ps 100%, top 0%. But top also
indicates 1174 always has CPU0 "STATE". (Across
tests CPUn varies but within a test it has
a fixed n.)

I have also seen ps "STAT" being RXs.

The following is from my earlier activity with my own
builds involved, here 1119, not the 1174 from above.
truss reports as the last thing for the stuck process
as "getpid()".

. . .
1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ 
mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) = 
0 (0x0)
1119: 0.589065030 
mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(12),-1,0x0)
 = 1074188288 (0x4006d000)
1119: 0.589227544 
openat(AT_FDCWD,"/tmp/kyua.aBQv6E/2/result.atf",O_WRONLY|O_CREAT|O_TRUNC,0644) 
= 3 (0x3)
1119: 0.589276503 getpid()  = 1119 (0x45f)



For reference, from inside an armv7 chroot session
before doing such a test:

# uname -apKU
FreeBSD generic 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263748-b95d2237af40: 
Thu Jun 22 11:10:50 UTC 2023 
r...@releng1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm 
armv7 1400090 1400090

===
Mark Millard
marklmi at yahoo.com

Re: TP-LINK USB no carrier after speed test

2023-01-18 Thread Hans Petter Selasky


On 1/18/23 12:45, Gary Jennejohn wrote:

It's not clear from the content of README.md whether Hans has added
thunderbolt to the files under /sys/conf.


Currently not so much has changed there, except from regularly rebasing 
the repository on top of FreeBSD-main.


I currently have my hands full!

--HPS

Re: TP-LINK USB no carrier after speed test

2023-01-18 Thread Gary Jennejohn

On Tue, 17 Jan 2023 19:58:54 -0300 (-03)
Ivan Quitschal  wrote:

> On Tue, 27 Sep 2022, Hans Petter Selasky wrote:
>
> >
> > FYI: There is some experimental thunderbolt support at:
> >
> > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhselasky%2Fusb4&data=05%7C01%7C%7C14c86eee9f5d492c41d508daa0b49bdb%7C84df9e7fe9f640afb435%7C1%7C0%7C637998994857157968%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FOnIO3esoAmi1FSPkHRYpHCHkcN6U2rO9WhaimdaVbk%3D&reserved=0
> >
> > But I'm not sure if it supports the hardware you've got.
> >
> > --HPS
> >
>
>
> Hi Hans
>
> i just told you early today that the problem was solved by sticking it into 
> USB
> 2.0, well i was wrong. problem came back just like before
>
> I see Alexander also has the same XHCI that i have here
>
> xhci0@pci0:0:20:0:  class=0x0c0330 rev=0x20 hdr=0x00 vendor=0x8086 
> device=0xa0ed
> subvendor=0x1028 subdevice=0x0ab0
>  vendor = 'Intel Corporation'
>  device = 'Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller'
>  class  = serial bus
>  subclass   = USB
>
>
> maybe this tiger lake support is the problem?
>
>
> I have checked your git repository above, how could i test it here ? what 
> dirs am i supposed to copy to my /usr/src ?
>
> thank you
>

That information is in the README.md:

This implements a basic kernel driver and userland tool for USB4 and
Thunderbolt3.  The relevant code is in the following locations:

sys/dev/thunderbolt
sys/modules/thunderbolt
usr.sbin/tbtconfig

So, you need the contents of those directories.

You'll have to build a module under sys/modules/thunderbolt, which
should result in a tb.ko file, which will have to be loaded using
kldload.

You also have to go into /usr/src/usr.sbin/tbtconfig and build that
binary.  There's a manpage there.

It's not clear from the content of README.md whether Hans has added
thunderbolt to the files under /sys/conf.

--
Gary Jennejohn

Re: RES: TP-LINK USB no carrier after speed test

2023-01-17 Thread Ivan Quitschal





On Tue, 27 Sep 2022, Hans Petter Selasky wrote:



FYI: There is some experimental thunderbolt support at:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhselasky%2Fusb4&data=05%7C01%7C%7C14c86eee9f5d492c41d508daa0b49bdb%7C84df9e7fe9f640afb435%7C1%7C0%7C637998994857157968%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FOnIO3esoAmi1FSPkHRYpHCHkcN6U2rO9WhaimdaVbk%3D&reserved=0

But I'm not sure if it supports the hardware you've got.

--HPS




Hi Hans

i just told you early today that the problem was solved by sticking it into USB
2.0, well i was wrong. problem came back just like before

I see Alexander also has the same XHCI that i have here

xhci0@pci0:0:20:0:  class=0x0c0330 rev=0x20 hdr=0x00 vendor=0x8086 device=0xa0ed
subvendor=0x1028 subdevice=0x0ab0
vendor = 'Intel Corporation'
device = 'Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller'
class  = serial bus
subclass   = USB


maybe this tiger lake support is the problem?


I have checked your git repository above, how could i test it here ? what dirs 
am i supposed to copy to my /usr/src ?


thank you

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2023-01-17 Thread Hans Petter Selasky


On 1/17/23 14:13, Ivan Quitschal wrote:
not THAT fine of course, since its limited to around 300mbps. when in 
USB 3 it reaches 600mbps just fine.


but besides that limitation from the version 2.0, it really works. ive 
tried a whole day of heavy traffic here and nothing happened at all.


rings any bells ?


Yes,

I see that too:

ugen0.3:  at usbus0, cfg=0 md=HOST spd=HIGH 
(480Mbps) pwr=ON (248mA)


Works like a charm spd=HIGH, but probably not super-speed.

Maybe the vendor does something different when the speed is super speed 
so that the BULK transport can move more data at a time ...


Vendor documentation is wanted! Maybe you simply need to USB trace the 
protocol when super-speed is used and vendor drivers are in place.


Right now there is no option to disable super speed only, but maybe try 
to run this command on all USB 3.x root HUBS:


usbconfig -d X.Y set_config 255

Maybe the device will show up as high-speed on the other computer 
aswell. It's worth a try.


--HPS

Re: RES: TP-LINK USB no carrier after speed test

2023-01-17 Thread Ivan Quitschal





On Wed, 28 Sep 2022, Ivan Quitschal wrote:


FYI: There is some experimental thunderbolt support at:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhselasky%2Fusb4&data=05%7C01%7C%7Cc2f534f631fd47afec9908daa135d60b%7C84df9e7fe9f640afb435%7C1%7C0%7C637999549868812246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uL5DwbPcldediZmBiufQXnkF7%2F2WQTizqVLAYBHZjqA%3D&reserved=0

But I'm not sure if it supports the hardware you've got.

--HPS




Hi Hans

i got two log versions for you, one with the constant set to 2048 (the 
working version) , and the other with no patches whatsoever (the bad 
version)


since the entire logs reached more than 150M of size, i had to cut to the 
last 1000 lines , hope toats enough


pleaes find attached the two files

the xhci_NOT_working i stoped recording right after the problem ocurred

please let me know if you need something else

thank you

--tzk


I need the full log.

The XHCI driver is very verbose you see.

Maybe you can do some filtering, like figuring out all the status codes you 
see:


status=1
status=13

and so on.

--HPS






Hi Hans,

sorry for the delay. i wasnt able to send the full logs because it went too big 
:(

but i have something, not sure if that helps.

it happens that ive moved my notebook to another place and now im using the 
ethernet adapter in the port USB 2.0 instead of USB 3 (where the problem used to 
happen) and now it works fine.


not THAT fine of course, since its limited to around 300mbps. when in USB 3 it 
reaches 600mbps just fine.


but besides that limitation from the version 2.0, it really works. ive tried a 
whole day of heavy traffic here and nothing happened at all.


rings any bells ?

thanks

Ivan

PS: (if you still want that log, let me know some place where i could upload it, 
i dont know)

Re: RES: TP-LINK USB no carrier after speed test

2022-09-28 Thread Ivan Quitschal




On Wed, 28 Sep 2022, Hans Petter Selasky wrote:


On 9/28/22 11:07, Ivan Quitschal wrote:



On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 15:22, Hans Petter Selasky wrote:

On 9/27/22 14:17, Ivan Quitschal wrote:



On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 02:24, Alexander Motin wrote:

On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but I'm 
unable to reproduce the if_ure hang on two different pieces of XHCI 
hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or 
hardware on your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this problem, 
but both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 
device=0x15d4 subvendor=0x subdevice=0x

 vendor = 'Intel Corporation'
 device = 'JHL6540 Thunderbolt 3 USB Controller (C step) 
[Alpine Ridge 4C 2016]'

 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 32, base 0xc3f0, size 65536, 
enabled

 cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 
message

 cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
  max read 512
  link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) 
ClockPM disabled

 ecap 0003[100] = Serial 1 20ff910876f10c00
 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
 ecap 0002[300] = VC 1 max VC0
 ecap 0004[400] = Power Budgeting 1
 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
 ecap 0018[600] = LTR 1
 ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 
device=0x9a13 subvendor=0x1028 subdevice=0x0991

 vendor = 'Intel Corporation'
 device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 64, base 0x60552c, size 
65536, enabled

 cap 01[70] = powerspec 2  supports D0 D3  current D0
 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 
message

 cap 09[90] = vendor (length 20) Intel cap 15 version 0
 cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native Intel 
chipet's XHCI?


Also, when running the stress test and you see the traffic stops, 
what happens if you run this command as root on the ugen which the 
if_ure belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it 
recovered traffic.  Only reset recovered it.




Hi Alexander,

Could you run "usbdump -d X.Y" at the same time to capture all the 
errors?


Looking especially for USB_ERR_TIMEOUT .

I have this:

xhci0@pci0:3:0:3:    class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 
device=0x15e0 subvendor=0x1849 subdevice=0x

   vendor = 'Advanced Micro Devices, Inc. [AMD]'
   device = 'Raven USB 3.1'
   class  = serial bus
   subclass   = USB

xhci0@pci0:0:20:0:    class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 
device=0x9d2f subvendor=0x8086 subdevice=0x9d2f

   vendor = 'Intel Corporation'
   device = 'Sunrise Point-LP USB 3.0 xHCI Controller'
   class  = serial bus
   subclass   = USB

--HPS




hi Hans

i think i got some good logs for you

before the problem i ran this:

ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


# usbconfig -d ugen0.10 >> before
# usbconfig -d ugen0.10 dump_all_desc >> before
# usbconfig -d ugen0.10 dump_stats >> before_status

the after the problem happened i ran

# usbconfig -d ugen0.10 >> after
# usbconfig -d ugen0.10 dump_all_desc >> after
# usbconfig -d ugen0.10 dump_stats >> after_status


just by looking i already see some problems comparing both

for example

before the problem we have:

--
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


   bLength = 0x0012
   bDescriptorType = 0x0001
   bcdUSB = 0x0300
   bDeviceClass = 0x  
   bDeviceSubClass = 0x
   bDeviceProtocol = 0x
   bMaxPacketSize0 = 0x0009
   idVendor = 0x2357
   idProduct = 0x0601
   bcdDevice = 0x3000

   iManufacturer = 0x0001  
   iProduct = 0x0002  
   iSerialNumber = 0x0006  <01>
   bNumConfigurations = 0x0002

Re: RES: TP-LINK USB no carrier after speed test

2022-09-28 Thread Hans Petter Selasky


On 9/28/22 11:47, Tomoaki AOKI wrote:

As I stated on Bug 237666 [1], I have Titan Ridge TB3 bridge on my
ThinkPad P52. The relevant part of HW probe is at comment 206 [2].
Are there any other info I can provide for Titan Ridge support?
(Not yet tried the codes.)

[1]https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666

[2]https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666#c206


I cannot promise anything and I don't have an overview which TB3 
controllers are compatible with eachother. Maybe grepping for the PCI 
ID's in Linux will give some clues, hence I don't have access to any 
thunderbolt documentation myself!


--HPS

Re: RES: TP-LINK USB no carrier after speed test

2022-09-28 Thread Tomoaki AOKI

On Tue, 27 Sep 2022 20:17:54 +0200
Hans Petter Selasky  wrote:

> On 9/27/22 15:22, Hans Petter Selasky wrote:

   (snip)

> FYI: There is some experimental thunderbolt support at:
> 
> https://github.com/hselasky/usb4
> 
> But I'm not sure if it supports the hardware you've got.
> 
> --HPS

Thanks for the hard work and info.

As I stated on Bug 237666 [1], I have Titan Ridge TB3 bridge on my
ThinkPad P52. The relevant part of HW probe is at comment 206 [2].
Are there any other info I can provide for Titan Ridge support?
(Not yet tried the codes.)

[1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666

[2] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237666#c206

-- 
Tomoaki AOKI

Re: RES: TP-LINK USB no carrier after speed test

2022-09-28 Thread Hans Petter Selasky


On 9/28/22 11:07, Ivan Quitschal wrote:



On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 15:22, Hans Petter Selasky wrote:

On 9/27/22 14:17, Ivan Quitschal wrote:



On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 02:24, Alexander Motin wrote:

On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but 
I'm unable to reproduce the if_ure hang on two different pieces 
of XHCI hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or 
hardware on your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this 
problem, but both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 
vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x

 vendor = 'Intel Corporation'
 device = 'JHL6540 Thunderbolt 3 USB Controller (C step) 
[Alpine Ridge 4C 2016]'

 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 32, base 0xc3f0, size 
65536, enabled

 cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 
message

 cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
  max read 512
  link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) 
ClockPM disabled

 ecap 0003[100] = Serial 1 20ff910876f10c00
 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
 ecap 0002[300] = VC 1 max VC0
 ecap 0004[400] = Power Budgeting 1
 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
 ecap 0018[600] = LTR 1
 ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 
vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991

 vendor = 'Intel Corporation'
 device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 64, base 0x60552c, size 
65536, enabled

 cap 01[70] = powerspec 2  supports D0 D3  current D0
 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 
message

 cap 09[90] = vendor (length 20) Intel cap 15 version 0
 cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native 
Intel chipet's XHCI?


Also, when running the stress test and you see the traffic stops, 
what happens if you run this command as root on the ugen which 
the if_ure belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported 
 and 2 times it completed successfully, but it neither 
case it recovered traffic.  Only reset recovered it.




Hi Alexander,

Could you run "usbdump -d X.Y" at the same time to capture all the 
errors?


Looking especially for USB_ERR_TIMEOUT .

I have this:

xhci0@pci0:3:0:3:    class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 
device=0x15e0 subvendor=0x1849 subdevice=0x

   vendor = 'Advanced Micro Devices, Inc. [AMD]'
   device = 'Raven USB 3.1'
   class  = serial bus
   subclass   = USB

xhci0@pci0:0:20:0:    class=0x0c0330 rev=0x21 hdr=0x00 
vendor=0x8086 device=0x9d2f subvendor=0x8086 subdevice=0x9d2f

   vendor = 'Intel Corporation'
   device = 'Sunrise Point-LP USB 3.0 xHCI Controller'
   class  = serial bus
   subclass   = USB

--HPS




hi Hans

i think i got some good logs for you

before the problem i ran this:

ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


# usbconfig -d ugen0.10 >> before
# usbconfig -d ugen0.10 dump_all_desc >> before
# usbconfig -d ugen0.10 dump_stats >> before_status

the after the problem happened i ran

# usbconfig -d ugen0.10 >> after
# usbconfig -d ugen0.10 dump_all_desc >> after
# usbconfig -d ugen0.10 dump_stats >> after_status


just by looking i already see some problems comparing both

for example

before the problem we have:

--
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


   bLength = 0x0012
   bDescriptorType = 0x0001
   bcdUSB = 0x0300
   bDeviceClass = 0x  
   bDeviceSubClass = 0x
   bDeviceProtocol = 0x
   bMaxPacketSize0 = 0x0009
   idVendor = 0x2357
   idProduct = 0x0601
   bcdDevice = 0x3000

   iManufacturer = 0x0001  
   iProduct = 0x0002  
   iSerialNumber = 0x0006  <01>
   bNumConfigurations = 0x0002



after the problem

--
u

RES: RES: TP-LINK USB no carrier after speed test

2022-09-28 Thread Ivan Quitschal

> 
> FYI: There is some experimental thunderbolt support at:
> 
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.c
> om%2Fhselasky%2Fusb4&data=05%7C01%7C%7C14c86eee9f5d492c41d50
> 8daa0b49bdb%7C84df9e7fe9f640afb435%7C1%7C0%7C6379989
> 94857157968%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjo
> iV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata
> =%2FOnIO3esoAmi1FSPkHRYpHCHkcN6U2rO9WhaimdaVbk%3D&reserved
> =0
> 
> But I'm not sure if it supports the hardware you've got.
> 
> --HPS


Hi Hans

Should i wait to apply this thunderbolt business just yet , at least you 
analize the log I sent you in the last email or should I go for it ?

Thanks

--tzk

Re: RES: TP-LINK USB no carrier after speed test


On 9/27/22 15:22, Hans Petter Selasky wrote:

On 9/27/22 14:17, Ivan Quitschal wrote:



On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 02:24, Alexander Motin wrote:

On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but 
I'm unable to reproduce the if_ure hang on two different pieces of 
XHCI hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or 
hardware on your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this 
problem, but both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 
vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x

 vendor = 'Intel Corporation'
 device = 'JHL6540 Thunderbolt 3 USB Controller (C step) 
[Alpine Ridge 4C 2016]'

 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 32, base 0xc3f0, size 
65536, enabled

 cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 
message

 cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
  max read 512
  link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) 
ClockPM disabled

 ecap 0003[100] = Serial 1 20ff910876f10c00
 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
 ecap 0002[300] = VC 1 max VC0
 ecap 0004[400] = Power Budgeting 1
 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
 ecap 0018[600] = LTR 1
 ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 
vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991

 vendor = 'Intel Corporation'
 device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 64, base 0x60552c, size 
65536, enabled

 cap 01[70] = powerspec 2  supports D0 D3  current D0
 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 
message

 cap 09[90] = vendor (length 20) Intel cap 15 version 0
 cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native 
Intel chipet's XHCI?


Also, when running the stress test and you see the traffic stops, 
what happens if you run this command as root on the ugen which the 
if_ure belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it 
recovered traffic.  Only reset recovered it.




Hi Alexander,

Could you run "usbdump -d X.Y" at the same time to capture all the 
errors?


Looking especially for USB_ERR_TIMEOUT .

I have this:

xhci0@pci0:3:0:3:    class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 
device=0x15e0 subvendor=0x1849 subdevice=0x

   vendor = 'Advanced Micro Devices, Inc. [AMD]'
   device = 'Raven USB 3.1'
   class  = serial bus
   subclass   = USB

xhci0@pci0:0:20:0:    class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 
device=0x9d2f subvendor=0x8086 subdevice=0x9d2f

   vendor = 'Intel Corporation'
   device = 'Sunrise Point-LP USB 3.0 xHCI Controller'
   class  = serial bus
   subclass   = USB

--HPS




hi Hans

i think i got some good logs for you

before the problem i ran this:

ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


# usbconfig -d ugen0.10 >> before
# usbconfig -d ugen0.10 dump_all_desc >> before
# usbconfig -d ugen0.10 dump_stats >> before_status

the after the problem happened i ran

# usbconfig -d ugen0.10 >> after
# usbconfig -d ugen0.10 dump_all_desc >> after
# usbconfig -d ugen0.10 dump_stats >> after_status


just by looking i already see some problems comparing both

for example

before the problem we have:

--
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


   bLength = 0x0012
   bDescriptorType = 0x0001
   bcdUSB = 0x0300
   bDeviceClass = 0x  
   bDeviceSubClass = 0x
   bDeviceProtocol = 0x
   bMaxPacketSize0 = 0x0009
   idVendor = 0x2357
   idProduct = 0x0601
   bcdDevice = 0x3000

   iManufacturer = 0x0001  
   iProduct = 0x0002  
   iSerialNumber = 0x0006  <01>
   bNumConfigurations = 0x0002



after the problem

--
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 m

Re: RES: TP-LINK USB no carrier after speed test


On 9/27/22 14:17, Ivan Quitschal wrote:



On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 02:24, Alexander Motin wrote:

On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but I'm 
unable to reproduce the if_ure hang on two different pieces of XHCI 
hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or 
hardware on your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this problem, 
but both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 
vendor=0x8086 device=0x15d4 subvendor=0x subdevice=0x

 vendor = 'Intel Corporation'
 device = 'JHL6540 Thunderbolt 3 USB Controller (C step) 
[Alpine Ridge 4C 2016]'

 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 32, base 0xc3f0, size 65536, 
enabled

 cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
 cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message
 cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
  max read 512
  link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) 
ClockPM disabled

 ecap 0003[100] = Serial 1 20ff910876f10c00
 ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
 ecap 0002[300] = VC 1 max VC0
 ecap 0004[400] = Power Budgeting 1
 ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
 ecap 0018[600] = LTR 1
 ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 
vendor=0x8086 device=0x9a13 subvendor=0x1028 subdevice=0x0991

 vendor = 'Intel Corporation'
 device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
 class  = serial bus
 subclass   = USB
 bar   [10] = type Memory, range 64, base 0x60552c, size 
65536, enabled

 cap 01[70] = powerspec 2  supports D0 D3  current D0
 cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message
 cap 09[90] = vendor (length 20) Intel cap 15 version 0
 cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native 
Intel chipet's XHCI?


Also, when running the stress test and you see the traffic stops, 
what happens if you run this command as root on the ugen which the 
if_ure belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it 
recovered traffic.  Only reset recovered it.




Hi Alexander,

Could you run "usbdump -d X.Y" at the same time to capture all the 
errors?


Looking especially for USB_ERR_TIMEOUT .

I have this:

xhci0@pci0:3:0:3:    class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 
device=0x15e0 subvendor=0x1849 subdevice=0x

   vendor = 'Advanced Micro Devices, Inc. [AMD]'
   device = 'Raven USB 3.1'
   class  = serial bus
   subclass   = USB

xhci0@pci0:0:20:0:    class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 
device=0x9d2f subvendor=0x8086 subdevice=0x9d2f

   vendor = 'Intel Corporation'
   device = 'Sunrise Point-LP USB 3.0 xHCI Controller'
   class  = serial bus
   subclass   = USB

--HPS




hi Hans

i think i got some good logs for you

before the problem i ran this:

ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


# usbconfig -d ugen0.10 >> before
# usbconfig -d ugen0.10 dump_all_desc >> before
# usbconfig -d ugen0.10 dump_stats >> before_status

the after the problem happened i ran

# usbconfig -d ugen0.10 >> after
# usbconfig -d ugen0.10 dump_all_desc >> after
# usbconfig -d ugen0.10 dump_stats >> after_status


just by looking i already see some problems comparing both

for example

before the problem we have:

--
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)


   bLength = 0x0012
   bDescriptorType = 0x0001
   bcdUSB = 0x0300
   bDeviceClass = 0x  
   bDeviceSubClass = 0x
   bDeviceProtocol = 0x
   bMaxPacketSize0 = 0x0009
   idVendor = 0x2357
   idProduct = 0x0601
   bcdDevice = 0x3000

   iManufacturer = 0x0001  
   iProduct = 0x0002  
   iSerialNumber = 0x0006  <01>
   bNumConfigurations = 0x0002



after the problem

--
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST 
spd=SUPER (5.0Gbps) pwr=ON (72mA)

Re: RES: TP-LINK USB no carrier after speed test

2022-09-27 Thread Ivan Quitschal




On Tue, 27 Sep 2022, Hans Petter Selasky wrote:


On 9/27/22 02:24, Alexander Motin wrote:

On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but I'm 
unable to reproduce the if_ure hang on two different pieces of XHCI 
hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or hardware on 
your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this problem, but 
both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 
device=0x15d4 subvendor=0x subdevice=0x

     vendor = 'Intel Corporation'
     device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine 
Ridge 4C 2016]'

     class  = serial bus
     subclass   = USB
     bar   [10] = type Memory, range 32, base 0xc3f0, size 65536, 
enabled

     cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
     cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message
     cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
  max read 512
  link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) ClockPM 
disabled

     ecap 0003[100] = Serial 1 20ff910876f10c00
     ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
     ecap 0002[300] = VC 1 max VC0
     ecap 0004[400] = Power Budgeting 1
     ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
     ecap 0018[600] = LTR 1
     ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 
device=0x9a13 subvendor=0x1028 subdevice=0x0991

     vendor = 'Intel Corporation'
     device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
     class  = serial bus
     subclass   = USB
     bar   [10] = type Memory, range 64, base 0x60552c, size 65536, 
enabled

     cap 01[70] = powerspec 2  supports D0 D3  current D0
     cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message
     cap 09[90] = vendor (length 20) Intel cap 15 version 0
     cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native Intel 
chipet's XHCI?


Also, when running the stress test and you see the traffic stops, what 
happens if you run this command as root on the ugen which the if_ure 
belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported  
and 2 times it completed successfully, but it neither case it recovered 
traffic.  Only reset recovered it.




Hi Alexander,

Could you run "usbdump -d X.Y" at the same time to capture all the errors?

Looking especially for USB_ERR_TIMEOUT .

I have this:

xhci0@pci0:3:0:3:	class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 
device=0x15e0 subvendor=0x1849 subdevice=0x

   vendor = 'Advanced Micro Devices, Inc. [AMD]'
   device = 'Raven USB 3.1'
   class  = serial bus
   subclass   = USB

xhci0@pci0:0:20:0:	class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 
device=0x9d2f subvendor=0x8086 subdevice=0x9d2f

   vendor = 'Intel Corporation'
   device = 'Sunrise Point-LP USB 3.0 xHCI Controller'
   class  = serial bus
   subclass   = USB

--HPS




hi Hans

i think i got some good logs for you

before the problem i ran this:

ugen0.10:  at usbus0, cfg=0 md=HOST spd=SUPER 
(5.0Gbps) pwr=ON (72mA)


# usbconfig -d ugen0.10 >> before
# usbconfig -d ugen0.10 dump_all_desc >> before
# usbconfig -d ugen0.10 dump_stats >> before_status

the after the problem happened i ran

# usbconfig -d ugen0.10 >> after
# usbconfig -d ugen0.10 dump_all_desc >> after
# usbconfig -d ugen0.10 dump_stats >> after_status


just by looking i already see some problems comparing both

for example

before the problem we have:

--
ugen0.10:  at usbus0, cfg=0 md=HOST spd=SUPER 
(5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST spd=SUPER 
(5.0Gbps) pwr=ON (72mA)


  bLength = 0x0012
  bDescriptorType = 0x0001
  bcdUSB = 0x0300
  bDeviceClass = 0x  
  bDeviceSubClass = 0x
  bDeviceProtocol = 0x
  bMaxPacketSize0 = 0x0009
  idVendor = 0x2357
  idProduct = 0x0601
  bcdDevice = 0x3000

  iManufacturer = 0x0001  
  iProduct = 0x0002  
  iSerialNumber = 0x0006  <01>
  bNumConfigurations = 0x0002



after the problem

--
ugen0.10:  at usbus0, cfg=0 md=HOST spd=SUPER 
(5.0Gbps) pwr=ON (72mA)
ugen0.10:  at usbus0, cfg=0 md=HOST spd=SUPER 
(5.0Gbps) pwr=ON (72mA)


  bLength = 0x0012
  bDescriptorType = 0x0001
  bcdUSB = 0x0300
  bDev

Re: RES: TP-LINK USB no carrier after speed test


On 9/27/22 02:24, Alexander Motin wrote:

On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but I'm 
unable to reproduce the if_ure hang on two different pieces of XHCI 
hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or hardware 
on your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this problem, 
but both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 
device=0x15d4 subvendor=0x subdevice=0x

     vendor = 'Intel Corporation'
     device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine 
Ridge 4C 2016]'

     class  = serial bus
     subclass   = USB
     bar   [10] = type Memory, range 32, base 0xc3f0, size 65536, 
enabled

     cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
     cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message
     cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
  max read 512
  link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) 
ClockPM disabled

     ecap 0003[100] = Serial 1 20ff910876f10c00
     ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
     ecap 0002[300] = VC 1 max VC0
     ecap 0004[400] = Power Budgeting 1
     ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
     ecap 0018[600] = LTR 1
     ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 
device=0x9a13 subvendor=0x1028 subdevice=0x0991

     vendor = 'Intel Corporation'
     device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
     class  = serial bus
     subclass   = USB
     bar   [10] = type Memory, range 64, base 0x60552c, size 65536, 
enabled

     cap 01[70] = powerspec 2  supports D0 D3  current D0
     cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message
     cap 09[90] = vendor (length 20) Intel cap 15 version 0
     cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native Intel 
chipet's XHCI?


Also, when running the stress test and you see the traffic stops, what 
happens if you run this command as root on the ugen which the if_ure 
belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it 
recovered traffic.  Only reset recovered it.




Hi Alexander,

Could you run "usbdump -d X.Y" at the same time to capture all the errors?

Looking especially for USB_ERR_TIMEOUT .

I have this:

xhci0@pci0:3:0:3:	class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 
device=0x15e0 subvendor=0x1849 subdevice=0x

vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = 'Raven USB 3.1'
class  = serial bus
subclass   = USB

xhci0@pci0:0:20:0:	class=0x0c0330 rev=0x21 hdr=0x00 vendor=0x8086 
device=0x9d2f subvendor=0x8086 subdevice=0x9d2f

vendor = 'Intel Corporation'
device = 'Sunrise Point-LP USB 3.0 xHCI Controller'
class  = serial bus
subclass   = USB

--HPS

Re: RES: RES: TP-LINK USB no carrier after speed test


On 9/27/22 00:25, Ivan Quitschal wrote:

Hi Hans,
how do you want me to do those tests for you ? with or without any of your 
patches? With the actual code on git ?


Without any patches.

--HPS

Re: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Alexander Motin


On 26.09.2022 17:29, Hans Petter Selasky wrote:
I've got a supposedly "broken" if_ure dongle from Alexander, but I'm 
unable to reproduce the if_ure hang on two different pieces of XHCI 
hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or hardware 
on your system.


Can you share the pciconfig -lv output for your XHCI controllers?


I have two laptops of different generations reproducing this problem, 
but both are having Thunderbolt on the USB-C ports:


This is one (7th Gen Core i7):

xhci1@pci0:56:0:0:  class=0x0c0330 rev=0x02 hdr=0x00 vendor=0x8086 
device=0x15d4 subvendor=0x subdevice=0x

vendor = 'Intel Corporation'
device = 'JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine 
Ridge 4C 2016]'

class  = serial bus
subclass   = USB
bar   [10] = type Memory, range 32, base 0xc3f0, size 65536, 
enabled

cap 01[80] = powerspec 3  supports D0 D1 D2 D3  current D0
cap 05[88] = MSI supports 8 messages, 64 bit enabled with 1 message
cap 10[c0] = PCI-Express 2 endpoint max data 128(128) RO NS
 max read 512
 link x4(x4) speed 2.5(2.5) ASPM disabled(L0s/L1) 
ClockPM disabled

ecap 0003[100] = Serial 1 20ff910876f10c00
ecap 0001[200] = AER 1 0 fatal 0 non-fatal 1 corrected
ecap 0002[300] = VC 1 max VC0
ecap 0004[400] = Power Budgeting 1
ecap 000b[500] = Vendor [1] ID 1234 Rev 1 Length 216
ecap 0018[600] = LTR 1
ecap 0019[700] = PCIe Sec 1 lane errors 0

This is another (11th Gen Core i7);

xhci0@pci0:0:13:0:  class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 
device=0x9a13 subvendor=0x1028 subdevice=0x0991

vendor = 'Intel Corporation'
device = 'Tiger Lake-LP Thunderbolt 4 USB Controller'
class  = serial bus
subclass   = USB
bar   [10] = type Memory, range 64, base 0x60552c, size 65536, 
enabled

cap 01[70] = powerspec 2  supports D0 D3  current D0
cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message
cap 09[90] = vendor (length 20) Intel cap 15 version 0
cap 09[b0] = vendor (length 0) Intel cap 0 version 1

Does the system you also has Thunderbolt chip, or you use native Intel 
chipet's XHCI?


Also, when running the stress test and you see the traffic stops, what 
happens if you run this command as root on the ugen which the if_ure 
belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?


Nope. Out of 4 times when traffic stopped 2 times it reported error> and 2 times it completed successfully, but it neither case it 
recovered traffic.  Only reset recovered it.


--
Alexander Motin

Re: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Ivan Quitschal




On Mon, 26 Sep 2022, Hans Petter Selasky wrote:


On 9/26/22 21:28, Alexander Motin wrote:

Ivan,

On 26.09.2022 13:11, Ivan Quitschal wrote:
bad news im afraid, problem occurred at the first attempt on 
speedtest.net.
and I'm really trying to help you analizying this code here myself, but 
problem is: im far from expert on network protocol business. if it is a 
network problem at all. seems to me more like a USB protocol limit issue 
or something ..  just FYI , limiting that first constant to 2048 still 
limits my  upload to 90mbps , and also still solves the issue .. there has 
to be something about it obviously


On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually help 
a lot more without so dramatic performance decrease.  Though it is likely 
only a workaround and does not explain the cause, so I hope Hans more ideas 
for us to test. ;)




Hi,

I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable 
to reproduce the if_ure hang on two different pieces of XHCI hardware, Intel 
based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or hardware on 
your system.


Can you share the pciconfig -lv output for your XHCI controllers?

Also, when running the stress test and you see the traffic stops, what 
happens if you run this command as root on the ugen which the if_ure belongs 
to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?

--HPS



hi Hans

without any patch , the actual code on repository


pciconf -lv
xhci0@pci0:0:20:0:  class=0x0c0330 rev=0x20 hdr=0x00 vendor=0x8086 device=0xa0ed 
subvendor=0x1028 subdevice=0x0ab0

vendor = 'Intel Corporation'
device = 'Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller'
class  = serial bus
subclass   = USB

did the stress test, got the problem, then i tried the below

[root@tzk-inspiron ~ ]# usbconfig -d ugen0.6 dump_string 0
STRING_0x00 = 0x04, 0x03, 0x09, 0x04
[root@tzk-inspiron ~ ]#

nothing happened, still no carrier. in order to get back the internet i had to
[root@tzk-inspiron ~ ]# usbconfig -d ugen0.6 reset

--tzk

RES: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Ivan Quitschal



> -Mensagem original-
> De: Hans Petter Selasky 
> Enviada em: segunda-feira, 26 de setembro de 2022 18:29
> Para: Alexander Motin ; Ivan Quitschal
> 
> Cc: freebsd-current@freebsd.org; freebsd-...@freebsd.org
> Assunto: Re: RES: TP-LINK USB no carrier after speed test
> 
> On 9/26/22 21:28, Alexander Motin wrote:
> > Ivan,
> >
> > On 26.09.2022 13:11, Ivan Quitschal wrote:
> >> bad news im afraid, problem occurred at the first attempt on
> >> speedtest.net.
> >> and I'm really trying to help you analizying this code here myself,
> >> but problem is: im far from expert on network protocol business. if
> >> it is a network problem at all. seems to me more like a USB protocol
> >> limit issue or something ..  just FYI , limiting that first constant
> >> to 2048 still limits my  upload to 90mbps , and also still solves the
> >> issue .. there has to be something about it obviously
> >
> > On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually
> > help a lot more without so dramatic performance decrease.  Though it
> > is likely only a workaround and does not explain the cause, so I hope
> > Hans more ideas for us to test. ;)
> >
> 
> Hi,
> 
> I've got a supposedly "broken" if_ure dongle from Alexander, but I'm unable to
> reproduce the if_ure hang on two different pieces of XHCI hardware, Intel 
> based
> and AMD based, which I've got.
> 
> This leads me to believe there is a bug in the XHCI driver or hardware on your
> system.
> 
> Can you share the pciconfig -lv output for your XHCI controllers?
> 
> Also, when running the stress test and you see the traffic stops, what 
> happens if
> you run this command as root on the ugen which the if_ure belongs to:
> 
> usbconfig -d ugenX.Y dump_string 0
> 
> Does the traffic resume?
> 
> --HPS

Hi Hans, 
how do you want me to do those tests for you ? with or without any of your 
patches? With the actual code on git ?

hi Alexander,
I did what you suggested, and what happened was the inverse, the upload got 
back to 300mbps , and what dropped to a half was the download, dropped to 200 
instead of 600 hehe

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Hans Petter Selasky


On 9/26/22 21:28, Alexander Motin wrote:

Ivan,

On 26.09.2022 13:11, Ivan Quitschal wrote:
bad news im afraid, problem occurred at the first attempt on 
speedtest.net.
and I'm really trying to help you analizying this code here myself, 
but problem is: im far from expert on network protocol business. if it 
is a network problem at all. seems to me more like a USB protocol 
limit issue or something ..  just FYI , limiting that first constant 
to 2048 still limits my  upload to 90mbps , and also still solves the 
issue .. there has to be something about it obviously


On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually 
help a lot more without so dramatic performance decrease.  Though it is 
likely only a workaround and does not explain the cause, so I hope Hans 
more ideas for us to test. ;)




Hi,

I've got a supposedly "broken" if_ure dongle from Alexander, but I'm 
unable to reproduce the if_ure hang on two different pieces of XHCI 
hardware, Intel based and AMD based, which I've got.


This leads me to believe there is a bug in the XHCI driver or hardware 
on your system.


Can you share the pciconfig -lv output for your XHCI controllers?

Also, when running the stress test and you see the traffic stops, what 
happens if you run this command as root on the ugen which the if_ure 
belongs to:


usbconfig -d ugenX.Y dump_string 0

Does the traffic resume?

--HPS

Re: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Alexander Motin


Ivan,

On 26.09.2022 13:11, Ivan Quitschal wrote:

bad news im afraid, problem occurred at the first attempt on speedtest.net.
and I'm really trying to help you analizying this code here myself, but 
problem is: im far from expert on network protocol business. if it is a 
network problem at all. seems to me more like a USB protocol limit issue 
or something ..  just FYI , limiting that first constant to 2048 still 
limits my  upload to 90mbps , and also still solves the issue .. there 
has to be something about it obviously


On my tests I found that reduction of URE_MAX_TX from 4 to 1 actually 
help a lot more without so dramatic performance decrease.  Though it is 
likely only a workaround and does not explain the cause, so I hope Hans 
more ideas for us to test. ;)


--
Alexander Motin

Re: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Ivan Quitschal





On Mon, 26 Sep 2022, Hans Petter Selasky wrote:


Hi Ivan,

Can you revert all if_ure patches, and try this one instead.

--HPS




hi Hans

bad news im afraid, problem occurred at the first attempt on speedtest.net.
and I'm really trying to help you analizying this code here myself, but problem 
is: 
im far from expert on network protocol business. if it is a network problem at 
all. 
seems to me more like a USB protocol limit issue or something ..  just FYI , 
limiting that first constant to 2048 still limits my  upload to 90mbps , and 
also still solves the issue .. there has to be something about it obviously


dont remember if i told you that but the tp-link adapter is currently plugged in 
a USB 3.2 port anyway


anything i could do to help you on something here? just let me know

thanks

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2022-09-26 Thread Hans Petter Selasky


Hi Ivan,

Can you revert all if_ure patches, and try this one instead.

--HPSdiff --git a/sys/dev/usb/controller/xhci.c b/sys/dev/usb/controller/xhci.c
index 045be9a40b99..09aefb02687d 100644
--- a/sys/dev/usb/controller/xhci.c
+++ b/sys/dev/usb/controller/xhci.c
@@ -2848,8 +2848,16 @@ xhci_transfer_insert(struct usb_xfer *xfer)
 
 	/* check if already inserted */
 	if (xfer->flags_int.bandwidth_reclaimed) {
-		DPRINTFN(8, "Already in schedule\n");
-		return (0);
+		DPRINTFN(8, "Already in schedule (ringin doorbell only)\n");
+
+		/*
+		 * Apparently there may be a race with multi
+		 * buffering, that the hardware doesn't see the new
+		 * chain bit value and stops the endpoint
+		 * execution. Fix this by ringing the doorbell after
+		 * each and every job that has been completed.
+		 */
+		goto ring_doorbell;
 	}
 
 	pepext = xhci_get_endpoint_ext(xfer->xroot->udev,
@@ -2966,6 +2974,7 @@ xhci_transfer_insert(struct usb_xfer *xfer)
 
 	xfer->flags_int.bandwidth_reclaimed = 1;
 
+ring_doorbell:
 	xhci_endpoint_doorbell(xfer);
 
 	return (0);

Re: RES: TP-LINK USB no carrier after speed test

2022-09-20 Thread Ivan Quitschal





On Mon, 19 Sep 2022, Ivan Quitschal wrote:




On Mon, 19 Sep 2022, Hans Petter Selasky wrote:


Hi Ivan,

Can you also test this USB kernel patch? And revert your if_ure.c patch?

--HPS



hi Hans,

it *almost* worked ... everything was perfect , full speed 600/300 on the 
first 5 tests (on sppedtest.net), on the 6th test, the same problem happened 
unfortunately


thanks

--tzk




hi Hans

today i tested again and the problem ocurred right away at the first attempt :(
but the problem is definitely related to that constant and upload.
I got it back to 2048 and the problem never happened again. but of course , the 
upload speed also dropped back to 90mbps (instead of 300)


thanks

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2022-09-19 Thread Ivan Quitschal





On Mon, 19 Sep 2022, Hans Petter Selasky wrote:


Hi Ivan,

Can you also test this USB kernel patch? And revert your if_ure.c patch?

--HPS



hi Hans,

it *almost* worked ... everything was perfect , full speed 600/300 on the first 
5 tests (on sppedtest.net), on the 6th test, the same problem happened 
unfortunately


thanks

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2022-09-19 Thread Hans Petter Selasky


Hi Ivan,

Can you also test this USB kernel patch? And revert your if_ure.c patch?

--HPSdiff --git a/sys/dev/usb/usb_transfer.c b/sys/dev/usb/usb_transfer.c
index 20ed2c897aac..757697926106 100644
--- a/sys/dev/usb/usb_transfer.c
+++ b/sys/dev/usb/usb_transfer.c
@@ -419,6 +419,7 @@ usbd_get_max_frame_length(const struct usb_endpoint_descriptor *edesc,
 
 		switch (type) {
 		case UE_CONTROL:
+		case UE_BULK:
 			max_packet_count = 1;
 			break;
 		case UE_ISOCHRONOUS:
@@ -529,6 +530,7 @@ usbd_transfer_setup_sub(struct usb_setup_params *parm)
 
 		switch (type) {
 		case UE_CONTROL:
+		case UE_BULK:
 			xfer->max_packet_count = 1;
 			break;
 		case UE_ISOCHRONOUS:

Re: RES: TP-LINK USB no carrier after speed test

2022-09-19 Thread Hans Petter Selasky


On 9/18/22 13:50, Ivan Quitschal wrote:

Hi Hans

just a heads up, it worked, tested a thousand times and the problem does 
not occur anylonger after i changed the constant to 2048


but upload speed was affcted a little i believe.
insted of 600/300 of internet speed , im having 600/90

but thats fine, way better now.

should this bug be in bugzilla for this ure driver as well wehave for axge?

thanks

--tzk


Hi Ivan,

I got one of those if_ure adapters at my hand, and will test it a bit 
before concluding. Stay tuned and thanks for your testing efforts!


--HPS

Re: RES: TP-LINK USB no carrier after speed test

2022-09-18 Thread Ivan Quitschal




On Fri, 16 Sep 2022, Ivan Quitschal wrote:




On Fri, 16 Sep 2022, Hans Petter Selasky wrote:


On 9/16/22 16:31, Ivan Quitschal wrote:




-Mensagem original-
De: Hans Petter Selasky 
Enviada em: sexta-feira, 16 de setembro de 2022 10:40
Para: Ivan Quitschal 
Cc: freebsd-current@freebsd.org
Assunto: Re: TP-LINK USB no carrier after speed test

On 9/16/22 14:18, Ivan Quitschal wrote:



On Fri, 16 Sep 2022, Hans Petter Selasky wrote:


On 9/16/22 08:34, Hans Petter Selasky wrote:

On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:



On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just
fine, literally everything. even the iwlwifi0.

But in order to use my full 600mbps, i dont use the wireless
but a TP-LINK USB ethernet connected on "ue0"

ugen0.6:  at usbus0, cfg=0
md=HOST spd=HIGH (480Mbps) pwr=ON (200mA)


but something really strange is happening .. everytime i open
the chromium e do a speedtest (could be speedtest.net or any
other) , at the end of the test the eth interface dies .. it
changes from full-duplex to half-duplex/no carrier and the
only way to get the internet back thru ue0 is by rebooting the
whole thing.
not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol
which we don't support. Check the Linux code.

Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%


2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d



ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6



40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C
TW



FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC



JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT
1f

xyEalCMDMjOepy3MZm5c%3D&reserved=0

--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by
the time i bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command
worked just fine.
i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the
download test but in the middle of the upload, always

dont know if that helps

thanks

--tzk


Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver.
Maybe it is simply sending too much data, than the device can handle!

--HPS



Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define    URE_TX_BUFSZ    16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS



You can also try other values, like subtracting one.

--HPS




hi Hans,

it worked but with a cost.

i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on
the 6th.

then i tried 4096 no success

then i tried bufsize 2048 and its working now, i did several tests in
a row and the internet keeps working just fine. but i noticed that the
speed also dropped.

with the same server , testing on windows, it goes like:
600 download / 300 upload

and on freebsd with 2048 bufsiz it goes like:
300 download / 150 upload


so it worked with a cost like i said, i had to give up half of my
bandwitch

what do you think ?

and thank you again

--tzk


Hi,

Can you try this instead:

usbconfig -d X.Y set_config 1

X.Y are the numbers after ugenX.Y

Then it will use a different driver.

--HPS


Hi Hans,

After the usbconfig -d X.Y set_config 1, the interface doesn't work 
anymore, how can I undo this?


Thanks

Ivan



usbconfig -d X.Y set_config 0

or

usbconfig -d X.Y reset

Did you check dmesg?

--HPS



hi Hans,

i had to reboot my router but i got my interface back running.
and i have good news
after that, im still using the buffersiz 2048 and this time i got 600mbps 
download /300 upload just like that.

these 2 things were not related i can see.

i guess thats it , looks like its solved.. i will keep monitoring here , but 
so far so good


thank you a lot as always

--tzk



Hi Hans

just a heads up, it worked, tested a thousand times and the problem does not 
occur anylonger after i changed the constant to 2048


but upload speed was affcted a little i believe.
insted of 600/300 of internet speed , im having 600/90

but thats fine, way better now.

should this bug be in bugzilla for this ure driver as well wehave for axge?

thanks

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2022-09-16 Thread Ivan Quitschal




On Fri, 16 Sep 2022, Hans Petter Selasky wrote:


On 9/16/22 16:31, Ivan Quitschal wrote:




-Mensagem original-
De: Hans Petter Selasky 
Enviada em: sexta-feira, 16 de setembro de 2022 10:40
Para: Ivan Quitschal 
Cc: freebsd-current@freebsd.org
Assunto: Re: TP-LINK USB no carrier after speed test

On 9/16/22 14:18, Ivan Quitschal wrote:



On Fri, 16 Sep 2022, Hans Petter Selasky wrote:


On 9/16/22 08:34, Hans Petter Selasky wrote:

On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:



On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just
fine, literally everything. even the iwlwifi0.

But in order to use my full 600mbps, i dont use the wireless
but a TP-LINK USB ethernet connected on "ue0"

ugen0.6:  at usbus0, cfg=0
md=HOST spd=HIGH (480Mbps) pwr=ON (200mA)


but something really strange is happening .. everytime i open
the chromium e do a speedtest (could be speedtest.net or any
other) , at the end of the test the eth interface dies .. it
changes from full-duplex to half-duplex/no carrier and the
only way to get the internet back thru ue0 is by rebooting the
whole thing.
not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol
which we don't support. Check the Linux code.

Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%


2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d



ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6



40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C
TW



FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC



JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT
1f

xyEalCMDMjOepy3MZm5c%3D&reserved=0

--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by
the time i bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command
worked just fine.
i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the
download test but in the middle of the upload, always

dont know if that helps

thanks

--tzk


Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver.
Maybe it is simply sending too much data, than the device can handle!

--HPS



Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define    URE_TX_BUFSZ    16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS



You can also try other values, like subtracting one.

--HPS




hi Hans,

it worked but with a cost.

i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on
the 6th.

then i tried 4096 no success

then i tried bufsize 2048 and its working now, i did several tests in
a row and the internet keeps working just fine. but i noticed that the
speed also dropped.

with the same server , testing on windows, it goes like:
600 download / 300 upload

and on freebsd with 2048 bufsiz it goes like:
300 download / 150 upload


so it worked with a cost like i said, i had to give up half of my
bandwitch

what do you think ?

and thank you again

--tzk


Hi,

Can you try this instead:

usbconfig -d X.Y set_config 1

X.Y are the numbers after ugenX.Y

Then it will use a different driver.

--HPS


Hi Hans,

After the usbconfig -d X.Y set_config 1, the interface doesn't work 
anymore, how can I undo this?


Thanks

Ivan



usbconfig -d X.Y set_config 0

or

usbconfig -d X.Y reset

Did you check dmesg?

--HPS



hi Hans,

i had to reboot my router but i got my interface back running.
and i have good news
after that, im still using the buffersiz 2048 and this time i got 600mbps 
download /300 upload just like that.

these 2 things were not related i can see.

i guess thats it , looks like its solved.. i will keep monitoring here , but so 
far so good


thank you a lot as always

--tzk

Re: RES: TP-LINK USB no carrier after speed test

2022-09-16 Thread Hans Petter Selasky


On 9/16/22 16:31, Ivan Quitschal wrote:




-Mensagem original-
De: Hans Petter Selasky 
Enviada em: sexta-feira, 16 de setembro de 2022 10:40
Para: Ivan Quitschal 
Cc: freebsd-current@freebsd.org
Assunto: Re: TP-LINK USB no carrier after speed test

On 9/16/22 14:18, Ivan Quitschal wrote:



On Fri, 16 Sep 2022, Hans Petter Selasky wrote:


On 9/16/22 08:34, Hans Petter Selasky wrote:

On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:



On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just
fine, literally everything. even the iwlwifi0.

But in order to use my full 600mbps, i dont use the wireless
but a TP-LINK USB ethernet connected on "ue0"

ugen0.6:  at usbus0, cfg=0
md=HOST spd=HIGH (480Mbps) pwr=ON (200mA)


but something really strange is happening .. everytime i open
the chromium e do a speedtest (could be speedtest.net or any
other) , at the end of the test the eth interface dies .. it
changes from full-duplex to half-duplex/no carrier and the
only way to get the internet back thru ue0 is by rebooting the
whole thing.
not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol
which we don't support. Check the Linux code.

Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%


2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d



ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6



40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C
TW



FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC



JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT
1f

xyEalCMDMjOepy3MZm5c%3D&reserved=0

--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by
the time i bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command
worked just fine.
i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the
download test but in the middle of the upload, always

dont know if that helps

thanks

--tzk


Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver.
Maybe it is simply sending too much data, than the device can handle!

--HPS



Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define    URE_TX_BUFSZ    16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS



You can also try other values, like subtracting one.

--HPS




hi Hans,

it worked but with a cost.

i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on
the 6th.

then i tried 4096 no success

then i tried bufsize 2048 and its working now, i did several tests in
a row and the internet keeps working just fine. but i noticed that the
speed also dropped.

with the same server , testing on windows, it goes like:
600 download / 300 upload

and on freebsd with 2048 bufsiz it goes like:
300 download / 150 upload


so it worked with a cost like i said, i had to give up half of my
bandwitch

what do you think ?

and thank you again

--tzk


Hi,

Can you try this instead:

usbconfig -d X.Y set_config 1

X.Y are the numbers after ugenX.Y

Then it will use a different driver.

--HPS


Hi Hans,

After the usbconfig -d X.Y set_config 1, the interface doesn't work anymore, 
how can I undo this?

Thanks

Ivan



usbconfig -d X.Y set_config 0

or

usbconfig -d X.Y reset

Did you check dmesg?

--HPS

RES: TP-LINK USB no carrier after speed test

2022-09-16 Thread Ivan Quitschal




> -Mensagem original-
> De: Hans Petter Selasky 
> Enviada em: sexta-feira, 16 de setembro de 2022 10:40
> Para: Ivan Quitschal 
> Cc: freebsd-current@freebsd.org
> Assunto: Re: TP-LINK USB no carrier after speed test
> 
> On 9/16/22 14:18, Ivan Quitschal wrote:
> >
> >
> > On Fri, 16 Sep 2022, Hans Petter Selasky wrote:
> >
> >> On 9/16/22 08:34, Hans Petter Selasky wrote:
> >>> On 9/16/22 08:20, Hans Petter Selasky wrote:
> >>>> On 9/15/22 17:36, Ivan Quitschal wrote:
> >>>>>
> >>>>>
> >>>>> On Thu, 15 Sep 2022, Ivan Quitschal wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, 15 Sep 2022, Hans Petter Selasky wrote:
> >>>>>>
> >>>>>>> On 9/15/22 17:18, Hans Petter Selasky wrote:
> >>>>>>>> On 9/15/22 17:16, Ivan Quitschal wrote:
> >>>>>>>>>
> >>>>>>>>> Hi All
> >>>>>>>>>
> >>>>>>>>> Does anybody have any idea what could be happening here?.
> >>>>>>>>> I have a laptop DELL INSPIRON 3511 and everything works just
> >>>>>>>>> fine, literally everything. even the iwlwifi0.
> >>>>>>>>>
> >>>>>>>>> But in order to use my full 600mbps, i dont use the wireless
> >>>>>>>>> but a TP-LINK USB ethernet connected on "ue0"
> >>>>>>>>>
> >>>>>>>>> ugen0.6:  at usbus0, cfg=0
> >>>>>>>>> md=HOST spd=HIGH (480Mbps) pwr=ON (200mA)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> but something really strange is happening .. everytime i open
> >>>>>>>>> the chromium e do a speedtest (could be speedtest.net or any
> >>>>>>>>> other) , at the end of the test the eth interface dies .. it
> >>>>>>>>> changes from full-duplex to half-duplex/no carrier and the
> >>>>>>>>> only way to get the internet back thru ue0 is by rebooting the
> >>>>>>>>> whole thing.
> >>>>>>>>> not even a "service netif restart" does anything
> >>>>>>>>>
> >>>>>>>>> if anyone has any ideas why is that , would be appreciated
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I think it some new features they use in the USB data protocol
> >>>>>>>> which we don't support. Check the Linux code.
> >>>>>>>>
> >>>>>>>> Between does:
> >>>>>>>>
> >>>>>>>> usbconfig -d 0.6 reset
> >>>>>>>>
> >>>>>>>> recover the device?
> >>>>>>>>
> >>>>>>>> --HPS
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Search for axge on bugzilla:
> >>>>>>>
> >>>>>>> I suspect you are using this chipset:
> >>>>>>>
> >>>>>>> Try:
> >>>>>>>
> >>>>>>> usbconfig show_ifdrv
> >>>>>>>
> >>>>>>> To know for sure.
> >>>>>>>
> >>>>>>> Also see:
> >>>>>>>
> >>>>>>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%
> >>>>>>>
> 2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&d
> >>>>>>>
> ata=05%7C01%7C%7C7d0b875611fa4c22aa6808da97e8f75a%7C84df9e7fe9f6
> >>>>>>>
> 40afb435%7C1%7C0%7C637989324107373931%7CUnknown%7C
> TW
> >>>>>>>
> FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> >>>>>>>
> JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o%2B12TNKJ4bcBBj1b4r4TT
> 1f
> >>>>>>> xyEalCMDMjOepy3MZm5c%3D&reserved=0
> >>>>>>>
> >>>>>>> --HPS
> >>>>

Re: TP-LINK USB no carrier after speed test

2022-09-16 Thread Hans Petter Selasky


Hi,

I compared the Linux code and the FreeBSD code, and the Linux code has 
firmware upload support for this device. Maybe implementing that will 
fix some issues. Will come back to this after EuroBSDcon :-)


--HPS

Re: TP-LINK USB no carrier after speed test

2022-09-16 Thread Hans Petter Selasky

On 9/16/22 14:18, Ivan Quitschal wrote:

On Fri, 16 Sep 2022, Hans Petter Selasky wrote:

On 9/16/22 08:34, Hans Petter Selasky wrote:

On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:

On Thu, 15 Sep 2022, Ivan Quitschal wrote:

On Thu, 15 Sep 2022, Hans Petter Selasky wrote:

On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:

Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just
fine, literally everything. even the iwlwifi0.

But in order to use my full 600mbps, i dont use the wireless
but a TP-LINK USB ethernet connected on "ue0"

ugen0.6: at usbus0, cfg=0 md=HOST
spd=HIGH (480Mbps) pwr=ON (200mA)

but something really strange is happening .. everytime i open
the chromium e do a speedtest (could be speedtest.net or any
other) , at the end of the test the eth interface dies .. it
changes from full-duplex to half-duplex/no carrier and the only
way to get the internet back thru ue0 is by rebooting the whole
thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated

Hi,

I think it some new features they use in the USB data protocol
which we don't support. Check the Linux code.

Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS

Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7C266a987745fc4f2d0b9008da97ae1d13%7C84df9e7fe9f640afb435%7C1%7C0%7C637989071335334015%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zKuhXtYc%2FG3qpRtU%2FZNr5uEeQARsGudcWIlC1bVOsLE%3D&reserved=0

--HPS

Hi Hans,

actually the driver i use is not agxe (i thought it would be by
the time i

bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command
worked just fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk

oh, i forgot to mention that the ure driver freezes not during the
download test but in the middle of the upload, always

dont know if that helps

thanks

--tzk

Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver.
Maybe it is simply sending too much data, than the device can handle!

--HPS

Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS

You can also try other values, like subtracting one.

--HPS

hi Hans,

it worked but with a cost.

i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on
the 6th.

then i tried 4096 no success

then i tried bufsize 2048 and its working now, i did several tests in a
row and the internet keeps working just fine. but i noticed that the
speed also dropped.

with the same server , testing on windows, it goes like:
600 download / 300 upload

and on freebsd with 2048 bufsiz it goes like:
300 download / 150 upload

so it worked with a cost like i said, i had to give up half of my bandwitch

what do you think ?

and thank you again

--tzk

Hi,

Can you try this instead:

usbconfig -d X.Y set_config 1

X.Y are the numbers after ugenX.Y

Then it will use a different driver.

--HPS

Re: TP-LINK USB no carrier after speed test

2022-09-16 Thread Ivan Quitschal

On Fri, 16 Sep 2022, Hans Petter Selasky wrote:

On 9/16/22 08:34, Hans Petter Selasky wrote:

On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:

On Thu, 15 Sep 2022, Ivan Quitschal wrote:

On Thu, 15 Sep 2022, Hans Petter Selasky wrote:

On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:

Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine,
literally everything. even the iwlwifi0.

But in order to use my full 600mbps, i dont use the wireless but a
TP-LINK USB ethernet connected on "ue0"

ugen0.6: at usbus0, cfg=0 md=HOST
spd=HIGH (480Mbps) pwr=ON (200mA)

but something really strange is happening .. everytime i open the
chromium e do a speedtest (could be speedtest.net or any other) , at
the end of the test the eth interface dies .. it changes from
full-duplex to half-duplex/no carrier and the only way to get the
internet back thru ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated

Hi,

I think it some new features they use in the USB data protocol which
we don't support. Check the Linux code.

Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS

Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

--HPS

Hi Hans,

actually the driver i use is not agxe (i thought it would be by the time
i

bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked
just fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk

oh, i forgot to mention that the ure driver freezes not during the
download test but in the middle of the upload, always

dont know if that helps

thanks

--tzk

Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver. Maybe
it is simply sending too much data, than the device can handle!

--HPS

Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ 16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS

You can also try other values, like subtracting one.

--HPS

hi Hans,

it worked but with a cost.

i tried 8192 and it works 5 tests (5 speed tests in a row). freezes on the 6th.

then i tried 4096 no success

then i tried bufsize 2048 and its working now, i did several tests in a row and
the
internet keeps working just fine. but i noticed that the speed also dropped.

with the same server , testing on windows, it goes like:
600 download / 300 upload

and on freebsd with 2048 bufsiz it goes like:
300 download / 150 upload

so it worked with a cost like i said, i had to give up half of my bandwitch

what do you think ?

and thank you again

--tzk

Re: TP-LINK USB no carrier after speed test


On 9/16/22 08:34, Hans Petter Selasky wrote:

On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:



On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just 
fine, literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but 
a TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST 
spd=HIGH (480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , 
at the end of the test the eth interface dies .. it changes from 
full-duplex to half-duplex/no carrier and the only way to get the 
internet back thru ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol 
which we don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 



--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by the 
time i

bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked 
just fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the 
download test but in the middle of the upload, always


dont know if that helps

thanks

--tzk


Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver. 
Maybe it is simply sending too much data, than the device can handle!


--HPS



Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define    URE_TX_BUFSZ    16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS



You can also try other values, like subtracting one.

--HPS

Re: TP-LINK USB no carrier after speed test


On 9/16/22 08:20, Hans Petter Selasky wrote:

On 9/15/22 17:36, Ivan Quitschal wrote:



On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a 
TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST 
spd=HIGH (480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , 
at the end of the test the eth interface dies .. it changes from 
full-duplex to half-duplex/no carrier and the only way to get the 
internet back thru ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol 
which we don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 



--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by the 
time i

bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked 
just fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the 
download test but in the middle of the upload, always


dont know if that helps

thanks

--tzk


Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver. 
Maybe it is simply sending too much data, than the device can handle!


--HPS



Hi,

Try lowering this constant to 8192:

sys/dev/usb/net/if_urereg.h:#define URE_TX_BUFSZ16384

Then recompile and install if_ure:

make -C sys/modules/usb/ure KMODDIR=/boot/kernel all install

--HPS

Re: TP-LINK USB no carrier after speed test


On 9/15/22 17:36, Ivan Quitschal wrote:



On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a 
TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST 
spd=HIGH (480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , 
at the end of the test the eth interface dies .. it changes from 
full-duplex to half-duplex/no carrier and the only way to get the 
internet back thru ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol which 
we don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0 



--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by the 
time i

bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked 
just fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the 
download test but in the middle of the upload, always


dont know if that helps

thanks

--tzk


Hi Ivan,

Yes, there seems to be problem there. I need to look at the driver. 
Maybe it is simply sending too much data, than the device can handle!


--HPS

Re: TP-LINK USB no carrier after speed test

2022-09-15 Thread void


On Thu, Sep 15, 2022 at 01:45:11PM -0300, Ivan Quitschal wrote:


capabilities=68009b
ether 54:af:97:86:be:2c
inet 192.168.0.35 netmask 0xff00 broadcast 192.168.0.255
media: Ethernet 1000baseT 
status: active
supported media:
media autoselect
media 1000baseT mediaopt full-duplex,master
media 1000baseT mediaopt full-duplex
media 100baseTX mediaopt full-duplex
media 100baseTX
media 10baseT/UTP mediaopt full-duplex
media 10baseT/UTP
media none
nd6 options=29


In /etc/rc.conf, is it autoselected (so no mediaopt line) 
or are you specifying the media 1000baseT mediaopt full-duplex,master ?


I'm asking because some network devices sometimes seem to *require* 
the speed to be specified because they don't play well autonegotiating.

--

Re: TP-LINK USB no carrier after speed test





On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a 
TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST spd=HIGH 
(480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , at the 
end of the test the eth interface dies .. it changes from full-duplex to 
half-duplex/no carrier and the only way to get the internet back thru 
ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol which we 
don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7C84d8684abc754f0596a108da97302431%7C84df9e7fe9f640afb435%7C1%7C0%7C637988530285207791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Lrg%2Fy3DsJOZj8MedxLJz2nkpm0swb8W%2F%2Bk1ZoRPKMT8%3D&reserved=0

--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by the time i
bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked just 
fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the download 
test but in the middle of the upload, always


dont know if that helps

thanks

--tzk



hi Hans

i've seen you made 2 patches for ure driver which looked like a little with the 
problem im having here.


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256675

problem is, its not compiling any longer, code must have changed since you made 
the patch.


regarding the "axge" bugzilla you sent me , THATS EXACTLY the problem im 
having. The workaround for the guy's problem 
was doing this:


# ifconfig ue0 media 1000baseT mediaopt flow

problem is, my ure/ue0 interface does not have that option "flow"

-
[tzk@tzk-inspiron ~ ]$ ifconfig -m ue0
ue0: flags=8843 metric 0 mtu 1500

options=68009b

capabilities=68009b
ether 54:af:97:86:be:2c
inet 192.168.0.35 netmask 0xff00 broadcast 192.168.0.255
media: Ethernet 1000baseT 
status: active
supported media:
media autoselect
media 1000baseT mediaopt full-duplex,master
media 1000baseT mediaopt full-duplex
media 100baseTX mediaopt full-duplex
media 100baseTX
media 10baseT/UTP mediaopt full-duplex
media 10baseT/UTP
media none
nd6 options=29


any ideas or any other patch you made ?
appreciate any insights

thanks in advance

--tzk

Re: TP-LINK USB no carrier after speed test





On Thu, 15 Sep 2022, Ivan Quitschal wrote:




On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a 
TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST spd=HIGH 
(480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , at the 
end of the test the eth interface dies .. it changes from full-duplex to 
half-duplex/no carrier and the only way to get the internet back thru ue0 
is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol which we 
don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Ce7f888b3635f4e898ca308da972fa69b%7C84df9e7fe9f640afb435%7C1%7C0%7C637988528164303655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zvw7m8lJ%2FHocK%2FXJIDfdPv%2FArCpE5pk9lYz%2BY8WzMCc%3D&reserved=0

--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by the time i
bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked just 
fine.

i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk



oh, i forgot to mention that the ure driver freezes not during the 
download test but in the middle of the upload, always


dont know if that helps

thanks

--tzk

Re: TP-LINK USB no carrier after speed test





On Thu, 15 Sep 2022, Hans Petter Selasky wrote:


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a TP-LINK 
USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST spd=HIGH 
(480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the chromium 
e do a speedtest (could be speedtest.net or any other) , at the end of the 
test the eth interface dies .. it changes from full-duplex to 
half-duplex/no carrier and the only way to get the internet back thru ue0 
is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol which we 
don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D210488&data=05%7C01%7C%7Cedde022bc19842d21eec08da972e3fb5%7C84df9e7fe9f640afb435%7C1%7C0%7C637988522152537501%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wWy4fA5uwNIN2SC%2F1BNEwdJP6pHW5bsrKyhuVkbHEZs%3D&reserved=0

--HPS




Hi Hans,

actually the driver i use is not agxe (i thought it would be by the time i
bought the usbcard)

this is the module im using

if_ure.ko

and thank you , yes, reseting the usb entry with your command worked just fine.
i got the internet back after doing this

usbconfig -d 0.6 reset

do we have a bug here then?

thank you

--tzk

Re: TP-LINK USB no carrier after speed test


On 9/15/22 17:18, Hans Petter Selasky wrote:

On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a 
TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST 
spd=HIGH (480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , at 
the end of the test the eth interface dies .. it changes from 
full-duplex to half-duplex/no carrier and the only way to get the 
internet back thru ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol which we 
don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS



Hi,

Search for axge on bugzilla:

I suspect you are using this chipset:

Try:

usbconfig show_ifdrv

To know for sure.

Also see:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210488

--HPS

Re: TP-LINK USB no carrier after speed test


On 9/15/22 17:16, Ivan Quitschal wrote:


Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, 
literally everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a 
TP-LINK USB ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST spd=HIGH 
(480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the 
chromium e do a speedtest (could be speedtest.net or any other) , at the 
end of the test the eth interface dies .. it changes from full-duplex to 
half-duplex/no carrier and the only way to get the internet back thru 
ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated



Hi,

I think it some new features they use in the USB data protocol which we 
don't support. Check the Linux code.


Between does:

usbconfig -d 0.6 reset

recover the device?

--HPS

TP-LINK USB no carrier after speed test




Hi All

Does anybody have any idea what could be happening here?.
I have a laptop DELL INSPIRON 3511 and everything works just fine, literally 
everything. even the iwlwifi0.


But in order to use my full 600mbps, i dont use the wireless but a TP-LINK USB 
ethernet connected on "ue0"


ugen0.6:  at usbus0, cfg=0 md=HOST spd=HIGH 
(480Mbps) pwr=ON (200mA)



but something really strange is happening .. everytime i open the chromium e do 
a speedtest (could be speedtest.net or any other) , at the end of the test the 
eth interface dies .. it changes from full-duplex to half-duplex/no carrier and 
the only way to get the internet back thru ue0 is by rebooting the whole thing.

not even a "service netif restart" does anything

if anyone has any ideas why is that , would be appreciated

thanks

--tzk

Re: test-includes breaks buildworld when WITHOUT_PF is set in src.conf

2022-02-09 Thread Gary Jennejohn

On Wed, 09 Feb 2022 11:08:44 +0100
Kristof Provost  wrote:

> On 9 Feb 2022, at 10:57, Gary Jennejohn wrote:
> > test-includes uses pf.h when checking usage of pfvar.h.
> >
> > But, these lines in include/Makefile remove pf.h when WITHOUT_PF is
> > set in src.conf:
> >
> > .if ${MK_PF} != "no"
> >  INCSGROUPS+=   PF
> > .endif
> >
> > This breaks buildworld.  The error message:
> >
> > In file included from net_pfvar.c:1:
> > /usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/pfvar.h:65:10: fatal error:
> > 'netpfil/pf/pf.h' file not found
> > #include 
> >  ^
> > 1 error generated.
> > --- net_pfvar.o ---
> > *** [net_pfvar.o] Error code 1
> >
> > make[3]: stopped in /usr/src/tools/build/test-includes
> > .ERROR_TARGET='net_pfvar.o'
> >
> > Removing the .if/.endif fixes it for me, although there may be a better
> > way to avoid the error.
> >  
> Warner's working on a better fix. See https://reviews.freebsd.org/D34009 for 
> the discussion.
> 

Thanks for the info.

-- 
Gary Jennejohn

Re: test-includes breaks buildworld when WITHOUT_PF is set in src.conf

2022-02-09 Thread Kristof Provost

On 9 Feb 2022, at 10:57, Gary Jennejohn wrote:
> test-includes uses pf.h when checking usage of pfvar.h.
>
> But, these lines in include/Makefile remove pf.h when WITHOUT_PF is
> set in src.conf:
>
> .if ${MK_PF} != "no"
>  INCSGROUPS+=   PF
> .endif
>
> This breaks buildworld.  The error message:
>
> In file included from net_pfvar.c:1:
> /usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/pfvar.h:65:10: fatal error:
> 'netpfil/pf/pf.h' file not found
> #include 
>  ^
> 1 error generated.
> --- net_pfvar.o ---
> *** [net_pfvar.o] Error code 1
>
> make[3]: stopped in /usr/src/tools/build/test-includes
> .ERROR_TARGET='net_pfvar.o'
>
> Removing the .if/.endif fixes it for me, although there may be a better
> way to avoid the error.
>
Warner’s working on a better fix. See https://reviews.freebsd.org/D34009 for 
the discussion.

Kristof

test-includes breaks buildworld when WITHOUT_PF is set in src.conf

2022-02-09 Thread Gary Jennejohn

test-includes uses pf.h when checking usage of pfvar.h.

But, these lines in include/Makefile remove pf.h when WITHOUT_PF is
set in src.conf:

.if ${MK_PF} != "no"
 INCSGROUPS+=   PF
.endif

This breaks buildworld.  The error message:

In file included from net_pfvar.c:1:
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/pfvar.h:65:10: fatal error:
'netpfil/pf/pf.h' file not found
#include 
 ^
1 error generated.
--- net_pfvar.o ---
*** [net_pfvar.o] Error code 1

make[3]: stopped in /usr/src/tools/build/test-includes
.ERROR_TARGET='net_pfvar.o'

Removing the .if/.endif fixes it for me, although there may be a better
way to avoid the error.

-- 
Gary Jennejohn

Re: kyua run under WITH_ASAN= built world reports a global-buffer-overflow during cpio test.

2022-01-12 Thread Mark Millard

On 2022-Jan-12, at 01:54, Mark Millard  wrote:

> For the below it appears that the report from UBSAN is accurate.
> 
> ==85511==ERROR: AddressSanitizer: global-buffer-overflow on address 
> 0x010753ca at pc 0x01139bda bp 0x7fffc2b0 sp 0x7fffc2a8
> READ of size 1 at 0x010753ca thread T0
>#0 0x1139bd9 in hexdump 
> /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35
>#1 0x113b73c in assertion_text_file_contents 
> /usr/main-src/contrib/libarchive/test_utils/test_main.c:1182:3
>#2 0x1125d46 in basic_cpio 
> /usr/main-src/contrib/libarchive/cpio/test/test_basic.c:84:2
>#3 0x11259dc in test_basic 
> /usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:2
>#4 0x1144943 in test_run 
> /usr/main-src/contrib/libarchive/test_utils/test_main.c:3561:2
>#5 0x1144943 in main 
> /usr/main-src/contrib/libarchive/test_utils/test_main.c:4062:9
> 
> 0x010753ca is located 54 bytes to the left of global variable ' literal>' defined in 
> '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:13' (0x1075400) 
> of size 5
>  '' is ascii string 'copy'
> 0x010753ca is located 22 bytes to the left of global variable ' literal>' defined in 
> '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:228:38' (0x10753e0) 
> of size 9
>  '' is ascii string '1 block
> '
> 0x010753ca is located 0 bytes to the right of global variable ' literal>' defined in 
> '/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:220:18' (0x10753c0) 
> of size 10
>  '' is ascii string '2 blocks
> '
> SUMMARY: AddressSanitizer: global-buffer-overflow 
> /usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 in hexdump
> Shadow bytes around the buggy address:
>  0x4020ea20: f9 f9 f9 f9 02 f9 f9 f9 00 01 f9 f9 00 02 f9 f9
>  0x4020ea30: 00 00 00 00 00 00 02 f9 f9 f9 f9 f9 00 f9 f9 f9
>  0x4020ea40: 00 01 f9 f9 00 00 00 00 00 00 01 f9 f9 f9 f9 f9
>  0x4020ea50: 06 f9 f9 f9 07 f9 f9 f9 00 00 00 00 00 07 f9 f9
>  0x4020ea60: f9 f9 f9 f9 04 f9 f9 f9 05 f9 f9 f9 00 00 00 00
> =>0x4020ea70: 00 05 f9 f9 f9 f9 f9 f9 00[02]f9 f9 00 01 f9 f9
>  0x4020ea80: 05 f9 f9 f9 01 f9 f9 f9 00 01 f9 f9 00 05 f9 f9
>  0x4020ea90: 00 02 f9 f9 00 f9 f9 f9 00 02 f9 f9 07 f9 f9 f9
>  0x4020eaa0: 00 01 f9 f9 07 f9 f9 f9 00 02 f9 f9 00 02 f9 f9
>  0x4020eab0: 00 03 f9 f9 00 01 f9 f9 00 04 f9 f9 00 00 00 00
>  0x4020eac0: 00 00 00 03 f9 f9 f9 f9 00 00 00 f9 f9 f9 f9 f9
> Shadow byte legend (one shadow byte represents 8 application bytes):
>  Addressable:   00
>  Partially addressable: 01 02 03 04 05 06 07 
>  Heap left redzone:   fa
>  Freed heap region:   fd
>  Stack left redzone:  f1
>  Stack mid redzone:   f2
>  Stack right redzone: f3
>  Stack after return:  f5
>  Stack use after scope:   f8
>  Global redzone:  f9
>  Global init order:   f6
>  Poisoned by user:f7
>  Container overflow:  fc
>  Array cookie:ac
>  Intra object redzone:bb
>  ASan internal:   fe
>  Left alloca redzone: ca
>  Right alloca redzone:cb
> ==85511==ABORTING
> 
> Well, contrib/libarchive/cpio/test/test_basic.c:84 is doing:
> 
>assertTextFileContents(se, "pack.err");
> 
> which involves, in turn:
> 
> int
> assertion_text_file_contents(const char *filename, int line, const char 
> *buff, const char *fn)
> {
> . . .
>s = (int)strlen(buff);
>contents = malloc(s * 2 + 128);
>n = (int)fread(contents, 1, s * 2 + 128 - 1, f);
> . . .
>if (n > 0) {
>hexdump(contents, buff, n, 0);
> . . .
> 
> Nothing about the code seems to constrain n to fit the
> size of the space for "pack.err" (9 bytes of "global"
> space).
> 
> The report is for the ref[i + j] in the code:
> 
> static void
> hexdump(const char *p, const char *ref, size_t l, size_t offset)
> {
> . . .
>for (j = 0; j < 16 && i + j < l; j++) {
>if (ref != NULL && p[i + j] != ref[i + j])
> . . .
> 
> where ref points to the space for "pack.err" and l was
> given a copy of the value of n in the previously shown
> code.
> 
> The i + j < l constraint need not avoid the code doing
> ref[i + j] in a way that reaches outside the space for
> "pack.err" --because of the supplied value of n (a.k.a. l)
> not being sufficient to respect the space for "pack.err".


pair below shows up in 13 reports:

   #0 0x1139bd9 in hexdump 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35
   #1 0x113b73c in assertion_text_file_contents 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:1182:3

So the above notes are just an illustration of a more
general issue with the assertion_text_file_contents
use of "hexdump(contents, buff, n, 0)".

===
Mark Millard
marklmi at yahoo.com

kyua run under WITH_ASAN= built world reports a global-buffer-overflow during cpio test.

2022-01-12 Thread Mark Millard

For the below it appears that the report from UBSAN is accurate.

==85511==ERROR: AddressSanitizer: global-buffer-overflow on address 
0x010753ca at pc 0x01139bda bp 0x7fffc2b0 sp 0x7fffc2a8
READ of size 1 at 0x010753ca thread T0
#0 0x1139bd9 in hexdump 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35
#1 0x113b73c in assertion_text_file_contents 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:1182:3
#2 0x1125d46 in basic_cpio 
/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:84:2
#3 0x11259dc in test_basic 
/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:2
#4 0x1144943 in test_run 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:3561:2
#5 0x1144943 in main 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:4062:9

0x010753ca is located 54 bytes to the left of global variable '' defined in 
'/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:229:13' (0x1075400) of 
size 5
  '' is ascii string 'copy'
0x010753ca is located 22 bytes to the left of global variable '' defined in 
'/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:228:38' (0x10753e0) of 
size 9
  '' is ascii string '1 block
'
0x010753ca is located 0 bytes to the right of global variable '' defined in 
'/usr/main-src/contrib/libarchive/cpio/test/test_basic.c:220:18' (0x10753c0) of 
size 10
  '' is ascii string '2 blocks
'
SUMMARY: AddressSanitizer: global-buffer-overflow 
/usr/main-src/contrib/libarchive/test_utils/test_main.c:875:35 in hexdump
Shadow bytes around the buggy address:
  0x4020ea20: f9 f9 f9 f9 02 f9 f9 f9 00 01 f9 f9 00 02 f9 f9
  0x4020ea30: 00 00 00 00 00 00 02 f9 f9 f9 f9 f9 00 f9 f9 f9
  0x4020ea40: 00 01 f9 f9 00 00 00 00 00 00 01 f9 f9 f9 f9 f9
  0x4020ea50: 06 f9 f9 f9 07 f9 f9 f9 00 00 00 00 00 07 f9 f9
  0x4020ea60: f9 f9 f9 f9 04 f9 f9 f9 05 f9 f9 f9 00 00 00 00
=>0x4020ea70: 00 05 f9 f9 f9 f9 f9 f9 00[02]f9 f9 00 01 f9 f9
  0x4020ea80: 05 f9 f9 f9 01 f9 f9 f9 00 01 f9 f9 00 05 f9 f9
  0x4020ea90: 00 02 f9 f9 00 f9 f9 f9 00 02 f9 f9 07 f9 f9 f9
  0x4020eaa0: 00 01 f9 f9 07 f9 f9 f9 00 02 f9 f9 00 02 f9 f9
  0x4020eab0: 00 03 f9 f9 00 01 f9 f9 00 04 f9 f9 00 00 00 00
  0x4020eac0: 00 00 00 03 f9 f9 f9 f9 00 00 00 f9 f9 f9 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:   00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:   fa
  Freed heap region:   fd
  Stack left redzone:  f1
  Stack mid redzone:   f2
  Stack right redzone: f3
  Stack after return:  f5
  Stack use after scope:   f8
  Global redzone:  f9
  Global init order:   f6
  Poisoned by user:f7
  Container overflow:  fc
  Array cookie:ac
  Intra object redzone:bb
  ASan internal:   fe
  Left alloca redzone: ca
  Right alloca redzone:cb
==85511==ABORTING

Well, contrib/libarchive/cpio/test/test_basic.c:84 is doing:

assertTextFileContents(se, "pack.err");

which involves, in turn:

int
assertion_text_file_contents(const char *filename, int line, const char *buff, 
const char *fn)
{
. . .
s = (int)strlen(buff);
contents = malloc(s * 2 + 128);
n = (int)fread(contents, 1, s * 2 + 128 - 1, f);
. . .
if (n > 0) {
hexdump(contents, buff, n, 0);
. . .

Nothing about the code seems to constrain n to fit the
size of the space for "pack.err" (9 bytes of "global"
space).

The report is for the ref[i + j] in the code:

static void
hexdump(const char *p, const char *ref, size_t l, size_t offset)
{
. . .
for (j = 0; j < 16 && i + j < l; j++) {
if (ref != NULL && p[i + j] != ref[i + j])
. . .

where ref points to the space for "pack.err" and l was
given a copy of the value of n in the previously shown
code.

The i + j < l constraint need not avoid the code doing
ref[i + j] in a way that reaches outside the space for
"pack.err" --because of the supplied value of n (a.k.a. l)
not being sufficient to respect the space for "pack.err".


===
Mark Millard
marklmi at yahoo.com

Re: FYI: An example ASAN failure report during kyua test -k /usr/tests/Kyuafile (info for some more examples)

2022-01-09 Thread Mark Millard

On 2022-Jan-9, at 13:47, Mark Millard  wrote:

> On 2022-Jan-7, at 03:39, Mark Millard  wrote:
> 
>> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
>> after finding what to control to allow the build, I installed
>> it in a directory tree for chroot use and have
>> "kyua test -k /usr/tests/Kyuafile" running.
>> 
>> I see evidence of one AddressSanitizer report. (kyua is still
>> running.) The context is:
>> 
>> # more 
>> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stdout.txt 
>> Executing command [ mkdir /tmp/kyua.FKD2vh/434/work/mntpt ]
>> mount -t tmpfs -o size=10M tmpfs /tmp/kyua.FKD2vh/434/work/mntpt
>> Executing command [ touch a ]
>> Executing command [ rm a ]
>> Executing command [ dd if=/dev/zero of=a bs=1m count=15 ]
>> Executing command [ rm a ]
>> 
>> # more 
>> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stderr.txt 
>> =
>> ==14384==ERROR: AddressSanitizer: stack-buffer-overflow on address 
>> 0x7fffa948 at pc 0x000801f38f5a bp 0x7fffa830 sp 0x7fffa828
>> WRITE of size 8 at 0x7fffa948 thread T0
>>   #0 0x801f38f59 in strtoimax_l 
>> /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11
>>   #1 0x10de6c8 in strtoimax 
>> /usr/main-src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:3441:18
>>   #2 0x11a4723 in getq /usr/main-src/bin/test/test.c:560:6
>>   #3 0x11a4523 in intcmp /usr/main-src/bin/test/test.c:584:7
>>   #4 0x11a4523 in binop /usr/main-src/bin/test/test.c:351:10
>>   #5 0x11a2f06 in primary /usr/main-src/bin/test/test.c:317:10
>>   #6 0x11a2f06 in nexpr /usr/main-src/bin/test/test.c:275:9
>>   #7 0x11a28cb in aexpr /usr/main-src/bin/test/test.c:261:8
>>   #8 0x11a2a03 in aexpr /usr/main-src/bin/test/test.c:263:10
>>   #9 0x11a228b in oexpr /usr/main-src/bin/test/test.c:247:8
>>   #10 0x11a1fcf in testcmd /usr/main-src/bin/test/test.c:224:10
>>   #11 0x1145289 in evalcommand /usr/main-src/bin/sh/eval.c:1107:16
>>   #12 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>>   #13 0x113fb34 in evaltree /usr/main-src/bin/sh/eval.c:225:4
>>   #14 0x113f86b in evaltree /usr/main-src/bin/sh/eval.c:212:4
>>   #15 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3
>>   #16 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>>   #17 0x113fc55 in evaltree /usr/main-src/bin/sh/eval.c:241:4
>>   #18 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3
>>   #19 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>>   #20 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3
>>   #21 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>>   #22 0x113eb88 in evalstring /usr/main-src/bin/sh/eval.c
>>   #23 0x1179727 in main /usr/main-src/bin/sh/main.c:171:3
>> 
>> Address 0x7fffa948 is located in stack of thread T0 at offset 264 in 
>> frame
>>   #0 0x801f387ff in strtoimax_l /usr/main-src/lib/libc/stdlib/strtoimax.c:58
>> 
>> This frame has 1 object(s):
>>   [32, 36) '__limit.i.i.i' <== Memory access at offset 264 overflows this 
>> variable
>> HINT: this may be a false positive if your program uses some custom stack 
>> unwind mechanism, swapcontext or vfork
>> (longjmp and C++ exceptions *are* supported)
>> SUMMARY: AddressSanitizer: stack-buffer-overflow 
>> /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 in strtoimax_l
>> Shadow bytes around the buggy address:
>> 0x44d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x44e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x44f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 0x4500: f1 f1 f1 f1 00 00 00 00 f1 f1 f1 f1 f8 f3 f3 f3
>> 0x4510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> =>0x4520: 00 00 00 00 f3 f3 f3 f3 f3[f3]f3 f3 00 00 00 00
>> 0x4530: f1 f1 f1 f1 00 f3 f3 f3 00 00 00 00 00 00 00 00
>> 0x4540: f1 f1 f1 f1 00 f2 f2 f2 00 f3 f3 f3 00 00 00 00
>> 0x4550: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>> 0x4560: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>> 0x4570: f2 f2 f2 f2 f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8
>> Shadow byte legend (one shadow byte represents 8 application bytes):
>> Addressable:   00
>> Partially addressable: 01 02 03 04 05 06 07 
>> Heap left redzone:   fa
>> Freed heap region:   fd
>> Stack left redzone:  f1
>> Stack mid redzone:   f2
>> Stack right r

Re: FYI: An example ASAN failure report during kyua test -k /usr/tests/Kyuafile (info for some more examples)

2022-01-09 Thread Mark Millard

On 2022-Jan-7, at 03:39, Mark Millard  wrote:

> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
> after finding what to control to allow the build, I installed
> it in a directory tree for chroot use and have
> "kyua test -k /usr/tests/Kyuafile" running.
> 
> I see evidence of one AddressSanitizer report. (kyua is still
> running.) The context is:
> 
> # more 
> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stdout.txt 
> Executing command [ mkdir /tmp/kyua.FKD2vh/434/work/mntpt ]
> mount -t tmpfs -o size=10M tmpfs /tmp/kyua.FKD2vh/434/work/mntpt
> Executing command [ touch a ]
> Executing command [ rm a ]
> Executing command [ dd if=/dev/zero of=a bs=1m count=15 ]
> Executing command [ rm a ]
> 
> # more 
> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/434/stderr.txt 
> =
> ==14384==ERROR: AddressSanitizer: stack-buffer-overflow on address 
> 0x7fffa948 at pc 0x000801f38f5a bp 0x7fffa830 sp 0x7fffa828
> WRITE of size 8 at 0x7fffa948 thread T0
>#0 0x801f38f59 in strtoimax_l 
> /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11
>#1 0x10de6c8 in strtoimax 
> /usr/main-src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:3441:18
>#2 0x11a4723 in getq /usr/main-src/bin/test/test.c:560:6
>#3 0x11a4523 in intcmp /usr/main-src/bin/test/test.c:584:7
>#4 0x11a4523 in binop /usr/main-src/bin/test/test.c:351:10
>#5 0x11a2f06 in primary /usr/main-src/bin/test/test.c:317:10
>    #6 0x11a2f06 in nexpr /usr/main-src/bin/test/test.c:275:9
>    #7 0x11a28cb in aexpr /usr/main-src/bin/test/test.c:261:8
>#8 0x11a2a03 in aexpr /usr/main-src/bin/test/test.c:263:10
>#9 0x11a228b in oexpr /usr/main-src/bin/test/test.c:247:8
>#10 0x11a1fcf in testcmd /usr/main-src/bin/test/test.c:224:10
>#11 0x1145289 in evalcommand /usr/main-src/bin/sh/eval.c:1107:16
>#12 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>#13 0x113fb34 in evaltree /usr/main-src/bin/sh/eval.c:225:4
>#14 0x113f86b in evaltree /usr/main-src/bin/sh/eval.c:212:4
>#15 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3
>#16 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>#17 0x113fc55 in evaltree /usr/main-src/bin/sh/eval.c:241:4
>#18 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3
>#19 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>#20 0x1144d89 in evalcommand /usr/main-src/bin/sh/eval.c:1053:3
>#21 0x113eeb7 in evaltree /usr/main-src/bin/sh/eval.c:289:4
>#22 0x113eb88 in evalstring /usr/main-src/bin/sh/eval.c
>#23 0x1179727 in main /usr/main-src/bin/sh/main.c:171:3
> 
> Address 0x7fffa948 is located in stack of thread T0 at offset 264 in frame
>#0 0x801f387ff in strtoimax_l /usr/main-src/lib/libc/stdlib/strtoimax.c:58
> 
>  This frame has 1 object(s):
>[32, 36) '__limit.i.i.i' <== Memory access at offset 264 overflows this 
> variable
> HINT: this may be a false positive if your program uses some custom stack 
> unwind mechanism, swapcontext or vfork
>  (longjmp and C++ exceptions *are* supported)
> SUMMARY: AddressSanitizer: stack-buffer-overflow 
> /usr/main-src/lib/libc/stdlib/strtoimax.c:148:11 in strtoimax_l
> Shadow bytes around the buggy address:
>  0x44d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x44e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x44f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x4500: f1 f1 f1 f1 00 00 00 00 f1 f1 f1 f1 f8 f3 f3 f3
>  0x4510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> =>0x4520: 00 00 00 00 f3 f3 f3 f3 f3[f3]f3 f3 00 00 00 00
>  0x4530: f1 f1 f1 f1 00 f3 f3 f3 00 00 00 00 00 00 00 00
>  0x4540: f1 f1 f1 f1 00 f2 f2 f2 00 f3 f3 f3 00 00 00 00
>  0x4550: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>  0x4560: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>  0x4570: f2 f2 f2 f2 f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8
> Shadow byte legend (one shadow byte represents 8 application bytes):
>  Addressable:   00
>  Partially addressable: 01 02 03 04 05 06 07 
>  Heap left redzone:   fa
>  Freed heap region:   fd
>  Stack left redzone:  f1
>  Stack mid redzone:   f2
>  Stack right redzone: f3
>  Stack after return:  f5
>  Stack use after scope:   f8
>  Global redzone:  f9
>  Global init order:   f6
>  Poisoned by user:f7
>  Container overflow:  fc
>  Array cookie:ac
>  Intra object redzone:bb
>  ASan internal:   fe
>  Left alloca re

Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile

On 2022-Jan-7, at 04:31, Stefan Esser  wrote:

> Am 07.01.22 um 12:49 schrieb Mark Millard:
>> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
>> after finding what to control to allow the build, I installed
>> it in a directory tree for chroot use and have
>> "kyua test -k /usr/tests/Kyuafile" running.
>> 
>> I see evidence of various examples of one type of undefined
>> behavior: "applying zero offset to null pointer"
>> 
>> # more 
>> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/usr.bin/sed/process.c:715:18 in 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> Fail: stderr not empty
>> --- /dev/null   2022-01-07 10:29:57.182903000 +
>> +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr   2022-01-07 
>> 10:29:57.17310 +
>> @@ -0,0 +1,2 @@
>> +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> Files left in work directory after failure: mntpt, mounterr
>> 
>> 
>> In general the lib/libc/stdio/fread.c:133:10 example seems to
>> be in a place that would make it fairly common.
> 
> Interesting find:
> 
>while (resid > (r = fp->_r)) {
>(void)memcpy((void *)p, (void *)fp->_p, (size_t)r);
>fp->_p += r; /* line 133 */
>/* fp->_r = 0 ... done in __srefill */
>p += r;
>resid -= r;
> 
> If fp->_p == NULL in line 133, then NULL has been passed as source address
> in memcpy() in the line above, and I'd think that is undefined behavior,
> even if a length of 0 is passed at the same time.

My copy of ISO/IEC 9899:2011 (E) only explicitly mentions such a
limitation for the memcpy_s variant.  It does say "[t]he memcpy
function returns the value of s1". The only mentioned "behavior
is undefined" is for copying between objects that overlap.

But there is more general wording in 7.24.1 (of 7.24 String
handling ):

QUOTE
Where an argument declared as size_t n specifies the length
of the array for a function, n can have the value zero on a
call to that function. Unless explicitly stated otherwise in
the description of a particular function in this subclause,
pointer arguments on such a call still shall have valid values,
as described in 7.1.4. On such a call, . . . a function that
copies characters copies zero characters.
END QUOTE

But I've not noticed anything in 7.1.4 is that explicit about
NULL arguments with zero sizes or that bans NULL arguments
in any generality.

In other words, I believe that the lack of a report for memcpy's
argument values is consistent with what ISO/IEC 9899:2011
is explicit about for such things.

I've not tried going through POSIX material or any other
potential standards.

> Maybe the code block quoted above (line 132 to 136) should be made wrapped
> into "if (r > 0) {}"?
> 

===
Mark Millard
marklmi at yahoo.com

Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile

On 2022-Jan-7, at 05:08, Mark Millard  wrote:

> On 2022-Jan-7, at 03:49, Mark Millard  wrote:
> 
>> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
>> after finding what to control to allow the build, I installed
>> it in a directory tree for chroot use and have
>> "kyua test -k /usr/tests/Kyuafile" running.
>> 
>> I see evidence of various examples of one type of undefined
>> behavior: "applying zero offset to null pointer"
>> 
>> # more 
>> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/usr.bin/sed/process.c:715:18 in 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> Fail: stderr not empty
>> --- /dev/null   2022-01-07 10:29:57.182903000 +
>> +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr   2022-01-07 
>> 10:29:57.17310 +
>> @@ -0,0 +1,2 @@
>> +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
>> offset to null pointer
>> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
>> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
>> Files left in work directory after failure: mntpt, mounterr
>> 
>> 
>> In general the lib/libc/stdio/fread.c:133:10 example seems to
>> be in a place that would make it fairly common.
>> 
>> usr.bin/sed/process.c:715:18 is more limited: just sed use.
>> 
> 
> kyua ran to completion. This note is focused on UBSAN reports.
> 
> By far the most common UBSAN report is for the
> lib/libc/stdio/fread.c:133:10 code.
> 
> Another somewhat common UBSAN report is:
> 
> Standard error:
> /usr/main-src/usr.bin/cut/cut.c:458:7: runtime error: addition of unsigned 
> offset to 0x6210010d overflowed to 0x6210010c
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/usr.bin/cut/cut.c:458:7 in 
> Fail: incorrect exit status: 1, expected: 0
> 
> 
> There is at least one example of:
> 
> Standard error:
> ld-elf.so.1: /lib/libthr.so.3: Undefined symbol 
> "__asan_option_detect_stack_use_after_return"
> 
> 
> Some more zero offsets to null are:
> 
> +/usr/main-src/bin/sh/jobs.c:590:35: runtime error: applying zero offset to 
> null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/bin/sh/jobs.c:590:35 in 
> +/usr/main-src/bin/sh/jobs.c:601:22: runtime error: applying zero offset to 
> null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/bin/sh/jobs.c:601:22 in 
> +/usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16: runtime error: 
> applying zero offset to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16 in 
> 
> +/usr/main-src/usr.sbin/makefs/ffs.c:1053:35: runtime error: applying zero 
> offset to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/usr.sbin/makefs/ffs.c:1053:35 in 
> Files left in work directory after failure: dir, ufs.img
> 
> 
> contrib/libxo/libxo/xo_buf.h has examples of non-zero offsets:
> 
> +/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22: runtime error: applying 
> non-zero offset 4 to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22 in 
> +/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44: runtime error: applying 
> zero offset to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44 in 
> +/usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29: runtime error: applying 
> non-zero offset 4 to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29 in 
> 
> As does contrib/openzfs/module/nvpair/nvpair.c :
> 
>

Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile

On 2022-Jan-7, at 03:49, Mark Millard  wrote:

> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
> after finding what to control to allow the build, I installed
> it in a directory tree for chroot use and have
> "kyua test -k /usr/tests/Kyuafile" running.
> 
> I see evidence of various examples of one type of undefined
> behavior: "applying zero offset to null pointer"
> 
> # more 
> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt 
> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/usr.bin/sed/process.c:715:18 in 
> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> Fail: stderr not empty
> --- /dev/null   2022-01-07 10:29:57.182903000 +
> +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr   2022-01-07 
> 10:29:57.17310 +
> @@ -0,0 +1,2 @@
> +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> Files left in work directory after failure: mntpt, mounterr
> 
> 
> In general the lib/libc/stdio/fread.c:133:10 example seems to
> be in a place that would make it fairly common.
> 
> usr.bin/sed/process.c:715:18 is more limited: just sed use.
> 

kyua ran to completion. This note is focused on UBSAN reports.

By far the most common UBSAN report is for the
lib/libc/stdio/fread.c:133:10 code.

Another somewhat common UBSAN report is:

Standard error:
/usr/main-src/usr.bin/cut/cut.c:458:7: runtime error: addition of unsigned 
offset to 0x6210010d overflowed to 0x6210010c
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/usr.bin/cut/cut.c:458:7 in 
Fail: incorrect exit status: 1, expected: 0


There is at least one example of:

Standard error:
ld-elf.so.1: /lib/libthr.so.3: Undefined symbol 
"__asan_option_detect_stack_use_after_return"


Some more zero offsets to null are:

+/usr/main-src/bin/sh/jobs.c:590:35: runtime error: applying zero offset to 
null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/bin/sh/jobs.c:590:35 in 
+/usr/main-src/bin/sh/jobs.c:601:22: runtime error: applying zero offset to 
null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/bin/sh/jobs.c:601:22 in 
+/usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16: runtime error: 
applying zero offset to null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/contrib/xz/src/liblzma/common/common.c:292:16 in 

+/usr/main-src/usr.sbin/makefs/ffs.c:1053:35: runtime error: applying zero 
offset to null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/usr.sbin/makefs/ffs.c:1053:35 in 
Files left in work directory after failure: dir, ufs.img


contrib/libxo/libxo/xo_buf.h has examples of non-zero offsets:

+/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22: runtime error: applying 
non-zero offset 4 to null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:22 in 
+/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44: runtime error: applying 
zero offset to null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/contrib/libxo/libxo/xo_buf.h:116:44 in 
+/usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29: runtime error: applying 
non-zero offset 4 to null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/contrib/libxo/libxo/xo_buf.h:120:29 in 

As does contrib/openzfs/module/nvpair/nvpair.c :

/usr/main-src/sys/contrib/openzfs/module/nvpair/nvpair.c:3129:49: runtime 
error: applying non-zero offset 4 to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/sys/contrib/openzfs/module/nvpair/nvpair.c:3129:49 in 


There is a:

+/usr/main-src/bin/sh/arith_yacc.c:193:10: runtime error: negation of 
-9223372036854775808 cannot be represented in type 'arith_t' (aka 'long'); cast 
to an unsigned type to negate this value to itself
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavi

Re: FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile

2022-01-07 Thread Stefan Esser

Am 07.01.22 um 12:49 schrieb Mark Millard:
> Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
> after finding what to control to allow the build, I installed
> it in a directory tree for chroot use and have
> "kyua test -k /usr/tests/Kyuafile" running.
> 
> I see evidence of various examples of one type of undefined
> behavior: "applying zero offset to null pointer"
> 
> # more 
> /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt 
> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> /usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/usr.bin/sed/process.c:715:18 in 
> /usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> Fail: stderr not empty
> --- /dev/null   2022-01-07 10:29:57.182903000 +
> +++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr   2022-01-07 
> 10:29:57.17310 +
> @@ -0,0 +1,2 @@
> +/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
> offset to null pointer
> +SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
> /usr/main-src/lib/libc/stdio/fread.c:133:10 in 
> Files left in work directory after failure: mntpt, mounterr
> 
> 
> In general the lib/libc/stdio/fread.c:133:10 example seems to
> be in a place that would make it fairly common.

Interesting find:

while (resid > (r = fp->_r)) {
(void)memcpy((void *)p, (void *)fp->_p, (size_t)r);
fp->_p += r; /* line 133 */
/* fp->_r = 0 ... done in __srefill */
p += r;
resid -= r;

If fp->_p == NULL in line 133, then NULL has been passed as source address
in memcpy() in the line above, and I'd think that is undefined behavior,
even if a length of 0 is passed at the same time.

Maybe the code block quoted above (line 132 to 136) should be made wrapped
into "if (r > 0) {}"?

Regards, STefan


OpenPGP_signature
Description: OpenPGP digital signature

FYI: An example type of UBSAN failure during kyua test -k /usr/tests/Kyuafile

Having done a buildworld with both WITH_ASAN= and WITH_UBSAN=
after finding what to control to allow the build, I installed
it in a directory tree for chroot use and have
"kyua test -k /usr/tests/Kyuafile" running.

I see evidence of various examples of one type of undefined
behavior: "applying zero offset to null pointer"

# more /usr/obj/DESTDIRs/main-amd64-xSAN-chroot/tmp/kyua.FKD2vh/356/stderr.txt 
/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/lib/libc/stdio/fread.c:133:10 in 
/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/lib/libc/stdio/fread.c:133:10 in 
/usr/main-src/usr.bin/sed/process.c:715:18: runtime error: applying zero offset 
to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/usr.bin/sed/process.c:715:18 in 
/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/lib/libc/stdio/fread.c:133:10 in 
Fail: stderr not empty
--- /dev/null   2022-01-07 10:29:57.182903000 +
+++ /tmp/kyua.FKD2vh/356/work/check.Mk9llD/stderr   2022-01-07 
10:29:57.17310 +
@@ -0,0 +1,2 @@
+/usr/main-src/lib/libc/stdio/fread.c:133:10: runtime error: applying zero 
offset to null pointer
+SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
/usr/main-src/lib/libc/stdio/fread.c:133:10 in 
Files left in work directory after failure: mntpt, mounterr


In general the lib/libc/stdio/fread.c:133:10 example seems to
be in a place that would make it fairly common.

usr.bin/sed/process.c:715:18 is more limited: just sed use.

===
Mark Millard
marklmi at yahoo.com

FYI: An example ASAN failure report during kyua test -k /usr/tests/Kyuafile