Re: bsd.rd hangs on boot; bsd.mp works

2020-12-28 Thread James Cook
On Mon, Dec 28, 2020 at 06:09:57PM +1100, Jonathan Gray wrote:
> On Mon, Dec 28, 2020 at 04:31:09AM +, James Cook wrote:
> > On Mon, Dec 28, 2020 at 01:19:19PM +1100, Jonathan Gray wrote:
> > > On Sun, Dec 27, 2020 at 07:48:58PM +, James Cook wrote:
> > > > > from your earlier bsd.mp dmesg:
> > > > > 
> > > > > cpu0: Enhanced SpeedStep 16268 MHz: speeds: 1601, 1600, 1500, 1400, 
> > > > > 1300, 1200, 1100, 1000, 900, 800, 700, 600, 500, 400 MHz
> > > > > 
> > > > > 1601 is variable/turbo speed the others are fixed.  When running in
> > > > > turbo mode getting the current frequency involves msrs which
> > > > > hw.cpuspeed doesn't do. The initial 'x MHz:' is from cpuspeed not from
> > > > > the acpi table.
> > > > > 
> > > > > On modern machines these values are from acpicpu(4)/acpi _PSS.
> > > > > 
> > > > > Can you show the output of running with the following diff to dump the
> > > > > performance counter control values?
> > > > > 
> > > > > MSR_PERF_FIXED_CTR_CTRL 0x38d 
> > > > > MSR_PERF_GLOBAL_CTRL 0x38f
> > > > > 
> > > > > on a broadwell laptop this shows
> > > > > 
> > > > > cpu0 cpu_freq_ctr cpuid 0x0a eax 0x7300403 ebx 0x0 ecx 0x0 edx 0x603
> > > > > cpu0 cpu_freq_ctr perf ver 3 gp ctrs 4 fixed 3
> > > > > cpu0 cpu_freq_ctr MSR_PERF_FIXED_CTR_CTRL 0x0
> > > > > cpu0 cpu_freq_ctr MSR_PERF_GLOBAL_CTRL 0xf
> > > > > cpu1 cpu_freq_ctr cpuid 0x0a eax 0x7300403 ebx 0x0 ecx 0x0 edx 0x603
> > > > > cpu1 cpu_freq_ctr perf ver 3 gp ctrs 4 fixed 3
> > > > > cpu1 cpu_freq_ctr MSR_PERF_FIXED_CTR_CTRL 0x0
> > > > > cpu1 cpu_freq_ctr MSR_PERF_GLOBAL_CTRL 0xf
> > > > > cpu2 cpu_freq_ctr cpuid 0x0a eax 0x7300403 ebx 0x0 ecx 0x0 edx 0x603
> > > > > cpu2 cpu_freq_ctr perf ver 3 gp ctrs 4 fixed 3
> > > > > cpu2 cpu_freq_ctr MSR_PERF_FIXED_CTR_CTRL 0x0
> > > > > cpu2 cpu_freq_ctr MSR_PERF_GLOBAL_CTRL 0xf
> > > > > cpu3 cpu_freq_ctr cpuid 0x0a eax 0x7300403 ebx 0x0 ecx 0x0 edx 0x603
> > > > > cpu3 cpu_freq_ctr perf ver 3 gp ctrs 4 fixed 3
> > > > > cpu3 cpu_freq_ctr MSR_PERF_FIXED_CTR_CTRL 0x0
> > > > > cpu3 cpu_freq_ctr MSR_PERF_GLOBAL_CTRL 0xf
> > > > 
> > > > Here's dmesg with that patch.
> > > 
> > > Thanks, here is another diff to try.  It sets the bit to be enabled
> > > not just for ring 0, fixes clearing a value and shows the values
> > > read out of the counter.
> > > 
> > > Index: sys/arch/amd64/amd64/identcpu.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/amd64/amd64/identcpu.c,v
> > > retrieving revision 1.117
> > > diff -u -p -r1.117 identcpu.c
> > > --- sys/arch/amd64/amd64/identcpu.c   13 Sep 2020 05:57:28 -  
> > > 1.117
> > > +++ sys/arch/amd64/amd64/identcpu.c   28 Dec 2020 02:03:27 -
> > > @@ -411,6 +411,9 @@ via_update_sensor(void *args)
> > >  }
> > >  #endif
> > >  
> > > +#define MSR_PERF_GLOBAL_STATUS 0x38e
> > > +#define MSR_PERF_GLOBAL_INUSE 0x392
> > > +
> > >  uint64_t
> > >  cpu_freq_ctr(struct cpu_info *ci)
> > >  {
> > > @@ -421,13 +424,22 @@ cpu_freq_ctr(struct cpu_info *ci)
> > >   CPUIDEDX_NUM_FC(cpu_perf_edx) <= 1)
> > >   return (0);
> > >  
> > > + if ((cpu_perf_eax & CPUIDEAX_VERID) > 3) {
> > > + msr = rdmsr(MSR_PERF_GLOBAL_INUSE);
> > > + printf("%s %s MSR_PERF_GLOBAL_INUSE 0x%llx\n", 
> > > ci->ci_dev->dv_xname,
> > > + __func__, msr);
> > > + }
> > > + msr = rdmsr(MSR_PERF_GLOBAL_STATUS);
> > > + printf("%s %s MSR_PERF_GLOBAL_STATUS before 0x%llx\n", 
> > > ci->ci_dev->dv_xname,
> > > + __func__, msr);
> > > +
> > >   msr = rdmsr(MSR_PERF_FIXED_CTR_CTRL);
> > >   if (msr & MSR_PERF_FIXED_CTR_FC(1, MSR_PERF_FIXED_CTR_FC_MASK)) {
> > >   /* some hypervisor is dicking us around */
> > >   return (0);
> > >   }
> > >  
> > > - msr |= MSR_PERF_FIXED_CTR_FC(1, MSR_PERF_FIXED_CTR_FC_1);
> > > + msr |= MSR_PERF_FIXED_CTR_FC(1, MSR_PERF_FIXED_CTR_FC_ANY);
> > >   wrmsr(MSR_PERF_FIXED_CTR_CTRL, msr);
> > >  
> > >   msr = rdmsr(MSR_PERF_GLOBAL_CTRL) | MSR_PERF_GLOBAL_CTR1_EN;
> > > @@ -437,13 +449,20 @@ cpu_freq_ctr(struct cpu_info *ci)
> > >   delay(10);
> > >   count = rdmsr(MSR_PERF_FIXED_CTR1);
> > >  
> > > + msr = rdmsr(MSR_PERF_GLOBAL_STATUS);
> > > + printf("%s %s MSR_PERF_GLOBAL_STATUS after 0x%llx\n", 
> > > ci->ci_dev->dv_xname,
> > > + __func__, msr);
> > > +
> > >   msr = rdmsr(MSR_PERF_FIXED_CTR_CTRL);
> > > - msr &= MSR_PERF_FIXED_CTR_FC(1, MSR_PERF_FIXED_CTR_FC_MASK);
> > > + msr &= ~MSR_PERF_FIXED_CTR_FC(1, MSR_PERF_FIXED_CTR_FC_MASK);
> > >   wrmsr(MSR_PERF_FIXED_CTR_CTRL, msr);
> > >  
> > >   msr = rdmsr(MSR_PERF_GLOBAL_CTRL);
> > >   msr &= ~MSR_PERF_GLOBAL_CTR1_EN;
> > >   wrmsr(MSR_PERF_GLOBAL_CTRL, msr);
> > > +
> > > + printf("%s %s count %lld last_count %lld freq %lld\n", 
> > > ci->ci_dev->dv_xname,
> > > + __func__, count, last_count, ((count - last_count) * 10));
> > >  
> > >   return ((count - last_count) * 10);
> > >  }
> > 
> > Here's the end of /var/log/messages, recording a boot with the 

Re: Unable to use Canon MG3250 USB scanner more than once

2020-12-28 Thread Mikolaj Kucharski
On Mon, Dec 28, 2020 at 08:55:33PM +, miko...@kucharski.name wrote:
> 
> I've started adding printfs to xhci_device_generic_transfer() but that
> slowed code path to the point that problem disappeared and scanner
> started to work reliably, but (obviously) scan got really slow, from
> couple of seconds to about 4 minutes.
> 

..and printfs to xhci_device_generic_start()

-- 
Regards,
 Mikolaj



Unable to use Canon MG3250 USB scanner more than once

2020-12-28 Thread mikolaj
>Synopsis:  After first successful scan on Canon MG3250 scanner, all next 
>ugenread()s timeout
>Category:  kernel
>Environment:
System  : OpenBSD 6.8
Details : OpenBSD 6.8-current (GENERIC.MP) #246: Sun Dec 27 
18:17:02 MST 2020
 
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

Architecture: OpenBSD.amd64
Machine : amd64
>Description:
When I use scanner first time via scanimage:

scanimage --device-name pixma:04A91762_860FE4 \
--format png --mode gray \
--resolution 300 --output-file scan.png

above command works and PNG file is properly saved with scanned
image. When I want to scan again command fails:

scanimage: sane_read: Operation was canceled

When increasing libusb debug logs example iteration of failed attempt
to scan looks as follows:

[ 0.584877] [0003ef68] libusb: debug [libusb_alloc_transfer] transfer 
0xea40237d150
[ 0.584902] [0003ef68] libusb: debug [libusb_submit_transfer] transfer 
0xea40237d150
[ 0.584921] [0003ef68] libusb: debug [obsd_submit_transfer] 
[ 0.584979] [0003ef68] libusb: debug [_access_endpoint] endpoint 8 mode 0
[ 1.579537] [0003ef68] libusb: debug [_errno_to_libusb] error: Operation timed 
out (60)
[ 1.579601] [0003ef68] libusb: debug [libusb_free_transfer] transfer 
0xea40237d150
[ 1.579696] [0003ef68] libusb: debug [libusb_clear_halt] endpoint 88
[ 1.579862] [0003ef68] libusb: debug [obsd_clear_halt] 

On the kernel side my debug printfs look as follow:

2020-12-27T12:29:39.939Z pce-0067 /bsd: MMM>> dv_xname=ugen0 endpt=8 flag=0 
[ugenread()|ugen.c|644]
2020-12-27T12:29:39.940Z pce-0067 /bsd: MMM dv_xname=ugen0 endpt=8 
[ugen_do_read()|ugen.c|484]
2020-12-27T12:29:39.940Z pce-0067 /bsd: MMM dv_xname=ugen0 endpt=8 xfertype=2 
[ugen_do_read()|ugen.c|503]
2020-12-27T12:29:39.940Z pce-0067 /bsd: MMM start UE_BULK 
[ugen_do_read()|ugen.c|549]
2020-12-27T12:29:42.997Z pce-0067 /bsd: MMM ! polling=0 flags=16 err=0 
xfer->flags=6 xfer->status=15 xfer->done=1 [usbd_transfer()|usbdi.c|375]
2020-12-27T12:29:42.998Z pce-0067 /bsd: MMM final polling=0 xfer->flags=6 
xfer->status=15 [usbd_transfer()|usbdi.c|389]
2020-12-27T12:29:42.998Z pce-0067 /bsd: MMM dv_xname=ugen0 endpt=8 err=15 
n=1024 timeout=3000 flags=6 [ugen_do_read()|ugen.c|565]
2020-12-27T12:29:42.998Z pce-0067 /bsd: MMM end UE_BULK 
[ugen_do_read()|ugen.c|582]
2020-12-27T12:29:42.998Z pce-0067 /bsd: MMM<< dv_xname=ugen0 endpt=8 error=60 
[ugenread()|ugen.c|647]

It maybe hard to see from above but the core of it is that in usbdi.c
file, in function usbd_transfer():

 358} else {
 359while (!xfer->done) {
 360flags = PRIBIO|(xfer->flags & USBD_CATCH ? 
PCATCH : 0);
 361
 362err = tsleep_nsec(xfer, flags, "usbsyn", 
INFSLP);
 363if (err && !xfer->done) {
 364usbd_abort_pipe(pipe);
 365if (err == EINTR)
 366xfer->status = USBD_INTERRUPTED;
 367else
 368xfer->status = USBD_TIMEOUT;
 369}
 370}
 371}

above function tsleep_nsec() on line 362 times out and xfer->status is set
to USBD_TIMEOUT on line 368. When tsleep_nsec() times out,
xhci_timeout() and xhci_timeout_task() from xhci.c are executed.

I've started adding printfs to xhci_device_generic_transfer() but that
slowed code path to the point that problem disappeared and scanner
started to work reliably, but (obviously) scan got really slow, from
couple of seconds to about 4 minutes.

This is where I got stuck as I got out of ideas how to drill this down
further in efficient manner. Decided to look on the mailing list and
found:

 - https://marc.info/?t=15815036211=1=2
 - https://marc.info/?t=15872902191=1=2

In above email threads I found patch quoted below in the Fix section.

>How-To-Repeat:
- boot OpenBSD
- power on scanner
- use scanimage to scan a picture
- when first scan succeeds
- any another scan fails with tsleep_nsec() on line 362 from usbdi.c 
function usbd_transfer() is timing out
- unplugging from USB scanner is needed to be able to scan again
- or power cycle scanner
- or power cycle OpenBSD
- basically ugen(4) detach / attach is needed

>Fix:
Per email threads on tech@ from above marc.info links below is
not proper solution, but a workaround:

--8<--
Index: dev/usb/ugen.c
===
RCS file: /cvs/src/sys/dev/usb/ugen.c,v
retrieving revision 1.109
diff -u -p -u -r1.109 ugen.c
--- dev/usb/ugen.c  25 Dec 2020 12:59:52 -  1.109
+++ dev/usb/ugen.c  27 Dec 2020 22:38:59 -
@@ -441,6 +442,7 @@ 

Re: Memory leak with getaddrinfo()

2020-12-28 Thread Otto Moerbeek
On Mon, Dec 28, 2020 at 08:54:46PM +0100, Theo Buehler wrote:

> On Sun, Dec 27, 2020 at 08:44:52PM +0100, Otto Moerbeek wrote:
> > Hi,
> > 
> > So here the diff that just fixes the mem leak on thread exit. It does
> > not contains the TLS init part, I'd like to do that differently than
> > what I did in the version I posted earlier.
> > 
> > As stated before, this makes the getaddrino(3) test program run in
> > constant memory.
> > 
> > ok? 
> 
> Please wrap this line at ,void:
> 
> > -_thread_tag_storage(void **tag, void *storage, size_t sz, void *err)
> > +_thread_tag_storage(void **tag, void *storage, size_t sz, void (*dt)(void 
> > *),void *err)
> 
> This diff made firefox 84.0.1 crash frequently on my main laptop (about
> once every 10-15 minutes) while iridium was still stable and I could not
> observe any other fallout.
> 
> The backtrace always looked like this:

OK, the two problems cannot be solved independently. I'll come up with
a new diff 

-Otto

> 
> #0  thrkill () at /tmp/-:3
> #1  0x00ae9eab90ee in nsProfileLock::FatalSignalHandler (signo=11, 
> info=0xae12f05420, context=0xae12f05330)
> at 
> /usr/obj/ports/firefox-84.0.1/firefox-84.0.1/toolkit/profile/nsProfileLock.cpp:168
> #2  0x00ae9f5aada7 in WasmTrapHandler (signum=11, info=0xae12f05420, 
> context=0xae12f05330) at 
> /usr/obj/ports/firefox-84.0.1/firefox-84.0.1/js/src/wasm/WasmSignalHandlers.cpp:980
> #3  
> #4  _asr_ctx_unref (ac=0xdfdfdfdfdfdfdfdf) at /usr/src/lib/libc/asr/asr.c:401
> #5  _asr_resolver_done_tp (arg=0xadc8b2eec0) at 
> /usr/src/lib/libc/asr/asr.c:141
> #6  0x00ae5f6d0025 in _rthread_tls_destructors (thread=0xae81077440) at 
> /usr/src/lib/libc/thread/rthread_tls.c:182
> #7  0x00ae5f65ec13 in _libc_pthread_exit (retval=) at 
> /usr/src/lib/libc/thread/rthread.c:150
> #8  0x00ae39f038d9 in _rthread_start (v=) at 
> /usr/src/lib/librthread/rthread.c:97
> #9  0x00ae5f6e1c8a in __tfork_thread () at 
> /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84
> (gdb)
> 



Re: Memory leak with getaddrinfo()

2020-12-28 Thread Theo Buehler
On Sun, Dec 27, 2020 at 08:44:52PM +0100, Otto Moerbeek wrote:
> Hi,
> 
> So here the diff that just fixes the mem leak on thread exit. It does
> not contains the TLS init part, I'd like to do that differently than
> what I did in the version I posted earlier.
> 
> As stated before, this makes the getaddrino(3) test program run in
> constant memory.
> 
> ok? 

Please wrap this line at ,void:

> -_thread_tag_storage(void **tag, void *storage, size_t sz, void *err)
> +_thread_tag_storage(void **tag, void *storage, size_t sz, void (*dt)(void 
> *),void *err)

This diff made firefox 84.0.1 crash frequently on my main laptop (about
once every 10-15 minutes) while iridium was still stable and I could not
observe any other fallout.

The backtrace always looked like this:

#0  thrkill () at /tmp/-:3
#1  0x00ae9eab90ee in nsProfileLock::FatalSignalHandler (signo=11, 
info=0xae12f05420, context=0xae12f05330)
at 
/usr/obj/ports/firefox-84.0.1/firefox-84.0.1/toolkit/profile/nsProfileLock.cpp:168
#2  0x00ae9f5aada7 in WasmTrapHandler (signum=11, info=0xae12f05420, 
context=0xae12f05330) at 
/usr/obj/ports/firefox-84.0.1/firefox-84.0.1/js/src/wasm/WasmSignalHandlers.cpp:980
#3  
#4  _asr_ctx_unref (ac=0xdfdfdfdfdfdfdfdf) at /usr/src/lib/libc/asr/asr.c:401
#5  _asr_resolver_done_tp (arg=0xadc8b2eec0) at /usr/src/lib/libc/asr/asr.c:141
#6  0x00ae5f6d0025 in _rthread_tls_destructors (thread=0xae81077440) at 
/usr/src/lib/libc/thread/rthread_tls.c:182
#7  0x00ae5f65ec13 in _libc_pthread_exit (retval=) at 
/usr/src/lib/libc/thread/rthread.c:150
#8  0x00ae39f038d9 in _rthread_start (v=) at 
/usr/src/lib/librthread/rthread.c:97
#9  0x00ae5f6e1c8a in __tfork_thread () at 
/usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84
(gdb)



Re: X11 blanking problem related to page flipping?

2020-12-28 Thread Laurence Tratt
On Sat, Nov 21, 2020 at 10:16:23AM +0100, Antoine Jacoutot wrote:

>> I've been having an intermittent problem for some months with my X11
>> screen blanking (i.e. going totally black -- however, the mouse still
>> displays and changes shape depending on what it's hovering over!).
>> Switching to Ctrl+Alt+F1 and back fixes it, though once in a while causes
>> X11 to crash. Sometimes this doesn't happen for a few days, at which point
>> I think I've fixed it. Sometimes it happens several times in 10 minutes.
>> There is no obvious trigger for it: I might have a web browser loaded or
>> not, or ffmpeg running or not etc etc.
> I have been seeing the same. I can usually reproduce it by having lots of
> chromium tabs with heavy web sites opened (like clicking on 5 or 6 links
> from google news).

A follow-up to this, since the problem still occurs. I've been playing around
with not running Chrome or Iridium, and I think Antoine has put his finger on
the nub of the problem: if I don't use Chrome or Iridium, the problem never
manifests. At some point after using them, I then have intermittent blanking,
which sometimes needs a console switch to fix, and perhaps 1/3 of the time a
console fix kills X. I don't know what Chrome or Iridium are doing that
triggers this problem, but nothing else I run seems to trigger it.


Laurie