from:"Taylor R Campbell"

Re: openssl3+postfix issue (ca md too weak)

2023-11-13 Thread Taylor R Campbell

[trimming tech-crypto from cc because this is a policy and
configuration issue, not a cryptography issue]

> Date: Mon, 13 Nov 2023 20:34:04 +0100
> From: Manuel Bouyer 
> 
> I'm facing an issue with postfix+openssl3 which may be critical (depending
> on how it can be fixed).
> 
> Now my postfix setup fails to send mails with
> Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: 
> error:0A00018E:SSL routines::ca md too 
> weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984:

1. This says `warning'; does the mail actually fail to go through, or
   are you just alarmed by the warning?

2. Can you describe your mail topology?

3. Can you describe the postfix configuration on every node involved
   in the topology?

  diff -u <(postconf -d) <(postconf)

   (not sure if there's an easier way to show the non-default
   settings; if there is, feel free to use that instead)

4. Can you share master.cf on every node involved if it's not the
   default?

5. If you connect to the server with `openssl s_client', what happens?

> So, as far as I understand, we end up with a postfix installation which
> can't talk to servers with valid certificates.

Unless anything has changed in the past couple years, I don't think
there is any widespread deployment of SMTP TLS server authentication
that means anything for general MTAs -- at best, TLS in SMTP serves as
opportunistic encryption to defend against passive eavesdroppers.

So I assume you must be talking about your own personal SMTP relay, or
your own personal submission endpoint, or something like that, meaning
this is not a general problem with sending mail on the internet.

(If a sender demanded that any receiving MTA show a valid certificate,
it would be a very lonely sender in the real world today -- there's no
mechanism (that I know of) for a sender to know which receiving MTAs
are expected to answer TLS with valid certificates, like writing http
vs https in the URL bar of a browser.)

Re: Call for testing: certctl, postinstall, TLS trust anchors

2023-10-11 Thread Taylor R Campbell

Correcting a small error in the previous message:


> Date: Wed, 11 Oct 2023 18:47:02 +
> From: Taylor R Campbell 
> 
> Note: The formal PKIX language has a way for a CA certificate to
> express that the CA it represents is authorized to sign certificates
> for TLS server authentication.

Actually, it can't even express that, as far as I know.

The certificate can say it is authorized to sign certificates (basic
constraints: CA=TRUE, extended key usage: cert sign), or it is
authorized to authenticate TLS servers (extended key usage: server
auth).  But it can't say it is authorized to sign certificates only
for entities authorized to authenticate TLS servers.

That is, it can't be _restricted_ from doing that in the X.509
language, so _any_ CA can always sign certificates for _any_ purpose.

Re: Call for testing: certctl, postinstall, TLS trust anchors

2023-10-11 Thread Taylor R Campbell

s so that existing
  applications get Mozilla's trusted TLS CAs out of the box for
  modern TLS validation.  This is the critical thing that matters
  most.

   => There are lots of grey areas and edge cases that will never
  affect 99% of users and can't be resolved without additional
  application engineering and possibly changes to the OpenSSL API,
  like restricting Tubitak Kamu SM to *.tr domains or moving the
  expiration date sooner.  I punted on those by failing closed and
  treating them as untrusted.

Note: The formal PKIX language has a way for a CA certificate to
express that the CA it represents is authorized to sign certificates
for TLS server authentication.  But it has no way for an application
(or operating system installation) to restrict a CA or its
subsidiaries from doing that.  That's additional information provided
by, e.g., Mozilla certdata.txt annotations, or by the separation of
select trust anchors into /etc/openssl/certs (vs, say,
/etc/smime/certs), which exists only outside the PKIX certificate
language.

Some other point-by-point replies below:

> Date: Mon, 09 Oct 2023 21:04:06 -0400
> From: Greg Troxel 
> 
> Taylor R Campbell  writes:
> 
> > I'm prioritizing effort on TLS, but I _installed_ the email
> > certificates as pem files under /usr/share so that they're available
> > in case someone wants to do something with them like declare
> > /etc/smime/certs as the place to find the trust anchors and configure
> > them with certctl(8) using a different config file.
> 
> This is the part I don't get, but I need to look at pkix, to see if
> there is standards support for this "different set of trust anchors
> depending on flavor".

It would be absolutely bonkers for all applications to be required to
use a single global set of trust anchors for all purposes.

I can't imagine that even the most ardent PKIX ideologues would
suggest that all applications should share the same set of trust
anchors for all purposes and let the CAs and their subsidiaries
themselves dictate what they're authorized to do.

For example, if we were using X.509 certificates and CMS signatures on
automated binary updates for NetBSD, this would be a non-starter.
We're not going to have an automated system where we'll accept a
signature from a public key with a certificate issued by just any
public CA that formally self-asserts code-signing authority.  Instead,
we'd just have trust anchors _for that purpose_ in /etc/nbsdupdate or
whatever.

> so one has to parse DISTRUST_AFTER at build time, and the contents will
> change later?  Or is our approach to say "if it's DISTRUST_AFTER at all,
> just say no"?  Or something else?

I applied the rule `if it's DISTRUST_AFTER at all, just say no',
because there is no way to transmit that information to applications
via the OpenSSL API.

> The use of "cache" is puzzling, as this is creating the directory that
> openssl uses, and there is no mechanism to fault in things.  It feels
> more like "certctl reads source paths and state from untrust, and
> prepares a directory of trust anchors".  But maybe it's just me, and my
> words confuse more than one person!

I'm drawing a distinction between precious configuration
(/etc/openssl/certs.conf, /etc/openssl/untrusted) and disposable cache
(/etc/openssl/certs).

The cache isn't lazily filled -- it has to be there for applications
to use it, and it's privileged in /etc so it doesn't make sense for
unprivileged applications like ftp(1) to fill it on-demand.

But there is nothing precious in it: you can blow it away and rebuild
it and nothing will be lost.

In these respects it is like the fontconfig cache.

> >> What's not obvious and is part of the interface is that if I untrust a
> >> cert (e.g. because I don't believe that the CA is sound), then when that
> >> CA cert is updated in the mozilla set, will be untrust persist, or not?
> >> I am guessing not, and this is hard to fix but I think it's important to
> >> tbe clear on "what does it mean to untrust a cert", because people may
> >> think it does what they want instead of more narrowly what it says.
> >
> > If you do `certctl untrust TrustCor_ECA-1.pem', then it will remain
> > untrusted on update.
> 
> And if TrustCor_ECA-3.pem is added (because in this example mozilla
> thinks they are ok and I don't) as part of normal
> new-root-cert-every-10-years, then that will get picked up.  So it is
> really removing a particular named cert, not removing a CA.

If you can suggest a mechanism to `remove a CA' that works more
reliably, I'm all ears.

The obvious things to store are:

- CA name according to trustdata.txt (which is uniquified as a file
  name),
- subject name or hash,
- public key or hash.

None of these uniquely

Re: Call for testing: certctl, postinstall, TLS trust anchors

2023-10-08 Thread Taylor R Campbell

> Date: Sun, 08 Oct 2023 10:54:13 -0400
> From: Greg Troxel 
> 
> (I've been putting off thinking about and dealing with this due to
> juggling too many other things.)

No worries, glad you found some time to review it!

> Taylor R Campbell  writes:
> 
> > The new certctl(8) tool is provided to manage the TLS trust anchors
> > configured in /etc/openssl/certs with a simple way to change the
> > source of trust anchors or distrust individual ones -- and with a
> > manual override, if you would rather use another mechanism to do it,
> > like the commands available in the security/mozilla-rootcerts or
> > security/ca-certificates packages, or the special-purpose
> > security/mozilla-rootcerts-openssl package.
> 
> This says "TLS trust anchors", but I wonder if that's accurate.  Isn't
> this "pkix trust anchors, for which the most common case is TLS"?  I
> have not dug in to the openssl library calls, but my impression is that
> openssl the installed software does pkix validation in general, and the
> installed trust anchors will be used for invocations to validate pkix certs
> separately from TLS.

If you know of applications that uses /etc/openssl/certs on NetBSD (or
/etc/ssl/certs on FreeBSD, or /etc/pki/tls/certs on Fedora), or SSLDIR
in pkgsrc, to find trust anchors for anything other than TLS, I'd like
to hear about it.

My guess is that such cases are very rare if they exist at all, and
they are totally dominated by applications that rely on it for TLS
validation.

> After reading, I think what's going on is
> 
>   1) mozilla rootcert situation is a bit of a mess smantically

Not really.  The CAs are clearly marked in certdata.txt for different
purposes.  I imported the content according to the metadata about
Mozilla's trust; more below.

>   2) certctl is installing the subset that is intended for TLS

Correct.

You can also use the same certctl(8) program to install trust anchors
in another path for other purposes, like /etc/smime/certs.  I made
sure it would work (both for testing and for other possible uses) but
didn't lift a finger to make it happen because TLS is critical and
everything else is niche.

>   3) the installed certs will be used for all uses, not just TLS
>   (e.g. SMIME), and because certs intended for SMIME but not "server"
>   are missing, the wrong thing will happen sometimes, but because many
>   CAs do both (?) it will often be ok.

See above: if you know of applications that rely on /etc/openssl/certs
for S/MIME, and it's not just a joke (which most open-ended
interorganizational use of S/MIME that I'm aware of is --
intraorganizational uses managed by a corporate IT department purely
for internal or partner use aside), I'm curious to see.

>   4) This is all not obvious, and
>  a) It's not the least bit clear that the right thing is happening.
>  b) I expect ~nobody to understand this.

Is any of the text in the certctl(8) or hier(7) man pages unclear
about this?  I tried to clarify the purpose of /etc/openssl/certs for
TLS trust anchors specifically in that text.

> Looking in /usr/share/certs/mozilla, it continues to be non-obvious.
> The idea that 'all' has "untrusted CAs" seems crazy; if they aren't
> trusted, why are they in the root set, which is by definition the set of
> CAs which meet the rules and are therefore trustworthy?

`all' has everything in certdata.txt.

`server' has only what Mozilla has chosen to trust for TLS.

`email' has only what Mozilla has chosen to trust for S/MIME.

> I see code is empty.  I'm going to ignore this.

Yes.  The certdata.txt format has a way to say that trust anchors are
fit for code-signing, so for completeness I exposed that via the
directory /usr/share/certs/mozilla/code, but (a) there are no trust
anchors in certdata.txt that use it, and (b) there is nothing in
NetBSD that would use such trust anchors anyway.

> With a bit of ls/sort/uniq/comm, I see that there are certs in all that
> do not appear in email or server:
> 
>   Explicitly_Distrust_DigiNotar_Root_CA.pem
>   Symantec_Class_1_Public_Primary_Certification_Authority_-_G6.pem
>   Symantec_Class_2_Public_Primary_Certification_Authority_-_G6.pem
>   TUBITAK_Kamu_SM_SSL_Kok_Sertifikasi_-_Surum_1.pem
>   TrustCor_ECA-1.pem
>   TrustCor_RootCert_CA-1.pem
>   TrustCor_RootCert_CA-2.pem
>   Verisign_Class_1_Public_Primary_Certification_Authority_-_G3.pem
>   Verisign_Class_2_Public_Primary_Certification_Authority_-_G3.pem

That looks correct.  It matches what I see in certdata.txt and the
special annotation about Tubitak Kamu SM at
<https://wiki.mozilla.org/CA/Additional_Trust_Changes>.  See
external/mpl/mozilla-certdata/share/Makefile for notes on how this
particular sausage is made.

These exclusions also match my knowledge of the history:

- Digi-Notar was

Re: cgd questions

2023-10-08 Thread Taylor R Campbell

> Date: Sun, 1 Oct 2023 14:50:20 +0200
> From: Thomas Klausner 
> 
> I tried finding this in the man page, but it wasn't fully clear to me.
> 
> When I pick up a cgd disk and want to use it on a NetBSD system to
> which it was not connected before, what do I need?
> 
> - the passphrase
> - the /etc/cgd/foo file?

Correct.

> If you need the /etc/cgd/foo file too, how do people handle those for
> cgds used as backup disks?

Save a copy somewhere else, e.g. printed out on paper or stored in a
USB flash drive.

> The other question is that the cgd man page says that some ciphers are
> obsolete. How can I switch from an obsolete cipher to a new one - is
> the only method to make a new cgd with the new cipher and copy the
> data manually?

Correct.

Perhaps it would be worthwhile to write a utility that does the
conversion incrementally in-place.  But it at least requires some
intermediate storage on another file system to do this safely, so it's
not entirely trivial -- cgd(4) itself uses no additional writable
storage, so there's nowhere for it to record how much of the volume
has been re-encrypted or to store a write-ahead or roll-back log of
each block it is re-encrypting in case of interruption.

> Date: Sun, 01 Oct 2023 09:31:03 -0400
> From: Greg Troxel 
> 
> Yes, you need the /etc/cgd/foo file because the passphrase is salted,
> and you might need an iv depending on iv method.  IMHO this is a design
> bug in cgd.  At least as a normal path, one should be able to access
> with just the passphrase.

On the contrary, this was a deliberate feature of cgd(4), precisely so
that you can store the parameters file separate from the disk, e.g. on
a small USB flash drive -- and, more importantly, so that you can
plausibly destroy the parameters file and safely recycle the disk.

See the Usenix 2003 paper on cgd
https://www.usenix.org/legacy/event/usenix03/tech/freenix03/full_papers/dowdeswell/dowdeswell.pdf
for details.

This design choice also obviates the need for cgd(4) to have its own
custom writable file system format and tooling like LUKS invented just
to store the parameters (which you have to store somewhere); instead
if you really want to store it side by side with the ciphertext, you
can do so with a small ffs partition on the disk, as you described.

> Date: Tue, 03 Oct 2023 15:53:33 -0400
> From: Greg Troxel 
> 
> Thomas Klausner  writes:
> 
> > IIUC the cgdconfig man page correctly, this is how you do that:
> >
> >  To create a new parameters file that will generate the same key as an 
> > old
> >  parameters file:
> >
> >  # cgdconfig -G -o newparamsfile oldparamsfile
> >  old file's passphrase:
> >  new file's passphrase:
> 
> I think what that does is encrypt the old key using the new passphrase,
> and store that encrypted key in the config file.  Thus you haven't
> changed the key, but you have a config file that allows decryption with
> a new passphrase.  That's good to give a second person access, but it
> doesn't revoke the first passphrase's access, if I understood correctly.

This revokes the first password's access if you also erase
oldparamsfile.

After all, the main value of cgd(4) -- aside from protecting from
theft of your disk while your laptop is turned off -- is that if you
can erase a short parameters file, that's enough to confidently erase
the disk before recycling it without having to worry about leaking
data through sector reassignment or wear-levelling or anything.

Note: The cgd parameters file does not contain the cgd(4) disk
encryption key -- not even an encrypted key.  It doesn't even contain
enough information to confirm a guess about the password.

So if you lose a USB flash drive with the parameters file, but you
don't lose the disk itself, it is useless to an adversary on its own
to guess your password.

(What it contains -- for password-based parameters -- is a random salt
that must be combined with the password to derive the disk encryption
key.)

> Date: Mon, 2 Oct 2023 09:46:10 +0200
> From: Thomas Klausner 
> 
> I have a USB Disk with ffs-on-cgd.  I unmounted the ffs but forgot
> unconfiguring the cgd before unplugging the disk.
> 
> Can this cause problems? What kinds?

It is unlikely to lead to any sort of data loss, because if you've
unmounted ffs, then it has closed the block device and thereby
synchronously flushed all pending writes.

If attempts to read from it or write to it fail in any way other than
returning EIO (e.g., crashing the system), or if you're unable to
unconfigure it with `cgdconfig -u', please file a PR and I'll take a
look.

cgd(4) _should_ handle this scenario gracefully.  But I say that
normatively or aspirationally rather than informatively of the status
quo.  We should find a way to automatically test this scenario too,
which I'm sure we don't right now.

Re: kerberos issues with 10.0_BETA post openssl update

2023-09-06 Thread Taylor R Campbell

> Date: Wed, 6 Sep 2023 10:39:34 +
> From: Taylor R Campbell 
> 
> A possible workaround is to set:
> 
>   [libdefaults]
>   k5login_directory = /root
> 
> However, that applies to _all_ kuserok checks for _all_ users, not
> just the pam_ksu one ror root, so it will probably break other things.
> I'm not sure there is a way in the config file to specify it just for
> pam_ksu or just for root.

Here's a workaround you could test with no code changes that shouldn't
break other applications: move /root/.k5login to /etc/k5login.d/root,
and set

[libdefaults]
kuserok = USER-K5LOGIN SYSTEM-K5LOGIN SIMPLE DENY

in /etc/krb5.conf.  Still worth finding a code fix for pam_ksu, but
you can try this workaround in the mean time.

Re: kerberos issues with 10.0_BETA post openssl update

2023-09-06 Thread Taylor R Campbell

> Date: Wed, 6 Sep 2023 16:41:00 +1200
> From: Mark Davies 
> 
> OK, so revision 1.10 of pam_ksu.c adds a call to 
> krb5_set_home_dir_access(NULL, FALSE);
> which causes the subsequent call to krb5_kuserok() to return false when 
> previously it would return true causing the whole pam_ksu to bail.
> 
> krb5_kuserok() is presuambly now returning false because if it can't 
> access the homedir it can't read /root/.k5login to see that 
> mark/r...@ecs.vuw.ac.nz is allowed.

The reason for revision 1.10 is that pam_ksu had a gaping security
hole without it, allowing the calling user to totally control the krb5
context by specifying ~/.krb5/config in the _calling user's_ home
directory and thereby spoof authentication decisions:

https://ftp.netbsd.org/pub/NetBSD/security/advisories/NetBSD-SA2023-005.txt.asc

I verified that the hole was there without the change, and I verified
that the change plugged the hole.

However, I think you're right that the change causes it to take this
path in krb5_kuserok to block access to the _target user's_ home
directory too:

profile_dir = k5login_dir;
if (profile_dir == NULL) {
/* Don't deadlock with gssd or anything of the sort */
if (!_krb5_homedir_access(context))
return KRB5_PLUGIN_NO_HANDLE;

if (rk_getpwnam_r(luser, , pwbuf, sizeof(pwbuf), ) != 0) {
krb5_set_error_message(context, errno, "User unknown 
(getpwnam_r())");
return KRB5_PLUGIN_NO_HANDLE;
}
if (pwd == NULL) {
krb5_set_error_message(context, errno, "User unknown (getpwnam())");
return KRB5_PLUGIN_NO_HANDLE;
}
profile_dir = pwd->pw_dir;
}

A possible workaround is to set:

[libdefaults]
k5login_directory = /root

However, that applies to _all_ kuserok checks for _all_ users, not
just the pam_ksu one ror root, so it will probably break other things.
I'm not sure there is a way in the config file to specify it just for
pam_ksu or just for root.

Perhaps it would be appropriate to add these lines in pam_ksu.c (or
possibly just the first one):

goto out;
}
+   krb5_set_home_dir_access(context, TRUE);
PAM_LOG("kuserok: %s -> %s", su_principal_name, user);
rv = krb5_kuserok(context, su_principal, user);
+   krb5_set_home_dir_access(context, FALSE);

At that point, the config file should have been parsed already, so the
calling user's ~/.krb5/config can't hurt anything.  But I haven't
audited this path.  So I don't know if it's safe.

Re: kerberos issues with 10.0_BETA post openssl update

2023-09-05 Thread Taylor R Campbell

> Date: Wed, 6 Sep 2023 09:54:16 +1200
> From: Mark Davies 
> 
> OK, found a simpler reproducible crash.  Run "kadmin -l" on a kdc and 
> try to change a password.
> 
> xen2# kadmin -l
> kadmin> passwd ecsproctor
> ecsproc...@ecs.vuw.ac.nz's Password:
> Verifying - ecsproc...@ecs.vuw.ac.nz's Password:
> Segmentation fault (core dumped)
> 
> Here is the backtrace
> 
> Core was generated by `kadmin'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x in ?? ()
> (gdb) where
> #0  0x in ?? ()
> #1  0x7f11ca0423d4 in ARCFOUR_string_to_key (context=0x7f11cafc7000, 
> enctype=KRB5_ENCTYPE_ARCFOUR_HMAC_MD5,
>  password=..., salt=..., opaque=..., key=0x7f11caf514d8)
>  at 
> /src/work/10/src/crypto/external/bsd/heimdal/dist/lib/krb5/salt-arcfour.c:83

This looks like a jump to null in the RC4 logic using EVP_md4().

For EVP_rc4 we have a hack in Heimdal to do

EVP_CIPHER_fetch(NULL, "rc4", "provider=legacy")

but I'm not sure it actually works -- I can't get it to do anything in
a test program without also calling OSSL_PROVIDER_load("legacy"), at
which point it becomes unnecessary -- and we don't do it for MD4.

So if we can convince Heimdal to call OSSL_PROVIDER_load("legacy") at
some point on startup, I bet that will fix it.

It looks like the EVP_CIPHER_fetch hack (or EVP_MD_fetch hack) is also
a memory leak, according to
:

   These functions usually have the name APINAME_fetch, where APINAME
   is the name of the operation.  For example EVP_MD_fetch(3) can be
   used to explicitly fetch a digest algorithm implementation.  The
   user is responsible for freeing the object returned from the
   APINAME_fetch function using APINAME_free when it is no longer
   needed.

So I'm not sure we should be using it at all.

> as to the su issue, I think that is a separate problem related to 
> revision 1.10 of pam_ksu.c.  If I revert that then su works.

Got a stack trace for that?

Re: kerberos issues with 10.0_BETA post openssl update

2023-09-04 Thread Taylor R Campbell

> Date: Mon, 4 Sep 2023 16:46:35 +1200
> From: Mark Davies 
> 
> Having updated from a 10.0_BETA built in march to one built couple
> of weeks ago (post the openssl3 merge) I'm now seeing various
> kerberos issues that I wasn't seeing before.

Can you share `ldd /usr/libexec/kadmind' on the machine where it's
crashing?  Wondering whether it's mixing shlib versions in one address
space, like https://gnats.netbsd.org/57603 (though that's not the same
issue because you're only using things in base).

Can you install the debug set on one of the affected systems where you
can reproduce a problem to get more information out of the stack
traces?

Call for testing: certctl, postinstall, TLS trust anchors

2023-09-03 Thread Taylor R Campbell

We're preparing to ship TLS trust anchors in base and configure them
so that applications like ftp(1) and pkg_add(1) can do TLS validation
out of the box.

The new certctl(8) tool is provided to manage the TLS trust anchors
configured in /etc/openssl/certs with a simple way to change the
source of trust anchors or distrust individual ones -- and with a
manual override, if you would rather use another mechanism to do it,
like the commands available in the security/mozilla-rootcerts or
security/ca-certificates packages, or the special-purpose
security/mozilla-rootcerts-openssl package.

I've added some logic in postinstall(8) to handle the transition when
you update.  Tried to anticipate all reasonable paths into the update,
and handle them all gracefully.  But no doubt I missed something.

So it would be helpful if you could test updating NetBSD in whatever
way you do it (sysinst, untar/etcupdate/postinstall, etcmanage,
something even more bespoke), and let me know if anything goes wrong
with the TLS trust anchors:

1. Does postinstall work smoothly for you?

2. Does it blow away any configuration you had?  (I don't think it
   should, but if you back up /etc you should be able to see.)

3. Do you end up with the trust anchors you expected?

4. Are the answers obvious or do you have to go digging?

5. Do you hit any messages or warnings or failures that you don't
   understand?

6. If you previously used mozilla-rootcerts, ca-certificates, or
   something else, and you want to switch to certctl(8), is it obvious
   what you need to do?  If not, where did you consult to find what
   you need to do, where you didn't find the answer?

Repository conversion is temporarily suspended

2023-08-31 Thread Taylor R Campbell

The automatic conversion of the NetBSD CVS repository to Fossil, Hg,
and Git is temporarily suspended until 2023-09-08.

The rack space donated to TNF for the infrastructure is undergoing
renovation from 2023-08-28 to 2023-09-08.

Apologies for the inconvenience.

git workflow across forced updates

2023-08-22 Thread Taylor R Campbell

This is my workflow to handle forced updates in the git repository
conversion as smoothly as regular updates:

1. When I start working on a change, say for PR 12345, I create two
   branches with a short name describing the change:

git branch pr12345-mumblefix-base origin/trunk
git checkout -b pr12345-mumblefix pr12345-mumblefix-base

2. I commit some changes on the branch with `git commit -i' to
   interactively select the hunks to commit so I can keep separate
   functional (or non-functional) changes.

   Sometimes I want to amend a previous commit, so I find the commit
   hash, say 1234abcd, and make a fixup commit:

git commit --fixup 1234abcd

   This makes a magic commit message `fixup! '.
   Then from time to time I rebase to apply the fixups:

git rebase -i --autosquash pr12345-mumblefix-base

3. Sometimes I push changes from the branch to the CVS repository with
   git cvsexportcommit:

git cvsexportcommit -c -p -v -w ~/cvs/src 1234abcd

   If the change is near an RCS id line ($NetBSD: ...$), it is
   sometimes necessary to omit `-p' so the patch is allowed to apply
   with smaller context, but I manually eyeball it before I do that,
   just in case.

4. When I want to update the repository, I use the attached script to
   create a new branch pr12345-mumblefix-v2 on a new base
   pr12345-mumblefix-v2-base with all the same commits:

/path/to/git-update.sh pr12345-mumblefix origin/trunk

   or, next time, to make pr12345-mumblefix-v3:

/path/to/git-update.sh pr12345-mumblefix-v2 origin/trunk

   Of course, there might be merge conflicts to resolve.  Sometimes if
   I've already committed changes git will figure that out; sometimes
   git won't figure it out but it's just a matter of skipping a commit
   with `git rebase --skip'.

   It's not highly polished -- if you want to give up on resolving the
   merge conflicts, you can do `git rebase --abort', but git-update.sh
   won't clean up the branches it already created.  Nevertheless, it's
   been very handy for my development needs.

   (You can also put it in your PATH and then say `git update' instead
   of `/path/to/git-update.sh'.  Not sure if git plans to add a
   standard `git update' command, though.)

5. When I'm done and I've pushed all the changes to CVS and the branch
   is now empty on update, so pr12345-mumblefix-vN is the same as
   pr12345-mumblefix-vN-base, I rename all the branches to be
   closed/pr12345-mumblefix-vN and closed/pr12345-mumblefix-vN-base.
   Not perfect but easy to ignore in `git branch' output.

   (You can also just delete the branches, if the proposition of
   deleting history doesn't make your skin crawl like it does for me.)
#!/bin/sh

# Copyright (c) 2023 Taylor R. Campbell
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# 1. Redistributions of source code must retain the above copyright
#notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
#notice, this list of conditions and the following disclaimer in the
#documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY EXPRESS
# OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN
# NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY DIRECT, INDIRECT,
# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,
# OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
# EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

set -Ceu

run()
{
echo '#' "$@"
${DRY_RUN+:} "$@"
}

oldbranch=$1
newbase=${2-origin/trunk}

case $oldbranch in
*-v[0-9] | *-v[0-9][0-9])
prefix=${1%-v*}
oldversion=${1##*-v}
newversion=$((oldversion + 1))
;;
*)
prefix=$1
newversion=2
;;
esac
newbranch=${prefix}-v${newversion}

if git rev-parse --verify --quiet ${newbranch}-base >/dev/null; then
printf >&2 'git-update: new base %s already exists\n' ${newbranch}-base
exit 1
fi

if git rev-parse --verify --quiet ${newbranch} >/dev/null; then
printf >&2 'git-update: new branch %s already exists\n' ${newbranch}
exit 1
fi

if ! run git diff --stat --exit-code; then
printf >&2 'git-update: dirty working tree, aborting\n'
exit 1
fi

if ! run git diff --cached --stat --exit-code; then
printf >&2 'git-upd

Re: builds fail w/undef ref in rump tests

2023-08-20 Thread Taylor R Campbell

> Date: Sun, 20 Aug 2023 09:45:20 -0500 (CDT)
> From: "John D. Baker" 
> 
> On Sun, 20 Aug 2023, Taylor R Campbell wrote:
> 
> > That's odd, can you share this ident(1) output?
> > 
> > cd $OBJDIR/tests/rump
> > ident kernspace/workqueue.o kernspace/libkernspace.a
> 
> kernspace/workqueue.o:
>  $NetBSD: workqueue.c,v 1.10 2023/08/10 22:20:20 riastradh Exp $
> 
> kernspace/libkernspace.a:
>  $NetBSD: sendsig.c,v 1.2 2020/05/23 23:42:44 ad Exp $
>  $NetBSD: workqueue.c,v 1.10 2023/08/10 22:20:20 riastradh Exp $

Very weird!  Can you share the output of these?

cd $OBJDIR/tests/rump
nm -g --defined-only kernspace/workqueue.o
nm -g --undefined-only rumpkern/t_workqueue.o

kernspace/workqueue.c is supposed to define
rumptest_workqueue_wait_pause as of revision 1.9 (and it should stay
defined in 1.10).

Unless maybe you have some other libkernspace somewhere else that is
somehow getting pulled in instead?

Re: builds fail w/undef ref in rump tests

2023-08-20 Thread Taylor R Campbell

> Date: Sun, 20 Aug 2023 14:11:02 +0200
> From: Martin Husemann 
> 
> On Sun, Aug 20, 2023 at 12:08:01PM +0000, Taylor R Campbell wrote:
> > > /r0/build/current/tools/amd64/bin/mips64el--netbsd-gcc
> > > --sysroot=/r0/build/current/DEST/evbmips64-el -Wl,--warn-shared-textrel   
> > > -pie  -shared-libgcc  -o t_workqueue  t_workqueue.o  
> > > -Wl,-rpath-link,/r0/build/current/DEST/evbmips64-el/lib  -L=/lib  
> > > -lrumpvfs -lrumpvfs_nofifofs -lrump -lrumpuser -lpthread -lrump 
> > > -L/r0/build/current/obj/mips64el/tests/rump/kernspace -lkernspace
> > > -latf-c
> > > /r0/build/current/tools/amd64/lib/gcc/mips64el--netbsd/10.5.0/../../../../mips64el--netbsd/bin/ld:
> > >  t_workqueue.o: in function `atfu_workqueue_wait_pause_body':
> > > /x/current/src/tests/rump/rumpkern/t_workqueue.c:96: undefined reference 
> > > to `rumptest_workqueue_wait_pause'
> 
> Is this with an update build? I think I had to clean some test obj dirs after
> the latest changes to workqueue (and probably forgot to note it in UPDATING)

I changed tests/rump/rumpkern/Makefile to use PROGDPLIBS so that
t_workqueue should get relinked correctly even in an update build, in
order to avoid needing a note in UPDATING.  Curious to understand how
this could happen?  Surely we should be able to get this working
without manual intervention.

Re: builds fail w/undef ref in rump tests

2023-08-20 Thread Taylor R Campbell

> From: "John D. Baker" 
> Date: Sat, 19 Aug 2023 11:28:32 -0500 (CDT)
> 
> Now that the MKCROSSGDB and new libpcap issues seem to have been ironed
> out, my builds (all arches) have resumed failing due to the following
> (evbmips-mips64el shown, sparc, macppc, dreamcast, i386, amd64 show the
> same failure):
> ...
> /r0/build/current/tools/amd64/bin/mips64el--netbsd-gcc
> --sysroot=/r0/build/current/DEST/evbmips64-el -Wl,--warn-shared-textrel   
> -pie  -shared-libgcc  -o t_workqueue  t_workqueue.o  
> -Wl,-rpath-link,/r0/build/current/DEST/evbmips64-el/lib  -L=/lib  -lrumpvfs 
> -lrumpvfs_nofifofs -lrump -lrumpuser -lpthread -lrump 
> -L/r0/build/current/obj/mips64el/tests/rump/kernspace -lkernspace-latf-c
> /r0/build/current/tools/amd64/lib/gcc/mips64el--netbsd/10.5.0/../../../../mips64el--netbsd/bin/ld:
>  t_workqueue.o: in function `atfu_workqueue_wait_pause_body':
> /x/current/src/tests/rump/rumpkern/t_workqueue.c:96: undefined reference to 
> `rumptest_workqueue_wait_pause'

That's odd, can you share this ident(1) output?

cd $OBJDIR/tests/rump
ident kernspace/workqueue.o kernspace/libkernspace.a

The definition has been there for a couple weeks now.

Re: 10.99.7 panic: defibrillate

2023-08-14 Thread Taylor R Campbell

> Date: Mon, 14 Aug 2023 18:16:49 +0200
> From: Thomas Klausner 
> 
> On Mon, Aug 14, 2023 at 12:41:06PM +0200, Thomas Klausner wrote:
> > I had followed your suggestion and bumped the heartbeat limit from 15
> > to 300, but today it paniced again.
> > 
> > panic: cpu8: found cpu9 heart stopped beating and unresponsive
> > 
> > I have a core dump in case you want any particular details.
> > 
> > I've now switched set it to 0.
> 
> and had a hard hang less than half a day later.
> 
> This hasn't been happening in 10.99.5 (at least not with that
> frequency), which had uptimes of weeks, so either the heartbeat code
> introduced additional problems (even if disabled this way) or
> something else got worse, or I am really really unlucky right now.

Welp.

I don't think simply having the heartbeat(9) code around would cause a
hang -- it's new code, which is higher-risk, but the design of the
code is very low-risk (all loops are bounded; interrupt handler and
soft interrupt handler are short and easy to audit for bounded
latency; each CPU only writes to its own per-CPU state).  I think it's
more likely something else changed.

Looks like it's time to bisect over the time since your last good
build, and see if you can make it a whole day without panicking?

874 commits since I bumped 10.99.5 (which was incidentally when I
introduced heartbeat(9)), so...it should only take a week or two if
the problem takes half a day to reproduce!

MKCROSSGDB=yes broken in new gdb?

2023-08-13 Thread Taylor R Campbell

$ time nice -n +10 env -i PATH=/bin:/usr/bin ./build.sh -O ../obj.arm64 -T 
../obj.arm64/tooldir -U -u -m evbarm64 -j4 -N1 -V MKCROSSGDB=yes -V MKDEBUG=yes 
-V USE_PIGZGZIP=yes release
[...]
In file included from 
/home/riastradh/netbsd/current/src/tools/gdb/../../external/gpl3/gdb/dist/gdb/gdbtypes.h:58:0,
 from 
/home/riastradh/netbsd/current/src/tools/gdb/../../external/gpl3/gdb/dist/gdb/symtab.h:28,
 from 
/home/riastradh/netbsd/current/src/tools/gdb/../../external/gpl3/gdb/dist/gdb/d-lang.c:21:
/home/riastradh/netbsd/current/src/tools/gdb/../../external/gpl3/gdb/dist/gdb/gmp-utils.h:29:10:
 fatal error: gmp.h: No such file or directory
 #include 
  ^~~
compilation terminated.
nbgmake[1]: *** [d-lang.o] Error 1

Re: 10.99.7 panic: defibrillate

2023-08-13 Thread Taylor R Campbell

> Date: Sun, 13 Aug 2023 06:16:51 -0400
> From: Greg Troxel 
> 
> Would it be useful for heartbeat to have a just-log-don't-panic option?

Worth considering, but...

> It feels like are in a state where we know there is a problem somewhere,
> and we don't know if it is in heartbeat, the kernel, or hardware.

...in this case it is already clear that under heavy disk I/O,
something is either holding onto a spin lock or starving softints and
threads at priority below softbio for much too long.

Holding a spin lock -- or otherwise running at raised IPL -- for 5sec
is already enough to violate the contract of the timecounter at
hz=100, which can lead to monotonic time going backwards, which breaks
all kinds of things but maybe only in subtle ways that are extremely
hard to diagnose retrospectively.

I thought all the uvm aiodone business was supposed tbe deferred to
workqueue context (which would not hold up heartbeats), but it looks
like we have a path (softbio) biodone -> biodone2 -> uvm_aio_aiodone
-> uvm_pagermapout -> vm_map_lock -> cv_wait which is forbidden in
softint context (and should really trip a KASSERT).  This might not be
the problem but it's evidence that the code path is on shaky grounds.

> I would not want to run a watchdog that reboots the system unless the FP
> rate is well under once per year, and really under 0.2/year.  Having
> this logged instead of panicing would make it more comfortable to turn
> on.  Probably it should be default to not panic, if this turns into
> enough reports that it seems to have significantly non-zero probability.

So far the only reports I've seen have been true alarms about
something being broken.  Most of the problems that this will catch
would otherwise manifest as `NetBSD stopped responding and I wasn't
able to get a core dump' (leading to useless undiagnosable PRs), not
as `huh, I saw this weird detailed log message', so the diagnostic
value of the heartbeat panic in those circumstances is very high.

Note that a hardware watchdog timer is a little bit different: it will
usually just reset the machine, giving no opportunity for diagnostics
like a crash dump.

> (Presumably atf runs on real hw survive HEARTBEAT though, so whatever is
> happening seems low probability to start with.)

Right.  My guess is that this may be related to problems that we've
been trying to diagnose regarding extreme delays at shutdown after
heavy disk I/O, which we need more information to figure out.
Possibly related to the yamt-pagecache merge, possibly related to the
zfs pagedaemon thrashing.

Re: 10.99.7 panic: defibrillate

2023-08-13 Thread Taylor R Campbell

> Date: Sun, 13 Aug 2023 06:16:53 +
> From: Taylor R Campbell 
> 
> > Date: Sun, 13 Aug 2023 08:03:00 +0200
> > From: Thomas Klausner 
> > 
> > So it happened again, no bulk build this time, just qt5-qtwebengine in
> > a sandbox.
> > 
> > panic: cpu0: softints stuck for 16 seconds
> 
> You could try `no options HEARTBEAT' in your kernel and see if
> anything is stuck after a day of builds -- the same crash commands
> might show something is up.

Or, better, without rebuilding a kernel:

sysctl -w kern.heartbeat.max_period=300

to extend the period to five minutes (300 seconds), or set it to 0 to
disable the checks at runtime.

Another diagnostic that might be helpful -- during load:

# lockstat -o lockstat.out sleep 15

and share the output.  If it fails, try a lower sleep time like 10 or
5 or 1.

Re: 10.99.7 panic: defibrillate

2023-08-13 Thread Taylor R Campbell

> Date: Sun, 13 Aug 2023 08:03:00 +0200
> From: Thomas Klausner 
> 
> So it happened again, no bulk build this time, just qt5-qtwebengine in
> a sandbox.
> 
> panic: cpu0: softints stuck for 16 seconds

You could try `no options HEARTBEAT' in your kernel and see if
anything is stuck after a day of builds -- the same crash commands
might show something is up.  Maybe also enter ddb with C-A-ESC or
hw.cnmagic and, for each CPU N, do `mach cpu N' and then `bt' in case
one core is stuck at raised IPL or something.

> Using an older gdb I get:
> (gdb) bt
> [...]
> #7  0x8023c2a4 in trap (frame=0xd3a110933ad0) at 
> /usr/src/sys/arch/amd64/amd64/trap.c:315
> #8  0x80234ad4 in alltraps ()
> #9  0x0003 in ?? ()

We need to fix gdb stack traces through kernel trap frames...

Can you try with crash(8)?

# crash -N netbsd.40 -M netbsd.40.core
crash> bt
crash> ps
crash> ps/w
crash> show all tstiles

Re: 10.99.7 panic: defibrillate

2023-08-12 Thread Taylor R Campbell

> Date: Sat, 12 Aug 2023 17:28:27 +0200
> From: Thomas Klausner 
> 
> I just got a new panic in 10.99.7 after running a pbulk for less than
> a day (after updating from 10.99.5, which was stable for weeks).
> ...
> vpanic() at netbsd:vpanic+0x173 panic() at netbsd :panic+0x3c
> defibrillate() at netbsd:defibrillate+Oxe3 hardclock() at 
> netbsd:hardclock+0x8b
> Xresume_lapic_ltimer() at netbsd:Xresume_lapic_ltimer+Oxle
> --- interrupt ---
> pmap_tlb_shootnow() at netbsd:pmap_tlb_shootnow+0x1f7
> ...

This panic means that one CPU has detected that another CPU has failed
to run either the hardclock interrupt handler or the SOFTINT_CLOCK
softints in over 15 seconds, and triggered an interprocessor interrupt
in an attempt to panic rather than stay stuck where it appears to be
stuck -- here, pmap_tlb_shootnow.

Normally the hardclock interrupt handler runs every 10ms (or 1/hz sec;
default hz=100), and softints run reasonably promptly, so failing to
do this for 15 sec is extremely unusual and likely indicates a CPU is
wedged and unable to make progress.  For example, something may be
stuck in an infinite loop with a spin lock held or spl raised, which
blocks interrupts.

(The HEARTBEAT option, this system where CPUs check one another for
progress, is new as of last month.  The problems it uncovers would
likely have manifested as silent unresponsive hang before.)

1. Did you notice anything sluggish before the crash?

2. Can you start another bulk build and run the following dtrace
   script for a while and share the final output?

dtrace -x cleanrate=50hz -n '
fbt::pmap_tlb_shootnow:entry,
fbt::uvm_pagermapout:entry {
self->starttime[probefunc] = timestamp
}
fbt::pmap_tlb_shootnow:return,
fbt::uvm_pagermapout:return /self->starttime[probefunc]/ {
@[probefunc] = quantize(timestamp -
self->starttime[probefunc]);
self->starttime[probefunc] = 0
}
tick-60s {
printa(@)
}
'

You may need to modload dtrace_fbt and dtrace_profile first.  The
tick-60s probe will print the current state of data collection once a
minute, showing a histogram of the time spent in the functions
pmap_tlb_shootnow and uvm_pagermapout.

If it says something like

dtrace: 429 dynamic variable drops with non-empty dirty list

then just hit ^C and save the last output.

> Sorry, no crash dump available.

3. Do you just not have a dump device, or are crash dumps broken
   altogether?  Can you test with sysctl debug.crashme?  (sysctl -w
   debug.crashme_enable=1, sysctl -w debug.crashme.panic=1)

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-08-01 Thread Taylor R Campbell

> Date: Tue, 01 Aug 2023 16:02:17 -0400
> From: Brad Spencer 
> 
> Taylor R Campbell  writes:
> 
> > So I just added a printf to the kernel in case this jump happens.  Can
> > you update to xen_clock.c 1.15 (and sys/arch/x86/include/cpu.h 1.135)
> > and try again?
> 
> Sure...

Correction: xen_clock.c 1.16 and sys/arch/x86/include/cpu.h 1.136
(missed a spot).

> >> If the dtrace does continue to run, sometimes, it is impossible to exit
> >> with CTRL-C.  The process seems stuck in this:
> >> 
> >> [ 4261.7158728] load: 2.64  cmd: dtrace 3295 [xclocv] 0.01u 0.02s 0% 7340k
> >
> > Interesting.  If this is reproducible, can you enter crash or ddb and
> > get a stack trace for the dtrace process, as well as output from ps,
> > ps/w, and `show all tstiles'?
> 
> It appears to be reproduceable..  in the sense that I encountered it a
> couple of times doing exactly the same workload test.  I am more or less
> completely unsure as to what the trigger is, however.  I probably should
> have mentioned, but when this happened the last time, I did have other
> newly created processes hang in tstile (the one in particular that I
> noticed was 'fortune' from a ssh attempt .. it got stuck on login and
> when I did a CTRL-T tstile was shown).

`show all tstiles' output in crash or ddb would definitely be helpful
here.

> I also probably should have mentioned that the DOM0 (NOT the DOMU) that
> the target system is running under has HZ set to 1000.  This is mostly
> to help keep the ntpd and chronyd happy on the Xen guests.  If the DOM0
> is left at 100 the drift can be too much on the DOMU systems.  Been
> running like this for a long time...

Interesting.  Why would the dom0's HZ choice a difference?  Nothing
in the guest should depend substantively on the host's tick rate.

A NetBSD XEN3_DOM0 kernel periodically updates the hypervisor with a
real-time clock derived from NTP (the `timepush' callout in
xen_clock.c), but the period between updates is 53 sec + 3 ticks, and
it's hard to imagine that setting the clock every 53.03 sec vs every
53.003 sec should make any difference for whether guests drift.

The resolution of the real-time clock sent to NTP is 1/hz, because
resettodr uses getmicrotime instead of microtime, but while that might
introduce jitter from rounding, I'm not sure it should cause
persistent drift in one direction or the other and I don't think
guests are likely to periodically query the Xen wall clock time often
enough for this jitter to matter.

Does the dom0 have any substantive continuous influence on domU
scheduling and timing?  I always assumed the hypervisor would have all
the say in that.

As an aside, I wonder whether it's even worthwhile to run ntpd or
chronyd on the domU instead of just letting the dom0 set it and
arranging to do the equivalent of inittodr periodically in the domU?

Can you try the attached patch on a dom0 and see if you still observe
drift?
>From 639672fef2a0a6959161b6e882e16128b1340c69 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Tue, 1 Aug 2023 21:58:45 +
Subject: [PATCH] WIP: push real-time clock directly from nanotime to
 hypervisor

Bypass resettodr's use of getmicrotime, and any overhead from
rtc_set_ymdhms.
---
 sys/arch/xen/xen/xen_clock.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/sys/arch/xen/xen/xen_clock.c b/sys/arch/xen/xen/xen_clock.c
index e36ec38ba758..fc2fa95490c4 100644
--- a/sys/arch/xen/xen/xen_clock.c
+++ b/sys/arch/xen/xen/xen_clock.c
@@ -988,8 +988,26 @@ fail:  sysctl_teardown();
 static void
 xen_timepush_intr(void *cookie)
 {
+#if 1
+   struct timespec now;
+   int error;

+   nanotime();
+   error = HYPERVISOR_platform_op(&(xen_platform_op_t) {
+   .cmd = XENPF_settime,
+   .u = {
+   .settime = {
+   .secs = now.tv_sec,
+   .nsecs = now.tv_nsec,
+   .system_time = xen_global_systime_ns(),
+   },
+   },
+   });
+   if (error)
+   printf("%s: XENPF_settime failed: %d\n", __func__, error);
+#else
resettodr();
+#endif
if (xen_timepush.ticks)
callout_schedule(_timepush.ch, xen_timepush.ticks);
 }
@@ -1090,15 +1108,22 @@ xen_rtc_set(struct todr_chip_handle *todr, struct 
timeval *tvp)
clock_secs_to_ymdhms(tvp->tv_sec, );
rtc_set_ymdhms(NULL, );

+#if 1
+   __USE(op);
+   __USE(systime_ns);
+   return 0;
+#else
/* Get the global system time so we can preserve it.  */
systime_ns = xen_global_systime_ns();

/* Set the hypervisor wall clock time.  */
+   /* XXX zero it first? (`mbz' field probab

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-08-01 Thread Taylor R Campbell

> Date: Mon, 31 Jul 2023 12:47:20 -0400
> 
> # dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } 
> sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated 
> timecounter contract") }'
> dtrace: description 'sdt:xen:hardclock:jump ' matched 2 probes
> dtrace: processing aborted: Abort due to systemic unresponsiveness

Well!  dtrace might be unhappy if the timecounter is broken too, heh.
So I just added a printf to the kernel in case this jump happens.  Can
you update to xen_clock.c 1.15 (and sys/arch/x86/include/cpu.h 1.135)
and try again?

> The system is fine just after a reboot, it certainly seems to be a
> requirment that a fair bit of work must be done before it gets into a
> bad state.
> 
> If the dtrace does continue to run, sometimes, it is impossible to exit
> with CTRL-C.  The process seems stuck in this:
> 
> [ 4261.7158728] load: 2.64  cmd: dtrace 3295 [xclocv] 0.01u 0.02s 0% 7340k

Interesting.  If this is reproducible, can you enter crash or ddb and
get a stack trace for the dtrace process, as well as output from ps,
ps/w, and `show all tstiles'?

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-30 Thread Taylor R Campbell

> Date: Sun, 30 Jul 2023 14:56:53 -0400
> From: Brad Spencer 
> 
> Taylor R Campbell  writes:
> 
> > Can you please try running with the attached patch and share the
> > warnings it produces?  Should give slightly more information.
> 
> Caught another one.  As far as I know the system is up to date with all
> of the requested patches:
> 
> [ 19419.647972] WARNING: lwp 16 (system idle/1) flags 0xa020: timecounter 
> went backwards from (19420 + 0x9e37cf0149d8f7bb/2^64) sec at 
> netbsd:mi_switch+0x11e on cpu1 to (19419 + 0xad917b77bd0a7cd3/2^64) sec at 
> netbsd:mi_switch+0x11e on cpu1

Can you run this dtrace script for a while (say, for a day, or from
start of boot until you see the WARNING above which only happens once
per boot), and then hit ^C?

dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } 
sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated 
timecounter contract") }'

If my hypothesis is correct, you can just leave this running over any
particular workload and you'll get:

(a) a message printed whenever the hardclock delay is too long, and
(b) when you hit ^C at the end, a histogram of all the >1-tick
hardclock jump delays.

(Avoiding the tick-10s probe, like I used in the last dtrace
suggestion, means you won't get updates printed every 10sec to your
terminal -- you'll have to hit ^C to see the results -- but as an
upside it won't instantly crash your kernel owing to the Xen/!Xen
module ABI mismatch for CLKF_USERMODE/PC.)

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-28 Thread Taylor R Campbell

> Date: Fri, 28 Jul 2023 09:42:30 -0400
> From: Brad Spencer 
> 
> Thanks, that allowed the dtrace to execute, but I never have time to
> execute the second probe, as this kernel panic occures within a few
> seconds of the first probe being run (probably on the order of 4 - 5
> seconds):
> 
> [ 213.9904671] fatal page fault in supervisor mode
> [ 213.9904671] trap type 6 code 0 rip 0x824d29ad cs 0xe030 rflags 
> 0x10283 cr2 0xd8 ilevel 0x7 rsp 0x92032de8db58
> [ 213.9904671] curlwp 0x92001784e9c0 pid 0.16 lowest kstack 
> 0x92032de892c0
> kernel: page fault trap, code=0
> Stopped in pid 0.16 (system) at cyclic:cyclic_clock+0x86:   movq
> d8(%rbx)
> ,%rax
> cyclic_clock() at cyclic:cyclic_clock+0x86

Oy...  I bet this is the following logic in cyclic_clock:

if (TRAPF_USERMODE(frame)) {
c->cpu_profile_pc = 0;
c->cpu_profile_upc = TRAPF_PC(frame);
} else {
c->cpu_profile_pc = TRAPF_PC(frame);
c->cpu_profile_upc = 0;
}

TRAPF_USERMODE and TRAPF_PC are aliases for CLKF_USERMOADE and
CLKF_PC, respectively, which are defined differently on Xen and !Xen.

But the module is built assuming !Xen, and the module ABI is supposed
to be the same, so these are forbidden in modules.

I guess we need to add an out-of-line function that modules can call
for this.

Or, alternatively (though not for pullup to 9 or 10) -- perhaps we can
nix the clockframe nonsense once and for all and use struct cpu_info
members unconditionally, which would eliminate the grody hack in the
i8254 interrupt handler.

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-28 Thread Taylor R Campbell

> Date: Fri, 28 Jul 2023 07:42:04 -0400
> From: Brad Spencer 
> 
> dtrace: invalid probe specifier sdt:xen:hardclock:jump { @ = quantize(arg1 - 
> arg0) } tick-10s { printa(@) }: probe description :::tick-10s does not match 
> any probes

modload dtrace_profile

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-28 Thread Taylor R Campbell

> Date: Thu, 27 Jul 2023 11:17:30 -0400
> From: Brad Spencer 
> 
> On the system that I have that exhibits the negative runtime problem, it
> may very well be the case that hardclocks are missed for 4.3sec.  The
> system has to have been up for a while and busy as a prereq., but if I
> then run:
> 
> dtrace -x nolibs -n 'sdt:xen:clock:, sdt:xen:hardclock:, sdt:xen:timecounter: 
> { printf("%d %d %d %d %d %d %d %d", arg0, arg1, arg2, arg3, arg4, arg5, arg6, 
> arg7) }'

Update xen_clock.c, and try this now (one in each of two terminals):

dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } 
tick-10s { printa(@) }'

dtrace -x nolibs -n 'sdt:xen:hardclock:jump /arg2 >= 430/ { printf("TIMECOUNTER 
BAD: hardclock delayed by %d ns (%d tick)", arg1 - arg0, arg2) }'

Let it run for a while until the second one fires, or until you get
bored, then share the last histogram from the first one?

(Note: I added a probe sdt:xen:hardclock:tick which fires on every
tick, so the original dtrace script will be a little too noisy.  If
you want to do it again for some reason, narrow the scope from
`sdt:xen:hardclock:' to just `sdt:xen:hardclock:missed,
sdt:xen:hardclock:jump'.)

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-27 Thread Taylor R Campbell

> Date: Thu, 27 Jul 2023 15:05:23 +1000
> from: matthew green 
> 
> one problem i've seen in kern_tc.c when the timecounter returns
> a smaller value is that tc_delta() ends up returning a very large
> (underflowed) value, and that makes the consumers of it do a very
> wrong thing.  eg, -2 becomes 2^32-2, and then eg in binuptime:
> 
> 477 bintime_addx(bt, th->th_scale * tc_delta(th));
> 
> or in tc_windup():
> 
> 933 delta = tc_delta(th);
> 938 th->th_offset_count += delta;
> 939 bintime_addx(>th_offset, th->th_scale * delta);
> 
> i "fixed" the time goes backwards on sparc issue a few years ago
> with this change, which avoids the above issue:
> 
>http://mail-index.netbsd.org/source-changes/2018/01/12/msg091064.html
> 
> but i really think that the way tc_delta() can underflow is a
> bad problem we should fix properly, i just wasn't sure of the
> right way to do it.

I don't understand, what do you mean by underflow here?

Part of the API contract of a k-bit timecounter(9) is that the
underlying clock must not have a frequency higher than f * 2^k / 2,
where f is the frequency of tc_windup calls.[*]

For example, if f = 100 Hz (i.e., hz=100), and k = 32 (as we use),
then the maximum timecounter frequency is 100 Hz * 2^32 / 2 ~= 214
GHz.  Even if f = 10 Hz, this is 21.4 GHz.

Under this premise, in the duration between two tc_windup calls,
consecutive calls to get_timecount() mod 2^k can't differ by more than
2^k / 2.  And each call to tc_windup resets th->th_offset_count :=
get_timecount().

So no matter how many times you call tc_delta(th) within that time,
(get_timecount() - th->th_offset_count) mod 2^k can't wrap around,
i.e., a sequence of calls must yield a nondecreasing sequence of k-bit
integers.

I don't know what the sparc timecounter frequency is, but the Xen
system timecounter returns units of nanoseconds, i.e., runs at 1 GHz,
well within these bounds.  So this kind of wraparound leading to
apparently negative runtime -- that is, l->l_stime going backwards --
should not be possible as long as we are calling tc_windup() at a
frequency of at least 1 GHz / (2^k / 2) = 0.47 Hz.

That said, at a 32-bit timecounter frequency of 1 GHz, if there is a
period of about 2^32 / 1 GHz ~= 4.3sec during which we miss all
consecutive hardclock ticks, that would violate the timecounter(9)
assumptions, and tc_delta(th) may go backwards if that happens.

So I think we need to find out why we're missing Xen hardclock timer
interrupts.  Should also make the dtrace probe show exactly how many
hardclock ticks in a batch happened, and should raise an alarm (with
or without dtrace) if it exceeds a threshold.


[*] Actually the limit is closer to f * 2^k, not f * 2^k / 2, but
there probably has to be a little slop for the computational
overhead of tc_windup to ensure the timehands are updated before
tc_delta would wrap around; a factor of two gives a comfortable
margin of error here.

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-25 Thread Taylor R Campbell

Can you please try running with the attached patch and share the
warnings it produces?  Should give slightly more information.
>From b6f360d9b1fdc418105fcc41b41f1206ca04334d Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Wed, 26 Jul 2023 01:36:28 +
Subject: [PATCH] WIP: attribution for l_stime updates

---
 sys/kern/init_main.c|  4 +++-
 sys/kern/kern_idle.c|  4 +++-
 sys/kern/kern_softint.c |  4 +++-
 sys/kern/kern_synch.c   | 53 -
 4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c
index 1d2dbd742fed..bc43f7dd18e9 100644
--- a/sys/kern/init_main.c
+++ b/sys/kern/init_main.c
@@ -254,6 +254,8 @@ static void configure2(void);
 static void configure3(void);
 void main(void);
 
+void updatestime(struct lwp *, const struct bintime *); /* XXX */
+
 /*
  * System startup; initialize the world, create process 0, mount root
  * filesystem, and fork to create init and pagedaemon.  Most of the
@@ -728,7 +730,7 @@ main(void)
mutex_exit(p->p_lock);
}
mutex_exit(_lock);
-   binuptime(>l_stime);
+   updatestime(curlwp, NULL);
 
for (CPU_INFO_FOREACH(cii, ci)) {
ci->ci_schedstate.spc_lastmod = time_second;
diff --git a/sys/kern/kern_idle.c b/sys/kern/kern_idle.c
index dc1fc194f491..b5c6f7a4d90e 100644
--- a/sys/kern/kern_idle.c
+++ b/sys/kern/kern_idle.c
@@ -41,6 +41,8 @@ __KERNEL_RCSID(0, "$NetBSD: kern_idle.c,v 1.34 2020/09/05 
16:30:12 riastradh Exp
 
 #include/* uvm_idle */
 
+void updatestime(struct lwp *, const struct bintime *); /* XXX */
+
 void
 idle_loop(void *dummy)
 {
@@ -53,7 +55,7 @@ idle_loop(void *dummy)
KASSERT(lwp_locked(l, spc->spc_lwplock));
kcpuset_atomic_set(kcpuset_running, cpu_index(ci));
/* Update start time for this thread. */
-   binuptime(>l_stime);
+   updatestime(l, NULL);
spc->spc_flags |= SPCF_RUNNING;
KASSERT((l->l_pflag & LP_RUNNING) != 0);
l->l_stat = LSIDL;
diff --git a/sys/kern/kern_softint.c b/sys/kern/kern_softint.c
index 5aa4c786037c..370d22bb94ca 100644
--- a/sys/kern/kern_softint.c
+++ b/sys/kern/kern_softint.c
@@ -793,6 +793,8 @@ softint_thread(void *cookie)
panic("softint_thread");
 }
 
+void updatestime(struct lwp *, const struct bintime *); /* XXX */
+
 /*
  * softint_dispatch:
  *
@@ -841,7 +843,7 @@ softint_dispatch(lwp_t *pinned, int s)
 * for it.
 */
if (timing) {
-   binuptime(>l_stime);
+   updatestime(l, NULL);
membar_producer();  /* for calcru */
l->l_pflag |= LP_TIMEINTR;
}
diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c
index 6695205ac900..13862681c9fb 100644
--- a/sys/kern/kern_synch.c
+++ b/sys/kern/kern_synch.c
@@ -545,6 +545,57 @@ updatertime(lwp_t *l, const struct bintime *now)
bintime_sub(>l_rtime, >l_stime);
 }
 
+void updatestime(struct lwp *, const struct bintime *); /* XXX */
+void
+updatestime(struct lwp *l, const struct bintime *now)
+{
+   static bool backwards = false;
+   struct bintime bt;
+   void *hereaddr = __builtin_return_address(0);
+   unsigned herecpu = curcpu()->ci_index;
+
+   if (now == NULL) {
+   binuptime();
+   now = 
+   }
+
+   if (__predict_false(bintimecmp(now, >l_stime, <)) && !backwards) {
+   char heresym[128], theresym[128];
+   void *thereaddr = l->l_sched.info; /* XXX */
+   unsigned therecpu = l->l___rsvd1;
+
+#ifdef DDB
+   db_symstr(heresym, sizeof(heresym),
+   (db_expr_t)(intptr_t)hereaddr, DB_STGY_PROC);
+   db_symstr(theresym, sizeof(theresym),
+   (db_expr_t)(intptr_t)thereaddr, DB_STGY_PROC);
+#else
+   snprintf(heresym, sizeof(heresym), "%p", hereaddr);
+   snprintf(theresym, sizeof(theresym), "%p", thereaddr);
+#endif
+   backwards = true;
+   printf("WARNING: lwp %ld (%s%s%s) flags 0x%x:"
+   " timecounter went backwards"
+   " from (%jd + 0x%016"PRIx64"/2^64) sec"
+   " at %s on cpu%u"
+   " to (%jd + 0x%016"PRIx64"/2^64) sec"
+   " at %s on cpu%u\n",
+   (long)l->l_lid,
+   l->l_proc->p_comm,
+   l->l_name ? " " : "",
+   l->l_name ? l->l_name : "",
+   l->l_pflag,
+   (intmax_t)l->l_stime.sec, l->l_stime.frac,
+   theresym, therecpu,
+   (intmax_t)now->sec, now->frac,
+   heresym, herecp

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-20 Thread Taylor R Campbell

> Date: Thu, 20 Jul 2023 21:50:26 -0400
> From: Brad Spencer 
> 
> With a DOMU kernel compiled with KDTRACE_HOOKS I get the following with
> either of those dtrace probes on the DOMU:
> 
> dtrace -n 'sdt:xen:clock:, sdt:xen:hardclock:, sdt:xen:timecounter: { 
> printf("%d %d %d %d %d %d %d", arg0, arg1, arg2, arg3, arg4, arg5, arg6, 
> arg7) }'
> dtrace: invalid probe specifier sdt:xen:clock:, sdt:xen:hardclock:, 
> sdt:xen:timecounter: { printf("%d %d %d %d %d %d %d", arg0, arg1, arg2, arg3, 
> arg4, arg5, arg6, arg7) }: "/usr/lib/dtrace/psinfo.d", line 46: syntax error 
> near "u_int"

Can you please file a PR about that, and pass `-x nolibs' to dtrace in
the mean time?

dtrace -x nolibs -n 'sdt:xen:clock:...'

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-13 Thread Taylor R Campbell

> Date: Thu, 13 Jul 2023 13:39:43 +
> From: Taylor R Campbell 
> 
> > Date: Sat, 08 Jul 2023 14:34:56 -0400
> > From: Brad Spencer 
> > 
> > [ 1792486.921759] kobj_checksyms, 988: [dtrace]: linker error: symbol 
> > `dtrace_invop_calltrap_addr' not found
> > [ 1792486.921759] kobj_checksyms, 988: [dtrace]: linker error: symbol 
> > `dtrace_invop_jump_addr' not found
> > [ 1792486.921759] kobj_checksyms, 988: [dtrace]: linker error: symbol 
> > `dtrace_trap_func' not found
> > [ 1792486.921759] WARNING: module error: unable to affix module `dtrace', 
> > error 8
> 
> Looks like nobody has wired up dtrace to Xen!  That's a pretty serious
> regression of Xen vs native x86.  Someone needs to hook these up.

Correction: Might just be that you need to build a kernel with
`options KDTRACE_HOOKS', which is apparently not the default in Xen.

Can you try that?

Maybe we should flip that on by default.

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-13 Thread Taylor R Campbell

> Date: Sat, 08 Jul 2023 14:34:56 -0400
> From: Brad Spencer 
> 
> Taylor R Campbell  writes:
> 
> > Can you either:
> 
> Yes, I can perform as much of this as needed after I get some other
> stuff in life dealt with more towards the end of the month.  I really
> won't have any time before then.

No worries!

> > 1. share the output of `vmstat -e | grep -e tsc -e systime -e
> >hardclock' after you get the console warning;
> 
> The DOMU currently only has 1 vcpu, but here is the output now:
> 
> vcpu0 raw systime went backwards  465790 intr
> 
> When I have real time later I will force the negative runtime to happen
> and run the above again.

This is evidence that the hypervisor is doing something wrong with the
clock it exposes to the guest.  However, on a single-vCPU system, we
work around this by noting the last Xen systime recorded on the
current vCPU, and pretending the clock just hadn't changed since then.

On a multi-vCPU system, we also try to work around it by recording a
clock skew in xen_global_systime_ns and applying it to ensure the
timestamp is monotonic, but perhaps that's not working right -- or
perhaps it is working for 64-bit timestamps, but the jumps are so
large that they wrap around the 32-bit timecounter arithmetic.

> > 2. run
> >
> >dtrace -n 'sdt:xen:clock: { printf("%d %d %d %d %d %d %d",
> >arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7) }'

Note: this should now be

dtrace -n 'sdt:xen:clock:, sdt:xen:hardclock:, sdt:xen:timecounter: { 
printf("%d %d %d %d %d %d %d", arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7) 
}'

> >on the system, and leave it running with output directed to a file,
> >and share the output when you see the console warning; or
> 
> The DOMU is a 9.3_STABLE from around November 8th and when I attempted
> to run the above dtrace it didn't work.  I got this in the messsages:
> 
> [ 1792486.921759] kobj_checksyms, 988: [dtrace]: linker error: symbol 
> `dtrace_invop_calltrap_addr' not found
> [ 1792486.921759] kobj_checksyms, 988: [dtrace]: linker error: symbol 
> `dtrace_invop_jump_addr' not found
> [ 1792486.921759] kobj_checksyms, 988: [dtrace]: linker error: symbol 
> `dtrace_trap_func' not found
> [ 1792486.921759] WARNING: module error: unable to affix module `dtrace', 
> error 8

Looks like nobody has wired up dtrace to Xen!  That's a pretty serious
regression of Xen vs native x86.  Someone needs to hook these up.

In the mean time, I've add a little more diagnostics to HEAD -- if you
can boot a current kernel, that might help, or I could try to make the
corresponding changes on netbsd-9.

https://mail-index.netbsd.org/source-changes/2023/07/13/msg145973.html
https://mail-index.netbsd.org/source-changes/2023/07/13/msg145974.html

New ddb command: show all tstiles

2023-07-09 Thread Taylor R Campbell

We often get reports of the form `system is hung with processes stuck
in tstile again, must be the same problem as '.

Unfortunately, these are usually _not_ the same problem, or even
related in more than a superficial way, because `processes stuck in
tstile' just means `waiting for a lock'.  And there are myriad ways
the system might deadlock in that state -- totally independent locking
bugs in any subsystem can manifest that way.

Fortunately, there's a new ddb command to help track this down and
tease apart _which_ subsystems have locking bugs causing the tstiles:

show all tstiles
show all tstiles/t

This goes through each thread (lwp) that's waiting for a lock and
prints

(a) the lock address,
(b) if the lock is owned by another lwp, who owns it, and
(c) with the `/t' modifier, a stack trace of the owner.

And, unlike `show all locks', it works in non-LOCKDEBUG kernels, so if
a production system locks up, `show all tstiles' may still be useful.

So next time the system locks up with processes stuck in tstile, try
`show all tstiles/t' and see if that helps identify the deadlock!

Re: What to do about "WARNING: negative runtime; monotonic clock has gone backwards"

2023-07-08 Thread Taylor R Campbell

> Date: Wed, 04 Jan 2023 14:43:25 -0500
> From: Brad Spencer 
> 
> So...  I have a PV+PVSHIM DOMU running a pretty recent 9.x on a DOM0
> running a 9.99.xx kernel.  The DOM0 is not large, a 4 processor E-2224
> with 32GB of memory.  The DOMU has 2 VCPUs and 8GB of memory.  About
> every day a very particular DOMU tosses the:
> 
> WARNING: negative runtime; monotonic clock has gone backwards

Does this still happen?

Can you either:

1. share the output of `vmstat -e | grep -e tsc -e systime -e
   hardclock' after you get the console warning;

2. run

   dtrace -n 'sdt:xen:clock: { printf("%d %d %d %d %d %d %d",
   arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7) }'

   on the system, and leave it running with output directed to a file,
   and share the output when you see the console warning; or

3. put `#define XEN_CLOCK_DEBUG 1' in sys/arch/xen/xen/xen_clock.c and
   build a new kernel, and share the dmesg output when you get the
   console warning?

This should tell us whether it's the Xen host's fault or something
wrong in NetBSD.

Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Taylor R Campbell

> Date: Fri, 7 Jul 2023 17:56:42 +0200
> From: Manuel Bouyer 
> 
> On Fri, Jul 07, 2023 at 01:11:54PM +0000, Taylor R Campbell wrote:
> > - The magic numbers for debug.crashme.spl_spinout are for evbarm.
> >   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.

Correction: IPL_SOFTCLOCK=2.

> > 1.  cpuctl offline 0
> > sleep 20
> > cpuctl online 0
> 
> With this I get a panic on Xen:
> [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", 
> line 158
> [...]
> [  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", 
> line 158

This was a mistake that arose because I was testing on aarch64 where
kpreempt_disabled() is always true.  Update and try again, please!

sys/kern/kern_heartbeat.c 1.2
sys/kern/subr_xcall.c 1.36

> > 4.  sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
> > # verify system panics after 15sec
> 
> my sysctl command did hang, but the system didn't panic

Right -- I made a mistake in my call for testing.  On x86,
IPL_SOFTCLOCK is 2, not 1, which is IPL_PREEMPT, a special ipl that
doesn't apply here.  So use this instead on x86:

sysctl -w debug.crashme.spl_spinout=2

(Not sure if it's different on Xen -- if it is, use whatever
IPL_SOFTCLOCK is there.)

> > 5.  sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
> > # verify system panics after 15sec
> 
> This one did panic

Great!

> > 6.  cpuctl offline 0
> > sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
> > # verify system panics after 15sec
> 
> my sysctl command did hang, but the system didn't panic

Same as with (4), use 2 instead of 1 here (or whatever is the right
value on Xen).

> > 7.  cpuctl offline 0
> > sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
> > # verify system panics after 15sec
> 
> and this one did panic

Great, thanks!

Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Taylor R Campbell

FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
heartbeat(9) that will make the system crash rather than hang when
CPUs are stuck in certain ways that hardware watchdog timers can't
detect (or on systems without hardware watchdog timers).

It's optional for now, but it's small and I'd like to make it
mandatory in the future.  If you'd like to try it out, add the
following two lines to your kernel config:

options HEARTBEAT
options HEARTBEAT_MAX_PERIOD_DEFAULT=15

You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
runtime, or use that knob to change the maximum period before the
system will crash if not all (online) CPUs have made progress.


Here are some manual tests that you can use to exercise it -- these
are manual tests, not automatic tests, because some will deliberately
crash the kernel to make sure the diagnostic works, and the others, if
broken, will also crash the kernel.

Notes:
- The magic numbers for debug.crashme.spl_spinout are for evbarm.
  On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
  For other architectures, consult the source for the numbers to use.
- If you're on a single-CPU system, skip the cpuctl offline/online
  tests and just do (4) and (5).
- If you're on a >2-CPU system, then for the cpuctl offline/online
  tests, try offlining all CPUs but one at a time.

1.  cpuctl offline 0
sleep 20
cpuctl online 0

2.  cpuctl offline 1
sleep 20
cpuctl online 1

3.  cpuctl offline 0
sysctl -w kern.heartbeat.max_period=5
sleep 10
sysctl -w kern.heartbeat.max_period=0
sleep 10
sysctl -w kern.heartbeat.max_period=15
sleep 20
cpuctl online 0

4.  sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
# verify system panics after 15sec

5.  sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
# verify system panics after 15sec

6.  cpuctl offline 0
sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
# verify system panics after 15sec

7.  cpuctl offline 0
sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
# verify system panics after 15sec

Re: AMDGPU Driver patches/bugs

2023-02-21 Thread Taylor R Campbell

> Date: Tue, 21 Feb 2023 13:20:13 -0800
> From: Jeff Frasca 
> 
> I was going to try the radeon driver again, because I want to see if
> my wayland compositor works better against it than the AMDGPU driver
> (I'm getting some weird corruption problems with my compositor that
> do not happen under Linux, but that's probably my code).

We have seen other weird minor graphics corruption problems with X,
even with xcompmgr or picom running.  I probably made another stupid
bug, maybe in cacheability attributes or something, buried somewhere
in the megabytes of diffs...

> I'll send the FP stuff to tech-kern and CC you.  For the PRs, that's
> the sendpr.cgi on netbsd.org, right?

Sure, https://gnats.NetBSD.org -> report a new bug, which goes to some
sendpr.cgi URL, or send-pr(1) if you like to do things from a local
mailer (but it requires a working local mailer which can make outgoing
SMTP connections, often blocked on residential networks).

Re: AMDGPU Driver patches/bugs

2023-02-21 Thread Taylor R Campbell

> Date: Mon, 20 Feb 2023 23:15:35 -0800
> From: Jeff Frasca 
> 
> I have two machines with AMD graphics hardware: a laptop with a raven
> ridge APU (GCN 5) and a desktop with a kaveri APU (GCN 2).
> [...]
> After upgrading my system in place from tarballs, and compiling a
> custom kernel with the AMDGPU drivers instead of the radeon drivers, I
> was pleasantly surprised to find that the frame buffer worked with the
> newer, less tested drivers!

Cool!

By the way, you may be able to just add `load amdgpu' to boot.cfg
instead of compiling a custom kernel with amdgpu.

Loading amdgpu as a module also has the advantage that it doesn't
break dtrace (due to annoying technical restrictions in CTF which the
kernel violates when amdgpu is statically linked because it is so
large).  You can rebuild just the module with:

cd src/sys/modules/amdgpu
$TOOLDIR/bin/nbmake-$ARCH -j4 dependall
$TOOLDIR/bin/nbmake-$ARCH -j4 install

> The problem with the doorbell code is that the Linux code uses
> adev->doorbell.ptr + index to get the address to write to.  ptr is
> ultimately a pointer to a 32 bit wide value (rather than the 64 bit
> wide value it actually is :-/ ), so the compiler's pointer math
> multiplies index by 4 instead of 8, as the NetBSD dev who wrote the
> code would have expected.

Amazing!  I must have stared at that code for hours trying to track
down the ring test failures, without realizing that the pointer was
typed 32-bit instead of 64-bit.

...I don't suppose you have another trick up your sleeve for the
radeon driver, do you?  We've also been seeing intermittent ring test
failures at boot, but it doesn't use any 64-bit doorbells, so this
trick doesn't work, alas.

> (The driver blows up spectacularly shortly thereafter by causing a
> floating point exception in kernel mode.  I don't have a full fix for
> that yet.  The thing I did try that seems to get further causes the
> screen to go blank.  I have a plan for debugging this, but I haven't
> gotten there yet.)

If you have a stack trace or crash dump I might be able to help.  The
amdgpu driver apparently uses FP/SIMD instructions in the kernel, and
I wired it up to NetBSD's mechanism for allowing it to do that, but I
don't know if I've ever seen those parts of the code get hit and
perhaps I missed something.

> I've attached patches.  Should I open a bug?  Send these to the kernel
> mailing list?

Patches applied, thanks!  I tweaked them a little bit, including to
fix an arithmetic overflow bug that you had copied & pasted from one
Taylor R Campbell, riastr...@netbsd.org, in kern_ksyms.c...oops.  (Fix
also applied in kern_ksyms.c now.)

Feel free to file PRs with patches and/or cc me and tech-kern -- I
don't always follow current-users.

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-13 Thread Taylor R Campbell

> Date: Sat, 13 Aug 2022 19:47:54 +0700
> From: Robert Elz 
> 
> However, since we now know the issue we have been looking at does involve
> the raw devices, not the block ones, I'm not sure what is the point of
> reverting that specfs_vnode.c patch, which only affects the block device
> open.  If that is needed, we might as well keep it, right?  It shouldn't
> affect current testing either way.

When _userland_ opens the raw /dev/rdkN _character_ device, for a
wedge on (say) raid0, the _kernel_ will do the equivalent of opening
the /dev/raid0 _block_ device.  All I/O by userland through /dev/rdkN
goes through the block device in the kernel.

Normally, the paths in the kernel for open/close on /dev/rdkN arrange
to open the block device only once at a time, and serialize the
opening and closing the block device under a lock -- well, except they
_don't_ serialize closing the block device under that lock.

So if, say, fsck opens and closes /dev/rdkN, and dkctl opens /dev/rdkM
at about the same time, dkctl might race to open the block device (in
the kernel) before fsck has finished closing it (again, in the
kernel).  That's the race that the patch to dk.c avoids.

The patch to spec_vnops.c is necessary to make spec_open gracefully
return EBUSY instead of crashing the kernel when this happens.  The
patch to dk.c is necessary to serialize the /dev/rdkN open/close logic
so that it never hits this case at all when opening the raid0 block
device -- and thus never spuriously fails with EBUSY _or_ crashes the
kernel.

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-12 Thread Taylor R Campbell

> Date: Sat, 13 Aug 2022 05:59:01 +0700
> From: Robert Elz 
> 
> Why is fsck running on the block device though?   And devpubd too?   Given
> the reference to cdev_close() I'd assumed it was a char (raw) device that
> was being used, which would be what fsck certainly should be using.  But
> I see that 02-wedgenames uses the block /dev/dnN device .. should fix that.)

This part of my hypothesis was not quite right -- it looks like there
is no race between fsck and dkctl per se, because they both use the
character device like they should as confirmed by ktrace.

Revised hypothesis:

- fsck opens /dev/rdkN, and
- dkctl opens /dev/rdkM,

where dkN and dkM are different wedges on the same parent.

02-wedgenames doesn't do `dkctl raid0 listwedges'; rather, it does
`dkctl dkN getwedgeinfo', which opens the character device /dev/rdkN
as it should.  And fsck is still opening a character device as it
should.

Why is this a problem?  When you open or close /dev/rdkN or /dev/rdkM,
that goes through dkopen and dkclose in the kernel.  On first open,
dkopen opens the parent disk's _block device_ via dk_open_parent, and
on last close, dkclose closes it via dklastclose -> dk_close_parent.

However, while dkclose -> dklastclose -> dk_close_parent is in
progress, currently nothing prevents a concurrent dkopen on another
wedge of the same parent from trying to open the parent afresh.

And although dk_open_parent is serialized by the mutex dk_rawlock,
dk_close_parent was _deliberately_ made to run outside dk_rawlock back
in 2010:

https://mail-index.netbsd.org/source-changes/2010/08/04/msg012270.html
https://mail-index.netbsd.org/tech-kern/2010/07/27/msg008612.html

Judging by the stack trace there (mutex_destroy -> [presumably
disk_destroy elided by tail call optimization] -> raidclose) was
actually evidence of a bug that had already been fixed a year earlier,
but possibly not in the kernel bouyer@ was running:

https://mail-index.netbsd.org/source-changes/2009/07/23/msg223381.html

This change by dyoung@ moved the disk_destroy call from raidclose to
raid_detach where it more properly belongs.  Note that today there has
been _another_ call to dk_close_parent, in dkwedge_read, for the past
seven years.

So I think we should move the call to dk_close_parent in dklastclose
back under dk_rawlock -- and also under dk_openlock.  This will allow
us to refactor the code to keep the mutex_enter/exit bracketing much
more obvious.

(Side note: I think we should eliminate dk_rawlock and just use
dk_openlock; they appear to be redundant, except in dkwedge_read, but
I'm not sure what logic uses _only_ dk_openlock and _not_ dk_rawlock
which is fruitfully concurrent with dkwedge_read.  Certainly
dkwedge_discover can't be called with either one held, and I don't see
why concurrency between dkwedge_read/add/list/delall/detach is
important.)

Can you try _reverting_ specfs_blockopen.patch, and _applying_ the
attached dkopenclose.patch, and see if you can reproduce any crash?

(I believe specfs_blockopen.patch is still necessary for other reasons
but it gets in the way of testing my hypothesis that dk_close_parent
needs to be serialized with dk_open_parent anyway.)
>From 5eaacbd7d9d6d6b3ccfea5564aebf9548f5b7564 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Fri, 12 Aug 2022 22:48:25 +
Subject: [PATCH] dk(9): Serialize closing parent's dk_rawvp with opening it.

Otherwise, the following events might happen:

- process 123 had /dev/rdkN open, starts close, enters dk_close_parent
- process 456 opens /dev/rdkM (same parent, different wedge), calls
  dk_open_parent

At this point, the block device hasn't yet closed, so dk_open_parent
will fail with EBUSY.  This is incorrect -- the chardev is never
supposed to fail with EBUSY, and dkopen/dkclose carefully manage
state to avoid opening the block device while it's still open.  The
problem is that dkopen in process 456 didn't wait for vn_close
in process 123 to finish before calling VOP_OPEN.

(Note: If it were the _same_ chardev /dev/rdkN in both processes,
then spec_open/close would prevent this.  But since it's a
_different_ chardev, spec_open/close assume that concurrency is OK,
and it's the driver's responsibility to serialize access to the
parent disk which, unbeknownst to spec_open/close, is shared between
dkN and dkM.)

It appears that the vn_close call was previously moved outside
dk_rawlock in 2010 to work around an unrelated bug in raidframe that
had already been fixed in HEAD:

Crash pointing to dk_rawlock and raidclose:
https://mail-index.netbsd.org/tech-kern/2010/07/27/msg008612.html

Change working around that crash:
https://mail-index.netbsd.org/source-changes/2010/08/04/msg012270.html

Change removing raidclose -> mutex_destroy(_rawlock) path:
https://mail-index.netbsd.org/source-changes/2009/07/23/msg223381.html
---
 sys/dev/dkwedge/dk.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/sys/dev/dkwedge/dk.c

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-12 Thread Taylor R Campbell

Here's a hypothesis about what happened.

- You have a RAID volume, say raid0, with a GPT partitioning it.

- raid0 is configured with an explicit /etc/raid0.conf file, rather
  than with an autoconfigured label.

- You have devpubd=YES in /etc/rc.conf.

On boot, the following sequence of events occurs:

1. /etc/rc.d/devpubd launches devpubd, which synchronously enumerates
   devices and invokes hooks for all the devices it finds.  This
   _excludes_ raid0 because it hasn't been configured yet.

2. /etc/rc.d/raidframe configures raid0 from /etc/raid0.conf.

3. /etc/rc.d/fsck starts to run.

At this point, two things happen concurrently:

(a) /etc/rc.d/fsck runs fsck on dkN (some wedge of raid0)
(b) devpubd wakes and runs `dkctl raid0 listwedges' in 02-wedgenames

fsck and dkctl race to work on raid0 -- the block device.  Sometimes
this happens without trouble.  Sometimes one will try to open it while
the other is still using it, and the open will fail with EBUSY.  But
sometimes one tries to open while the other has started, but not yet
finished, closing it -- and that's when the crash happens.  With my
last patch, it should just fail with EBUSY in that case too.

Now there's a higher-level issue here -- once we fix the kernel crash,
dkctl as called via devpubd -> 02-wedgenames might lose the race with
fsck and then fail to create the wedge, or fsck might lose the race
and, well, fail to fsck your file system, both of which might be bad.

So we need to find a way to deal with this race even after we fix the
kernel crash.

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-12 Thread Taylor R Campbell

> Date: Sat, 13 Aug 2022 02:13:23 +0700
> From: Robert Elz 
> 
> Apart from dkctl() and the two fsck_ffs processes, there was
> the parent fsck (just in wait) and (or course) init, and a whole
> set of sh processes (many in pipe_rd or similar) - which will be
> the infrastructure used for running rc scripts.   There's also devpubd
> (in wait) and sleep (in nanoslp).   That's it (and of course, lots and
> lots of kernel threads - things with pid==0).

This is interesting -- I think what happened is that fsck and the
dkctl in devpubd's hooks/02-wedgenames may have run concurrently.
(Can't find anything else that might run dkctl asynchronously at
boot.)

Might help to know what the process command-lines were for fsck and
dkctl if you catch it again (possibly without the patch I just sent in
a followup, in case that fixes the kernel crash).

Can you share dmesg, `drvctl -lt' output, and /etc/rc.conf (or any
other /etc/rc* configuration)?  Can you describe all the physical and
logical storage devices on your system?  (Privately is fine if you
prefer.)

Do you have any removable media that get detected asynchronously at
boot so devpubd might have finished starting up and started listening
for events in the background by the time that the medium appears but
just before fsck starts opening disks?

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-12 Thread Taylor R Campbell

Can you try the attached patch and see if (a) it still panics, or if
(b) there are any other adverse consequences like fsck failing?
>From 7ade169092246e8870aaab2973d8bb252e625d89 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Fri, 12 Aug 2022 20:01:50 +
Subject: [PATCH] specfs: Refuse to open a closing-in-progress block device.

We could wait for close to complete, but let's test to see if this is
what is leading to the inconsistent states and later figure out
whether blocking or immediate failure is better.
---
 sys/miscfs/specfs/spec_vnops.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 8b4ed335880a..e6295f9786de 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -789,8 +789,13 @@ spec_open(void *v)
 *
 * Treat zero opencnt with non-NULL mountpoint as open.
 * This may happen after forced detach of a mounted device.
+*
+* Also treat sd_closing, meaning there is a concurrent
+* close in progress, as still open.
 */
-   if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
+   if (sd->sd_opencnt != 0 ||
+   sd->sd_mountpoint != NULL ||
+   sd->sd_closing) {
error = EBUSY;
break;
}

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-12 Thread Taylor R Campbell

I added a couple more assertions (spec_vnops.c 1.213).

Attached is an updated patch with the ABI change to record who was
closing when it shouldn't be possible.
>From 4df657b9ca2112c556eb7aaf7b6ed5b6912603c1 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Sat, 16 Apr 2022 11:45:06 +
Subject: [PATCH] specfs: Identify which lwp is closing for assertions.

May help with diagnosing:

https://mail-index.netbsd.org/current-users/2022/08/09/msg042800.html
https://syzkaller.appspot.com/bug?id=47c67ab6d3a87514d0707882a9ad6671beaa8642

XXX kernel ABI change
---
 sys/miscfs/specfs/spec_vnops.c | 14 --
 sys/miscfs/specfs/specdev.h|  2 +-
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index e1e2cbb84071..8b4ed335880a 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -397,7 +397,7 @@ spec_node_init(vnode_t *vp, dev_t rdev)
sd->sd_bdevvp = NULL;
sd->sd_iocnt = 0;
sd->sd_opened = false;
-   sd->sd_closing = false;
+   sd->sd_closing = NULL;
sn->sn_dev = sd;
sd = NULL;
} else {
@@ -967,7 +967,8 @@ spec_open(void *v)
KASSERTMSG(sn->sn_opencnt <= sd->sd_opencnt,
"sn_opencnt=%u > sd_opencnt=%u",
sn->sn_opencnt, sd->sd_opencnt);
-   KASSERT(!sd->sd_closing);
+   KASSERTMSG(sd->sd_closing == NULL, "sd_closing=%p",
+   sd->sd_closing);
sd->sd_opened = true;
} else if (sd->sd_opencnt == 1 && sd->sd_opened) {
/*
@@ -1698,9 +1699,10 @@ spec_close(void *v)
if (count == 0) {
KASSERTMSG(sn->sn_opencnt == 0, "sn_opencnt=%u",
sn->sn_opencnt);
-   KASSERT(!sd->sd_closing);
+   KASSERTMSG(sd->sd_closing == NULL, "sd_closing=%p",
+   sd->sd_closing);
sd->sd_opened = false;
-   sd->sd_closing = true;
+   sd->sd_closing = curlwp;
}
mutex_exit(_lock);
 
@@ -1750,8 +1752,8 @@ spec_close(void *v)
 */
mutex_enter(_lock);
KASSERT(!sd->sd_opened);
-   KASSERT(sd->sd_closing);
-   sd->sd_closing = false;
+   KASSERTMSG(sd->sd_closing == curlwp, "sd_closing=%p", sd->sd_closing);
+   sd->sd_closing = NULL;
cv_broadcast(_iocv);
mutex_exit(_lock);
 
diff --git a/sys/miscfs/specfs/specdev.h b/sys/miscfs/specfs/specdev.h
index c55ab7aaa90a..4f3d315b5053 100644
--- a/sys/miscfs/specfs/specdev.h
+++ b/sys/miscfs/specfs/specdev.h
@@ -80,7 +80,7 @@ typedef struct specdev {
dev_t   sd_rdev;
volatile u_int  sd_iocnt;   /* # bdev/cdev_* operations active */
boolsd_opened;  /* true if successfully opened */
-   boolsd_closing; /* true when bdev/cdev_close ongoing */
+   struct lwp  *sd_closing;/* true when bdev/cdev_close ongoing */
 } specdev_t;
 
 /*

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-12 Thread Taylor R Campbell

> Date: Fri, 12 Aug 2022 23:28:13 +0700
> From: Robert Elz 
> 
> I think I am going to try backing out the patch, and just run with your
> (Taylor's) updated specfs_vnops.c in case there is something in the patch
> which is altering the behaviour (more than just the nature of the sd_closing
> field type).

This is extremely unlikely.  You might try removing the assertion that
it tripped on, KASSERT(!sd->sd_opened), so that it has the opportunity
to trip on the new assertion with more diagnostic information in it
instead.

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-11 Thread Taylor R Campbell

> Date: Thu, 11 Aug 2022 13:01:34 +
> From: Taylor R Campbell 
> 
> If that still doesn't help, you could try the attached patch (which
> I'm not committing at the moment because it changes the kernel ABI).

...patch actually attached now
>From 11428c92e7bc51a35942c2024b95c756af8c9fc6 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Sat, 16 Apr 2022 11:45:06 +
Subject: [PATCH] specfs: Identify which lwp is closing for assertions.

May help with diagnosing:

https://mail-index.netbsd.org/current-users/2022/08/09/msg042800.html
https://syzkaller.appspot.com/bug?id=47c67ab6d3a87514d0707882a9ad6671beaa8642

XXX kernel ABI change
---
 sys/miscfs/specfs/spec_vnops.c | 11 ++-
 sys/miscfs/specfs/specdev.h|  2 +-
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 762939696c17..c386a49fe034 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -397,7 +397,7 @@ spec_node_init(vnode_t *vp, dev_t rdev)
sd->sd_bdevvp = NULL;
sd->sd_iocnt = 0;
sd->sd_opened = false;
-   sd->sd_closing = false;
+   sd->sd_closing = NULL;
sn->sn_dev = sd;
sd = NULL;
} else {
@@ -1686,9 +1686,10 @@ spec_close(void *v)
if (count == 0) {
KASSERTMSG(sn->sn_opencnt == 0, "sn_opencnt=%u",
sn->sn_opencnt);
-   KASSERT(!sd->sd_closing);
+   KASSERTMSG(sd->sd_closing == NULL, "sd_closing=%p",
+   sd->sd_closing);
sd->sd_opened = false;
-   sd->sd_closing = true;
+   sd->sd_closing = curlwp;
}
mutex_exit(_lock);
 
@@ -1738,8 +1739,8 @@ spec_close(void *v)
 */
mutex_enter(_lock);
KASSERT(!sd->sd_opened);
-   KASSERT(sd->sd_closing);
-   sd->sd_closing = false;
+   KASSERTMSG(sd->sd_closing == curlwp, "sd_closing=%p", sd->sd_closing);
+   sd->sd_closing = NULL;
cv_broadcast(_iocv);
mutex_exit(_lock);
 
diff --git a/sys/miscfs/specfs/specdev.h b/sys/miscfs/specfs/specdev.h
index c55ab7aaa90a..4f3d315b5053 100644
--- a/sys/miscfs/specfs/specdev.h
+++ b/sys/miscfs/specfs/specdev.h
@@ -80,7 +80,7 @@ typedef struct specdev {
dev_t   sd_rdev;
volatile u_int  sd_iocnt;   /* # bdev/cdev_* operations active */
boolsd_opened;  /* true if successfully opened */
-   boolsd_closing; /* true when bdev/cdev_close ongoing */
+   struct lwp  *sd_closing;/* true when bdev/cdev_close ongoing */
 } specdev_t;
 
 /*

Re: specfs/spec_vnops.c diagnostic assertion panic

2022-08-11 Thread Taylor R Campbell

I've seen something like that in syzkaller before:

https://syzkaller.appspot.com/bug?id=47c67ab6d3a87514d0707882a9ad6671beaa864

If you cvs up spec_vnops.c to 1.211, you may get a different assertion
firing, or not; which assertion fires now might potentially help to
diagnose the problem.

If that still doesn't help, you could try the attached patch (which
I'm not committing at the moment because it changes the kernel ABI).
With the patch, it may hit the following panic:

panic: kernel diagnostic assertion "sd->sd_closing == NULL" failed: file 
"/syzkaller/jobs/netbsd/kernel/sys/miscfs/specfs/spec_vnops.c", line 1675 
sd_closing=0xbc0013315500

If so, the `sd_closing=0xbc0013315500' part gives you an lwp
address, which you can feed to `bt/a' to get a stack trace:

db> bt/a 0xbc0013315500

Finding out what that lwp is up to might help diagnose the problem.

Re: Virtio Viocon driver - possible to backport from OpenBSD?

2022-08-09 Thread Taylor R Campbell

The attached patch adds viocon(4) to NetBSD.  Lightly tested under
qemu.  Review welcome!  I'm not very familiar with tty drivers so I'm
hoping someone who is will take a look and see if I did anything the
wrong way in copying from OpenBSD.

As in OpenBSD, it uses new device nodes /dev/ttyVI[0-9][0-9], where
the first digit is the unit number and the second digit is the port
number.  The port number is currently always zero but in principle it
might change.  These are now unconditionally included in the MI
MAKEDEV.tmpl, and ttyVI00 through ttyVI30 are created by default.
>From 50c5c4d195f82ad7b665a7e0adfa922a448e9cfb Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Tue, 9 Aug 2022 14:09:38 +
Subject: [PATCH] viocon(4): New virtio tty driver imported from OpenBSD.

viocon* at virtio?

Tested under qemu with:

qemu-system-aarch64 ... \
  -device virtio-serial \
  -chardev socket,path=/tmp/ttyVI00,server=on,wait=off,id=ttyVI00 \
  -device virtconsole,chardev=ttyVI00,name=org.NetBSD.dev.ttyVI00 \
  ...
---
 distrib/sets/lists/man/mi   |   3 +
 etc/MAKEDEV.tmpl|   9 +
 share/man/man4/Makefile |   2 +-
 share/man/man4/viocon.4 |  66 
 sys/conf/majors |   1 +
 sys/dev/virtio/files.virtio |   4 +
 sys/dev/virtio/viocon.c | 632 
 7 files changed, 716 insertions(+), 1 deletion(-)
 create mode 100644 share/man/man4/viocon.4
 create mode 100644 sys/dev/virtio/viocon.c

diff --git a/distrib/sets/lists/man/mi b/distrib/sets/lists/man/mi
index 3b354d653ca0..c56263b6dffc 100644
--- a/distrib/sets/lists/man/mi
+++ b/distrib/sets/lists/man/mi
@@ -2047,6 +2047,7 @@
 ./usr/share/man/cat4/video.0   man-sys-catman  .cat
 ./usr/share/man/cat4/vinum.0   man-obsoleteobsolete
 ./usr/share/man/cat4/vio9p.0   man-sys-catman  .cat
+./usr/share/man/cat4/viocon.0  man-sys-catman  .cat
 ./usr/share/man/cat4/vioif.0   man-sys-catman  .cat
 ./usr/share/man/cat4/viomb.0   man-sys-catman  .cat
 ./usr/share/man/cat4/viornd.0  man-sys-catman  .cat
@@ -5233,6 +5234,7 @@
 ./usr/share/man/html4/viaide.html  man-sys-htmlman html
 ./usr/share/man/html4/video.html   man-sys-htmlman html
 ./usr/share/man/html4/vio9p.html   man-sys-htmlman html
+./usr/share/man/html4/viocon.html  man-sys-htmlman html
 ./usr/share/man/html4/vioif.html   man-sys-htmlman html
 ./usr/share/man/html4/viomb.html   man-sys-htmlman html
 ./usr/share/man/html4/viornd.html  man-sys-htmlman html
@@ -8351,6 +8353,7 @@
 ./usr/share/man/man4/video.4   man-sys-man .man
 ./usr/share/man/man4/vinum.4   man-obsoleteobsolete
 ./usr/share/man/man4/vio9p.4   man-sys-man .man
+./usr/share/man/man4/viocon.4  man-sys-man .man
 ./usr/share/man/man4/vioif.4   man-sys-man .man
 ./usr/share/man/man4/viomb.4   man-sys-man .man
 ./usr/share/man/man4/viornd.4  man-sys-man .man
diff --git a/etc/MAKEDEV.tmpl b/etc/MAKEDEV.tmpl
index 51fb821f244b..6576679045a8 100644
--- a/etc/MAKEDEV.tmpl
+++ b/etc/MAKEDEV.tmpl
@@ -148,6 +148,7 @@
 #  dmz*UNIBUS DMZ32 (vax)
 #  dl* UNIBUS DL11 (vax)
 #  xencons Xen virtual console
+#  ttyVI?? viocon(4)
 #
 # Terminal multiplexors:
 #  dc* 4 channel serial interface (keyboard, mouse, modem, printer)
@@ -849,6 +850,7 @@ all)
makedev qemufwcfg
makedev sht3xtemp0
makedev scmd0
+   makedev ttyVI00 ttyVI10 ttyVI20 ttyVI30
makedev local # do this last
;;
 
@@ -2272,6 +2274,13 @@ scmd[0-9]*)
mkdev scmd$unit c %scmd_chr% $unit 666
;;
 
+ttyVI[0-9][0-9])
+   port=${i#ttyVI?}
+   devunit=${i%$port}
+   unit=${devunit#ttyVI}
+   mkdev ttyVI$unit$port c %viocon_chr% $((16*$unit + $port))
+   ;;
+
 midevend)
 %MI_DEVICES_END%
 local)
diff --git a/share/man/man4/Makefile b/share/man/man4/Makefile
index d98073dccd63..897dca9ebd64 100644
--- a/share/man/man4/Makefile
+++ b/share/man/man4/Makefile
@@ -68,7 +68,7 @@ MAN=  aac.4 ac97.4 acardide.4 aceride.4 acphy.4 \
uark.4 ubsec.4 udp.4 uep.4 ug.4 uha.4 uk.4 ukphy.4 umb.4 \
unix.4 userconf.4 \
vald.4 valz.4 veriexec.4 vga.4 vge.4 viaide.4 video.4 \
-   vio9p.4 vioif.4 viomb.4 viornd.4 vioscsi.4 virt.4 virtio.4 \
+   vio9p.4 viocon.4 vioif.4 viomb.4 viornd.4 vioscsi.4 virt.4 virtio.4 \
vether.4 vlan.4 vmmon.4 vmnet.4 vmt.4 vmx.4 vnd.4 voodoofb.4 vr.4 \
vte.4 \
wapbl.4 wb.4 wbsio.4 wd.4 wdc.4 wg.4 wi.4 wm.4 wpi.4 \
diff --git a/share/man/man4/viocon.4 b/share/man/man4/viocon.4
new f

Re: kernel deadlock on fstchg with vnd

2022-05-31 Thread Taylor R Campbell

> Date: Mon, 30 May 2022 14:33:42 +0200
> From: "J. Hannken-Illjes" 
> 
> >>  1767  /* Nuke the vnodes for any open instances */
> >>  1768  for (i = 0; i < MAXPARTITIONS; i++) {
> >>  1769  mn = DISKMINOR(device_unit(vnd->sc_dev), i);
> >>  1770  vdevgone(bmaj, mn, mn, VBLK);
> >>  1771  if (mn != myminor) /* XXX avoid to kill own vnode */
> >>  1772  vdevgone(cmaj, mn, mn, VCHR);
> >>  1773  }
> >> 
> >> The "skip myself" on lines 1771/1772 is responsible for this behaviour.
> > 
> > Yes and doing the same for block devices avoids the issue.
> > But Taylor is reluctant to commit this hack.
> 
> And he is right.  It smells fishy to detach a (pseudo) device from
> an open instance of itself, either with ioctl or close.
> 
> Why do we detach on last close -- isn't it sufficient to detach
> either explicit with drvctl(8) or on module unload?
> 
> The attached diff moves vdevgone() to vnd_detach() and no longer
> detaches on last close -- comments?

Sorry, I got detached from this thread and committed the workaround
already before I noticed your patch.

The idea of these devices -- vnd, as well as cgd and ccd, and possibly
others I'm not thinking of -- is that they are created on-demand, and
persist only as long as they are in use.

With your patch, there is no path in normal vnconfig usage that
detaches the device; operators would have to go out of their way to
use `drvctl -d' (and drvctl itself is racy with detach -- that also
needs to be fixed) to free up any resources used by no-longer-used
vnd(4) autoconf instances.

That's not to say the current state of affairs is good -- it's
obviously broken in some ways (either the revoke is unnecessary, or
(a) it's necessary, and (b) failing to revoke the chardev is
insufficient because the chardev can have more than one open), and
every driver handles the state differently, so the autoconf API we
have is not serving the needs of drivers to allocate minor numbers.

I think that it would probably be better if we had a matching pair of
operations, say spawn and unspawn, that can be used like so:

xyzopen(dev_t dev)
{
device_t self;
struct xyz_softc *sc;

self = config_pseudo_spawn(xyzunit(dev));
sc = device_private(self);
...
}

xyzclose(dev_t dev)  // d_close callback, invoked on last close
{
device_t self = device_lookup(xyzunit(dev));
struct xyz_softc *sc = device_private(self);

if (!sc->sc_configured)
config_pseudo_unspawn(self);
}

This would need to be integrated into autoconf/specfs to avoid races
in config_pseudo_spawn, sc->sc_configured, and config_pseudo_unspawn,
and if there's any revoke involved, the unspawn might be scheduled
asynchronously to make it work.  Or, maybe spawn/unspawn should be
done in ioctl VNDIOCSET and VNDIOCCLR, and open/close can be empty
stubs.

But I haven't yet thought through the details of a good way to deal
with these pseudo-devices -- I put it off during the recent
open/detach rototill.

Re: uvideo uvm_fault panic

2022-05-14 Thread Taylor R Campbell

> Date: Sat, 14 May 2022 15:14:50 +
> From: sc.dy...@gmail.com
> 
> On 2022/05/14 13:47, Taylor R Campbell wrote:
> > Can you please try the two attached patches?
> > 
> > 1. uvideobadstream.patch should fix the _crash_ when you try to open a
> >video stream on a device that the driver deemed to have a bad
> >descriptor.  Try this one first, if you have time -- it prevents a
> >malicious USB device from causing this kernel crash.
> 
> Yes, it fixes crash -- as it prevents attaching video(4)...
> 
> > 2. uvideosizeof.patch should fix the sizeof calculations so that the
> >driver stops rejecting your device's descriptor.  This one should
> >make your device work again.
> 
> It works well again.

Thanks, committed!

Re: entropy startup panic

2022-03-24 Thread Taylor R Campbell

> Date: Thu, 24 Mar 2022 14:23:09 +0100
> From: Thomas Klausner 
> 
> On Thu, Mar 24, 2022 at 01:03:11PM +0000, Taylor R Campbell wrote:
> > > Date: Thu, 24 Mar 2022 11:29:16 +0100
> > > From: Thomas Klausner 
> > > 
> > > entropy_account_cpu() at netbsd:entropy_account_cpy+0x211
> > 
> > Errr...  I might have misdiagnosed this -- do you have the panic
> > message with a line number for the assertion?
> 
> Sorry, no - it wasn't on the screen, and not in the messagebuffer
> after the reboot either.

OK, I disassembled kern_entropy.o in a GENERIC kernel and it looks
like entropy_account_cpu+0x211 is the assertion I thought it was
(which would make sense), so try again with kern_entropy.c 1.54!

Re: entropy startup panic

2022-03-24 Thread Taylor R Campbell

> Date: Thu, 24 Mar 2022 11:29:16 +0100
> From: Thomas Klausner 
> 
> I've just updated from a 3 days old kernel to todays, and it didn't
> finish booting up.
> 
> The console output, hand-copied:
> 
> breakpoint() at netbsd:breakpoint
> vpanic() at netbsd:vpanic+0x156
> kern_assert() at netbsd:kern_assert+0x4b
> entropy_account_cpu() at netbsd:entropy_account_cpy+0x211
> entropy_softintr() at netbsd:entropy_softintr+0x4e
> xc_broadcast() at netbsd:xc_broadcast+0x233
> rnd_init_softint() at netbsd:rnd_init_softint+0x95
> main() at netbsd:main+0x331

Interesting!  I see the bug, and I've committed what I think should
fix it in kern_entropy.c 1.54, but I'm curious what assortment of
entropy sources led you to trip over this -- can you send me dmesg and
rndctl -l?

Re: HEADS UP: Merging drm update

2022-01-03 Thread Taylor R Campbell

> Date: Mon, 03 Jan 2022 22:08:26 +0900
> From: Ryo ONODERA 
> 
> Ryo ONODERA  writes:
> 
> > __engines_record_defaults
> > intel_gt_wait_for_idle
> > intel_gt_retire_requests_timeout
> > dma_fence_wait_timeout
> > i915_fence_wait (via *fence->ops->wait)
> > i915_request_wait
> >
> > In i915_request_wait, DRM_SPIN_TIMED_WAIT_UNTIL sets timeout=0
> > and i915_request_wait returns timeout=-ETIME.

Thanks, I think there's something going wrong either with submitting
requests or noticing that they're submitted, but it only happens on
some machines.  It might be related to some of the other panics around
requests, like the cv_is_valid panic in signal_irq_work (56561), or it
might be something is mapped wrong or there's a missing cache flush /
bus_dmamap_sync...  Will investigate when I have a chance.

Re: HEADS UP: Merging drm update (Lenovo X230 mode switch issue in UEFI mode only, BIOS works)

2021-12-31 Thread Taylor R Campbell

> Date: Fri, 31 Dec 2021 11:30:32 +0100
> From: Matthias Petermann 
> 
> - When I boot current in UEFI mode, after the mode switch it only 
> displays a blank screen with a white background. After that, within a 
> few seconds, a kind of randomly structured dark spot develops from the 
> center of the screen, which then stretches to the edge of the screen [1].

Can you get dmesg from the previous boot if you reboot back in BIOS
mode?  Or can you ssh in and get it, or get it over a serial console?

Re: HEADS UP: Merging drm update

2021-12-28 Thread Taylor R Campbell

> Date: Tue, 28 Dec 2021 11:34:43 +0900
> From: Ryo ONODERA 
> 
> intel_gt_pm_fini() at netbsd:intel_gt_pm_fini+0x18
> intel_gt_init() at netbsd:intel_gt_init+0x6ad
> i915_gem_init() at netbsd:i915_gem_init+0x14d
> i915_driver_probe() at netbsd:i915_driver_probe+0x949
> i915drmkms_attach_real() at netbsd:i915drmkms_attach_real+0x4c
> config_mountroot_thread() at netbsd:config_mountroot_thread+0x60

So intel_gt_init is failing on boot, and the driver has decided to
give up -- and proximate cause of the crash is that one of the error
branches is screwy, but while it would be nice to fix the error
branches it's more important to find why we're reaching them in the
first place.

Can you get a line number for intel_gt_init+0x6ad, and can you also
insert prints into every error branch of intel_gt_init to find out
which one it is and how it fails?  And maybe do that recursively in
whichever branch does fail?

Re: HEADS UP: Merging drm update

2021-12-27 Thread Taylor R Campbell

> Date: Tue, 28 Dec 2021 08:41:16 +0900
> From: Ryo ONODERA 
> 
> panic: mutex_vector_enter,518: uninitialized lock (lock=0xd80023501458, 
> from=807c091c)
> cpu0: Begin traceback...
> vpanic() at netbsd:vpanic+0x156
> panic() at netbsd:panic+0x3c
> lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x180
> mutex_enter() at netbsd:mutex_enter+0x213
> intel_rps_mark_interactive() at netbsd:intel_rps_mark_interactive+0x1e

Weird -- can you insert db_stacktrace() into intel_rps_init_early and
intel_rps_fini?  This looks like i915->gt.rps.power.mutex, judging by
the stack trace, but I don't see how we can get here without first
initializing the mutex.

Re: HEADS UP: Merging drm update

2021-12-27 Thread Taylor R Campbell

> Date: Mon, 27 Dec 2021 12:34:10 +0900
> From: Ryo ONODERA 
> 
> panic: kernel diagnostic assertion "!(!i915_request_completed(rq))" failed: 
> file "/usr/src/sys/external/bsd/drm2/dist/drm/i915/gt/intel_gt.c", line 475 

I committed a change that I think will fix the particular bug leading
to this assertion, but I suspect it won't fix the underlying issue.
Nevertheless, can you cvs up and try again?

Re: HEADS UP: Merging drm update

2021-12-26 Thread Taylor R Campbell

Can you try the attached patch?  Also, if you haven't already, can you
try running a DIAGNOSTIC/DEBUG/LOCKDEBUG kernel?
>From 025745d6f43aff21c88bc9e3b393ce146fc9af04 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Sun, 26 Dec 2021 16:05:20 +
Subject: [PATCH] drm: Fix locking around accurate vblank counts.

- Make drm_crtc_accurate_vblank_count require the caller to hold the
  event lock, rather than take it internally.

- Fix locking around drm_crtc_accurate_vblank_count and related
  operations in amdgpu and nouveau interrupt handlers.

- Use drm_crtc_vblank_put_locked, not drm_crtc_vblank_put, when we
  already hold the event lock.
---
 .../dist/drm/amd/display/amdgpu_dm/amdgpu_dm.c|  2 +-
 .../drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c |  2 ++
 sys/external/bsd/drm2/dist/drm/drm_vblank.c   | 15 ++-
 .../drm/nouveau/dispnv50/nouveau_dispnv50_disp.c  |  2 +-
 4 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index c8cbf7a6f68e..d0386f514457 100644
--- a/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -361,7 +361,7 @@ static void dm_pflip_high_irq(void *interrupt_params)
drm_crtc_send_vblank_event(_crtc->base, e);
 
/* Event sent, so done with vblank for this flip */
-   drm_crtc_vblank_put(_crtc->base);
+   drm_crtc_vblank_put_locked(_crtc->base);
}
} else if (e) {
/* VRR active and inside front-porch: vblank count and
diff --git 
a/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c 
b/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
index 79cb94ea67dc..5a154037c524 100644
--- a/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
+++ b/sys/external/bsd/drm2/dist/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
@@ -319,7 +319,9 @@ void amdgpu_dm_crtc_handle_crc_irq(struct drm_crtc *crtc)
   [0], [1], [2]))
return;
 
+   spin_lock(>dev->event_lock);
drm_crtc_add_crc_entry(crtc, true,
   drm_crtc_accurate_vblank_count(crtc), 
crcs);
+   spin_unlock(>dev->event_lock);
}
 }
diff --git a/sys/external/bsd/drm2/dist/drm/drm_vblank.c 
b/sys/external/bsd/drm2/dist/drm/drm_vblank.c
index 9ced81bd7ed8..ec91bbe9391b 100644
--- a/sys/external/bsd/drm2/dist/drm/drm_vblank.c
+++ b/sys/external/bsd/drm2/dist/drm/drm_vblank.c
@@ -337,7 +337,7 @@ static u64 drm_vblank_count(struct drm_device *dev, 
unsigned int pipe)
  * This is mostly useful for hardware that can obtain the scanout position, but
  * doesn't have a hardware frame counter.
  */
-static u64 drm_crtc_accurate_vblank_count_locked(struct drm_crtc *crtc)
+u64 drm_crtc_accurate_vblank_count(struct drm_crtc *crtc)
 {
struct drm_device *dev = crtc->dev;
unsigned int pipe = drm_crtc_index(crtc);
@@ -358,17 +358,6 @@ static u64 drm_crtc_accurate_vblank_count_locked(struct 
drm_crtc *crtc)
 
return vblank;
 }
-
-u64 drm_crtc_accurate_vblank_count(struct drm_crtc *crtc)
-{
-   u64 vblank;
-
-   spin_lock(>dev->event_lock);
-   vblank = drm_crtc_accurate_vblank_count_locked(crtc);
-   spin_unlock(>dev->event_lock);
-
-   return vblank;
-}
 EXPORT_SYMBOL(drm_crtc_accurate_vblank_count);
 
 static void __disable_vblank(struct drm_device *dev, unsigned int pipe)
@@ -972,7 +961,7 @@ void drm_crtc_arm_vblank_event(struct drm_crtc *crtc,
assert_spin_locked(>event_lock);
 
e->pipe = pipe;
-   e->sequence = drm_crtc_accurate_vblank_count_locked(crtc) + 1;
+   e->sequence = drm_crtc_accurate_vblank_count(crtc) + 1;
list_add_tail(>base.link, >vblank_event_list);
 }
 EXPORT_SYMBOL(drm_crtc_arm_vblank_event);
diff --git 
a/sys/external/bsd/drm2/dist/drm/nouveau/dispnv50/nouveau_dispnv50_disp.c 
b/sys/external/bsd/drm2/dist/drm/nouveau/dispnv50/nouveau_dispnv50_disp.c
index b9423cdc84fb..f49bb2ddf56d 100644
--- a/sys/external/bsd/drm2/dist/drm/nouveau/dispnv50/nouveau_dispnv50_disp.c
+++ b/sys/external/bsd/drm2/dist/drm/nouveau/dispnv50/nouveau_dispnv50_disp.c
@@ -2136,9 +2136,9 @@ nv50_disp_atomic_commit_tail(struct drm_atomic_state 
*state)
if (new_crtc_state->event) {
unsigned long flags;
/* Get correct count/ts if racing with vblank irq */
+   spin_lock_irqsave(>dev->event_lock, flags);
if (new_crtc_state->active)
drm_crtc_accurate_vblank_count(crtc);
-   spin_lock_irqsave(>dev->

Re: HEADS UP: Merging drm update

2021-12-25 Thread Taylor R Campbell

> Date: Sun, 26 Dec 2021 02:37:31 +0900
> From: Ryo ONODERA 
> 
> If panic is removed, my LCD turns black finally and stay black forever
> and does not reach to USB serial console.
> [...]
> However I am not sure because I cannot get any message.
> 
> Do you have an idea to investigate deeper?

Does your laptop have anything like Intel AMT which you could use to
get serial-over-LAN?

Maybe you can put in a panic later, after the display changes, and
build your kernel with DDB_COMMANDONENTER="bt;sync", so you might be
able to either (a) get a crash dump or (b) see the last boot's dmesg
when the machine resets (and you then boot an older kernel)?

Re: HEADS UP: Merging drm update

2021-12-24 Thread Taylor R Campbell

Better yet -- can you try the attached patch?
>From e484fe666999730543f490ce6084486f7d7ce524 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Fri, 24 Dec 2021 11:12:43 +
Subject: [PATCH] i915: Use AcpiOsMapMemory, not bus_space_map, for opregion.

Needed because this appears in firmware-type memory mappings, which
are excluded from bus_space_map.

XXX pullup-9 (via manual patch since the code has changed a bit)
---
 .../dist/drm/i915/display/intel_opregion.c| 34 +--
 .../dist/drm/i915/display/intel_opregion.h| 14 
 2 files changed, 8 insertions(+), 40 deletions(-)

diff --git a/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.c 
b/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.c
index 4b0412828d49..b2f986bd74c0 100644
--- a/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.c
+++ b/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.c
@@ -952,15 +952,7 @@ int intel_opregion_setup(struct drm_i915_private *dev_priv)
INIT_WORK(>asle_work, asle_work);
 
 #ifdef __NetBSD__
-   opregion->bst = pdev->pd_pa.pa_memt;
-   err = -bus_space_map(opregion->bst, asls, OPREGION_SIZE,
-   BUS_SPACE_MAP_LINEAR|BUS_SPACE_MAP_CACHEABLE,
-   >asls_bsh);
-   if (err) {
-   DRM_DEBUG_DRIVER("Failed to map opregion: %d\n", err);
-   return err;
-   }
-   base = bus_space_vaddr(opregion->bst, opregion->asls_bsh);
+   base = AcpiOsMapMemory(asls, OPREGION_SIZE);
 #else
base = memremap(asls, OPREGION_SIZE, MEMREMAP_WB);
 #endif
@@ -1035,14 +1027,7 @@ int intel_opregion_setup(struct drm_i915_private 
*dev_priv)
}
 
 #ifdef __NetBSD__
-   if (bus_space_map(opregion->bst, rvda,
-   opregion->asle->rvds,
-   BUS_SPACE_MAP_LINEAR|BUS_SPACE_MAP_CACHEABLE,
-   >rvda_bsh))
-   opregion->rvda = NULL;
-   else
-   opregion->rvda = bus_space_vaddr(opregion->bst,
-   opregion->rvda_bsh);
+   opregion->rvda = AcpiOsMapMemory(rvda, opregion->asle->rvds);
 #else
opregion->rvda = memremap(rvda, opregion->asle->rvds,
  MEMREMAP_WB);
@@ -1058,11 +1043,8 @@ int intel_opregion_setup(struct drm_i915_private 
*dev_priv)
} else {
DRM_DEBUG_KMS("Invalid VBT in ACPI OpRegion (RVDA)\n");
 #ifdef __NetBSD__
-   if (opregion->rvda) {
-   bus_space_unmap(opregion->bst,
-   opregion->rvda_bsh,
-   opregion->asle->rvds);
-   }
+   AcpiOsUnmapMemory(opregion->rvda,
+   opregion->asle->rvds);
 #else
memunmap(opregion->rvda);
 #endif
@@ -1094,7 +1076,7 @@ out:
 
 err_out:
 #ifdef __NetBSD__
-   bus_space_unmap(opregion->bst, opregion->asls_bsh, OPREGION_SIZE);
+   AcpiOsUnmapMemory(base, OPREGION_SIZE);
 #else
memunmap(base);
 #endif
@@ -1251,14 +1233,14 @@ void intel_opregion_unregister(struct drm_i915_private 
*i915)
 
/* just clear all opregion memory pointers now */
 #ifdef __NetBSD__
-   bus_space_unmap(opregion->bst, opregion->asls_bsh, OPREGION_SIZE);
+   size_t rvds = opregion->asle->rvds;
+   AcpiOsUnmapMemory(opregion->header, OPREGION_SIZE);
 #else
memunmap(opregion->header);
 #endif
if (opregion->rvda) {
 #ifdef __NetBSD__
-   bus_space_unmap(opregion->bst, opregion->rvda_bsh,
-   opregion->asle->rvds);
+   AcpiOsUnmapMemory(opregion->rvda, rvds);
 #else
memunmap(opregion->rvda);
 #endif
diff --git a/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.h 
b/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.h
index a04f1c9d9f74..879c4135f670 100644
--- a/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.h
+++ b/sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.h
@@ -38,17 +38,7 @@ struct opregion_acpi;
 struct opregion_swsci;
 struct opregion_asle;
 
-#ifdef __NetBSD__  /* XXX acpi iomem */
-#  include 
-#  define  __iomem __acpi_iomem
-#endif
-
 struct intel_opregion {
-#ifdef __NetBSD__
-   bus_space_tag_t bst;
-   bus_space_handle_t asls_bsh;
-   bus_space_handle_t rvda_bsh;
-#endif
struct opregion_header *header;
struct opregion_acpi *acpi;
struct opregion_swsci *swsci;
@@ -66,10 +56,6 @@ struct intel_opregion {
 
 #define OPREGION_SIZE(8 * 1024)
 
-#ifdef __NetBSD__  /* XXX acpi iomem */
-#  undef   __iomem
-#endif
-
 #ifdef CONFIG_ACPI
 
 int intel_opregion_setup(struct drm_i915_private *dev_priv);

Re: HEADS UP: Merging drm update

2021-12-24 Thread Taylor R Campbell

> Date: Thu, 23 Dec 2021 17:26:49 +
> From: Taylor R Campbell 
> 
> > Date: Fri, 24 Dec 2021 01:53:03 +0900
> > From: Ryo ONODERA 
> > 
> > And with this patch, I have gotten the following dmesg:
> > This has no bus_space_map and extent_alloc_subregion1...
> 
> OK, can you try the attached patch and see if it gives us any clues in
> dmesg?  This prints a stack trace any time subr_extent.c writes to a
> struct extent_region and the region now covers the relevant space.

The extent regions might be created too early in x86_parse_clusters,
before dmesg starts recording.

Can you dump bootinfo, e.g. from crash(8)?  Running this on an older
working kernel will do -- doesn't have to be from the broken one.

x/bx bootinfo,1000

(bootinfo might be more than 0x1000 bytes long, but this will probably
suffice to find what covers ASLS=0x63ec5018.)

Re: HEADS UP: Merging drm update

2021-12-23 Thread Taylor R Campbell

> Date: Fri, 24 Dec 2021 01:53:03 +0900
> From: Ryo ONODERA 
> 
> And with this patch, I have gotten the following dmesg:
> This has no bus_space_map and extent_alloc_subregion1...

OK, can you try the attached patch and see if it gives us any clues in
dmesg?  This prints a stack trace any time subr_extent.c writes to a
struct extent_region and the region now covers the relevant space.
diff --git a/sys/kern/subr_extent.c b/sys/kern/subr_extent.c
index 05962829e9a9..e7a461c6f265 100644
--- a/sys/kern/subr_extent.c
+++ b/sys/kern/subr_extent.c
@@ -98,6 +98,22 @@ panic(a ...) printf(a)
 #defineKASSERT(exp)
 #endif
 
+#include 
+#defineBADSTART((unsigned long)0x63ec5018)
+#defineOPREGION_SIZE   (8 * 1024)
+#defineBADEND  (BADSTART + OPREGION_SIZE)
+static void
+dumpex(const struct extent_region *rp, const char *file, int line)
+{
+
+   if (BADEND < rp->er_start || rp->er_end <= BADSTART)
+   return;
+   printf("%s:%d extent_region @ %p [0x%lx, 0x%lx)\n", file, line, rp,
+   rp->er_start, rp->er_end);
+   db_stacktrace();
+}
+#defineDUMPEX(rp)  dumpex(rp, __FILE__, __LINE__)
+
 static struct pool expool;
 
 /*
@@ -373,6 +389,7 @@ extent_insert_and_optimize(struct extent *ex, u_long start, 
u_long size,
 * We can coalesce.  Prepend us to the first region.
 */
LIST_FIRST(>ex_regions)->er_start = start;
+   DUMPEX(LIST_FIRST(>ex_regions));
extent_free_region_descriptor(ex, rp);
return;
}
@@ -383,6 +400,7 @@ extent_insert_and_optimize(struct extent *ex, u_long start, 
u_long size,
 */
rp->er_start = start;
rp->er_end = start + (size - 1);
+   DUMPEX(rp);
LIST_INSERT_HEAD(>ex_regions, rp, er_link);
return;
}
@@ -402,6 +420,7 @@ extent_insert_and_optimize(struct extent *ex, u_long start, 
u_long size,
 * note of it.
 */
after->er_end = start + (size - 1);
+   DUMPEX(after);
appended = 1;
}
 
@@ -420,6 +439,7 @@ extent_insert_and_optimize(struct extent *ex, u_long start, 
u_long size,
 * Yup, we can free it up.
 */
after->er_end = LIST_NEXT(after, er_link)->er_end;
+   DUMPEX(after);
nextr = LIST_NEXT(after, er_link);
LIST_REMOVE(nextr, er_link);
extent_free_region_descriptor(ex, nextr);
@@ -428,6 +448,7 @@ extent_insert_and_optimize(struct extent *ex, u_long start, 
u_long size,
 * Nope, just prepend us to the next region.
 */
LIST_NEXT(after, er_link)->er_start = start;
+   DUMPEX(LIST_NEXT(after, er_link));
}
 
extent_free_region_descriptor(ex, rp);
@@ -452,6 +473,7 @@ extent_insert_and_optimize(struct extent *ex, u_long start, 
u_long size,
 */
rp->er_start = start;
rp->er_end = start + (size - 1);
+   DUMPEX(rp);
LIST_INSERT_AFTER(after, rp, er_link);
 }
 
@@ -1118,12 +1140,14 @@ extent_free(struct extent *ex, u_long start, u_long 
size, int flags)
/* Case 2. */
if ((start == rp->er_start) && (end < rp->er_end)) {
rp->er_start = (end + 1);
+   DUMPEX(rp);
goto done;
}
 
/* Case 3. */
if ((start > rp->er_start) && (end == rp->er_end)) {
rp->er_end = (start - 1);
+   DUMPEX(rp);
goto done;
}
 
@@ -1132,9 +1156,11 @@ extent_free(struct extent *ex, u_long start, u_long 
size, int flags)
/* Fill in new descriptor. */
nrp->er_start = end + 1;
nrp->er_end = rp->er_end;
+   DUMPEX(nrp);
 
/* Adjust current descriptor. */
rp->er_end = start - 1;
+   DUMPEX(rp);
 
/* Insert new descriptor after current. */
LIST_INSERT_AFTER(rp, nrp, er_link);

Re: HEADS UP: Merging drm update

2021-12-23 Thread Taylor R Campbell

> Date: Thu, 23 Dec 2021 14:56:19 +
> From: Taylor R Campbell 
> 
> I'm wondering whether the two bus_space_maps in intel_opregion_setup
> overlap, and whether one needs to be a bus_space_subregion or
> something.

Never mind, that's a red herring and obviously not what's happening
here.

Maybe it's bus_space_alloc that's taking the address, not
bus_space_map or bus_space_reserve at all?  Can you put the same
db_stacktrace treatment into extent_alloc_subregion1?  Maybe every
time an extent_region is inserted into the list or every time anything
writes to er_start, print its address, start, end, and stack trace?

Re: HEADS UP: Merging drm update

2021-12-23 Thread Taylor R Campbell

> Date: Thu, 23 Dec 2021 20:53:00 +0900
> From: Ryo ONODERA 
> 
> I have added panic()s to extent_alloc_region and
> extent_insert_and_optimize.
> And the kernel with this change does not display anything at all.
> After bootloader, my LCD turns black and stays black forever.
> I have removed panic()s after db_stacktrace()s and added only
> db_stacktrace() to extent_alloc_region and extent_insert_and_optimize.
> And I have gotten the following dmesg.
> (I feel that this has no new information.)

Can you use pcictl to dump the pci config registers of the graphics
device, and dump the content of whatever address is in register 0xfc?

Something like:

# pcictl pci0 dump -b 0 -d 2 -f 0
PCI configuration registers:
  Common header:
0x00: 0x01668086 0x0097 0x0309 0x

Vendor Name: Intel (0x8086)
Device Name: Ivy Bridge Integrated Graphics Device (0x0166)
...
# pcictl pci0 read -b 0 -d 2 -f 0 0xfc
daf4f018
# dd if=/dev/mem iseek=$((0xdaf4f018)) bs=1 count=8192 | hexdump -C

Separately, if you can dmesgs out -- can you print asls and rvda in
intel_opregion_setup?

I'm wondering whether the two bus_space_maps in intel_opregion_setup
overlap, and whether one needs to be a bus_space_subregion or
something.

Re: HEADS UP: Merging drm update

2021-12-22 Thread Taylor R Campbell

> Date: Wed, 22 Dec 2021 18:58:51 +
> From: Taylor R Campbell 
> 
> > Date: Thu, 23 Dec 2021 00:57:13 +0900
> > From: Ryo ONODERA 
> > 
> > Ryo ONODERA  writes:
> > 
> > > I have no useful information or thought yet…
> > >
> > > I have added check for BADADDR+OPREGION_SIZE in bus_space_reserve.
> > > And I have found that no one uses that address.
> > 
> > I have added similar logic to extent_alloc_region() and
> > extent_insert_and_optimize(). And I have found that someone
> > reserves 0x6375c000 to 0x63f3.
> > However I cannot find who reserves this range.
> > i915drmkms wants 0x63ec5018 to 0x63ec7017 and this is overlapped
> > with 0x6375c000 to 0x63f3.
> 
> Can you use db_stacktrace() in extent_alloc_region and
> extent_insert_and_optimize when the region covers that address?  That
> might help narrow it down.

Can you make it stop panicking here, but still print the messages and
get the dmesg at the end?  (If you need to make it panic to get the
dmesg, maybe make it panic in drmfb_attach?)  It may be that i915
tries to map it twice but we didn't notice because we were looking for
any other driver before i915.

Re: HEADS UP: Merging drm update

2021-12-22 Thread Taylor R Campbell

> Date: Thu, 23 Dec 2021 00:57:13 +0900
> From: Ryo ONODERA 
> 
> Ryo ONODERA  writes:
> 
> > I have no useful information or thought yet…
> >
> > I have added check for BADADDR+OPREGION_SIZE in bus_space_reserve.
> > And I have found that no one uses that address.
> 
> I have added similar logic to extent_alloc_region() and
> extent_insert_and_optimize(). And I have found that someone
> reserves 0x6375c000 to 0x63f3.
> However I cannot find who reserves this range.
> i915drmkms wants 0x63ec5018 to 0x63ec7017 and this is overlapped
> with 0x6375c000 to 0x63f3.

Can you use db_stacktrace() in extent_alloc_region and
extent_insert_and_optimize when the region covers that address?  That
might help narrow it down.

Re: HEADS UP: Merging drm update

2021-12-21 Thread Taylor R Campbell

> Date: Tue, 21 Dec 2021 22:47:34 +0900
> From: Ryo ONODERA 
> 
> I think that "Cannot find any crtc or sizes" may be related to
> "Failed to find VBIOS tables (VBT)".
> I have added some printf lines to some functions invoked before
> intel_bios_init in
> sys/external/bsd/drm2/dist/drm/i915/display/intel_bios.c .
> 
> And bus_space_map (line 956)
> in intel_opregion_setup in 
> sys/external/bsd/drm2/dist/drm/i915/display/intel_opregion.c
> failed and return value 35.

Can you do the following diagnostics?

1. Print what the bus address is when bus_space_map fails here; call
   it BADADDR.

2. Add a fragment to bus_space_reserve in x86/bus_space.c:

   #include 

if (bpa <= BADADDR && BADADDR < bpa + size)
db_stacktrace();

Then share the dmesg output on boot with this change to
bus_space_reserve?

This way we can track down who's reserving the registers that
intel_opregion.c wants.

Re: HEADS UP: Merging drm update

2021-12-20 Thread Taylor R Campbell

> Date: Mon, 20 Dec 2021 11:35:45 +0900
> From: Kengo NAKAHARA 
> 
> GENERIC kernel without DIAGNOSTIC option fails to build.
> Could you apply the following patch?
>  
> https://github.com/knakahara/netbsd-src/commit/b1c93870ef5689201b6eb7e08811bc40e3e1543e

Thanks!  I did it slightly differently, to reduce the amount of
upstream patching we need to do.  OK now?

Re: HEADS UP: Merging drm update

2021-12-19 Thread Taylor R Campbell

> Date: Sun, 19 Dec 2021 23:51:58 +
> From: Taylor R Campbell 
> 
> > Date: Sun, 19 Dec 2021 21:41:14 +0100
> > From: Benny Siegert 
> > 
> > Meanwhile, I built a new kernel from HEAD on my Pinebook Pro. There is
> > a bit of a pause when DRM is initialized during boot, and X is
> > basically unusable -- it takes a minute or so to draw the login screen
> > (slim).
> > [...]
> > This is not working as intended, I suppose? :)
> 
> Oops -- the rockchip vblank logic got lost in the shuffle to elide
> debug commits from the commitbomb.  The necessary logic is in
> <https://github.com/riastradh/netbsd-src/commit/efbfc9d5c93034cf976dbd595e3e6ad8b1cbf586>
> but I need to dust it off and commit it properly with a little less
> debugging crud!

Please update and try again?  (I've only compile-tested the changes,
will take a closer look tomorrow if it doesn't fix the problem.)

Re: HEADS UP: Merging drm update

2021-12-19 Thread Taylor R Campbell

> Date: Sun, 19 Dec 2021 21:41:14 +0100
> From: Benny Siegert 
> 
> Meanwhile, I built a new kernel from HEAD on my Pinebook Pro. There is
> a bit of a pause when DRM is initialized during boot, and X is
> basically unusable -- it takes a minute or so to draw the login screen
> (slim).
> [...]
> This is not working as intended, I suppose? :)

Oops -- the rockchip vblank logic got lost in the shuffle to elide
debug commits from the commitbomb.  The necessary logic is in

but I need to dust it off and commit it properly with a little less
debugging crud!

Re: HEADS UP: Merging drm update

2021-12-19 Thread Taylor R Campbell

> Date: Sun, 19 Dec 2021 01:42:53 +
> From: Taylor R Campbell 
> 
> > Date: Sat, 18 Dec 2021 17:06:13 +
> > From: Taylor R Campbell 
> > 
> > I'm planning to merge the drm update this weekend -- a cvs import and
> > merge commit, plus about 1300 commits on top of that from the git
> > repository.
> 
> The update is in progress, but my commitbomb script isn't perfect and
> sometimes requires manual intervention, which won't happen while I'm
> asleep.  The tree might not build for a few hours in that event.
> Apologies in advance!

The commitbomb is done -- drm update merged.  There may be build
fallout; let me know if so and I'll try to take care of it.

Re: HEADS UP: Merging drm update

2021-12-18 Thread Taylor R Campbell

> Date: Sat, 18 Dec 2021 17:06:13 +
> From: Taylor R Campbell 
> 
> I'm planning to merge the drm update this weekend -- a cvs import and
> merge commit, plus about 1300 commits on top of that from the git
> repository.

The update is in progress, but my commitbomb script isn't perfect and
sometimes requires manual intervention, which won't happen while I'm
asleep.  The tree might not build for a few hours in that event.
Apologies in advance!

HEADS UP: Merging drm update

2021-12-18 Thread Taylor R Campbell

I'm planning to merge the drm update this weekend -- a cvs import and
merge commit, plus about 1300 commits on top of that from the git
repository.

Please don't commit anything to the following subtrees until done --
if you do, your changes will be lost:

sys/external/bsd/drm2
sys/external/bsd/common

Please don't touch the following additional files outside those
subtrees:

sys/arch/arm/nvidia/tegra_drm.c
sys/arch/arm/nvidia/tegra_drm.h
sys/arch/arm/nvidia/tegra_drm_fb.c
sys/arch/arm/nvidia/tegra_drm_mode.c
sys/arch/arm/nvidia/tegra_fb.c
sys/arch/arm/nvidia/tegra_nouveau.c
sys/arch/arm/nxp/imx6_dwhdmi.c
sys/arch/arm/rockchip/rk_anxdp.c
sys/arch/arm/rockchip/rk_drm.c
sys/arch/arm/rockchip/rk_drm.h
sys/arch/arm/rockchip/rk_dwhdmi.c
sys/arch/arm/rockchip/rk_fb.c
sys/arch/arm/rockchip/rk_vop.c
sys/arch/arm/sunxi/sunxi_drm.c
sys/arch/arm/sunxi/sunxi_drm.h
sys/arch/arm/sunxi/sunxi_dwhdmi.c
sys/arch/arm/sunxi/sunxi_fb.c
sys/arch/arm/sunxi/sunxi_hdmiphy.h
sys/arch/arm/sunxi/sunxi_lcdc.c
sys/arch/arm/sunxi/sunxi_mixer.c
sys/arch/arm/ti/ti_fb.c
sys/arch/arm/ti/ti_lcdc.c
sys/arch/arm/ti/ti_lcdc.h
sys/dev/fdt/fdt_panel.c
sys/dev/fdt/hdmi_connector.c
sys/dev/i2c/anxedp.c
sys/dev/i2c/tda19988.c
sys/dev/ic/anx_dp.c
sys/dev/ic/anx_dp.h
sys/dev/ic/dw_hdmi.c
sys/dev/ic/dw_hdmi.h
sys/dev/ic/dw_hdmi_phy.c
sys/dev/videomode/edidvar.h
sys/modules/*drmkms*
sys/sys/agpio.h

Re: zpool import skips wedges due to a race condition

2021-09-11 Thread Taylor R Campbell

> Date: Tue, 7 Sep 2021 22:30:45 +0100
> From: Alexander Nasonov 
> 
> When I run zfs import, it launches 32 threads and opens 32 disks in
> parallel, including cgd1 and dk24. But it can't open dk24 while
> cgd1 is still open (it fails with EBUSY).

I have the attached patch in my tree from a while ago to scan
non-wedge disks first, then wedges.  For some reason I didn't commit
it but I forget why.  (Well, it's a little kludgey to look at the
pathname to distinguish dkN devices, but...)
>From dbee261ca7b5837946e93ec601acb04ae0b0a8b1 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell 
Date: Sat, 29 Feb 2020 18:19:45 +
Subject: [PATCH] Scan primary disks first; then dkwedges.

---
 .../dist/lib/libzfs/common/libzfs_import.c| 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c 
b/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c
index b935e49076a5..e8d29b86f4b2 100644
--- a/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c
+++ b/external/cddl/osnet/dist/lib/libzfs/common/libzfs_import.c
@@ -1175,6 +1175,9 @@ zpool_find_import_impl(libzfs_handle_t *hdl, importargs_t 
*iarg)
config_entry_t *ce, *cenext;
name_entry_t *ne, *nenext;
avl_tree_t slice_cache;
+#ifdef __NetBSD__
+   avl_tree_t dkwedge_cache;
+#endif
rdsk_node_t *slice;
void *cookie;
 
@@ -1266,6 +1269,8 @@ zpool_find_import_impl(libzfs_handle_t *hdl, importargs_t 
*iarg)
}
 #endif
 #ifdef __NetBSD__
+   avl_create(_cache, slice_cache_compare,
+   sizeof (rdsk_node_t), offsetof(rdsk_node_t, rn_node));
if (strcmp(rdsk, "/dev/") == 0) {
static const char mib_name[] = "hw.disknames";
size_t len;
@@ -1291,7 +1296,10 @@ zpool_find_import_impl(libzfs_handle_t *hdl, 
importargs_t *iarg)
slice->rn_dfd = dfd;
slice->rn_hdl = hdl;
slice->rn_nozpool = B_FALSE;
-   avl_add(_cache, slice);
+   if (strncmp(name, "dk", 2) == 0)
+   avl_add(_cache, slice);
+   else
+   avl_add(_cache, slice);
}
free(disknames);
 
@@ -1333,6 +1341,16 @@ skipdir:
AVL_AFTER)))
(void) tpool_dispatch(t, zpool_open_func, slice);
tpool_wait(t);
+#ifdef __NetBSD__
+   for (slice = avl_first(_cache); slice;
+   (slice = avl_walk(_cache, slice,
+   AVL_AFTER)))
+   (void) tpool_dispatch(t, zpool_open_func, slice);
+   tpool_wait(t);
+   while ((slice = avl_destroy_nodes(_cache,
+   )) != NULL)
+   avl_add(_cache, slice);
+#endif
tpool_destroy(t);
 
cookie = NULL;
@@ -1373,6 +1391,9 @@ skipdir:
free(slice->rn_name);
free(slice);
}
+#ifdef __NetBSD__
+   avl_destroy(_cache);
+#endif
avl_destroy(_cache);
 
(void) closedir(dirp);

Re: Script command under NetBSD-current

2021-06-16 Thread Taylor R Campbell

> Date: Mon, 14 Jun 2021 20:01:49 +
> From: RVP 
> 
> #0  0x7f7fbfa0ab3a in ___lwp_park60 () from /usr/libexec/ld.elf_so
> #1 0x7f7fbfa0265d in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
> #2  0x7f7fbfa03125 in _rtld_exit () from /usr/libexec/ld.elf_so
> #3  0x79097fb6bb1f in __cxa_finalize () from /usr/lib/libc.so.12
> #4  0x79097fb6b73d in exit () from /usr/lib/libc.so.12
> #5  0x01401771 in done ()
> #6  0x01401853 in finish ()
> #7  

This indicates a bug in the script(1) program -- it calls exit() in a
signal handler, but exit() is not async-signal-safe.

script(1) should be changed to use only async-signal-safe functions in
its signal handlers -- e.g., by just setting a flag in the signal
handler and either handling EINTR after every blocking syscall or
running with signals masked except during pselect/pollts loop.

I don't know why it's different in netbsd-9 and current, but it was
broken in netbsd-9 before, and there were some changes to some of the
logic could which trigger race conditions differently in current.


> Date: Tue, 15 Jun 2021 08:15:12 +
> From: RVP 
> 
> The small patch below fixes it for me.
> 
> --- START PATCH ---
> --- libexec/ld.elf_so.orig/rtld.c 2020-09-22 00:41:27.0 +
> +++ libexec/ld.elf_so/rtld.c  2021-06-15 08:11:34.301709238 +
> @@ -1750,6 +1750,8 @@
>   sigdelset(, SIGTRAP); /* Allow the debugger */
>   sigprocmask(SIG_BLOCK, , mask);
> 
> + membar_enter();

This may change some timing with the effect of rejiggering a race
condition, but it doesn't meaningfully affect the semantics, and
certainly won't prevent a deadlock from calling exit in the signal
handler if it interrupts lazy symbol binding.

The corresponding membar_enter in _rtld_shared_enter at the beginning
doesn't make sense and should be removed.  In particular, the order
generally needs to be something like:

mumblefrotz_enter:
atomic_r/m/w(lock stuff);
membar_enter();

body of critical section;

mumblefrotz_exit:
membar_exit();
atomic_r/m/w(lock stuff);

Putting another membar_enter _before_ the atomic_r/m/w(lock stuff) in
mumblefrotz_enter doesn't really do anything.

Re: Automated report: NetBSD-current/i386 install success

2021-06-13 Thread Taylor R Campbell

> Date: Sun, 13 Jun 2021 09:39:22 +0300
> From: Andreas Gustafsson 
> 
> The NetBSD Test Fixture wrote:
> > The NetBSD-current/i386 install is working again.
> 
> So it is, but the system now panics during the ATF tests:
> 
>   dev/fss/t_fss (38/899): 1 test cases
>   basic: [ 148.2076287] panic: kernel diagnostic assertion 
> "KERNEL_LOCKED_P()" failed: file 
> "/tmp/build/2021.06.13.03.09.20-i386/src/sys/kern/subr_autoconf.c", line 1972 

Will fix shortly, sorry for the hiccup!

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Taylor R Campbell

> Date: Mon, 05 Apr 2021 10:58:58 +0700
> From: Robert Elz 
> 
> I understand that some people desire highly secure systems (I'm not
> convinced that anyone running NetBSD can really justify that desire,
> but that's beside the point) and that's fine - make the system be able
> to be as secure as possible, just don't require me to enable it, and
> don't make it impossible or even difficuly to disable it - and allow
> some kind of middle ground, just just "perfectly secure" and "hopeless".

The main issue that hits people is that the traditional mechanism by
which the OS reports a potential security problem with entropy is for
it to make applications silently hang -- and the issue is getting
worse now that getrandom() is more widely used, e.g. in Python when
you do `import multiprocessing'.

Based on experience over the past year with a meaningful criterion for
_detecting_ potential problems, I don't think that's a useful
mechanism for _reporting_ them, which is why I added several other
mechanisms -- a line in the /etc/security report, an `entropy' knob in
/etc/rc.conf to wait or fail to single-user (default: neither) -- and
proposed to remove the blocking behaviour of getrandom() in favour of
focusing on feedback in system integration:

https://mail-index.netbsd.org/tech-userlevel/2021/01/11/msg012807.html

(main discussion after all the noise starts here:
https://mail-index.netbsd.org/tech-userlevel/2021/01/15/msg012846.html)

But I ran out of steam to continue the discussion at the time.

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Taylor R Campbell

> Date: Mon, 5 Apr 2021 01:16:56 + (UTC)
> From: RVP 
> 
> Then, the issue here is one of predictability. NetBSD doesn't want,
> for extremely valid reason, to incorporate any perturbation sources
> which have been pooh-poohed in the technical literature.

Why do you say that?  We do incorporate many sources that are not
well-studied -- every keystroke, for example, and the CPU cycle
counter at the time of the keystroke, affects the output of
/dev/urandom.

We just don't _count_ these sources for the purpose of detecting and
reporting the potential security problem of low entropy, because we
don't have a good model by which we can confidently claim they are
actually unpredictable to an adversary.

> Or, perhaps statistical tests of the raw in-kernel sources will
> demonstrate exactly why things like timing jitter have been
> pooh-poohed in the literature?

What do you mean by `things like timing jitter have been pooh-poohed
in the literature'?  Timing jitter in ring oscillators arising from
thermal noise in the silicon is the main source of entropy in most
on-die hardware RNGs on the market that I'm aware of.  This design is
reasonably well-studied in the literature.

We even try to use something like it by default in NetBSD when nothing
else is available -- we use the periodic hardclock interrupt timer to
sample the CPU cycle counter, in the hope that these clocks are driven
by independent oscillators with some timing jitter.

However, this is a generic mechanism that works on any machine -- and
on some machines, there is no CPU cycle counter so it might fall back
to using the hardclock timer as a substitute, defeating the purpose.
And even if there is a CPU cycle counter, it might be clocked by the
same oscillator as the hardclock timer,   So we conservatively
count the hardclock entropy source as having zero entropy.

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Taylor R Campbell

> Date: Sun, 4 Apr 2021 21:24:56 + (UTC)
> From: RVP 
> 
> I think running the /dev/random bit-stream through some statistical
> tests, (both on RDRAND/RDSEED-based and estimator-based as in your
> patch) would be useful here.

No, because the output of /dev/random and /dev/urandom is the output
of a pseudorandom number generator that meets modern standards of
security.

If anyone had _ever_ published statistical tests that the PRNG failed
in a detectable way, then (a) this would be an earthshattering
development in the cryptography literature, which would be hotly
discussed in much more significant forums than NetBSD mailing lists,
and (b) we would stop using this PRNG and switch to another one.

(Device-dependent health tests do make sense in the HWRNG device
driver, to detect broken devices before we treat them as having
entropy, which is why we do them wherever we can, e.g. to detect the
AMD RDRAND bugs.)

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Taylor R Campbell

> Date: Sun, 04 Apr 2021 12:58:09 -0700
> From: "Greg A. Woods" 
> References: 
>   <20210404094958.692f360...@jupiter.mumble.net>
> 
> At Sun, 4 Apr 2021 09:49:58 +, Taylor R Campbell  
> wrote:
> Subject: Re: regarding the changes to kernel entropy gathering
> >
> > Your change _creates_ the lie that every bit of data entered this way
> > is drawn from a source with independent uniform distribution.
> 
> No, my change _allows_ the administrator to decide which devices can be
> used as estimating/counting entropy sources.  For example I know that
> many of the devices on almost all of my machines (virtual or otherwise)
> are equally good sources of entropy for their uses.

If you know this (and this is something I certainly can't confidently
assert!), you can write 32 bytes to /dev/random, save a seed, and be
done with it.

But users who don't go messing around with obscure rndctl settings in
rc.conf will be proverbially shot in the foot by this change -- except
they won't notice because there is practically guaranteed to be no
feedback whatsoever for a security disaster until their systems turn
up in a paper published at Usenix like <https://factorable.net/>.

What your change does is equivalent to going around to every device
driver that previously said `this provides zero entropy, or I don't
know how much entropy it provides' and replacing that claim by `this
is a sample of an independent and perfectly uniform random string of
bits', which is a much stronger (and falser) claim than even the old
`entropy estimation' confabulation that NetBSD used to do.

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Taylor R Campbell

> Date: Sun, 4 Apr 2021 11:14:31 -0700
> From: John Nemeth 
> 
>  I understand the need for good random sources, and won't argue
> it.  My question is, how can we tell what random sources a system
> actually has, i.e. is there some flag that cpuctl identify shows
> when a system has RDRAND/RDSEED?

# cpuctl identify 0 | grep -e RDRAND -e RDSEED
cpu0: features1 0x7fbae3bf

>   Are there other sources that can
> be positively identified as providing randomness?

`rndctl -l' will tell you whether any sources you have on your system
have provided any entropy.  The system generally tries to read from
HWRNGs as soon as possible at boot, so unless something is wrong you
will see such sources listed next to nonzero bits of entropy in
`rndctl -l' as soon as you can run that.

You can grep the code for rnd_add_data and rnd_add_data_sync to find
the drivers that pass nonzero values as the last argument, which is
the number of bits of entropy in the process that generated the sample
being fed in.

Lots of SoCs have on-board RNGs these days; there are Intel and ARM
CPU instructions (no ARMv8.5 hardware yet that I know of, but we're
ready for its RNG!); some crypto decelerators like tpm(4), ubsec(4),
and hifn(4) have RNGs; and there are some dedicated RNG devices like
ualea(4).

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Taylor R Campbell

> Date: Sun, 04 Apr 2021 23:47:10 +0700
> From: Robert Elz 
> 
> Date:Sun, 4 Apr 2021 15:28:13 +
>     From:    Taylor R Campbell 
> Message-ID:  <20210404152814.3c56360...@jupiter.mumble.net>
> 
>   | you can let NetBSD take care of it automatically
>   | on subsequent boots by running `/etc/rc.d/random_seed stop' to save a
>   | seed to disk.)
> 
> Is that file encrypted?   If it is, where does the decryption key come from?

No.  If the adversary can bypass file system privileges to read (and,
most likely, write) arbitrary data on your disk, that adversary is
outside the threat model and you need to do something else to deal
with such an adversary -- for example, by encrypting your disk with
cgd(4).

If the random seed is stored on a networked file system, then NetBSD
incorporates it but treats it as having zero entropy since a network
eavesdropper could read it.

(Side note: Disclosure of the _current_ seed file does not enable
retroactive recovery of _past_ secrets generated from /dev/urandom.)

> I think I'd prefer possibly insecure, but difficult to obtain from outside
> like disk drive interrupt timing low order bits than that.   Regardless of
> how unproven that method might be.

The seed is hashed together with disk drive interrupt timings, so you
get security from either or both, along with RDRAND/RDSEED output if
available.  The kernel also uses the hardclock timer to periodically
sample a cycle counter as a makeshift ring oscillator, if nothing else
is available.  Every device listed in `rndctl -l' has the opportunity
to influence the output of /dev/urandom; for security it is enough
that any one of them be adequately unpredictable.

> And what's the scheme for cheap low-end devices that have no
> writable storage?  (The proverbial internet toaster, for example).

If the device has a HWRNG, great.  If not, this may be a difficult
problem for the system engineer designing the device to solve.
There's no free lunch.

> Lastly, why would anyone presume that RDRAND generates less predictable
> bits (less predictable to someone who knows how it works) than any of
> the other methods that are used.

We generally take the vendor's word in hardware documentation about
what's actually behind the device registers, just like we do for any
other device we have a driver for (e.g., we generally expect devices
not to leak the contents of arbitrary RAM across the network).

In the case of Intel RDRAND, the design is documented in

Intel Digital Random Number Generator Software Implementation
Guide, Revision 2.1, October 17, 2019.

https://web.archive.org/web/20190404022757/https://software.intel.com/sites/default/files/managed/98/4a/DRNG_Software_Implementation_Guide_2.1.pdf

The design samples from ring oscillators -- two independently clocked
circuits, one of which rapidly alternates between states and the other
of which periodically samples the first, so the state is determined by
the random timing jitter between the two clocks, influenced by
unpredictable thermal noise in the silicon.  This class of designs is
reasonably well-studied in the literature -- much moreso than, e.g.,
disk seek times or network packet times.

We also incorporate new information about the devices if available.
For example, on certain AMD CPUs, RDRAND and/or RDSEED sometimes gets
into bad states, which the device driver checks for; in this event the
driver counts zero entropy.

We sometimes do a bit of science on the devices too.  See
https://nxr.netbsd.org/xref/src/sys/arch/mips/cavium/dev/octeon_rnm.c
for some notes on the Cavium Octeon HWRNG, and
https://nxr.netbsd.org/xref/src/sys/arch/arm/sunxi/sun8i_crypto.c for
an example of a driver for a thing that might be an OK HWRNG but is
not clearly endorsed by vendor documentation and has a very nonuniform
distribution so we obviously need more than 256 bits of sample
material to get 256 bits of entropy -- but how much more is not clear,
which is why it counts as zero entropy for now.

> If we want really good security, I'd submit we need to disable
> the random seed file, and RDRAND (and anything similar) until we
> have proof that they're perfect.

...right

FYI: If you really want, you can stop the kernel from counting any
particular entropy source by setting rndctl_flags in /etc/rc.conf:

rndctl_flags="-E -d rdrand; -E -d rdseed; -E -d rdrand/rdseed; -E -d seed"

Then any attempt to read from /dev/random will block until some other
entropy source has provided enough entropy.

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Taylor R Campbell

> Date: Sun, 4 Apr 2021 10:40:22 -0400 (EDT)
> From: Mouse 
> 
> > What NetBSD-current is telling you on your Xen system, on a CPU
> > predating RDRAND/RDSEED, is the unfortunate truth that there is no
> > reliable source of entropy available in your system --
> 
> Not quite.  That there is nothing which NetBSD, independent of the
> sysadmin, is confident is a reliable source of entropy.
> 
> It's entirely possible one or more of those sources actually does
> supply usable entropy, but NetBSD doesn't realize that (and, as I
> understand it, provides no way for the sysadmin to fix that, short of
> hacking on the source).

If the sysadmin knows something NetBSD doesn't, it is easy for the
sysadmin to convince NetBSD to unblock by writing 32 bytes to
/dev/random as root.  No need to hack any source.

(This shouldn't be an automatic recipe, though, because it depends on
specific knowledge of the system in question which the authors of the
device drivers and the rest of the software didn't know.  And if you
have done that once, you can let NetBSD take care of it automatically
on subsequent boots by running `/etc/rc.d/random_seed stop' to save a
seed to disk.)

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Taylor R Campbell

> Date: Sat, 03 Apr 2021 12:24:29 -0700
> From: "Greg A. Woods" 
> 
> Updating a system, even on -current, shouldn't create a long-lived
> situation where the system documentation and the behaviour and actions
> of system commands is completely out of sync with the behaviour of the
> kernel, and in fact lies to the administrator about the abilities of the
> system.

It would help if you could identify specifically what you are calling
a lie.

> @@ -1754,21 +1766,21 @@
>  rnd_add_uint32(struct krndsource *rs, uint32_t value)
>  {
> 
> - rnd_add_data(rs, , sizeof value, 0);
> + rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
>  }

The rnd_add_uint32 function is used by drivers to feed in data from
sources _with no known model for their entropy_.  It's how drivers
toss in data that might be helpful but might totally predictable, and
the driver has no way to know.

Your change _creates_ the lie that every bit of data entered this way
is drawn from a source with independent uniform distribution.

What NetBSD-current is telling you on your Xen system, on a CPU
predating RDRAND/RDSEED, is the unfortunate truth that there is no
reliable source of entropy available in your system -- annoying, yes,
but when you talk about `matters so important as system security and
integrity' you might prefer to hear about this rather than have it
swept under the rug.

What your patch does is shoot yourself in the foot by fantasizing that
_every_ source using rnd_add_uint32, even if it predictably always
supplies all-zero bits, has the maximum entropy possible.

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Taylor R Campbell

> Date: Tue, 30 Mar 2021 16:23:43 -0700
> From: "Greg A. Woods" 
> 
> At Tue, 30 Mar 2021 23:53:43 +0200, Manuel Bouyer  
> wrote:
> > On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> > > Perhaps the answer is that nothing seems to be contributing anything to
> > > the entropy pool.  No matter what device I exercise, none of the numbers
> > > in the following changes:
> >
> > yes, it's been this way since the rnd rototill. Virtual devices are
> > not trusted.
> >
> > The only way is to manually seed the pool.
> 
> Ah, so that is definitely not what I expected!

This is false.  If the VM host provided a viornd(4) device then NetBSD
would automatically collect, and count, entropy from the host, with no
manual intervention.

> Finally, if the system isn't actually collecting entropy from a device,
> then why the heck does it allow me to think it is (i.e. by allowing me
> to enable it and show it as enabled and collecting via "rndctl -l")?

The system does collect samples from all those devices.  However, they
are not designed to be unpredictable and there is no good reliable
model for just how unpredictable they are, so the system doesn't
_count_ anything from them.  See https://man.NetBSD.org/entropy.4 for
a high-level overview.

In the past we used an essentially meaningless model, designed in a
vacuum without reference to any information about the physics of the
sources of the samples (and the same model with all sources), for
fabricating entropy estimates by examining the sample data.  This
practice no longer happens.

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Taylor R Campbell

> Date: Tue, 30 Mar 2021 23:53:43 +0200
> From: Manuel Bouyer 
> 
> On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> > [...]
> > 
> > Perhaps the answer is that nothing seems to be contributing anything to
> > the entropy pool.  No matter what device I exercise, none of the numbers
> > in the following changes:
> 
> yes, it's been this way since the rnd rototill. Virtual devices are
> not trusted.
> 
> The only way is to manually seed the pool.

This is false.  The virtual RNG drivers (viornd(4) [1], rump
hyperentropy [2], maybe others) all assume the VM host provides
samples with full entropy.  This has always been the case, and this
didn't change at all in the rototill last year.

There are no virtual RNG devices on the system in question, according
to the quoted `rndctl -l' output.  Perhaps the VM host needs to be
taught to expose a virtio-rng device to the guest?

[1] https://nxr.netbsd.org/xref/src/sys/dev/pci/viornd.c#245
[2] https://nxr.netbsd.org/xref/src/sys/rump/librump/rumpkern/hyperentropy.c#57

P.S.  Further discussion about Python, getrandom, and system
integration:
https://mail-index.netbsd.org/tech-userlevel/2021/01/11/msg012807.html

Re: [HEADS UP] pkgsrc default database directory changed

2020-12-13 Thread Taylor R Campbell

> Date: Sun, 13 Dec 2020 20:11:53 +0100
> From: Thomas Klausner 
> 
> On Sun, Dec 13, 2020 at 02:18:32PM +0100, Rhialto wrote:
> > A problem with pkg.refcount might be that files in that directory
> > contain absolute pathnames starting with /var/db/pkg. E.g.:
> > 
> > $ cat /var/db/pkg.refcount/groups/mail/cyrus-sasl-2.1.27nb1 
> > /var/db/pkg/cyrus-sasl-2.1.27nb1
> > 
> > $ cat 
> > /var/db/pkg.refcount/files/usr/pkg/lib/perl5/vendor_perl/5.32.0/XML/SAX/ParserDetails.ini/p5-XML-SAX-1.02
> >  
> > /var/db/pkg/p5-XML-SAX-1.02
> > 
> > $ cat /var/db/pkg.refcount/users/smmsp/sendmail-8.15.2nb9 
> > /var/db/pkg/sendmail-8.15.2nb9
> 
> That looks like a design flaw :(
> 
> IIUC the effect will be that some empty directories are not deleted
> when the last packages using them are deleted.
> 
> Is anyone willing to write up a shell script to fix them?

Before we distribute more band-aids on this that might require manual
intervention, maybe we could step back and assess the situation to
make sure the issues are being comprehensively addressed, and perhaps
find a way to do it that is safe, idempotent, and requires no manual
intervention by users?

We have a lot of different types of users who might be affected by
this, and I worry that the pkgdb migration had already gotten out of
hand before we multiplied the cases that are out there with updates to
base and pullups to -8 and -9:

- netbsd-{8,9,HEAD} users using base pkg_install exclusively who have
  updated
  . base and pkgsrc
  . base
  . pkgsrc
  . neither

- netbsd-{8,9,HEAD} users using bootstrapped pkg_install exclusively
  who {have,haven't} updated pkgsrc

- netbsd-{8,9,HEAD} users using base pkg_install for main packages but
  bootstrapped pkg_install for /home/user/pkg or /usr/pbulk or
  similar, who have updated
  . base and pkgsrc
  . base
  . pkgsrc
  . neither

All these need to be multiplied by the different versions of the
manual intervention we've publicly suggested so far.  Did I miss any
important cases that we're likely to encounter?

Re: Entropy problem [was Re: CVS Problem (again) ,Slightly lesser old code, but old still [was Re: Console problem ,older code]]

2020-12-05 Thread Taylor R Campbell

> Date: Sat, 5 Dec 2020 15:19:15 -0800
> From: Germain Le Chapelain 
> 
> It's building firefox again,
> 
> I haven't yet tried any work-around mentioned in the email
> introducing the change But I am confused about the issue, why this
> arises.
> 
> I thought the way randomness was implemented is that you would
> initialize one seed-number to an amd64 and it would pass you back
> random #s, anytime and at any rate, *guaranteed* to be random,
> provided that for sure your initial # was random...

This is roughly right -- once you have a good seed stored in
/var/db/entropy-file, it serves to generate any amount of random
output, and it generally will be kept updated from boot to boot.

The issue is that certain software tries to avoid a security
catastrophe where you _haven't yet gotten a good enough seed_ before
you generate cryptographic keys that matter, like in
.

If your CPU has RDRAND/RDSEED, we assume it works (unless it gives
consecutive repeated 256-bit outputs, in which case we decide it's
broken), and nothing should ever block (unless you explicitly turn
that on for debugging purposes).  (If your CPU had RDRAND/RDSEED but
it's still blocking, let me know and we can try to diagnose that --
you can check with `cpuctl identify 0'.)

But if you don't have RDRAND/RDSEED, and you don't have a seed stored
on disk, and you don't have any other hardware RNG, and a program
running on NetBSD asks to _wait until it is seeded_ -- then, well, it
will hang.

In NetBSD<=9 (and in Linux and other systems), the kernel would stop
hanging after a while on the basis of a metric so meaningless it is
tantamount to a lie; NetBSD-current no longer incorporates that lie.
That said, in many contexts hanging is confusing and not helpful, and
I would like to address it better before we get to NetBSD 10 (e.g.,
 for some ideas).  But that's what it
does right now.

I'm not sure exactly what symptom you're experiencing at the moment,
but it might be compounded by an issue with Python, which is the
following:

1. Many years ago, in the early 2000s, Python added a function
   os.urandom(n) which returns a string of n bytes chosen uniformly at
   random _without ever blocking_ -- in other words, the contract was:
   if the system doesn't have a good enough seed, too bad; the caller
   of os.urandom(n) accepts responsibility for the security
   consequences (for example, the caller is leaving it to a system
   engineer to put the parts together so it's not an issue).

2. A lot of code was written using Python os.urandom(n), including the
   Python dict hash table randomization, the Python random module, and
   the Python multiprocessing module.

3. A few years ago, Guido deliberately changed the semantics of
   os.urandom(n) in Python by accepting PEP 524:

   https://www.python.org/dev/peps/pep-0524/
   https://mail.python.org/pipermail/security-sig/2016-August/000101.html

   Changing os.urandom(n) so it _does_ block, after over a decade of
   an API contract where it _never_ blocks, had the side effect that
   Python programs would almost all hang early at boot even if they
   don't appear to do any cryptography themselves.

4. To mitigate this symptom, Python was tweaked a tiny bit inside so
   that the random module and the dict hash table randomization would
   be initialized without blocking -- but _nothing else in Python_ has
   access to this path (without specifically opening /dev/urandom or
   similar).  That includes the multiprocessing module, which is used
   by, e.g., the build system meson.

Now we have a kind of nasty dilemma with Python:

- We could leave it as is, and Python will just block on some systems
  even if it's doing builds and not talking over the internet or
  anything.

- We could patch os.urandom(n) so that it returns to the old
  never-block semantics, and (a) risk factorable.net-type
  vulnerabilities in Python programs that assume os.urandom(n) _will_
  block until seeded, and (b) risk a bad reputation for breaking
  Python security like Debian did with OpenSSL.

Neither of these options is appealing -- I'm still working on finding
a way to finesse it that will work out better for everyone in the end.

Re: "wireguard" implementation improperly merged and needs revert

2020-08-22 Thread Taylor R Campbell

[followups to tech-net to reduce cross-posting noise]

Hi, Jason!

Sorry about not reaching out.  The history is that the code has been
kicking around the NetBSD world since Ozaki-san first wrote it in
2018, without getting merged into src.  It felt a shame to let it
wallow in that state indefinitely, and it seemed to be in pretty good
shape when I reviewed it this year, with a few small issues I saw, so
I dusted it off and merged it.

I would be happy to hear specific criticism of the code and its
implementation flaws and violations, and/or pointers to documentation
of the certain set of behaviours and security criteria that you expect
implementations to adhere to.  Also happy to help answer questions
about and navigate the NetBSD network stack if you want to review it
yourself.

As far as I know, Ozaki-san wrote the code following the WireGuard
protocol paper.  There are a few XXX comments in the code that should
be addressed, and there are some issues I know of that I have a small
TODO list for but didn't seem critical enough to block committing the
initial work:

[ ] self-tests for crypto
[ ] fix libssh dependency
[ ] dtrace probes
[ ] lockless radix tree lookups for peers
[ ] take advantage of sys/crypto/chacha 
[ ] modularize
[ ] split sliding window out
[ ] rename wgconfig(8) -> wg(8) and make interface compatible

For now, users have to go out of their way to enable the experimental
wg(4) interface, and I didn't have any specific timeline planned for
enabling it in GENERIC kernels -- wasn't likely to have been before
September 1st anyway and I'm happy to commit to holding off on that
until we've had a chance to discuss further in September.  Does that
work?

Thanks,
-Riastradh

WireGuard in NetBSD

2020-08-20 Thread Taylor R Campbell

Back in 2018, ozaki-r@ wrote an in-kernel implementation of WireGuard
 for NetBSD -- a point-to-point roaming-
capable virtual private network tunnel with modern cryptography.

Today I imported Ozaki-san's WireGuard code into NetBSD proper.
Here's an example of how to use it, taken from the new wg(4) man page.
You'll need to build a kernel config with `pseudo-device wg' in it --
it's not in any GENERIC kernels yet, and there's no loadable module
yet.  (Both of these will change -- also, you can try it without any
kernel changes using the wg-userspace(8) tool which runs in userland
with a rump server and tun(4); see the man page for details.)

Typical network topology:

 wm0 = 1.2.3.4   bge0 = 4.3.2.1

 Stationary server: Roaming client:
 +-++-+
 |A||B|
 |-||-|
 |[wm0]-internet[bge0]|
 |[wg0] port 1234 - - - (tunnel) - - - - - - [wg0]|
 |   10.0.1.0  |   10.0.1.1   |
 | |   || |
 +--[wm1]--+  +-+   +-+
  |   | VPN 10.0.1.0/24 |
  |   +-+
 +-+
 | LAN 10.0.0.0/24 |
 +-+

Generate key pairs on A and B:

 A# wg-keygen > /etc/wireguard/wg0
 A# wg-keygen --pub < /etc/wireguard/wg0 > /etc/wireguard/wg0.pub
 A# cat /etc/wireguard/wg0.pub
 N+B4Nelg+4ysvbLW3qenxIwrJVE9MdjMyqrIisH7V0Y=

 B# wg-keygen > /etc/wireguard/wg0
 B# wg-keygen --pub < /etc/wireguard/wg0 > /etc/wireguard/wg0.pub
 B# cat /etc/wireguard/wg0.pub
 X7EGm3T3IfodBcyilkaC89j0SH3XD6+/pwvp7Dgp5SU=

Configure A to listen on port 1234 and allow connections from B to
appear in the 10.0.1.0/24 subnet:

 A# ifconfig wg0 create 10.0.1.0/24
 A# wgconfig wg0 set private-key /etc/wireguard/wg0
 A# wgconfig wg0 set listen-port 1234
 A# wgconfig wg0 add peer B \
 X7EGm3T3IfodBcyilkaC89j0SH3XD6+/pwvp7Dgp5SU= \
 --allowed-ips=10.0.1.1/32
 A# ifconfig wg0 up
 A# ifconfig wg0
 wg0: flags=0x51 mtu 1420
 inet 10.0.1.0/24 ->  flags 0

Configure B to connect to A at 1.2.3.4 on port 1234 and the packets
can begin to flow:

 B# ifconfig wg0 create 10.0.1.1/24
 B# wgconfig wg0 set private-key /etc/wireguard/wg0
 B# wgconfig wg0 add peer A \
 N+B4Nelg+4ysvbLW3qenxIwrJVE9MdjMyqrIisH7V0Y= \
 --allowed-ips=10.0.1.0/32 \
 --endpoint=1.2.3.4:1234
 B# ifconfig wg0 up
 B# ifconfig wg0
 wg0: flags=0x51 mtu 1420
 inet 10.0.1.1/24 ->  flags 0
 B# ping -n 10.0.1.0
 PING 10.0.1.0 (10.0.1.0): 56 data bytes
 64 bytes from 10.0.1.0: icmp_seq=0 ttl=255 time=2.721110 ms

Re: Automated report: NetBSD-current/i386 build failure

2020-07-26 Thread Taylor R Campbell

> Date: Sun, 26 Jul 2020 15:20:05 +0300
> From: Andreas Gustafsson 
> 
> The build is still failing, current error as of 2020.07.26.09.17.24:
> 
>   ===  1 extra files in DESTDIR  =
>   Files in DESTDIR but missing from flist.
>   File is obsolete or flist is out of date ?
>   --
>   ./usr/libdata/debug/usr/tests/sys/crypto/chacha
>   =  end of 1 extra files  ===

Should be fixed now.

Re: VIA Padlock on AMD64

2020-07-25 Thread Taylor R Campbell

> Date: Fri, 24 Jul 2020 00:35:19 +0300
> From: Andrius V 
> 
> > > Cool!  Is it reproducible?  You can trigger reading from the RNG with:
> > >
> > > sysctl kern.entropy.gather=1
> 
> Tested, no messages are triggered for amd64, but sysctl
> kern.entropy.gather value is always 0:
> sysctl -w kern.entropy.gather=1
> kern.entropy.gather: 0 -> 1
> 
> For i386, same, no messages, but kern.entropy.gather always has
> changing negative numbers.

kern.entropy.gather is just a trigger that you can write to in order
to cause an effect; the number read out of it is meaningless.

I committed a change to fix the interpretation of the status code
returned according to the documentation:

https://mail-index.NetBSD.org/source-changes/2020/07/25/msg119718.html

Let me know if you see the symptom again!

(I should maybe do some more diagnostic tests on real hardware to
verify how it behaves in case the code was mostly right before and the
documentation is wrong.)

Re: VIA Padlock on AMD64

2020-07-21 Thread Taylor R Campbell

> Date: Wed, 22 Jul 2020 03:17:21 +0300
> From: Andrius V 
> 
> Not sure how useful to work on this though. Linux doesn't seem to have
> newer graphics IDs in VIA DRM driver too (possibly abandoned over new
> effort?). Currently Xorg server starts on VX900, but screen complains
> with "Input signal is out of range" error with or without viadrmums
> driver, so I can't see anything on the screen (maybe I need some
> manual xorg configuration). I don't remember if it was the case
> before, I usually used console only.

Is there any documentation for the graphics hardware?  Is hardware
available on the market in the US?  (Can you send me any?)

> Unrelated, but I think it appeared pretty recently in dmesg logs and
> related to your changes:
> [   4.0518619] aes: VIA ACE
> 
> [  11.7018582] cpu_rng via: failed repetition test
> T[  12.4718583] entropy: ready
> 
> Before it was:
> cpu_rng: VIA
> rnd: seeded with 256 bits

Cool!  Is it reproducible?  You can trigger reading from the RNG with:

sysctl kern.entropy.gather=1

I don't think I've changed anything in the past month or so that would
affect it -- more likely, this is a stochastic (and perhaps transient)
failure of the RNG hardware.  I also see a likely bug in the VIA
cpu_rng code judging by the Padlock documentation -- should be easy to
fix, once the air here cools down enough to boot up my VIA laptop
without setting fire to the table.

Re: VIA Padlock on AMD64

2020-07-20 Thread Taylor R Campbell

> Date: Mon, 20 Jul 2020 21:31:20 +
> From: Taylor R Campbell 
> 
> > Date: Tue, 21 Jul 2020 00:16:16 +0300
> > From: Andrius V 
> > 
> > After a bit more investigation, yes, it is kassert (kernel diagnostic
> > assertion "viadrm_pci_ids[i].subvendor == PCI_ANY_ID" ), doesn't
> > matter amd64/i386. 
> 
> Panic should be fixed in via_pci 1.3.

...excuse me, that's supposed to read: via_pci.c 1.4.

Re: VIA Padlock on AMD64

2020-07-20 Thread Taylor R Campbell

> Date: Tue, 21 Jul 2020 00:16:16 +0300
> From: Andrius V 
> 
> After a bit more investigation, yes, it is kassert (kernel diagnostic
> assertion "viadrm_pci_ids[i].subvendor == PCI_ANY_ID" ), doesn't
> matter amd64/i386. 

Panic should be fixed in via_pci 1.3.

>The cause seems pretty obvious: viadrv_PCI_IDS
> doesn't include some chrome graphics like 0x7122 from VX900 or 0x122
> from VX800, thus subvendor value is 0 by default, PCI_ANY_ID is 1 on
> the other hand, if I am not mistaken. Adding 0x7122 to the list allows
> viadrmums to attach and boot, but with some errors in the logs.

Can you share dmesg?

Do you have any older 32-bit VIA hardware to compare to see whether it
works better there?  (Haven't tried it on my 32-bit VIA laptop in a
while, and right now it's too hot to power that little space heater
on!)

Re: 9.99.69 panic - libcrypto changes?

2020-07-20 Thread Taylor R Campbell

> Date: Thu, 2 Jul 2020 23:09:16 +0100
> From: Chavdar Ivanov 
> 
> On amd64 9.99.69 from yesterday I get:
> 
> crash: _kvm_kvatop(0)
> Crash version 9.99.69, image version 9.99.69.
> Kernel compiled without options LOCKDEBUG.
> System panicked: fpudna from kernel, ip 0x802292af, trapframe
> 0xbe013c564a50

First couple of attempts weren't quite right, but I think there's a
good chance this is fixed in HEAD now with sys/arch/x86/x86/fpu.c
revision 1.72.

x86 in-kernel fpu bug

2020-07-20 Thread Taylor R Campbell

> Date: Mon, 20 Jul 2020 11:04:21 +0100
> From: Patrick Welche 
> 
> After a -current/amd64 update, a sudden outbreak of floating point exceptions:
> 
> /usr/src/tools/gcc/../../external/gpl3/gcc/dist/gcc/tree-ssa-operands.c:1348:1:
>  
> internal compiler error: Floating point exception 
>   
> 
> Fetching message headers...
> Floating point exception (core dumped) mutt
> 
> Any guesses?

There's a good chance this has been fixed in sys/arch/x86/x86/fpu.c
revision 1.72 -- can you update and try again with a new kernel?

Re: VIA Padlock on AMD64

2020-07-19 Thread Taylor R Campbell

> Date: Mon, 20 Jul 2020 08:16:34 +0300
> From: Andrius V 
> 
> Considering it is limited usage isn't the kernel module actually more
> useful, since you can add it to boot config once you really need it?
> Unless, in-kernel functionality won't work with the kernel module for some
> reason. Regardless, I was simply looking from the perspective of i386,
> since it has the module available, why not just to move config for both? I
> can agree though, that probably there were close to zero users utilizing
> padlock over the years. Nevertheless, I have two VIA Nano boards, both
> successfully booted with a padlock engine attached.

The code is just not worth much, and I kind of plan to delete it
altogether once a few other things in the kernel AES stack are fixed.
If you're not doing in-kernel IPsec it's really not worth anything.

> Unfortunately, testing on three different boards didn't show much
> difference, even when specifically openssl's padlock engine was used
> (actually it showed worse results on bigger blocks). Seems like padlock
> instructions were not utilized in any case, at least if compared to some
> results posted on the internet. Either I was doing something wrong or
> openssl needs special patches. It was consistent with Linux results though.

You can tell whether openssl is doing the cryptography in the kernel
via /dev/crypto (and therefore potentially using the via_padlock.c
module) as follows:

% openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 14899528 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 64 size blocks: 4030412 aes-256 cbc's in 2.97s
[...]
% openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 1732147 aes-256-cbc's in 0.76s
Doing aes-256-cbc for 3s on 64 size blocks: 1373417 aes-256-cbc's in 0.43s
[...]

Note that the real time is about three seconds for each line, but the
last number shows user time -- it's about three seconds _without_
`-evp', but under one second _with_ `-evp', which means that the
computation is happening in the kernel.

Of course, you will probably find that the throughput is not actually
improved -- that's because the /dev/crypto userland interface is a
piece of garbage that's useful only to check the results of crypto
devices.  Which is why I said that the module is useful only for
in-kernel IPsec.

> > > > On the side note, same goes to viadrmums module (it's i386 only now)
> > > > but at least on VX900 I ended up with the crash, so I guess it may be
> > > > incompatible with amd64 (though it builds successfully). Will try to
> > > > test VX800 later on.
> >
> > I haven't tried viadrmums in a while.  What was the crash?
> 
> I am planning to submit PR once I will retest the driver with the latest
> images. Actually, it may not be amd64 specific after all. Crash happens
> during boot on match function, likely newer graphics are unsupported and
> triggers some kassert?

Can't say without the specific kassert (and ideally stack trace too).
I don't see anything obvious in viadrm_match that would be
problematic.

Re: VIA Padlock on AMD64

2020-07-19 Thread Taylor R Campbell

> Date: Sun, 19 Jul 2020 14:16:25 +0300
> From: Andrius V 
> 
> I just noticed that padlock driver was made to compile on amd64
> recently. I believe it should be included in modules Makefile
> (/sys/modules/Makefile) as well (as per my git diff in previous
> messages).

I guess we could do that.  It's not really that useful -- the only
application it's good for is in-kernel IPsec.  OpenSSL should already
be able to take advantage of the CPU support in userland -- no kernel
driver needed.

(I also don't have any 64-bit VIA hardware to test it on, which is why
I added it only to amd64/ALL to make compile-testing easier.)

> > I created a patch to show the intention. It builds and "padlock"
> > attaches to the CPU. On another hand, I failed to find any application
> > which would be taking advantage of padlock (including i386). Neither
> > openssl (tried speed tests), nor openssh (some commands I found in
> > internet)  showed any difference with padlock module or without it.
> > Maybe some patching is needed but it seems hardware acceleration is
> > ignored with these apps. If somebody know a good way to test it on
> > NetBSD, I would be glad to hear it.

Try comparing

openssl speed aes-256-cbc
openssl speed -evp aes-256-cbc

and particularly note the user time spent.

> > On the side note, same goes to viadrmums module (it's i386 only now)
> > but at least on VX900 I ended up with the crash, so I guess it may be
> > incompatible with amd64 (though it builds successfully). Will try to
> > test VX800 later on.

I haven't tried viadrmums in a while.  What was the crash?

Re: 9.99.69 panic - libcrypto changes?

2020-07-06 Thread Taylor R Campbell

> Date: Mon, 6 Jul 2020 13:49:26 +0100
> From: Chavdar Ivanov 
> 
> On Mon, 6 Jul 2020 at 02:14, Taylor R Campbell  wrote:
> >
> > This and the earlier panic you reported may be fixed by
> >
> > https://mail-index.netbsd.org/source-changes/2020/07/06/msg119081.html
> 
> I upgraded this morning. Unfortunately it did not fix it for me,
> though I did not get a panic. I got a deep CPU loop with a fully
> spinning fan and no reaction to the attempt to break into the
> debugger, so I had to hit the power button.

Well, that's unfortunate!

I am trying to reproduce on a wifi network with WPA2, verified by the
`ifconfig -v' counters to be using software CCMP, while simultaneously
doing heavy cgd I/O, and so far I haven't been able to reproduce.
I'll keep investigating.

The attached patch might help with breaking into the debugger, if
you'd like to give it a try -- won't fix any underlying problem, but
may help us to find out what the problem is.
diff -r 622b4cb96b07 sys/arch/x86/x86/fpu.c
--- a/sys/arch/x86/x86/fpu.cSun Jul 05 14:24:05 2020 +
+++ b/sys/arch/x86/x86/fpu.cMon Jul 06 15:49:34 2020 +
@@ -383,9 +383,10 @@ fpu_kern_enter(void)
struct cpu_info *ci;
int s;
 
-   s = splhigh();
+   s = splnet();
 
ci = curcpu();
+   KASSERTMSG(ci->ci_ilevel <= IPL_NET, "ilevel=%d", ci->ci_ilevel);
KASSERT(ci->ci_kfpu_spl == -1);
ci->ci_kfpu_spl = s;

1 2 >

1 - 100 of 140 matches

Mail list logo