Re: SCHED_ULE makes 256Mbyte i386 unusable

Julian Elischer Mon, 23 Apr 2018 00:51:11 -0700

On 22/4/18 10:36 pm, Rodney W. Grimes wrote:

Konstantin Belousov wrote:

On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:

Konstantin Belousov wrote:

On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:

I decided to start a new thread on current related to SCHED_ULE, since I see
more than just performance degradation and on a recent current kernel.
(I cc'd a couple of the people discussing performance problems in freebsd-stable
  recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic 
scheduler".


When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 
2017
current/head kernel, I would see about a 30% performance degradation (elapsed
run time for a kernel build over NFSv4.1) when the server kernel was built with
options SCHED_ULE
instead of
options SCHED_4BSD

So, now that I have decreased the number of nfsd kernel threads to 32, it works
with both schedulers and with essentially the same performance. (ie. The 30%
performance degradation has disappeared.)

Now, with a kernel from a couple of days ago, the
options SCHED_ULE
kernel becomes unusable shortly after starting testing.
I have seen two variants of this:
- Became essentially hung. All I could do was ping the machine from the network.
- Reported "vm_thread_new: kstack allocation failed
   and then any attempt to do anything gets "No more processes".

This is strange.  It usually means that you get KVA either exhausted or
severly fragmented.

Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
kernel is working ok now. I haven't done enough to compare performance yet.
Maybe I'll post again when I have some numbers.

Enter ddb, it should be operational since pings are replied.  Try to see
where the threads are stuck.

I didn't do this, since reducing the number of kernel threads seems to have 
fixed
the problem. For the pNFS server, the nfsd threads will spawn additional kernel
threads to do proxies to the mirrored DS servers.

with the only difference being a kernel built with
options SCHED_4BSD
everything works and performs the same as the Dec 2017 kernel.

I can try rolling back through the revisions, but it would be nice if someone
could suggest where to start, because it takes a couple of hours to build a
kernel on this system.

So, something has made things worse for a head/current kernel this winter, rick

There are at least two potentially relevant changes.

First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.

I've been running this machine with KSTACK_PAGES=4 for some time, so no change.

W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
in the thread):
I didn't see any instability when using KSTACK_PAGES=4 for this until this 
cropped
up and seemed to be scheduler related (but not really, it seems).
I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
Server code.

Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
item getting allocated on the stack, but many moderate sized ones.
(A part of it is multiple instances of "struct vattr", some buried in "struct 
nfsvattr",
  that NFS needs to use. I don't think these are large enough to justify 
malloc/free,
  but it has to use several of them.)

One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
the stack. I changes the code to malloc/free them and then when testing, to
my surprise I had a 20% performance hit and shelved the patch.
Now that I know that the server was running near its limit, I might try this one
again, to see if the performance hit doesn't occur when the machine has adequate
memory. If the performance hit goes away, I could commit this, but it wouldn't
have that much effect on the kstack usage. (It's interesting how this patch 
ended
up related to the issue this thread discussed.)

Anything we can do to help relieve KSTACK usage, especially on i386
is helpfull.  These is a thread back quite some time where someone
came up with a compile time static "this functions uses X bytes of
local stack" and a bit of clean up was done.  We should persue
this issue further.


that was me.

use

|-Wframe-larger-than||=<arg>|¶<https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-wframe-larger-than>and set it to something like 512 bytes (obviously you have to makewarnings non fatal as well).


My experiece with the i386/KSTACK issues was attempting to do installs
from snapshot .iso's, I usually had to change to a custom kernel without
INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack
problem (ie, dont use NFS during install).  Neither was very pleasant.

I have found it in practical to run the 4 page KSTACK in production
VM's using i386 due to memory requirements.  I run many very lean
i386 VM's with 64MB of memory.  I suspect our user base also has
many people doing this, and it would be to our advantage to try
and reduce our kernel stack needs.

Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.

Could this change have resulted in the system being able to allocate fewer
kernel threads/stacks for some reason?

Well, it could, as anything can be buggy. But the intent of the change
was to give 4G KVA, and it did.

Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
(see performance issue that went away, noted above) and any change could
have pushed it across the line, I think.

Consequences of the first one are obvious, it is much harder to find
the place to map the stack.  Second change, on the other hand, provides
almost full 4G for KVA and should have mostly compensate for the negative
effects of the first.

And, I cannot see how changing the scheduler would fix or even affect that
behaviour.

My hunch is that the system was running near its limit for kernel 
threads/stacks.
Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
to a higher peak number of threads and hit the limit.
SCHED_4BSD happened to result in timing such that it stayed just below the
limit and worked.
I can think of a couple of things that might affect this:
1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
       they wouldn't terminate and release their resources before more new ones
       are spawned.

Scheduler has nothing to do with the threads termination.  It might
select running threads in a way that causes the undesired pattern to
appear which might create some amount of backlog for termination, but
I doubt it.

2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
       could try and spawn more mirror DS worker threads at about the same time.

Anyhow, thanks for the help, rick

Have a good day, rick
_______________________________________________
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


_______________________________________________
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: SCHED_ULE makes 256Mbyte i386 unusable

Reply via email to