Hi everyone, I recently upgraded my main amd64 server from 10.3-stable (r302011) to 11.0-stable (r308099). It went smoothly except for one big issue: certain applications (but not the system as a whole) respond very sluggishly, and video playback of any kind is extremely choppy.
The system is under very light load, and I see no evidence of abnormal interrupt latency or interrupt load. More interestingly, if I place the system under full load (~0.0% idle) the problem *disappears* and playback/responsiveness are smooth and quick. Running ktrace on some of the affected apps points me at the problem: huge variance in the amount of time spent in the nanosleep system call. A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry to return of the syscall. OTOH, anything CPU-bound or that waits on condvars or I/O interrupts seems to work fine, so this doesn't seem to be an issue with overall system latency. I can repro this with a simple program that just does a 3ms usleep in a tight loop (i.e. roughly the amount of time a video player would sleep between frames @ 30fps). At light load ktrace will show the huge nanosleep variance; under heavy load every nanosleep will complete in almost exactly 3ms. FWIW, I don't see this on -current, although right now all my -current images are VMs on different HW so that might not mean anything. I'm not aware of any recent timer- or scheduler- specific changes, so I'm wondering if perhaps the recent IPI or taskqueue changes might be somehow to blame. I'm not especially familiar w/ the relevant parts of the kernel, so any guidance on where I should focus my debugging efforts would be much appreciated. Thanks, Jason
signature.asc
Description: OpenPGP digital signature