Hi,

I`ve observed this issue previously on an old 3.10 branch but wrote it
off due to inability to reproduce in any meaningful way. Currently I
am seeing it on 3.10 branch where all KVM-related and RCU-related
issues are patched more or less for well-known issues.

Way to obtain a problematic state:
 - run a hypervisor for essentially long time, it took a year and half
previously for issue to come on the mentioned old branch, but for
newer kernel and probably due to higher load it took roughly a half of
a year,
 - suddenly a single VM obtains a lock and became unresponsive while
all threads displaying Running state, under this lock VM is neither
not killable via SIGKILL and not freezeable via freezer cgroup, the
only obvious symptoms is that it does not consume any cpu cycles
anymore (no counter inside sched info ) and of course it is
non-debuggable anymore. As it follows, it is quite impossible to say
at a glance where lock sits, as there is no distinctive processes
which are at least sleeping and could be moved out of sight.

It looks like I could have met pure scheduler issue, so if nothing
from attached recursive stack/status dump would click on an idea, I`d
CC scheduler folks. Timer/RCU configs are attached for the
convenience.

Thanks for looking into this!

stack:
http://xdel.ru/downloads/vm-sched-hang/stack.txt
status:
http://xdel.ru/downloads/vm-sched-hang/status.txt
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

# RCU Subsystem
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_USER_QS=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y
# RCU Debugging
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set

Reply via email to