Khoa and I have been discussing a workload that triggers softlockups and
hung task warnings inside the guest.  These warnings can pop up due to
bugs in the guest Linux kernel but they can also be triggered by the
hypervisor if vcpus are not being scheduled at reasonable times.

I've wanted a tool that reports high stealtimes and includes the last
vmexit reason.  This allows us to figure out if specific I/O device
emulation is taking too long or if other factors like host memory
pressure are degrading guest performance.

Here is a first sketch of such a tool.  It's a perf-script(1) Python
script which can be used to analyze perf.data files recorded with
kvm:kvm_entry and kvm:kvm_exit events.

Stealtimes exceeding a threshold will be flagged up:

  $ perf script -s /absolute/path/to/stealtime.py
  100466.1062486786 9690: steal time 00000.029318914 secs,
                          exit_reason IO_INSTRUCTION,
                          guest_rip 0xffffffff81278f02,
                          isa 1, info1 0xcf80003, info2 0x0

The example above shows an I/O access to 0xcf8 (PCI Configuration Space
Address port) that took about 28 milliseconds.  The host pid was 9690; this
can be used to investigate the QEMU vcpu thread.  The guest rip can be used
to investigate guest code that triggered this vmexit.

Given this information, it becomes possible to debug QEMU to figure out
why vmexit handling is taking too long.  It might be due to global mutex
contention if another thread holds the global mutex while blocking.
This sort of investigation needs to be done manually today but it might
be possible to add perf event handlers to watch for global mutex
contention inside QEMU and automatically identify the culprit.

Stalls inside the kvm kernel module can also be investigated since
kvm:kvm_exit events are triggered when they happen too.

I wanted to share in case it is useful for others.  Suggestions for
better approaches welcome!

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>

---

#!/usr/bin/env python
# perf script event handlers, generated by perf script -g python
# Licensed under the terms of the GNU GPL License version 2

# Script to print steal times longer than a given threshold
#
# To collect trace data:
# $ perf record -a -e kvm:kvm_entry -e kvm:kvm_exit
#
# To print results from trace data:
#
# $ perf script -s /absolute/path/to/stealtime.py
# 100466.1062486786 9690: steal time 00000.029318914 secs,
#                         exit_reason IO_INSTRUCTION,
#                         guest_rip 0xffffffff81278f02,
#                         isa 1, info1 0xcf80003, info2 0x0
#
# The example above shows an I/O access to 0xcf8 (PCI Configuration Space
# Address port) that took about 28 milliseconds.  The host pid was 9690; this
# can be used to investigate the QEMU vcpu thread.  The guest rip can be used
# to investigate guest code that triggered this vmexit.

# Print steal times longer than this threshold in milliseconds
THRESHOLD_MS = 100

import os
import sys

sys.path.append(os.environ['PERF_EXEC_PATH'] + \
        '/scripts/python/Perf-Trace-Util/lib/Perf/Trace')

from perf_trace_context import *
from Core import *

vcpu_threads = {}

def trace_begin():
    print 'argv:', str(sys.argv)

def trace_end():
        pass

def kvm__kvm_exit(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        exit_reason, guest_rip, isa, info1, 
        info2):

    if common_pid in vcpu_threads:
        last = vcpu_threads[common_pid]
        assert last[0] == 'kvm__kvm_entry'
        while last[2] > common_nsecs:
            common_secs -= 1
            common_nsecs += 1 * 1000 * 1000 * 1000
        delta_secs = common_secs - last[1]
        delta_nsecs = common_nsecs - last[2]

    vcpu_threads[common_pid] = (event_name, common_secs, common_nsecs, 
exit_reason, guest_rip, isa, info1, info2)

def kvm__kvm_entry(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        vcpu_id):

    if common_pid in vcpu_threads:
        last = vcpu_threads[common_pid]
        assert last[0] == 'kvm__kvm_exit'
        while last[2] > common_nsecs:
            common_secs -= 1
            common_nsecs += 1 * 1000 * 1000 * 1000
        delta_secs = common_secs - last[1]
        delta_nsecs = common_nsecs - last[2]
        if delta_secs > 0 or delta_nsecs > THRESHOLD_MS * 1000 * 1000:
            print '%05u.%09u %u: steal time %05u.%09u secs, exit_reason %s, 
guest_rip %#x, isa %d, info1 %#x, info2 %#x' % (
                    last[1], last[2],
                    common_pid,
                    delta_secs, delta_nsecs,
                    symbol_str("kvm__kvm_exit", "exit_reason", last[3]),
                    last[4],
                    last[5],
                    last[6],
                    last[7])

    vcpu_threads[common_pid] = (event_name, common_secs, common_nsecs)

def trace_unhandled(event_name, context, event_fields_dict):
                print ' '.join(['%s=%s'%(k,str(v))for k,v in 
sorted(event_fields_dict.items())])
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to