Hello Roger,
        Thank you for the time and patience you devoted to reading my messages 
and e-mails. I really appreciate that.
        After analyzed, the reason was there was a user process named monitor 
which often read and wrote xenstore in my test VM. The monitor had been 
suspended after stop_all_proc was called and I Found it held xs.request_mutex 
lock but in the following functions xs_write and xs_suspend 
would not get the lock. So the VM hang.
        I can't use git send-email so I attach a patch directly. In this patch, 
I let stop_all_proc hold the xs.request_mutex lock ahead and xs_dev_write in 
the xenstore device need wait until stop_all_proc has finished. 
In my test environment, my VM was alive after made the live migration back and 
forth 1000 times.
        I want to know how about this patch, Please let me know if you have any 
questions. Thanks!

    Note: The patch is made on the base of the origin/release/10.2.0 branch.

-----邮件原件-----
发件人: Roger Pau Monne [mailto:roger....@citrix.com] 
发送时间: 2016年12月16日 0:47
收件人: Liuyingdong
抄送: Zhaojun (Euler); Suoben
主题: Re: why a FreeBSD 10.2 VM hang under high pressure?

On Thu, Dec 15, 2016 at 01:58:33PM +0000, Liuyingdong wrote:
> Hello,
>       The migration process didn't finish successfully, VM hang on the 
> sending side. 
>       Because I had introduced suspend-cancel mechanism in my test 
> environment, I think VM will be alive on the sending side if libxl suspend is 
> timeout.If memory pressure was oversized in the sending side, the VM 
> continuous dirty pages was faster than the rate of copying, this resulted in 
> a much longer downtime. So prediction algorithm cancelled the live migration. 
> In my test environment, my VM was alive on the sending side as I expected .

Right, this is thanks to your recover from failed suspension patch, I will try 
to review this ASAP.

>     I found stop_all_proc had finished by printing log:
> =====================log start===============================
> [===test===]->xctrl_suspend
> [===test===]start to stop_all_proc
> [===1209===stop]p_comm=sftp-server,pid=3158,r=0
> [===1209===stop]p_comm=csh,pid=3156,r=0
> [===1209===stop]p_comm=sshd,pid=3152,r=0
> [===1209===stop]p_comm=csh,pid=836,r=0
> [===1209===stop]p_comm=sshd,pid=833,r=0
> [===1209===stop]p_comm=getty,pid=715,r=0
> [===1209===stop]p_comm=getty,pid=714,r=0
> [===1209===stop]p_comm=getty,pid=713,r=0
> [===1209===stop]p_comm=getty,pid=712,r=0
> [===1209===stop]p_comm=getty,pid=711,r=0
> [===1209===stop]p_comm=getty,pid=710,r=0
> [===1209===stop]p_comm=getty,pid=709,r=0
> [===1209===stop]p_comm=getty,pid=708,r=0
> [===1209===stop]p_comm=getty,pid=707,r=0
> [===1209===stop]p_comm=cron,pid=663,r=0
> [===1209===stop]p_comm=sendmail,pid=659,r=0
> [===1209===stop]p_comm=sendmail,pid=656,r=0
> [===1209===stop]p_comm=sshd,pid=653,r=0
> [===1209===stop]p_comm=uvp-monitor,pid=559,r=0
> [===1209===stop]p_comm=syslogd,pid=455,r=0
> [===1209===stop]p_comm=devd,pid=381,r=0
> [===1209===stop]p_comm=dhclient,pid=380,r=0
> [===1209===stop]p_comm=dhclient,pid=318,r=0
> [===test===]finish to stop_all_proc
> [===test===]start to power_suspend
> [===test===]finish to power_suspend
> [===test===]start to sched_bind
> [===test===]finish to sched_bind
> [===test===]running on CPU#0
> ...
> ======================log end==============================
> 
>       I debugged by remote way and stack information is shown as follows. To 
> my surprise, acpi_cpu.c:1038 was acpi_cpu_c1() but next frame jumped suddenly 
> pmap_kenter() function. I guest the vm was doing crashdump map after 
> acpi_cpu_c1().
> what's your opinion about it?

See below for a little bit of information about this trace. I'm a little bit 
puzzled here, did this crash/hang happened with you resume from failed 
suspension patch applied, or is this trace from before applying your patch?

> ======================stack start==============================
> (kgdb) target remote 9.61.1.166:123456 Remote debugging using 
> 9.61.1.166:123456 [New Thread 0] [Switching to Thread 0]
> 0xffffffff80d2edc6 in map_table (pa=18446744073709551614, offset=0, 
>     sig=0xf 
> "$0f00000$2430663$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343"...)
>     at /usr/src/sys/amd64/acpica/acpi_machdep.c:135
> 135                     pmap_kenter(va, pa);

Hm, I don't think so, here the pmap_kenter happens in the function table_map, 
which means that acpi_cpu_idle was trying to fetch some information from an 
ACPI table maybe?

> Current language:  auto; currently minimal
> (kgdb) bt
> #0  0xffffffff80d2edc6 in map_table (pa=18446744073709551614, offset=0, 
>     sig=0xf 
> "$0f00000$2430663$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343"...)
>     at /usr/src/sys/amd64/acpica/acpi_machdep.c:135
> #1  0xffffffff8036ca3a in acpi_cpu_idle (sbt=<value optimized out>) at 
> /usr/src/sys/dev/acpica/acpi_cpu.c:1038

This doesn't seem to make much sense, here the trace points to a call to
acpi_cpu_c1 (which is basically a hlt instruction, however the debugger seems 
to think it's calling table_map. So either the symbols are out-of-sync or 
there's something weird going on here.

> #2  0xffffffff80d352ff in cpu_idle (busy=-2123763712) at 
> /usr/src/sys/amd64/amd64/machdep.c:828
> #3  0xffffffff80d353a0 in cpu_idle_wakeup (cpu=53412) at 
> /usr/src/sys/amd64/amd64/machdep.c:852
> #4  0xffffffff80975ee5 in sched_idletd (dummy=<value optimized out>) 
> at /usr/src/sys/kern/sched_ule.c:808
> #5  0xffffffff80912e7a in fork_exit (callout=0xffffffff80975d10 
> <sched_idletd+160>, arg=0x0, frame=0xfffffe007afadc00)
>     at /usr/src/sys/kern/kern_fork.c:1018
> #6  0xffffffff80d31cde in Xint0x80_syscall () at ia32_exception.S:72
> #7  0x0000000000000000 in ?? ()
> (kgdb) info threads 
>   4 Thread 3  0xffffffff80d2edc6 in map_table (pa=18446744073709551607, 
> offset=0, 
>     sig=0xf 
> "$0f00000$2430663$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343"...)
>     at /usr/src/sys/amd64/acpica/acpi_machdep.c:135
>   3 Thread 2  0xffffffff80d2edc6 in map_table (pa=18446744073709551611, 
> offset=0, 
>     sig=0xf 
> "$0f00000$2430663$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343"...)
>     at /usr/src/sys/amd64/acpica/acpi_machdep.c:135
>   2 Thread 1  0xffffffff80d2edc6 in map_table (pa=18446744073709551613, 
> offset=0, 
>     sig=0xf 
> "$0f00000$2430663$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343"...)
>     at /usr/src/sys/amd64/acpica/acpi_machdep.c:135
> * 1 Thread 0  0xffffffff80d2edc6 in map_table (pa=18446744073709551614, 
> offset=0, 
>     sig=0xf 
> "$0f00000$2430663$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343$2432343"...)
>     at /usr/src/sys/amd64/acpica/acpi_machdep.c:135
> ======================stack end==============================

Roger.

Attachment: monitor need wait until stop_all_proc has finished.patch
Description: monitor need wait until stop_all_proc has finished.patch

_______________________________________________
freebsd-xen@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-xen
To unsubscribe, send any mail to "freebsd-xen-unsubscr...@freebsd.org"

Reply via email to