Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On 2017/3/1 0:14, Andrea Arcangeli wrote: Hello, On Tue, Feb 28, 2017 at 09:48:26AM +0800, Hailiang Zhang wrote: Yes, for current implementing of live snapshot, it supports tcg, but does not support kvm mode, the reason i have mentioned above, if you try to implement it, i think you need to start from userfaultfd supporting KVM. There is scenario for it, But I'm blocked by other things these days. I'm glad to discuss with you if you planed to do it. Yes, there were other urgent userfaultfd features needed by QEMU and CRIU queued for merging (hugetlbfs/shmem/non-cooperative support) and they're all included upstream now. Now that such work is finished, fixing the WP support to work with KVM and to provide full accuracy will be the next thing to do. Great, looking forward to it. thanks. Thanks, Andrea .
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hello, On Tue, Feb 28, 2017 at 09:48:26AM +0800, Hailiang Zhang wrote: > Yes, for current implementing of live snapshot, it supports tcg, > but does not support kvm mode, the reason i have mentioned above, > if you try to implement it, i think you need to start from userfaultfd > supporting KVM. There is scenario for it, But I'm blocked by other things > these days. I'm glad to discuss with you if you planed to do it. Yes, there were other urgent userfaultfd features needed by QEMU and CRIU queued for merging (hugetlbfs/shmem/non-cooperative support) and they're all included upstream now. Now that such work is finished, fixing the WP support to work with KVM and to provide full accuracy will be the next thing to do. Thanks, Andrea
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Thanks a lot Hailiang On 28/02/2017 02:48, Hailiang Zhang wrote: Hi, On 2017/2/27 23:37, Christian Pinto wrote: Hello Hailiang, are there any updates on this patch series? Are you planning to release a new version? No, userfaultfd still does not support write-protect for KVM. You can see the newest discussion about it here: https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html Yes, I have read that part of the discussion and quickly managed to reproduce the "Bad address" on ARMv8. You say there are some issues with the current snapshot-v2 version, which issues were you referring to? On my side the only problem I have seen was that the live snapshot was not working on ARMv8, but I have fixed that and managed to successfully snapshot and restore a QEMU ARMv8 tcg machine on an ARMv8 host. I will gladly contribute with these fixes once you will release a new version of the patches. Yes, for current implementing of live snapshot, it supports tcg, but does not support kvm mode, the reason i have mentioned above, if you try to implement it, i think you need to start from userfaultfd supporting KVM. There is scenario for it, But I'm blocked by other things these days. I'm glad to discuss with you if you planed to do it. I will have a deeper look at why userfault is not yet working with KVM and get back on this thread for feedback/suggestions. Thanks, Christian Thanks. Hailiang Thanks a lot, Christian On 20/08/2016 08:31, Hailiang Zhang wrote: Hi, I updated this series, but didn't post it, because there are some problems while i tested the snapshot function. I didn't know if it is the userfaultfd issue or not. I don't have time to investigate it this month. I have put them in github https://github.com/coloft/qemu/tree/snapshot-v2 Anyone who want to test and modify it are welcomed! Besides, will you join the linuxcon or KVM forum in Canada ? I wish to see you there if you join the conference ;) Thanks, Hailiang On 2016/8/18 23:56, Andrea Arcangeli wrote: Hello everyone, I've an aa.git tree uptodate on the master & userfault branch (master includes other pending VM stuff, userfault branch only contains userfault enhancements): https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault I didn't have time to test KVM live memory snapshot on it yet as I'm still working to improve it. Did anybody test it? However I'd be happy to take any bugreports and quickly solve anything that isn't working right with the shadow MMU. I got positive report already for another usage of the uffd WP support: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f The last few things I'm working on to finish the WP support are: 1) pte_swp_mkuffd_wp equivalent of pte_swp_mksoft_dirty to mark in a vma->vm_flags with VM_UFFD_WP set, which swap entries were generated while the pte was wrprotected. 2) to avoid all false positives the equivalent of pte_mksoft_dirty is needed too... and that requires spare software bits on the pte which are available on x86. I considered also taking over the soft_dirty bit but then you couldn't do checkpoint restore of a JIT/to-native compiler that uses uffd WP support so it wasn't ideal. Perhaps it would be ok as an incremental patch to make the two options mutually exclusive to defer the arch changes that pte_mkuffd_wp would require for later. 3) prevent UFFDIO_ZEROPAGE if registering WP|MISSING or trigger a cow in userfaultfd_writeprotect. 4) WP selftest In theory things should work ok already if the userland code is tolerant against false positives through swap and after fork() and KSM. For an usage like snapshotting false positives shouldn't be an issue (it'll just run slower if you swap in the worst case), and point 3) above also isn't an issue because it's going to register into uffd with WP only. The current status includes: 1) WP support for anon (with false positives.. work in progress) 2) MISSING support for tmpfs and hugetlbfs 3) non cooperative support Thanks, Andrea . .
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hi, On 2017/2/27 23:37, Christian Pinto wrote: Hello Hailiang, are there any updates on this patch series? Are you planning to release a new version? No, userfaultfd still does not support write-protect for KVM. You can see the newest discussion about it here: https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html You say there are some issues with the current snapshot-v2 version, which issues were you referring to? On my side the only problem I have seen was that the live snapshot was not working on ARMv8, but I have fixed that and managed to successfully snapshot and restore a QEMU ARMv8 tcg machine on an ARMv8 host. I will gladly contribute with these fixes once you will release a new version of the patches. Yes, for current implementing of live snapshot, it supports tcg, but does not support kvm mode, the reason i have mentioned above, if you try to implement it, i think you need to start from userfaultfd supporting KVM. There is scenario for it, But I'm blocked by other things these days. I'm glad to discuss with you if you planed to do it. Thanks. Hailiang Thanks a lot, Christian On 20/08/2016 08:31, Hailiang Zhang wrote: Hi, I updated this series, but didn't post it, because there are some problems while i tested the snapshot function. I didn't know if it is the userfaultfd issue or not. I don't have time to investigate it this month. I have put them in github https://github.com/coloft/qemu/tree/snapshot-v2 Anyone who want to test and modify it are welcomed! Besides, will you join the linuxcon or KVM forum in Canada ? I wish to see you there if you join the conference ;) Thanks, Hailiang On 2016/8/18 23:56, Andrea Arcangeli wrote: Hello everyone, I've an aa.git tree uptodate on the master & userfault branch (master includes other pending VM stuff, userfault branch only contains userfault enhancements): https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault I didn't have time to test KVM live memory snapshot on it yet as I'm still working to improve it. Did anybody test it? However I'd be happy to take any bugreports and quickly solve anything that isn't working right with the shadow MMU. I got positive report already for another usage of the uffd WP support: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f The last few things I'm working on to finish the WP support are: 1) pte_swp_mkuffd_wp equivalent of pte_swp_mksoft_dirty to mark in a vma->vm_flags with VM_UFFD_WP set, which swap entries were generated while the pte was wrprotected. 2) to avoid all false positives the equivalent of pte_mksoft_dirty is needed too... and that requires spare software bits on the pte which are available on x86. I considered also taking over the soft_dirty bit but then you couldn't do checkpoint restore of a JIT/to-native compiler that uses uffd WP support so it wasn't ideal. Perhaps it would be ok as an incremental patch to make the two options mutually exclusive to defer the arch changes that pte_mkuffd_wp would require for later. 3) prevent UFFDIO_ZEROPAGE if registering WP|MISSING or trigger a cow in userfaultfd_writeprotect. 4) WP selftest In theory things should work ok already if the userland code is tolerant against false positives through swap and after fork() and KSM. For an usage like snapshotting false positives shouldn't be an issue (it'll just run slower if you swap in the worst case), and point 3) above also isn't an issue because it's going to register into uffd with WP only. The current status includes: 1) WP support for anon (with false positives.. work in progress) 2) MISSING support for tmpfs and hugetlbfs 3) non cooperative support Thanks, Andrea . .
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hello Hailiang, are there any updates on this patch series? Are you planning to release a new version? You say there are some issues with the current snapshot-v2 version, which issues were you referring to? On my side the only problem I have seen was that the live snapshot was not working on ARMv8, but I have fixed that and managed to successfully snapshot and restore a QEMU ARMv8 tcg machine on an ARMv8 host. I will gladly contribute with these fixes once you will release a new version of the patches. Thanks a lot, Christian On 20/08/2016 08:31, Hailiang Zhang wrote: Hi, I updated this series, but didn't post it, because there are some problems while i tested the snapshot function. I didn't know if it is the userfaultfd issue or not. I don't have time to investigate it this month. I have put them in github https://github.com/coloft/qemu/tree/snapshot-v2 Anyone who want to test and modify it are welcomed! Besides, will you join the linuxcon or KVM forum in Canada ? I wish to see you there if you join the conference ;) Thanks, Hailiang On 2016/8/18 23:56, Andrea Arcangeli wrote: Hello everyone, I've an aa.git tree uptodate on the master & userfault branch (master includes other pending VM stuff, userfault branch only contains userfault enhancements): https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault I didn't have time to test KVM live memory snapshot on it yet as I'm still working to improve it. Did anybody test it? However I'd be happy to take any bugreports and quickly solve anything that isn't working right with the shadow MMU. I got positive report already for another usage of the uffd WP support: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f The last few things I'm working on to finish the WP support are: 1) pte_swp_mkuffd_wp equivalent of pte_swp_mksoft_dirty to mark in a vma->vm_flags with VM_UFFD_WP set, which swap entries were generated while the pte was wrprotected. 2) to avoid all false positives the equivalent of pte_mksoft_dirty is needed too... and that requires spare software bits on the pte which are available on x86. I considered also taking over the soft_dirty bit but then you couldn't do checkpoint restore of a JIT/to-native compiler that uses uffd WP support so it wasn't ideal. Perhaps it would be ok as an incremental patch to make the two options mutually exclusive to defer the arch changes that pte_mkuffd_wp would require for later. 3) prevent UFFDIO_ZEROPAGE if registering WP|MISSING or trigger a cow in userfaultfd_writeprotect. 4) WP selftest In theory things should work ok already if the userland code is tolerant against false positives through swap and after fork() and KSM. For an usage like snapshotting false positives shouldn't be an issue (it'll just run slower if you swap in the worst case), and point 3) above also isn't an issue because it's going to register into uffd with WP only. The current status includes: 1) WP support for anon (with false positives.. work in progress) 2) MISSING support for tmpfs and hugetlbfs 3) non cooperative support Thanks, Andrea .
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hi Andrea, I noticed that, you call change_protection() helper in mprotect to realize write protect capability for userfault. But i doubted mprotect can't work properly with KVM. If shadow page table (spte) which used by VM is already established in EPT,change_protection() does not remove its write authority but only invalid its Host page-table and shadow page table ( kvm registers invalidate_page/invalidate_range_start). I investigated ksm, Since it can merge the pages which are used by VM, and it need to remove the write authority of these pages too. Its process is not same with mprotect. It has a helper write_protect_page(), and it finally calls hook function change_pte in KVM. It will remove the page's write authority in EPT page table. The code path is: write_protect_page -> set_pte_at_notify -> mmu_notifier_change_pte -> mn->ops->change_pte -> kvm_mmu_notifier_change_pte (If I'm wrong, please let me know :) ). So IMHO, we can realize userfault supporting KVM by refer to ksm, I will investigate it deeply and try to implement it, but I'm not quite familiar with memory system in kernel, so it will takes me some time to study it firstly ... I'd like to know if you have any plan about supporting KVM for userfault ? Thanks, Hailiang On 2016/9/18 10:14, Hailiang Zhang wrote: Hi Andrea, Any comments ? Thanks. On 2016/9/6 11:39, Hailiang Zhang wrote: Hi Andrea, I tested it with the new live memory snapshot with --enable-kvm, it doesn't work. To make things simple, I simplified the codes, only left the codes that can tested the write-protect capability. You can find the codes from https://github.com/coloft/qemu/tree/test-userfault-write-protect. You can reproduce the problem easily with it. Tested result as follow, [root@localhost qemu]# x86_64-softmmu/qemu-system-x86_64 --enable-kvm -drive file=/mnt/sdb/win7/win7.qcow2,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio QEMU 2.6.95 monitor - type 'help' for more information (qemu) migrate file:/home/xxx qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! error: kvm run failed Bad address EAX=0004 EBX= ECX=83b2ac20 EDX=c022 ESI=85fe33f4 EDI=c020 EBP=83b2abcc ESP=83b2abc0 EIP=8bd2ff0c EFL=00010293 [--S-A-C] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 00c0f300 DPL=3 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0023 00c0f300 DPL=3 DS [-WA] FS =0030 83b2dc00 3748 00409300 DPL=0 DS [-WA] GS = LDT= TR =0028 801e2000 20ab 8b00 DPL=0 TSS32-busy GDT= 80b95000 03ff IDT= 80b95400 07ff CR0=8001003b CR2=030b5000 CR3=00185000 CR4=06f8 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0800 Code=8b ff 55 8b ec 53 56 8b 75 08 57 8b 7e 34 56 e8 30 f7 ff ff <6a> 00 57 8a d8 e8 96 14 00 00 6a 04 83 c7 02 57 e8 8b 14 00 00 5f c6 46 5b 00 5e 8a c3 5b I investigated kvm and userfault codes. we use MMU Notifier to integrating KVM with the Linux Memory Management. Here for userfault write-protect, the function calling paths are: userfaultfd_ioctl
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hi Andrea, Any comments ? Thanks. On 2016/9/6 11:39, Hailiang Zhang wrote: Hi Andrea, I tested it with the new live memory snapshot with --enable-kvm, it doesn't work. To make things simple, I simplified the codes, only left the codes that can tested the write-protect capability. You can find the codes from https://github.com/coloft/qemu/tree/test-userfault-write-protect. You can reproduce the problem easily with it. Tested result as follow, [root@localhost qemu]# x86_64-softmmu/qemu-system-x86_64 --enable-kvm -drive file=/mnt/sdb/win7/win7.qcow2,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio QEMU 2.6.95 monitor - type 'help' for more information (qemu) migrate file:/home/xxx qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! error: kvm run failed Bad address EAX=0004 EBX= ECX=83b2ac20 EDX=c022 ESI=85fe33f4 EDI=c020 EBP=83b2abcc ESP=83b2abc0 EIP=8bd2ff0c EFL=00010293 [--S-A-C] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 00c0f300 DPL=3 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0023 00c0f300 DPL=3 DS [-WA] FS =0030 83b2dc00 3748 00409300 DPL=0 DS [-WA] GS = LDT= TR =0028 801e2000 20ab 8b00 DPL=0 TSS32-busy GDT= 80b95000 03ff IDT= 80b95400 07ff CR0=8001003b CR2=030b5000 CR3=00185000 CR4=06f8 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0800 Code=8b ff 55 8b ec 53 56 8b 75 08 57 8b 7e 34 56 e8 30 f7 ff ff <6a> 00 57 8a d8 e8 96 14 00 00 6a 04 83 c7 02 57 e8 8b 14 00 00 5f c6 46 5b 00 5e 8a c3 5b I investigated kvm and userfault codes. we use MMU Notifier to integrating KVM with the Linux Memory Management. Here for userfault write-protect, the function calling paths are: userfaultfd_ioctl -> userfaultfd_writeprotect -> mwriteprotect_range -> change_protection (Directly call mprotect helper here) -> change_protection_range -> change_pud_range -> change_pmd_range -> mmu_notifier_invalidate_range_start(mm, mni_start, end); -> kvm_mmu_notifier_invalidate_range_start (KVM module) OK, here, we remove the item from spte. (If we use EPT hardware, we remove the page table entry for it). That's why we can get fault notifying for VM. And It seems that we can't fix the userfault (remove the page's write-protect authority) by this function calling paths. Here my question is, for userfault write-protect capability, why we remove the page table entry instead of marking it as read-only. Actually, for KVM, we have a mmu notifier (kvm_mmu_notifier_change_pte) to do this, We can use it to remove the writable authority for KVM page table, just like KVM dirty log tracking does. Please see function __rmap_write_protect() in KVM. Another question, is mprotect() works normally with KVM ? (I didn't test it.), I think KSM and swap can work with KVM properly. Besides, there seems to be a bug for userfault write-protect. We use UFFDIO_COPY_MODE_DONTWAKE in userfaultfd_writeprotect, should it be UFFDIO_WRITEPROTECT_MODE_DONTWAKE
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hi Andrea, I tested it with the new live memory snapshot with --enable-kvm, it doesn't work. To make things simple, I simplified the codes, only left the codes that can tested the write-protect capability. You can find the codes from https://github.com/coloft/qemu/tree/test-userfault-write-protect. You can reproduce the problem easily with it. Tested result as follow, [root@localhost qemu]# x86_64-softmmu/qemu-system-x86_64 --enable-kvm -drive file=/mnt/sdb/win7/win7.qcow2,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio QEMU 2.6.95 monitor - type 'help' for more information (qemu) migrate file:/home/xxx qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! qemu-system-x86_64: postcopy_ram_fault_thread: 7f07fb92a000 fault and remove write protect! error: kvm run failed Bad address EAX=0004 EBX= ECX=83b2ac20 EDX=c022 ESI=85fe33f4 EDI=c020 EBP=83b2abcc ESP=83b2abc0 EIP=8bd2ff0c EFL=00010293 [--S-A-C] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0023 00c0f300 DPL=3 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0023 00c0f300 DPL=3 DS [-WA] FS =0030 83b2dc00 3748 00409300 DPL=0 DS [-WA] GS = LDT= TR =0028 801e2000 20ab 8b00 DPL=0 TSS32-busy GDT= 80b95000 03ff IDT= 80b95400 07ff CR0=8001003b CR2=030b5000 CR3=00185000 CR4=06f8 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0800 Code=8b ff 55 8b ec 53 56 8b 75 08 57 8b 7e 34 56 e8 30 f7 ff ff <6a> 00 57 8a d8 e8 96 14 00 00 6a 04 83 c7 02 57 e8 8b 14 00 00 5f c6 46 5b 00 5e 8a c3 5b I investigated kvm and userfault codes. we use MMU Notifier to integrating KVM with the Linux Memory Management. Here for userfault write-protect, the function calling paths are: userfaultfd_ioctl -> userfaultfd_writeprotect -> mwriteprotect_range -> change_protection (Directly call mprotect helper here) -> change_protection_range -> change_pud_range -> change_pmd_range -> mmu_notifier_invalidate_range_start(mm, mni_start, end); -> kvm_mmu_notifier_invalidate_range_start (KVM module) OK, here, we remove the item from spte. (If we use EPT hardware, we remove the page table entry for it). That's why we can get fault notifying for VM. And It seems that we can't fix the userfault (remove the page's write-protect authority) by this function calling paths. Here my question is, for userfault write-protect capability, why we remove the page table entry instead of marking it as read-only. Actually, for KVM, we have a mmu notifier (kvm_mmu_notifier_change_pte) to do this, We can use it to remove the writable authority for KVM page table, just like KVM dirty log tracking does. Please see function __rmap_write_protect() in KVM. Another question, is mprotect() works normally with KVM ? (I didn't test it.), I think KSM and swap can work with KVM properly. Besides, there seems to be a bug for userfault write-protect. We use UFFDIO_COPY_MODE_DONTWAKE in userfaultfd_writeprotect, should it be UFFDIO_WRITEPROTECT_MODE_DONTWAKE there ? static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hi, I updated this series, but didn't post it, because there are some problems while i tested the snapshot function. I didn't know if it is the userfaultfd issue or not. I don't have time to investigate it this month. I have put them in github https://github.com/coloft/qemu/tree/snapshot-v2 Anyone who want to test and modify it are welcomed! Besides, will you join the linuxcon or KVM forum in Canada ? I wish to see you there if you join the conference ;) Thanks, Hailiang On 2016/8/18 23:56, Andrea Arcangeli wrote: Hello everyone, I've an aa.git tree uptodate on the master & userfault branch (master includes other pending VM stuff, userfault branch only contains userfault enhancements): https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault I didn't have time to test KVM live memory snapshot on it yet as I'm still working to improve it. Did anybody test it? However I'd be happy to take any bugreports and quickly solve anything that isn't working right with the shadow MMU. I got positive report already for another usage of the uffd WP support: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f The last few things I'm working on to finish the WP support are: 1) pte_swp_mkuffd_wp equivalent of pte_swp_mksoft_dirty to mark in a vma->vm_flags with VM_UFFD_WP set, which swap entries were generated while the pte was wrprotected. 2) to avoid all false positives the equivalent of pte_mksoft_dirty is needed too... and that requires spare software bits on the pte which are available on x86. I considered also taking over the soft_dirty bit but then you couldn't do checkpoint restore of a JIT/to-native compiler that uses uffd WP support so it wasn't ideal. Perhaps it would be ok as an incremental patch to make the two options mutually exclusive to defer the arch changes that pte_mkuffd_wp would require for later. 3) prevent UFFDIO_ZEROPAGE if registering WP|MISSING or trigger a cow in userfaultfd_writeprotect. 4) WP selftest In theory things should work ok already if the userland code is tolerant against false positives through swap and after fork() and KSM. For an usage like snapshotting false positives shouldn't be an issue (it'll just run slower if you swap in the worst case), and point 3) above also isn't an issue because it's going to register into uffd with WP only. The current status includes: 1) WP support for anon (with false positives.. work in progress) 2) MISSING support for tmpfs and hugetlbfs 3) non cooperative support Thanks, Andrea .
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hello everyone, I've an aa.git tree uptodate on the master & userfault branch (master includes other pending VM stuff, userfault branch only contains userfault enhancements): https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault I didn't have time to test KVM live memory snapshot on it yet as I'm still working to improve it. Did anybody test it? However I'd be happy to take any bugreports and quickly solve anything that isn't working right with the shadow MMU. I got positive report already for another usage of the uffd WP support: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f The last few things I'm working on to finish the WP support are: 1) pte_swp_mkuffd_wp equivalent of pte_swp_mksoft_dirty to mark in a vma->vm_flags with VM_UFFD_WP set, which swap entries were generated while the pte was wrprotected. 2) to avoid all false positives the equivalent of pte_mksoft_dirty is needed too... and that requires spare software bits on the pte which are available on x86. I considered also taking over the soft_dirty bit but then you couldn't do checkpoint restore of a JIT/to-native compiler that uses uffd WP support so it wasn't ideal. Perhaps it would be ok as an incremental patch to make the two options mutually exclusive to defer the arch changes that pte_mkuffd_wp would require for later. 3) prevent UFFDIO_ZEROPAGE if registering WP|MISSING or trigger a cow in userfaultfd_writeprotect. 4) WP selftest In theory things should work ok already if the userland code is tolerant against false positives through swap and after fork() and KSM. For an usage like snapshotting false positives shouldn't be an issue (it'll just run slower if you swap in the worst case), and point 3) above also isn't an issue because it's going to register into uffd with WP only. The current status includes: 1) WP support for anon (with false positives.. work in progress) 2) MISSING support for tmpfs and hugetlbfs 3) non cooperative support Thanks, Andrea
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On 2016/7/14 19:43, Dr. David Alan Gilbert wrote: * Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote: On 2016/7/14 2:02, Dr. David Alan Gilbert wrote: * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: For now, we still didn't support live memory snapshot, we have discussed a scheme which based on userfaultfd long time ago. You can find the discussion by the follow link: https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html The scheme is based on userfaultfd's write-protect capability. The userfaultfd write protection feature is available here: http://www.spinics.net/lists/linux-mm/msg97422.html I've (finally!) had a brief look through this, I like the idea. I've not bothered with minor cleanup like comments on them; I'm sure those will happen later; some larger scale things to think about are: a) I wonder if it's really best to put that much code into the postcopy function; it might be but I can see other userfault uses as well. Yes, it is better to extract common codes into public functions. b) I worry a bit about the size of the copies you create during setup and I don't really understand why you can't start sending those pages Because we save device state and ram in the same snapshot_thread, if the process of saving device is blocked by writing pages, we can remove the write-protect in 'postcopy/fault' thread, but can't send it immediately. Don't you write the devices to a buffer? If so then you perhaps you could split writing into that buffer into a separate thread. Hmm, it may work in this way. immediately - but then I worry aobut the relative order of when pages data should be sent compared to the state of devices view of RAM. c) Have you considered also using userfault for loading the snapshot - I know there was someone on #qemu a while ago who was talking about using it as a way to quickly reload from a migration image. I didn't notice such talking before, maybe i missed it. Could you please send me the link ? I don't think there's any public docs about it; this was a conversation with Christoph Seifert on #qemu about May last year. Got it. But i do consider the scenario of quickly snapshot restoring. And the difficulty here is how can we quickly find the position of the special page. That is, while VM is accessing one page, we need to find its position in snapshot file and read it into memory. Consider the compatibility, we hope we can still re-use all migration capabilities. My rough idea about the scenario is: 1. Use an array to recode the beginning position of all VM's pages. Use the offset as the index for the array, just like migration bitmaps. 2. Save the data of the array into another file in a special format. 3. Also record the position of device state data in snapshot file. (Or we can put the device state data at the head of snapshot file) 4. While restore the snapshot, reload the array first, and then read the device state. 5. Set all pages to MISS status. 6. Resume VM to run 7. The next process is like how postcopy incoming does. I'm not sure if this scenario is practicable or not. We need further discussion. :) Yes; I can think of a few different ways to do (2): a) We could just store it at the end of the snapshot file (and know that it's at the end - I think the json format description did a similar trick). Yes, this is a better idea. b) We wouldn't need the 4 byte headers on the page we currently send. c) Juan's idea of having multiple fd's for migration streams might also fit, with the RAM data in the separate file. d) But if we know it's a file (not a network stream) then should we treat it specially and just use a sparse file of the same size as RAM, and just pwrite() the data into the right offset? Yes, this is the simplest way to save the snapshot file, the disadvantage for it is we can't directly reuse current migration incoming way to restore VM (None quickly restore). We need to modify current restore process. I'm not sure which way is better. But it's worth a try. Hailiang Dave Hailiang Dave The process of this live memory scheme is like bellow: 1. Pause VM 2. Enable write-protect fault notification by using userfaultfd to mark VM's memory to write-protect (readonly). 3. Save VM's static state (here is device state) to snapshot file 4. Resume VM, VM is going to run. 5. Snapshot thread begins to save VM's live state (here is RAM) into snapshot file. 6. During this time, all the actions of writing VM's memory will be blocked by kernel, and kernel will wakeup the fault treating thread in qemu to process this write-protect fault. The fault treating thread will deliver this page's address to snapshot thread. 7. snapshot thread gets this address, save this page into snasphot file, and then remove the write-protect by using userfaultfd API, after that, the actions of writing will be
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote: > On 2016/7/14 2:02, Dr. David Alan Gilbert wrote: > > * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > > > For now, we still didn't support live memory snapshot, we have discussed > > > a scheme which based on userfaultfd long time ago. > > > You can find the discussion by the follow link: > > > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html > > > > > > The scheme is based on userfaultfd's write-protect capability. > > > The userfaultfd write protection feature is available here: > > > http://www.spinics.net/lists/linux-mm/msg97422.html > > > > I've (finally!) had a brief look through this, I like the idea. > > I've not bothered with minor cleanup like comments on them; > > I'm sure those will happen later; some larger scale things to think > > about are: > >a) I wonder if it's really best to put that much code into the postcopy > > function; it might be but I can see other userfault uses as well. > > Yes, it is better to extract common codes into public functions. > > >b) I worry a bit about the size of the copies you create during setup > > and I don't really understand why you can't start sending those pages > > Because we save device state and ram in the same snapshot_thread, if the > process > of saving device is blocked by writing pages, we can remove the write-protect > in > 'postcopy/fault' thread, but can't send it immediately. Don't you write the devices to a buffer? If so then you perhaps you could split writing into that buffer into a separate thread. > > immediately - but then I worry aobut the relative order of when pages > > data should be sent compared to the state of devices view of RAM. > >c) Have you considered also using userfault for loading the snapshot - I > > know there was someone on #qemu a while ago who was talking about using > > it as a way to quickly reload from a migration image. > > > > I didn't notice such talking before, maybe i missed it. > Could you please send me the link ? I don't think there's any public docs about it; this was a conversation with Christoph Seifert on #qemu about May last year. > But i do consider the scenario of quickly snapshot restoring. > And the difficulty here is how can we quickly find the position > of the special page. That is, while VM is accessing one page, we > need to find its position in snapshot file and read it into memory. > Consider the compatibility, we hope we can still re-use all migration > capabilities. > > My rough idea about the scenario is: > 1. Use an array to recode the beginning position of all VM's pages. > Use the offset as the index for the array, just like migration bitmaps. > 2. Save the data of the array into another file in a special format. > 3. Also record the position of device state data in snapshot file. > (Or we can put the device state data at the head of snapshot file) > 4. While restore the snapshot, reload the array first, and then read > the device state. > 5. Set all pages to MISS status. > 6. Resume VM to run > 7. The next process is like how postcopy incoming does. > > I'm not sure if this scenario is practicable or not. We need further > discussion. :) Yes; I can think of a few different ways to do (2): a) We could just store it at the end of the snapshot file (and know that it's at the end - I think the json format description did a similar trick). b) We wouldn't need the 4 byte headers on the page we currently send. c) Juan's idea of having multiple fd's for migration streams might also fit, with the RAM data in the separate file. d) But if we know it's a file (not a network stream) then should we treat it specially and just use a sparse file of the same size as RAM, and just pwrite() the data into the right offset? Dave > > Hailiang > > > Dave > > > > > > > > The process of this live memory scheme is like bellow: > > > 1. Pause VM > > > 2. Enable write-protect fault notification by using userfaultfd to > > > mark VM's memory to write-protect (readonly). > > > 3. Save VM's static state (here is device state) to snapshot file > > > 4. Resume VM, VM is going to run. > > > 5. Snapshot thread begins to save VM's live state (here is RAM) into > > > snapshot file. > > > 6. During this time, all the actions of writing VM's memory will be > > > blocked > > >by kernel, and kernel will wakeup the fault treating thread in qemu to > > >process this write-protect fault. The fault treating thread will > > > deliver this > > >page's address to snapshot thread. > > > 7. snapshot thread gets this address, save this page into snasphot file, > > > and then remove the write-protect by using userfaultfd API, after > > > that, > > > the actions of writing will be recovered. > > > 8. Repeat step 5~7 until all VM's memory is saved to snapshot file > > > > > > Compared with the feature of 'migrate VM's state to
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On 2016/7/14 2:02, Dr. David Alan Gilbert wrote: * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: For now, we still didn't support live memory snapshot, we have discussed a scheme which based on userfaultfd long time ago. You can find the discussion by the follow link: https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html The scheme is based on userfaultfd's write-protect capability. The userfaultfd write protection feature is available here: http://www.spinics.net/lists/linux-mm/msg97422.html I've (finally!) had a brief look through this, I like the idea. I've not bothered with minor cleanup like comments on them; I'm sure those will happen later; some larger scale things to think about are: a) I wonder if it's really best to put that much code into the postcopy function; it might be but I can see other userfault uses as well. Yes, it is better to extract common codes into public functions. b) I worry a bit about the size of the copies you create during setup and I don't really understand why you can't start sending those pages Because we save device state and ram in the same snapshot_thread, if the process of saving device is blocked by writing pages, we can remove the write-protect in 'postcopy/fault' thread, but can't send it immediately. immediately - but then I worry aobut the relative order of when pages data should be sent compared to the state of devices view of RAM. c) Have you considered also using userfault for loading the snapshot - I know there was someone on #qemu a while ago who was talking about using it as a way to quickly reload from a migration image. I didn't notice such talking before, maybe i missed it. Could you please send me the link ? But i do consider the scenario of quickly snapshot restoring. And the difficulty here is how can we quickly find the position of the special page. That is, while VM is accessing one page, we need to find its position in snapshot file and read it into memory. Consider the compatibility, we hope we can still re-use all migration capabilities. My rough idea about the scenario is: 1. Use an array to recode the beginning position of all VM's pages. Use the offset as the index for the array, just like migration bitmaps. 2. Save the data of the array into another file in a special format. 3. Also record the position of device state data in snapshot file. (Or we can put the device state data at the head of snapshot file) 4. While restore the snapshot, reload the array first, and then read the device state. 5. Set all pages to MISS status. 6. Resume VM to run 7. The next process is like how postcopy incoming does. I'm not sure if this scenario is practicable or not. We need further discussion. :) Hailiang Dave The process of this live memory scheme is like bellow: 1. Pause VM 2. Enable write-protect fault notification by using userfaultfd to mark VM's memory to write-protect (readonly). 3. Save VM's static state (here is device state) to snapshot file 4. Resume VM, VM is going to run. 5. Snapshot thread begins to save VM's live state (here is RAM) into snapshot file. 6. During this time, all the actions of writing VM's memory will be blocked by kernel, and kernel will wakeup the fault treating thread in qemu to process this write-protect fault. The fault treating thread will deliver this page's address to snapshot thread. 7. snapshot thread gets this address, save this page into snasphot file, and then remove the write-protect by using userfaultfd API, after that, the actions of writing will be recovered. 8. Repeat step 5~7 until all VM's memory is saved to snapshot file Compared with the feature of 'migrate VM's state to file', the main difference for live memory snapshot is it has little time delay for catching VM's state. It just captures the VM's state while got users snapshot command, just like take a photo of VM's state. For now, we only support tcg accelerator, since userfaultfd is not supporting tracking write faults for KVM. Usage: 1. Take a snapshot #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off -drive file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio Issue snapshot command: (qemu)migrate -d file:/home/Snapshot 2. Revert to the snapshot #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off -drive file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio -incoming file:/home/Snapshot NOTE: The userfaultfd write protection feature does not support THP for now,
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > For now, we still didn't support live memory snapshot, we have discussed > a scheme which based on userfaultfd long time ago. > You can find the discussion by the follow link: > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html > > The scheme is based on userfaultfd's write-protect capability. > The userfaultfd write protection feature is available here: > http://www.spinics.net/lists/linux-mm/msg97422.html I've (finally!) had a brief look through this, I like the idea. I've not bothered with minor cleanup like comments on them; I'm sure those will happen later; some larger scale things to think about are: a) I wonder if it's really best to put that much code into the postcopy function; it might be but I can see other userfault uses as well. b) I worry a bit about the size of the copies you create during setup and I don't really understand why you can't start sending those pages immediately - but then I worry aobut the relative order of when pages data should be sent compared to the state of devices view of RAM. c) Have you considered also using userfault for loading the snapshot - I know there was someone on #qemu a while ago who was talking about using it as a way to quickly reload from a migration image. Dave > > The process of this live memory scheme is like bellow: > 1. Pause VM > 2. Enable write-protect fault notification by using userfaultfd to >mark VM's memory to write-protect (readonly). > 3. Save VM's static state (here is device state) to snapshot file > 4. Resume VM, VM is going to run. > 5. Snapshot thread begins to save VM's live state (here is RAM) into >snapshot file. > 6. During this time, all the actions of writing VM's memory will be blocked > by kernel, and kernel will wakeup the fault treating thread in qemu to > process this write-protect fault. The fault treating thread will deliver > this > page's address to snapshot thread. > 7. snapshot thread gets this address, save this page into snasphot file, >and then remove the write-protect by using userfaultfd API, after that, >the actions of writing will be recovered. > 8. Repeat step 5~7 until all VM's memory is saved to snapshot file > > Compared with the feature of 'migrate VM's state to file', > the main difference for live memory snapshot is it has little time delay for > catching VM's state. It just captures the VM's state while got users snapshot > command, just like take a photo of VM's state. > > For now, we only support tcg accelerator, since userfaultfd is not supporting > tracking write faults for KVM. > > Usage: > 1. Take a snapshot > #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off > -drive > file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none > -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m > 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 > --monitor stdio > Issue snapshot command: > (qemu)migrate -d file:/home/Snapshot > 2. Revert to the snapshot > #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off > -drive > file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none > -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m > 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 > --monitor stdio -incoming file:/home/Snapshot > > NOTE: > The userfaultfd write protection feature does not support THP for now, > Before taking snapshot, please disable THP by: > echo never > /sys/kernel/mm/transparent_hugepage/enabled > > TODO: > - Reduce the influence for VM while taking snapshot > > zhanghailiang (13): > postcopy/migration: Split fault related state into struct > UserfaultState > migration: Allow the migrate command to work on file: urls > migration: Allow -incoming to work on file: urls > migration: Create a snapshot thread to realize saving memory snapshot > migration: implement initialization work for snapshot > QEMUSizedBuffer: Introduce two help functions for qsb > savevm: Split qemu_savevm_state_complete_precopy() into two helper > functions > snapshot: Save VM's device state into snapshot file > migration/postcopy-ram: fix some helper functions to support > userfaultfd write-protect > snapshot: Enable the write-protect notification capability for VM's > RAM > snapshot/migration: Save VM's RAM into snapshot file > migration/ram: Fix some helper functions' parameter to use > PageSearchStatus > snapshot: Remove page's write-protect and copy the content during > setup stage > > include/migration/migration.h | 41 +-- > include/migration/postcopy-ram.h | 9 +- > include/migration/qemu-file.h | 3 +- > include/qemu/typedefs.h | 1 + > include/sysemu/sysemu.h
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
Hello, On Tue, Jul 05, 2016 at 11:57:31AM +0200, Baptiste Reynal wrote: > Ok, if it is not on Andrea schedule I am willing to take the action, > at least for ARM/ARM64 support. A few days ago I released this update: https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/ git clone -b master --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git cd aa git fetch git reset --hard origin/master The branch will be constantly rebased so you will need to rebase or reset on origin/master after a fetch to get the updates. Features added: 1) WP support for anon (Shaohua, hugetlbfs has a FIXME) 2) non cooperative support (Pavel & Mike Rapoport) 3) hugetlbfs missing faults tracking (Mike Kravetz) WP support and hugetlbfs required a couple of fixes, the non-cooperative support is as submitted but I wonder if we should have a single non cooperative feature flag. I didn't advertise it yet because It's not well tested and in fact I don't expect the WP mode to work fully as it should. However the kernel should run stable, I fixed enough bugs so that this release should not be possible to DoS or exploit the kernel with this patchset applied (unlike the original code submits which had race conditions and potentially kernel crashing bugs). The next thing I plan to work on is a bitflag in the swap entry for the WP tracking so that WP tracking works correctly through swapins without false positives. It'll work like soft-dirty. Possible that other things are still uncovered in the WP support. THP should be covered now (the callback was missing in the original submit but I fixed that). KVM it's not entirely clear why it didn't work before but it may require changes to the KVM code if this is not enough. KVM should not use gup(write=1) for read faults on shadow pagetables, so it has at least a chance to work. I'm also considering using a reserved bitflag in the mapped/present pte/trans_huge_pmds to track which virtual addresses have been wrprotected. Without a reserved bitflag, fork() would inevitably lead to WP userfaults false positives. I'm not sure if it's required or if it should be left up to userland to enforce the pagetables don't become wrprotected (i.e. use MADV_DONTFORK like of course KVM already does). First we've to solve the false positives through swap anyway, the two should be orthogonal improvements. If you could test the live snapshotting patchset on my kernel master branch and report any issue or incremental fix against my branch, it'd be great. On my side I think I'll focus on testing by extending the testsuite inside the kernel to exercise WP tracking too. There are several other active users of the new userfaultfd features, including JIT garbage collection (that previously used mprotect and trapped SIGSEGV), distributed shared memory, SQL database robustness in hugetlbfs holes and postcopy live migration of containers (a process using userfaultfd of its own being live migrated inside a containers with the non-cooperative model, isn't solved yet though). Thanks, Andrea
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On 2016/7/5 17:57, Baptiste Reynal wrote: On Tue, Jul 5, 2016 at 3:49 AM, Hailiang Zhangwrote: On 2016/7/4 20:22, Baptiste Reynal wrote: On Thu, Jan 7, 2016 at 1:19 PM, zhanghailiang wrote: For now, we still didn't support live memory snapshot, we have discussed a scheme which based on userfaultfd long time ago. You can find the discussion by the follow link: https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html The scheme is based on userfaultfd's write-protect capability. The userfaultfd write protection feature is available here: http://www.spinics.net/lists/linux-mm/msg97422.html The process of this live memory scheme is like bellow: 1. Pause VM 2. Enable write-protect fault notification by using userfaultfd to mark VM's memory to write-protect (readonly). 3. Save VM's static state (here is device state) to snapshot file 4. Resume VM, VM is going to run. 5. Snapshot thread begins to save VM's live state (here is RAM) into snapshot file. 6. During this time, all the actions of writing VM's memory will be blocked by kernel, and kernel will wakeup the fault treating thread in qemu to process this write-protect fault. The fault treating thread will deliver this page's address to snapshot thread. 7. snapshot thread gets this address, save this page into snasphot file, and then remove the write-protect by using userfaultfd API, after that, the actions of writing will be recovered. 8. Repeat step 5~7 until all VM's memory is saved to snapshot file Compared with the feature of 'migrate VM's state to file', the main difference for live memory snapshot is it has little time delay for catching VM's state. It just captures the VM's state while got users snapshot command, just like take a photo of VM's state. For now, we only support tcg accelerator, since userfaultfd is not supporting tracking write faults for KVM. Usage: 1. Take a snapshot #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off -drive file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio Issue snapshot command: (qemu)migrate -d file:/home/Snapshot 2. Revert to the snapshot #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off -drive file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio -incoming file:/home/Snapshot NOTE: The userfaultfd write protection feature does not support THP for now, Before taking snapshot, please disable THP by: echo never > /sys/kernel/mm/transparent_hugepage/enabled TODO: - Reduce the influence for VM while taking snapshot zhanghailiang (13): postcopy/migration: Split fault related state into struct UserfaultState migration: Allow the migrate command to work on file: urls migration: Allow -incoming to work on file: urls migration: Create a snapshot thread to realize saving memory snapshot migration: implement initialization work for snapshot QEMUSizedBuffer: Introduce two help functions for qsb savevm: Split qemu_savevm_state_complete_precopy() into two helper functions snapshot: Save VM's device state into snapshot file migration/postcopy-ram: fix some helper functions to support userfaultfd write-protect snapshot: Enable the write-protect notification capability for VM's RAM snapshot/migration: Save VM's RAM into snapshot file migration/ram: Fix some helper functions' parameter to use PageSearchStatus snapshot: Remove page's write-protect and copy the content during setup stage include/migration/migration.h | 41 +-- include/migration/postcopy-ram.h | 9 +- include/migration/qemu-file.h | 3 +- include/qemu/typedefs.h | 1 + include/sysemu/sysemu.h | 3 + linux-headers/linux/userfaultfd.h | 21 +++- migration/fd.c| 51 - migration/migration.c | 101 - migration/postcopy-ram.c | 229 -- migration/qemu-file-buf.c | 61 ++ migration/ram.c | 104 - migration/savevm.c| 90 --- trace-events | 1 + 13 files changed, 587 insertions(+), 128 deletions(-) -- 1.8.3.1 Hi, Hi Hailiang, Can I get the status of this patch series ? I cannot find a v2. Yes, I haven't updated it for long time, it is based on userfault-wp API in kernel, and Andrea didn't update the related patches until
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On Tue, Jul 5, 2016 at 3:49 AM, Hailiang Zhangwrote: > On 2016/7/4 20:22, Baptiste Reynal wrote: >> >> On Thu, Jan 7, 2016 at 1:19 PM, zhanghailiang >> wrote: >>> >>> For now, we still didn't support live memory snapshot, we have discussed >>> a scheme which based on userfaultfd long time ago. >>> You can find the discussion by the follow link: >>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html >>> >>> The scheme is based on userfaultfd's write-protect capability. >>> The userfaultfd write protection feature is available here: >>> http://www.spinics.net/lists/linux-mm/msg97422.html >>> >>> The process of this live memory scheme is like bellow: >>> 1. Pause VM >>> 2. Enable write-protect fault notification by using userfaultfd to >>> mark VM's memory to write-protect (readonly). >>> 3. Save VM's static state (here is device state) to snapshot file >>> 4. Resume VM, VM is going to run. >>> 5. Snapshot thread begins to save VM's live state (here is RAM) into >>> snapshot file. >>> 6. During this time, all the actions of writing VM's memory will be >>> blocked >>>by kernel, and kernel will wakeup the fault treating thread in qemu to >>>process this write-protect fault. The fault treating thread will >>> deliver this >>>page's address to snapshot thread. >>> 7. snapshot thread gets this address, save this page into snasphot file, >>> and then remove the write-protect by using userfaultfd API, after >>> that, >>> the actions of writing will be recovered. >>> 8. Repeat step 5~7 until all VM's memory is saved to snapshot file >>> >>> Compared with the feature of 'migrate VM's state to file', >>> the main difference for live memory snapshot is it has little time delay >>> for >>> catching VM's state. It just captures the VM's state while got users >>> snapshot >>> command, just like take a photo of VM's state. >>> >>> For now, we only support tcg accelerator, since userfaultfd is not >>> supporting >>> tracking write faults for KVM. >>> >>> Usage: >>> 1. Take a snapshot >>> #x86_64-softmmu/qemu-system-x86_64 -machine >>> pc-i440fx-2.5,accel=tcg,usb=off -drive >>> file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none >>> -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m >>> 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 >>> --monitor stdio >>> Issue snapshot command: >>> (qemu)migrate -d file:/home/Snapshot >>> 2. Revert to the snapshot >>> #x86_64-softmmu/qemu-system-x86_64 -machine >>> pc-i440fx-2.5,accel=tcg,usb=off -drive >>> file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none >>> -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m >>> 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 >>> --monitor stdio -incoming file:/home/Snapshot >>> >>> NOTE: >>> The userfaultfd write protection feature does not support THP for now, >>> Before taking snapshot, please disable THP by: >>> echo never > /sys/kernel/mm/transparent_hugepage/enabled >>> >>> TODO: >>> - Reduce the influence for VM while taking snapshot >>> >>> zhanghailiang (13): >>>postcopy/migration: Split fault related state into struct >>> UserfaultState >>>migration: Allow the migrate command to work on file: urls >>>migration: Allow -incoming to work on file: urls >>>migration: Create a snapshot thread to realize saving memory snapshot >>>migration: implement initialization work for snapshot >>>QEMUSizedBuffer: Introduce two help functions for qsb >>>savevm: Split qemu_savevm_state_complete_precopy() into two helper >>> functions >>>snapshot: Save VM's device state into snapshot file >>>migration/postcopy-ram: fix some helper functions to support >>> userfaultfd write-protect >>>snapshot: Enable the write-protect notification capability for VM's >>> RAM >>>snapshot/migration: Save VM's RAM into snapshot file >>>migration/ram: Fix some helper functions' parameter to use >>> PageSearchStatus >>>snapshot: Remove page's write-protect and copy the content during >>> setup stage >>> >>> include/migration/migration.h | 41 +-- >>> include/migration/postcopy-ram.h | 9 +- >>> include/migration/qemu-file.h | 3 +- >>> include/qemu/typedefs.h | 1 + >>> include/sysemu/sysemu.h | 3 + >>> linux-headers/linux/userfaultfd.h | 21 +++- >>> migration/fd.c| 51 - >>> migration/migration.c | 101 - >>> migration/postcopy-ram.c | 229 >>> -- >>> migration/qemu-file-buf.c | 61 ++ >>> migration/ram.c | 104 - >>> migration/savevm.c| 90 --- >>> trace-events
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On 2016/7/4 20:22, Baptiste Reynal wrote: On Thu, Jan 7, 2016 at 1:19 PM, zhanghailiangwrote: For now, we still didn't support live memory snapshot, we have discussed a scheme which based on userfaultfd long time ago. You can find the discussion by the follow link: https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html The scheme is based on userfaultfd's write-protect capability. The userfaultfd write protection feature is available here: http://www.spinics.net/lists/linux-mm/msg97422.html The process of this live memory scheme is like bellow: 1. Pause VM 2. Enable write-protect fault notification by using userfaultfd to mark VM's memory to write-protect (readonly). 3. Save VM's static state (here is device state) to snapshot file 4. Resume VM, VM is going to run. 5. Snapshot thread begins to save VM's live state (here is RAM) into snapshot file. 6. During this time, all the actions of writing VM's memory will be blocked by kernel, and kernel will wakeup the fault treating thread in qemu to process this write-protect fault. The fault treating thread will deliver this page's address to snapshot thread. 7. snapshot thread gets this address, save this page into snasphot file, and then remove the write-protect by using userfaultfd API, after that, the actions of writing will be recovered. 8. Repeat step 5~7 until all VM's memory is saved to snapshot file Compared with the feature of 'migrate VM's state to file', the main difference for live memory snapshot is it has little time delay for catching VM's state. It just captures the VM's state while got users snapshot command, just like take a photo of VM's state. For now, we only support tcg accelerator, since userfaultfd is not supporting tracking write faults for KVM. Usage: 1. Take a snapshot #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off -drive file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio Issue snapshot command: (qemu)migrate -d file:/home/Snapshot 2. Revert to the snapshot #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off -drive file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 --monitor stdio -incoming file:/home/Snapshot NOTE: The userfaultfd write protection feature does not support THP for now, Before taking snapshot, please disable THP by: echo never > /sys/kernel/mm/transparent_hugepage/enabled TODO: - Reduce the influence for VM while taking snapshot zhanghailiang (13): postcopy/migration: Split fault related state into struct UserfaultState migration: Allow the migrate command to work on file: urls migration: Allow -incoming to work on file: urls migration: Create a snapshot thread to realize saving memory snapshot migration: implement initialization work for snapshot QEMUSizedBuffer: Introduce two help functions for qsb savevm: Split qemu_savevm_state_complete_precopy() into two helper functions snapshot: Save VM's device state into snapshot file migration/postcopy-ram: fix some helper functions to support userfaultfd write-protect snapshot: Enable the write-protect notification capability for VM's RAM snapshot/migration: Save VM's RAM into snapshot file migration/ram: Fix some helper functions' parameter to use PageSearchStatus snapshot: Remove page's write-protect and copy the content during setup stage include/migration/migration.h | 41 +-- include/migration/postcopy-ram.h | 9 +- include/migration/qemu-file.h | 3 +- include/qemu/typedefs.h | 1 + include/sysemu/sysemu.h | 3 + linux-headers/linux/userfaultfd.h | 21 +++- migration/fd.c| 51 - migration/migration.c | 101 - migration/postcopy-ram.c | 229 -- migration/qemu-file-buf.c | 61 ++ migration/ram.c | 104 - migration/savevm.c| 90 --- trace-events | 1 + 13 files changed, 587 insertions(+), 128 deletions(-) -- 1.8.3.1 Hi, Hi Hailiang, Can I get the status of this patch series ? I cannot find a v2. Yes, I haven't updated it for long time, it is based on userfault-wp API in kernel, and Andrea didn't update the related patches until recent days. I will update this series in the next one or two weeks. But it will only support TCG until userfault-wp API supports KVM. About TCG limitation, is
Re: [Qemu-devel] [RFC 00/13] Live memory snapshot based on userfaultfd
On Thu, Jan 7, 2016 at 1:19 PM, zhanghailiangwrote: > For now, we still didn't support live memory snapshot, we have discussed > a scheme which based on userfaultfd long time ago. > You can find the discussion by the follow link: > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg01779.html > > The scheme is based on userfaultfd's write-protect capability. > The userfaultfd write protection feature is available here: > http://www.spinics.net/lists/linux-mm/msg97422.html > > The process of this live memory scheme is like bellow: > 1. Pause VM > 2. Enable write-protect fault notification by using userfaultfd to >mark VM's memory to write-protect (readonly). > 3. Save VM's static state (here is device state) to snapshot file > 4. Resume VM, VM is going to run. > 5. Snapshot thread begins to save VM's live state (here is RAM) into >snapshot file. > 6. During this time, all the actions of writing VM's memory will be blocked > by kernel, and kernel will wakeup the fault treating thread in qemu to > process this write-protect fault. The fault treating thread will deliver > this > page's address to snapshot thread. > 7. snapshot thread gets this address, save this page into snasphot file, >and then remove the write-protect by using userfaultfd API, after that, >the actions of writing will be recovered. > 8. Repeat step 5~7 until all VM's memory is saved to snapshot file > > Compared with the feature of 'migrate VM's state to file', > the main difference for live memory snapshot is it has little time delay for > catching VM's state. It just captures the VM's state while got users snapshot > command, just like take a photo of VM's state. > > For now, we only support tcg accelerator, since userfaultfd is not supporting > tracking write faults for KVM. > > Usage: > 1. Take a snapshot > #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off > -drive > file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none > -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m > 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 > --monitor stdio > Issue snapshot command: > (qemu)migrate -d file:/home/Snapshot > 2. Revert to the snapshot > #x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.5,accel=tcg,usb=off > -drive > file=/mnt/windows/win7_install.qcow2.bak,if=none,id=drive-ide0-0-1,format=qcow2,cache=none > -device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -vnc :7 -m > 8192 -smp 1 -netdev tap,id=bn0 -device virtio-net-pci,id=net-pci0,netdev=bn0 > --monitor stdio -incoming file:/home/Snapshot > > NOTE: > The userfaultfd write protection feature does not support THP for now, > Before taking snapshot, please disable THP by: > echo never > /sys/kernel/mm/transparent_hugepage/enabled > > TODO: > - Reduce the influence for VM while taking snapshot > > zhanghailiang (13): > postcopy/migration: Split fault related state into struct > UserfaultState > migration: Allow the migrate command to work on file: urls > migration: Allow -incoming to work on file: urls > migration: Create a snapshot thread to realize saving memory snapshot > migration: implement initialization work for snapshot > QEMUSizedBuffer: Introduce two help functions for qsb > savevm: Split qemu_savevm_state_complete_precopy() into two helper > functions > snapshot: Save VM's device state into snapshot file > migration/postcopy-ram: fix some helper functions to support > userfaultfd write-protect > snapshot: Enable the write-protect notification capability for VM's > RAM > snapshot/migration: Save VM's RAM into snapshot file > migration/ram: Fix some helper functions' parameter to use > PageSearchStatus > snapshot: Remove page's write-protect and copy the content during > setup stage > > include/migration/migration.h | 41 +-- > include/migration/postcopy-ram.h | 9 +- > include/migration/qemu-file.h | 3 +- > include/qemu/typedefs.h | 1 + > include/sysemu/sysemu.h | 3 + > linux-headers/linux/userfaultfd.h | 21 +++- > migration/fd.c| 51 - > migration/migration.c | 101 - > migration/postcopy-ram.c | 229 > -- > migration/qemu-file-buf.c | 61 ++ > migration/ram.c | 104 - > migration/savevm.c| 90 --- > trace-events | 1 + > 13 files changed, 587 insertions(+), 128 deletions(-) > > -- > 1.8.3.1 > > > Hi Hailiang, Can I get the status of this patch series ? I cannot find a v2. About TCG limitation, is KVM support on a TODO list or is there a strong technical barrier ? Thanks, Baptiste