On 1/27/2017 6:23 PM, Juan Quintela wrote: > Jitendra Kolhe <jitendra.ko...@hpe.com> wrote: >> Using "-mem-prealloc" option for a very large guest leads to huge guest >> start-up and migration time. This is because with "-mem-prealloc" option >> qemu tries to map every guest page (create address translations), and >> make sure the pages are available during runtime. virsh/libvirt by >> default, seems to use "-mem-prealloc" option in case the guest is >> configured to use huge pages. The patch tries to map all guest pages >> simultaneously by spawning multiple threads. Given the problem is more >> prominent for large guests, the patch limits the changes to the guests >> of at-least 64GB of memory size. Currently limiting the change to QEMU >> library functions on POSIX compliant host only, as we are not sure if >> the problem exists on win32. Below are some stats with "-mem-prealloc" >> option for guest configured to use huge pages. >> >> ------------------------------------------------------------------------ >> Idle Guest | Start-up time | Migration time >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - single threaded (existing code) >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 54m11.796s | 75m43.843s > ^^^^^^^^^^ > >> 64 Core - 1TB | 8m56.576s | 14m29.049s >> 64 Core - 256GB | 2m11.245s | 3m26.598s >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - map guest pages using 8 threads >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 5m1.027s | 34m10.565s >> 64 Core - 1TB | 1m10.366s | 8m28.188s >> 64 Core - 256GB | 0m19.040s | 2m10.148s >> ----------------------------------------------------------------------- >> Guest stats with 2M HugePage usage - map guest pages using 16 threads >> ----------------------------------------------------------------------- >> 64 Core - 4TB | 1m58.970s | 31m43.400s > ^^^^^^^^^ > > Impressive, not everyday one get an speedup of 20 O:-) > > >> +static void *do_touch_pages(void *arg) >> +{ >> + PageRange *range = (PageRange *)arg; >> + char *start_addr = range->addr; >> + uint64_t numpages = range->numpages; >> + uint64_t hpagesize = range->hpagesize; >> + uint64_t i = 0; >> + >> + for (i = 0; i < numpages; i++) { >> + memset(start_addr + (hpagesize * i), 0, 1); > > I would use the range->addr and similar here directly, but it is just a > question of taste. >
Thanks for your response, will update my next patch. >> - /* MAP_POPULATE silently ignores failures */ >> - for (i = 0; i < numpages; i++) { >> - memset(area + (hpagesize * i), 0, 1); >> + /* touch pages simultaneously for memory >= 64G */ >> + if (memory < (1ULL << 36)) { > > 64GB guest already took quite a bit of time, I think I would put it > always as min(num_vcpus, 16). So, we always execute the multiple theard > codepath? > It sounds good idea to have a heuristic on vcpu count. I will add in my next version. But shouldn't we also restrict based on guest RAM size too, to avoid overhead spawning multiple threads for thin guests? Thanks, - Jitendra > But very nice, thanks. > > Later, Juan. >