> > > > > > > > > I think this is "very" wasteful. Assume the workload > > > > > > > > > writes the pages > > > > > dirty randomly within the guest address space, and the transfer > > > > > speed is constant. Intuitively, I think nearly half of the dirty > > > > > pages produced in Iteration 1 is not really dirty. This means > > > > > the time of Iteration 2 is double of that to send only really dirty > > > > > pages. > > > > > > > > > > > > > > > > It makes sense, can you get some perf numbers to show what > > > > > > > > kinds of workloads get impacted the most? That would also > > > > > > > > help us to figure out what kinds of speed improvements we > > > > > > > > can > > > expect. > > > > > > > > > > > > > > > > > > > > > > > > Amit > > > > > > > > > > > > > > I have picked up 6 workloads and got the following > > > > > > > statistics numbers of every iteration (except the last > > > > > > > stop-copy one) during > > > precopy. > > > > > > > These numbers are obtained with the basic precopy migration, > > > > > > > without the capabilities like xbzrle or compression, etc. > > > > > > > The network for the migration is exclusive, with a separate > > > > > > > network for > > > the workloads. > > > > > > > They are both gigabit ethernet. I use qemu-2.5.1. > > > > > > > > > > > > > > Three (booting, idle, web server) of them converged to the > > > > > > > stop-copy > > > > > phase, > > > > > > > with the given bandwidth and default downtime (300ms), while > > > > > > > the other three (kernel compilation, zeusmp, memcached) did not. > > > > > > > > > > > > > > One page is "not-really-dirty", if it is written first and > > > > > > > is sent later (and not written again after that) during one > > > > > > > iteration. I guess this would not happen so often during the > > > > > > > other iterations as during the 1st iteration. Because all > > > > > > > the pages of the VM are sent to the dest node > > > > > during > > > > > > > the 1st iteration, while during the others, only part of the > > > > > > > pages are > > > sent. > > > > > > > So I think the "not-really-dirty" pages should be produced > > > > > > > mainly during the 1st iteration , and maybe very little > > > > > > > during the other > > > iterations. > > > > > > > > > > > > > > If we could avoid resending the "not-really-dirty" pages, > > > > > > > intuitively, I think the time spent on Iteration 2 would be > > > > > > > halved. This is a chain > > > > > reaction, > > > > > > > because the dirty pages produced during Iteration 2 is > > > > > > > halved, which > > > > > incurs > > > > > > > that the time spent on Iteration 3 is halved, then Iteration 4, > > > > > > > 5... > > > > > > > > > > > > Yes; these numbers don't show how many of them are false dirty > > > though. > > > > > > > > > > > > One problem is thinking about pages that have been redirtied, > > > > > > if the page is > > > > > dirtied > > > > > > after the sync but before the network write then it's the > > > > > > false-dirty that you're describing. > > > > > > > > > > > > However, if the page is being written a few times, and so it > > > > > > would have > > > > > been written > > > > > > after the network write then it isn't a false-dirty. > > > > > > > > > > > > You might be able to figure that out with some kernel tracing > > > > > > of when the > > > > > dirtying > > > > > > happens, but it might be easier to write the fix! > > > > > > > > > > > > Dave > > > > > > > > > > Hi, I have made some new progress now. > > > > > > > > > > To tell how many false dirty pages there are exactly in each > > > > > iteration, I malloc a buffer in memory as big as the size of the > > > > > whole VM memory. When a page is transferred to the dest node, it > > > > > is copied to the buffer; During the next iteration, if one page > > > > > is transferred, it is compared to the old one in the buffer, and > > > > > the old one will be replaced for next comparison if it is really > > > > > dirty. > > > > > Thus, we are now able to get the exact number of false dirty pages. > > > > > > > > > > This time, I use 15 workloads to get the statistic number. They are: > > > > > > > > > > 1. 11 benchmarks picked up from cpu2006 benchmark suit. They > > > > > are all scientific > > > > > computing workloads like Quantum Chromodynamics, Fluid > > > > > Dynamics, > > > etc. > > > > > I pick > > > > > up these 11 benchmarks because compared to others, they > > > > > have bigger memory > > > > > occupation and higher memory dirty rate. Thus most of them > > > > > could not converge > > > > > to stop-and-copy using the default migration speed (32MB/s). > > > > > 2. kernel compilation > > > > > 3. idle VM > > > > > 4. Apache web server which serves static content > > > > > > > > > > (the above workloads are all running in VM with 1 vcpu and 1GB > > > > > memory, and the > > > > > migration speed is the default 32MB/s) > > > > > > > > > > 5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB > > > > > are used as the cache. > > > > > After filling up the 4GB cache, a client writes the cache > > > > > at a constant > > > speed > > > > > during migration. This time, migration speed has no limit, > > > > > and is up to > > > the > > > > > capability of 1Gbps Ethernet. > > > > > > > > > > Summarize the results first: (and you can read the precise > > > > > number > > > > > below) > > > > > > > > > > 1. 4 of these 15 workloads have a big proportion (>60%, even > > > > > >80% during some iterations) > > > > > of false dirty pages out of all the dirty pages since > > > > > iteration 2 (and the > > > big > > > > > proportion lasts during the following iterations). They are > > > cpu2006.zeusmp, > > > > > cpu2006.bzip2, cpu2006.mcf, and memcached. > > > > > 2. 2 workloads (idle, webserver) spend most of the migration > > > > > time on iteration 1, even > > > > > though the proportion of false dirty pages is big since > > > > > iteration 2, the space to > > > > > optimize is small. > > > > > 3. 1 workload (kernel compilation) only have a big proportion > > > > > during iteration 2, not > > > > > in the other iterations. > > > > > 4. 8 workloads (the other 8 benchmarks of cpu2006) have little > > > > > proportion of false > > > > > dirty pages since iteration 2. So the spaces to optimize > > > > > for them are > > > small. > > > > > > > > > > Now I want to talk a little more about the reasons why false > > > > > dirty pages are produced. > > > > > The first reason is what we have discussed before---the > > > > > mechanism to track the dirty pages. > > > > > And then I come up with another reason. Here is the situation: a > > > > > write operation to one memory page happens, but it doesn't > > > > > change any content of the page. So it's "write but not dirty", > > > > > and kernel still marks it as dirty. One guy in our lab has done > > > > > some experiments to figure out the proportion of "write but not > dirty" > > > > > operations, and he uses the cpu2006 benchmark suit. According to > > > > > his results, general workloads has a little proportion (<10%) of > > > > > "write but not dirty" out of all the write operations, while few > > > > > workloads have higher proportion (one even as high as 50%). Now > > > > > we are not sure why "write but not dirty" would happen, it just > happened. > > > > > > > > > > So these two reasons contribute to the false dirty pages. To > > > > > optimize, I compute and store the SHA1 hash before transferring > > > > > each page. Next time, if one page needs retransmission, its > > > > > SHA1 hash is computed again, and compared to the old hash. If > > > > > the hash is the same, it's a false dirty page, and we just skip > > > > > this page; Otherwise, the page is transferred, and the new hash > > > > > replaces the old one for next comparison. > > > > > The reason to use SHA1 hash but not byte-by-byte comparison is > > > > > the memory overheads. One SHA1 hash is 20 bytes. So we need > > > > > extra > > > > > 20/4096 (<1/200) memory space of the whole VM memory, which is > > > > > relatively small. > > > > > As far as I know, SHA1 hash is widely used in the scenes of > > > > > deduplication for backup systems. > > > > > They have proven that the probability of hash collision is far > > > > > smaller than disk hardware fault, so it's secure hash, that is, > > > > > if the hashes of two chunks are the same, the content must be the > same. > > > > > So I think the SHA1 hash could replace byte-to-byte comparison > > > > > in the VM memory scenery. > > > > > > > > > > Then I do the same migration experiments using the SHA1 hash. > > > > > For the 4 workloads which have big proportions of false dirty > > > > > pages, the improvement is remarkable. Without optimization, they > > > > > either can not converge to stop-and-copy, or take a very long time to > complete. > > > > > With the > > > > > SHA1 hash method, all of them now complete in a relatively short > time. > > > > > For the reason I have talked above, the other workloads don't > > > > > get notable improvements from the optimization. So below, I only > > > > > show the exact number after optimization for the 4 workloads > > > > > with remarkable improvements. > > > > > > > > > > Any comments or suggestions? > > > > > > > > Maybe you can compare the performance of your solution as that of > > > XBZRLE to see which one is better. > > > > The merit of using SHA1 is that it can avoid data copy as that in > > > > XBZRLE, and > > > need less buffer. > > > > How about the overhead of calculating the SHA1? Is it faster than > > > > copying a > > > page? > > > > > > > > Liang > > > > > > > > > > > > > > Yes, XBZRLE is able to handle the false dirty pages. However, if we > > > want to avoid transferring all of the false dirty pages using > > > XBZRLE, we need a buffer as big as the whole VM memory, while SHA1 > > > needs a much small buffer. Of course, if we have a buffer as big as > > > the whole VM memory using XBZRLE, we could transfer less data on > > > network than SHA1, because XBZRLE is able to compress similar pages. > > > In a word, yes, the merit of using SHA1 is that it needs much less > > > buffer, and leads to nice improvement if there are many false dirty pages. > > > > > > > The current implementation of XBZRLE begins to buffer page from the > > second iteration, Maybe it's worth to make it start to work from the first > iteration based on your finding. > > > > > In terms of the overhead of calculating the SHA1 compared with > > > transferring a page, it's related to the CPU and network > > > performance. In my test environment(Intel Xeon > > > E5620 @2.4GHz, 1Gbps Ethernet), I didn't observe obvious extra > > > computing overhead caused by calculating the SHA1, because the > > > throughput of network (got by "info migrate") remains almost the same. > > > > You can check the CPU usage, or to measure the time spend on a local > > live migration which use SHA1/ XBZRLE. > > > > Liang > > > > > > I compare SHA1 with XBZRLE. I use XBZRLE in two ways: > 1. Begins to buffer pages from iteration 1; 2. As current implementation, > begins to buffer pages from iteration 2. > > I post the results of three workloads: cpu2006.zeusmp, cpu2006.mcf, > memcached. > I set the cache size as 256MB for zeusmp & mcf (they run in VM with 1GB > ram), and set the cache size as 1GB for memcached (it run in VM with 6GB > ram, and memcached takes 4GB as cache). > > As you can read from the data below, beginning to buffer pages from > iteration 1 is better than the current implementation(from iteration 2), > because the total migration time is shorter. > > SHA1 is better than the XBZRLE with the cache size I choose, because it leads > to shorter migration time, and consumes far less memory overhead (<1/200 > of the total VM memory). >
Hi Chunguang, Have you tried to use a large XBZRLE cache size which equals to the guest's RAM size? Is SHA1 faster in that case? Thanks! Liang