* Gonglei (arei.gong...@huawei.com) wrote: > On 2015/7/30 19:56, Dr. David Alan Gilbert wrote: > > * Jason Wang (jasow...@redhat.com) wrote: > >> > >> > >> On 07/30/2015 04:03 PM, Dr. David Alan Gilbert wrote: > >>> * Dong, Eddie (eddie.d...@intel.com) wrote: > >>>>>> A question here, the packet comparing may be very tricky. For example, > >>>>>> some protocol use random data to generate unpredictable id or > >>>>>> something else. One example is ipv6_select_ident() in Linux. So COLO > >>>>>> needs a mechanism to make sure PVM and SVM can generate same random > >>>>> data? > >>>>> Good question, the random data connection is a big problem for COLO. At > >>>>> present, it will trigger checkpoint processing because of the different > >>>>> random > >>>>> data. > >>>>> I don't think any mechanisms can assure two different machines generate > >>>>> the > >>>>> same random data. If you have any ideas, pls tell us :) > >>>>> > >>>>> Frequent checkpoint can handle this scenario, but maybe will cause the > >>>>> performance poor. :( > >>>>> > >>>> The assumption is that, after VM checkpoint, SVM and PVM have identical > >>>> internal state, so the pattern used to generate random data has high > >>>> possibility to generate identical data at short time, at least... > >>> They do diverge pretty quickly though; I have simple examples which > >>> reliably cause a checkpoint because of simple randomness in applications. > >>> > >>> Dave > >>> > >> > >> And it will become even worse if hwrng is used in guest. > > > > Yes; it seems quite application dependent; (on IPv4) an ssh connection, > > once established, tends to work well without triggering checkpoints; > > and static web pages also work well. Examples of things that do cause > > more checkpoints are, displaying guest statistics (e.g. running top > > in that ssh) which is timing dependent, and dynamically generated > > web pages that include a unique ID (bugzilla's password reset link in > > it's front page was a fun one), I think also establishing > > new encrypted connections cause the same randomness. > > > > However, it's worth remembering that COLO is trying to reduce the > > number of checkpoints compared to a simple checkpointing world > > which would be aiming to do a checkpoint ~100 times a second, > > and for compute bound workloads, or ones that don't expose > > the randomness that much, it can get checkpoints of a few seconds > > in length which greatly reduces the overhead. > > > > Yes. That's the truth. > We can set two different modes for different scenarios. Maybe Named > 1) frequent checkpoint mode for multi-connections and randomness scenarios > and 2) non-frequent checkpoint mode for other scenarios. > > But that's the next plan, we are thinking about that.
I have some code that tries to automatically switch between those; it measures the checkpoint lengths, and if they're consistently short it sends a different message byte to the secondary at the start of the checkpoint, so that it doesn't bother running. Every so often it then flips back to a COLO checkpoint to see if the checkpoints are still really fast. Dave > > Regards, > -Gonglei > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK