Peter Xu <pet...@redhat.com> wrote: > On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote: >> On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote: >> > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: >> > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: >> > > > Thanks for looking into this.. definitely worthwhile. >> > > > >> > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: >> > > > > There are 27 pre-copy live migration scenarios being tested. In all >> > > > > of >> > > > > these we force non-convergance and run for one iteration, then let it >> > > > > converge and wait for completion during the second (or following) >> > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a >> > > > > very >> > > > > long time (~30 seconds). >> > > > > >> > > > > While it is important to test the migration passes and convergance >> > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The >> > > > > TLS migration scenarios in particular are merely exercising different >> > > > > code paths during connection establishment. >> > > > > >> > > > > To optimize time taken, switch most of the test scenarios to run >> > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives >> > > > > a massive speed up for most of the test scenarios. >> > > > > >> > > > > For test coverage the following scenarios are unchanged >> > > > >> > > > Curious how are below chosen? I assume.. >> > > >> > > Chosen based on whether they exercise code paths that are unique >> > > and interesting during the RAM transfer phase. >> > > >> > > Essentially the goal is that if we have N% code coverage before this >> > > patch, then we should still have the same N% code coverage after this >> > > patch. >> > > >> > > The TLS tests exercise code paths that are unique during the migration >> > > establishment phase. Once establishd they don't exercise anything >> > > "interesting" during RAM transfer phase. Thus we don't loose code >> > > coverage >> > > by runing TLS tests non-live. >> > > >> > > > >> > > > > >> > > > > * Precopy with UNIX sockets >> > > > >> > > > this one verifies dirty log. >> > > > >> > > > > * Precopy with UNIX sockets and dirty ring tracking >> > > > >> > > > ... dirty ring... >> > > > >> > > > > * Precopy with XBZRLE >> > > > >> > > > ... xbzrle I think needs a diff on old/new, makes sense. >> > > > >> > > > > * Precopy with UNIX compress >> > > > > * Precopy with UNIX compress (nowait) >> > > > > * Precopy with multifd >> > > > >> > > > What about the rest three? Especially for two compression tests. >> > > >> > > The compress thread logic is unique/interesting during RAM transfer >> > > so benefits from running live. The wait vs non-wait scenario tests >> > > a distinct codepath/logic. >> > >> > I assume you mean e.g. when compressing with guest page being modified and >> > we should survive that rather than crashing the compressor? >> >> No, i mean the compression code has a significant behaviour difference >> between its two tests, because they toggle: >> >> @compress-wait-thread: Controls behavior when all compression >> threads are currently busy. If true (default), wait for a free >> compression thread to become available; otherwise, send the page >> uncompressed. (Since 3.1) >> >> so we need to exercise the code path that falls back to sending >> uncompressed, as well as the code path that waits for free threads. > > But then the question is why live is needed? > > IIUC whether the wait thing triggers have nothing directly related to VM is > live or not, but whether all compress thread busy. IOW, IIUC all compress > paths will be tested even if non-live as long as we feed enough pages to > the compressor threads.
It is even wrong. We didn't fix this for compression: commit 007e179ef0e97eafda4c9ff2a9d665a1947c7c6d Author: Ilya Leoshkevich <i...@linux.ibm.com> Date: Tue Jul 5 22:35:59 2022 +0200 multifd: Copy pages before compressing them with zlib zlib_send_prepare() compresses pages of a running VM. zlib does not make any thread-safety guarantees with respect to changing deflate() input concurrently with deflate() [1]. Not that anyone is going to use any accelerator to run zlib when we are compression just 4k. Intel AT engine had to also move to 64 pages at a time to make it a difference. As said, I can't think of a single scenary where compression is a good option. Later, Juan.