On Thu, Oct 19, 2023 at 12:28:39PM +0200, Juan Quintela wrote: > Fabiano Rosas <faro...@suse.de> wrote: > > We currently use the 'sem_sync' semaphore on the sending side: > > > > 1) to know when the multifd_send_thread() has finished sending the > > MULTIFD_FLAG_SYNC packet; > > > > This is unnecessary. Multifd sends packets one by one and completion > > is already bound by the 'sem' semaphore. The SYNC packet has nothing > > special that would require it to have a separate semaphore on the > > sending side. > > What migration basically does is: > > sync_dirty_bitmap() > while (too_much_dirty_memory) > foreach(dirty_page) > send(dirty_page) > sync_dirty_bitmap() > > I know, this is an over simplification, but it is enough to explain the > problem that this code tries to fix. > > Once that we have multifd, we can have several channels, each of one > going through a different network connection. Yes, networks are a black > box and there is no guarantees about how packets arrive on different > sockets. > > In one iteration, page 99 is dirty. We send it through channel 0. > We end the iteration and we synchronize the bitmap again. > We send the SYNC packet in both channels. > Page 99 is dirty again, and this time it gets sent through channel 1. > > What we want, we want the communication to be: > > Channel 0 migration thread Channel 1 > > (1) sent page 99 > (2) sync bitmap > (3) sent page 99 > > And we want destination to make sure that it never gets packet with page > 99 from channel 0 AFTER page 99 from channel 1. > > Notice, this is something that it is highly improbable to happen, but it > _can_ happen (and zero copy increases the probability of it). > > So we create this SYNC packet that does that: > > foreach (channel) > create_job_to_send_sync() packet > foreach (channel) > wait_until_it_has_sent_the_sync_packet() > > Notice that this is the main migration thread, it will net send new work > to any channel until it receives the sem_sync for every channel. > > Now, how do we deal on the target: > > foreach (channel) > wait(sem_sync) > foreach (channel) > send(sem_sync) > > So, trying to do my best at ASCII art, what happens when we end a round > of memory iteration > > MigrationThread(send) MigrationThread(recv) channel_n (send) > channel_n(recv) > > sync_bitmap() > foreach(channel) > create_job_with_SYNC > post(channel->sem) > wait(channel->sem) > write(SYNC) > > post(channel->sem_sync) > foreach(channel) > wait(channel->sem_sync) > > write(MULTIFD_FLUSH) > > read(SYNC) > > post(main->sem_sync) > > wait(channel->sem_sync) > read(MULTIFD_FLUSH) > foreach(channel) > wait(main->sem_sync) > foreach(channel) > post(channel->sem_sync)
Hmm, I think I missed the fact that the main thread also sends something (in this case, MULTIFD_FLUSH), when I was raising the question in the other thread on sync. Now I think it should work indeed. I'm not sure what is the case before this commit, though: commit 294e5a4034e81b3d8db03b4e0f691386f20d6ed3 Author: Juan Quintela <quint...@redhat.com> Date: Tue Jun 21 13:36:11 2022 +0200 multifd: Only flush once each full round of memory > > Interesting points: > > 1- We guarantee that packets inside the same channel are in order. > > 2- Migration send thread don't send a MULTIFD_FLUSH packet until every > channel has sent a SYNC packet IMHO this is not required? Main thread can send MULTIFD_FLUSH early too, I think the important thing is dest recv threads will always sync with this one, early or late. Then it's impossible that one new page appears earlier than another old version (in another channel) because when reading the new page dest threads must have digested the old. > > 3- After reception of a SYNC packet. A channel: > a -> communicates to the main migration thread that it has received > it (post(main->sem_sync)) > b -> it waits on (channel->sem_sync) > > 4- Main recv thread receives a MULTIFD_FLUSH > a -> waits for every channel to say that it has received a SYNC > packet > b -> communicates to every channel that they can continue. > > Send channels can send new data after the main channel does a > write(MULTIFD_FLUSH). But reception channels will not read it until the > main recv thread is sure that every reception channel has received a > SYNC, so we are sure that (in the previous example) page 99 from thread > 0 is already written. > > Is it clearer now? > > And yes, after discussion I will convert this email in documentation. Please do so, and I suggest we start to do more with docs/*.rst and not use wiki pages unless necessary, then doc is with code. We may consider create docs/migration/ and this can belong to multifd.rst. Thanks, -- Peter Xu