Given that Cyrus is unix processes, not threads, what mechanism are you proposing for the atomic here?
I'd be keen to see something shaped like a pull request against Cyrus showing how this would interact with the existing locking architecture. Cheers, Bron. On Wed, Nov 13, 2019, at 16:12, Anatoli via Cyrus-devel wrote: > > How do you know when the current write operations are finished? > > The pseudocode from my previous email: > > __start: > > atomic_inc(write_operations_in_execution_counter) > > atomic_load(write_barrier) > > if (write_barrier == ON) { > > atomic_dec(write_operations_in_execution_counter) > sleep(100ms) > goto __start > > } else { > > *perform_write_operation_with_its_own_locks()* > atomic_dec(write_operations_in_execution_counter) > > } > > > Here perform_write_operation_with_its_own_locks() is a function (one > among many) that today performs a write operation (the algorithm implies > that all occurrences of write operations should be wrapped with this > pseudocode). > > Once a write function returns (which would mean that this particular > write operation completed), the write_operations_in_execution_counter is > decremented. When all operations complete, the counter would be == 0, > which is detected by the barrier thread and it proceeds with the > sync/fsync to have all files fully written to disk & notifies the > external process waiting to take a snapshot. Then it turns off the > barrier and the daemon continues normal operation. > > This is similar to a single global lock as you describe it, and it sort > of behaves like it, but it doesn't need to be acquired as such every > time and it has practically 0 overhead so it could be placed inside > high-frequency loops. It's similar to how modern lock-free concurrency > is implemented. The point it that it's an opportunistic locking which > could be implemented with very low overhead. > > And if the code inside the perform_write_operation_with_its_own_locks() > is guaranteed not to block on other > perform_write_operation_with_its_own_locks() functions, then this > implementation would be lock-free for when the barrier is OFF, i.e. no > potential deadlocks. For when it's ON, there also would be no deadlocks, > but the operation as such won't be fully lock-free. > > Also, the way I propose to implement the barrier thread, it won't block > the server for more than a configurable amount of seconds no matter what > (with this, the entire implementation would be lock-free (if we consider > the snapshot as part of the process), i.e. it guarantees progress in a > finite amount of time, though some threads could starve): > > atomic_store(write_barrier, ON) > > __check_write_counter: > > atomic_load(write_operations_in_execution_counter) > > if (write_operations_in_execution_counter == 0) { > > sync_data_to_disk() > signal_data_ready() > *wait_for_lock_release_with_timeout(5s)* > atomic_store(write_barrier, OFF) > > } else { > > sleep(1ms) > goto __check_write_counter > > } > > > Here the wait_for_lock_release_with_timeout(5s) function will wait for > the release-lock signal for 5 seconds and would turn off the barrier no > matter if the external operation (snapshot-taking backup tool) > completed, so the server would continue its normal operation once the 5s > timeout expires. > > So while the barrier thread waits for the release-lock signal, the > backup tool performs a snapshot and then sends the release-lock signal. > The result of the signaling indicates whether the lock was released > before or not. If the backup tool receives the code indicating that the > lock was released before, it would mean that the snapshot that was taken > could be inconsistent. > > In this case the backup tool could try to perform the operation again or > proceed in another way (e.g. to notify the admin that the snapshot takes > more than the preconfigured lock-wait time). Again this is the > opportunistic locking, i.e. we try to perform an operation without a > guarantee of success, so we don't need to wait indefinitely, again > providing a lock-free guarantee. If we succeed, then all is good. If > not, we try again or abandon the task with an error. > > And all this would be external to cyrus, it would be implemented in the > backup utility. > > I guess the best way to start with this is to identify all places where > data write operations occur (I suppose this is where the mail data and > all sorts of databases are written). Once they are identified they could > be tweaked a bit for better concurrency and lockability and then we > could analyze how to wrap them with a global lock/barrier. > > Regards, > Anatoli > > > On 12/11/19 06:20, Bron Gondwana wrote: > > > > > > On Tue, Nov 12, 2019, at 14:50, Anatoli wrote: > >> Bron, > >> > >> The proposed algo is a barrier before any single-lock. In itself it's a > >> single lock, but the same code (the pseudocode for the *worker thread* > >> in my previous mail) should be inserted at *every* single-lock/write > >> operation location. If there's no need to pause, the overhead is > >> non-existent. If a pause is requested, all worker threads would pause at > >> the entrance to any single-lock/write code. > >> > >> It would make the entire Cyrus daemon to complete all pending write > >> operations and pause new ones. At this stage, if I understand it > >> correctly, the data on disk would be in a consistent state, ready to > >> take a snapshot or to perform some other operation. > > > > "complete all pending write operations and pause new ones" > > > > How do you know when the current write operations are finished? > > > >> Without that, if we just take a snapshot of the fs, it could happen that > >> a) some files are not written entirely (i.e. caught in the middle of a > >> write operation) or b) the contents of some files are newer than the > >> other, i.e. the logical write operation was not atomic (e.g. mail data > >> is written but indexes are not updated yet or something similar). > >> > >> Maybe I didn't understand you correctly. Do you mean that finishing all > >> writes and pausing new ones is not enough to guarantee an integral state > >> of files on disk? If it's the case, what would have to be done to > >> guarantee it (i.e. to make it like Cyrus was shutdown normally)? > > > > I mean that to finish all writes and pause new ones, you need to know > > that the writes are finished. And not just writes, but sets of writes > > that are held under a lock together. The way I know to do this is a > > single global lock with the following properties: > > > > 1) every action which locks any file within Cyrus for writing takes a > > SHARED global lock before it takes the write lock on the file. > > > > 2) the SHARED lock is held for the duration of the writes, and released > > once the writes are finished. > > > > 3) the "backup utility" takes an EXCLUSIVE lock on the global lock, > > which will only be granted once each write is finished. It then takes a > > snapshot, and releases the EXCLUSIVE lock. > > > > This guarantees full consistency. > > > > The question that always exists for locks is "what granularity" - too > > wide, and you hold the lock for a long time. Too narrow, and you take > > and release it very frequently, adding overhead. > > > > My first and most dumbest theory is to go quite wide - add the lock in > > every runloop and command line utility such that it's held for the > > entire running of the loop or the utility! Mostly these are done within > > a fraction of a second. The one place that might be interesting is > > FETCH 1:* RFC822.PEEK or similar in imapd, where we already have some > > locking magic that holds a shared namelock on the mailbox to stop > > repacking while it releases the index lock to allow other actions on the > > mailbox in the meanwhile. > > > > So we could go down a layer and only lock when we lock mailboxes or > > cyrusdbs, and refcount the global lock. This seems more likely to be a > > good layer, and not too horrible. > > > > The other thing is that we'll need to assert that the lock isn't being > > held during each daemon's command loop, so that bugs don't leak out to > > deadlock entire servers. > > > > And I think that's nearly it :) > > > > Bron. > > > > -- > > Bron Gondwana, CEO, Fastmail Pty Ltd > > br...@fastmailteam.com > > > > > -- Bron Gondwana, CEO, Fastmail Pty Ltd br...@fastmailteam.com