First, I’m pretty sure that, contrary to your claims, C++ does not support this. C++ doesn’t even support shared memory out of the box. The third-party Boost library does provide it—as long as you only care about systems that correctly supports POSIX shared memory, and Windows, and as long as you either don’t care about the fact that your shared memory might be simulated with a memory mapped file or that it might have different persistence than you asked for. And then, whether you can actually allocate atomics inside shared memory and have them actually be atomic depends on whether atomic<T> for every one of your types is guaranteed lock-free as opposed to just usually lock-free, which you have to test for either per-platform/compiler at build time, or at runtime. For most applications, that’s all good enough. But if requiring a not-100%-portable third-party library for both building and runtime, and writing an autoconf test to boot, is good enough for C++, why does your solution need to be in the stdlib rather than a PyPI library in Python?
Meanwhile: > On Sep 13, 2019, at 06:32, Vinay Sharma via Python-ideas > <python-ideas@python.org> wrote: > > Let’s say I have a parent process which spawns lots of worker processes to > serve some requests. Now the parent process wants to know the statistics > about the kind of requests being served by these workers. For, example the > average time to serve a request by all workers combined, number of bad > requests, number of stalled requests, number of rerouted requests, etc. > > Now, the worker processes will make updates to these variables, which the > parent can report, and accordingly adjust workers. And, instead of locking, > it would be much more helpful and easier to use atomic values. It definitely will not make it easier. Using atomics means that your statistics can be out of sync. You could, for example, see up-to-the-nanosecond bad requests count but out-of-date total requests and therefore calculate an incorrect request success rate (for an extreme example, it could be 1/0) and do the wrong thing. You can fix this by coming up with an ordering discipline for updates, and adding an extra value for total updates that you can use to manage the potential error bounds, but this is hardly simple. By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial. > The workers can also send data to parent, and then the parent will have the > overhead of writing and synchronising, but this wouldn’t be very fast. If the parent is the only writer or user of the data, why does the parent have any overhead for synchronizing? You can just use non-atomic writes, which are much faster, and there are fewer writes too. Of course sending every update over a queue _is_ slow, almost certainly costing a lot more than what you save on not needing locks or atomics, so the overall cost is higher. But not for the reason you seem to think, and if you don’t know where the performance costs actually are, I’m not sure you’re in the right place to start micro-optimizing here. Meanwhile, a single lock plus dozens of nonatomic writes is going to be a lot faster than dozens of atomic writes (especially if you don’t have a library flexible enough to let you do acquire-release semantics instead of fully-atomic semantics), as well as being simpler. Not to mention that having dozens of atomics likely to be allocated near each other may well mean stalling the cache 5 times for each contention or false contention instead of just once, but with a lock you don’t need to even consider that, much less figure out how to test for it and fix it. When the lock isn’t acceptable, it’s because there’s too much contention on the lock—and there would have been too much contention on the atomic writes as well, so that isn’t the solution. There are optimizations you can get from double-buffering the stats pages, using platform-specific calls to atomically sync a page at a time rather than a value at a time, etc., but ultimately you have to reduce that contention. The traditional answer to this is have multiple shm segments and have the parent (or a separate stats-daemon) aggregate them. The simplest version is to take this all the way to a single shm per child. Now you just need a single atomic counter per dump rather than each value being atomic. This is especially nice for a stats use case, where the child may be updating stats hundreds of times per second but the collector is only checking each child every few seconds. (If you want a shiny modern solution instead, this looks like one of the few cases where hardware TM support can probably speed things up relatively easily, which would be a fun project to work on…) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CLUZJTUC3ZRRGY4Q6CAAYDHKNG3IHDRH/ Code of Conduct: http://python.org/psf/codeofconduct/