First, I’m pretty sure that, contrary to your claims, C++ does not support 
this. C++ doesn’t even support shared memory out of the box. The third-party 
Boost library does provide it—as long as you only care about systems that 
correctly supports POSIX shared memory, and Windows, and as long as you either 
don’t care about the fact that your shared memory might be simulated with a 
memory mapped file or that it might have different persistence than you asked 
for. And then, whether you can actually allocate atomics inside shared memory 
and have them actually be atomic depends on whether atomic<T> for every one of 
your types is guaranteed lock-free as opposed to just usually lock-free, which 
you have to test for either per-platform/compiler at build time, or at runtime. 
For most applications, that’s all good enough. But if requiring a 
not-100%-portable third-party library for both building and runtime, and 
writing an autoconf test to boot, is good enough for C++, why does your 
solution need to be in the stdlib rather than a PyPI library in Python? 

Meanwhile:

> On Sep 13, 2019, at 06:32, Vinay Sharma via Python-ideas 
> <python-ideas@python.org> wrote:
> 
> Let’s say I have a parent process which spawns lots of worker processes to 
> serve some requests. Now the parent process wants to know the statistics 
> about the kind of requests being served by these workers. For, example the 
> average time to serve a request by all workers combined, number of bad 
> requests, number of stalled requests, number of rerouted requests, etc.
> 
> Now, the worker processes will make updates to these variables, which the 
> parent can report, and accordingly adjust workers. And, instead of locking, 
> it would be much more helpful and easier to use atomic values. 

It definitely will not make it easier.

Using atomics means that your statistics can be out of sync. You could, for 
example, see up-to-the-nanosecond bad requests count but out-of-date total 
requests and therefore calculate an incorrect request success rate (for an 
extreme example, it could be 1/0) and do the wrong thing. You can fix this by 
coming up with an ordering discipline for updates, and adding an extra value 
for total updates that you can use to manage the potential error bounds, but 
this is hardly simple.

By contrast, just grabbing a lock around every call to `update_stats` and 
`read_stats` is trivial.

> The workers can also send data to parent, and then the parent will have the 
> overhead of writing and synchronising, but this wouldn’t be very fast.

If the parent is the only writer or user of the data, why does the parent have 
any overhead for synchronizing? You can just use non-atomic writes, which are 
much faster, and there are fewer writes too. Of course sending every update 
over a queue _is_ slow, almost certainly costing a lot more than what you save 
on not needing locks or atomics, so the overall cost is higher. But not for the 
reason you seem to think, and if you don’t know where the performance costs 
actually are, I’m not sure you’re in the right place to start micro-optimizing 
here.

Meanwhile, a single lock plus dozens of nonatomic writes is going to be a lot 
faster than dozens of atomic writes (especially if you don’t have a library 
flexible enough to let you do acquire-release semantics instead of fully-atomic 
semantics), as well as being simpler. Not to mention that having dozens of 
atomics likely to be allocated near each other may well mean stalling the cache 
5 times for each contention or false contention instead of just once, but with 
a lock you don’t need to even consider that, much less figure out how to test 
for it and fix it.

When the lock isn’t acceptable, it’s because there’s too much contention on the 
lock—and there would have been too much contention on the atomic writes as 
well, so that isn’t the solution. There are optimizations you can get from 
double-buffering the stats pages, using platform-specific calls to atomically 
sync a page at a time rather than a value at a time, etc., but ultimately you 
have to reduce that contention. The traditional answer to this is have multiple 
shm segments and have the parent (or a separate stats-daemon) aggregate them. 
The simplest version is to take this all the way to a single shm per child. Now 
you just need a single atomic counter per dump rather than each value being 
atomic. This is especially nice for a stats use case, where the child may be 
updating stats hundreds of times per second but the collector is only checking 
each child every few seconds. 

(If you want a shiny modern solution instead, this looks like one of the few 
cases where hardware TM support can probably speed things up relatively easily, 
which would be a fun project to work on…)
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CLUZJTUC3ZRRGY4Q6CAAYDHKNG3IHDRH/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to