@Monster to avoid globals, simply go for a root structure/object and pass the pointers around
to your second question: How atomic is implemented depends completeley on your architecture. Thats the reason why you have the c standards; each Program should behave correct regardless if its compiled for PowerPC, ADSP-XXXXX or MIPS for instance. if you have a MC (for instance the 6502; I recommend to start with that if you like to go for basic research) who likes to write to memory per statement 8bit can be written. 16bit needs at least 2 asm-statements. You are single core here but 2 statements could be interrupted. And if you have a dma controller on the same bus, things could be worse because a single write could also be delayed. Thats the reason why some vendors have special, atomic, instructions. TAS (test and set) or CAS (compare and set) which behave atomic on the databus. Due to the physical fact that external memory is very slow clocked, today there are memory caches. With multicore and caches things get complicated. Each core has its cache but needs to read/write to external memory. The access needs to synchonized/serialized but I dont know how its done on the 0x86 architecture. Most of the technology is not open source (hidden undocumented instructions possible). Also on modern SOC (GPU or Baseband chips) there is also running a proprietary RTOS behind the scenes. At least for two physical chips the VHDL is open, its the P8X32A from parallax (crazy non mainstream) and the risc-v core ( [https://www.sifive.com/products/hifive1](https://www.sifive.com/products/hifive1) ). The Sitara TI family is not open but all datasheets are public (Beaglebone for instance). They are linux ready and nim should also run on it (not tested yet). Its always the same: tradeoff between IO/Bandwidth and CPU and dealing with contention regardless if your system is on singlecore, multicore, database or you do something distributed. Due to the fact that you have multicore and for instance your architektural design lacks a little bit it could be possible that you go for parallel but your code is much slower (or not faster) than the single thread version. So if you have a 32-bit arch and you like to do atomics with 64bit you have to look into your C api. If it´s there your are lucky if not you have to implement it for your own (Locks).
