On Thu, 2016-11-17 at 12:12 -0800, Bin Fan wrote: > On 11/14/2016 4:34 PM, Bin Fan wrote: > > Hi All, > > > > I have an updated version of libatomic ABI specification draft. Please > > take a look to see if it matches GCC implementation. The purpose of > > this document is to establish an official GCC libatomic ABI, and allow > > compatible compiler and runtime implementations on the affected > > platforms.
Thanks for the update, and sorry for the late reply. Comments below. > > - Rewrite section 3 to replace "lock-free" operations with "hardware > > backed" instructions. The digest of this section is: 1) inlineable > > atomics must be implemented with the hardware backed atomic > > instructions. 2) for non-inlineable atomics, the compiler must > > generate a runtime call, and the runtime support function is free to > > use any implementation. OK. I still think that using hardware-backed instructions for a particular type requires that there is a true atomic load instruction for that type. Emulating a load with an idempotent store (eg, cmpxchg16b) is not useful, overall. One could argue that an idempotent atomic HW store such as a cmpxchg16b in a loop is indeed lock-free. However, IMO the intention behind "lock-free" atomics in C and C++ is to offer atomics that are both lock-free *and* as fast as one would assume for a fully HW-backed solution for atomic accesses. This includes that loads must be cheaper than stores, in particular under contention / concurrent accesses by several threads. I believe that "fast" is much more often part of the motivation for using lock-free atomics than the actual "lock-free", so the progress-guarantee aspect (which isn't even lock-free but obstruction-free, see below). If we do see a sufficiently strong need for lock-free atomics, which should build something just for that (eg, if removing the address-free requirement, we can support lock-free (in the progress-guarantee sense) operations for a lot more types). Also, while that previous issue is "just" a performance issue, the fact that we could issue a store when calling to atomic_load() is a correctness issue, I think. One example are volatile atomic loads; while C/C++ don't really constrain what a volatile load needs to be in the underlying implementation, I think most users would assume that a load really means a hardware load instruction of some sort, and nothing else. cmpxchg16b conflicts with such an assumption. Another example is read-only mapped memory. Bottom line: we shouldn't rely solely on cmpxchg16b and similar. (Though this doesn't necessarily mean that there can't be compiler flags that enable its use.) I think the ABI should set a baseline for each architecture, and the baseline decides whether something is inlinable or not. Thus, the x86_64 ABI would make __int128 operations not imlinable (because of the issues with cmpxchg16b, see above). If users want to use capabilities beyond the baseline, they can choose to use flags that alter/extend the ABI. For example, if they use a flag that explicitly enables the use of cmpxchg16b for atomics, they also need to use a libatomic implementation built in the same way (if possible). This then creates a new ABI(-variant), basically. I've made a few tests on my x86_64 machine a few weeks ago, and I didn't see cmpxchg16b being used. IIRC, I also looked at libatomic and didn't see it (but I don't remember for sure). Either way, if I should have been wrong, and we are using cmpxchg16b for loads, this should be fixed. Ideally, this should be fixed before the stage 3 deadline this Friday. Such a fix might potentially break existing uses, but the earlier we fix this, the better. Section 3 Rationale, alternative 1: I'm wondering if the example is correct. For a 4-byte-aligned type of size 3, the implementation cannot simply use 4-byte hardware-backed atomics because this will inevitably touch the 4th byte I think, and the implementation can't know whether this is padding or not. Or do we expect that things like packed structs are disallowed? N3.1: Why do you assume that 8-byte HW atomics are available on i386? Because cmpxchg8b is available for CPUs that are the lowest i?86 we still intend to support? I'd also use "hardware-backed" instead of "hardware backed". > > - The Rationale section in section 3 is also revised to remove the > > mentioning of "lock-free", but there is not major change of concept. > > > > - Add note N3.1 to emphasize the assumption of general hardware > > supported atomic instruction > > > > - Add note N3.2 to discuss the issues of cmpxchg16b See above. > > - Add a paragraph in section 4.1 to specify memory_order_consume must > > be implemented through memory_order_acquire. Section 4.2 emphasizes it > > again. > > > > - The specification of each runtime functions mostly maps to the > > corresponding generic functions in the C11 standard. Two functions are > > worth noting: > > 1) C11 atomic_compare_exchange compares and updates the "value" while > > __atomic_compare_exchange functions in this ABI compare and update the > > "memory", which implies the memcmp and memcpy semantics. In Section 4, parts about atomic_compare_exchange: should there be a back-reference to the memcmp point made earlier in the document? > > 2) The specification of __atomic_is_lock_free allows both a per-object > > result and a per-type result. A per-type implementation could pass > > NULL, or a faked address as the address of the object. A per-object > > implementation could pass the actual address of the object. The __atomic_is_lock_free description should specify that "lock-free" refers to the definition of "lock-free" in C++14, which includes "address-free". I'm referring to C++14 specifically because this contains an update which is relevant for (1) LL/SC-based architectures (ie, that "lock-free" is actually what is called obstruction-free in the literature) and (2) for any libatomic implementation that wants to use HW atomics for things like the example in Section 3's Rationale, alternative 1 (see above). This ABI needs to also specify how hardware-backed atomics are implemented on a particular architecture. For example, on architectures where there is more than one choice for how to certain memory orders (eg, ARM), the ABI should pick a certain mapping. I guess this should be a note in Section 4, maybe as a separate subsection and/or an additional note around the memory_order enum description; I'd keep the note about implementing something equivalent to C11/C++11 semantics. What we would document is something like the possible mappings discussed here: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html There are typos in Section 2.4.