On Thu, 2016-11-17 at 12:12 -0800, Bin Fan wrote:
> On 11/14/2016 4:34 PM, Bin Fan wrote:
> > Hi All,
> >
> > I have an updated version of libatomic ABI specification draft. Please 
> > take a look to see if it matches GCC implementation. The purpose of 
> > this document is to establish an official GCC libatomic ABI, and allow 
> > compatible compiler and runtime implementations on the affected 
> > platforms.

Thanks for the update, and sorry for the late reply.  Comments below.

> > - Rewrite section 3 to replace "lock-free" operations with "hardware 
> > backed" instructions. The digest of this section is: 1) inlineable 
> > atomics must be implemented with the hardware backed atomic 
> > instructions. 2) for non-inlineable atomics, the compiler must 
> > generate a runtime call, and the runtime support function is free to 
> > use any implementation.

OK.

I still think that using hardware-backed instructions for a particular
type requires that there is a true atomic load instruction for that
type.  Emulating a load with an idempotent store (eg, cmpxchg16b) is not
useful, overall.

One could argue that an idempotent atomic HW store such as a cmpxchg16b
in a loop is indeed lock-free.  However, IMO the intention behind
"lock-free" atomics in C and C++ is to offer atomics that are both
lock-free *and* as fast as one would assume for a fully HW-backed
solution for atomic accesses.  This includes that loads must be cheaper
than stores, in particular under contention / concurrent accesses by
several threads.
I believe that "fast" is much more often part of the motivation for
using lock-free atomics than the actual "lock-free", so the
progress-guarantee aspect (which isn't even lock-free but
obstruction-free, see below).  If we do see a sufficiently strong need
for lock-free atomics, which should build something just for that (eg,
if removing the address-free requirement, we can support lock-free (in
the progress-guarantee sense) operations for a lot more types).

Also, while that previous issue is "just" a performance issue, the fact
that we could issue a store when calling to atomic_load() is a
correctness issue, I think.
One example are volatile atomic loads; while C/C++ don't really
constrain what a volatile load needs to be in the underlying
implementation, I think most users would assume that a load really means
a hardware load instruction of some sort, and nothing else.  cmpxchg16b
conflicts with such an assumption.
Another example is read-only mapped memory.

Bottom line: we shouldn't rely solely on cmpxchg16b and similar.
(Though this doesn't necessarily mean that there can't be compiler flags
that enable its use.)


I think the ABI should set a baseline for each architecture, and the
baseline decides whether something is inlinable or not.  Thus, the
x86_64 ABI would make __int128 operations not imlinable (because of the
issues with cmpxchg16b, see above).

If users want to use capabilities beyond the baseline, they can choose
to use flags that alter/extend the ABI.  For example, if they use a flag
that explicitly enables the use of cmpxchg16b for atomics, they also
need to use a libatomic implementation built in the same way (if
possible).  This then creates a new ABI(-variant), basically.


I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
see it (but I don't remember for sure).  Either way, if I should have
been wrong, and we are using cmpxchg16b for loads, this should be fixed.
Ideally, this should be fixed before the stage 3 deadline this Friday.
Such a fix might potentially break existing uses, but the earlier we fix
this, the better.


Section 3 Rationale, alternative 1: I'm wondering if the example is
correct.  For a 4-byte-aligned type of size 3, the implementation cannot
simply use 4-byte hardware-backed atomics because this will inevitably
touch the 4th byte I think, and the implementation can't know whether
this is padding or not.  Or do we expect that things like packed structs
are disallowed?

N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
Because cmpxchg8b is available for CPUs that are the lowest i?86 we
still intend to support?

I'd also use "hardware-backed" instead of "hardware backed".

> > - The Rationale section in section 3 is also revised to remove the 
> > mentioning of "lock-free", but there is not major change of concept.
> >
> > - Add note N3.1 to emphasize the assumption of general hardware 
> > supported atomic instruction
> >
> > - Add note N3.2 to discuss the issues of cmpxchg16b

See above.

> > - Add a paragraph in section 4.1 to specify memory_order_consume must 
> > be implemented through memory_order_acquire. Section 4.2 emphasizes it 
> > again.
> >
> > - The specification of each runtime functions mostly maps to the 
> > corresponding generic functions in the C11 standard. Two functions are 
> > worth noting:
> > 1) C11 atomic_compare_exchange compares and updates the "value" while 
> > __atomic_compare_exchange functions in this ABI compare and update the 
> > "memory", which implies the memcmp and memcpy semantics.

In Section 4, parts about atomic_compare_exchange: should there be a
back-reference to the memcmp point made earlier in the document?

> > 2) The specification of __atomic_is_lock_free allows both a per-object 
> > result and a per-type result. A per-type implementation could pass 
> > NULL, or a faked address as the address of the object. A per-object 
> > implementation could pass the actual address of the object.

The __atomic_is_lock_free description should specify that "lock-free"
refers to the definition of "lock-free" in C++14, which includes
"address-free".  I'm referring to C++14 specifically because this
contains an update which is relevant for (1) LL/SC-based architectures
(ie, that "lock-free" is actually what is called obstruction-free in the
literature) and (2) for any libatomic implementation that wants to use
HW atomics for things like the example in Section 3's Rationale,
alternative 1 (see above).


This ABI needs to also specify how hardware-backed atomics are
implemented on a particular architecture.  For example, on architectures
where there is more than one choice for how to certain memory orders
(eg, ARM), the ABI should pick a certain mapping.  I guess this should
be a note in Section 4, maybe as a separate subsection and/or an
additional note around the memory_order enum description; I'd keep the
note about implementing something equivalent to C11/C++11 semantics.
What we would document is something like the possible mappings discussed
here: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html


There are typos in Section 2.4.

Reply via email to