If you do excessive refcounting (ARC) and care about performance you actually need to let the implementor of C decide where the RC_counter is embedded...
C providing facility to be refcountable do not equate with C's user want refcounting. It means that if C's user want it to be refcounted, it is gonna be faster.
