On 2016年04月20日 22:24, Peter Zijlstra wrote:
> On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote:
> 
>> +#define __XCHG_GEN(cmp, type, sfx, skip, v)                         \
>> +static __always_inline unsigned long                                        
>> \
>> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old,            \
>> +                     unsigned long new);                            \
>> +static __always_inline u32                                          \
>> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new)            \
>> +{                                                                   \
>> +    int size = sizeof (type);                                       \
>> +    int off = (unsigned long)ptr % sizeof(u32);                     \
>> +    volatile u32 *p = ptr - off;                                    \
>> +    int bitoff = BITOFF_CAL(size, off);                             \
>> +    u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff;    \
>> +    u32 oldv, newv, tmp;                                            \
>> +    u32 ret;                                                        \
>> +    oldv = READ_ONCE(*p);                                           \
>> +    do {                                                            \
>> +            ret = (oldv & bitmask) >> bitoff;                       \
>> +            if (skip && ret != old)                                 \
>> +                    break;                                          \
>> +            newv = (oldv & ~bitmask) | (new << bitoff);             \
>> +            tmp = oldv;                                             \
>> +            oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv);       \
>> +    } while (tmp != oldv);                                          \
>> +    return ret;                                                     \
>> +}
> 
> So for an LL/SC based arch using cmpxchg() like that is sub-optimal.
> 
> Why did you choose to write it entirely in C?
> 
yes, you are right. more load/store will be done in C code.
However such xchg_u8/u16 is just used by qspinlock now. and I did not see any 
performance regression.
So just wrote in C, for simple. :)

Of course I have done xchg tests.
we run code just like xchg((u8*)&v, j++); in several threads.
and the result is,
[  768.374264] use time[1550072]ns in xchg_u8_asm
[  768.377102] use time[2826802]ns in xchg_u8_c

I think this is because there is one more load in C.
If possible, we can move such code in asm-generic/.

thanks
xinhui

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to