Merged, Maxim.
On 12/14/16 10:58, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > >> -----Original Message----- >> From: Maxim Uvarov [mailto:[email protected]] >> Sent: Tuesday, December 13, 2016 4:06 PM >> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- >> labs.com>; [email protected] >> Subject: Re: [lng-odp] [API-NEXT PATCH] linux-gen: schedule_sp: use ring >> as priority queue >> >> On 12/13/16 10:50, Savolainen, Petri (Nokia - FI/Espoo) wrote: >>> >>>>> +static void add_group(sched_group_t *sched_group, int thr, int group) >>>>> +{ >>>>> + int num; >>>>> + uint32_t gen_cnt; >>>>> + thr_group_t *thr_group = &sched_group->s.thr[thr]; >>>>> + >>>>> + num = thr_group->num_group; >>>>> + thr_group->group[num] = group; >>>>> + thr_group->num_group = num + 1; >>>>> + gen_cnt = odp_atomic_load_u32(&thr_group->gen_cnt); >>>>> + odp_atomic_store_u32(&thr_group->gen_cnt, gen_cnt + 1); >>>> >>>> >>>> odp_atomic_inc_u32() >>> >>> Atomic, in-memory increment would be waste here since this operation is >> done while holding a lock. The variable is atomic since it's checked with >> load_acq in sched_cmd() (when not holding a lock). >>> >>> >>>> >>>> >>>>> +} >>>>> + >>>>> +static void remove_group(sched_group_t *sched_group, int thr, int >>>> group) >>>>> +{ >>>>> + int i, num; >>>>> + int found = 0; >>>>> + thr_group_t *thr_group = &sched_group->s.thr[thr]; >>>>> + >>>>> + num = thr_group->num_group; >>>>> + >>>>> + for (i = 0; i < num; i++) { >>>>> + if (thr_group->group[i] == group) { >>>>> + found = 1; >>>>> + >>>>> + for (; i < num - 1; i++) >>>>> + thr_group->group[i] = thr_group->group[i + 1]; >>>>> + >>>>> + break; >>>>> + } >>>>> + } >>>>> + >>>>> + if (found) { >>>>> + uint32_t gen_cnt; >>>>> + >>>>> + thr_group->num_group = num - 1; >>>>> + gen_cnt = odp_atomic_load_u32(&thr_group->gen_cnt); >>>>> + odp_atomic_store_u32(&thr_group->gen_cnt, gen_cnt + 1); >>>> >>>> odp_atomic_inc_u32() >>>> >>>> and you can remove if found by placing inc and break above. That should >>>> fit in 80 characters. >>> >>> >>> Same thing here. Load + store is more performant than an unnecessary >> atomic, in-memory inc. >>> >>> -Petri >>> >> >> I think it depends. On ABI compat mode that will be function and you >> need 2 stores in stack. >> >> And what is faster depends on implementation of internal >> __atomic_fetch_add call, which may vary on gcc version. I never measured >> what is faster, but it's easy to check... >> >> Maxim. > > > No, it does not depend on GCC. Those are plain load and store instructions, > so there's no atomicity guarantees needed. Also atomics can and should be > inlined even in ABI compat mode (which is CPU arch specific anyway). I think > those are inlined already using GCC __atomics but for strict ABI compat we > should write those in assembly. > > The atomic_inc implements *in-memory* atomic increment, which needs much more > synchronization than a load or a store (which does not provide any sync). So, > load + store == always about 2 CPU cycles. Atomic_inc may be e.g. 30 cycles, > depending on the HW implementation. The main point is that this code section > is synchronized with a lock already and it does not need any extra > (in-memory) atomic synchronization because of that. > > The variable is atomic only because it's checked with load_acq in sched_cmd(). > > -Petri > >
