On 6 April 2017 at 20:51, Maxim Uvarov <[email protected]> wrote:

> On 04/06/17 13:35, Ola Liljedahl wrote:
> > On 5 April 2017 at 23:39, Maxim Uvarov <[email protected]> wrote:
> >> On 04/05/17 17:30, Ola Liljedahl wrote:
> >>> On 5 April 2017 at 14:50, Maxim Uvarov <[email protected]>
> wrote:
> >>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
> >>>>> This can go into master/api-next as an independent patch. Agree?
> >>>>
> >>>> agree. If we accept implementation where events can be 'delayed'
> >>> Probably all platforms with HW queues.
> >>>
> >>>> than it
> >>>> looks like we missed some api to sync queues.
> >>> When would those API's be used?
> >>>
> >>
> >> might be in case like that. Might be it's not needed in real world
> >> application.
> > This was a test program. I don't see the same situation occurring in a
> > real world application. I could be wrong.
> >
> >>
> >> My point that if situation of postpone event is accepted that we need
> >> document that in api doxygen comment.
> > I think the asynchronous behaviour is the default. ODP is a hardware
> > abstraction. HW is often asynchronous, writes are posted etc. Ensuring
> > synchronous behaviour costs performance.
> >
> > Single-threaded software is "synchronous", writes are immediately
> > visible to the thread. But as soon as you go multi-threaded and don't
> > use locks to access shared resources, software also becomes
> > "asynchronous" (don't know if it is the right word here). Only if you
> > use locks to synchronise accesses to shared memory you return to some
> > form of sequential consistency (all threads see updates in the same
> > order). You don't want to use locks, that quickly creates scalability
> > bottlenecks.
> >
> > Since the scalable scheduler does its best to avoid locks
> > (non-scalable) and sequential consistency (slow), instead utilising
> > lock-less and lock-free algorithms and weak memory ordering (e.g.
> > acquire/release), it exposes the underlying hardware characteristics.
> >
>
> Ola, I think you better understand how week memory ordering works. In
> this case I understand that hardware can 'delay' events in queue and
> make them not visible just after queueing for some reason. And it's not
> possible to solve in implementation. If we speak totally about software
> I would understand if one thread did queue and other dequeue. Or case if
> you queued X and dequeued Y. But in that case if each thread queued 1
> and dequeued 1 in each thread. Which look like if you store in one
> thread some variable then you need several loads to get value which was
> stored. Is that right behaviour of week ordering?
>

The new ring buffer design uses separate read & write (head & tail)
pointers for producers (enqueue) and consumers (dequeue) (DPDK/BSD ring
buffer style but with different names). Enqueue checks if there is space
(load prod_write, load-acquire prod_read), updates prod_write (CAS-relaxed
prod_write), stores events in the ring array, waits for any previous
producers to release their updates (load cons_write) and eventually
releases its own updates (store-release cons_write) so that they can be
seen by consumers.
Dequeue works in an equivalent way (load-acquire on cons_write, CAS on
cons_read, then DMB ISHLD before store to prod_read, DMB ISHLD is enough to
ensure that the read of of the ring buffer has completed before dequeue
"releases" the ring buffer slots).

See slide 9 in my Linaro Connect presentation:
https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ/edit#slide=id.g1d00c08a90_0_108

One aspect of having separate metadata (head & tail pointers) for producers
and for consumers is that producers and consumer can have a different view
of the queue state, i.e. one thread can see the queue as empty while
another thread sees it as non-empty. I call this Schrödinger's Queue
Conundrum when I talked about it in my internal scheduler prototype design
presentation back in Las Vegas.

The scalable scheduler puts all of these ring buffer metadata variables in
separate cache lines for maximum scalability. They are all targets for
stores and only one cache (CPU) at a time can write to a cache line (per
MOESI and similar cache coherency protocols). Combine this with relaxed
memory ordering where plain loads and stores can be reordered in any way,
prefetching of cache lines and store buffers and the CPU might see updates
(stores) from other CPU's in unexpected order (a CPU always sees its own
stores immediately). If load-acquire and store-release are used when
necessary, it is guaranteed that the CPU does not see invalid (stale) data.

In the scenario here we have multiple threads enqueuing and dequeuing
events (one at a time) to/from the same queue and not being synchronised
with each other (the queue is the only shared object but as explained
above, the synchronisation is very loose). I haven't figured out the exact
situation where this problem occurs but I am not surprised.

OK let's try to understand:
TA enqueues 1st event (updates prod_write=1, cons_write=1)
TB enqueues 2nd event (updates prod_write=2, cons_write=2)
TB dequeues 1st event (updates cons_read=1, prod_read=1)
TA calls dequeue but cons_read and cons_write can be fetched in any order
(e.g. cons_read=1, cons_write=1) and TA draws the conclusion that there are
no elements in the ring (cons_read==cons_write). TB-enqueue's update of
cons_write=2 isn't always immediately seen by TA-dequeue, TA loaded
cons_write when it still had the value 1. Relaxed (loads and) stores can be
reordered on ARM (not on x86 which implements TSO - Total Store Order, at
least this is what software sees).

You could probably (well I am sure) avoid the situation by making the
memory ordering stricter but that would degrade performance. The code
currently uses the minimal orderings that are required to guarantee the
correctness of the ring buffer reads (in dequeue) and writes (in enqueue)
and nothing more. If you retry the dequeue operation a few times, TA
eventually sees the updated cons_write (as written by TB-enqueue) and
correctly determines that there is an event that can be dequeued
(cons_read=1, cons_write=2, cons_read != cons_write)).

Possibly, the more threads involved, the larger the probability of loads
and stores getting reordered because they must be snooped from different
caches (different CPU's) and thus the cache coherency transactions will
have different latency,


> Maxim.
>
>
> >>
> >> Maxim.
> >>
> >>>>
> >>>> But I do not see why we need this patch. On the same cpu test queue 1
> >>>> event and after that dequeue 1 event:
> >>>>
> >>>>         for (i = 0; i < QUEUE_ROUNDS; i++) {
> >>>>                 ev = odp_buffer_to_event(buf);
> >>>>
> >>>>                 if (odp_queue_enq(queue, ev)) {
> >>>>                         LOG_ERR("  [%i] Queue enqueue failed.\n",
> thr);
> >>>>                         odp_buffer_free(buf);
> >>>>                         return -1;
> >>>>                 }
> >>>>
> >>>>                 ev = odp_queue_deq(queue);
> >>>>
> >>>>                 buf = odp_buffer_from_event(ev);
> >>>>
> >>>>                 if (!odp_buffer_is_valid(buf)) {
> >>>>                         LOG_ERR("  [%i] Queue empty.\n", thr);
> >>>>                         return -1;
> >>>>                 }
> >>>>         }
> >>>>
> >>>> Where this exactly event can be delayed?
> >>> In the memory system.
> >>>
> >>>>
> >>>> If other threads do the same - then all do enqueue 1 event first and
> >>>> then dequeue one event. I can understand problem with queueing on one
> >>>> cpu and dequeuing on other cpu. But on the same cpu is has to always
> >>>> work. Isn't it?
> >>> No.
> >>>
> >>>>
> >>>> Maxim.
> >>>>
> >>>>>
> >>>>> On 4 April 2017 at 21:22, Brian Brooks <[email protected]> wrote:
> >>>>>> On 04/04 17:26:12, Bill Fischofer wrote:
> >>>>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <[email protected]>
> wrote:
> >>>>>>>> On 04/04 21:59:15, Maxim Uvarov wrote:
> >>>>>>>>> On 04/04/17 21:47, Brian Brooks wrote:
> >>>>>>>>>> Signed-off-by: Ola Liljedahl <[email protected]>
> >>>>>>>>>> Reviewed-by: Brian Brooks <[email protected]>
> >>>>>>>>>> Reviewed-by: Honnappa Nagarahalli <[email protected]
> >
> >>>>>>>>>> Reviewed-by: Kevin Wang <[email protected]>
> >>>>>>>>>> ---
> >>>>>>>>>>  test/common_plat/performance/odp_scheduling.c | 12
> ++++++++++--
> >>>>>>>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c
> b/test/common_plat/performance/odp_scheduling.c
> >>>>>>>>>> index c74a0713..38e76257 100644
> >>>>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c
> >>>>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c
> >>>>>>>>>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr,
> test_globals_t *globals)
> >>>>>>>>>>     test_message_t *t_msg;
> >>>>>>>>>>     odp_queue_t queue;
> >>>>>>>>>>     uint64_t c1, c2, cycles;
> >>>>>>>>>> -   int i;
> >>>>>>>>>> +   int i, j;
> >>>>>>>>>>
> >>>>>>>>>>     /* Alloc test message */
> >>>>>>>>>>     buf = odp_buffer_alloc(globals->pool);
> >>>>>>>>>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr,
> test_globals_t *globals)
> >>>>>>>>>>                     return -1;
> >>>>>>>>>>             }
> >>>>>>>>>>
> >>>>>>>>>> -           ev = odp_queue_deq(queue);
> >>>>>>>>>> +           /* When enqueue and dequeue are decoupled (e.g. not
> using a
> >>>>>>>>>> +            * common lock), an enqueued event may not be
> immediately
> >>>>>>>>>> +            * visible to dequeue. So we just try again for a
> while. */
> >>>>>>>>>> +           for (j = 0; j < 100; j++) {
> >>>>>>>>>
> >>>>>>>>> where 100 number comes from?
> >>>>>>>>
> >>>>>>>> It is the retry count. Perhaps it could be a bit lower, or a bit
> higher, but
> >>>>>>>> it works well.
> >>>>>>>
> >>>>>>> Actually, it's incorrect. What happens if all 100 retries fail?
> You'll
> >>>>>>> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is
> >>>>>>> undefined.
> >>>>>>
> >>>>>> Incorrect? :) The point is that an event may not be immediately
> available
> >>>>>> to dequeue after it has been enqueued. This is due to the way that
> a concurrent
> >>>>>> ring buffer behaves in a multi-threaded environment. The approach
> here is
> >>>>>> just to retry the dequeue a couple times (100 times actually)
> before moving
> >>>>>> on to the rest of code. Perhaps 100 times is too many times, but
> some amount
> >>>>>> of retry is needed.
> >>>>>>
> >>>>>> If this is not desirable, then I think it would be more accurate to
> consider
> >>>>>> odp_queue_enq() / odp_queue_deq() as async APIs -or- MT-unsafe
> (must be called
> >>>>>> from one thread at a time in order to ensure the behavior that an
> event is
> >>>>>> immediately available for dequeue once it has been enqueued).
> >>>>>>
> >>>>>>>>
> >>>>>>>>> Maxim.
> >>>>>>>>>
> >>>>>>>>>> +                   ev = odp_queue_deq(queue);
> >>>>>>>>>> +                   if (ev != ODP_EVENT_INVALID)
> >>>>>>>>>> +                           break;
> >>>>>>>>>> +                   odp_cpu_pause();
> >>>>>>>>>> +           }
> >>>>>>>>>>
> >>>>>>>>>>             buf = odp_buffer_from_event(ev);
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>
> >>
>
>

Reply via email to