Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

Ola Liljedahl Fri, 07 Apr 2017 05:43:58 -0700

On 7 April 2017 at 01:33, Bill Fischofer <[email protected]> wrote:


> On Thu, Apr 6, 2017 at 1:51 PM, Maxim Uvarov <[email protected]>
> wrote:
> > On 04/06/17 13:35, Ola Liljedahl wrote:
> >> On 5 April 2017 at 23:39, Maxim Uvarov <[email protected]> wrote:
> >>> On 04/05/17 17:30, Ola Liljedahl wrote:
> >>>> On 5 April 2017 at 14:50, Maxim Uvarov <[email protected]>
> wrote:
> >>>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
> >>>>>> This can go into master/api-next as an independent patch. Agree?
> >>>>>
> >>>>> agree. If we accept implementation where events can be 'delayed'
> >>>> Probably all platforms with HW queues.
> >>>>
> >>>>> than it
> >>>>> looks like we missed some api to sync queues.
> >>>> When would those API's be used?
> >>>>
> >>>
> >>> might be in case like that. Might be it's not needed in real world
> >>> application.
> >> This was a test program. I don't see the same situation occurring in a
> >> real world application. I could be wrong.
> >>
> >>>
> >>> My point that if situation of postpone event is accepted that we need
> >>> document that in api doxygen comment.
> >> I think the asynchronous behaviour is the default. ODP is a hardware
> >> abstraction. HW is often asynchronous, writes are posted etc. Ensuring
> >> synchronous behaviour costs performance.
> >>
> >> Single-threaded software is "synchronous", writes are immediately
> >> visible to the thread. But as soon as you go multi-threaded and don't
> >> use locks to access shared resources, software also becomes
> >> "asynchronous" (don't know if it is the right word here). Only if you
> >> use locks to synchronise accesses to shared memory you return to some
> >> form of sequential consistency (all threads see updates in the same
> >> order). You don't want to use locks, that quickly creates scalability
> >> bottlenecks.
> >>
> >> Since the scalable scheduler does its best to avoid locks
> >> (non-scalable) and sequential consistency (slow), instead utilising
> >> lock-less and lock-free algorithms and weak memory ordering (e.g.
> >> acquire/release), it exposes the underlying hardware characteristics.
> >>
> >
> > Ola, I think you better understand how week memory ordering works. In
> > this case I understand that hardware can 'delay' events in queue and
> > make them not visible just after queueing for some reason. And it's not
> > possible to solve in implementation. If we speak totally about software
> > I would understand if one thread did queue and other dequeue. Or case if
> > you queued X and dequeued Y. But in that case if each thread queued 1
> > and dequeued 1 in each thread. Which look like if you store in one
> > thread some variable then you need several loads to get value which was
> > stored. Is that right behaviour of week ordering?
>
> There are some good online articles that explain the issues well. See,
> for example [1]  that explains the types of barriers used to control
> memory orderings and [2] that explains how these relate to strong vs.
> weak memory models.
>
> --
> [1] http://preshing.com/20120710/memory-barriers-are-like-
> source-control-operations/
> [2] http://preshing.com/20120930/weak-vs-strong-memory-models/
>
> Preshing has a lot of good posts.
Herb Sutter has a good presentation here:
https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2


> >
> > Maxim.
> >
> >
> >>>
> >>> Maxim.
> >>>
> >>>>>
> >>>>> But I do not see why we need this patch. On the same cpu test queue 1
> >>>>> event and after that dequeue 1 event:
> >>>>>
> >>>>>         for (i = 0; i < QUEUE_ROUNDS; i++) {
> >>>>>                 ev = odp_buffer_to_event(buf);
> >>>>>
> >>>>>                 if (odp_queue_enq(queue, ev)) {
> >>>>>                         LOG_ERR("  [%i] Queue enqueue failed.\n",
> thr);
> >>>>>                         odp_buffer_free(buf);
> >>>>>                         return -1;
> >>>>>                 }
> >>>>>
> >>>>>                 ev = odp_queue_deq(queue);
> >>>>>
> >>>>>                 buf = odp_buffer_from_event(ev);
> >>>>>
> >>>>>                 if (!odp_buffer_is_valid(buf)) {
> >>>>>                         LOG_ERR("  [%i] Queue empty.\n", thr);
> >>>>>                         return -1;
> >>>>>                 }
> >>>>>         }
> >>>>>
> >>>>> Where this exactly event can be delayed?
> >>>> In the memory system.
> >>>>
> >>>>>
> >>>>> If other threads do the same - then all do enqueue 1 event first and
> >>>>> then dequeue one event. I can understand problem with queueing on one
> >>>>> cpu and dequeuing on other cpu. But on the same cpu is has to always
> >>>>> work. Isn't it?
> >>>> No.
> >>>>
> >>>>>
> >>>>> Maxim.
> >>>>>
> >>>>>>
> >>>>>> On 4 April 2017 at 21:22, Brian Brooks <[email protected]>
> wrote:
> >>>>>>> On 04/04 17:26:12, Bill Fischofer wrote:
> >>>>>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <
> [email protected]> wrote:
> >>>>>>>>> On 04/04 21:59:15, Maxim Uvarov wrote:
> >>>>>>>>>> On 04/04/17 21:47, Brian Brooks wrote:
> >>>>>>>>>>> Signed-off-by: Ola Liljedahl <[email protected]>
> >>>>>>>>>>> Reviewed-by: Brian Brooks <[email protected]>
> >>>>>>>>>>> Reviewed-by: Honnappa Nagarahalli <
> [email protected]>
> >>>>>>>>>>> Reviewed-by: Kevin Wang <[email protected]>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  test/common_plat/performance/odp_scheduling.c | 12
> ++++++++++--
> >>>>>>>>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c
> b/test/common_plat/performance/odp_scheduling.c
> >>>>>>>>>>> index c74a0713..38e76257 100644
> >>>>>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c
> >>>>>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c
> >>>>>>>>>>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr,
> test_globals_t *globals)
> >>>>>>>>>>>     test_message_t *t_msg;
> >>>>>>>>>>>     odp_queue_t queue;
> >>>>>>>>>>>     uint64_t c1, c2, cycles;
> >>>>>>>>>>> -   int i;
> >>>>>>>>>>> +   int i, j;
> >>>>>>>>>>>
> >>>>>>>>>>>     /* Alloc test message */
> >>>>>>>>>>>     buf = odp_buffer_alloc(globals->pool);
> >>>>>>>>>>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr,
> test_globals_t *globals)
> >>>>>>>>>>>                     return -1;
> >>>>>>>>>>>             }
> >>>>>>>>>>>
> >>>>>>>>>>> -           ev = odp_queue_deq(queue);
> >>>>>>>>>>> +           /* When enqueue and dequeue are decoupled (e.g.
> not using a
> >>>>>>>>>>> +            * common lock), an enqueued event may not be
> immediately
> >>>>>>>>>>> +            * visible to dequeue. So we just try again for a
> while. */
> >>>>>>>>>>> +           for (j = 0; j < 100; j++) {
> >>>>>>>>>>
> >>>>>>>>>> where 100 number comes from?
> >>>>>>>>>
> >>>>>>>>> It is the retry count. Perhaps it could be a bit lower, or a bit
> higher, but
> >>>>>>>>> it works well.
> >>>>>>>>
> >>>>>>>> Actually, it's incorrect. What happens if all 100 retries fail?
> You'll
> >>>>>>>> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is
> >>>>>>>> undefined.
> >>>>>>>
> >>>>>>> Incorrect? :) The point is that an event may not be immediately
> available
> >>>>>>> to dequeue after it has been enqueued. This is due to the way that
> a concurrent
> >>>>>>> ring buffer behaves in a multi-threaded environment. The approach
> here is
> >>>>>>> just to retry the dequeue a couple times (100 times actually)
> before moving
> >>>>>>> on to the rest of code. Perhaps 100 times is too many times, but
> some amount
> >>>>>>> of retry is needed.
> >>>>>>>
> >>>>>>> If this is not desirable, then I think it would be more accurate
> to consider
> >>>>>>> odp_queue_enq() / odp_queue_deq() as async APIs -or- MT-unsafe
> (must be called
> >>>>>>> from one thread at a time in order to ensure the behavior that an
> event is
> >>>>>>> immediately available for dequeue once it has been enqueued).
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Maxim.
> >>>>>>>>>>
> >>>>>>>>>>> +                   ev = odp_queue_deq(queue);
> >>>>>>>>>>> +                   if (ev != ODP_EVENT_INVALID)
> >>>>>>>>>>> +                           break;
> >>>>>>>>>>> +                   odp_cpu_pause();
> >>>>>>>>>>> +           }
> >>>>>>>>>>>
> >>>>>>>>>>>             buf = odp_buffer_from_event(ev);
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>
> >>>
> >
>

Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

Reply via email to