On Thu, Apr 6, 2017 at 1:51 PM, Maxim Uvarov <[email protected]> wrote: > On 04/06/17 13:35, Ola Liljedahl wrote: >> On 5 April 2017 at 23:39, Maxim Uvarov <[email protected]> wrote: >>> On 04/05/17 17:30, Ola Liljedahl wrote: >>>> On 5 April 2017 at 14:50, Maxim Uvarov <[email protected]> wrote: >>>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote: >>>>>> This can go into master/api-next as an independent patch. Agree? >>>>> >>>>> agree. If we accept implementation where events can be 'delayed' >>>> Probably all platforms with HW queues. >>>> >>>>> than it >>>>> looks like we missed some api to sync queues. >>>> When would those API's be used? >>>> >>> >>> might be in case like that. Might be it's not needed in real world >>> application. >> This was a test program. I don't see the same situation occurring in a >> real world application. I could be wrong. >> >>> >>> My point that if situation of postpone event is accepted that we need >>> document that in api doxygen comment. >> I think the asynchronous behaviour is the default. ODP is a hardware >> abstraction. HW is often asynchronous, writes are posted etc. Ensuring >> synchronous behaviour costs performance. >> >> Single-threaded software is "synchronous", writes are immediately >> visible to the thread. But as soon as you go multi-threaded and don't >> use locks to access shared resources, software also becomes >> "asynchronous" (don't know if it is the right word here). Only if you >> use locks to synchronise accesses to shared memory you return to some >> form of sequential consistency (all threads see updates in the same >> order). You don't want to use locks, that quickly creates scalability >> bottlenecks. >> >> Since the scalable scheduler does its best to avoid locks >> (non-scalable) and sequential consistency (slow), instead utilising >> lock-less and lock-free algorithms and weak memory ordering (e.g. >> acquire/release), it exposes the underlying hardware characteristics. >> > > Ola, I think you better understand how week memory ordering works. In > this case I understand that hardware can 'delay' events in queue and > make them not visible just after queueing for some reason. And it's not > possible to solve in implementation. If we speak totally about software > I would understand if one thread did queue and other dequeue. Or case if > you queued X and dequeued Y. But in that case if each thread queued 1 > and dequeued 1 in each thread. Which look like if you store in one > thread some variable then you need several loads to get value which was > stored. Is that right behaviour of week ordering?
There are some good online articles that explain the issues well. See, for example [1] that explains the types of barriers used to control memory orderings and [2] that explains how these relate to strong vs. weak memory models. -- [1] http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ [2] http://preshing.com/20120930/weak-vs-strong-memory-models/ > > Maxim. > > >>> >>> Maxim. >>> >>>>> >>>>> But I do not see why we need this patch. On the same cpu test queue 1 >>>>> event and after that dequeue 1 event: >>>>> >>>>> for (i = 0; i < QUEUE_ROUNDS; i++) { >>>>> ev = odp_buffer_to_event(buf); >>>>> >>>>> if (odp_queue_enq(queue, ev)) { >>>>> LOG_ERR(" [%i] Queue enqueue failed.\n", thr); >>>>> odp_buffer_free(buf); >>>>> return -1; >>>>> } >>>>> >>>>> ev = odp_queue_deq(queue); >>>>> >>>>> buf = odp_buffer_from_event(ev); >>>>> >>>>> if (!odp_buffer_is_valid(buf)) { >>>>> LOG_ERR(" [%i] Queue empty.\n", thr); >>>>> return -1; >>>>> } >>>>> } >>>>> >>>>> Where this exactly event can be delayed? >>>> In the memory system. >>>> >>>>> >>>>> If other threads do the same - then all do enqueue 1 event first and >>>>> then dequeue one event. I can understand problem with queueing on one >>>>> cpu and dequeuing on other cpu. But on the same cpu is has to always >>>>> work. Isn't it? >>>> No. >>>> >>>>> >>>>> Maxim. >>>>> >>>>>> >>>>>> On 4 April 2017 at 21:22, Brian Brooks <[email protected]> wrote: >>>>>>> On 04/04 17:26:12, Bill Fischofer wrote: >>>>>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <[email protected]> >>>>>>>> wrote: >>>>>>>>> On 04/04 21:59:15, Maxim Uvarov wrote: >>>>>>>>>> On 04/04/17 21:47, Brian Brooks wrote: >>>>>>>>>>> Signed-off-by: Ola Liljedahl <[email protected]> >>>>>>>>>>> Reviewed-by: Brian Brooks <[email protected]> >>>>>>>>>>> Reviewed-by: Honnappa Nagarahalli <[email protected]> >>>>>>>>>>> Reviewed-by: Kevin Wang <[email protected]> >>>>>>>>>>> --- >>>>>>>>>>> test/common_plat/performance/odp_scheduling.c | 12 ++++++++++-- >>>>>>>>>>> 1 file changed, 10 insertions(+), 2 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c >>>>>>>>>>> b/test/common_plat/performance/odp_scheduling.c >>>>>>>>>>> index c74a0713..38e76257 100644 >>>>>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c >>>>>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c >>>>>>>>>>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr, >>>>>>>>>>> test_globals_t *globals) >>>>>>>>>>> test_message_t *t_msg; >>>>>>>>>>> odp_queue_t queue; >>>>>>>>>>> uint64_t c1, c2, cycles; >>>>>>>>>>> - int i; >>>>>>>>>>> + int i, j; >>>>>>>>>>> >>>>>>>>>>> /* Alloc test message */ >>>>>>>>>>> buf = odp_buffer_alloc(globals->pool); >>>>>>>>>>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr, >>>>>>>>>>> test_globals_t *globals) >>>>>>>>>>> return -1; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> - ev = odp_queue_deq(queue); >>>>>>>>>>> + /* When enqueue and dequeue are decoupled (e.g. not >>>>>>>>>>> using a >>>>>>>>>>> + * common lock), an enqueued event may not be >>>>>>>>>>> immediately >>>>>>>>>>> + * visible to dequeue. So we just try again for a >>>>>>>>>>> while. */ >>>>>>>>>>> + for (j = 0; j < 100; j++) { >>>>>>>>>> >>>>>>>>>> where 100 number comes from? >>>>>>>>> >>>>>>>>> It is the retry count. Perhaps it could be a bit lower, or a bit >>>>>>>>> higher, but >>>>>>>>> it works well. >>>>>>>> >>>>>>>> Actually, it's incorrect. What happens if all 100 retries fail? You'll >>>>>>>> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is >>>>>>>> undefined. >>>>>>> >>>>>>> Incorrect? :) The point is that an event may not be immediately >>>>>>> available >>>>>>> to dequeue after it has been enqueued. This is due to the way that a >>>>>>> concurrent >>>>>>> ring buffer behaves in a multi-threaded environment. The approach here >>>>>>> is >>>>>>> just to retry the dequeue a couple times (100 times actually) before >>>>>>> moving >>>>>>> on to the rest of code. Perhaps 100 times is too many times, but some >>>>>>> amount >>>>>>> of retry is needed. >>>>>>> >>>>>>> If this is not desirable, then I think it would be more accurate to >>>>>>> consider >>>>>>> odp_queue_enq() / odp_queue_deq() as async APIs -or- MT-unsafe (must be >>>>>>> called >>>>>>> from one thread at a time in order to ensure the behavior that an event >>>>>>> is >>>>>>> immediately available for dequeue once it has been enqueued). >>>>>>> >>>>>>>>> >>>>>>>>>> Maxim. >>>>>>>>>> >>>>>>>>>>> + ev = odp_queue_deq(queue); >>>>>>>>>>> + if (ev != ODP_EVENT_INVALID) >>>>>>>>>>> + break; >>>>>>>>>>> + odp_cpu_pause(); >>>>>>>>>>> + } >>>>>>>>>>> >>>>>>>>>>> buf = odp_buffer_from_event(ev); >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>> >>> >
