On 7 April 2017 at 01:33, Bill Fischofer <[email protected]> wrote:
> On Thu, Apr 6, 2017 at 1:51 PM, Maxim Uvarov <[email protected]> > wrote: > > On 04/06/17 13:35, Ola Liljedahl wrote: > >> On 5 April 2017 at 23:39, Maxim Uvarov <[email protected]> wrote: > >>> On 04/05/17 17:30, Ola Liljedahl wrote: > >>>> On 5 April 2017 at 14:50, Maxim Uvarov <[email protected]> > wrote: > >>>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote: > >>>>>> This can go into master/api-next as an independent patch. Agree? > >>>>> > >>>>> agree. If we accept implementation where events can be 'delayed' > >>>> Probably all platforms with HW queues. > >>>> > >>>>> than it > >>>>> looks like we missed some api to sync queues. > >>>> When would those API's be used? > >>>> > >>> > >>> might be in case like that. Might be it's not needed in real world > >>> application. > >> This was a test program. I don't see the same situation occurring in a > >> real world application. I could be wrong. > >> > >>> > >>> My point that if situation of postpone event is accepted that we need > >>> document that in api doxygen comment. > >> I think the asynchronous behaviour is the default. ODP is a hardware > >> abstraction. HW is often asynchronous, writes are posted etc. Ensuring > >> synchronous behaviour costs performance. > >> > >> Single-threaded software is "synchronous", writes are immediately > >> visible to the thread. But as soon as you go multi-threaded and don't > >> use locks to access shared resources, software also becomes > >> "asynchronous" (don't know if it is the right word here). Only if you > >> use locks to synchronise accesses to shared memory you return to some > >> form of sequential consistency (all threads see updates in the same > >> order). You don't want to use locks, that quickly creates scalability > >> bottlenecks. > >> > >> Since the scalable scheduler does its best to avoid locks > >> (non-scalable) and sequential consistency (slow), instead utilising > >> lock-less and lock-free algorithms and weak memory ordering (e.g. > >> acquire/release), it exposes the underlying hardware characteristics. > >> > > > > Ola, I think you better understand how week memory ordering works. In > > this case I understand that hardware can 'delay' events in queue and > > make them not visible just after queueing for some reason. And it's not > > possible to solve in implementation. If we speak totally about software > > I would understand if one thread did queue and other dequeue. Or case if > > you queued X and dequeued Y. But in that case if each thread queued 1 > > and dequeued 1 in each thread. Which look like if you store in one > > thread some variable then you need several loads to get value which was > > stored. Is that right behaviour of week ordering? > > There are some good online articles that explain the issues well. See, > for example [1] that explains the types of barriers used to control > memory orderings and [2] that explains how these relate to strong vs. > weak memory models. > > -- > [1] http://preshing.com/20120710/memory-barriers-are-like- > source-control-operations/ > [2] http://preshing.com/20120930/weak-vs-strong-memory-models/ > > Preshing has a lot of good posts. Herb Sutter has a good presentation here: https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 > > > > Maxim. > > > > > >>> > >>> Maxim. > >>> > >>>>> > >>>>> But I do not see why we need this patch. On the same cpu test queue 1 > >>>>> event and after that dequeue 1 event: > >>>>> > >>>>> for (i = 0; i < QUEUE_ROUNDS; i++) { > >>>>> ev = odp_buffer_to_event(buf); > >>>>> > >>>>> if (odp_queue_enq(queue, ev)) { > >>>>> LOG_ERR(" [%i] Queue enqueue failed.\n", > thr); > >>>>> odp_buffer_free(buf); > >>>>> return -1; > >>>>> } > >>>>> > >>>>> ev = odp_queue_deq(queue); > >>>>> > >>>>> buf = odp_buffer_from_event(ev); > >>>>> > >>>>> if (!odp_buffer_is_valid(buf)) { > >>>>> LOG_ERR(" [%i] Queue empty.\n", thr); > >>>>> return -1; > >>>>> } > >>>>> } > >>>>> > >>>>> Where this exactly event can be delayed? > >>>> In the memory system. > >>>> > >>>>> > >>>>> If other threads do the same - then all do enqueue 1 event first and > >>>>> then dequeue one event. I can understand problem with queueing on one > >>>>> cpu and dequeuing on other cpu. But on the same cpu is has to always > >>>>> work. Isn't it? > >>>> No. > >>>> > >>>>> > >>>>> Maxim. > >>>>> > >>>>>> > >>>>>> On 4 April 2017 at 21:22, Brian Brooks <[email protected]> > wrote: > >>>>>>> On 04/04 17:26:12, Bill Fischofer wrote: > >>>>>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks < > [email protected]> wrote: > >>>>>>>>> On 04/04 21:59:15, Maxim Uvarov wrote: > >>>>>>>>>> On 04/04/17 21:47, Brian Brooks wrote: > >>>>>>>>>>> Signed-off-by: Ola Liljedahl <[email protected]> > >>>>>>>>>>> Reviewed-by: Brian Brooks <[email protected]> > >>>>>>>>>>> Reviewed-by: Honnappa Nagarahalli < > [email protected]> > >>>>>>>>>>> Reviewed-by: Kevin Wang <[email protected]> > >>>>>>>>>>> --- > >>>>>>>>>>> test/common_plat/performance/odp_scheduling.c | 12 > ++++++++++-- > >>>>>>>>>>> 1 file changed, 10 insertions(+), 2 deletions(-) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c > b/test/common_plat/performance/odp_scheduling.c > >>>>>>>>>>> index c74a0713..38e76257 100644 > >>>>>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c > >>>>>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c > >>>>>>>>>>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr, > test_globals_t *globals) > >>>>>>>>>>> test_message_t *t_msg; > >>>>>>>>>>> odp_queue_t queue; > >>>>>>>>>>> uint64_t c1, c2, cycles; > >>>>>>>>>>> - int i; > >>>>>>>>>>> + int i, j; > >>>>>>>>>>> > >>>>>>>>>>> /* Alloc test message */ > >>>>>>>>>>> buf = odp_buffer_alloc(globals->pool); > >>>>>>>>>>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr, > test_globals_t *globals) > >>>>>>>>>>> return -1; > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> - ev = odp_queue_deq(queue); > >>>>>>>>>>> + /* When enqueue and dequeue are decoupled (e.g. > not using a > >>>>>>>>>>> + * common lock), an enqueued event may not be > immediately > >>>>>>>>>>> + * visible to dequeue. So we just try again for a > while. */ > >>>>>>>>>>> + for (j = 0; j < 100; j++) { > >>>>>>>>>> > >>>>>>>>>> where 100 number comes from? > >>>>>>>>> > >>>>>>>>> It is the retry count. Perhaps it could be a bit lower, or a bit > higher, but > >>>>>>>>> it works well. > >>>>>>>> > >>>>>>>> Actually, it's incorrect. What happens if all 100 retries fail? > You'll > >>>>>>>> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is > >>>>>>>> undefined. > >>>>>>> > >>>>>>> Incorrect? :) The point is that an event may not be immediately > available > >>>>>>> to dequeue after it has been enqueued. This is due to the way that > a concurrent > >>>>>>> ring buffer behaves in a multi-threaded environment. The approach > here is > >>>>>>> just to retry the dequeue a couple times (100 times actually) > before moving > >>>>>>> on to the rest of code. Perhaps 100 times is too many times, but > some amount > >>>>>>> of retry is needed. > >>>>>>> > >>>>>>> If this is not desirable, then I think it would be more accurate > to consider > >>>>>>> odp_queue_enq() / odp_queue_deq() as async APIs -or- MT-unsafe > (must be called > >>>>>>> from one thread at a time in order to ensure the behavior that an > event is > >>>>>>> immediately available for dequeue once it has been enqueued). > >>>>>>> > >>>>>>>>> > >>>>>>>>>> Maxim. > >>>>>>>>>> > >>>>>>>>>>> + ev = odp_queue_deq(queue); > >>>>>>>>>>> + if (ev != ODP_EVENT_INVALID) > >>>>>>>>>>> + break; > >>>>>>>>>>> + odp_cpu_pause(); > >>>>>>>>>>> + } > >>>>>>>>>>> > >>>>>>>>>>> buf = odp_buffer_from_event(ev); > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>> > >>> > > >
