On Thu, Apr 6, 2017 at 1:51 PM, Maxim Uvarov <[email protected]> wrote:
> On 04/06/17 13:35, Ola Liljedahl wrote:
>> On 5 April 2017 at 23:39, Maxim Uvarov <[email protected]> wrote:
>>> On 04/05/17 17:30, Ola Liljedahl wrote:
>>>> On 5 April 2017 at 14:50, Maxim Uvarov <[email protected]> wrote:
>>>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
>>>>>> This can go into master/api-next as an independent patch. Agree?
>>>>>
>>>>> agree. If we accept implementation where events can be 'delayed'
>>>> Probably all platforms with HW queues.
>>>>
>>>>> than it
>>>>> looks like we missed some api to sync queues.
>>>> When would those API's be used?
>>>>
>>>
>>> might be in case like that. Might be it's not needed in real world
>>> application.
>> This was a test program. I don't see the same situation occurring in a
>> real world application. I could be wrong.
>>
>>>
>>> My point that if situation of postpone event is accepted that we need
>>> document that in api doxygen comment.
>> I think the asynchronous behaviour is the default. ODP is a hardware
>> abstraction. HW is often asynchronous, writes are posted etc. Ensuring
>> synchronous behaviour costs performance.
>>
>> Single-threaded software is "synchronous", writes are immediately
>> visible to the thread. But as soon as you go multi-threaded and don't
>> use locks to access shared resources, software also becomes
>> "asynchronous" (don't know if it is the right word here). Only if you
>> use locks to synchronise accesses to shared memory you return to some
>> form of sequential consistency (all threads see updates in the same
>> order). You don't want to use locks, that quickly creates scalability
>> bottlenecks.
>>
>> Since the scalable scheduler does its best to avoid locks
>> (non-scalable) and sequential consistency (slow), instead utilising
>> lock-less and lock-free algorithms and weak memory ordering (e.g.
>> acquire/release), it exposes the underlying hardware characteristics.
>>
>
> Ola, I think you better understand how week memory ordering works. In
> this case I understand that hardware can 'delay' events in queue and
> make them not visible just after queueing for some reason. And it's not
> possible to solve in implementation. If we speak totally about software
> I would understand if one thread did queue and other dequeue. Or case if
> you queued X and dequeued Y. But in that case if each thread queued 1
> and dequeued 1 in each thread. Which look like if you store in one
> thread some variable then you need several loads to get value which was
> stored. Is that right behaviour of week ordering?

There are some good online articles that explain the issues well. See,
for example [1]  that explains the types of barriers used to control
memory orderings and [2] that explains how these relate to strong vs.
weak memory models.

--
[1] 
http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/
[2] http://preshing.com/20120930/weak-vs-strong-memory-models/

>
> Maxim.
>
>
>>>
>>> Maxim.
>>>
>>>>>
>>>>> But I do not see why we need this patch. On the same cpu test queue 1
>>>>> event and after that dequeue 1 event:
>>>>>
>>>>>         for (i = 0; i < QUEUE_ROUNDS; i++) {
>>>>>                 ev = odp_buffer_to_event(buf);
>>>>>
>>>>>                 if (odp_queue_enq(queue, ev)) {
>>>>>                         LOG_ERR("  [%i] Queue enqueue failed.\n", thr);
>>>>>                         odp_buffer_free(buf);
>>>>>                         return -1;
>>>>>                 }
>>>>>
>>>>>                 ev = odp_queue_deq(queue);
>>>>>
>>>>>                 buf = odp_buffer_from_event(ev);
>>>>>
>>>>>                 if (!odp_buffer_is_valid(buf)) {
>>>>>                         LOG_ERR("  [%i] Queue empty.\n", thr);
>>>>>                         return -1;
>>>>>                 }
>>>>>         }
>>>>>
>>>>> Where this exactly event can be delayed?
>>>> In the memory system.
>>>>
>>>>>
>>>>> If other threads do the same - then all do enqueue 1 event first and
>>>>> then dequeue one event. I can understand problem with queueing on one
>>>>> cpu and dequeuing on other cpu. But on the same cpu is has to always
>>>>> work. Isn't it?
>>>> No.
>>>>
>>>>>
>>>>> Maxim.
>>>>>
>>>>>>
>>>>>> On 4 April 2017 at 21:22, Brian Brooks <[email protected]> wrote:
>>>>>>> On 04/04 17:26:12, Bill Fischofer wrote:
>>>>>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> On 04/04 21:59:15, Maxim Uvarov wrote:
>>>>>>>>>> On 04/04/17 21:47, Brian Brooks wrote:
>>>>>>>>>>> Signed-off-by: Ola Liljedahl <[email protected]>
>>>>>>>>>>> Reviewed-by: Brian Brooks <[email protected]>
>>>>>>>>>>> Reviewed-by: Honnappa Nagarahalli <[email protected]>
>>>>>>>>>>> Reviewed-by: Kevin Wang <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>>  test/common_plat/performance/odp_scheduling.c | 12 ++++++++++--
>>>>>>>>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c 
>>>>>>>>>>> b/test/common_plat/performance/odp_scheduling.c
>>>>>>>>>>> index c74a0713..38e76257 100644
>>>>>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c
>>>>>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c
>>>>>>>>>>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr, 
>>>>>>>>>>> test_globals_t *globals)
>>>>>>>>>>>     test_message_t *t_msg;
>>>>>>>>>>>     odp_queue_t queue;
>>>>>>>>>>>     uint64_t c1, c2, cycles;
>>>>>>>>>>> -   int i;
>>>>>>>>>>> +   int i, j;
>>>>>>>>>>>
>>>>>>>>>>>     /* Alloc test message */
>>>>>>>>>>>     buf = odp_buffer_alloc(globals->pool);
>>>>>>>>>>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr, 
>>>>>>>>>>> test_globals_t *globals)
>>>>>>>>>>>                     return -1;
>>>>>>>>>>>             }
>>>>>>>>>>>
>>>>>>>>>>> -           ev = odp_queue_deq(queue);
>>>>>>>>>>> +           /* When enqueue and dequeue are decoupled (e.g. not 
>>>>>>>>>>> using a
>>>>>>>>>>> +            * common lock), an enqueued event may not be 
>>>>>>>>>>> immediately
>>>>>>>>>>> +            * visible to dequeue. So we just try again for a 
>>>>>>>>>>> while. */
>>>>>>>>>>> +           for (j = 0; j < 100; j++) {
>>>>>>>>>>
>>>>>>>>>> where 100 number comes from?
>>>>>>>>>
>>>>>>>>> It is the retry count. Perhaps it could be a bit lower, or a bit 
>>>>>>>>> higher, but
>>>>>>>>> it works well.
>>>>>>>>
>>>>>>>> Actually, it's incorrect. What happens if all 100 retries fail? You'll
>>>>>>>> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is
>>>>>>>> undefined.
>>>>>>>
>>>>>>> Incorrect? :) The point is that an event may not be immediately 
>>>>>>> available
>>>>>>> to dequeue after it has been enqueued. This is due to the way that a 
>>>>>>> concurrent
>>>>>>> ring buffer behaves in a multi-threaded environment. The approach here 
>>>>>>> is
>>>>>>> just to retry the dequeue a couple times (100 times actually) before 
>>>>>>> moving
>>>>>>> on to the rest of code. Perhaps 100 times is too many times, but some 
>>>>>>> amount
>>>>>>> of retry is needed.
>>>>>>>
>>>>>>> If this is not desirable, then I think it would be more accurate to 
>>>>>>> consider
>>>>>>> odp_queue_enq() / odp_queue_deq() as async APIs -or- MT-unsafe (must be 
>>>>>>> called
>>>>>>> from one thread at a time in order to ensure the behavior that an event 
>>>>>>> is
>>>>>>> immediately available for dequeue once it has been enqueued).
>>>>>>>
>>>>>>>>>
>>>>>>>>>> Maxim.
>>>>>>>>>>
>>>>>>>>>>> +                   ev = odp_queue_deq(queue);
>>>>>>>>>>> +                   if (ev != ODP_EVENT_INVALID)
>>>>>>>>>>> +                           break;
>>>>>>>>>>> +                   odp_cpu_pause();
>>>>>>>>>>> +           }
>>>>>>>>>>>
>>>>>>>>>>>             buf = odp_buffer_from_event(ev);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>
>>>
>

Reply via email to