> 
> This code still has the following in do_schedule():
> 
> ordered = sched_cb_queue_is_ordered(qi);
> 
> /* Do not cache ordered events locally to improve
> * parallelism. Ordered context can only be released
> * when the local cache is empty. */
> if (ordered && max_num < MAX_DEQ)
>     max_deq = max_num;
> 
> To do what the comment says this should be changed to:
> 
> if (ordered)
>    max_deq = 1;
> 
> Because in the case of ordered queues, you want to schedule
> consecutive events to separate threads to allow them to be processed
> in parallel. Allowing multiple events to be processed by a single
> thread introduces head-of-line blocking.
> 
> Of course, if you make this change I suspect some of the performance
> gains measured in the simple test cases we have with this
> implementation will go away since I suspect a good portion of those
> gains is due to effectively turning ordered queues back into atomic
> queues, which is what this sort of event batching with limited numbers
> of events does.
> 

This comment: "Do not cache ordered events locally..." refers to scheduler 
local event stash (odp_event_t ev_stash[MAX_DEQ]), which is not used with 
ordered queues. When application requests N events, up to N (or MAX_DEQ if N > 
MAX_DEQ) events will be dequeued. There is no reason to to fix this to one in 
scheduler. If application is worried about head of line blocking, it can itself 
limit N to 1. There are various applications and use cases. An application may 
use many ordered queues, so that ordering is guaranteed, but parallelism is 
maximized (as in the common case each CPU will process events from different 
queues). Another application with a single, fat queue (and varying event size) 
may be more concerned on latency and ask only single event per a schedule call.

These things should not be speculated, but measured. That’s why the ordered 
queue performance test was developed and sent to the list already over a month 
ago. It uses few, fat input queues and has a conservable amount of processing 
per ordered event. It demonstrates 1.15 - 2.6x speedup (1 - 12 cores) compared 
to the old implementation. Additionally, the new implementation scales almost 
linearly and much better than the old one. These are results for N=32 in 
application's schedule_multi call. When the application limits N to 1, the 
throughput is halved, not increased.

-Petri

Reply via email to