This is a good discussion. The key is to recognize that the ordering
provided by ordered queues is of contexts, not events. When the
scheduler dispatches an event from an ordered queue, it creates an
ordered context that the receiving thread now runs in. That context is
ordered with respect to all other contexts originating from the same
ordered queue and persists until the next odp_schedule() call or until
an explicit odp_schedule_release_order() call is made, though the
latter call is defined as simply a "hint" that may or may not actually
release the ordered context.
An ordered context protects all enqueues made by the thread running in
that ordered context. That means that no matter how few or how many
odp_queue_enq() calls that thread makes, all of these calls will be
ordered with respect to all such calls made by other threads running
in other ordered contexts originating from the same ordered queue.
Within the thread itself, the relative order of the odp_queue_enq()
calls it makes determines the order of events generated by this
ordered context since as a single thread it can only make once such
call at a time.
Ordered locks add an interesting wrinkle to this because they make it
possible for ordered critical sections to exist within the overall
parallel processing of these ordered contexts. If an ordered queue has
N ordered locks then there are N possible ordered synchronization
points that can be entered within the lifespan of a given ordered
context. Note that there is no requirement that the thread actually
"visit" any of these points. If another thread in a later ordered
context is waiting on ordered lock i within its context then it will
be delayed until such time as either this thread enters and then exits
ordered lock i, or it releases its ordered context altogether, since
this would provide definitive proof that it can never attempt to enter
that lock at some later point.
One of the key differences between the current order queue code and
Matias' patch is the question of when the internal structure (queue
element) is inspected when performing an odp_queue_enq() call. Because
queue elements are shared objects, they are protected by an internal
lock that must be acquired prior to manipulating that structure. The
current code acquires that lock as part of the entry processing to
odp_queue_enq() and then determines whether this thread's context
represents the current queue order to determine whether the enqueue
operation can be completed. If not, then the event is placed in the
ordered queue's reorder queue where it will be processed at some later
time when the queue's order "catches up" to this thread's order. In
Matias' code, by contrast, odp_queue_enq() first determines if its
order is current and if not it does not lock the queue element but
rather temporarily stores the event into a local stash. The advantage
here is that we avoid locking the queue element if we can't really
complete the operation. The disadvantage is that the read cannot
release its order until its stash has been emptied, which means that
it must stall rather than being able to continue processing another
ordered context when attempting to release its current ordered
I suspect further performance improvement can be made by combining
these two ideas such that out of order enqueues are stashed locally
and then flushed to the origin ordered queue's reorder queue upon
order release as needed to avoid context release blockage. The real
question is how many enqueues are expected within a given ordered
context? The assumption has been that the average would be close to
one, but obviously a stash becomes more beneficial the larger that
On Thu, Dec 1, 2016 at 1:32 AM, Elo, Matias (Nokia - FI/Espoo)
> On 1 Dec 2016, at 6:58, Yi He <yi...@linaro.org> wrote:
> Hi, thanks Matias
> The example is very helpful, one question follow it:
> Previously I understand atomic queue like multiple producer/single consumer
> behaviour, so producers (enqueue threads) can still run in parallel?
> Yes. This actually touches another subject which is queue thread safety.
> odp_queue_create() takes 'odp_queue_param_t *param’ argument, which defines
> enqueue and dequeue modes (ODP_QUEUE_OP_MT (default) /
> ODP_QUEUE_OP_MT_UNSAFE / ODP_QUEUE_OP_DISABLED). Currently, there is always
> internal locking when doing enqueue in linux-generic so the argument has no
> real effect.
> But in the example the two producer threads behaves like sequentially while
> enqueueing atomic queue?
> A1, A2 (new allocated), A0 (original), B1, B2 (new allocated), B0 (original)
> The threads processing events from an ordered queue have to synchronise
> their enqueue operations somehow to guarantee event order in the atomic
> queue. How this is actually done, is ODP implementation dependent.
> For example in the new ordered queue implementation I try to avoid
> synchronising/blocking as long as possible by locally caching enqueue
> operations if it isn’t threads turn to do enqueue.
> Best Regards, Yi
> On 28 November 2016 at 20:40, Elo, Matias (Nokia - FI/Espoo)
> <matias....@nokia-bell-labs.com> wrote:
>> On 28 Nov 2016, at 12:19, Yi He <yi...@linaro.org> wrote:
>> Thanks Matias
>> Very clearly documented:
>> * Event ordering is based on the
>> * received event(s), but also other (newly allocated or stored) events
>> * ordered when enqueued within the same ordered context.
>> One question is that for scheduled events, their order are very clear
>> determined by their original sequence in the receiving queue, but what's the
>> relationship and sequence between new allocated events with the received
>> ones, especially if the allocation happens in multiple threads.
>> The events enqueued within the same ordered context should be received in
>> the same order as they were enqueued = FIFO (i.e. if the newly allocated
>> events where enqueued before the original event, the new events will be
>> received first). The order holds for the whole ordered context.
>> For example:
>> Two threads (A & B) schedule work from the same ordered queue and enqueue
>> processed events to a common atomic queue . Both threads call odp_schedule()
>> and receive a single event -> A gets ordered context 1 and B gets 2. While
>> processing the event, thread A allocates two new events (A1&A2). Finally, it
>> enqueues them to the atomic queue in the following order: A1, A2, A0
>> (original event). Thread B does the same thing (B1, B2, B0).
>> As the atomic context holds for the whole context the events are received
>> in the atomic queue in the following order: A1, A2, A0, B1, B2, B0,
>> independent of which of threads performs the enqueue operations first.