This is a good discussion. The key is to recognize that the ordering provided by ordered queues is of contexts, not events. When the scheduler dispatches an event from an ordered queue, it creates an ordered context that the receiving thread now runs in. That context is ordered with respect to all other contexts originating from the same ordered queue and persists until the next odp_schedule() call or until an explicit odp_schedule_release_order() call is made, though the latter call is defined as simply a "hint" that may or may not actually release the ordered context.
An ordered context protects all enqueues made by the thread running in that ordered context. That means that no matter how few or how many odp_queue_enq() calls that thread makes, all of these calls will be ordered with respect to all such calls made by other threads running in other ordered contexts originating from the same ordered queue. Within the thread itself, the relative order of the odp_queue_enq() calls it makes determines the order of events generated by this ordered context since as a single thread it can only make once such call at a time. Ordered locks add an interesting wrinkle to this because they make it possible for ordered critical sections to exist within the overall parallel processing of these ordered contexts. If an ordered queue has N ordered locks then there are N possible ordered synchronization points that can be entered within the lifespan of a given ordered context. Note that there is no requirement that the thread actually "visit" any of these points. If another thread in a later ordered context is waiting on ordered lock i within its context then it will be delayed until such time as either this thread enters and then exits ordered lock i, or it releases its ordered context altogether, since this would provide definitive proof that it can never attempt to enter that lock at some later point. One of the key differences between the current order queue code and Matias' patch is the question of when the internal structure (queue element) is inspected when performing an odp_queue_enq() call. Because queue elements are shared objects, they are protected by an internal lock that must be acquired prior to manipulating that structure. The current code acquires that lock as part of the entry processing to odp_queue_enq() and then determines whether this thread's context represents the current queue order to determine whether the enqueue operation can be completed. If not, then the event is placed in the ordered queue's reorder queue where it will be processed at some later time when the queue's order "catches up" to this thread's order. In Matias' code, by contrast, odp_queue_enq() first determines if its order is current and if not it does not lock the queue element but rather temporarily stores the event into a local stash. The advantage here is that we avoid locking the queue element if we can't really complete the operation. The disadvantage is that the read cannot release its order until its stash has been emptied, which means that it must stall rather than being able to continue processing another ordered context when attempting to release its current ordered context. I suspect further performance improvement can be made by combining these two ideas such that out of order enqueues are stashed locally and then flushed to the origin ordered queue's reorder queue upon order release as needed to avoid context release blockage. The real question is how many enqueues are expected within a given ordered context? The assumption has been that the average would be close to one, but obviously a stash becomes more beneficial the larger that number becomes. On Thu, Dec 1, 2016 at 1:32 AM, Elo, Matias (Nokia - FI/Espoo) <matias....@nokia-bell-labs.com> wrote: > > On 1 Dec 2016, at 6:58, Yi He <yi...@linaro.org> wrote: > > Hi, thanks Matias > > The example is very helpful, one question follow it: > > Previously I understand atomic queue like multiple producer/single consumer > behaviour, so producers (enqueue threads) can still run in parallel? > > > Yes. This actually touches another subject which is queue thread safety. > odp_queue_create() takes 'odp_queue_param_t *param’ argument, which defines > enqueue and dequeue modes (ODP_QUEUE_OP_MT (default) / > ODP_QUEUE_OP_MT_UNSAFE / ODP_QUEUE_OP_DISABLED). Currently, there is always > internal locking when doing enqueue in linux-generic so the argument has no > real effect. > > > But in the example the two producer threads behaves like sequentially while > enqueueing atomic queue? > > A1, A2 (new allocated), A0 (original), B1, B2 (new allocated), B0 (original) > > > > The threads processing events from an ordered queue have to synchronise > their enqueue operations somehow to guarantee event order in the atomic > queue. How this is actually done, is ODP implementation dependent. > > For example in the new ordered queue implementation I try to avoid > synchronising/blocking as long as possible by locally caching enqueue > operations if it isn’t threads turn to do enqueue. > > -Matias > > > Best Regards, Yi > > On 28 November 2016 at 20:40, Elo, Matias (Nokia - FI/Espoo) > <matias....@nokia-bell-labs.com> wrote: >> >> >> On 28 Nov 2016, at 12:19, Yi He <yi...@linaro.org> wrote: >> >> Thanks Matias >> >> Very clearly documented: >> “ >> * Event ordering is based on the >> * received event(s), but also other (newly allocated or stored) events >> are >> * ordered when enqueued within the same ordered context. >> “ >> One question is that for scheduled events, their order are very clear >> determined by their original sequence in the receiving queue, but what's the >> relationship and sequence between new allocated events with the received >> ones, especially if the allocation happens in multiple threads. >> >> >> >> The events enqueued within the same ordered context should be received in >> the same order as they were enqueued = FIFO (i.e. if the newly allocated >> events where enqueued before the original event, the new events will be >> received first). The order holds for the whole ordered context. >> >> For example: >> >> Two threads (A & B) schedule work from the same ordered queue and enqueue >> processed events to a common atomic queue . Both threads call odp_schedule() >> and receive a single event -> A gets ordered context 1 and B gets 2. While >> processing the event, thread A allocates two new events (A1&A2). Finally, it >> enqueues them to the atomic queue in the following order: A1, A2, A0 >> (original event). Thread B does the same thing (B1, B2, B0). >> >> As the atomic context holds for the whole context the events are received >> in the atomic queue in the following order: A1, A2, A0, B1, B2, B0, >> independent of which of threads performs the enqueue operations first. >> >> -Matias >> > >