Bill Fischofer(Bill-Fischofer-Linaro) replied on github web page: platform/linux-generic/odp_queue.c line 95 @@ -143,8 +150,10 @@ static int queue_capability(odp_queue_capability_t *capa) capa->max_sched_groups = sched_fn->num_grps(); capa->sched_prios = odp_schedule_num_prio(); capa->plain.max_num = capa->max_queues; + capa->plain.max_size = CONFIG_QUEUE_SIZE; capa->plain.nonblocking = ODP_BLOCKING; capa->sched.max_num = capa->max_queues; + capa->sched.max_size = CONFIG_QUEUE_SIZE;
Comment: Agreed. > Bill Fischofer(Bill-Fischofer-Linaro) wrote: > Correct, as noted earlier. I withdraw that comment. >> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >> Presumably the compiler will see the overlap and optimize away the >> redundancy, so I assume the performance impact will be nil here. >>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>> Since you're allowing `head` and `tail` to run over the entire 32-bit >>> range, you're correct that you can completely fill the ring. I did miss >>> that point. However, as shown above you still need this to be: >>> >>> ``` >>> num = size - abs(tail - head); >>> ``` >>> To avoid problems at the 32-bit wrap boundary. >>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>> You're computing in 32 bits not 8 bits, and your ring size is less than >>>> 2^32 elements. Consider the following test program: >>>> ``` >>>> #include <stdlib.h> >>>> #include <stdio.h> >>>> #include <stdint.h> >>>> #include <inttypes.h> >>>> >>>> void main() { >>>> uint32_t head[4] = {0, 1, 2, 3}; >>>> uint32_t tail[4] = {0, 0xffffffff, 0xfffffffe, 0xfffffffd}; >>>> uint32_t mask = 4095; >>>> uint32_t result; >>>> int i; >>>> >>>> for (i = 0; i < 4; i++) { >>>> printf("head = %" PRIu32 " tail = %" PRIu32 ":\n", >>>> head[i], tail[i]); >>>> >>>> result = tail[i] - head[i]; >>>> printf("tail - head = %" PRIu32 "\n", result); >>>> >>>> result = (tail[i] - head[i]) & mask; >>>> printf("(tail - head) & mask = %" PRIu32 "\n", result); >>>> >>>> result = abs(tail[i] - head[i]); >>>> printf("abs(tail - head) = %" PRIu32 "\n\n", result); >>>> } >>>> } >>>> ``` >>>> in theory `tail - head` should be the number of elements in the ring, in >>>> this case 0, 2, 4, and 6. But running this test program gives the >>>> following output: >>>> ``` >>>> head = 0 tail = 0: >>>> tail - head = 0 >>>> (tail - head) & mask = 0 >>>> abs(tail - head) = 0 >>>> >>>> head = 1 tail = 4294967295: >>>> tail - head = 4294967294 >>>> (tail - head) & mask = 4094 >>>> abs(tail - head) = 2 >>>> >>>> head = 2 tail = 4294967294: >>>> tail - head = 4294967292 >>>> (tail - head) & mask = 4092 >>>> abs(tail - head) = 4 >>>> >>>> head = 3 tail = 4294967293: >>>> tail - head = 4294967290 >>>> (tail - head) & mask = 4090 >>>> abs(tail - head) = 6 >>>> ``` >>>> Since you're allowing head to run free over the 32-bit range of the >>>> variable, when the 32-bits rolls over you'll get a large positive number, >>>> not the small one you need to stay within the ring bounds. The alternative >>>> is to mask `head` and `tail` as you increment them, but then you run into >>>> the effective range issue. >>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>> OK >>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>> OK, the `ODP_ASSERT()` would still be useful for debugging. >>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>> That functionality is not obvious from the name. It either implies that >>>>>>> one of the input arguments is written (not true here) or the reader >>>>>>> might assume that it is an expression without side-effect and should be >>>>>>> deleted (what I originally thought when reading it). You should pick a >>>>>>> routine name that makes it clear it's actually doing something real, in >>>>>>> this case performing prefetch processing. >>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>> Elsewhere you write `if (ring_st_is_empty(...))`, not `if >>>>>>>> (ring_st_is_empty(...) == 1)` so this is inconsistent. >>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>> Didn't try larger than 32. 32 is already quite large from QoS point >>>>>>>>> of view. >>>>>>>>> >>>>>>>>> I'm planning to use config file for run time tunning, so this hard >>>>>>>>> coding may change in that phase. >>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>> One entry is not lost. >>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>> One entry is not lost. User provided size if not (currently) used. >>>>>>>>>>> Queue size is always 4k. >>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>> One entry is not lost. >>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>> One entry is not lost. >>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>> OK, added checks in v2. >>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>> OK. Compiler probably did that already, but changed in v2. >>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>> Tail and head indexes are (masked from) uint32_t and do not >>>>>>>>>>>>>>>> wrap around when the ring is full. I think you assume that the >>>>>>>>>>>>>>>> store index is 0...size-1, while it's full uint32_t which is >>>>>>>>>>>>>>>> then masked to get the actual index. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For example: >>>>>>>>>>>>>>>> size = 100; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Empty: >>>>>>>>>>>>>>>> head = 100 >>>>>>>>>>>>>>>> tail = 100 >>>>>>>>>>>>>>>> num = 100 - 100 = 0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Full: >>>>>>>>>>>>>>>> head = 100 >>>>>>>>>>>>>>>> tail = 200 >>>>>>>>>>>>>>>> num = 200 - 100 = 100 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Wrap uint32_t + full: >>>>>>>>>>>>>>>> head = 0xFFFFFF9C >>>>>>>>>>>>>>>> tail = 0 >>>>>>>>>>>>>>>> num = 0 - 0xFFFFFF9C = 0x64 = 100 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So, no abs() needed. Ring size can be 4096, instead of 4095. >>>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>>> It's already documented 5 lines above: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /* Initialize ring. Ring size must be a power of two. */ >>>>>>>>>>>>>>>>> static inline void ring_st_init(ring_st_t *ring, uint32_t >>>>>>>>>>>>>>>>> *data, uint32_t size) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>>>> This function converts 32 bit buffer indexes to buffer >>>>>>>>>>>>>>>>>> header pointers. The counter operation is >>>>>>>>>>>>>>>>>> buffer_index_from_buf(). The prefetch is a side effect of >>>>>>>>>>>>>>>>>> the function, which may be changed/moved any time if it's >>>>>>>>>>>>>>>>>> found out that there's a place for prefetching. I actually >>>>>>>>>>>>>>>>>> plan to test if number of prefetches should be limited as >>>>>>>>>>>>>>>>>> e.g. 32 consecutive prefetches may be too much for some CPU >>>>>>>>>>>>>>>>>> architectures. >>>>>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>>>>> I prefer style where '== 0' is used instead of '!'. >>>>>>>>>>>>>>>>>>> Especially, when the if clause is as complex as this and >>>>>>>>>>>>>>>>>>> there's danger for reader to miss the '!' sign. >>>>>>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>>>>>> It's there to ensure that all bits are zero also when >>>>>>>>>>>>>>>>>>>> someone would modify the bitfield from two to three fields >>>>>>>>>>>>>>>>>>>> later on. Similarly to memset() zero is used for struct >>>>>>>>>>>>>>>>>>>> inits. >>>>>>>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>>>>>>> There's no need for abs(). Since it's all uint32_t >>>>>>>>>>>>>>>>>>>>> variables, wrap a round is handled already. >>>>>>>>>>>>>>>>>>>>> An example in 8bits: >>>>>>>>>>>>>>>>>>>>> 0xff - 0xfd = 0x02 >>>>>>>>>>>>>>>>>>>>> 0x00 - 0xfe = 0x02 >>>>>>>>>>>>>>>>>>>>> 0x01 - 0xff = 0x02 >>>>>>>>>>>>>>>>>>>>> 0x02 - 0x00 = 0x02 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> This passes both gcc and clang, and is used already in >>>>>>>>>>>>>>>>>>>>> the other ring implementation see ring_deq_multi(). >>>>>>>>>>>>>>>>>>>>>> Petri Savolainen(psavol) wrote: >>>>>>>>>>>>>>>>>>>>>> I prefer style with blank line in the end of a typedef, >>>>>>>>>>>>>>>>>>>>>> since it's easier to spot the type name (as it's not >>>>>>>>>>>>>>>>>>>>>> mixed into struct field names). Checkpatch passes so >>>>>>>>>>>>>>>>>>>>>> this style should be OK. >>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>> Does this mean that sizes larger than 32 have no added >>>>>>>>>>>>>>>>>>>>>>> performance benefit? >>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>> Must use `CONFIG_QUEUE_SIZE - 1` here, as noted >>>>>>>>>>>>>>>>>>>>>>>> earlier, if we're not going to use the user-supplied >>>>>>>>>>>>>>>>>>>>>>>> queue size. >>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>> Given its name, this looks like an extraneous >>>>>>>>>>>>>>>>>>>>>>>>> statement that should be deleted. Renaming this to >>>>>>>>>>>>>>>>>>>>>>>>> something like `prefetch_dequeued_bufs()` would make >>>>>>>>>>>>>>>>>>>>>>>>> the intent clearer here. >>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> `if (!ring_st_is_empty(&queue->s.ring_st))` seems >>>>>>>>>>>>>>>>>>>>>>>>>> more natural here. >>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> Change to `if (param->size >= CONFIG_QUEUE_SIZE)` >>>>>>>>>>>>>>>>>>>>>>>>>>> to handle the effective queue capacity. The >>>>>>>>>>>>>>>>>>>>>>>>>>> user-supplied `size` should then be set to >>>>>>>>>>>>>>>>>>>>>>>>>>> `ROUNDUP_POWER2_U32(size) - 1` for the masking to >>>>>>>>>>>>>>>>>>>>>>>>>>> work properly. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> Same comment here as for plain queues. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> As noted earlier, due to "losing" one entry to >>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish queue empty/full, this should be >>>>>>>>>>>>>>>>>>>>>>>>>>>>> returned as `CONFIG_QUEUE_SIZE - 1`, and we also >>>>>>>>>>>>>>>>>>>>>>>>>>>>> need to ensure that `CONFIG_QUEUE_SIZE` is itself >>>>>>>>>>>>>>>>>>>>>>>>>>>>> a power of 2. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Since you're initializing `index.pool` and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `index.buffer` there's no need to set >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `index.u32` here. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We originally had this index partitioning based >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on `ODP_CONFIG_POOLS`. Do we want to return to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that here? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If not, we at least need an >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `ODP_STATIC_ASSERT()` to ensure that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `ODP_CONFIG_POOLS < 256` or else bad things >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will happen here. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This routine can be optimized to: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ``` >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return ring->head == ring->tail; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ``` >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Your invariant is the queue is empty when >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `head == tail` therefore the queue is full >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> when `abs(tail - head) == mask`, so the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correct calculation here is: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `num = mask - abs(tail - head);` >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The effect is that a queue can only hold >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `size - 1` elements, otherwise you cannot >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> distinguish between a full and an empty queue >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without another bit of metadata, which is a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cost you're trying to avoid. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is somewhat problematic if the caller is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trying to be "optimal" by specifying a power >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of two in the `size` parameter of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `odp_queue_param_t` passed to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `odp_queue_create()`. For this reason we may >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wish to return a `max_size` of a power of 2 - >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 in `odp_queue_capability()` as part of this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> patch series. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This only works if `size` is a power of 2. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Should be documented as such, since this is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an internal routine. In this case an >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `ODP_ASSERT(size == >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ROUNDUP_POWER2_U32(size))` for this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requirement would be a useful debugging aid. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be `num = abs(tail - head);` to deal >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with wrap arounds, otherwise may be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misinterpreted as overly large since it's >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `uint32_t`. Note that GCC and clang >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> recognize `abs()` and treat it as a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> builtin, so there's no actual `stdlib.h` >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> call here. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bill Fischofer(Bill-Fischofer-Linaro) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Extra blank line should be removed (nit). https://github.com/Linaro/odp/pull/492#discussion_r170150984 updated_at 2018-02-23 02:21:54