git bisect shows this problem starts surfacing with this commit:
eea929d61e25106adc2598448c865f40e4a6f13b is the first bad commit
commit eea929d61e25106adc2598448c865f40e4a6f13b
Author: Petri Savolainen <[email protected]>
Date: Thu Nov 10 13:07:39 2016 +0200
linux-gen: pool: reimplement pool with ring
Used the ring data structure to implement pool. Also
buffer structure was simplified to enable future driver
interface. Every buffer includes a packet header, so each
buffer can be used as a packet head or segment. Segmentation
was disabled and segment size was fixed to a large number
(64kB) to limit the number of modification in the commit.
Signed-off-by: Petri Savolainen <[email protected]>
I don't think the issue is necessarily with this patch but rather that the
efficiency improvements are probably exposing a latent race condition
elsewhere in ordered queue handling. This needs further investigation.
On Wed, Nov 16, 2016 at 3:01 PM, Bill Fischofer <[email protected]>
wrote:
> Trying to reproduce this I'm seeing sporadic failures in the scheduler
> validation test that don't seem to appear in the base api-next branch.
> Issue seems to be failures in the ordered queue tests:
>
> Test: scheduler_test_multi_mq_mt_prio_n
> ...linux.c:273:odpthread_run_start_routine():helper:
> ODP worker thread started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> passed
> Test: scheduler_test_multi_mq_mt_prio_a
> ...linux.c:273:odpthread_run_start_routine():helper:
> ODP worker thread started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> passed
> Test: scheduler_test_multi_mq_mt_prio_o
> ...linux.c:273:odpthread_run_start_routine():helper:
> ODP worker thread started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> linux.c:273:odpthread_run_start_routine():helper: ODP worker thread
> started as linux pthread. (pid=6274)
> FAILED
> 1. scheduler.c:871 - bctx->sequence == seq
> 2. scheduler.c:871 - bctx->sequence == seq
> Test: scheduler_test_multi_1q_mt_a_excl
> ...linux.c:273:odpthread_run_start_routine():helper:
> ODP worker thread started as linux pthread. (pid=6274)
>
> We had seen these earlier but they were never consistently reproducible.
> Petri: are you able to recreate this on your local systems?
>
>
> On Wed, Nov 16, 2016 at 2:03 PM, Maxim Uvarov <[email protected]>
> wrote:
>
>> I can not test patch by patch this series because it fails (one time it
>> was TM, one time kernel died, other time OOM killer killed tests then hang
>> kernel).
>>
>> And for all patches test/common_plat/validation/api/pktio/pktio_main
>> hangs forever:
>>
>>
>> Program received signal SIGINT, Interrupt.
>> 0x00002afbe69ffb80 in __nanosleep_nocancel () at
>> ../sysdeps/unix/syscall-template.S:81
>> 81 in ../sysdeps/unix/syscall-template.S
>> (gdb) bt
>> #0 0x00002afbe69ffb80 in __nanosleep_nocancel () at
>> ../sysdeps/unix/syscall-template.S:81
>> #1 0x0000000000415ced in odp_pktin_recv_tmo (queue=...,
>> packets=packets@entry=0x7ffed64d8bd0, num=num@entry=1,
>> wait=wait@entry=18446744073709551615) at
>> ../../../platform/linux-generic/odp_packet_io.c:1584
>> #2 0x00000000004047fa in recv_packets_tmo (pktio=pktio@entry=0x2,
>> pkt_tbl=pkt_tbl@entry=0x7ffed64d9500,
>> seq_tbl=seq_tbl@entry=0x7ffed64d94b0, num=num@entry=1,
>> mode=mode@entry=RECV_TMO, tmo=tmo@entry=18446744073709551615, ns=ns@entry
>> =0)
>> at ../../../../../../test/common_plat/validation/api/pktio/pkti
>> o.c:515
>> #3 0x00000000004075f8 in test_recv_tmo (mode=RECV_TMO) at
>> ../../../../../../test/common_plat/validation/api/pktio/pktio.c:940
>> #4 0x00002afbe61cc482 in run_single_test () from
>> /usr/local/lib/libcunit.so.1
>> #5 0x00002afbe61cc0b2 in run_single_suite () from
>> /usr/local/lib/libcunit.so.1
>> #6 0x00002afbe61c9d55 in CU_run_all_tests () from
>> /usr/local/lib/libcunit.so.1
>> #7 0x00002afbe61ce245 in basic_run_all_tests () from
>> /usr/local/lib/libcunit.so.1
>> #8 0x00002afbe61cdfe7 in CU_basic_run_tests () from
>> /usr/local/lib/libcunit.so.1
>> #9 0x0000000000409361 in odp_cunit_run () at
>> ../../../../test/common_plat/common/odp_cunit_common.c:298
>> #10 0x00002afbe6c2ff45 in __libc_start_main (main=0x403850 <main>,
>> argc=1, argv=0x7ffed64d9878, init=<optimized out>,
>> fini=<optimized out>, rtld_fini=<optimized out>,
>> stack_end=0x7ffed64d9868) at libc-start.c:287
>> #11 0x000000000040387e in _start ()
>> (gdb) up
>> #1 0x0000000000415ced in odp_pktin_recv_tmo (queue=...,
>> packets=packets@entry=0x7ffed64d8bd0, num=num@entry=1,
>> wait=wait@entry=18446744073709551615) at
>> ../../../platform/linux-generic/odp_packet_io.c:1584
>> 1584 nanosleep(&ts, NULL);
>> (gdb) p ts
>> $1 = {tv_sec = 0, tv_nsec = 1000}
>> (gdb) l
>> 1579 }
>> 1580
>> 1581 wait--;
>> 1582 }
>> 1583
>> 1584 nanosleep(&ts, NULL);
>> 1585 }
>> 1586 }
>> 1587
>> 1588 int odp_pktin_recv_mq_tmo(const odp_pktin_queue_t queues[],
>> unsigned num_q,
>> (gdb) up
>> #2 0x00000000004047fa in recv_packets_tmo (pktio=pktio@entry=0x2,
>> pkt_tbl=pkt_tbl@entry=0x7ffed64d9500,
>> seq_tbl=seq_tbl@entry=0x7ffed64d94b0, num=num@entry=1,
>> mode=mode@entry=RECV_TMO, tmo=tmo@entry=18446744073709551615, ns=ns@entry
>> =0)
>> at ../../../../../../test/common_plat/validation/api/pktio/pkti
>> o.c:515
>> 515 n = odp_pktin_recv_tmo(pktin[0], pkt_tmp, num - num_rx,
>> (gdb) p num - num_rx
>> $2 = 1
>> (gdb) l
>> 510 /** Multiple odp_pktin_recv_tmo()/odp_pktin_recv_mq_tmo()
>> calls may be
>> 511 * required to discard possible non-test packets. */
>> 512 do {
>> 513 ts1 = odp_time_global();
>> 514 if (mode == RECV_TMO)
>> 515 n = odp_pktin_recv_tmo(pktin[0], pkt_tmp, num - num_rx,
>> 516 tmo);
>> 517 else
>> 518 n = odp_pktin_recv_mq_tmo(pktin, (unsigned)num_q,
>> 519 from, pkt_tmp,
>> (gdb) p tmo
>> $3 = 18446744073709551615
>>
>>
>> I applied patches and following script under root:
>> CLEANUP=0 GIT_URL=/opt/Linaro/odp3.git GIT_BRANCH=api-next ./build.sh
>>
>> Need more investigation into this issue... Not applied yet.
>>
>> Maxim.
>>
>> On 11/16/16 02:58, Bill Fischofer wrote:
>>
>>> Trying again as the repost doesn't seem to show up on the list either.
>>>
>>> For this series:
>>>
>>> Reviewed-and-tested-by: Bill Fischofer <[email protected]
>>> <mailto:[email protected]>>
>>>
>>> On Tue, Nov 15, 2016 at 5:55 PM, Bill Fischofer <
>>> [email protected] <mailto:[email protected]>> wrote:
>>>
>>> Reposting this since it doesn't seem to have made it to the
>>> mailing list.
>>>
>>> For this series:
>>>
>>> Reviewed-and-tested-by: Bill Fischofer <[email protected]
>>> <mailto:[email protected]>>
>>>
>>> On Tue, Nov 15, 2016 at 8:41 AM, Bill Fischofer
>>> <[email protected] <mailto:[email protected]>>
>>> wrote:
>>>
>>> For this series:
>>>
>>> Reviewed-and-tested-by: Bill Fischofer
>>> <[email protected] <mailto:[email protected]>>
>>>
>>> On Thu, Nov 10, 2016 at 5:07 AM, Petri Savolainen
>>> <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Pool performance is optimized by using a ring as the
>>> global buffer storage.
>>> IPC build is disabled, since it needs large modifications
>>> due to dependency to
>>> pool internals. Old pool implementation was based on locks
>>> and linked list of
>>> buffer headers. New implementation maintain a ring of
>>> buffer handles, which
>>> enable fast, burst based allocs and frees. Also ring
>>> scales better with number
>>> of cpus than a list (enq and deq operations update
>>> opposite ends of the pool).
>>>
>>> L2fwd link rate (%), 2 x 40GE, 64 byte packets
>>>
>>> direct- parallel- atomic-
>>> cpus orig direct diff orig parall diff orig
>>> atomic diff
>>> 1 7 % 8 % 1 % 6 % 6 % 2 % 5.4
>>> % 5.6 % 4 %
>>> 2 14 % 15 % 7 % 9 % 9 % 5 % 8 %
>>> 9 % 8 %
>>> 4 28 % 30 % 6 % 13 % 14 % 13 % 12 %
>>> 15 % 19 %
>>> 6 42 % 44 % 6 % 16 % 19 % 19 % 8 %
>>> 20 % 150 %
>>> 8 46 % 59 % 28 % 19 % 23 % 26 % 18 %
>>> 24 % 34 %
>>> 10 55 % 57 % 3 % 20 % 27 % 37 % 8 %
>>> 28 % 264 %
>>> 12 56 % 56 % -1 % 22 % 31 % 43 % 7 %
>>> 32 % 357 %
>>>
>>> Max packet rate of NICs are reached with 10-12 cpu on
>>> direct mode. Otherwise,
>>> all cases were improved. Especially, scheduler driven
>>> cases suffered on bad
>>> pool scalability.
>>>
>>> changed in v3:
>>> * rebased
>>> * ipc disabled with #ifdef
>>> * added support for multi-segment packets
>>> * API: added explicit limits for packet length in alloc calls
>>> * Corrected validation test and example application bugs
>>> found during
>>> segmentation implementation
>>>
>>> changed in v2:
>>> * rebased to api-next branch
>>> * added a comment that ring size must be larger than
>>> number of items in it
>>> * fixed clang build issue
>>> * added parens in align macro
>>>
>>> v1 reviews:
>>> Reviewed-by: Brian Brooks <[email protected]
>>> <mailto:[email protected]>>
>>>
>>>
>>>
>>>
>>> Petri Savolainen (19):
>>> linux-gen: ipc: disable build of ipc pktio
>>> linux-gen: pktio: do not free zero packets
>>> linux-gen: ring: created common ring implementation
>>> linux-gen: align: added round up power of two
>>> linux-gen: pool: reimplement pool with ring
>>> linux-gen: ring: added multi enq and deq
>>> linux-gen: pool: use ring multi enq and deq operations
>>> linux-gen: pool: optimize buffer alloc
>>> linux-gen: pool: clean up pool inlines functions
>>> linux-gen: pool: ptr instead of hdl in buffer_alloc_multi
>>> test: validation: buf: test alignment
>>> test: performance: crypto: use capability to select max
>>> packet
>>> test: correctly initialize pool parameters
>>> test: validation: packet: fix bugs in tailroom and
>>> concat tests
>>> linux-gen: packet: added support for segmented packets
>>> test: validation: packet: improved multi-segment alloc test
>>> api: packet: added limits for packet len on alloc
>>> linux-gen: packet: remove zero len support from alloc
>>> linux-gen: packet: enable multi-segment packets
>>>
>>> example/generator/odp_generator.c | 2 +-
>>> include/odp/api/spec/packet.h | 9 +-
>>> include/odp/api/spec/pool.h | 6 +
>>> platform/linux-generic/Makefile.am <http://le.am>
>>> | 1 +
>>>
>>> .../include/odp/api/plat/packet_types.h | 6 +-
>>> .../include/odp/api/plat/pool_types.h | 6 -
>>> .../linux-generic/include/odp_align_internal.h | 34 +-
>>> .../linux-generic/include/odp_buffer_inlines.h | 167
>>> +--
>>> .../linux-generic/include/odp_buffer_internal.h | 120 +-
>>> .../include/odp_classification_datamodel.h | 2 +-
>>> .../linux-generic/include/odp_config_internal.h | 55 +-
>>> .../linux-generic/include/odp_packet_internal.h | 87 +-
>>> platform/linux-generic/include/odp_pool_internal.h | 289
>>> +---
>>> platform/linux-generic/include/odp_ring_internal.h | 176
>>> +++
>>> .../linux-generic/include/odp_timer_internal.h | 4 -
>>> platform/linux-generic/odp_buffer.c | 22 +-
>>> platform/linux-generic/odp_classification.c | 25 +-
>>> platform/linux-generic/odp_crypto.c | 12 +-
>>> platform/linux-generic/odp_packet.c | 717
>>> ++++++++--
>>> platform/linux-generic/odp_packet_io.c | 2 +-
>>> platform/linux-generic/odp_pool.c | 1440
>>> ++++++++------------
>>> platform/linux-generic/odp_queue.c | 4 +-
>>> platform/linux-generic/odp_schedule.c | 102 +-
>>> platform/linux-generic/odp_schedule_ordered.c | 4 +-
>>> platform/linux-generic/odp_timer.c | 3 +-
>>> platform/linux-generic/pktio/dpdk.c | 10 +-
>>> platform/linux-generic/pktio/ipc.c | 3 +-
>>> platform/linux-generic/pktio/loop.c | 2 +-
>>> platform/linux-generic/pktio/netmap.c | 14 +-
>>> platform/linux-generic/pktio/socket.c | 17 +-
>>> platform/linux-generic/pktio/socket_mmap.c | 10 +-
>>> test/common_plat/performance/odp_crypto.c | 47 +-
>>> test/common_plat/performance/odp_pktio_perf.c | 2 +-
>>> test/common_plat/performance/odp_scheduling.c | 8 +-
>>> test/common_plat/validation/api/buffer/buffer.c | 113 +-
>>> test/common_plat/validation/api/crypto/crypto.c | 2 +-
>>> test/common_plat/validation/api/packet/packet.c | 96 +-
>>> test/common_plat/validation/api/pktio/pktio.c | 21 +-
>>> 38 files changed, 1745 insertions(+), 1895 deletions(-)
>>> create mode 100644
>>> platform/linux-generic/include/odp_ring_internal.h
>>>
>>> --
>>> 2.8.1
>>>
>>>
>>>
>>>
>>>
>>
>