On Wed, Apr 01, 2015 at 05:47:51PM +0300, Maxim Uvarov wrote:
> On 04/01/15 16:57, Stuart Haslam wrote:
> >On Wed, Apr 01, 2015 at 04:30:59PM +0300, Maxim Uvarov wrote:
> >>Commit:
> >> 8fa64fb9fc4e43fbd9d3c226ed89229863bdb771
> >> linux-generic: scheduler: restructured queue and pktio integration
> >>Implement race with schedule termination and polling input queue.
> >>This patch locks pktio while doing poll to prevent destroy linked
> >>queue in the middle of this poll.
> >>
> >Admittedly I've not looked in detail at the terminate sequence after the
> >scheduler changes, so don't really understand what you're fixing, but
> >this feels like a workaround rather than a fix. Shouldn't the pktin
> >queue have been removed from the scheduler before it's closed? What's
> >the sequence that leads to the problem?
> as I understand that right one thread goes to
> schedule()->pktin_poll(sched_cmd->pe))
> then successful do:
> odp_pktio_recv(entry->s.handle, pkt_tbl, QUEUE_MULTI_MAX);
>
> after that other thread calls terminate and bellow in the pktio_poll code:
> queue_enq_multi(qentry, hdr_tbl, num_enq);
>
> qentry here is corrupted due to it has been destroyed by other thread.
>
I see from the trace that you can reproduce this with the odp_pktio.c
validation test, which is single threaded so there's no thread race.
> Because of qentry is linked to pktio entry we have to lock pktio
> entry for that
> queue to make sure that is was not modified while pktin_poll execution.
>
> I did make check from the root to find that problem (it occurs about
> 1 of 10 run times).
>
> I sent back trace some to mailing list some time ago:
>
> ore was generated by `./test/validation/odp_pktio'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x0000000000411dc4 in odp_atomic_fetch_inc_u32 (atom=0x2baaaadfff00)
> at ./include/odp/atomic.h:70
> 70 return __atomic_fetch_add(&atom->v, 1, __ATOMIC_RELAXED);
> (gdb) bt
> #0 0x0000000000411dc4 in odp_atomic_fetch_inc_u32 (atom=0x2baaaadfff00)
> at ./include/odp/atomic.h:70
> #1 0x0000000000411e8a in odp_ticketlock_lock
> (ticketlock=0x2baaaadfff00) at odp_ticketlock.c:28
> #2 0x000000000040f0f8 in queue_enq_multi (queue=0x2baaaadfff00,
> buf_hdr=0x7fff1fccb0b0, num=1) at odp_queue.c:376
> #3 0x000000000040987d in pktin_poll (entry=0x2aaaab200600) at
> odp_packet_io.c:713
> #4 0x0000000000410378 in schedule (out_queue=0x0,
> out_ev=0x7fff1fccb1d8, max_num=1, max_deq=4) at odp_schedule.c:455
> #5 0x000000000041050a in schedule_loop (out_queue=0x0, wait=1,
> out_ev=0x7fff1fccb1d8, max_num=1, max_deq=4) at odp_schedule.c:518
> #6 0x00000000004105a4 in odp_schedule (out_queue=0x0, wait=1) at
> odp_schedule.c:551
> #7 0x0000000000402b83 in destroy_inq (pktio=0x2) at odp_pktio.c:320
So the problem sequence is this;
1. destroy_inq() calls odp_pktio_inq_remdef()
2. odp_pktio_inq_remdef() sets the pktio's inq_default to
ODP_QUEUE_INVALID and does nothing to remove the queue from
scheduler (not sure if it should?)
3. destroy_inq() calls odp_schedule()
4. odp_schedule() dequeues the event to poll the pktin queue and then
calls pktin_poll()
5. pktin_poll() attempts to fetch some packets from the pktin, and if
it receives any, attempts to enqueue them to using inq_default,
which by this point is ODP_QUEUE_INVALID.
So there's a fundamental breakage. I'm not sure yet how it should be
fixed but this patch will make it go away;
diff --git a/platform/linux-generic/odp_packet_io.c
b/platform/linux-generic/odp_packet_io.c
index 15335d1..ae1de3c 100644
--- a/platform/linux-generic/odp_packet_io.c
+++ b/platform/linux-generic/odp_packet_io.c
@@ -685,6 +685,9 @@ int pktin_poll(pktio_entry_t *entry)
if (odp_unlikely(is_free(entry)))
return -1;
+ if (entry->s.inq_default == ODP_QUEUE_INVALID)
+ return -1;
+
num = odp_pktio_recv(entry->s.handle, pkt_tbl, QUEUE_MULTI_MAX);
if (num < 0) {
The race is that you'll only get the crash if some packets were actually
received between the last time the validation test itself called
odp_schedule() and calling destroy_inq(). You'd never see it on the
linux-generic "loop" interface as traffic doesn't just mysteriously
appear there.
--
Stuart.
_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp