On Thu, Apr 02, 2015 at 11:21:57AM +0300, Ciprian Barbu wrote:
> On Wed, Apr 1, 2015 at 7:23 PM, Maxim Uvarov <[email protected]> wrote:
> > On 04/01/15 18:29, Stuart Haslam wrote:
> >>
> >> On Wed, Apr 01, 2015 at 05:47:51PM +0300, Maxim Uvarov wrote:
> >>>
> >>> On 04/01/15 16:57, Stuart Haslam wrote:
> >>>>
> >>>> On Wed, Apr 01, 2015 at 04:30:59PM +0300, Maxim Uvarov wrote:
> >>>>>
> >>>>> Commit:
> >>>>>   8fa64fb9fc4e43fbd9d3c226ed89229863bdb771
> >>>>>   linux-generic: scheduler: restructured queue and pktio integration
> >>>>> Implement race with schedule termination and polling input queue.
> >>>>> This patch locks pktio while doing poll to prevent destroy linked
> >>>>> queue in the middle of this poll.
> >>>>>
> >>>> Admittedly I've not looked in detail at the terminate sequence after the
> >>>> scheduler changes, so don't really understand what you're fixing, but
> >>>> this feels like a workaround rather than a fix. Shouldn't the pktin
> >>>> queue have been removed from the scheduler before it's closed? What's
> >>>> the sequence that leads to the problem?
> >>>
> >>> as I understand that right one thread goes to
> >>> schedule()->pktin_poll(sched_cmd->pe))
> >>> then successful do:
> >>> odp_pktio_recv(entry->s.handle, pkt_tbl, QUEUE_MULTI_MAX);
> >>>
> >>> after that other thread calls terminate and bellow in the pktio_poll
> >>> code:
> >>> queue_enq_multi(qentry, hdr_tbl, num_enq);
> >>>
> >>> qentry here is corrupted due to it has been destroyed by other thread.
> >>>
> >> I see from the trace that you can reproduce this with the odp_pktio.c
> >> validation test, which is single threaded so there's no thread race.
> >>
> >>> Because of qentry is linked to pktio entry we have to lock pktio
> >>> entry for that
> >>> queue to make sure that is was not modified while pktin_poll execution.
> >>>
> >>> I did make check from the root to find that problem (it occurs about
> >>> 1 of 10 run times).
> >>>
> >>> I sent back trace some to mailing list some time ago:
> >>>
> >>> ore was generated by `./test/validation/odp_pktio'.
> >>> Program terminated with signal SIGSEGV, Segmentation fault.
> >>> #0  0x0000000000411dc4 in odp_atomic_fetch_inc_u32 (atom=0x2baaaadfff00)
> >>> at ./include/odp/atomic.h:70
> >>> 70        return __atomic_fetch_add(&atom->v, 1, __ATOMIC_RELAXED);
> >>> (gdb) bt
> >>> #0  0x0000000000411dc4 in odp_atomic_fetch_inc_u32 (atom=0x2baaaadfff00)
> >>> at ./include/odp/atomic.h:70
> >>> #1  0x0000000000411e8a in odp_ticketlock_lock
> >>> (ticketlock=0x2baaaadfff00) at odp_ticketlock.c:28
> >>> #2  0x000000000040f0f8 in queue_enq_multi (queue=0x2baaaadfff00,
> >>> buf_hdr=0x7fff1fccb0b0, num=1) at odp_queue.c:376
> >>> #3  0x000000000040987d in pktin_poll (entry=0x2aaaab200600) at
> >>> odp_packet_io.c:713
> >>> #4  0x0000000000410378 in schedule (out_queue=0x0,
> >>> out_ev=0x7fff1fccb1d8, max_num=1, max_deq=4) at odp_schedule.c:455
> >>> #5  0x000000000041050a in schedule_loop (out_queue=0x0, wait=1,
> >>> out_ev=0x7fff1fccb1d8, max_num=1, max_deq=4) at odp_schedule.c:518
> >>> #6  0x00000000004105a4 in odp_schedule (out_queue=0x0, wait=1) at
> >>> odp_schedule.c:551
> >>> #7  0x0000000000402b83 in destroy_inq (pktio=0x2) at odp_pktio.c:320
> >>
> >> So the problem sequence is this;
> >>
> >>   1. destroy_inq() calls odp_pktio_inq_remdef()
> >>   2. odp_pktio_inq_remdef() sets the pktio's inq_default to
> >>      ODP_QUEUE_INVALID and does nothing to remove the queue from
> >>      scheduler (not sure if it should?)
> >>   3. destroy_inq() calls odp_schedule()
> >>   4. odp_schedule() dequeues the event to poll the pktin queue and then
> >>      calls pktin_poll()
> >>   5. pktin_poll() attempts to fetch some packets from the pktin, and if
> >>      it receives any, attempts to enqueue them to using inq_default,
> >>      which by this point is ODP_QUEUE_INVALID.
> >>
> >> So there's a fundamental breakage. I'm not sure yet how it should be
> >> fixed but this patch will make it go away;
> >
> > Yes, confirm. Thanks for describing that. Patch fixes current problem so I
> > think it should go before next tag.
> > Please send it to mailing list.
> 
> There are two issues here.
> 
> First is the one Stuart's patch fixes, it is possible that after
> removing the input queue from the pktio some thread will still call
> odp_schedule that would cause the scheduler to pull packets from the
> hardware and not be able to enqueue them anywhere, because there is no
> input queue.
> 
> The second issue is that the pktio testsuite will first remove the
> input queue and then expect that odp_schedule to return something.
> This logic is broken and should be fixed, perhaps even document it.
> 

The logic isn't broken. It doesn't really expect anything to be returned,
all it's doing is receiving events and free'ing them. It does this in
order to ensure the input queue is empty, which is required before
odp_queue_destroy() is called. It calls odp_pktio_inq_remdef() first,
which means "stop delivering incoming packets into the input queue",
then odp_schedule()s until no events are received in order to flush out
anything that was received between the last time we called
odp_schedule() and when we called odp_pktio_inq_remdef().

If we are to fix anything it should be the restriction that queues must
be empty before they are destroyed.

--
Stuart.
_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp

Reply via email to