On 08/04 11:01:09, Bill Fischofer wrote: > On Thu, Aug 4, 2016 at 10:59 AM, Mike Holmes <[email protected]> wrote: > > > > > > > On 4 August 2016 at 11:47, Bill Fischofer <[email protected]> > > wrote: > > > >> > >> On Thu, Aug 4, 2016 at 10:36 AM, Mike Holmes <[email protected]> > >> wrote: > >> > >>> On my vanilla x86 I don't get any issues, keen to get this in and have > >>> CI run it on lots of HW to see what happens, many of the other tests > >>> completely fail in process mode so we will expose a lot as we add them I > >>> think. > >>> > >>> On 4 August 2016 at 11:33, Bill Fischofer <[email protected]> > >>> wrote: > >>> > >>>> > >>>> > >>>> On Thu, Aug 4, 2016 at 10:26 AM, Brian Brooks <[email protected]> > >>>> wrote: > >>>> > >>>>> Reviewed-by: Brian Brooks <[email protected]> > >>>>> > >>>>> On 08/04 09:18:14, Mike Holmes wrote: > >>>>> > +ret=0 > >>>>> > + > >>>>> > +run() > >>>>> > +{ > >>>>> > + echo odp_scheduling_run_proc starts with $1 worker threads > >>>>> > + echo ===================================================== > >>>>> > + > >>>>> > + $PERFORMANCE/odp_scheduling${EXEEXT} --odph_proc -c $1 || > >>>>> ret=1 > >>>>> > +} > >>>>> > + > >>>>> > +run 1 > >>>>> > +run 8 > >>>>> > + > >>>>> > +exit $ret > >>>>> > >>>>> Seeing this randomly in both multithread and multiprocess modes: > >>>>> > >>>> > >>>> Before or after you apply this patch? What environment are you seeing > >>>> these errors in. They should definitely not be happening. > >>>> > >>>> > >>>>> > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:328:odp_queue_destroy():queue > >>>>> "sched_00_07" not empty > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:271:schedule_term_global():Queue > >>>>> not empty > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:294:schedule_term_global():Pool > >>>>> destroy fail. > >>>>> ../../../odp/platform/linux-generic/odp_init.c:188:_odp_term_global():ODP > >>>>> schedule term failed. > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:170:odp_queue_term_global():Not > >>>>> destroyed queue: sched_00_07 > >>>>> ../../../odp/platform/linux-generic/odp_init.c:195:_odp_term_global():ODP > >>>>> queue term failed. > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_pool_term_global():Not > >>>>> destroyed pool: odp_sched_pool > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_pool_term_global():Not > >>>>> destroyed pool: msg_pool > >>>>> ../../../odp/platform/linux-generic/odp_init.c:202:_odp_term_global():ODP > >>>>> buffer pool term failed. > >>>>> ~/odp_incoming/odp_build/test/common_plat/performance$ echo $? > >>>>> 0 > >>>>> > >>>>> > >> Looks like we have a real issue that somehow creeped into master. I can > >> sporadically reproduce these same errors on my x86 system. It looks like > >> this is also present in the monarch_lts branch. > >> > > > > > > I think that we agreed that Monarch would not support Process mode becasue > > we never tested for it, but for TgrM we need to start fixing it. > > > > Unfortunately the issue Brian identified has nothing to do with process > mode. This happens in regular pthread mode on all levels past v1.10.0.0 as > far as I can see.
The issue seems to emerge only under high event rates. The application asks for more work, but none will be scheduled. However, there actually will be work in the queue. So, the teardown will fail because the queue is not empty. There may be a disconnect between the scheduling and the queueing or some other synchronization related bug. I think I've seen something similar on an ARM platform, so it may be architecture independent.
