On Thu, Aug 4, 2016 at 11:17 AM, Brian Brooks <[email protected]> wrote:
> On 08/04 11:01:09, Bill Fischofer wrote: > > On Thu, Aug 4, 2016 at 10:59 AM, Mike Holmes <[email protected]> > wrote: > > > > > > > > > > > On 4 August 2016 at 11:47, Bill Fischofer <[email protected]> > > > wrote: > > > > > >> > > >> On Thu, Aug 4, 2016 at 10:36 AM, Mike Holmes <[email protected]> > > >> wrote: > > >> > > >>> On my vanilla x86 I don't get any issues, keen to get this in and > have > > >>> CI run it on lots of HW to see what happens, many of the other tests > > >>> completely fail in process mode so we will expose a lot as we add > them I > > >>> think. > > >>> > > >>> On 4 August 2016 at 11:33, Bill Fischofer <[email protected] > > > > >>> wrote: > > >>> > > >>>> > > >>>> > > >>>> On Thu, Aug 4, 2016 at 10:26 AM, Brian Brooks < > [email protected]> > > >>>> wrote: > > >>>> > > >>>>> Reviewed-by: Brian Brooks <[email protected]> > > >>>>> > > >>>>> On 08/04 09:18:14, Mike Holmes wrote: > > >>>>> > +ret=0 > > >>>>> > + > > >>>>> > +run() > > >>>>> > +{ > > >>>>> > + echo odp_scheduling_run_proc starts with $1 worker threads > > >>>>> > + echo ===================================================== > > >>>>> > + > > >>>>> > + $PERFORMANCE/odp_scheduling${EXEEXT} --odph_proc -c $1 || > > >>>>> ret=1 > > >>>>> > +} > > >>>>> > + > > >>>>> > +run 1 > > >>>>> > +run 8 > > >>>>> > + > > >>>>> > +exit $ret > > >>>>> > > >>>>> Seeing this randomly in both multithread and multiprocess modes: > > >>>>> > > >>>> > > >>>> Before or after you apply this patch? What environment are you > seeing > > >>>> these errors in. They should definitely not be happening. > > >>>> > > >>>> > > >>>>> > > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:328:odp_ > queue_destroy():queue > > >>>>> "sched_00_07" not empty > > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:271: > schedule_term_global():Queue > > >>>>> not empty > > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:294: > schedule_term_global():Pool > > >>>>> destroy fail. > > >>>>> ../../../odp/platform/linux-generic/odp_init.c:188:_odp_ > term_global():ODP > > >>>>> schedule term failed. > > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:170:odp_ > queue_term_global():Not > > >>>>> destroyed queue: sched_00_07 > > >>>>> ../../../odp/platform/linux-generic/odp_init.c:195:_odp_ > term_global():ODP > > >>>>> queue term failed. > > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_ > pool_term_global():Not > > >>>>> destroyed pool: odp_sched_pool > > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_ > pool_term_global():Not > > >>>>> destroyed pool: msg_pool > > >>>>> ../../../odp/platform/linux-generic/odp_init.c:202:_odp_ > term_global():ODP > > >>>>> buffer pool term failed. > > >>>>> ~/odp_incoming/odp_build/test/common_plat/performance$ echo $? > > >>>>> 0 > > >>>>> > > >>>>> > > >> Looks like we have a real issue that somehow creeped into master. I > can > > >> sporadically reproduce these same errors on my x86 system. It looks > like > > >> this is also present in the monarch_lts branch. > > >> > > > > > > > > > I think that we agreed that Monarch would not support Process mode > becasue > > > we never tested for it, but for TgrM we need to start fixing it. > > > > > > > Unfortunately the issue Brian identified has nothing to do with process > > mode. This happens in regular pthread mode on all levels past v1.10.0.0 > as > > far as I can see. > > The issue seems to emerge only under high event rates. The application asks > for more work, but none will be scheduled. However, there actually will be > work in the queue. So, the teardown will fail because the queue is not > empty. > There may be a disconnect between the scheduling and the queueing or some > other synchronization related bug. I think I've seen something similar on > an ARM platform, so it may be architecture independent. > Well, now that I'm trying to find the root issue, it's proving elusive. I was able to get this failure in < 10 runs before but now it doesn't want to show itself. If you can repro this more readily, can you get a core dump of the failure? I've been running with this patch: --- >From ba1fa0eb943fa7a3a3c9202b9e5bf5fc2ed5d1f4 Mon Sep 17 00:00:00 2001 From: Bill Fischofer <[email protected]> Date: Thu, 4 Aug 2016 11:38:01 -0500 Subject: [PATCH] debug: abort on cleanup errors Signed-off-by: Bill Fischofer <[email protected]> --- test/performance/odp_scheduling.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/test/performance/odp_scheduling.c b/test/performance/odp_scheduling.c index c575b70..7505df9 100644 --- a/test/performance/odp_scheduling.c +++ b/test/performance/odp_scheduling.c @@ -785,6 +785,7 @@ int main(int argc, char *argv[]) char cpumaskstr[ODP_CPUMASK_STR_SIZE]; odp_pool_param_t params; int ret = 0; + int rc = 0; odp_instance_t instance; odph_odpthread_params_t thr_params; @@ -953,15 +954,17 @@ int main(int argc, char *argv[]) for (j = 0; j < QUEUES_PER_PRIO; j++) { queue = globals->queue[i][j]; - odp_queue_destroy(queue); + rc += odp_queue_destroy(queue); } } - odp_shm_free(shm); - odp_queue_destroy(plain_queue); - odp_pool_destroy(pool); - odp_term_local(); - odp_term_global(instance); + rc += odp_shm_free(shm); + rc += odp_queue_destroy(plain_queue); + rc += odp_pool_destroy(pool); + rc += odp_term_local(); + rc += odp_term_global(instance); + if (rc != 0) + abort(); return ret; } -- 2.7.4
