On Thu, Aug 4, 2016 at 11:17 AM, Brian Brooks <[email protected]>
wrote:

> On 08/04 11:01:09, Bill Fischofer wrote:
> > On Thu, Aug 4, 2016 at 10:59 AM, Mike Holmes <[email protected]>
> wrote:
> >
> > >
> > >
> > > On 4 August 2016 at 11:47, Bill Fischofer <[email protected]>
> > > wrote:
> > >
> > >>
> > >> On Thu, Aug 4, 2016 at 10:36 AM, Mike Holmes <[email protected]>
> > >> wrote:
> > >>
> > >>> On my vanilla x86 I don't get any issues, keen to get this in and
> have
> > >>> CI run it on lots of HW to see what happens, many of the other tests
> > >>> completely fail in process mode so we will expose a lot as we add
> them I
> > >>> think.
> > >>>
> > >>> On 4 August 2016 at 11:33, Bill Fischofer <[email protected]
> >
> > >>> wrote:
> > >>>
> > >>>>
> > >>>>
> > >>>> On Thu, Aug 4, 2016 at 10:26 AM, Brian Brooks <
> [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Reviewed-by: Brian Brooks <[email protected]>
> > >>>>>
> > >>>>> On 08/04 09:18:14, Mike Holmes wrote:
> > >>>>> > +ret=0
> > >>>>> > +
> > >>>>> > +run()
> > >>>>> > +{
> > >>>>> > +     echo odp_scheduling_run_proc starts with $1 worker threads
> > >>>>> > +     echo =====================================================
> > >>>>> > +
> > >>>>> > +     $PERFORMANCE/odp_scheduling${EXEEXT} --odph_proc -c $1 ||
> > >>>>> ret=1
> > >>>>> > +}
> > >>>>> > +
> > >>>>> > +run 1
> > >>>>> > +run 8
> > >>>>> > +
> > >>>>> > +exit $ret
> > >>>>>
> > >>>>> Seeing this randomly in both multithread and multiprocess modes:
> > >>>>>
> > >>>>
> > >>>> Before or after you apply this patch? What environment are you
> seeing
> > >>>> these errors in. They should definitely not be happening.
> > >>>>
> > >>>>
> > >>>>>
> > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:328:odp_
> queue_destroy():queue
> > >>>>> "sched_00_07" not empty
> > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:271:
> schedule_term_global():Queue
> > >>>>> not empty
> > >>>>> ../../../odp/platform/linux-generic/odp_schedule.c:294:
> schedule_term_global():Pool
> > >>>>> destroy fail.
> > >>>>> ../../../odp/platform/linux-generic/odp_init.c:188:_odp_
> term_global():ODP
> > >>>>> schedule term failed.
> > >>>>> ../../../odp/platform/linux-generic/odp_queue.c:170:odp_
> queue_term_global():Not
> > >>>>> destroyed queue: sched_00_07
> > >>>>> ../../../odp/platform/linux-generic/odp_init.c:195:_odp_
> term_global():ODP
> > >>>>> queue term failed.
> > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_
> pool_term_global():Not
> > >>>>> destroyed pool: odp_sched_pool
> > >>>>> ../../../odp/platform/linux-generic/odp_pool.c:149:odp_
> pool_term_global():Not
> > >>>>> destroyed pool: msg_pool
> > >>>>> ../../../odp/platform/linux-generic/odp_init.c:202:_odp_
> term_global():ODP
> > >>>>> buffer pool term failed.
> > >>>>> ~/odp_incoming/odp_build/test/common_plat/performance$ echo $?
> > >>>>> 0
> > >>>>>
> > >>>>>
> > >> Looks like we have a real issue that somehow creeped into master. I
> can
> > >> sporadically reproduce these same errors on my x86 system.  It looks
> like
> > >> this is also present in the monarch_lts branch.
> > >>
> > >
> > >
> > > I think that we agreed that Monarch would not support Process mode
> becasue
> > > we never tested for it, but for TgrM we need to start fixing it.
> > >
> >
> > Unfortunately the issue Brian identified has nothing to do with process
> > mode. This happens in regular pthread mode on all levels past v1.10.0.0
> as
> > far as I can see.
>
> The issue seems to emerge only under high event rates. The application asks
> for more work, but none will be scheduled. However, there actually will be
> work in the queue. So, the teardown will fail because the queue is not
> empty.
> There may be a disconnect between the scheduling and the queueing or some
> other synchronization related bug. I think I've seen something similar on
> an ARM platform, so it may be architecture independent.
>

Well, now that I'm trying to find the root issue, it's proving elusive. I
was able to get this failure in < 10 runs before but now it doesn't want to
show itself.  If you can repro this more readily, can you get a core dump
of the failure?  I've been running with this patch:

---
>From ba1fa0eb943fa7a3a3c9202b9e5bf5fc2ed5d1f4 Mon Sep 17 00:00:00 2001
From: Bill Fischofer <[email protected]>
Date: Thu, 4 Aug 2016 11:38:01 -0500
Subject: [PATCH] debug: abort on cleanup errors

Signed-off-by: Bill Fischofer <[email protected]>
---
 test/performance/odp_scheduling.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/test/performance/odp_scheduling.c
b/test/performance/odp_scheduling.c
index c575b70..7505df9 100644
--- a/test/performance/odp_scheduling.c
+++ b/test/performance/odp_scheduling.c
@@ -785,6 +785,7 @@ int main(int argc, char *argv[])
  char cpumaskstr[ODP_CPUMASK_STR_SIZE];
  odp_pool_param_t params;
  int ret = 0;
+ int rc = 0;
  odp_instance_t instance;
  odph_odpthread_params_t thr_params;

@@ -953,15 +954,17 @@ int main(int argc, char *argv[])

  for (j = 0; j < QUEUES_PER_PRIO; j++) {
  queue = globals->queue[i][j];
- odp_queue_destroy(queue);
+ rc += odp_queue_destroy(queue);
  }
  }

- odp_shm_free(shm);
- odp_queue_destroy(plain_queue);
- odp_pool_destroy(pool);
- odp_term_local();
- odp_term_global(instance);
+ rc += odp_shm_free(shm);
+ rc += odp_queue_destroy(plain_queue);
+ rc += odp_pool_destroy(pool);
+ rc += odp_term_local();
+ rc += odp_term_global(instance);
+ if (rc != 0)
+ abort();

  return ret;
 }
-- 
2.7.4

Reply via email to