On 29.11.2006 [10:15:30 +1100], David Gibson wrote:
> On Tue, Nov 28, 2006 at 01:49:03PM -0800, Nishanth Aravamudan wrote:
> > On 16.11.2006 [10:50:13 +1100], David Gibson wrote:
> > > On Wed, Nov 15, 2006 at 01:09:07PM -0800, Nishanth Aravamudan wrote:
> > > > On 15.11.2006 [10:41:19 +1100], David Gibson wrote:
> > > > > On Tue, Nov 14, 2006 at 02:33:59PM -0800, Nishanth Aravamudan wrote:
> > > > > > On 14.11.2006 [12:21:36 -0800], Nishanth Aravamudan wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I'm hitting a brick wall debugging the linkshare segfaults I'm 
> > > > > > > seeing.
> > > > > > >
> > > > > > > (These logs are from my 2-way x86_64, but I'm seeing similar 
> > > > > > > issues on a G5
> > > > > > > (ppc64):
> > > > > > >
> > > > > > > HUGETLB_SHARE=2 xB.linkshare 2 (32):    PASS
> > > > > > > HUGETLB_SHARE=2 xB.linkshare 2 (64):    PASS
> > > > > > > HUGETLB_SHARE=1 xB.linkshare 2 (32):    FAIL    2 of 2 children 
> > > > > > > exited abnormally
> > > > > > > HUGETLB_SHARE=1 xB.linkshare 2 (64):    FAIL    2 of 2 children 
> > > > > > > exited abnormally
> > > > > > > HUGETLB_SHARE=2 xBDT.linkshare 2 (32):  PASS
> > > > > > > HUGETLB_SHARE=2 xBDT.linkshare 2 (64):  PASS
> > > > > > > HUGETLB_SHARE=1 xBDT.linkshare 2 (32):  FAIL    2 of 2 children 
> > > > > > > exited abnormally
> > > > > > > HUGETLB_SHARE=1 xBDT.linkshare 2 (64):  FAIL    2 of 2 children 
> > > > > > > exited abnormally
> > > > > > >
> > > > > > > With all 4 failures being segmentation faults we caught.
> > > > > >
> > > > > > /me hangs head in shame.
> > > > > >
> > > > > > This is all probably just a stupid programming error on my part. 
> > > > > > I'll
> > > > > > have a fix, I think, once I return from class.
> > > > >
> > > > > Btw, some of the existing testcases (e.g. alloc-instantiate-race) use
> > > > > strsignal() and WTERMSIG() to give a more informative message when a
> > > > > child is killed by a signal.  It's probably a good idea to use that
> > > > > here too, so you can see they died with a SEGV at first glance.
> > > >
> > > > Yes, this is done with a verbose test run. If I were to do it via a
> > > > FAIL statement, we'd get 3 FAIL lines for every failing case. I
> > > > suppose I could add a FAIL_CONT() for this...
> > >
> > > Um.. I don't really follow you.
> >
> > If you look at the patch I sent previously, we do print out the signal
> > information with strsignal, but via verbose_printf(). If I were to do so
> > via a FAIL() line, we'd either only print out the signal for the first
> > child (since the testcase would fail immediately), or we'd have to add a
> > FAIL_CONT() or something to allow me to indicate failure without failing
> > the testcase immediately.
> 
> I don't really see that failing after the first child comes back with
> a signal is a problem.  I don't see any situations where the return
> code from the second child will be vital for debugging.

Ok, I can change it easily enough.

> In any case you mustn't call FAIL() directly from the child process.

While I agree this is messy in principle, I don't see any reason that it
can't be called. It's simply a wrapper for exit()? I see that it calls
cleanup() and there is a claim that cleanup() must be defined in the
tests but this is untrue AFAICT.

> > In any case, I've spent a good amount of time cleaning and fixing the
> > linkshare testcase yesterday and today. Here is what I have so far. We
> > are still getting segfaults on xBDT.linkshare 64-bit with
> > HUGETLB_SHARE=2, but I went and checked and it's not a testcase issue, I
> > don't think. We also will segfault, for instance, if xBDT.linkhuge
> > 64-bit is run manually with HUGETLB_SHARE=2 two times in a row. I am now
> > looking into the root cause of this failure.
> 
> What sort of system is this on?  I've certainly had 64-bit
> xBDT.linkshare work on some setups.

This is with the *reworked* linkshare. The version currently in the
library is broken in multiple ways.

> > The patch is pretty much ready for inclusion, I think, but I'd like one
> > more round of review.
> 
> Looks like a nice cleanup, but some further comments below.

<snip>

> > +static int child_process(char *self, int num_sharings, int index)
> >  {
> >     int i;
> > -   int shmid;
> > -   ino_t *shm;
> > +   ino_t ino;
> > +
> > +   get_link_string(self);
> > +
> > +   shmid = shmget(SHM_KEY, num_sharings * NUM_TESTS *
> > +                                                   sizeof(ino_t), 0666);
> > +   if (shmid < 0)
> > +           FAIL("Child's shmget failed: %s", strerror(errno));
> 
> As mentioned above, a direct FAIL() from the child will always cause
> trouble - what will the parent do.  Instead just do an exit() with
> non-zero code, and let the parent do the FAIL() when it gets the bad
> code back from wait().

The parent will see a non-zero return status and FAIL itself. FAIL() is
a wrapper for exit with non-zero code, so I'm not sure what the
difference is.

> 
> > +
> > +   shm = shmat(shmid, NULL, 0);
> > +   if (shm == (void *)-1)
> > +           FAIL("shmat failed: %s", strerror(errno));
> 
> And again.

I can make these changes to verbose_printf() and exit(RC_FAIL), but I'm
not sure what the difference is.

> > +   for (i = 0; i < NUM_TESTS; i++) {
> > +           if (!test_addr_huge(testtab + i)) {
> > +                   shm[index * NUM_TESTS + i] = 0;
> > +           } else {
> > +                   ino = do_test(testtab + i);
> > +                   if ((int)ino < 0) {
> > +                           shmdt(shm);
> > +                           exit(RC_FAIL);
> > +                   }
> > +                   shm[index * NUM_TESTS + i] = ino;
> > +           }
> > +   }
> > +   shmdt(shm);
> 
> Man I hate SysV shm.  I'm almost certain this communication page stuff
> would be less ugly if we did mmap()s directly.  Especially if we used
> an unlinked file, and passed the fd directly through to the children
> by clearing FD_CLOEXEC.

I was trying to minimize functional changes (beyond fixes) in this
version. You had mentioned the cleanup of a mmap()'d unlinked file
before and I added it to my todo then. But I'm trying to minimize
changes so that things stay working.

<snip>

> > -           children = (pid_t *)malloc(num_sharings * sizeof(pid_t));
> > -           if (!children)
> > +           children_pids = (pid_t *)malloc(num_sharings * sizeof(pid_t));
> > +           if (!children_pids)
> >                     FAIL("malloc failed: %s", strerror(errno));
> 
> Is there really any advantage to having a variable number of children.
> Things could be simplified considerably if we just cut this down to
> the minimal testcase of 2 children, hardcoded.

I don't know about "considerably", but fair enough. If no one else has
any objections, I can make this change. Note, that it may also be
interesting to run x{B,BDT}.linkhuge with HUGETLB_SHARE set
appropriately, consecutively, as it results in the same sort of thing
(but would only be a "Did it PASS()?" kind of test).

Thanks,
Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Libhugetlbfs-devel mailing list
Libhugetlbfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel

Reply via email to