Re: [Libhugetlbfs-devel] Need help debugging linkshare segfaults

David Gibson Wed, 29 Nov 2006 17:44:19 -0800

On Tue, Nov 28, 2006 at 09:56:51PM -0800, Nishanth Aravamudan wrote:
[snip]
> > I don't really see that failing after the first child comes back with
> > a signal is a problem.  I don't see any situations where the return
> > code from the second child will be vital for debugging.
> 
> Ok, I can change it easily enough.
> 
> > In any case you mustn't call FAIL() directly from the child process.
> 
> While I agree this is messy in principle, I don't see any reason that it
> can't be called. It's simply a wrapper for exit()? I see that it calls
> cleanup() and there is a claim that cleanup() must be defined in the
> tests but this is untrue AFAICT.


If you FAIL() from the child you'll get two separate fail messages
which makes a mess of the output.  Easiest way to make sure there's a
single, comprehensible fail message is to only ever FAIL() (or PASS()
or CONFIG()) from the top level process.

> > > In any case, I've spent a good amount of time cleaning and fixing the
> > > linkshare testcase yesterday and today. Here is what I have so far. We
> > > are still getting segfaults on xBDT.linkshare 64-bit with
> > > HUGETLB_SHARE=2, but I went and checked and it's not a testcase issue, I
> > > don't think. We also will segfault, for instance, if xBDT.linkhuge
> > > 64-bit is run manually with HUGETLB_SHARE=2 two times in a row. I am now
> > > looking into the root cause of this failure.
> > 
> > What sort of system is this on?  I've certainly had 64-bit
> > xBDT.linkshare work on some setups.
> 
> This is with the *reworked* linkshare. The version currently in the
> library is broken in multiple ways.

I thought you said you got failures with both the old and new versions..

> > > The patch is pretty much ready for inclusion, I think, but I'd like one
> > > more round of review.
> > 
> > Looks like a nice cleanup, but some further comments below.
> 
> <snip>
> 
> > > +static int child_process(char *self, int num_sharings, int index)
> > >  {
> > >   int i;
> > > - int shmid;
> > > - ino_t *shm;
> > > + ino_t ino;
> > > +
> > > + get_link_string(self);
> > > +
> > > + shmid = shmget(SHM_KEY, num_sharings * NUM_TESTS *
> > > +                                                 sizeof(ino_t), 0666);
> > > + if (shmid < 0)
> > > +         FAIL("Child's shmget failed: %s", strerror(errno));
> > 
> > As mentioned above, a direct FAIL() from the child will always cause
> > trouble - what will the parent do.  Instead just do an exit() with
> > non-zero code, and let the parent do the FAIL() when it gets the bad
> > code back from wait().
> 
> The parent will see a non-zero return status and FAIL itself. FAIL() is
> a wrapper for exit with non-zero code, so I'm not sure what the
> difference is.

Confusing multiple messages.

> > > +
> > > + shm = shmat(shmid, NULL, 0);
> > > + if (shm == (void *)-1)
> > > +         FAIL("shmat failed: %s", strerror(errno));
> > 
> > And again.
> 
> I can make these changes to verbose_printf() and exit(RC_FAIL), but I'm
> not sure what the difference is.
> 
> > > + for (i = 0; i < NUM_TESTS; i++) {
> > > +         if (!test_addr_huge(testtab + i)) {
> > > +                 shm[index * NUM_TESTS + i] = 0;
> > > +         } else {
> > > +                 ino = do_test(testtab + i);
> > > +                 if ((int)ino < 0) {
> > > +                         shmdt(shm);
> > > +                         exit(RC_FAIL);
> > > +                 }
> > > +                 shm[index * NUM_TESTS + i] = ino;
> > > +         }
> > > + }
> > > + shmdt(shm);
> > 
> > Man I hate SysV shm.  I'm almost certain this communication page stuff
> > would be less ugly if we did mmap()s directly.  Especially if we used
> > an unlinked file, and passed the fd directly through to the children
> > by clearing FD_CLOEXEC.
> 
> I was trying to minimize functional changes (beyond fixes) in this
> version. You had mentioned the cleanup of a mmap()'d unlinked file
> before and I added it to my todo then. But I'm trying to minimize
> changes so that things stay working.

Fair enough.

> > > -         children = (pid_t *)malloc(num_sharings * sizeof(pid_t));
> > > -         if (!children)
> > > +         children_pids = (pid_t *)malloc(num_sharings * sizeof(pid_t));
> > > +         if (!children_pids)
> > >                   FAIL("malloc failed: %s", strerror(errno));
> > 
> > Is there really any advantage to having a variable number of children.
> > Things could be simplified considerably if we just cut this down to
> > the minimal testcase of 2 children, hardcoded.
> 
> I don't know about "considerably", but fair enough. If no one else has
> any objections, I can make this change. Note, that it may also be
> interesting to run x{B,BDT}.linkhuge with HUGETLB_SHARE set
> appropriately, consecutively, as it results in the same sort of
> thing

That sounds like a good idea.  Makes sure HUGETLB_SHARE doesn't mess
anything up, even in a case which doesn't use sharing.

> (but would only be a "Did it PASS()?" kind of test).

Um.. there's another kind?

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Libhugetlbfs-devel mailing list
Libhugetlbfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel

Re: [Libhugetlbfs-devel] Need help debugging linkshare segfaults

Reply via email to