On 29.11.2006 [10:15:30 +1100], David Gibson wrote: > On Tue, Nov 28, 2006 at 01:49:03PM -0800, Nishanth Aravamudan wrote: > > On 16.11.2006 [10:50:13 +1100], David Gibson wrote: > > > On Wed, Nov 15, 2006 at 01:09:07PM -0800, Nishanth Aravamudan wrote: > > > > On 15.11.2006 [10:41:19 +1100], David Gibson wrote: > > > > > On Tue, Nov 14, 2006 at 02:33:59PM -0800, Nishanth Aravamudan wrote: > > > > > > On 14.11.2006 [12:21:36 -0800], Nishanth Aravamudan wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I'm hitting a brick wall debugging the linkshare segfaults I'm > > > > > > > seeing. > > > > > > > > > > > > > > (These logs are from my 2-way x86_64, but I'm seeing similar > > > > > > > issues on a G5 > > > > > > > (ppc64): > > > > > > > > > > > > > > HUGETLB_SHARE=2 xB.linkshare 2 (32): PASS > > > > > > > HUGETLB_SHARE=2 xB.linkshare 2 (64): PASS > > > > > > > HUGETLB_SHARE=1 xB.linkshare 2 (32): FAIL 2 of 2 children > > > > > > > exited abnormally > > > > > > > HUGETLB_SHARE=1 xB.linkshare 2 (64): FAIL 2 of 2 children > > > > > > > exited abnormally > > > > > > > HUGETLB_SHARE=2 xBDT.linkshare 2 (32): PASS > > > > > > > HUGETLB_SHARE=2 xBDT.linkshare 2 (64): PASS > > > > > > > HUGETLB_SHARE=1 xBDT.linkshare 2 (32): FAIL 2 of 2 children > > > > > > > exited abnormally > > > > > > > HUGETLB_SHARE=1 xBDT.linkshare 2 (64): FAIL 2 of 2 children > > > > > > > exited abnormally > > > > > > > > > > > > > > With all 4 failures being segmentation faults we caught. > > > > > > > > > > > > /me hangs head in shame. > > > > > > > > > > > > This is all probably just a stupid programming error on my part. > > > > > > I'll > > > > > > have a fix, I think, once I return from class. > > > > > > > > > > Btw, some of the existing testcases (e.g. alloc-instantiate-race) use > > > > > strsignal() and WTERMSIG() to give a more informative message when a > > > > > child is killed by a signal. It's probably a good idea to use that > > > > > here too, so you can see they died with a SEGV at first glance. > > > > > > > > Yes, this is done with a verbose test run. If I were to do it via a > > > > FAIL statement, we'd get 3 FAIL lines for every failing case. I > > > > suppose I could add a FAIL_CONT() for this... > > > > > > Um.. I don't really follow you. > > > > If you look at the patch I sent previously, we do print out the signal > > information with strsignal, but via verbose_printf(). If I were to do so > > via a FAIL() line, we'd either only print out the signal for the first > > child (since the testcase would fail immediately), or we'd have to add a > > FAIL_CONT() or something to allow me to indicate failure without failing > > the testcase immediately. > > I don't really see that failing after the first child comes back with > a signal is a problem. I don't see any situations where the return > code from the second child will be vital for debugging.
Ok, I can change it easily enough. > In any case you mustn't call FAIL() directly from the child process. While I agree this is messy in principle, I don't see any reason that it can't be called. It's simply a wrapper for exit()? I see that it calls cleanup() and there is a claim that cleanup() must be defined in the tests but this is untrue AFAICT. > > In any case, I've spent a good amount of time cleaning and fixing the > > linkshare testcase yesterday and today. Here is what I have so far. We > > are still getting segfaults on xBDT.linkshare 64-bit with > > HUGETLB_SHARE=2, but I went and checked and it's not a testcase issue, I > > don't think. We also will segfault, for instance, if xBDT.linkhuge > > 64-bit is run manually with HUGETLB_SHARE=2 two times in a row. I am now > > looking into the root cause of this failure. > > What sort of system is this on? I've certainly had 64-bit > xBDT.linkshare work on some setups. This is with the *reworked* linkshare. The version currently in the library is broken in multiple ways. > > The patch is pretty much ready for inclusion, I think, but I'd like one > > more round of review. > > Looks like a nice cleanup, but some further comments below. <snip> > > +static int child_process(char *self, int num_sharings, int index) > > { > > int i; > > - int shmid; > > - ino_t *shm; > > + ino_t ino; > > + > > + get_link_string(self); > > + > > + shmid = shmget(SHM_KEY, num_sharings * NUM_TESTS * > > + sizeof(ino_t), 0666); > > + if (shmid < 0) > > + FAIL("Child's shmget failed: %s", strerror(errno)); > > As mentioned above, a direct FAIL() from the child will always cause > trouble - what will the parent do. Instead just do an exit() with > non-zero code, and let the parent do the FAIL() when it gets the bad > code back from wait(). The parent will see a non-zero return status and FAIL itself. FAIL() is a wrapper for exit with non-zero code, so I'm not sure what the difference is. > > > + > > + shm = shmat(shmid, NULL, 0); > > + if (shm == (void *)-1) > > + FAIL("shmat failed: %s", strerror(errno)); > > And again. I can make these changes to verbose_printf() and exit(RC_FAIL), but I'm not sure what the difference is. > > + for (i = 0; i < NUM_TESTS; i++) { > > + if (!test_addr_huge(testtab + i)) { > > + shm[index * NUM_TESTS + i] = 0; > > + } else { > > + ino = do_test(testtab + i); > > + if ((int)ino < 0) { > > + shmdt(shm); > > + exit(RC_FAIL); > > + } > > + shm[index * NUM_TESTS + i] = ino; > > + } > > + } > > + shmdt(shm); > > Man I hate SysV shm. I'm almost certain this communication page stuff > would be less ugly if we did mmap()s directly. Especially if we used > an unlinked file, and passed the fd directly through to the children > by clearing FD_CLOEXEC. I was trying to minimize functional changes (beyond fixes) in this version. You had mentioned the cleanup of a mmap()'d unlinked file before and I added it to my todo then. But I'm trying to minimize changes so that things stay working. <snip> > > - children = (pid_t *)malloc(num_sharings * sizeof(pid_t)); > > - if (!children) > > + children_pids = (pid_t *)malloc(num_sharings * sizeof(pid_t)); > > + if (!children_pids) > > FAIL("malloc failed: %s", strerror(errno)); > > Is there really any advantage to having a variable number of children. > Things could be simplified considerably if we just cut this down to > the minimal testcase of 2 children, hardcoded. I don't know about "considerably", but fair enough. If no one else has any objections, I can make this change. Note, that it may also be interesting to run x{B,BDT}.linkhuge with HUGETLB_SHARE set appropriately, consecutively, as it results in the same sort of thing (but would only be a "Did it PASS()?" kind of test). Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel