I thought I should write down how I'm narrowing down the glibc issue and making it easier to debug.
We're talking about the glibc system mode testing, where a qemu VM is started, unfs3 is used to NFS mount $WORKDIR into the VM and then the glibc test suite is run in that environment. The command to run just this test is: oe-selftest -r glibc.GlibcSelfTestSystemEmulated -j 1 -K The -j option means it creates a new build directory each time and the -K ensures it is kept. You can find the guest VM with ps ax | grep qemu-system-. You can then access it with something like ssh [email protected] We needed some way to identify when things were broken/hanging. That appears to be that unfsd has exited and is in a zombie state. I did some horrible hacking of the command execution code in meta/lib/oeqa/utils/commands.py adding: def logit(self, msg): with open("/tmp/rp.cmd.%s" % os.getpid(), "a+") as f: f.write(msg + "\n") Adding things like: + self.logit("Command '%s' returned %d as exit code." % (self.cmd, self.status)) + self.logit("Last 20 lines:\n%s" % lout) self.log.debug("Last 20 lines:\n%s" % lout) Which let me see the buffer overflow message from unfsd. This code ensures the data isn't stuck in some pipe buffer we never see by writing it out directly when it happens. Now, with an idea of what breaks, we needed a faster reproducer as the tests hang after about 4.5 hours, which isn't good for debugging. In the end, I wrote a script like this: while true do date tail log.do_check.* -n 4 ps ax | grep unfs | grep -v grep sleep 2 done which monitored the log file and logged the lines there along with the time and the status of the unfs process. Left running for a while, this narrowed down the issue to being triggered by the misc/test-syslog test. I then worked out how to run individual tests in the glibc test suite (make test t=misc/tst-syslog) and then hacked that into the glibc- testsuite recipe: + tar -xf ${COREBASE}/meta/recipes-core/glibc/glibc/test.tgz -C ${B}/ + oe_runmake -i \ GPROF="${TARGET_PREFIX}gprof" \ QEMU_SYSROOT="${RECIPE_SYSROOT}" \ @@ -28,7 +30,9 @@ do_check:append () { SSH_HOST_USER="${TOOLCHAIN_TEST_HOST_USER}" \ SSH_HOST_PORT="${TOOLCHAIN_TEST_HOST_PORT}" \ test-wrapper="${UNPACKDIR}/check-test-wrapper ${TOOLCHAIN_TEST_TARGET}" \ - check + test t=misc/tst-syslog + + sleep 10000 The tarball creates some needed files in the test directory (testroot.pristine and testroot.root), not sure how to create those with a make command as yet, I just ahacked around it with a tarball. With those changes, the oe-selftest hangs much faster. It still has an image build and lots of compiling in glibc but is nowhere near 4.5 hours how. Next, I can likely extract much of this from the test into a manual environment to avoid the image creation overhead each time. Obvoiously I've not started debugging the overflow itself yet but I can now trigger the issue in a much faster way. Cheers, Richard
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#229209): https://lists.openembedded.org/g/openembedded-core/message/229209 Mute This Topic: https://lists.openembedded.org/mt/117222780/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
