I thought I should write down how I'm narrowing down the glibc issue
and making it easier to debug.

We're talking about the glibc system mode testing, where a qemu VM is
started, unfs3 is used to NFS mount $WORKDIR into the VM and then the
glibc test suite is run in that environment.

The command to run just this test is:

oe-selftest -r glibc.GlibcSelfTestSystemEmulated -j 1 -K

The -j option means it creates a new build directory each time and the
-K ensures it is kept.

You can find the guest VM with ps ax | grep qemu-system-. You can then
access it with something like ssh [email protected]

We needed some way to identify when things were broken/hanging. That
appears to be that unfsd has exited and is in a zombie state. I did
some horrible hacking of the command execution code in
meta/lib/oeqa/utils/commands.py adding:

    def logit(self, msg):
        with open("/tmp/rp.cmd.%s" % os.getpid(), "a+") as f:
            f.write(msg + "\n")

Adding things like:

+        self.logit("Command '%s' returned %d as exit code." % (self.cmd, 
self.status))

+            self.logit("Last 20 lines:\n%s" % lout)
             self.log.debug("Last 20 lines:\n%s" % lout)
 

Which let me see the buffer overflow message from unfsd. This code
ensures the data isn't stuck in some pipe buffer we never see by
writing it out directly when it happens.

Now, with an idea of what breaks, we needed a faster reproducer as the
tests hang after about 4.5 hours, which isn't good for debugging.

In the end, I wrote a script like this:

while true
do
    date
    tail log.do_check.* -n 4
    ps ax | grep unfs | grep -v grep
    sleep 2
done

which monitored the log file and logged the lines there along with the
time and the status of the unfs process. Left running for a while, this
narrowed down the issue to being triggered by the misc/test-syslog
test.

I then worked out how to run individual tests in the glibc test suite
(make test t=misc/tst-syslog) and then hacked that into the glibc-
testsuite recipe:
 
+    tar -xf ${COREBASE}/meta/recipes-core/glibc/glibc/test.tgz -C ${B}/
+
     oe_runmake -i \
         GPROF="${TARGET_PREFIX}gprof" \
         QEMU_SYSROOT="${RECIPE_SYSROOT}" \
@@ -28,7 +30,9 @@ do_check:append () {
         SSH_HOST_USER="${TOOLCHAIN_TEST_HOST_USER}" \
         SSH_HOST_PORT="${TOOLCHAIN_TEST_HOST_PORT}" \
         test-wrapper="${UNPACKDIR}/check-test-wrapper 
${TOOLCHAIN_TEST_TARGET}" \
-        check
+        test t=misc/tst-syslog
+
+    sleep 10000

The tarball creates some needed files in the test directory
(testroot.pristine and testroot.root), not sure how to create those
with a make command as yet, I just ahacked around it with a tarball.

With those changes, the oe-selftest hangs much faster. It still has an
image build and lots of compiling in glibc but is nowhere near 4.5
hours how.

Next, I can likely extract much of this from the test into a manual
environment to avoid the image creation overhead each time.

Obvoiously I've not started debugging the overflow itself yet but I can
now trigger the issue in a much faster way.

Cheers,

Richard




-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#229209): 
https://lists.openembedded.org/g/openembedded-core/message/229209
Mute This Topic: https://lists.openembedded.org/mt/117222780/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to