On Fri, 2023-09-15 at 10:58 -0400, Philip Balister wrote: > On 9/15/23 09:59, Peter Kjellerstedt wrote: > > > -----Original Message----- > > > From: [email protected] > > > <[email protected]> On Behalf Of Richard Purdie > > > Sent: den 15 september 2023 10:41 > > > To: [email protected]; openembedded-architecture > > > <[email protected]> > > > Cc: Philip Balister <[email protected]> > > > Subject: Re: [bitbake-devel] [PATCH] server/process: Disable the flush() > > > call in server logging > > > > > > On Thu, 2023-09-14 at 17:28 +0100, Richard Purdie via > > > lists.openembedded.org wrote: > > > > We've been chasing bitbake timeouts for a while and it was unclear > > > > where things > > > > were blocking on IO. It appears the flush() call in server logging can > > > > cause > > > > pauses up to minutes long on systems with slow (spinning) disks that > > > > are heavily > > > > loaded with IO. > > > > > > > > Since the flush() was added to aid debugging of other timing issues, we > > > > shouldn't > > > > need it now and it can be disabled. Leave a comment as a reminder of > > > > the pain this > > > > can cause. > > > > > > > > Signed-off-by: Richard Purdie <[email protected]> > > > > --- > > > > lib/bb/server/process.py | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/lib/bb/server/process.py b/lib/bb/server/process.py > > > > index 40cb99bc97..94e93bc3a6 100644 > > > > --- a/lib/bb/server/process.py > > > > +++ b/lib/bb/server/process.py > > > > @@ -43,7 +43,8 @@ def currenttime(): > > > > > > > > def serverlog(msg): > > > > print(str(os.getpid()) + " " + currenttime() + " " + msg) > > > > - sys.stdout.flush() > > > > + #Seems a flush here triggers filesytem sync like behaviour and > > > > long hangs in the server > > > > + #sys.stdout.flush() > > > > > > > > # > > > > # When we have lockfile issues, try and find infomation about which > > > > process is > > > > > > Whilst this patch looks simple, it has all the signs of turning into > > > weeks of pain after looking at the results from testing. It looks like > > > we have a race in the system which this flush is masking. > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2082/steps/14/logs/stdio > > > > > > Looking at just one of the failures there (of the 6), we see this test > > > pass: > > > > > > 1: 20/57 355/537 (0.12s) (0 failed) > > > (recipetool.RecipetoolTests.test_recipetool_split_pkg_licenses) > > > > > > then the next test to run on that build directory is: > > > > > > 1: 21/57 359/537 (137.64s) (0 failed) > > > (signing.Signing.test_signing_packages) > > > > > > with an error which boils down to: > > > > > > FileNotFoundError: [Errno 2] No such file or directory: > > > '/tmp/recipetoolqaizvjzf8h' > > > > > > from > > > > > > run(cmd='git rev-parse --abbrev-ref HEAD', input=None, log=None, > > > extrafiles=[], **options={'cwd': '/tmp/recipetoolqaizvjzf8h', 'shell': > > > True}) > > > > > > i.e. it is trying to run a command in a directory which doesn't exist. > > > > > > Without looking, I know "recipetoolqa" tmpfiles will come from > > > RecipetoolTests so the previous build directory is being removed and > > > yet still is cwd for the server. > > > > > > I've not looked much into any of the other failures yet. > > > > > > Looking at the 5 different selftest failures: > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2082/steps/14/logs/stdio > > > > > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - > > > archiver.Archiver.test_archiver_srpm_mode: FAILED (237.00s) > > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - > > > overlayfs.OverlayFSEtcRunTimeTests.test_all_required_variables_set: FAILED > > > (13.25s) > > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - > > > prservice.BitbakePrTests.test_pr_service_deb_arch_dep: FAILED (126.30s) > > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - > > > signing.Signing.test_signing_packages: FAILED (137.64s) > > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - > > > sstatetests.SStateHashSameSigs.test_sstate_32_64_same_hash: FAILED > > > (541.51s) > > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - > > > sstatetests.SStateHashSameSigs3.test_sstate_multilib_or_not_native_samesig > > > s: FAILED (224.05s) > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5710/steps/14/logs/stdio > > > > > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - > > > archiver.Archiver.test_archiver_srpm_mode: FAILED (59.78s) > > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - > > > bblogging.BitBakeLogging.test_shell_loggingA: FAILED (15.71s) > > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - > > > multiconfig.MultiConfig.test_multiconfig_reparse: FAILED (6.39s) > > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - > > > prservice.BitbakePrTests.test_pr_service_rpm_arch_indep: FAILED (0.84s) > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5746/steps/14/logs/stdio > > > > > > 2023-09-14 23:33:02,720 - oe-selftest - INFO - RESULTS - > > > package.VersionOrdering.test_dpkg: ERROR > > > 2023-09-14 23:33:02,720 - oe-selftest - INFO - RESULTS - > > > package.VersionOrdering.test_opkg: ERROR > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > package.VersionOrdering.test_rpm: ERROR > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > archiver.Archiver.test_archiver_srpm_mode: FAILED (129.04s) > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > bbtests.BitbakeTests.test_postfile: FAILED (2.06s) > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > bbtests.BitbakeTests.test_prefile: FAILED (1.73s) > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > devtool.DevtoolModifyTests.test_devtool_buildclean: FAILED (265.26s) > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > overlayfs.OverlayFSTests.test_mount_unit_not_set: FAILED (7.83s) > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > prservice.BitbakePrTests.test_import_export_override_db: FAILED (289.85s) > > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - > > > sstatetests.SStateHashSameSigs3.test_sstate_multilib_or_not_native_samesig > > > s: FAILED (111.34s) > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5757/steps/15/logs/stdio > > > > > > 2023-09-14 22:55:12,332 - oe-selftest - INFO - RESULTS - > > > bblogging.BitBakeLogging.test_python_exec_func_python_loggingB: FAILED > > > (100.76s) > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5777/steps/14/logs/stdio > > > > > > 2023-09-14 23:00:48,622 - oe-selftest - INFO - RESULTS - > > > archiver.Archiver.test_archiver_srpm_mode: FAILED (71.69s) > > > 2023-09-14 23:00:48,622 - oe-selftest - INFO - RESULTS - > > > prservice.BitbakePrTests.test_pr_service_deb_arch_dep: FAILED (27.14s) > > > > > > so the failures are all over the place. > > > > > > We also had a wic test failure that when re-ran, didn't reproduce: > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5777/steps/14/logs/stdio > > > > > > and some toolchain test failures for qemuppc: > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/153/builds/529 > > > > > > and qemuarm64: > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/151/builds/530 > > > > > > The question at this point is what do people want me to do. We clearly > > > have a really nasty bug in here. The patch is "right" and we do need to > > > fix this. If I merge it, I suspect I'm going to end up having to chase > > > this down before we can release and I am going to struggle to find the > > > time to do it and I suspect my sanity will suffer. This does look to be > > > a significant issue though. > > > > > > Cheers, > > > > > > Richard > > > > While the problem the patch tried to fix is real, it seems a lot less > > problematic than the result of applying the patch. So from a release > > perspective, I would keep the current code and possibly add it as a > > known bug to the release notes. Then, after the release is out, start > > the work of solving the fallout from applying the patch, with the > > intention to have it solved for the first point release of Nanbield. > > > > While not ideal, it sounds a lot less stressful than applying the > > patch at this stage. > > I understand this exposes some "interesting" issues, but I tend to agree > with Peter here.
This is a 'bit' more than "interesting". If we ignore this, pretty much every bitbake failure will be blamed on "the cache problem" and we'll develop a reputation for reliability issues :(. The damage this issue can do is scary even if I doubt most users would hit it. Cheers, Richard
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#1763): https://lists.openembedded.org/g/openembedded-architecture/message/1763 Mute This Topic: https://lists.openembedded.org/mt/101375777/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
