On Fri, 2023-09-15 at 10:58 -0400, Philip Balister wrote:
> On 9/15/23 09:59, Peter Kjellerstedt wrote:
> > > -----Original Message-----
> > > From: [email protected] 
> > > <[email protected]> On Behalf Of Richard Purdie
> > > Sent: den 15 september 2023 10:41
> > > To: [email protected]; openembedded-architecture 
> > > <[email protected]>
> > > Cc: Philip Balister <[email protected]>
> > > Subject: Re: [bitbake-devel] [PATCH] server/process: Disable the flush() 
> > > call in server logging
> > > 
> > > On Thu, 2023-09-14 at 17:28 +0100, Richard Purdie via 
> > > lists.openembedded.org wrote:
> > > > We've been chasing bitbake timeouts for a while and it was unclear 
> > > > where things
> > > > were blocking on IO. It appears the flush() call in server logging can 
> > > > cause
> > > > pauses up to minutes long on systems with slow (spinning) disks that 
> > > > are heavily
> > > > loaded with IO.
> > > > 
> > > > Since the flush() was added to aid debugging of other timing issues, we 
> > > > shouldn't
> > > > need it now and it can be disabled. Leave a comment as a reminder of 
> > > > the pain this
> > > > can cause.
> > > > 
> > > > Signed-off-by: Richard Purdie <[email protected]>
> > > > ---
> > > >   lib/bb/server/process.py | 3 ++-
> > > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/lib/bb/server/process.py b/lib/bb/server/process.py
> > > > index 40cb99bc97..94e93bc3a6 100644
> > > > --- a/lib/bb/server/process.py
> > > > +++ b/lib/bb/server/process.py
> > > > @@ -43,7 +43,8 @@ def currenttime():
> > > > 
> > > >   def serverlog(msg):
> > > >       print(str(os.getpid()) + " " +  currenttime() + " " + msg)
> > > > -    sys.stdout.flush()
> > > > +    #Seems a flush here triggers filesytem sync like behaviour and 
> > > > long hangs in the server
> > > > +    #sys.stdout.flush()
> > > > 
> > > >   #
> > > >   # When we have lockfile issues, try and find infomation about which 
> > > > process is
> > > 
> > > Whilst this patch looks simple, it has all the signs of turning into
> > > weeks of pain after looking at the results from testing. It looks like
> > > we have a race in the system which this flush is masking.
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2082/steps/14/logs/stdio
> > > 
> > > Looking at just one of the failures there (of the 6), we see this test
> > > pass:
> > > 
> > > 1: 20/57 355/537 (0.12s) (0 failed)
> > > (recipetool.RecipetoolTests.test_recipetool_split_pkg_licenses)
> > > 
> > > then the next test to run on that build directory is:
> > > 
> > > 1: 21/57 359/537 (137.64s) (0 failed)
> > > (signing.Signing.test_signing_packages)
> > > 
> > > with an error which boils down to:
> > > 
> > > FileNotFoundError: [Errno 2] No such file or directory:
> > > '/tmp/recipetoolqaizvjzf8h'
> > > 
> > > from
> > > 
> > > run(cmd='git rev-parse --abbrev-ref HEAD', input=None, log=None,
> > > extrafiles=[], **options={'cwd': '/tmp/recipetoolqaizvjzf8h', 'shell':
> > > True})
> > > 
> > > i.e. it is trying to run a command in a directory which doesn't exist.
> > > 
> > > Without looking, I know "recipetoolqa" tmpfiles will come from
> > > RecipetoolTests so the previous build directory is being removed and
> > > yet still is cwd for the server.
> > > 
> > > I've not looked much into any of the other failures yet.
> > > 
> > > Looking at the 5 different selftest failures:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2082/steps/14/logs/stdio
> > > 
> > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS -
> > > archiver.Archiver.test_archiver_srpm_mode: FAILED (237.00s)
> > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS -
> > > overlayfs.OverlayFSEtcRunTimeTests.test_all_required_variables_set: FAILED
> > > (13.25s)
> > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS -
> > > prservice.BitbakePrTests.test_pr_service_deb_arch_dep: FAILED (126.30s)
> > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS -
> > > signing.Signing.test_signing_packages: FAILED (137.64s)
> > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS -
> > > sstatetests.SStateHashSameSigs.test_sstate_32_64_same_hash: FAILED
> > > (541.51s)
> > > 2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS -
> > > sstatetests.SStateHashSameSigs3.test_sstate_multilib_or_not_native_samesig
> > > s: FAILED (224.05s)
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5710/steps/14/logs/stdio
> > > 
> > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS -
> > > archiver.Archiver.test_archiver_srpm_mode: FAILED (59.78s)
> > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS -
> > > bblogging.BitBakeLogging.test_shell_loggingA: FAILED (15.71s)
> > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS -
> > > multiconfig.MultiConfig.test_multiconfig_reparse: FAILED (6.39s)
> > > 2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS -
> > > prservice.BitbakePrTests.test_pr_service_rpm_arch_indep: FAILED (0.84s)
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5746/steps/14/logs/stdio
> > > 
> > > 2023-09-14 23:33:02,720 - oe-selftest - INFO - RESULTS -
> > > package.VersionOrdering.test_dpkg: ERROR
> > > 2023-09-14 23:33:02,720 - oe-selftest - INFO - RESULTS -
> > > package.VersionOrdering.test_opkg: ERROR
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > package.VersionOrdering.test_rpm: ERROR
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > archiver.Archiver.test_archiver_srpm_mode: FAILED (129.04s)
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > bbtests.BitbakeTests.test_postfile: FAILED (2.06s)
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > bbtests.BitbakeTests.test_prefile: FAILED (1.73s)
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > devtool.DevtoolModifyTests.test_devtool_buildclean: FAILED (265.26s)
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > overlayfs.OverlayFSTests.test_mount_unit_not_set: FAILED (7.83s)
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > prservice.BitbakePrTests.test_import_export_override_db: FAILED (289.85s)
> > > 2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS -
> > > sstatetests.SStateHashSameSigs3.test_sstate_multilib_or_not_native_samesig
> > > s: FAILED (111.34s)
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5757/steps/15/logs/stdio
> > > 
> > > 2023-09-14 22:55:12,332 - oe-selftest - INFO - RESULTS -
> > > bblogging.BitBakeLogging.test_python_exec_func_python_loggingB: FAILED
> > > (100.76s)
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5777/steps/14/logs/stdio
> > > 
> > > 2023-09-14 23:00:48,622 - oe-selftest - INFO - RESULTS -
> > > archiver.Archiver.test_archiver_srpm_mode: FAILED (71.69s)
> > > 2023-09-14 23:00:48,622 - oe-selftest - INFO - RESULTS -
> > > prservice.BitbakePrTests.test_pr_service_deb_arch_dep: FAILED (27.14s)
> > > 
> > > so the failures are all over the place.
> > > 
> > > We also had a wic test failure that when re-ran, didn't reproduce:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5777/steps/14/logs/stdio
> > > 
> > > and some toolchain test failures for qemuppc:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/153/builds/529
> > > 
> > > and qemuarm64:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/151/builds/530
> > > 
> > > The question at this point is what do people want me to do. We clearly
> > > have a really nasty bug in here. The patch is "right" and we do need to
> > > fix this. If I merge it, I suspect I'm going to end up having to chase
> > > this down before we can release and I am going to struggle to find the
> > > time to do it and I suspect my sanity will suffer. This does look to be
> > > a significant issue though.
> > > 
> > > Cheers,
> > > 
> > > Richard
> > 
> > While the problem the patch tried to fix is real, it seems a lot less
> > problematic than the result of applying the patch. So from a release
> > perspective, I would keep the current code and possibly add it as a
> > known bug to the release notes. Then, after the release is out, start
> > the work of solving the fallout from applying the patch, with the
> > intention to have it solved for the first point release of Nanbield.
> > 
> > While not ideal, it sounds a lot less stressful than applying the
> > patch at this stage.
> 
> I understand this exposes some "interesting" issues, but I tend to agree 
> with Peter here.

This is a 'bit' more than "interesting". If we ignore this, pretty much
every bitbake failure will be blamed on "the cache problem" and we'll
develop a reputation for reliability issues :(.

The damage this issue can do is scary even if I doubt most users would
hit it.

Cheers,

Richard
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1763): 
https://lists.openembedded.org/g/openembedded-architecture/message/1763
Mute This Topic: https://lists.openembedded.org/mt/101375777/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to