On Thu, 2023-09-14 at 17:28 +0100, Richard Purdie via
lists.openembedded.org wrote:
> We've been chasing bitbake timeouts for a while and it was unclear where 
> things
> were blocking on IO. It appears the flush() call in server logging can cause
> pauses up to minutes long on systems with slow (spinning) disks that are 
> heavily
> loaded with IO.
> 
> Since the flush() was added to aid debugging of other timing issues, we 
> shouldn't
> need it now and it can be disabled. Leave a comment as a reminder of the pain 
> this
> can cause.
> 
> Signed-off-by: Richard Purdie <[email protected]>
> ---
>  lib/bb/server/process.py | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/bb/server/process.py b/lib/bb/server/process.py
> index 40cb99bc97..94e93bc3a6 100644
> --- a/lib/bb/server/process.py
> +++ b/lib/bb/server/process.py
> @@ -43,7 +43,8 @@ def currenttime():
>  
>  def serverlog(msg):
>      print(str(os.getpid()) + " " +  currenttime() + " " + msg)
> -    sys.stdout.flush()
> +    #Seems a flush here triggers filesytem sync like behaviour and long 
> hangs in the server
> +    #sys.stdout.flush()
>  
>  #
>  # When we have lockfile issues, try and find infomation about which process 
> is

Whilst this patch looks simple, it has all the signs of turning into
weeks of pain after looking at the results from testing. It looks like
we have a race in the system which this flush is masking.

https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2082/steps/14/logs/stdio

Looking at just one of the failures there (of the 6), we see this test
pass:

1: 20/57 355/537 (0.12s) (0 failed) 
(recipetool.RecipetoolTests.test_recipetool_split_pkg_licenses)

then the next test to run on that build directory is:

1: 21/57 359/537 (137.64s) (0 failed) (signing.Signing.test_signing_packages)

with an error which boils down to:

FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/recipetoolqaizvjzf8h'

from 

run(cmd='git rev-parse --abbrev-ref HEAD', input=None, log=None, extrafiles=[], 
**options={'cwd': '/tmp/recipetoolqaizvjzf8h', 'shell': True})

i.e. it is trying to run a command in a directory which doesn't exist.

Without looking, I know "recipetoolqa" tmpfiles will come from
RecipetoolTests so the previous build directory is being removed and
yet still is cwd for the server.

I've not looked much into any of the other failures yet.

Looking at the 5 different selftest failures:

https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2082/steps/14/logs/stdio

2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - 
archiver.Archiver.test_archiver_srpm_mode: FAILED (237.00s)
2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - 
overlayfs.OverlayFSEtcRunTimeTests.test_all_required_variables_set: FAILED 
(13.25s)
2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - 
prservice.BitbakePrTests.test_pr_service_deb_arch_dep: FAILED (126.30s)
2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - 
signing.Signing.test_signing_packages: FAILED (137.64s)
2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - 
sstatetests.SStateHashSameSigs.test_sstate_32_64_same_hash: FAILED (541.51s)
2023-09-15 01:56:12,685 - oe-selftest - INFO - RESULTS - 
sstatetests.SStateHashSameSigs3.test_sstate_multilib_or_not_native_samesigs: 
FAILED (224.05s)

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5710/steps/14/logs/stdio

2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - 
archiver.Archiver.test_archiver_srpm_mode: FAILED (59.78s)
2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - 
bblogging.BitBakeLogging.test_shell_loggingA: FAILED (15.71s)
2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - 
multiconfig.MultiConfig.test_multiconfig_reparse: FAILED (6.39s)
2023-09-14 16:00:20,642 - oe-selftest - INFO - RESULTS - 
prservice.BitbakePrTests.test_pr_service_rpm_arch_indep: FAILED (0.84s)

https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5746/steps/14/logs/stdio

2023-09-14 23:33:02,720 - oe-selftest - INFO - RESULTS - 
package.VersionOrdering.test_dpkg: ERROR
2023-09-14 23:33:02,720 - oe-selftest - INFO - RESULTS - 
package.VersionOrdering.test_opkg: ERROR
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
package.VersionOrdering.test_rpm: ERROR
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
archiver.Archiver.test_archiver_srpm_mode: FAILED (129.04s)
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
bbtests.BitbakeTests.test_postfile: FAILED (2.06s)
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
bbtests.BitbakeTests.test_prefile: FAILED (1.73s)
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
devtool.DevtoolModifyTests.test_devtool_buildclean: FAILED (265.26s)
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
overlayfs.OverlayFSTests.test_mount_unit_not_set: FAILED (7.83s)
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
prservice.BitbakePrTests.test_import_export_override_db: FAILED (289.85s)
2023-09-14 23:33:02,721 - oe-selftest - INFO - RESULTS - 
sstatetests.SStateHashSameSigs3.test_sstate_multilib_or_not_native_samesigs: 
FAILED (111.34s)

https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5757/steps/15/logs/stdio

2023-09-14 22:55:12,332 - oe-selftest - INFO - RESULTS - 
bblogging.BitBakeLogging.test_python_exec_func_python_loggingB: FAILED (100.76s)

https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5777/steps/14/logs/stdio

2023-09-14 23:00:48,622 - oe-selftest - INFO - RESULTS - 
archiver.Archiver.test_archiver_srpm_mode: FAILED (71.69s)
2023-09-14 23:00:48,622 - oe-selftest - INFO - RESULTS - 
prservice.BitbakePrTests.test_pr_service_deb_arch_dep: FAILED (27.14s)

so the failures are all over the place.

We also had a wic test failure that when re-ran, didn't reproduce:

https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5777/steps/14/logs/stdio

and some toolchain test failures for qemuppc:

https://autobuilder.yoctoproject.org/typhoon/#/builders/153/builds/529

and qemuarm64:

https://autobuilder.yoctoproject.org/typhoon/#/builders/151/builds/530

The question at this point is what do people want me to do. We clearly
have a really nasty bug in here. The patch is "right" and we do need to
fix this. If I merge it, I suspect I'm going to end up having to chase
this down before we can release and I am going to struggle to find the
time to do it and I suspect my sanity will suffer. This does look to be
a significant issue though.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1759): 
https://lists.openembedded.org/g/openembedded-architecture/message/1759
Mute This Topic: https://lists.openembedded.org/mt/101375777/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to