On 2021-03-08 12:33 p.m., Richard Purdie wrote:
[Please note: This e-mail is from an EXTERNAL e-mail address]
On Mon, 2021-03-08 at 09:50 -0500, Randy MacLeod wrote:
On 2021-03-07 3:58 p.m., Sakib Sajal wrote:
+timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
+
+if [ $? -ne 0 ]
+then
+ top -b -n 1
+else
+ echo "success"
Do we need this else part? It'll just fill up the logs?
+fi
+
What values have you tried and how did that work out?
On Friday, a build with 100KB and was it .5 seconds and didn't
see any event exceed the timeout, right? We also ran several builds
out of a shared local sstate-cache and didn't see any event exceed the
timeout, iirc.
It would really help me to have an idea of what we're proposing we configure
on the autobuilder and how to capture the result...
Cheers,
Richard
Hi Richard,
Randy and I could directly use autobuilders and make the necessary
changes to run the tests/experiments, if you prefer to do it yourself,
read on for more details.
We have been working on a way to collect more data to deal with the
various intermittent failures that we've been having. We put together a
simple script that tries to write a specified amount of data to the
filesystem within a specified time. This script can be used to determine
if there is io stress on the file-system, if so it captures the output
of top to see which processes are running.
To use the script to monitor the filesystem during a build, add this to
local.conf:
BB_HEARTBEAT_EVENT = "<interval at which data will be logged>"
BB_LOG_HOST_STAT_ON_INTERVAL = "1"
BB_LOG_HOST_STAT_CMDS = "oe-timeout-dd-test.sh <timeout> <no. of
kilobytes to write>"
The logs are stored in tmp-glibc/buildstats/*/host_stats file.
Sample log output:
Event Time: 1615158368.577454
Date: 2021-03-07 23:06:08.586541
oe-timeout-dd-test.sh 0.1 1000
success
Event Time: 1615158392.393126
Date: 2021-03-07 23:06:32.480006
oe-timeout-dd-test.sh 0.1 1000
success
Event Time: 1615159999.497806
Date: 2021-03-07 23:33:19.508317
oe-timeout-dd-test.sh 0.1 1000
top - 15:33:20 up 136 days, 8:11, 16 users, load average: 641.54,
792.91, 704.
....
The builds that exceeded the timeout can be found by:
grep "top " tmp-glibc/buildstats/*/host_stats
For example:
build$ grep -r "top " data_collect*/tmp-glibc/buildstats/
data_collect1/tmp-glibc/buildstats/20210307205832/host_stats:top -
16:22:44 up 136 days, 9:01, 16 users, load average: 1653.48, 774.33, 548
data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top -
15:33:20 up 136 days, 8:11, 16 users, load average: 641.54, 792.91, 704.
data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top -
16:21:50 up 136 days, 9:00, 16 users, load average: 859.48, 535.06, 464.
data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top -
16:23:50 up 136 days, 9:02, 16 users, load average: 1300.73, 863.23, 596
data_collect3/tmp-glibc/buildstats/20210307205840/host_stats:top -
14:57:51 up 136 days, 7:36, 16 users, load average: 343.42, 281.75, 243.
data_collect4/tmp-glibc/buildstats/20210307205840/host_stats:top -
15:07:42 up 136 days, 7:46, 16 users, load average: 607.03, 447.76, 324.
data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top -
14:57:35 up 136 days, 7:35, 16 users, load average: 304.50, 270.72, 239.
data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top -
15:04:53 up 136 days, 7:43, 16 users, load average: 491.61, 307.05, 262.
data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top -
16:27:34 up 136 days, 9:05, 16 users, load average: 457.19, 639.73, 563.
I ran 5 builds simultaneously using a shared sstate-cache with 0.5s
timeout and 100kb write, which did not trigger the script for io lag for
any build.
I reran the test without shared sstate-cache and all 5 builds
encountered io lag at least once, some 3 or 4 times, as shown in the
grep output above.
Looking at the logs, it looked like the machine was swapping.
The <timeout> and the <count> variables may need to be adjusted for each
machine.
We found a bug with the data collection mechanism, where if the scripts
hasn't returned within a second, it times out and is killed. This bug
should not hamper with you carrying out tests, I will send a fix.
The bug looks like following:
Error running command: oe-timeout-dd-test.sh 0.1 1000
Command '['oe-timeout-dd-test.sh', '0.1', '1000']' timed out after 1 seconds
Sincerely,
Sakib
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#149122):
https://lists.openembedded.org/g/openembedded-core/message/149122
Mute This Topic: https://lists.openembedded.org/mt/81158947/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-