On 2021-03-08 12:33 p.m., Richard Purdie wrote:
[Please note: This e-mail is from an EXTERNAL e-mail address]

On Mon, 2021-03-08 at 09:50 -0500, Randy MacLeod wrote:
On 2021-03-07 3:58 p.m., Sakib Sajal wrote:
+timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
+
+if [ $? -ne 0 ]
+then
+        top -b -n 1
+else
+        echo "success"
Do we need this else part? It'll just fill up the logs?

+fi
+
What values have you tried and how did that work out?
On Friday, a build with 100KB and was it .5 seconds and didn't
see any event exceed the timeout, right? We also ran several builds
out of a shared local sstate-cache and didn't see any event exceed the
timeout, iirc.
It would really help me to have an idea of what we're proposing we configure
on the autobuilder and how to capture the result...

Cheers,

Richard

Hi Richard,

Randy and I could directly use autobuilders and make the necessary changes to run the tests/experiments, if you prefer to do it yourself, read on for more details.

We have been working on a way to collect more data to deal with the various intermittent failures that we've been having. We put together a simple script that tries to write a specified amount of data to the filesystem within a specified time. This script can be used to determine if there is io stress on the file-system, if so it captures the output of top to see which processes are running.


To use the script to monitor the filesystem during a build, add this to local.conf:

BB_HEARTBEAT_EVENT = "<interval at which data will be logged>"
BB_LOG_HOST_STAT_ON_INTERVAL = "1"
BB_LOG_HOST_STAT_CMDS = "oe-timeout-dd-test.sh <timeout> <no. of kilobytes to write>"

The logs are stored in tmp-glibc/buildstats/*/host_stats file.

Sample log output:

Event Time: 1615158368.577454
Date: 2021-03-07 23:06:08.586541
oe-timeout-dd-test.sh 0.1 1000
success

Event Time: 1615158392.393126
Date: 2021-03-07 23:06:32.480006
oe-timeout-dd-test.sh 0.1 1000
success

Event Time: 1615159999.497806
Date: 2021-03-07 23:33:19.508317
oe-timeout-dd-test.sh 0.1 1000
top - 15:33:20 up 136 days,  8:11, 16 users,  load average: 641.54, 792.91, 704.
....


The builds that exceeded the timeout can be found by:

grep "top " tmp-glibc/buildstats/*/host_stats

For example:

build$ grep -r  "top " data_collect*/tmp-glibc/buildstats/
data_collect1/tmp-glibc/buildstats/20210307205832/host_stats:top - 16:22:44 up 136 days,  9:01, 16 users,  load average: 1653.48, 774.33, 548 data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top - 15:33:20 up 136 days,  8:11, 16 users,  load average: 641.54, 792.91, 704. data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top - 16:21:50 up 136 days,  9:00, 16 users,  load average: 859.48, 535.06, 464. data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top - 16:23:50 up 136 days,  9:02, 16 users,  load average: 1300.73, 863.23, 596 data_collect3/tmp-glibc/buildstats/20210307205840/host_stats:top - 14:57:51 up 136 days,  7:36, 16 users,  load average: 343.42, 281.75, 243. data_collect4/tmp-glibc/buildstats/20210307205840/host_stats:top - 15:07:42 up 136 days,  7:46, 16 users,  load average: 607.03, 447.76, 324. data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top - 14:57:35 up 136 days,  7:35, 16 users,  load average: 304.50, 270.72, 239. data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top - 15:04:53 up 136 days,  7:43, 16 users,  load average: 491.61, 307.05, 262. data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top - 16:27:34 up 136 days,  9:05, 16 users,  load average: 457.19, 639.73, 563.


I ran 5 builds simultaneously using a shared sstate-cache with 0.5s timeout and 100kb write, which did not trigger the script for io lag for any build. I reran the test without shared sstate-cache and all 5 builds encountered io lag at least once, some 3 or 4 times, as shown in the grep output above.

Looking at the logs, it looked like the machine was swapping.

The <timeout> and the <count> variables may need to be adjusted for each machine.

We found a bug with the data collection mechanism, where if the scripts hasn't returned within a second, it times out and is killed. This bug should not hamper with you carrying out tests, I will send a fix.

The bug looks like following:

Error running command: oe-timeout-dd-test.sh 0.1 1000
Command '['oe-timeout-dd-test.sh', '0.1', '1000']' timed out after 1 seconds


Sincerely,

Sakib

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#149122): 
https://lists.openembedded.org/g/openembedded-core/message/149122
Mute This Topic: https://lists.openembedded.org/mt/81158947/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to