Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script

Sakib Sajal Mon, 08 Mar 2021 12:47:18 -0800


On 2021-03-08 12:33 p.m., Richard Purdie wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]


On Mon, 2021-03-08 at 09:50 -0500, Randy MacLeod wrote:

On 2021-03-07 3:58 p.m., Sakib Sajal wrote:

+timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
+
+if [ $? -ne 0 ]
+then
+        top -b -n 1
+else
+        echo "success"

Do we need this else part? It'll just fill up the logs?

+fi
+

What values have you tried and how did that work out?
On Friday, a build with 100KB and was it .5 seconds and didn't
see any event exceed the timeout, right? We also ran several builds
out of a shared local sstate-cache and didn't see any event exceed the
timeout, iirc.

It would really help me to have an idea of what we're proposing we configure
on the autobuilder and how to capture the result...

Cheers,

Richard


Hi Richard,

Randy and I could directly use autobuilders and make the necessarychanges to run the tests/experiments, if you prefer to do it yourself,read on for more details.

We have been working on a way to collect more data to deal with thevarious intermittent failures that we've been having. We put together asimple script that tries to write a specified amount of data to thefilesystem within a specified time. This script can be used to determineif there is io stress on the file-system, if so it captures the outputof top to see which processes are running.

To use the script to monitor the filesystem during a build, add this tolocal.conf:


BB_HEARTBEAT_EVENT = "<interval at which data will be logged>"
BB_LOG_HOST_STAT_ON_INTERVAL = "1"

BB_LOG_HOST_STAT_CMDS = "oe-timeout-dd-test.sh <timeout> <no. ofkilobytes to write>"


The logs are stored in tmp-glibc/buildstats/*/host_stats file.

Sample log output:

Event Time: 1615158368.577454
Date: 2021-03-07 23:06:08.586541
oe-timeout-dd-test.sh 0.1 1000
success

Event Time: 1615158392.393126
Date: 2021-03-07 23:06:32.480006
oe-timeout-dd-test.sh 0.1 1000
success

Event Time: 1615159999.497806
Date: 2021-03-07 23:33:19.508317
oe-timeout-dd-test.sh 0.1 1000

top - 15:33:20 up 136 days, 8:11, 16 users, load average: 641.54,792.91, 704.

....


The builds that exceeded the timeout can be found by:

grep "top " tmp-glibc/buildstats/*/host_stats

For example:

build$ grep -r  "top " data_collect*/tmp-glibc/buildstats/

data_collect1/tmp-glibc/buildstats/20210307205832/host_stats:top -16:22:44 up 136 days, 9:01, 16 users, load average: 1653.48, 774.33, 548data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top -15:33:20 up 136 days, 8:11, 16 users, load average: 641.54, 792.91, 704.data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top -16:21:50 up 136 days, 9:00, 16 users, load average: 859.48, 535.06, 464.data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top -16:23:50 up 136 days, 9:02, 16 users, load average: 1300.73, 863.23, 596data_collect3/tmp-glibc/buildstats/20210307205840/host_stats:top -14:57:51 up 136 days, 7:36, 16 users, load average: 343.42, 281.75, 243.data_collect4/tmp-glibc/buildstats/20210307205840/host_stats:top -15:07:42 up 136 days, 7:46, 16 users, load average: 607.03, 447.76, 324.data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top -14:57:35 up 136 days, 7:35, 16 users, load average: 304.50, 270.72, 239.data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top -15:04:53 up 136 days, 7:43, 16 users, load average: 491.61, 307.05, 262.data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top -16:27:34 up 136 days, 9:05, 16 users, load average: 457.19, 639.73, 563.

I ran 5 builds simultaneously using a shared sstate-cache with 0.5stimeout and 100kb write, which did not trigger the script for io lag forany build.I reran the test without shared sstate-cache and all 5 buildsencountered io lag at least once, some 3 or 4 times, as shown in thegrep output above.


Looking at the logs, it looked like the machine was swapping.

The <timeout> and the <count> variables may need to be adjusted for eachmachine.

We found a bug with the data collection mechanism, where if the scriptshasn't returned within a second, it times out and is killed. This bugshould not hamper with you carrying out tests, I will send a fix.


The bug looks like following:

Error running command: oe-timeout-dd-test.sh 0.1 1000
Command '['oe-timeout-dd-test.sh', '0.1', '1000']' timed out after 1 seconds


Sincerely,

Sakib

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#149122): 
https://lists.openembedded.org/g/openembedded-core/message/149122
Mute This Topic: https://lists.openembedded.org/mt/81158947/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script

Reply via email to