Hi,

On Tue, Nov 10, 2020 at 06:07:44PM -0500, Sakib Sajal wrote:
> There are a number of timeout and hang defects where
> it would be useful to collect statistics about what
> is running on a build host when that condition occurs.
> 
> This adds functionality to collect build system stats
> on a regular interval and/or on task failure. Both
> features are disabled by default.
> 
> To enable logging on a regular interval, set:
> BB_HEARTBEAT_EVENT = "<interval>"
> BB_LOG_HOST_STAT_ON_INTERVAL = <boolean>
> Logs are stored in ${BUILDSTATS_BASE}/<build_name>/host_stats
> 
> To enable logging on a task failure, set:
> BB_LOG_HOST_STAT_ON_FAILURE = "<boolean>"
> Logs are stored in ${BUILDSTATS_BASE}/<build_name>/build_stats
> 
> The list of commands, along with the desired options, need
> to be specified in the BB_LOG_HOST_STAT_CMDS variable
> delimited by ; as such:
> BB_LOG_HOST_STAT_CMDS = "command1 ; command2 ;... ;"

I can understand why and have been debugging crashing and hanging build 
machines,
but I would not have found this change useful. Do you have more concrete 
examples
how this could be used?

Instead, I found that normal Linux server admin practices were best:

 * collect build machine kernel, journald and syslogs to remote host, e.g. 
rsyslog
 * monitor CPU, memory, IO, network etc performance, also to a remote host, e.g.
   pcp.io tooling or collectd
 * collect bitbake build logs with system timestamps to remote host, e.g. don't 
trust
   jenkins and its timestamps

With those, I have been able to find problems in Linux kernels, bugs
in VMWare cloud storage stack triggering IO hangs, stalls and eventually kernel
crashes, broken HW like memory. And of course basic things like full disks, 
full /tmp,
kernel oom killer kicking in when build slaves ran out of RAM during bitbake 
build
which results in either build machine changes or tuning of parallel build flags 
to
account also physical RAM.

Wihtout full remote logging infrastructure, I could not have solved anything. 
Running
individual commands is not enough when only full kernel dmesg of affected 
machine
can tell that IO stack has a hang or an Oops or that disk had been mounted 
read-only
due to errors.

Cheers,

-Mikko
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#144468): 
https://lists.openembedded.org/g/openembedded-core/message/144468
Mute This Topic: https://lists.openembedded.org/mt/78171470/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to