I've been trying to diagnose issues with some of the CI pipelines and one blind spot in the reporting is in the stages - we'll have timing info on the first couple of parts and maybe the last part, but the lion's share is in one big part in the middle. It'll say 45 minutes, or jump to an hour, and I can't tell why because there's not enough granularity in the report. Is that something we can improve? For example, this one jumped to almost 2 hours from a normal 45 minute run: http://jenkins.mxnet-ci.amazon-ml.com/job/restricted-website-build/986/timings/
I'd also like to see some data dumps like memory usage and disk space and whatever else diagnostics we can get and each stage. Cheers, Aaron
