On Fri, Jan 29, 2016 at 12:58 PM, j.nitsc...@ok.de <j.nitsc...@ok.de> wrote:
> On Thu, 28 Jan 2016 16:10:52 -0800 Kay Schenk wrote: > > > > On 01/14/2016 09:48 AM, Kay Schenk wrote: > >> On Thu, Jan 14, 2016 at 4:04 AM, j.nitsc...@ok.de > >> <mailto:j.nitsc...@ok.de> <j.nitsc...@ok.de > >> <mailto:j.nitsc...@ok.de>> wrote: > >> > >> Hello, > >> > >> some may have noticed our linux-32 buildbot fails quite often. [1] > >> Here an analysis: (tl;dr jump to solutions) > >> * always fails in first buildbot step: svn updating > >> * failed step takes around 6 minutes, a successfull step uses ~37 > >> minutes to complete > >> * the commands in the step take much time and often a timeout > >> triggers > >> > >> The commands and their timeouts (seconds) are: > >> 1) svn --version (1200) > >> 2) rm -rf > >> /home/buildslave20/slave20/openoffice-linux32-nightly/build (120) > >> 3) chmod -Rf u+rwx > >> /home/buildslave20/slave20/openoffice-linux32-nightly/build > >> (120) ah, why? > >> 4) rm -rf > >> /home/buildslave20/slave20/openoffice-linux32-nightly/build > >> (120) huh, again? > >> 5) svn info --xml --non-interactive --no-auth-cache (1200) > >> 6) svn update --non-interactive --no-auth-cache (1200) > >> 7) cp -R -P -p -v > >> /home/buildslave20/slave20/openoffice-linux32-nightly/source > >> /home/buildslave20/slave20/openoffice-linux32-nightly/build (120) > >> 8) svn info --xml (1200) > >> > >> Their results: > >> 1) Always finishes in ~15 seconds > >> 2) No output, almost always fails with command timed out: 120 > >> seconds > >> without output, attempting to kill > >> 3) No output, almost always fails with command timed out: 120 > >> seconds > >> without output, attempting to kill > >> 4) No output, finishes sometimes. > >> *if we fail here the build process is stopped and this the > >> reason for > >> the often failures* > >> 5) Local command completes in a sec. > >> 6) Can take a while depending in source changes. Gives tons of > >> output, > >> so timeout never triggers. > >> 7) Takes *very* long (over 20 minutes) but never triggers timeout as > >> '-v' the output spams the log. > >> 8) Local command again takes a sec. > >> > >> Conclusions: > >> *file operations don't have enough time to finish* > >> > >> Solutions: > >> Edit 'svn updating' buildstep > >> a) Remove rm and chmod commands and replace cp with > >> 'rsync -q -t -p -r --delete > >> /home/buildslave20/slave20/openoffice-linux32-nightly/source > >> /home/buildslave20/slave20/openoffice-linux32-nightly/build' > >> This is much faster as very few copies needed and it's delete is > >> faster than rm command. But increase the timeout anyway just in > >> case. > >> (*preferred* solution but needs rsync on the box) > >> b) increase the timeouts and shut up cp by removing '-v' > >> c) remove unversioned files when updating and build in this folder > >> d) Make rm and chmod verbose by adding '-v' (or -c' for chmod). > >> Spam the > >> log even more, but the timeouts won't trigger. > >> Who doesn't like 50MB logfiles? Yes, the log for this step of > >> every > >> succesfull build is over 50MB currently! Starting build #127 [1] > >> (before > >> this build there was only a build folder but no source > >> Not a serious solution! > >> > >> *I suggest we fix this soon because the huge log files will blow > >> up a > >> server sooner or later.* > >> > >> Regards Jochen > >> > >> [1] https://ci.apache.org/builders/openoffice-linux32-nightly > >> > >> note: on linux64 buildbot the file operations are *much* faster. cp > >> takes 90 secs isn't verbose but in the 120 sec timeout limit. > >> > >> > >> Thanks for the suggestions, I will look into this. > >> > >> > > I just wanted to give a short update on this. > > > > * our Linux-32 and linux-64 buildbots use the same mechanisms for an > > svn pull -- a "copy" -- so I left the 32-bit instructions as is > 'copy' instructions differ in one detail > Linux-32: cp -R -P -p -v > /home/buildslave20/slave20/openoffice-linux32-nightly/source > /home/buildslave20/slave20/openoffice-linux32-nightly/build > Linux-64: cp -R -P -p > /home/buildslave19/slave19/openofficeorg-nightly/source > /home/buildslave19/slave19/openofficeorg-nightly/build > > *-v* needs to go to reduce the log siz > OK. and thank you. Your eyes are better than mine! :) > but we have to increase timeout further before we do this or copy will > always fail > > > https://ci.apache.org/builders/openoffice-linux32-nightly/builds/162/steps/svn/logs/stdio > : > I will try it again. It's these odd remove commands that we don't seem to control that seem to be a problem. > > cp -R -P -p -v > > /home/buildslave20/slave20/openoffice-linux32-nightly/source > > /home/buildslave20/slave20/openoffice-linux32-nightly/build in dir > > /home/buildslave20/slave20/openoffice-linux32-nightly (timeout 120 secs) > ... humongous log ... > > elapsedTime=1370.929525 program finished with exit code 0 > seems 1200 won't be enough, note that the timeout for cp was still 120 > > On Thu, 28 Jan 2016 16:10:52 -0800 Kay Schenk wrote: > > * I recently updated the timeout for the svn pull for linux-32 to > > 1200 secs. To me it looked like this was set to 120 though it IS > > supposed to default to 1200, but... > timeouts in 'svn update' of build #162 (Jan 29 02:05) haven't changed > from older builds > > > > * there are some other extra steps -- some removes -- that seem to > > be tacked onto the svn step that are outside of our config commands > > that ARE timing out and seem to NOT be governed by the total timeout > > for this step, yet they time out in successful builds also. > well, removes get an other try after a chmod. > so the first remove can timeout without consequence > > when both removes fail the build fails, but succeeds the next day > because most files are removed already > Yes, that is correct. I noticed this also. So we get builds every other day. > > * there are some buildbot setup instructions that differ for our > > linux-64 and linux-32 builds. > maybe our instructions don't reach the buildbots or aren't updated? > > > > Detailed in: > > My INFRA ticket to track Linux-32 buildbot problems: > > > > https://issues.apache.org/jira/browse/INFRA-10997 > > > > So, still a mystery to me at this point. > checking time frame for other tasks is a good idea > the difference of the same cp on Linux-32 and Linux-64 looks too big > Linux-32: elapsedTime=1370.929525 > Linux-64: elapsedTime=117.262038 > > Thanks again for your assistance...more tweaking to come...soon. -- ---------------------------------------------------------------------- MzK "Though no one can go back and make a brand new start, anyone can start from now and make a brand new ending." -- Carl Bard