Re: [build system] jenkins got itself wedged...
working on it. we'll have intermittent downtime the next ~30 mins. On Sun, May 21, 2017 at 12:01 PM, shane knapp wrote: > yeah. i noticed that and restarted it a few minutes ago. i'll have > some time later this afternoon to take a closer look... :\ > > On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki wrote: >> It looked well these days. However, it seems to go down slowly again... >> >> When I tried to see console log (e.g. >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull), >> a server returns "proxy error." >> >> Regards, >> Kazuaki Ishizaki >> >> >> >> From:shane knapp >> To:Sean Owen >> Cc:dev >> Date:2017/05/20 09:43 >> Subject:Re: [build system] jenkins got itself wedged... >> >> >> >> >> last update of the week: >> >> things are looking great... we're GCing happily and staying well >> within our memory limits. >> >> i'm going to do one more restart after the two pull request builds >> finish to re-enable backups, and call it a weekend. :) >> >> shane >> >> On Fri, May 19, 2017 at 8:29 AM, shane knapp wrote: >>> this is hopefully my final email on the subject... :) >>> >>> things have seemed to settled down after my GC tuning, and system >>> load/cpu usage/memory has been nice and flat all night. i'll continue >>> to keep an eye on things but it looks like we've weathered the worst >>> part of the storm. >>> >>> On Thu, May 18, 2017 at 6:40 PM, shane knapp wrote: after needing another restart this afternoon, i did some homework and aggressively twiddled some GC settings[1]. since then, things have definitely smoothed out w/regards to memory and cpu usage spikes. i've attached a screenshot of slightly happier looking graphs. still keeping an eye on things, and hoping that i can go back to being a lurker... ;) shane 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/ On Thu, May 18, 2017 at 11:20 AM, shane knapp wrote: > ok, more updates: > > 1) i audited all of the builds, and found that the spark-*-compile-* > and spark-*-test-* jobs were set to the identical cron time trigger, > so josh rosen and i updated them to run at H/5 (instead of */5). load > balancing ftw. > > 2) the jenkins master is now running on java8, which has moar bettar > GC management under the hood. > > i'll be keeping an eye on this today, and if we start seeing GC > overhead failures, i'll start doing more GC performance tuning. > thankfully, cloudbees has a relatively decent guide that i'll be > following here: https://jenkins.io/blog/2016/11/21/gc-tuning/ > > shane > > On Thu, May 18, 2017 at 8:39 AM, shane knapp > wrote: >> yeah, i spoke too soon. jenkins is still misbehaving, but FINALLY i'm >> getting some error messages in the logs... looks like jenkins is >> thrashing on GC. >> >> now that i know what's up, i should be able to get this sorted today. >> >> On Thu, May 18, 2017 at 12:39 AM, Sean Owen wrote: >>> I'm not sure if it's related, but I still can't get Jenkins to test >>> PRs. For >>> example, triggering it through the spark-prs.appspot.com UI gives >>> me... >>> >>> https://spark-prs.appspot.com/trigger-jenkins/18012 >>> >>> Internal Server Error >>> >>> That might be from the appspot app though? >>> >>> But posting "Jenkins test this please" on PRs doesn't seem to work, >>> and I >>> can't reach Jenkins: >>> >>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >>> >>> On Thu, May 18, 2017 at 12:44 AM shane knapp >>> wrote: after another couple of restarts due to high load and system unresponsiveness, i finally found what is the most likely culprit: a typo in the jenkins config where the java heap size was configured. instead of -Xmx16g, we had -Dmx16G... which could easily explain the random and non-deterministic system hangs we've had over the past couple of years. anyways, it's been corrected and the master seems to be humming along, for real this time, w/o issue. i'll continue to keep an eye on this for the rest of the week, but things are looking MUCH better now. sorry again for the interruptions in service. shane On Wed, May 17, 2017 at 9:59 AM, shane knapp wrote: > ok, we're back up, system load looks cromulent and we're happily > building (again). > > shane > > On Wed, May 17, 2017 at 9:50 AM, shane knapp > wrote: >> i'm going to need to perform a quick reboot on the jenkins master. >> it >> looks like it's hung again. >>>
Re: [build system] jenkins got itself wedged...
yeah. i noticed that and restarted it a few minutes ago. i'll have some time later this afternoon to take a closer look... :\ On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki wrote: > It looked well these days. However, it seems to go down slowly again... > > When I tried to see console log (e.g. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull), > a server returns "proxy error." > > Regards, > Kazuaki Ishizaki > > > > From:shane knapp > To:Sean Owen > Cc:dev > Date:2017/05/20 09:43 > Subject:Re: [build system] jenkins got itself wedged... > > > > > last update of the week: > > things are looking great... we're GCing happily and staying well > within our memory limits. > > i'm going to do one more restart after the two pull request builds > finish to re-enable backups, and call it a weekend. :) > > shane > > On Fri, May 19, 2017 at 8:29 AM, shane knapp wrote: >> this is hopefully my final email on the subject... :) >> >> things have seemed to settled down after my GC tuning, and system >> load/cpu usage/memory has been nice and flat all night. i'll continue >> to keep an eye on things but it looks like we've weathered the worst >> part of the storm. >> >> On Thu, May 18, 2017 at 6:40 PM, shane knapp wrote: >>> after needing another restart this afternoon, i did some homework and >>> aggressively twiddled some GC settings[1]. since then, things have >>> definitely smoothed out w/regards to memory and cpu usage spikes. >>> >>> i've attached a screenshot of slightly happier looking graphs. >>> >>> still keeping an eye on things, and hoping that i can go back to being >>> a lurker... ;) >>> >>> shane >>> >>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/ >>> >>> On Thu, May 18, 2017 at 11:20 AM, shane knapp >>> wrote: ok, more updates: 1) i audited all of the builds, and found that the spark-*-compile-* and spark-*-test-* jobs were set to the identical cron time trigger, so josh rosen and i updated them to run at H/5 (instead of */5). load balancing ftw. 2) the jenkins master is now running on java8, which has moar bettar GC management under the hood. i'll be keeping an eye on this today, and if we start seeing GC overhead failures, i'll start doing more GC performance tuning. thankfully, cloudbees has a relatively decent guide that i'll be following here: https://jenkins.io/blog/2016/11/21/gc-tuning/ shane On Thu, May 18, 2017 at 8:39 AM, shane knapp wrote: > yeah, i spoke too soon. jenkins is still misbehaving, but FINALLY i'm > getting some error messages in the logs... looks like jenkins is > thrashing on GC. > > now that i know what's up, i should be able to get this sorted today. > > On Thu, May 18, 2017 at 12:39 AM, Sean Owen wrote: >> I'm not sure if it's related, but I still can't get Jenkins to test >> PRs. For >> example, triggering it through the spark-prs.appspot.com UI gives >> me... >> >> https://spark-prs.appspot.com/trigger-jenkins/18012 >> >> Internal Server Error >> >> That might be from the appspot app though? >> >> But posting "Jenkins test this please" on PRs doesn't seem to work, >> and I >> can't reach Jenkins: >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >> >> On Thu, May 18, 2017 at 12:44 AM shane knapp >> wrote: >>> >>> after another couple of restarts due to high load and system >>> unresponsiveness, i finally found what is the most likely culprit: >>> >>> a typo in the jenkins config where the java heap size was configured. >>> instead of -Xmx16g, we had -Dmx16G... which could easily explain the >>> random and non-deterministic system hangs we've had over the past >>> couple of years. >>> >>> anyways, it's been corrected and the master seems to be humming >>> along, >>> for real this time, w/o issue. i'll continue to keep an eye on this >>> for the rest of the week, but things are looking MUCH better now. >>> >>> sorry again for the interruptions in service. >>> >>> shane >>> >>> On Wed, May 17, 2017 at 9:59 AM, shane knapp >>> wrote: >>> > ok, we're back up, system load looks cromulent and we're happily >>> > building (again). >>> > >>> > shane >>> > >>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp >>> > wrote: >>> >> i'm going to need to perform a quick reboot on the jenkins master. >>> >> it >>> >> looks like it's hung again. >>> >> >>> >> sorry about this! >>> >> >>> >> shane >>> >> >>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp >>> >> >>> >> wrote: >>> >>> ...but just now i started getting alerts on system load, which >>> >>> was >
Re: [build system] jenkins got itself wedged...
It looked well these days. However, it seems to go down slowly again... When I tried to see console log (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull ), a server returns "proxy error." Regards, Kazuaki Ishizaki From: shane knapp To: Sean Owen Cc: dev Date: 2017/05/20 09:43 Subject:Re: [build system] jenkins got itself wedged... last update of the week: things are looking great... we're GCing happily and staying well within our memory limits. i'm going to do one more restart after the two pull request builds finish to re-enable backups, and call it a weekend. :) shane On Fri, May 19, 2017 at 8:29 AM, shane knapp wrote: > this is hopefully my final email on the subject... :) > > things have seemed to settled down after my GC tuning, and system > load/cpu usage/memory has been nice and flat all night. i'll continue > to keep an eye on things but it looks like we've weathered the worst > part of the storm. > > On Thu, May 18, 2017 at 6:40 PM, shane knapp wrote: >> after needing another restart this afternoon, i did some homework and >> aggressively twiddled some GC settings[1]. since then, things have >> definitely smoothed out w/regards to memory and cpu usage spikes. >> >> i've attached a screenshot of slightly happier looking graphs. >> >> still keeping an eye on things, and hoping that i can go back to being >> a lurker... ;) >> >> shane >> >> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/ >> >> On Thu, May 18, 2017 at 11:20 AM, shane knapp wrote: >>> ok, more updates: >>> >>> 1) i audited all of the builds, and found that the spark-*-compile-* >>> and spark-*-test-* jobs were set to the identical cron time trigger, >>> so josh rosen and i updated them to run at H/5 (instead of */5). load >>> balancing ftw. >>> >>> 2) the jenkins master is now running on java8, which has moar bettar >>> GC management under the hood. >>> >>> i'll be keeping an eye on this today, and if we start seeing GC >>> overhead failures, i'll start doing more GC performance tuning. >>> thankfully, cloudbees has a relatively decent guide that i'll be >>> following here: https://jenkins.io/blog/2016/11/21/gc-tuning/ >>> >>> shane >>> >>> On Thu, May 18, 2017 at 8:39 AM, shane knapp wrote: yeah, i spoke too soon. jenkins is still misbehaving, but FINALLY i'm getting some error messages in the logs... looks like jenkins is thrashing on GC. now that i know what's up, i should be able to get this sorted today. On Thu, May 18, 2017 at 12:39 AM, Sean Owen wrote: > I'm not sure if it's related, but I still can't get Jenkins to test PRs. For > example, triggering it through the spark-prs.appspot.com UI gives me... > > https://spark-prs.appspot.com/trigger-jenkins/18012 > > Internal Server Error > > That might be from the appspot app though? > > But posting "Jenkins test this please" on PRs doesn't seem to work, and I > can't reach Jenkins: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ > > On Thu, May 18, 2017 at 12:44 AM shane knapp wrote: >> >> after another couple of restarts due to high load and system >> unresponsiveness, i finally found what is the most likely culprit: >> >> a typo in the jenkins config where the java heap size was configured. >> instead of -Xmx16g, we had -Dmx16G... which could easily explain the >> random and non-deterministic system hangs we've had over the past >> couple of years. >> >> anyways, it's been corrected and the master seems to be humming along, >> for real this time, w/o issue. i'll continue to keep an eye on this >> for the rest of the week, but things are looking MUCH better now. >> >> sorry again for the interruptions in service. >> >> shane >> >> On Wed, May 17, 2017 at 9:59 AM, shane knapp wrote: >> > ok, we're back up, system load looks cromulent and we're happily >> > building (again). >> > >> > shane >> > >> > On Wed, May 17, 2017 at 9:50 AM, shane knapp >> > wrote: >> >> i'm going to need to perform a quick reboot on the jenkins master. it >> >> looks like it's hung again. >> >> >> >> sorry about this! >> >> >> >> shane >> >> >> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp >> >> wrote: >> >>> ...but just now i started getting alerts on system load, which was >> >>> rather high. i had to kick jenkins again, and will keep an eye on the >> >>> master and possible need to reboot. >> >>> >> >>> sorry about the interruption of service... >> >>> >> >>> shane >> >>> >> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp >> >>> wrote: >> ...so i kicked it and it's now back up and happily building. >> >> --