Re: [build system] jenkins got itself wedged...

2017-05-21 Thread shane knapp
working on it.  we'll have intermittent downtime the next ~30 mins.

On Sun, May 21, 2017 at 12:01 PM, shane knapp  wrote:
> yeah.  i noticed that and restarted it a few minutes ago.  i'll have
> some time later this afternoon to take a closer look...   :\
>
> On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki  wrote:
>> It looked well these days. However, it seems to go down slowly again...
>>
>> When I tried to see console log (e.g.
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
>> a server returns "proxy error."
>>
>> Regards,
>> Kazuaki Ishizaki
>>
>>
>>
>> From:shane knapp 
>> To:Sean Owen 
>> Cc:dev 
>> Date:2017/05/20 09:43
>> Subject:Re: [build system] jenkins got itself wedged...
>> 
>>
>>
>>
>> last update of the week:
>>
>> things are looking great...  we're GCing happily and staying well
>> within our memory limits.
>>
>> i'm going to do one more restart after the two pull request builds
>> finish to re-enable backups, and call it a weekend.  :)
>>
>> shane
>>
>> On Fri, May 19, 2017 at 8:29 AM, shane knapp  wrote:
>>> this is hopefully my final email on the subject...   :)
>>>
>>> things have seemed to settled down after my GC tuning, and system
>>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>>> to keep an eye on things but it looks like we've weathered the worst
>>> part of the storm.
>>>
>>> On Thu, May 18, 2017 at 6:40 PM, shane knapp  wrote:
 after needing another restart this afternoon, i did some homework and
 aggressively twiddled some GC settings[1].  since then, things have
 definitely smoothed out w/regards to memory and cpu usage spikes.

 i've attached a screenshot of slightly happier looking graphs.

 still keeping an eye on things, and hoping that i can go back to being
 a lurker...  ;)

 shane

 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/

 On Thu, May 18, 2017 at 11:20 AM, shane knapp 
 wrote:
> ok, more updates:
>
> 1) i audited all of the builds, and found that the spark-*-compile-*
> and spark-*-test-* jobs were set to the identical cron time trigger,
> so josh rosen and i updated them to run at H/5 (instead of */5).  load
> balancing ftw.
>
> 2) the jenkins master is now running on java8, which has moar bettar
> GC management under the hood.
>
> i'll be keeping an eye on this today, and if we start seeing GC
> overhead failures, i'll start doing more GC performance tuning.
> thankfully, cloudbees has a relatively decent guide that i'll be
> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>
> shane
>
> On Thu, May 18, 2017 at 8:39 AM, shane knapp 
> wrote:
>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>> getting some error messages in the logs...   looks like jenkins is
>> thrashing on GC.
>>
>> now that i know what's up, i should be able to get this sorted today.
>>
>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen  wrote:
>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>> PRs. For
>>> example, triggering it through the spark-prs.appspot.com UI gives
>>> me...
>>>
>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>
>>> Internal Server Error
>>>
>>> That might be from the appspot app though?
>>>
>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>>> and I
>>> can't reach Jenkins:
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>
>>> On Thu, May 18, 2017 at 12:44 AM shane knapp 
>>> wrote:

 after another couple of restarts due to high load and system
 unresponsiveness, i finally found what is the most likely culprit:

 a typo in the jenkins config where the java heap size was configured.
 instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
 random and non-deterministic system hangs we've had over the past
 couple of years.

 anyways, it's been corrected and the master seems to be humming
 along,
 for real this time, w/o issue.  i'll continue to keep an eye on this
 for the rest of the week, but things are looking MUCH better now.

 sorry again for the interruptions in service.

 shane

 On Wed, May 17, 2017 at 9:59 AM, shane knapp 
 wrote:
 > ok, we're back up, system load looks cromulent and we're happily
 > building (again).
 >
 > shane
 >
 > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
 > wrote:
 >> i'm going to need to perform a quick reboot on the jenkins master.
 >> it
 >> looks like it's hung again.
>>>

Re: [build system] jenkins got itself wedged...

2017-05-21 Thread shane knapp
yeah.  i noticed that and restarted it a few minutes ago.  i'll have
some time later this afternoon to take a closer look...   :\

On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki  wrote:
> It looked well these days. However, it seems to go down slowly again...
>
> When I tried to see console log (e.g.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
> a server returns "proxy error."
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:shane knapp 
> To:Sean Owen 
> Cc:dev 
> Date:2017/05/20 09:43
> Subject:Re: [build system] jenkins got itself wedged...
> 
>
>
>
> last update of the week:
>
> things are looking great...  we're GCing happily and staying well
> within our memory limits.
>
> i'm going to do one more restart after the two pull request builds
> finish to re-enable backups, and call it a weekend.  :)
>
> shane
>
> On Fri, May 19, 2017 at 8:29 AM, shane knapp  wrote:
>> this is hopefully my final email on the subject...   :)
>>
>> things have seemed to settled down after my GC tuning, and system
>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>> to keep an eye on things but it looks like we've weathered the worst
>> part of the storm.
>>
>> On Thu, May 18, 2017 at 6:40 PM, shane knapp  wrote:
>>> after needing another restart this afternoon, i did some homework and
>>> aggressively twiddled some GC settings[1].  since then, things have
>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>
>>> i've attached a screenshot of slightly happier looking graphs.
>>>
>>> still keeping an eye on things, and hoping that i can go back to being
>>> a lurker...  ;)
>>>
>>> shane
>>>
>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp 
>>> wrote:
 ok, more updates:

 1) i audited all of the builds, and found that the spark-*-compile-*
 and spark-*-test-* jobs were set to the identical cron time trigger,
 so josh rosen and i updated them to run at H/5 (instead of */5).  load
 balancing ftw.

 2) the jenkins master is now running on java8, which has moar bettar
 GC management under the hood.

 i'll be keeping an eye on this today, and if we start seeing GC
 overhead failures, i'll start doing more GC performance tuning.
 thankfully, cloudbees has a relatively decent guide that i'll be
 following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/

 shane

 On Thu, May 18, 2017 at 8:39 AM, shane knapp 
 wrote:
> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
> getting some error messages in the logs...   looks like jenkins is
> thrashing on GC.
>
> now that i know what's up, i should be able to get this sorted today.
>
> On Thu, May 18, 2017 at 12:39 AM, Sean Owen  wrote:
>> I'm not sure if it's related, but I still can't get Jenkins to test
>> PRs. For
>> example, triggering it through the spark-prs.appspot.com UI gives
>> me...
>>
>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>
>> Internal Server Error
>>
>> That might be from the appspot app though?
>>
>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>> and I
>> can't reach Jenkins:
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>
>> On Thu, May 18, 2017 at 12:44 AM shane knapp 
>> wrote:
>>>
>>> after another couple of restarts due to high load and system
>>> unresponsiveness, i finally found what is the most likely culprit:
>>>
>>> a typo in the jenkins config where the java heap size was configured.
>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>> random and non-deterministic system hangs we've had over the past
>>> couple of years.
>>>
>>> anyways, it's been corrected and the master seems to be humming
>>> along,
>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>> for the rest of the week, but things are looking MUCH better now.
>>>
>>> sorry again for the interruptions in service.
>>>
>>> shane
>>>
>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp 
>>> wrote:
>>> > ok, we're back up, system load looks cromulent and we're happily
>>> > building (again).
>>> >
>>> > shane
>>> >
>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
>>> > wrote:
>>> >> i'm going to need to perform a quick reboot on the jenkins master.
>>> >> it
>>> >> looks like it's hung again.
>>> >>
>>> >> sorry about this!
>>> >>
>>> >> shane
>>> >>
>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp
>>> >> 
>>> >> wrote:
>>> >>> ...but just now i started getting alerts on system load, which
>>> >>> was
>

Re: [build system] jenkins got itself wedged...

2017-05-21 Thread Kazuaki Ishizaki
It looked well these days. However, it seems to go down slowly again...

When I tried to see console log (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull
), a server returns "proxy error."

Regards,
Kazuaki Ishizaki



From:   shane knapp 
To: Sean Owen 
Cc: dev 
Date:   2017/05/20 09:43
Subject:Re: [build system] jenkins got itself wedged...



last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp  wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp  
wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp  
wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp  
wrote:
 yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY 
i'm
 getting some error messages in the logs...   looks like jenkins is
 thrashing on GC.

 now that i know what's up, i should be able to get this sorted today.

 On Thu, May 18, 2017 at 12:39 AM, Sean Owen  
wrote:
> I'm not sure if it's related, but I still can't get Jenkins to test 
PRs. For
> example, triggering it through the spark-prs.appspot.com UI gives 
me...
>
> https://spark-prs.appspot.com/trigger-jenkins/18012
>
> Internal Server Error
>
> That might be from the appspot app though?
>
> But posting "Jenkins test this please" on PRs doesn't seem to work, 
and I
> can't reach Jenkins:
> 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

>
> On Thu, May 18, 2017 at 12:44 AM shane knapp  
wrote:
>>
>> after another couple of restarts due to high load and system
>> unresponsiveness, i finally found what is the most likely culprit:
>>
>> a typo in the jenkins config where the java heap size was 
configured.
>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain 
the
>> random and non-deterministic system hangs we've had over the past
>> couple of years.
>>
>> anyways, it's been corrected and the master seems to be humming 
along,
>> for real this time, w/o issue.  i'll continue to keep an eye on 
this
>> for the rest of the week, but things are looking MUCH better now.
>>
>> sorry again for the interruptions in service.
>>
>> shane
>>
>> On Wed, May 17, 2017 at 9:59 AM, shane knapp  
wrote:
>> > ok, we're back up, system load looks cromulent and we're happily
>> > building (again).
>> >
>> > shane
>> >
>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 

>> > wrote:
>> >> i'm going to need to perform a quick reboot on the jenkins 
master.  it
>> >> looks like it's hung again.
>> >>
>> >> sorry about this!
>> >>
>> >> shane
>> >>
>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 

>> >> wrote:
>> >>> ...but just now i started getting alerts on system load, which 
was
>> >>> rather high.  i had to kick jenkins again, and will keep an eye 
on the
>> >>> master and possible need to reboot.
>> >>>
>> >>> sorry about the interruption of service...
>> >>>
>> >>> shane
>> >>>
>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 

>> >>> wrote:
>>  ...so i kicked it and it's now back up and happily building.
>>
>> 
--