Re: [build system] jenkins got itself wedged...

Kazuaki Ishizaki Sun, 21 May 2017 09:09:49 -0700

It looked well these days. However, it seems to go down slowly again...

When I tried to see console log (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull
), a server returns "proxy error."


Regards,
Kazuaki Ishizaki



From:   shane knapp <skn...@berkeley.edu>
To:     Sean Owen <so...@cloudera.com>
Cc:     dev <dev@spark.apache.org>
Date:   2017/05/20 09:43
Subject:        Re: [build system] jenkins got itself wedged...



last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> 
wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu> 
wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu> 
wrote:
>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY 
i'm
>>>> getting some error messages in the logs...   looks like jenkins is
>>>> thrashing on GC.
>>>>
>>>> now that i know what's up, i should be able to get this sorted today.
>>>>
>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> 
wrote:
>>>>> I'm not sure if it's related, but I still can't get Jenkins to test 
PRs. For
>>>>> example, triggering it through the spark-prs.appspot.com UI gives 
me...
>>>>>
>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>
>>>>> Internal Server Error
>>>>>
>>>>> That might be from the appspot app though?
>>>>>
>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, 
and I
>>>>> can't reach Jenkins:
>>>>> 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

>>>>>
>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu> 
wrote:
>>>>>>
>>>>>> after another couple of restarts due to high load and system
>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>
>>>>>> a typo in the jenkins config where the java heap size was 
configured.
>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain 
the
>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>> couple of years.
>>>>>>
>>>>>> anyways, it's been corrected and the master seems to be humming 
along,
>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on 
this
>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>
>>>>>> sorry again for the interruptions in service.
>>>>>>
>>>>>> shane
>>>>>>
>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <skn...@berkeley.edu> 
wrote:
>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>> > building (again).
>>>>>> >
>>>>>> > shane
>>>>>> >
>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
<skn...@berkeley.edu>
>>>>>> > wrote:
>>>>>> >> i'm going to need to perform a quick reboot on the jenkins 
master.  it
>>>>>> >> looks like it's hung again.
>>>>>> >>
>>>>>> >> sorry about this!
>>>>>> >>
>>>>>> >> shane
>>>>>> >>
>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 
<skn...@berkeley.edu>
>>>>>> >> wrote:
>>>>>> >>> ...but just now i started getting alerts on system load, which 
was
>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye 
on the
>>>>>> >>> master and possible need to reboot.
>>>>>> >>>
>>>>>> >>> sorry about the interruption of service...
>>>>>> >>>
>>>>>> >>> shane
>>>>>> >>>
>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 
<skn...@berkeley.edu>
>>>>>> >>> wrote:
>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>
>>>>>> 
---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [build system] jenkins got itself wedged...

Reply via email to