ok, more updates:

1) i audited all of the builds, and found that the spark-*-compile-*
and spark-*-test-* jobs were set to the identical cron time trigger,
so josh rosen and i updated them to run at H/5 (instead of */5).  load
balancing ftw.

2) the jenkins master is now running on java8, which has moar bettar
GC management under the hood.

i'll be keeping an eye on this today, and if we start seeing GC
overhead failures, i'll start doing more GC performance tuning.
thankfully, cloudbees has a relatively decent guide that i'll be
following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/

shane

On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu> wrote:
> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
> getting some error messages in the logs...   looks like jenkins is
> thrashing on GC.
>
> now that i know what's up, i should be able to get this sorted today.
>
> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:
>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>
>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>
>> Internal Server Error
>>
>> That might be from the appspot app though?
>>
>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>> can't reach Jenkins:
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>
>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu> wrote:
>>>
>>> after another couple of restarts due to high load and system
>>> unresponsiveness, i finally found what is the most likely culprit:
>>>
>>> a typo in the jenkins config where the java heap size was configured.
>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>> random and non-deterministic system hangs we've had over the past
>>> couple of years.
>>>
>>> anyways, it's been corrected and the master seems to be humming along,
>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>> for the rest of the week, but things are looking MUCH better now.
>>>
>>> sorry again for the interruptions in service.
>>>
>>> shane
>>>
>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <skn...@berkeley.edu> wrote:
>>> > ok, we're back up, system load looks cromulent and we're happily
>>> > building (again).
>>> >
>>> > shane
>>> >
>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <skn...@berkeley.edu>
>>> > wrote:
>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>> >> looks like it's hung again.
>>> >>
>>> >> sorry about this!
>>> >>
>>> >> shane
>>> >>
>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <skn...@berkeley.edu>
>>> >> wrote:
>>> >>> ...but just now i started getting alerts on system load, which was
>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>> >>> master and possible need to reboot.
>>> >>>
>>> >>> sorry about the interruption of service...
>>> >>>
>>> >>> shane
>>> >>>
>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <skn...@berkeley.edu>
>>> >>> wrote:
>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to