Re: [build system] jenkins got itself wedged...

2017-05-21 Thread shane knapp
working on it.  we'll have intermittent downtime the next ~30 mins.

On Sun, May 21, 2017 at 12:01 PM, shane knapp <skn...@berkeley.edu> wrote:
> yeah.  i noticed that and restarted it a few minutes ago.  i'll have
> some time later this afternoon to take a closer look...   :\
>
> On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
>> It looked well these days. However, it seems to go down slowly again...
>>
>> When I tried to see console log (e.g.
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
>> a server returns "proxy error."
>>
>> Regards,
>> Kazuaki Ishizaki
>>
>>
>>
>> From:shane knapp <skn...@berkeley.edu>
>> To:Sean Owen <so...@cloudera.com>
>> Cc:    dev <dev@spark.apache.org>
>> Date:2017/05/20 09:43
>> Subject:Re: [build system] jenkins got itself wedged...
>> 
>>
>>
>>
>> last update of the week:
>>
>> things are looking great...  we're GCing happily and staying well
>> within our memory limits.
>>
>> i'm going to do one more restart after the two pull request builds
>> finish to re-enable backups, and call it a weekend.  :)
>>
>> shane
>>
>> On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote:
>>> this is hopefully my final email on the subject...   :)
>>>
>>> things have seemed to settled down after my GC tuning, and system
>>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>>> to keep an eye on things but it looks like we've weathered the worst
>>> part of the storm.
>>>
>>> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> wrote:
>>>> after needing another restart this afternoon, i did some homework and
>>>> aggressively twiddled some GC settings[1].  since then, things have
>>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>>
>>>> i've attached a screenshot of slightly happier looking graphs.
>>>>
>>>> still keeping an eye on things, and hoping that i can go back to being
>>>> a lurker...  ;)
>>>>
>>>> shane
>>>>
>>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>
>>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu>
>>>> wrote:
>>>>> ok, more updates:
>>>>>
>>>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>>>> balancing ftw.
>>>>>
>>>>> 2) the jenkins master is now running on java8, which has moar bettar
>>>>> GC management under the hood.
>>>>>
>>>>> i'll be keeping an eye on this today, and if we start seeing GC
>>>>> overhead failures, i'll start doing more GC performance tuning.
>>>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>>
>>>>> shane
>>>>>
>>>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu>
>>>>> wrote:
>>>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>>>> getting some error messages in the logs...   looks like jenkins is
>>>>>> thrashing on GC.
>>>>>>
>>>>>> now that i know what's up, i should be able to get this sorted today.
>>>>>>
>>>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>>>>>> PRs. For
>>>>>>> example, triggering it through the spark-prs.appspot.com UI gives
>>>>>>> me...
>>>>>>>
>>>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>>>
>>>>>>> Internal Server Error
>>>>>>>
>>>>>>> That might be from the appspot app though?
>>>>>>>
>>>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>&g

Re: [build system] jenkins got itself wedged...

2017-05-21 Thread shane knapp
yeah.  i noticed that and restarted it a few minutes ago.  i'll have
some time later this afternoon to take a closer look...   :\

On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
> It looked well these days. However, it seems to go down slowly again...
>
> When I tried to see console log (e.g.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
> a server returns "proxy error."
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:shane knapp <skn...@berkeley.edu>
> To:Sean Owen <so...@cloudera.com>
> Cc:dev <dev@spark.apache.org>
> Date:    2017/05/20 09:43
> Subject:Re: [build system] jenkins got itself wedged...
> 
>
>
>
> last update of the week:
>
> things are looking great...  we're GCing happily and staying well
> within our memory limits.
>
> i'm going to do one more restart after the two pull request builds
> finish to re-enable backups, and call it a weekend.  :)
>
> shane
>
> On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote:
>> this is hopefully my final email on the subject...   :)
>>
>> things have seemed to settled down after my GC tuning, and system
>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>> to keep an eye on things but it looks like we've weathered the worst
>> part of the storm.
>>
>> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> wrote:
>>> after needing another restart this afternoon, i did some homework and
>>> aggressively twiddled some GC settings[1].  since then, things have
>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>
>>> i've attached a screenshot of slightly happier looking graphs.
>>>
>>> still keeping an eye on things, and hoping that i can go back to being
>>> a lurker...  ;)
>>>
>>> shane
>>>
>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu>
>>> wrote:
>>>> ok, more updates:
>>>>
>>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>>> balancing ftw.
>>>>
>>>> 2) the jenkins master is now running on java8, which has moar bettar
>>>> GC management under the hood.
>>>>
>>>> i'll be keeping an eye on this today, and if we start seeing GC
>>>> overhead failures, i'll start doing more GC performance tuning.
>>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>
>>>> shane
>>>>
>>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu>
>>>> wrote:
>>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>>> getting some error messages in the logs...   looks like jenkins is
>>>>> thrashing on GC.
>>>>>
>>>>> now that i know what's up, i should be able to get this sorted today.
>>>>>
>>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>>>>> PRs. For
>>>>>> example, triggering it through the spark-prs.appspot.com UI gives
>>>>>> me...
>>>>>>
>>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>>
>>>>>> Internal Server Error
>>>>>>
>>>>>> That might be from the appspot app though?
>>>>>>
>>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>>>>>> and I
>>>>>> can't reach Jenkins:
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>>
>>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>> after another couple of restarts due to high load and system
>>>>>>> unresponsiveness, i finall

Re: [build system] jenkins got itself wedged...

2017-05-21 Thread Kazuaki Ishizaki
It looked well these days. However, it seems to go down slowly again...

When I tried to see console log (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull
), a server returns "proxy error."

Regards,
Kazuaki Ishizaki



From:   shane knapp <skn...@berkeley.edu>
To: Sean Owen <so...@cloudera.com>
Cc: dev <dev@spark.apache.org>
Date:   2017/05/20 09:43
Subject:    Re: [build system] jenkins got itself wedged...



last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> 
wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu> 
wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu> 
wrote:
>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY 
i'm
>>>> getting some error messages in the logs...   looks like jenkins is
>>>> thrashing on GC.
>>>>
>>>> now that i know what's up, i should be able to get this sorted today.
>>>>
>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> 
wrote:
>>>>> I'm not sure if it's related, but I still can't get Jenkins to test 
PRs. For
>>>>> example, triggering it through the spark-prs.appspot.com UI gives 
me...
>>>>>
>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>
>>>>> Internal Server Error
>>>>>
>>>>> That might be from the appspot app though?
>>>>>
>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, 
and I
>>>>> can't reach Jenkins:
>>>>> 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

>>>>>
>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu> 
wrote:
>>>>>>
>>>>>> after another couple of restarts due to high load and system
>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>
>>>>>> a typo in the jenkins config where the java heap size was 
configured.
>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain 
the
>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>> couple of years.
>>>>>>
>>>>>> anyways, it's been corrected and the master seems to be humming 
along,
>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on 
this
>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>
>>>>>> sorry again for the interruptions in service.
>&g

Re: [build system] jenkins got itself wedged...

2017-05-19 Thread shane knapp
last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp  wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp  wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp  wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp  wrote:
 yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
 getting some error messages in the logs...   looks like jenkins is
 thrashing on GC.

 now that i know what's up, i should be able to get this sorted today.

 On Thu, May 18, 2017 at 12:39 AM, Sean Owen  wrote:
> I'm not sure if it's related, but I still can't get Jenkins to test PRs. 
> For
> example, triggering it through the spark-prs.appspot.com UI gives me...
>
> https://spark-prs.appspot.com/trigger-jenkins/18012
>
> Internal Server Error
>
> That might be from the appspot app though?
>
> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
> can't reach Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> On Thu, May 18, 2017 at 12:44 AM shane knapp  wrote:
>>
>> after another couple of restarts due to high load and system
>> unresponsiveness, i finally found what is the most likely culprit:
>>
>> a typo in the jenkins config where the java heap size was configured.
>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>> random and non-deterministic system hangs we've had over the past
>> couple of years.
>>
>> anyways, it's been corrected and the master seems to be humming along,
>> for real this time, w/o issue.  i'll continue to keep an eye on this
>> for the rest of the week, but things are looking MUCH better now.
>>
>> sorry again for the interruptions in service.
>>
>> shane
>>
>> On Wed, May 17, 2017 at 9:59 AM, shane knapp  wrote:
>> > ok, we're back up, system load looks cromulent and we're happily
>> > building (again).
>> >
>> > shane
>> >
>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
>> > wrote:
>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>> >> looks like it's hung again.
>> >>
>> >> sorry about this!
>> >>
>> >> shane
>> >>
>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 
>> >> wrote:
>> >>> ...but just now i started getting alerts on system load, which was
>> >>> rather high.  i had to kick jenkins again, and will keep an eye on 
>> >>> the
>> >>> master and possible need to reboot.
>> >>>
>> >>> sorry about the interruption of service...
>> >>>
>> >>> shane
>> >>>
>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 
>> >>> wrote:
>>  ...so i kicked it and it's now back up and happily building.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: 

Re: [build system] jenkins got itself wedged...

2017-05-18 Thread shane knapp
ok, more updates:

1) i audited all of the builds, and found that the spark-*-compile-*
and spark-*-test-* jobs were set to the identical cron time trigger,
so josh rosen and i updated them to run at H/5 (instead of */5).  load
balancing ftw.

2) the jenkins master is now running on java8, which has moar bettar
GC management under the hood.

i'll be keeping an eye on this today, and if we start seeing GC
overhead failures, i'll start doing more GC performance tuning.
thankfully, cloudbees has a relatively decent guide that i'll be
following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/

shane

On Thu, May 18, 2017 at 8:39 AM, shane knapp  wrote:
> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
> getting some error messages in the logs...   looks like jenkins is
> thrashing on GC.
>
> now that i know what's up, i should be able to get this sorted today.
>
> On Thu, May 18, 2017 at 12:39 AM, Sean Owen  wrote:
>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
>> example, triggering it through the spark-prs.appspot.com UI gives me...
>>
>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>
>> Internal Server Error
>>
>> That might be from the appspot app though?
>>
>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
>> can't reach Jenkins:
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>
>> On Thu, May 18, 2017 at 12:44 AM shane knapp  wrote:
>>>
>>> after another couple of restarts due to high load and system
>>> unresponsiveness, i finally found what is the most likely culprit:
>>>
>>> a typo in the jenkins config where the java heap size was configured.
>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>> random and non-deterministic system hangs we've had over the past
>>> couple of years.
>>>
>>> anyways, it's been corrected and the master seems to be humming along,
>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>> for the rest of the week, but things are looking MUCH better now.
>>>
>>> sorry again for the interruptions in service.
>>>
>>> shane
>>>
>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp  wrote:
>>> > ok, we're back up, system load looks cromulent and we're happily
>>> > building (again).
>>> >
>>> > shane
>>> >
>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
>>> > wrote:
>>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>>> >> looks like it's hung again.
>>> >>
>>> >> sorry about this!
>>> >>
>>> >> shane
>>> >>
>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 
>>> >> wrote:
>>> >>> ...but just now i started getting alerts on system load, which was
>>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>> >>> master and possible need to reboot.
>>> >>>
>>> >>> sorry about the interruption of service...
>>> >>>
>>> >>> shane
>>> >>>
>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 
>>> >>> wrote:
>>>  ...so i kicked it and it's now back up and happily building.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins got itself wedged...

2017-05-18 Thread shane knapp
yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
getting some error messages in the logs...   looks like jenkins is
thrashing on GC.

now that i know what's up, i should be able to get this sorted today.

On Thu, May 18, 2017 at 12:39 AM, Sean Owen  wrote:
> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For
> example, triggering it through the spark-prs.appspot.com UI gives me...
>
> https://spark-prs.appspot.com/trigger-jenkins/18012
>
> Internal Server Error
>
> That might be from the appspot app though?
>
> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
> can't reach Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> On Thu, May 18, 2017 at 12:44 AM shane knapp  wrote:
>>
>> after another couple of restarts due to high load and system
>> unresponsiveness, i finally found what is the most likely culprit:
>>
>> a typo in the jenkins config where the java heap size was configured.
>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>> random and non-deterministic system hangs we've had over the past
>> couple of years.
>>
>> anyways, it's been corrected and the master seems to be humming along,
>> for real this time, w/o issue.  i'll continue to keep an eye on this
>> for the rest of the week, but things are looking MUCH better now.
>>
>> sorry again for the interruptions in service.
>>
>> shane
>>
>> On Wed, May 17, 2017 at 9:59 AM, shane knapp  wrote:
>> > ok, we're back up, system load looks cromulent and we're happily
>> > building (again).
>> >
>> > shane
>> >
>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
>> > wrote:
>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>> >> looks like it's hung again.
>> >>
>> >> sorry about this!
>> >>
>> >> shane
>> >>
>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 
>> >> wrote:
>> >>> ...but just now i started getting alerts on system load, which was
>> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
>> >>> master and possible need to reboot.
>> >>>
>> >>> sorry about the interruption of service...
>> >>>
>> >>> shane
>> >>>
>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 
>> >>> wrote:
>>  ...so i kicked it and it's now back up and happily building.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins got itself wedged...

2017-05-18 Thread Sean Owen
I'm not sure if it's related, but I still can't get Jenkins to test PRs.
For example, triggering it through the spark-prs.appspot.com UI gives me...

https://spark-prs.appspot.com/trigger-jenkins/18012
Internal Server Error

That might be from the appspot app though?
But posting "Jenkins test this please" on PRs doesn't seem to work, and I
can't reach Jenkins:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

On Thu, May 18, 2017 at 12:44 AM shane knapp  wrote:

> after another couple of restarts due to high load and system
> unresponsiveness, i finally found what is the most likely culprit:
>
> a typo in the jenkins config where the java heap size was configured.
> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
> random and non-deterministic system hangs we've had over the past
> couple of years.
>
> anyways, it's been corrected and the master seems to be humming along,
> for real this time, w/o issue.  i'll continue to keep an eye on this
> for the rest of the week, but things are looking MUCH better now.
>
> sorry again for the interruptions in service.
>
> shane
>
> On Wed, May 17, 2017 at 9:59 AM, shane knapp  wrote:
> > ok, we're back up, system load looks cromulent and we're happily
> > building (again).
> >
> > shane
> >
> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
> wrote:
> >> i'm going to need to perform a quick reboot on the jenkins master.  it
> >> looks like it's hung again.
> >>
> >> sorry about this!
> >>
> >> shane
> >>
> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 
> wrote:
> >>> ...but just now i started getting alerts on system load, which was
> >>> rather high.  i had to kick jenkins again, and will keep an eye on the
> >>> master and possible need to reboot.
> >>>
> >>> sorry about the interruption of service...
> >>>
> >>> shane
> >>>
> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 
> wrote:
>  ...so i kicked it and it's now back up and happily building.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [build system] jenkins got itself wedged...

2017-05-17 Thread shane knapp
after another couple of restarts due to high load and system
unresponsiveness, i finally found what is the most likely culprit:

a typo in the jenkins config where the java heap size was configured.
instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
random and non-deterministic system hangs we've had over the past
couple of years.

anyways, it's been corrected and the master seems to be humming along,
for real this time, w/o issue.  i'll continue to keep an eye on this
for the rest of the week, but things are looking MUCH better now.

sorry again for the interruptions in service.

shane

On Wed, May 17, 2017 at 9:59 AM, shane knapp  wrote:
> ok, we're back up, system load looks cromulent and we're happily
> building (again).
>
> shane
>
> On Wed, May 17, 2017 at 9:50 AM, shane knapp  wrote:
>> i'm going to need to perform a quick reboot on the jenkins master.  it
>> looks like it's hung again.
>>
>> sorry about this!
>>
>> shane
>>
>> On Tue, May 16, 2017 at 12:55 PM, shane knapp  wrote:
>>> ...but just now i started getting alerts on system load, which was
>>> rather high.  i had to kick jenkins again, and will keep an eye on the
>>> master and possible need to reboot.
>>>
>>> sorry about the interruption of service...
>>>
>>> shane
>>>
>>> On Tue, May 16, 2017 at 8:18 AM, shane knapp  wrote:
 ...so i kicked it and it's now back up and happily building.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins got itself wedged...

2017-05-17 Thread shane knapp
ok, we're back up, system load looks cromulent and we're happily
building (again).

shane

On Wed, May 17, 2017 at 9:50 AM, shane knapp  wrote:
> i'm going to need to perform a quick reboot on the jenkins master.  it
> looks like it's hung again.
>
> sorry about this!
>
> shane
>
> On Tue, May 16, 2017 at 12:55 PM, shane knapp  wrote:
>> ...but just now i started getting alerts on system load, which was
>> rather high.  i had to kick jenkins again, and will keep an eye on the
>> master and possible need to reboot.
>>
>> sorry about the interruption of service...
>>
>> shane
>>
>> On Tue, May 16, 2017 at 8:18 AM, shane knapp  wrote:
>>> ...so i kicked it and it's now back up and happily building.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins got itself wedged...

2017-05-17 Thread shane knapp
i'm going to need to perform a quick reboot on the jenkins master.  it
looks like it's hung again.

sorry about this!

shane

On Tue, May 16, 2017 at 12:55 PM, shane knapp  wrote:
> ...but just now i started getting alerts on system load, which was
> rather high.  i had to kick jenkins again, and will keep an eye on the
> master and possible need to reboot.
>
> sorry about the interruption of service...
>
> shane
>
> On Tue, May 16, 2017 at 8:18 AM, shane knapp  wrote:
>> ...so i kicked it and it's now back up and happily building.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins got itself wedged...

2017-05-16 Thread shane knapp
...but just now i started getting alerts on system load, which was
rather high.  i had to kick jenkins again, and will keep an eye on the
master and possible need to reboot.

sorry about the interruption of service...

shane

On Tue, May 16, 2017 at 8:18 AM, shane knapp  wrote:
> ...so i kicked it and it's now back up and happily building.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] jenkins got itself wedged...

2017-05-16 Thread Herman van Hövell tot Westerflier
Thanks Shane!

On Tue, May 16, 2017 at 5:18 PM, shane knapp  wrote:

> ...so i kicked it and it's now back up and happily building.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.]