Re: Tests failing with GC limit exceeded

Kay Ousterhout Thu, 05 Jan 2017 16:36:13 -0800

Thanks for looking into this Shane!

On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <skn...@berkeley.edu> wrote:


> as of first thing this morning, here's the list of recent GC overhead
> build failures:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70874/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70842/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70927/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70551/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70835/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70841/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70869/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70598/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70898/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70629/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70644/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70686/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70620/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70871/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70873/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70622/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70837/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70626/
> console
>
> i haven't really found anything that jumps out at me except perhaps
> auditing/upping the java memory limits across the build.  this seems
> to be a massive shot in the dark, and time consuming, so let's just
> call this a "method of last resort".
>
> looking more closely at the systems themselves, it looked to me that
> there was enough java "garbage" that had accumulated over the last 5
> months (since the last reboot) that system reboots would be a good
> first step.
>
> https://www.youtube.com/watch?v=nn2FB1P_Mn8
>
> over the course of this morning i've been sneaking in worker reboots
> during quiet times...  the ganglia memory graphs look a lot better
> (free memory up, cached memory down!), and i'll keep an eye on things
> over the course of the next few days to see if the build failure
> frequency is effected.
>
> also, i might be scheduling quarterly system reboots if this indeed
> fixes the problem.
>
> shane
>
> On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <skn...@berkeley.edu> wrote:
> > preliminary findings:  seems to be transient, and affecting 4% of
> > builds from late december until now (which is as far back as we keep
> > build records for the PRB builds).
> >
> >  408 builds
> >   16 builds.gc   <--- failures
> >
> > it's also happening across all workers at about the same rate.
> >
> > and best of all, there seems to be no pattern to which tests are
> > failing (different each time).  i'll look a little deeper and decide
> > what to do next.
> >
> > On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <skn...@berkeley.edu> wrote:
> >> nope, no changes to jenkins in the past few months.  ganglia graphs
> >> show higher, but not worrying, memory usage on the workers when the
> >> jobs failed...
> >>
> >> i'll take a closer look later tonite/first thing tomorrow morning.
> >>
> >> shane
> >>
> >> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <k...@eecs.berkeley.edu>
> wrote:
> >>> I've noticed a bunch of the recent builds failing because of GC
> limits, for
> >>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have
> there
> >>> been any recent changes in the build configuration that might be
> causing
> >>> this?  Does anyone else have any ideas about what's going on here?
> >>>
> >>> -Kay
> >>>
> >>>
>

Re: Tests failing with GC limit exceeded

Reply via email to