Thanks for looking into this Shane! On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <skn...@berkeley.edu> wrote:
> as of first thing this morning, here's the list of recent GC overhead > build failures: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70874/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70842/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70927/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70551/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70835/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70841/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70869/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70598/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70898/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70629/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70644/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70686/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70620/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70871/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70873/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70622/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70837/ > console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70626/ > console > > i haven't really found anything that jumps out at me except perhaps > auditing/upping the java memory limits across the build. this seems > to be a massive shot in the dark, and time consuming, so let's just > call this a "method of last resort". > > looking more closely at the systems themselves, it looked to me that > there was enough java "garbage" that had accumulated over the last 5 > months (since the last reboot) that system reboots would be a good > first step. > > https://www.youtube.com/watch?v=nn2FB1P_Mn8 > > over the course of this morning i've been sneaking in worker reboots > during quiet times... the ganglia memory graphs look a lot better > (free memory up, cached memory down!), and i'll keep an eye on things > over the course of the next few days to see if the build failure > frequency is effected. > > also, i might be scheduling quarterly system reboots if this indeed > fixes the problem. > > shane > > On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <skn...@berkeley.edu> wrote: > > preliminary findings: seems to be transient, and affecting 4% of > > builds from late december until now (which is as far back as we keep > > build records for the PRB builds). > > > > 408 builds > > 16 builds.gc <--- failures > > > > it's also happening across all workers at about the same rate. > > > > and best of all, there seems to be no pattern to which tests are > > failing (different each time). i'll look a little deeper and decide > > what to do next. > > > > On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <skn...@berkeley.edu> wrote: > >> nope, no changes to jenkins in the past few months. ganglia graphs > >> show higher, but not worrying, memory usage on the workers when the > >> jobs failed... > >> > >> i'll take a closer look later tonite/first thing tomorrow morning. > >> > >> shane > >> > >> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <k...@eecs.berkeley.edu> > wrote: > >>> I've noticed a bunch of the recent builds failing because of GC > limits, for > >>> seemingly unrelated changes (e.g. 70818, 70840, 70842). Shane, have > there > >>> been any recent changes in the build configuration that might be > causing > >>> this? Does anyone else have any ideas about what's going on here? > >>> > >>> -Kay > >>> > >>> >