Re: [Nfs-ganesha-devel] Intermittent test failures - manual tests and continuous integration

Niels de Vos Fri, 08 Sep 2017 08:48:04 -0700

And now with screenshot! :)

Have a good weekend,
Niels



On Fri, Sep 08, 2017 at 05:41:30PM +0200, Niels de Vos wrote:
> On Fri, Sep 08, 2017 at 06:55:19AM -0700, Frank Filz wrote:
> > > On Fri, Sep 01, 2017 at 03:09:34PM -0700, Frank Filz wrote:
> > > > Lately, we have been plagued by a lot of intermittent test failures.
> > > >
> > > > I have seen intermittent failures in pynfs WRT14, WRT15, and WRT16.
> > > > These have not been resolved by the latest ntirpc pullup.
> > > >
> > > > Additionally, we see a lot of intermittent failures in the continuous
> > > > integration.
> > > >
> > > > A big issue with the Centos CI is that it seems to have a fragile
> > > > setup, and sometimes doesn't even succeed in trying to build Ganesha,
> > > > and then fires a Verified -1. This makes it hard to evaluate what
> > > > patches are actually ready for integration.
> > > 
> > > We can look into this, but it helps if you can provide a link to the patch
> > in
> > > GerritHub or the job in the CI.
> > 
> > Here's one merged last week with a Gluster CI Verify -1:
> > 
> > https://review.gerrithub.io/#/c/375463/
> > 
> > And just to preserve it in case... here's the log:
> > 
> > Triggered by Gerrit: https://review.gerrithub.io/375463 in silent mode.
> > [EnvInject] - Loading node environment variables.
> > Building remotely on nfs-ganesha-ci-slave01 (nfs-ganesha) in workspace
> > /home/nfs-ganesha/workspace/nfs-ganesha_trigger-fsal_gluster
> > [nfs-ganesha_trigger-fsal_gluster] $ /bin/sh -xe
> > /tmp/jenkins5031649144466335345.sh
> > + set +x
> >   % Total    % Received % Xferd  Average Speed   Time    Time     Time
> > Current
> >                                  Dload  Upload   Total   Spent    Left
> > Speed
> > 
> >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
> > 0
> >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
> > 0
> > 100  1735  100  1735    0     0   8723      0 --:--:-- --:--:-- --:--:--
> > 8718
> > Traceback (most recent call last):
> >   File "bootstrap.py", line 33, in <module>
> >     b=json.loads(dat)
> >   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
> >     return _default_decoder.decode(s)
> >   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
> >     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
> >   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
> >     raise ValueError("No JSON object could be decoded")
> > ValueError: No JSON object could be decoded
> > https://ci.centos.org/job/nfs-ganesha_trigger-fsal_gluster/3455//console :
> > FAILED
> > Build step 'Execute shell' marked build as failure
> > Finished: FAILURE
> > 
> > Which tells me not much about why it failed, though it looks like a failure
> > that has nothing to do with Ganesha...
> 
> From #centos-devel on Freenode:
> 
> 15:49 < ndevos> bstinson: is 
> https://ci.centos.org/job/nfs-ganesha_trigger-fsal_gluster/3487/console a 
> known duffy problem? and how can the jobs work around this?
> 15:51 < bstinson> ndevos: you may be hitting the rate limit
> 15:52 < ndevos> bstinson: oh, that is possible, I guess... it might happen 
> when a series of patches get sent
> 15:53 < ndevos> bstinson: should I do a sleep and retry in case of such a 
> failure?
> 15:55 < bstinson> ndevos: yeah, that should work. we measure your usage over 
> 5 minutes
> 15:57 < ndevos> bstinson: ok, so sleeping 5 minutes, retry and loop should be 
> acceptible?
> 15:59 < ndevos> bstinson: is there a particular message returned by duffy 
> when the rate limit is hit? the reply is not json, but maybe some error?
> 15:59 < ndevos> (in plain text format?)
> 15:59 < bstinson> yeah 5 minutes should be acceptable, it does return a plain 
> text error message
> 16:00 < bstinson> 'Deployment rate over quota, try again in a few minutes'
> 
> Added a retry logic which is now live, and should get applied for all
> upcoming tests:
> 
> https://github.com/nfs-ganesha/ci-tests/commit/ed055058c7956ebb703464c742837a9ace797129
> 
> 
> > > > An additional issue with the Centos CI is that the failure logs often
> > > > aren't preserved long enough to even diagnose the issue.
> > > 
> > > That is something we can change. Some jobs do not delete the results, but
> > > others seem to do. How long (in days), or how many results would you like
> > to
> > > keep?
> > 
> > I'd say they need to be kept at least a week, if we could have time based
> > retention rather than number of results retention, I think that would help.
> 
> Some jobs seem to have been set to keep 7 days, max 7 jobs. It does not
> really cost us anything, so I'll change it to 14 days. A screenshot for
> these settings has been attached. It can be that I missed updating a job
> so let us know in case logs are deleted too early.
> 
> > At least after a week, it's reasonable to expect folks to rebase their
> > patches and re-submit, which would trigger a new run.
> > 
> > > > The result is that honestly, I mostly ignore the Centos CI results.
> > > > They almost might as well not be run...
> > > 
> > > This is definitely not what we want, so lets fix the problems.
> > 
> > Yea, and thus my rant...
> 
> I really understand this, a CI should be helpful in identifying
> problems, and not introduce problems from itself. Lets try hard to not
> have you needing to rant about it much more :-)
> 
> > > > Let's talk about CI more on a near time concall (it would help if
> > > > Niels and Jiffin could join a call to talk about this, our next call
> > > > might be too soon for that).
> > > 
> > > Tuesdays tend to be very busy for me, and I am not sure I can join the
> > call
> > > next week. Arthy did some work on the jobs in the CentOS CI, she could
> > > probably work with Jiffin to make any changes that improve the experience
> > > for you. I'm happy to help out where I can too, of course :-)
> > 
> > If we can figure out another time to have a CI call, that would be helpful.
> 
> > It would be good to pull in Patrice from CEA as well as anyone else who
> > cares.
> > 
> > It would really help if we could have someone with better time zone overlap
> > with me who could manage the CI stuff, but that may not be realistic.
> 
> We can sign up anyone in the NFS-Ganesha community to do this. It takes
> a little time to get familiar with the scripts and tools that are used,
> but once that settled it is relatively straight forward.
> 
> Volunteers?
> 
> Cheers,
> Niels
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Nfs-ganesha-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Nfs-ganesha-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Re: [Nfs-ganesha-devel] Intermittent test failures - manual tests and continuous integration

Reply via email to