> I re-ran the same test for 48 hours using the NFS 4.0 mount option, to the > Ganesha NFS 2.4.1 server, with the client NFS fstab entry: > > ede-c2-gw01:/var/top /C2-NFS4 nfs4 rw,hard,noauto,vers=4.0 0 0 > > and I have not seen any assert or segfaults, so there something going on > when using vers=4.2 that is not seen with vers=4.0. When using vers=4.2, I > normally see more then 20 asserts or segfault per 24 hours when running my > test case. > > I am going to re-run my tests using vers=4.1
There has been relatively little 4.2 testing done with Ganesha, so it wouldn't surprise me there is some issue there. If it turns out to be 4.2 only, then we will need to examine what is different in the 4.2 flow. On the other hand, if it shows up in 4.1, then likely culprits are the session code and the way we handle state owner sequence checking (which is for 4.0 only) in conjunction with stateid validation. There's enough complexity in trying to handle the two different ways of validating statefull requests that I could easily see a refcount bug showing up. Frank > On Wed, Nov 2, 2016 at 12:20 PM, Frank Filz <ffilz...@mindspring.com> > wrote: > > I'm playing with running Ganesha under valgrind and helgrind to see if > > anything drops out from those. > > > > Unfortunately helgrind seems to show up a lot of data races that > > either have no functional impact (stat collection that doesn't use > > atomic ops), a ton in the ntirpc code, and it also seems to > > misunderstand some atomic ops (I HAVE seen it complain before when > > something is accessed using atomic ops, but sometimes while holding a > > lock, and sometimes not, it decides the fact that there were unlocked > > accesses causes a race even though the atomic op should guarantee). > > > > Frank > > > >> -----Original Message----- > >> From: Malahal Naineni [mailto:mala...@gmail.com] > >> Sent: Tuesday, October 25, 2016 11:22 PM > >> To: Eric Eastman <eric.east...@keepertech.com> > >> Cc: nfs-ganesha-devel@lists.sourceforge.net > >> Subject: Re: [Nfs-ganesha-devel] assert in dec_state_owner_ref() with > >> V2.4.0.3 > >> > >> Please post if you have an easy reproducer. We will try to recreate > >> and > > root > >> cause it. > >> > >> On Wed, Oct 26, 2016 at 6:15 AM, Eric Eastman > >> <eric.east...@keepertech.com> wrote: > >> > A little more info on this issue. I did a 24 hour run of my test > >> > using the POSIX FSAL with an ext4 file system as the backstore, and > >> > saw 9 asserts during this test run, all caused by the variable > >> > "refcount" ending up at -1. The errors seem to be occurring while > >> > running "rm -rf" on a directory with 1000 sub-directories, with > >> > each having 11 files in it. > >> > > >> > This looks to me like a race condition and I am having issues > >> > finding the root cause reading through the source code. There are > >> > notes from commit e7307c5, dated Jan 5 2016, on "Resolve race > >> > between get_state_owner and dec_state_owner_ref differently" so > >> > this looks like an area that there has been issues before. > >> > > >> > If anyone has an idea on what the root problem is or where to look, > >> > please let me know, as we cannot use Ganesha NFS if it is going to > >> > assert during production. > >> > > >> > Thanks, > >> > Eric > >> > > >> > On Thu, Oct 20, 2016 at 1:22 AM, Eric Eastman > >> > <eric.east...@keepertech.com> wrote: > >> >> While testing Ganesha NFS V2.4.0.3 using the CEPH FSAL to a ceph > >> >> file system, I am seeing the ganesha.nfsd process die due to an > >> >> assert call multiple times per hour. I have also seen it die at > >> >> the same place in the code using the VFS FSAL with a ext4 file > >> >> system, but it dies much less often. > >> >> > >> >> It is dying at line 917 in src/SAL/state_misc.c, which is called > >> >> by src/SAL/state_misc.c at line 1010. The assert call is in > >> >> dec_state_owner_ref() at the line: > >> >> > >> >> assert(refcount > 0); > >> >> > >> >> Looking at the core files and adding in some debugging code > >> >> confirms that refcount is -1 when the assert call is made. > >> >> > >> >> It looks like the owner count is trying to go to -1 in > >> >> uncache_nfs4_owner(), but as it occurs only on occasions, I think > >> >> it is a race condition. > >> >> > >> >> Info on the build: > >> >> > >> >> Host OS is Ubuntu 14.04 with a 4.8.2 x86_64 kernel on a 8 > >> >> processor system > >> >> > >> >> Cmake command: > >> >> # cmake -DCMAKE_INSTALL_PREFIX=/opt/keeper - > >> DALLOCATOR=jemalloc > >> >> -DUSE_ADMIN_TOOLS=ON -DUSE_DBUS=ON ../src > >> >> > >> >> # ganesha.nfsd -v > >> >> ganesha.nfsd compiled on Oct 17 2016 at 16:50:18 Release = > >> >> V2.4.0.3 Release comment = GANESHA file server is 64 bits > >> >> compliant and supports NFS v3,4.0,4.1 (pNFS) and 9P Git HEAD = > >> >> 0f55a9a97a4bf232fb0e42542e4ca7491fbf84ce > >> >> Git Describe = V2.4.0.3-0-g0f55a9a > >> >> > >> >> # ceph -v > >> >> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > >> >> > >> >> # cat ganesha.conf > >> >> LOG { > >> >> components { > >> >> ALL = INFO; > >> >> } > >> >> } > >> >> > >> >> EXPORT_DEFAULTS { > >> >> SecType = none, sys; > >> >> Protocols = 3, 4; > >> >> Transports = TCP; > >> >> } > >> >> > >> >> # define CephFS export > >> >> EXPORT { > >> >> Export_ID = 42; > >> >> Path = /top; > >> >> Pseudo = /top; > >> >> Access_Type = RW; > >> >> Squash = No_Root_Squash; > >> >> FSAL { > >> >> Name = CEPH; > >> >> } > >> >> } > >> >> > >> >> The VFS export for the ext4 tests was: > >> >> > >> >> # define CephFS export > >> >> EXPORT { > >> >> Export_ID = 43; > >> >> Path = /var/top; > >> >> Pseudo = /var/top; > >> >> Access_Type = RW; > >> >> Squash = No_Root_Squash; > >> >> FSAL { > >> >> Name = VFS; > >> >> } > >> >> } > >> >> > >> >> The test was 2 Ubuntu 14.04 NFS clients each having 6 processes, > >> >> writing 11,000 256k files in separate directory trees with 11 > >> >> files per lowest level node. On each Ubuntu client, 3 processes > >> >> wrote to a NFS 3 mount and 3 wrote to a NFS 4 mount. The files are > >> >> then read and verified, deleted, and the test restarts. > >> >> > >> >> Regards, > >> >> Eric > >> > > >> > ------------------------------------------------------------------- > >> > --- > >> > -------- The Command Line: Reinvented for Modern Developers Did the > >> > resurgence of CLI tooling catch you by surprise? > >> > Reconnect with the command line and become more productive. > >> > Learn the new .NET and ASP.NET CLI. Get your free copy! > >> > http://sdm.link/telerik > >> > _______________________________________________ > >> > Nfs-ganesha-devel mailing list > >> > Nfs-ganesha-devel@lists.sourceforge.net > >> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel > >> > >> > > ---------------------------------------------------------------------- > > ------ > > -- > >> The Command Line: Reinvented for Modern Developers Did the > resurgence > >> of CLI tooling catch you by surprise? > >> Reconnect with the command line and become more productive. > >> Learn the new .NET and ASP.NET CLI. Get your free copy! > >> http://sdm.link/telerik > >> _______________________________________________ > >> Nfs-ganesha-devel mailing list > >> Nfs-ganesha-devel@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel > > > > > > --- > > This email has been checked for viruses by Avast antivirus software. > > https://www.avast.com/antivirus > > --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi _______________________________________________ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel