I am about 1/2 way through running the 48 hour test using NFS 4.1 on
the client, and I have had multiple assert failures caused by
"refcount = -1", but so far I have not seen any segfault failures, so
the assert failure is in both 4.1 and 4.2, but not in 4.0.  If I don't
see a segfault failure by tomorrow, I am going to assume the segfault
failure only happens with 4.2.

Regards,
Eric


On Thu, Nov 10, 2016 at 1:27 PM, Frank Filz <ffilz...@mindspring.com> wrote:
>> I re-ran the same test for 48 hours using the NFS 4.0 mount option, to the
>> Ganesha NFS 2.4.1 server, with the client NFS fstab entry:
>>
>> ede-c2-gw01:/var/top /C2-NFS4 nfs4 rw,hard,noauto,vers=4.0  0 0
>>
>> and I have not seen any assert or segfaults, so there something going on
>> when using vers=4.2 that is not seen with vers=4.0. When using vers=4.2, I
>> normally see more then 20 asserts or segfault per 24 hours when running my
>> test case.
>>
>> I am going to re-run my tests using vers=4.1
>
> There has been relatively little 4.2 testing done with Ganesha, so it 
> wouldn't surprise me there is some issue there.
>
> If it turns out to be 4.2 only, then we will need to examine what is 
> different in the 4.2 flow.
>
> On the other hand, if it shows up in 4.1, then likely culprits are the 
> session code and the way we handle state owner sequence checking (which is 
> for 4.0 only) in conjunction with stateid validation. There's enough 
> complexity in trying to handle the two different ways of validating statefull 
> requests that I could easily see a refcount bug showing up.
>
> Frank
>
>> On Wed, Nov 2, 2016 at 12:20 PM, Frank Filz <ffilz...@mindspring.com>
>> wrote:
>> > I'm playing with running Ganesha under valgrind and helgrind to see if
>> > anything drops out from those.
>> >
>> > Unfortunately helgrind seems to show up a lot of data races that
>> > either have no functional impact (stat collection that doesn't use
>> > atomic ops), a ton in the ntirpc code, and it also seems to
>> > misunderstand some atomic ops (I HAVE seen it complain before when
>> > something is accessed using atomic ops, but sometimes while holding a
>> > lock, and sometimes not, it decides the fact that there were unlocked
>> > accesses causes a race even though the atomic op should guarantee).
>> >
>> > Frank
>> >
>> >> -----Original Message-----
>> >> From: Malahal Naineni [mailto:mala...@gmail.com]
>> >> Sent: Tuesday, October 25, 2016 11:22 PM
>> >> To: Eric Eastman <eric.east...@keepertech.com>
>> >> Cc: nfs-ganesha-devel@lists.sourceforge.net
>> >> Subject: Re: [Nfs-ganesha-devel] assert in dec_state_owner_ref() with
>> >> V2.4.0.3
>> >>
>> >> Please post if you have an easy reproducer. We will try to recreate
>> >> and
>> > root
>> >> cause it.
>> >>
>> >> On Wed, Oct 26, 2016 at 6:15 AM, Eric Eastman
>> >> <eric.east...@keepertech.com> wrote:
>> >> > A little more info on this issue.  I did a 24 hour run of my test
>> >> > using the POSIX FSAL with an ext4 file system as the backstore, and
>> >> > saw 9 asserts during this test run, all caused by the variable
>> >> > "refcount" ending up at -1.  The errors seem to be occurring while
>> >> > running "rm -rf" on a directory with 1000 sub-directories, with
>> >> > each having 11 files in it.
>> >> >
>> >> > This looks to me like a race condition and I am having issues
>> >> > finding the root cause reading through the source code.  There are
>> >> > notes from commit e7307c5, dated Jan 5 2016,  on "Resolve race
>> >> > between get_state_owner and dec_state_owner_ref differently"  so
>> >> > this looks like an area that there has been issues before.
>> >> >
>> >> > If anyone has an idea on what the root problem is or where to look,
>> >> > please let me know, as we cannot use Ganesha NFS if it is going to
>> >> > assert during production.
>> >> >
>> >> > Thanks,
>> >> > Eric
>> >> >
>> >> > On Thu, Oct 20, 2016 at 1:22 AM, Eric Eastman
>> >> > <eric.east...@keepertech.com> wrote:
>> >> >> While testing Ganesha NFS V2.4.0.3 using the CEPH FSAL to a ceph
>> >> >> file system, I am seeing the ganesha.nfsd process die due to an
>> >> >> assert call multiple times per hour.  I have also seen it die at
>> >> >> the same place in the code using the VFS FSAL with a ext4 file
>> >> >> system, but it dies much less often.
>> >> >>
>> >> >> It is dying at line 917 in src/SAL/state_misc.c, which is called
>> >> >> by src/SAL/state_misc.c at line 1010.  The assert call is in
>> >> >> dec_state_owner_ref() at the line:
>> >> >>
>> >> >>        assert(refcount > 0);
>> >> >>
>> >> >> Looking at the core files and adding in some debugging code
>> >> >> confirms that refcount is -1 when the assert call is made.
>> >> >>
>> >> >> It looks like the owner count is trying to go to -1 in
>> >> >> uncache_nfs4_owner(), but as it occurs only on occasions, I think
>> >> >> it is a race condition.
>> >> >>
>> >> >> Info on the build:
>> >> >>
>> >> >> Host OS is Ubuntu 14.04 with a 4.8.2 x86_64 kernel on a 8
>> >> >> processor system
>> >> >>
>> >> >> Cmake command:
>> >> >> # cmake -DCMAKE_INSTALL_PREFIX=/opt/keeper -
>> >> DALLOCATOR=jemalloc
>> >> >> -DUSE_ADMIN_TOOLS=ON -DUSE_DBUS=ON ../src
>> >> >>
>> >> >> # ganesha.nfsd -v
>> >> >> ganesha.nfsd compiled on Oct 17 2016 at 16:50:18 Release =
>> >> >> V2.4.0.3 Release comment = GANESHA file server is 64 bits
>> >> >> compliant and supports NFS v3,4.0,4.1 (pNFS) and 9P Git HEAD =
>> >> >> 0f55a9a97a4bf232fb0e42542e4ca7491fbf84ce
>> >> >> Git Describe = V2.4.0.3-0-g0f55a9a
>> >> >>
>> >> >> # ceph -v
>> >> >> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> >> >>
>> >> >> # cat ganesha.conf
>> >> >> LOG {
>> >> >>     components {
>> >> >>        ALL = INFO;
>> >> >>     }
>> >> >> }
>> >> >>
>> >> >> EXPORT_DEFAULTS {
>> >> >> SecType = none, sys;
>> >> >> Protocols = 3, 4;
>> >> >> Transports = TCP;
>> >> >> }
>> >> >>
>> >> >> # define CephFS export
>> >> >> EXPORT {
>> >> >>     Export_ID = 42;
>> >> >>     Path = /top;
>> >> >>     Pseudo = /top;
>> >> >>     Access_Type = RW;
>> >> >>     Squash = No_Root_Squash;
>> >> >>     FSAL {
>> >> >>         Name = CEPH;
>> >> >>     }
>> >> >> }
>> >> >>
>> >> >> The VFS export for the ext4 tests was:
>> >> >>
>> >> >> # define CephFS export
>> >> >> EXPORT {
>> >> >>     Export_ID = 43;
>> >> >>     Path = /var/top;
>> >> >>     Pseudo = /var/top;
>> >> >>     Access_Type = RW;
>> >> >>     Squash = No_Root_Squash;
>> >> >>     FSAL {
>> >> >>         Name = VFS;
>> >> >>     }
>> >> >> }
>> >> >>
>> >> >> The test was 2 Ubuntu 14.04 NFS clients each having 6 processes,
>> >> >> writing 11,000 256k files in separate directory trees with 11
>> >> >> files per lowest level node. On each Ubuntu client, 3 processes
>> >> >> wrote to a NFS 3 mount and 3 wrote to a NFS 4 mount. The files are
>> >> >> then read and verified, deleted, and the test restarts.
>> >> >>
>> >> >> Regards,
>> >> >> Eric
>> >> >
>> >> > -------------------------------------------------------------------
>> >> > ---
>> >> > -------- The Command Line: Reinvented for Modern Developers Did the
>> >> > resurgence of CLI tooling catch you by surprise?
>> >> > Reconnect with the command line and become more productive.
>> >> > Learn the new .NET and ASP.NET CLI. Get your free copy!
>> >> > http://sdm.link/telerik
>> >> > _______________________________________________
>> >> > Nfs-ganesha-devel mailing list
>> >> > Nfs-ganesha-devel@lists.sourceforge.net
>> >> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>> >>
>> >>
>> > ----------------------------------------------------------------------
>> > ------
>> > --
>> >> The Command Line: Reinvented for Modern Developers Did the
>> resurgence
>> >> of CLI tooling catch you by surprise?
>> >> Reconnect with the command line and become more productive.
>> >> Learn the new .NET and ASP.NET CLI. Get your free copy!
>> >> http://sdm.link/telerik
>> >> _______________________________________________
>> >> Nfs-ganesha-devel mailing list
>> >> Nfs-ganesha-devel@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>> >
>> >
>> > ---
>> > This email has been checked for viruses by Avast antivirus software.
>> > https://www.avast.com/antivirus
>> >
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to