[Kernel-packages] [Bug 1821395] Re: fscache: jobs might hang when fscache disk is full

Mauricio Faria de Oliveira Fri, 12 Apr 2019 03:36:44 -0700

Regression testing setup/steps
===


fscache
-------

sudo apt-get -y install cachefilesd
echo 'RUN=yes' | sudo tee -a /etc/default/cachefilesd
sudo modprobe fscache
sudo systemctl start cachefilesd


nfs
---

sudo apt-get -y install nfs-kernel-server
sudo systemctl start nfs-kernel-server

sudo mkdir -p /{srv,mnt}/nfs-{test,scratch}

# different fsid if in the same local filesystem
echo '/srv/nfs-test    127.0.0.1(rw,no_subtree_check,no_root_squash,fsid=0)' | 
sudo tee -a /etc/exports
echo '/srv/nfs-scratch 127.0.0.1(rw,no_subtree_check,no_root_squash,fsid=1)' | 
sudo tee -a /etc/exports
sudo exportfs -ra


xfs-tests
---------

sudo apt-get -y install automake gcc make git xfsprogs xfslibs-dev \
  uuid-dev uuid-runtime libtool-bin e2fsprogs libuuid1 attr libattr1-dev \
  libacl1-dev libaio-dev libgdbm-dev quota gawk fio dbench python sqlite3

git clone https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
cd xfstests-dev

git log --oneline -1 HEAD
  f3c1bca generic: Test that SEEK_HOLE can find a punched hole

make -j$(nproc); echo $?   # must be 0

sudo useradd fsgqa
sudo groupadd fsgqa
sudo useradd 123456-fsgqa

export TEST_DEV=127.0.0.1:/srv/nfs-test
export TEST_DIR=/mnt/nfs-test

export SCRATCH_DEV=127.0.0.1:/srv/nfs-scratch
export SCRATCH_MNT=/mnt/nfs-scratch

export TEST_FS_MOUNT_OPTS="-o fsc"  # for fscache / test dev
export NFS_MOUNT_OPTIONS="-o fsc"   # for fscache / scratch dev

cd ~/xfstests-dev 
sudo -E ./check -nfs -g quick 2>&1 | tee ~/xfs-tests.nfs.log.$(uname -r)
<...>

---

In another terminal, check the NFS mounts are indeed with the 'fsc'
(fscache) attribute:

$ mount | grep nfs | grep fsc
127.0.0.1:/srv/nfs-test on /mnt/nfs-test type nfs4 
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,fsc,local_lock=none,addr=127.0.0.1)
127.0.0.1:/srv/nfs-scratch on /mnt/nfs-scratch type nfs4 
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,fsc,local_lock=none,addr=127.0.0.1)

And compare fscache stats before/after run:

$ cat /proc/fs/fscache/stats 
FS-Cache statistics
Cookies: idx=0 dat=0 spc=0
Objects: alc=0 nal=0 avl=0 ded=0
ChkAux : non=0 ok=0 upd=0 obs=0
Pages  : mrk=0 unc=0
Acquire: n=0 nul=0 noc=0 ok=0 nbf=0 oom=0
Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
Invals : n=0 run=0
Updates: n=0 nul=0 run=0
Relinqs: n=0 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=0 ok=0 wt=0 nod=0 nbf=0 int=0 oom=0
Retrvls: ops=0 owt=0 abt=0
Stores : n=0 ok=0 agn=0 nbf=0 oom=0
Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
Ops    : pend=0 run=0 enq=0 can=0 rej=0
Ops    : ini=0 dfr=0 rel=0 gc=0
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0

...

$ cat /proc/fs/fscache/stats
FS-Cache statistics
Cookies: idx=412 dat=2441632 spc=0
Objects: alc=8929 nal=0 avl=8741 ded=8928
ChkAux : non=0 ok=86 upd=0 obs=1123
Pages  : mrk=371441 unc=371441
Acquire: n=2442044 nul=0 noc=0 ok=2442044 nbf=0 oom=0
Lookups: n=8929 neg=8817 pos=112 crt=8817 tmo=0
Invals : n=152 run=152
Updates: n=0 nul=0 run=152
Relinqs: n=2442044 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=1498 ok=0 wt=195 nod=1498 nbf=0 int=0 oom=0
Retrvls: ops=1498 owt=575 abt=0
Stores : n=371145 ok=371145 agn=0 nbf=0 oom=0
Stores : ops=1117 run=372234 pgs=371118 rxd=371118 olm=0
VmScan : nos=49 gon=0 bsy=0 can=0 wt=0
Ops    : pend=575 run=2767 enq=372387 can=0 rej=0
Ops    : ini=372795 dfr=37 rel=372795 gc=37
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
CacheEv: nsp=1123 stl=0 rtr=0 cul=0

---

Note, in 4.15.0 kernels, some tests apparently run forever:
generic/430, 431 and 434 (same behavior in nfs+fscache, ext4, xfs),
they were killed with 'sudo kill -TERM $(pidof xfs_io)'.

# ref: https://wiki.linux-nfs.org/wiki/index.php/Xfstests

** Tags removed: verification-needed-cosmic
** Tags added: verification-done-cosmic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821395

Title:
  fscache: jobs might hang when fscache disk is full

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  [Impact]

   * fscache issue where jobs get hung when fscache disk is full.

   * trivial upstream fix; already applied in X/D, required in B/C:
     commit c5a94f434c82 ("fscache: fix race between enablement and
     dropping of object").

  [Test Case]

   * Test kernel verified / regression-tested by reporter.

   * Apparently there's no simple test case,
     but these are the conditions to hit the problem:

     1) The active dataset size is equal to the cache disk size.
        The application reads the data over and over again.
     2) Disk is near full (90%+)
     3) cachefilesd in userspace is trying to cull the old objects
        while new objects are being looked up.
     4) new cachefiles are created and some fail with no disk space.
     5) race in dropping object state machine and
        deferred lookup state machine causes the hang.
     6) HUNG in fscache_wait_for_deferred_lookup for
        clear bit FSCACHE_COOKIE_LOOKING_UP cookie->flags.

  [Regression Potential]

   * Low; contained in fscache; no further fixes applied upstream.

   * This patch is applied in a stable tree (linux-4.4.y).

  [Original Description]

  An user reported an fscache issue where jobs get hung when the fscache
  disk is full.

  After investigation, it's been found to be an issue already reported/fixed 
upstream,
  by commit c5a94f434c82 ("fscache: fix race between enablement and dropping of 
object").

  This patch is required in Bionic and Cosmic, and it's applied in
  Xenial (via stable) and Disco.

  Apparently there's no simple test case, but these are the conditions
  to hit the problem:

  1) The active dataset size is equal to the cache disk size.
     The application reads the data over and over again.
  2) Disk is near full (90%+)
  3) cachefilesd in userspace is trying to cull the old objects
     while new objects are being looked up.
  4) new cachefiles are created and some fail with no disk space.
  5) race in dropping object state machine and
     deferred lookup state machine causes the hang.
  6) HUNG in fscache_wait_for_deferred_lookup for
     clear bit FSCACHE_COOKIE_LOOKING_UP cookie->flags.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1821395/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1821395] Re: fscache: jobs might hang when fscache disk is full

Reply via email to