Hello, are you using Infiniband ? if so what are the peer credit settings ?
cat /proc/sys/lnet/nis cat /proc/sys/lnet/peers On 12/3/17 8:38 AM, E.S. Rosenberg wrote: > Did you find the problem? Were there any useful suggestions off-list? > > On Wed, Nov 29, 2017 at 1:34 PM, Charles A Taylor <chas...@ufl.edu > <mailto:chas...@ufl.edu>> wrote: > > > We have a genomics pipeline app (supernova) that fails > consistently due to the client being evicted on the OSSs with a > “lock callback timer expired”. I doubled “nlm_enqueue_min” across > the cluster but then the timer simply expired after 200s rather > than 100s so I don’t think that is the answer. The syslog/dmesg > on the client shows no signs of distress and it is a “bigmem” > machine with 1TB of RAM. > > The eviction appears to come while the application is processing a > large number (~300) of data “chunks” (i.e. files) which occur in > pairs. > > -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 > > ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs > -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 > > ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm > > I assume the 24-byte file is metadata (an index or some such) and > the 34M file is the actual data but I’m just guessing since I’m > completely unfamiliar with the application. > > The write error is, > > #define ENOTCONN 107 /* Transport endpoint is not > connected */ > > which occurs after the OSS eviction. This was reproducible under > 2.5.3.90 as well. We hoped that upgrading to 2.10.1 would resolve > the issue but it has not. > > This is the first application (in 10 years) we have encountered > that consistently and reliably fails when run over Lustre. I’m > not sure at this point whether this is a bug or tuning issue. > If others have encountered and overcome something like this, we’d > be grateful to hear from you. > > Regards, > > Charles Taylor > UF Research Computing > > OSS: > -------------- > Nov 28 23:41:41 ufrcoss28 kernel: LustreError: > 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback > timer expired after 201s: evicing client at 10.13.136.74@o2ib ns: > filter-testfs-OST002e_UUID lock: > ffff880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res: > [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] > (req 4096->1802239) flags: 0x60000400010020 nid: 10.13.136.74@o2ib > remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: > 6488120506 lvb_type: 0 > > Client: > ——————— > Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: > testfs-OST002e-osc-ffff88c053fe3800: operation ost_write to node > 10.13.136.30@o2ib failed: rc = -107 > Nov 28 23:41:42 s5a-s23 kernel: Lustre: > testfs-OST002e-osc-ffff88c053fe3800: Connection to testfs-OST002e > (at 10.13.136.30@o2ib) was lost; in progress operations using this > service will wait for recovery to complete > Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: > testfs-OST002e-osc-ffff88c053fe3800: This client was evicted by > testfs-OST002e; in progress operations using this service will fail. > Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: > testfs-OST002c-osc-ffff88c053fe3800: operation ost_punch to node > 10.13.136.30@o2ib failed: rc = -107 > Nov 28 23:41:42 s5a-s23 kernel: Lustre: > testfs-OST002c-osc-ffff88c053fe3800: Connection to testfs-OST002c > (at 10.13.136.30@o2ib) was lost; in progress operations using this > service will wait for recovery to complete > Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: > testfs-OST002c-osc-ffff88c053fe3800: This client was evicted by > testfs-OST002c; in progress operations using this service will fail. > Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0: > testfs-OST0000-osc-ffff88c053fe3800: operation ost_statfs to node > 10.13.136.23@o2ib failed: rc = -107 > Nov 28 23:41:47 s5a-s23 kernel: Lustre: > testfs-OST0000-osc-ffff88c053fe3800: Connection to testfs-OST0000 > (at 10.13.136.23@o2ib) was lost; in progress operations using this > service will wait for recovery to complete > Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0: > testfs-OST0004-osc-ffff88c053fe3800: This client was evicted by > testfs-OST0004; in progress operations using this service will fail. > Nov 28 23:43:11 s5a-s23 kernel: Lustre: > testfs-OST0006-osc-ffff88c053fe3800: Connection restored to > 10.13.136.24@o2ib (at 10.13.136.24@o2ib) > Nov 28 23:43:38 s5a-s23 kernel: Lustre: > testfs-OST002c-osc-ffff88c053fe3800: Connection restored to > 10.13.136.30@o2ib (at 10.13.136.30@o2ib) > Nov 28 23:43:45 s5a-s23 kernel: Lustre: > testfs-OST0000-osc-ffff88c053fe3800: Connection restored to > 10.13.136.23@o2ib (at 10.13.136.23@o2ib) > Nov 28 23:43:48 s5a-s23 kernel: Lustre: > testfs-OST0004-osc-ffff88c053fe3800: Connection restored to > 10.13.136.23@o2ib (at 10.13.136.23@o2ib) > Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar > messages > Nov 28 23:43:55 s5a-s23 kernel: Lustre: > testfs-OST0007-osc-ffff88c053fe3800: Connection restored to > 10.13.136.24@o2ib (at 10.13.136.24@o2ib) > Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar > messages > > Some Details: > ------------------- > OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64) > Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64) > Client: 2.10.1 > 1 TB RAM > Mellanox ConnectX-3 IB/VPI HCAs > Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64 > MOFED 3.2.2 IB stack > Lustre 2.10.1 > Servers: 10 HA OSS pairs (20 OSSs) > 128 GB RAM > 6 OSTs (8+2 RAID-6) per OSS > Mellanox ConnectX-3 IB/VPI HCAs > RedHat EL7 Native IB Stack (i.e. not MOFED) > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > <mailto:lustre-discuss@lists.lustre.org> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> > > > > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org