While I'm not who you need to interpret the stack trace, I can decipher JIRA 
and the state of LU-12462 is that it is already landed for the upcoming 2.12.4 
release. So, if you have a good reproducer, you could always test a single 
client on the tip of b2_12 (either building from git or else grabbing the 
latest build from https://build.whamcloud.com/job/lustre-b2_12/) . What's there 
now is close to the finished article and this will let you know whether moving 
to 2.12.4 when it comes out will resolve this issue for you. 

On 2020-01-10, 7:42 AM, "lustre-discuss on behalf of Christopher Mountford" 
<[email protected] on behalf of [email protected]> 
wrote:

    Hi,
    
    We just switched to a new 2.12.3 Lustre storage system on our local HPC 
cluster have seen a number of client node crashes - all leaving a similar 
syslog entry:
    
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent 
ffff9238ac5133f0@{[0 -> 255/255], 
[1|0|-|cache|wiY|ffff9238a7370f00],[1703936|89|+|-|ffff9238733e7180|256|        
  (null)]}
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: 
ffff9238ac5133f0 ns: alice3-OST0019-osc-ffff9248e7337800 lock: 
ffff9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: 
[0x740000400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 
65536->262143) flags: 0x20000000000 nid: local remote: 0xda6676eba5fdbd0c 
expref: -99 pid: 24499 timeout: 0 lvb_type: 1
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object ffff9238a7370f00 
extents at osc_cache_writeback_range:3062, mppr: 256.
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent ffff9238ac5133f0@{[0 
-> 255/255], [1|0|-|cache|wiY|ffff9238a7370f00], 
[1703936|89|+|-|ffff9238733e7180|256|          (null)]} in tree 1.
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: ffff9238ac5133f0 
ns: alice3-OST0019-osc-ffff9248e7337800 lock: 
ffff9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: 
[0x740000400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 
65536->262143) flags: 0x20000000000 nid: local remote: 0xda6676eba5fdbd0c 
expref: -99 pid: 24499 timeout: 0 lvb_type: 1
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( ext->oe_start 
>= start && ext->oe_end <= end ) failed:
    Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG
    Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm 
3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
    Jan 10 13:21:08 spectre15 kernel: Call Trace:
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc0e167cc>] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc0e1687c>] 
lbug_with_loc+0x4c/0xa0 [libcfs]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc13b1a5d>] 
osc_cache_writeback_range+0xacd/0x1260 [osc]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc13a07f5>] 
osc_io_fsync_start+0x85/0x1a0 [osc]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc105e388>] 
cl_io_start+0x68/0x130 [obdclass]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc133e537>] 
lov_io_call.isra.7+0x87/0x140 [lov]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc133e6f6>] 
lov_io_start+0x56/0x150 [lov]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc105e388>] 
cl_io_start+0x68/0x130 [obdclass]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc106055c>] 
cl_io_loop+0xcc/0x1c0 [obdclass]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc1473d3b>] 
cl_sync_file_range+0x2db/0x380 [lustre]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc148ba90>] 
ll_delete_inode+0x160/0x230 [lustre]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff88668544>] evict+0xb4/0x180
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff8866896c>] iput+0xfc/0x190
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff8865cbde>] 
do_unlinkat+0x1ae/0x2d0
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff8865dc5b>] 
SyS_unlinkat+0x1b/0x40
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff88b8dede>] 
system_call_fastpath+0x25/0x2a
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
    Jan 10 13:21:08 spectre15 kernel: Kernel panic - not syncing: LBUG
    
    
    We are able to reproduced the error on a test system - it appears to be 
caused by removing multiple files with a single rm -f *, strangely, repeating 
this and deleting the files one at a time is fine (these results are both 
reproducable). Only files with a data on MDT layout cause the crash.
    
    We have been using the 2.12.3 client (with 2.10.7 servers) since December 
without issue. The problem seems to be occuring since we moved to using a new 
Lustre 2.12.3 filesystem which has data on MDT enabled. We have confirmed that 
deleting files which do not have a data on MDT layout does not cause the above 
problem.
    
    This looks to me like LU-12462 
(https://jira.whamcloud.com/browse/LU-12462), however, it looks like this is 
only known to affect 2.13.0 (and 2.12.4) - not 2.12.3, I'm not familiar with 
jira though so I could be reading this wrong!
    
    Any suggestions on how best to report/resolve this?
    
    We have repeated the tests using a 2.13.0 test client and we do not see any 
crashes on this client (LU-12462 says fixed in 2.13).
    
    Regards,
    Christopher.
    
    
    -- 
    -- 
    # Dr. Christopher Mountford
    # System specialist - Research Computing/HPC
    # 
    # IT services,
    #     University of Leicester, University Road, 
    #     Leicester, LE1 7RH, UK 
    #
    # t: 0116 252 3471
    # e: [email protected]
    
    _______________________________________________
    lustre-discuss mailing list
    [email protected]
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
    

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to