Thanks, will take a look.

Any other areas i should be looking? Should i be applying any Lustre tuning?

Thanks

Get Outlook for Android<https://aka.ms/ghei36>
________________________________
From: Oral, H. <[email protected]>
Sent: Monday, October 28, 2019 7:06:41 PM
To: Louis Allen <[email protected]>; Carlson, Timothy S 
<[email protected]>; [email protected] 
<[email protected]>
Subject: Re: [EXTERNAL] Re: [lustre-discuss] Lustre Timeouts/Filesystem Hanging

For inspecting client side I/O, you can use Darshan.

Thanks,

Sarp

--
Sarp Oral, PhD

National Center for Computational Sciences
Oak Ridge National Laboratory
[email protected]
865-574-2173


On 10/28/19, 1:58 PM, "lustre-discuss on behalf of Louis Allen" 
<[email protected] on behalf of [email protected]> 
wrote:


    Thanks for the reply, Tim.


    Are there any tools I can use to see if that is the cause?


    Could any tuning possibly help the situation?


    Thanks





    ________________________________________
    From: Carlson, Timothy S <[email protected]>
    Sent: Monday, 28 October 2019, 17:24
    To: Louis Allen; [email protected]
    Subject: RE: Lustre Timeouts/Filesystem Hanging


    In my experience, this is almost always related to some code doing really 
bad I/O. Let’s say you have a 1000 rank MPI code doing open/read 4k/close on a 
few specific files on that OST.  That will make for a  bad day.

    The other place you can see this, and this isn’t your case, is when ZFS 
refuses to give up on a disk that is failing and your overall I/O suffers from 
ZFS continuing to try to read from a disk that it should just kick out

    Tim


    From: lustre-discuss <[email protected]>
    On Behalf Of Louis Allen
    Sent: Monday, October 28, 2019 10:16 AM
    To: [email protected]
    Subject: [lustre-discuss] Lustre Timeouts/Filesystem Hanging



    Hello,



    Lustre (2.12) seem to be hanging quite frequently (5+ times a day) for us 
and one of the OSS servers (out of 4) is reporting an extremely high load 
average (150+) but the CPU usage of that server
     is actually very low - so it must be related to something else - possibly 
CPU_IO_WAIT.



    The OSS server we are seeing the high load averages we can also see 
multiple LustreError messages in /var/log/messages:



    Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Service thread pid 2403 was 
inactive for 200.08s. The thread might be hung, or it might only be slow and 
will resume later. Dumping the stack trace
     for debugging purposes:
    Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Skipped 4 previous similar 
messages

    Oct 28 11:22:23 pazlustreoss001 kernel: Pid: 2403, comm: ll_ost00_068 
3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019

    Oct 28 11:22:23 pazlustreoss001 kernel: Call Trace:

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc03747c5>] 
jbd2_log_wait_commit+0xc5/0x140 [jbd2]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0375e52>] 
jbd2_complete_transaction+0x52/0xa0 [jbd2]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0732da2>] 
ldiskfs_sync_file+0x2e2/0x320 [ldiskfs]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffa52760b0>] 
vfs_fsync_range+0x20/0x30

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0c8b651>] 
osd_object_sync+0xb1/0x160 [osd_ldiskfs]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0ab48a7>] 
tgt_sync+0xb7/0x270 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0dc3731>] 
ofd_sync_hdl+0x111/0x530 [ofd]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0aba1da>] 
tgt_request_handle+0xaea/0x1580 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0a5f80b>] 
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0a6313c>] 
ptlrpc_main+0xafc/0x1fc0 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffa50c1c71>] 
kthread+0xd1/0xe0

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffa5775c37>] 
ret_from_fork_nospec_end+0x0/0x39

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffffffffff>] 
0xffffffffffffffff

    Oct 28 11:22:23 pazlustreoss001 kernel: LustreError: dumping log to 
/tmp/lustre-log.1572261743.2403

    Oct 28 11:22:23 pazlustreoss001 kernel: Pid: 2292, comm: ll_ost03_043 
3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019

    Oct 28 11:22:23 pazlustreoss001 kernel: Call Trace:

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc03747c5>] 
jbd2_log_wait_commit+0xc5/0x140 [jbd2]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0375e52>] 
jbd2_complete_transaction+0x52/0xa0 [jbd2]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0732da2>] 
ldiskfs_sync_file+0x2e2/0x320 [ldiskfs]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffa52760b0>] 
vfs_fsync_range+0x20/0x30

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0c8b651>] 
osd_object_sync+0xb1/0x160 [osd_ldiskfs]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0ab48a7>] 
tgt_sync+0xb7/0x270 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0dc3731>] 
ofd_sync_hdl+0x111/0x530 [ofd]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0aba1da>] 
tgt_request_handle+0xaea/0x1580 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0a5f80b>] 
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Service thread pid 2403 
completed after 200.29s. This indicates the system was overloaded (too many 
service threads, or there were not enough hardware
     resources).

    Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Skipped 48 previous similar 
messages

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffc0a6313c>] 
ptlrpc_main+0xafc/0x1fc0 [ptlrpc]

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffa50c1c71>] 
kthread+0xd1/0xe0

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffa5775c37>] 
ret_from_fork_nospec_end+0x0/0x39

    Oct 28 11:22:23 pazlustreoss001 kernel: [<ffffffffffffffff>] 
0xffffffffffffffff











_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to