Please help!  Got another panic today, while migrating directories:


*[Tue Jul 23 16:04:18 2019] LustreError: 52142:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed:** **[Tue Jul 23 16:04:18 2019] LustreError: 52142:0:(service.c:189:ptlrpc_save_lock()) LBUG** **[Tue Jul 23 16:04:18 2019] Pid: 52142, comm: mdt00_002 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Tue Apr 30 22:18:15 UTC 2019**
**
**Message from syslogd@cmal14lb27 at Jul 23 16:04:15 ...**
** kernel:[101545.117309] LustreError: 52142:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed:**
**
**Message from syslogd@cmal14lb27 at Jul 23 16:04:15 ...**
** kernel:[101545.119130] LustreError: 52142:0:(service.c:189:ptlrpc_save_lock()) LBUG**
**
**Message from syslogd@cmal14lb27 at Jul 23 16:04:15 ...**
** kernel:LustreError: 52142:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed:**
**
**Message from syslogd@cmal14lb27 at Jul 23 16:04:15 ...**
** kernel:LustreError: 52142:0:(service.c:189:ptlrpc_save_lock()) LBUG**
**[Tue Jul 23 16:04:18 2019] Call Trace:**
**[Tue Jul 23 16:04:18 2019] [<ffffffffc0f717cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc0f7187c>] lbug_with_loc+0x4c/0xa0 [libcfs]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc12f9c41>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc1892b0b>] mdt_save_lock+0x20b/0x360 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc1892cbc>] mdt_object_unlock+0x5c/0x3c0 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc18aba52>] mdt_reint_striped_unlock+0x1a2/0x2f0 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc18abbc8>] mdt_migrate_object_unlock+0x28/0x60 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc18b0544>] mdt_reint_migrate+0x934/0x1310 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc18b0fa3>] mdt_reint_rec+0x83/0x210 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc188f1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc189a497>] mdt_reint+0x67/0x140 [mdt]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc1359e5a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc12ff80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]** **[Tue Jul 23 16:04:18 2019] [<ffffffffc130313c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]**
**[Tue Jul 23 16:04:18 2019] [<ffffffff9fec1c71>] kthread+0xd1/0xe0**
**[Tue Jul 23 16:04:18 2019] [<ffffffffa0575c37>] ret_from_fork_nospec_end+0x0/0x39**
**[Tue Jul 23 16:04:18 2019] [<ffffffffffffffff>] 0xffffffffffffffff**
*

    Regards,

        Jonny


------------------------------------------------------------------------
globo.com       
*João Carlos Mendes Luís*
*Senior DevOps Engineer*
[email protected] <mailto:[email protected]>
+55-21-2483-6893
+55-21-99218-1222


On 22/07/2019 23:35, Joao Carlos Mendes Luis wrote:
On 7/22/19 11:10 PM, Andreas Dilger wrote:
If you are trying to delete MDT0000 then that is definitely not implemented yet...


No, no, no...


This was my first idea, but then I understood that the root directory is always on MDT0, so I had to migrate it to another server (after having created two more, and crashed during migration).

I will later try to migrate to another server, and then delete MDT2.  But first I need to finish this lfsck...   :-(


These "NOT IMPLEMETED (sic)" messages are just from running lfsck_start -A



Cheers, Andreas

On Jul 22, 2019, at 16:08, João Carlos Mendes Luís <[email protected] <mailto:[email protected]>> wrote:

Hi,

    I'm running some lab tests with lustre 2.12.2 in Oracle Linux Server release 7.6.  Last test I did was about migration and MDT splitting.  I started with a MGS+MDS node, and two OSS nodes, and one of the tests was to create two more MDSs and migrate data between then, until, after some time, I could delete the original MDS. But something happened in the middle and the servers panicked/rebooted.

    I am now in what appears to be an lfsck bug.  After many other tests, I run lfsck_start, and after some time get this message on the nodes:

MGS/MDS0:

*[Mon Jul 22 17:42:25 2019] LustreError: 24107:0:(osd_index.c:1872:osd_index_it_get()) NOT IMPLEMETED YET (move to 0x2481000002000000)*

OSS1/MDS1

*[Mon Jul 22 17:40:29 2019] LustreError: 31558:0:(osd_index.c:1872:osd_index_it_get()) NOT IMPLEMETED YET (move to 0xa41300c002000000)*

OST2/MDS2

*[Mon Jul 22 17:40:32 2019] LustreError: 8935:0:(osd_index.c:1872:osd_index_it_get()) NOT IMPLEMETED YET (move to 0xa013000003000000)*


    And for current lfsck status, I run *lctl get_param *.*.lfsck* | grep -E 'status|\.lfsck_lay|\.lfsck_name'*

MGS/MDS0:

*mdd.mirror01-MDT0000.lfsck_layout=**
**status: completed**
**mdd.mirror01-MDT0000.lfsck_namespace=**
**status: partial*

OSS1/MDS1

*mdd.mirror01-MDT0001.lfsck_layout=**
**status: completed**
**mdd.mirror01-MDT0001.lfsck_namespace=**
**status: partial**
**obdfilter.mirror01-OST0065.lfsck_layout=**
**status: completed*

OST2/MDS2

*mdd.mirror01-MDT0002.lfsck_layout=**
**status: completed**
**mdd.mirror01-MDT0002.lfsck_namespace=**
**status: partial**
**obdfilter.mirror01-OST0066.lfsck_layout=**
**status: completed*

    Is this a known bug?  How do I fix these "partial" lsfck runs?

    Thanks for any help,


        Jonny


------------------------------------------------------------------------
globo.com       
*João Carlos Mendes Luís*
*Senior DevOps Engineer*
[email protected] <mailto:[email protected]>
+55-21-2483-6893
+55-21-99218-1222


_______________________________________________
lustre-discuss mailing list
[email protected] <mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


    Atenciosamente,

        Jonny

--
João Carlos Mendes Luís
Globo.COM - +55-21-2483-6893

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to