[ceph-users] Re: MDS crash

Eugen Block Sun, 21 Apr 2024 03:19:12 -0700

What’s the output of:

ceph tell mds.0 damage ls


Zitat von alexey.gerasi...@opencascade.com:

Dear colleagues, hope that anybody can help us.
The initial point: Ceph cluster v15.2 (installed and controlled bythe Proxmox) with 3 nodes based on physical servers rented from acloud provider. CephFS is installed also.
Yesterday we discovered that some of the applications stoppedworking. During the investigation we recognized that we have theproblem with Ceph, more precisely with СephFS - MDS daemons suddenlycrashed. We tried to restart them and found that they crashed againimmediately after the start. The crash information:2024-04-17T17:47:42.841+0000 7f959ced9700 1 mds.0.29134recovery_done -- successful recovery!
2024-04-17T17:47:42.853+0000 7f959ced9700  1 mds.0.29134 active_start
2024-04-17T17:47:42.881+0000 7f959ced9700  1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+0000 7f959aed5700 -1./src/mds/OpenFileTable.cc: In function 'voidOpenFileTable::commit(MDSContext*, uint64_t, int)' thread7f959aed5700 time 2024-04-17T17:47:43.831243+0000
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)
Next hours we read the tons of articles, studied the documentation,and checked the common state of Ceph cluster by the variousdiagnostic commands – but didn’t find anything wrong. At evening wedecided to upgrade it up to v16, and finally to v17.2.7.Unfortunately, it didn’t solve the problem, MDS continue to crashwith the same error. The only difference that we found is “1 MDSsreport damaged metadata” in the output of ceph -s – see it below.
I supposed that it may be the well-known bug, but couldn’t find thesame one on https://tracker.ceph.com - there are several bugsassociated with file OpenFileTable.cc but not related toceph_assert(count > 0)
We tried to check the source code of OpenFileTable.cc also, here isa fragment of it, in function OpenFileTable::_journal_finish
      int omap_idx = anchor.omap_idx;
      unsigned& count = omap_num_items.at(omap_idx);
      ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph,and it is unexpected behavior. But again, we found nothing wrong inour cluster…
Next, we started withhttps://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/article – tried to reset the journal (despite that it was Ok all thetime) and wipe the sessions using cephfs-table-tool all resetsession command. No result…Now I decided to continue following this article and runcephfs-data-scan scan_extents command, it is working just now. But Ihave a doubt that it will solve the issue because of no problem withour objects in Ceph.
Is it the new bug? or something else? Any idea is welcome!

The important outputs:

----- ceph -s
  cluster:
    id:     4cd1c477-c8d0-4855-a1f1-cb71d89427ed
    health: HEALTH_ERR
            1 MDSs report damaged metadata
            insufficient standby MDS daemons available
            83 daemons have recently crashed
            3 mgr modules have recently crashed

  services:
mon: 3 daemons, quorumasrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
    mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
    mds: 1/1 daemons up
    osd: 18 osds: 18 up (since 22h), 18 in (since 29h)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 289 pgs
    objects: 29.72M objects, 5.6 TiB
    usage:   21 TiB used, 47 TiB / 68 TiB avail
    pgs:     287 active+clean
             2   active+clean+scrubbing+deep

  io:
    client:   2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr

-----ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=basev0.20,2=client writeable ranges,3=default file layouts on dirs,4=dirinode in separate object,5=mds uses versioned encoding,6=dirfrag isstored in omap,7=mds uses inline data,8=no anchor table,9=filelayout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   29480
flags   12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+0000
modified        2024-04-18T16:52:29.970504+0000
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  14728
compat compat={},rocompat={},incompat={1=base v0.20,2=clientwriteable ranges,3=default file layouts on dirs,4=dir inode inseparate object,5=mds uses versioned encoding,6=dirfrag is stored inomap,7=mds uses inline data,8=no anchor table,9=file layoutv2,10=snaprealm v2}
max_mds 1
in      0
up      {0=156636152}
failed
damaged
stopped
data_pools      [5]
metadata_pool   6
inline_data     disabled
balancer
standby_count_wanted    1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since2024-04-18T16:52:29.970479+0000 addr[v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023]compat {c=[1],r=[1],i=[7ff]}]
-----cephfs-journal-tool --rank=cephfs:0 journal inspect
Overall journal integrity: OK

-----ceph pg dump summary
version 41137
stamp 2024-04-18T21:17:59.133536+0000
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUNDBYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOGsum 29717605 0 0 0 06112544251872 13374192956 28493480 1806575 1806575
OSD_STAT  USED    AVAIL   USED_RAW  TOTAL
sum       21 TiB  47 TiB    21 TiB  68 TiB

-----ceph pg dump pools
POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUNDBYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG8 31771 0 0 0 0131337887503 2482 140 401246 4012467 839707 0 0 0 03519034650971 736 61 399328 3993286 1319576 0 0 0 0421044421 13374189738 28493279 206749 2067495 27526539 0 0 0 02461702171417 0 0 792165 7921652 12 0 0 0 048497560 0 0 6991 6991
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS crash

Reply via email to