Yes, I've seen this problem quite frequently as of late, running v13.2.10 MDS. 
It seems to be dependent on the client behavior - a lot of xlock contention on 
some directory, although it's hard to pin down which client is doing what. The 
only remedy was to fail over the MDS.

1k - 4k clients
2M requests/replies (not sure what the window is)
40GB of MDS cache
1 active MDS, 1 standby-replay

Something tells me that I need multi-active MDS setup, but every MDS crash in 
that set up results in clearing the journal. It could be that this cluster was 
upgraded from older releases of Ceph and years of clearing the journal has led 
to unrecoverable damage.
For now we're hanging on with 1 active MDS, and there's plans to move to 
radosgw.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to