Thank you, John! That was exactly the bug we were hitting. My Google-fu didn't lead me to this one.
On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.sp...@redhat.com> wrote: > On 15/04/2015 20:02, Kyle Hutson wrote: > >> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going >> pretty well. >> >> Then, about noon today, we had an mds crash. And then the failover mds >> crashed. And this cascaded through all 4 mds servers we have. >> >> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears >> to be OK for a little while. ceph -w goes through 'replay' 'reconnect' >> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting >> to 'active', it crashes again. >> >> I have the mds log at http://people.beocat.cis.ksu. >> edu/~kylehutson/ceph-mds.hobbit01.log <http://people.beocat.cis.ksu. >> edu/%7Ekylehutson/ceph-mds.hobbit01.log> >> >> For the possibly, but not necessarily, useful background info. >> - Yesterday we took our erasure coded pool and increased both pg_num and >> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%), >> but those seem to be continuing to clean themselves up. >> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph) >> filesystem to this filesystem. >> - Before we realized the mds crashes, we had just changed the size of our >> metadata pool from 2 to 4. >> > > It looks like you're seeing http://tracker.ceph.com/issues/10449, which > is a situation where the SessionMap object becomes too big for the MDS to > save.The cause of it in that case was stuck requests from a misbehaving > client running a slightly older kernel. > > Assuming you're using the kernel client and having a similar problem, you > could try to work around this situation by forcibly unmounting the clients > while the MDS is offline, such that during clientreplay the MDS will remove > them from the SessionMap after timing out, and then next time it tries to > save the map it won't be oversized. If that works, you could then look > into getting newer kernels on the clients to avoid hitting the issue again > -- the #10449 ticket has some pointers about which kernel changes were > relevant. > > Cheers, > John >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com