Thank you, John!

That was exactly the bug we were hitting. My Google-fu didn't lead me to
this one.

On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.sp...@redhat.com> wrote:

> On 15/04/2015 20:02, Kyle Hutson wrote:
>
>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going
>> pretty well.
>>
>> Then, about noon today, we had an mds crash. And then the failover mds
>> crashed. And this cascaded through all 4 mds servers we have.
>>
>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears
>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect'
>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting
>> to 'active', it crashes again.
>>
>> I have the mds log at http://people.beocat.cis.ksu.
>> edu/~kylehutson/ceph-mds.hobbit01.log <http://people.beocat.cis.ksu.
>> edu/%7Ekylehutson/ceph-mds.hobbit01.log>
>>
>> For the possibly, but not necessarily, useful background info.
>> - Yesterday we took our erasure coded pool and increased both pg_num and
>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%),
>> but those seem to be continuing to clean themselves up.
>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph)
>> filesystem to this filesystem.
>> - Before we realized the mds crashes, we had just changed the size of our
>> metadata pool from 2 to 4.
>>
>
> It looks like you're seeing http://tracker.ceph.com/issues/10449, which
> is a situation where the SessionMap object becomes too big for the MDS to
> save.The cause of it in that case was stuck requests from a misbehaving
> client running a slightly older kernel.
>
> Assuming you're using the kernel client and having a similar problem, you
> could try to work around this situation by forcibly unmounting the clients
> while the MDS is offline, such that during clientreplay the MDS will remove
> them from the SessionMap after timing out, and then next time it tries to
> save the map it won't be oversized.  If that works, you could then look
> into getting newer kernels on the clients to avoid hitting the issue again
> -- the #10449 ticket has some pointers about which kernel changes were
> relevant.
>
> Cheers,
> John
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to