Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Yan, Zheng
On Mon, Feb 11, 2019 at 8:01 PM Jake Grimmett  wrote:
>
> Hi Zheng,
>
> Many, many thanks for your help...
>
> Your suggestion of setting large values for mds_cache_size and
> mds_cache_memory_limit stopped our MDS crashing :)
>
> The values in ceph.conf are now:
>
> mds_cache_size = 8589934592
> mds_cache_memory_limit = 17179869184
>
> Should these values be left in our configuration?

No. you'd better to change them to original values.

>
> again thanks for the assistance,
>
> Jake
>
> On 2/11/19 8:17 AM, Yan, Zheng wrote:
> > On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  
> > wrote:
> >>
> >> Dear All,
> >>
> >> Unfortunately the MDS has crashed on our Mimic cluster...
> >>
> >> First symptoms were rsync giving:
> >> "No space left on device (28)"
> >> when trying to rename or delete
> >>
> >> This prompted me to try restarting the MDS, as it reported laggy.
> >>
> >> Restarting the MDS, shows this as error in the log before the crash:
> >>
> >> elist.h: 39: FAILED assert(!is_on_list())
> >>
> >> A full MDS log showing the crash is here:
> >>
> >> http://p.ip.fi/iWlz
> >>
> >> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
> >>
> >> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
> >> replication for MDS. We have a single active MDS, with two failover MDS
> >>
> >> We have ~2PB of cephfs data here, all of which is currently
> >> inaccessible, all and any advice gratefully received :)
> >>
> >
> > Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
> > them to very large values before starting mds. If mds does not crash,
> > restore the mds_cache_size and mds_cache_memory_limit  to their
> > original values (by admin socket) after mds becomes active for 10
> > seconds
> >
> > If mds still crash, try compile ceph-mds with following patch
> >
> > diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
> > index d3461fba2e..c2731e824c 100644
> > --- a/src/mds/CDir.cc
> > +++ b/src/mds/CDir.cc
> > @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
> >// clean?
> >if (dn->is_dirty())
> >  dn->mark_clean();
> > +  if (inode->is_stray())
> > +dn->item_stray.remove_myself();
> >
> >if (dn->state_test(CDentry::STATE_BOTTOMLRU))
> >  cache->bottom_lru.lru_remove(dn);
> >
> >
> >> best regards,
> >>
> >> Jake
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Jake Grimmett
Hi Zheng,

Sorry - I've just re-read your email and saw your instruction to restore
the mds_cache_size and mds_cache_memory_limit to original values if the
MDS does not crash - I have now done this...

thanks again for your help,

best regards,

Jake

On 2/11/19 12:01 PM, Jake Grimmett wrote:
> Hi Zheng,
> 
> Many, many thanks for your help...
> 
> Your suggestion of setting large values for mds_cache_size and
> mds_cache_memory_limit stopped our MDS crashing :)
> 
> The values in ceph.conf are now:
> 
> mds_cache_size = 8589934592
> mds_cache_memory_limit = 17179869184
> 
> Should these values be left in our configuration?
> 
> again thanks for the assistance,
> 
> Jake
> 
> On 2/11/19 8:17 AM, Yan, Zheng wrote:
>> On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  wrote:
>>>
>>> Dear All,
>>>
>>> Unfortunately the MDS has crashed on our Mimic cluster...
>>>
>>> First symptoms were rsync giving:
>>> "No space left on device (28)"
>>> when trying to rename or delete
>>>
>>> This prompted me to try restarting the MDS, as it reported laggy.
>>>
>>> Restarting the MDS, shows this as error in the log before the crash:
>>>
>>> elist.h: 39: FAILED assert(!is_on_list())
>>>
>>> A full MDS log showing the crash is here:
>>>
>>> http://p.ip.fi/iWlz
>>>
>>> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
>>>
>>> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
>>> replication for MDS. We have a single active MDS, with two failover MDS
>>>
>>> We have ~2PB of cephfs data here, all of which is currently
>>> inaccessible, all and any advice gratefully received :)
>>>
>>
>> Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
>> them to very large values before starting mds. If mds does not crash,
>> restore the mds_cache_size and mds_cache_memory_limit  to their
>> original values (by admin socket) after mds becomes active for 10
>> seconds
>>
>> If mds still crash, try compile ceph-mds with following patch
>>
>> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
>> index d3461fba2e..c2731e824c 100644
>> --- a/src/mds/CDir.cc
>> +++ b/src/mds/CDir.cc
>> @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
>>// clean?
>>if (dn->is_dirty())
>>  dn->mark_clean();
>> +  if (inode->is_stray())
>> +dn->item_stray.remove_myself();
>>
>>if (dn->state_test(CDentry::STATE_BOTTOMLRU))
>>  cache->bottom_lru.lru_remove(dn);
>>
>>
>>> best regards,
>>>
>>> Jake
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Jake Grimmett
Hi Zheng,

Many, many thanks for your help...

Your suggestion of setting large values for mds_cache_size and
mds_cache_memory_limit stopped our MDS crashing :)

The values in ceph.conf are now:

mds_cache_size = 8589934592
mds_cache_memory_limit = 17179869184

Should these values be left in our configuration?

again thanks for the assistance,

Jake

On 2/11/19 8:17 AM, Yan, Zheng wrote:
> On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  wrote:
>>
>> Dear All,
>>
>> Unfortunately the MDS has crashed on our Mimic cluster...
>>
>> First symptoms were rsync giving:
>> "No space left on device (28)"
>> when trying to rename or delete
>>
>> This prompted me to try restarting the MDS, as it reported laggy.
>>
>> Restarting the MDS, shows this as error in the log before the crash:
>>
>> elist.h: 39: FAILED assert(!is_on_list())
>>
>> A full MDS log showing the crash is here:
>>
>> http://p.ip.fi/iWlz
>>
>> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
>>
>> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
>> replication for MDS. We have a single active MDS, with two failover MDS
>>
>> We have ~2PB of cephfs data here, all of which is currently
>> inaccessible, all and any advice gratefully received :)
>>
> 
> Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
> them to very large values before starting mds. If mds does not crash,
> restore the mds_cache_size and mds_cache_memory_limit  to their
> original values (by admin socket) after mds becomes active for 10
> seconds
> 
> If mds still crash, try compile ceph-mds with following patch
> 
> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
> index d3461fba2e..c2731e824c 100644
> --- a/src/mds/CDir.cc
> +++ b/src/mds/CDir.cc
> @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
>// clean?
>if (dn->is_dirty())
>  dn->mark_clean();
> +  if (inode->is_stray())
> +dn->item_stray.remove_myself();
> 
>if (dn->state_test(CDentry::STATE_BOTTOMLRU))
>  cache->bottom_lru.lru_remove(dn);
> 
> 
>> best regards,
>>
>> Jake
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Yan, Zheng
On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett  wrote:
>
> Dear All,
>
> Unfortunately the MDS has crashed on our Mimic cluster...
>
> First symptoms were rsync giving:
> "No space left on device (28)"
> when trying to rename or delete
>
> This prompted me to try restarting the MDS, as it reported laggy.
>
> Restarting the MDS, shows this as error in the log before the crash:
>
> elist.h: 39: FAILED assert(!is_on_list())
>
> A full MDS log showing the crash is here:
>
> http://p.ip.fi/iWlz
>
> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...
>
> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
> replication for MDS. We have a single active MDS, with two failover MDS
>
> We have ~2PB of cephfs data here, all of which is currently
> inaccessible, all and any advice gratefully received :)
>

Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set
them to very large values before starting mds. If mds does not crash,
restore the mds_cache_size and mds_cache_memory_limit  to their
original values (by admin socket) after mds becomes active for 10
seconds

If mds still crash, try compile ceph-mds with following patch

diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index d3461fba2e..c2731e824c 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn)
   // clean?
   if (dn->is_dirty())
 dn->mark_clean();
+  if (inode->is_stray())
+dn->item_stray.remove_myself();

   if (dn->state_test(CDentry::STATE_BOTTOMLRU))
 cache->bottom_lru.lru_remove(dn);


> best regards,
>
> Jake
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-08 Thread Jake Grimmett
Dear All,

Unfortunately the MDS has crashed on our Mimic cluster...

First symptoms were rsync giving:
"No space left on device (28)"
when trying to rename or delete

This prompted me to try restarting the MDS, as it reported laggy.

Restarting the MDS, shows this as error in the log before the crash:

elist.h: 39: FAILED assert(!is_on_list())

A full MDS log showing the crash is here:

http://p.ip.fi/iWlz

I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...

The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
replication for MDS. We have a single active MDS, with two failover MDS

We have ~2PB of cephfs data here, all of which is currently
inaccessible, all and any advice gratefully received :)

best regards,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com