Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Goncalo Borges

Will do Brad. From you answer it should be a safe thing to do.

Will report later.

Thanks for the help

Cheers

Goncalo



On 07/05/2016 02:42 PM, Brad Hubbard wrote:

On Tue, Jul 5, 2016 at 1:34 PM, Patrick Donnelly  wrote:

Hi Goncalo,

I believe this segfault may be the one fixed here:

https://github.com/ceph/ceph/pull/10027

Ah, nice one Patrick.

Goncalo, the patch is fairly simple, just the addition of a lock on two lines to
resolve the race. Could you try recompiling with those changes and let
us know how
it goes?

Cheers,
Brad


(Sorry for brief top-post. Im on mobile.)

On Jul 4, 2016 9:16 PM, "Goncalo Borges" 
wrote:

Dear All...

We have recently migrated all our ceph infrastructure from 9.2.0 to
10.2.2.

We are currently using ceph-fuse to mount cephfs in a number of clients.

ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
application requesting 4 hosts with 4 cores each (16 instances in total) .
According to the user, each instance has its own dedicated inputs and
outputs.

Please note that if we go back to ceph-fuse 9.2.0 client everything works
fine.

The ceph-fuse 10.2.2 client segfault is the following (we were able to
capture it mounting ceph-fuse in debug mode):

2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
ceph-fuse[7346]: starting ceph client
2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 0x7f6af8c12320
newargc=11
ceph-fuse[7346]: starting fuse
*** Caught signal (Segmentation fault) **
  in thread 7f69d7fff700 thread_name:ceph-fuse
  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
  1: (()+0x297ef2) [0x7f6aedbecef2]
  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f6aedaee035]
  5: (()+0x199891) [0x7f6aedaee891]
  6: (()+0x15b76) [0x7f6aed50db76]
  7: (()+0x12aa9) [0x7f6aed50aaa9]
  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
  9: (clone()+0x6d) [0x7f6aeb8d193d]
2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal
(Segmentation fault) **
  in thread 7f69d7fff700 thread_name:ceph-fuse

  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
  1: (()+0x297ef2) [0x7f6aedbecef2]
  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f6aedaee035]
  5: (()+0x199891) [0x7f6aedaee891]
  6: (()+0x15b76) [0x7f6aed50db76]
  7: (()+0x12aa9) [0x7f6aed50aaa9]
  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
  9: (clone()+0x6d) [0x7f6aeb8d193d]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.



The full dump is quite long. Here are the very last bits of it. Let me
know if you need the full dump.

--- begin dump of recent events ---
  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
_getxattr(137c789, "security.capability", 0) = -61
  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 ll_write
0x7f6a08028be0 137c78c 20094~34
  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 ll_write
0x7f6a08028be0 20094~34 = 34
  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 ll_write
0x7f6a100145f0 137c78d 28526~34
  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 ll_write
0x7f6a100145f0 28526~34 = 34
  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559
ll_forget 137c78c 1
  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559
ll_forget 137c789 1
  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 ll_write
0x7f6a94006350 137c789 22010~216
  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 ll_write
0x7f6a94006350 22010~216 = 216
  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
ll_getxattr 137c78c.head security.capability size 0
  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
_getxattr(137c78c, "security.capability", 0) = -61



   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
_getxattr(137c78a, "security.capability", 0) = -61
   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 ll_write
0x7f6a08042560 137c78b 11900~34
   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 ll_write
0x7f6a08042560 11900~34 = 34
   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
ll_getattr 11e9c80.head
   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
ll_getattr 11e9c80.head = 0
   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559
ll_forget 137c78a 1
   -154> 2016-07-05 10:09:14.043738 7f6a5ebfd700  3 client.464559 ll_write
0x7f6a140d5930 137c78a 18292~34
   -153> 2016-07-05 10:09:14.043759 7f6a5ebfd700  3 client.464559 

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Goncalo Borges

Hi Brad, Shinobu, Patrick...

Indeed if I run with 'debug client = 20' it seems I get a very similar 
log to what Patrick has in the patch. However it is difficult for me to 
really say if it is exactly the same thing.


One thing I could try is simply to apply the fix in the source code and 
recompile. Is this something safe to do?



Cheers

Goncalo


On 07/05/2016 01:34 PM, Patrick Donnelly wrote:


Hi Goncalo,

I believe this segfault may be the one fixed here:

https://github.com/ceph/ceph/pull/10027

(Sorry for brief top-post. Im on mobile.)

On Jul 4, 2016 9:16 PM, "Goncalo Borges" > wrote:

>
> Dear All...
>
> We have recently migrated all our ceph infrastructure from 9.2.0 to 
10.2.2.

>
> We are currently using ceph-fuse to mount cephfs in a number of 
clients.

>
> ceph-fuse 10.2.2 client is segfaulting in some situations. One of 
the scenarios where ceph-fuse segfaults is when a user submits a 
parallel (mpi) application requesting 4 hosts with 4 cores each (16 
instances in total) . According to the user, each instance has its own 
dedicated inputs and outputs.

>
> Please note that if we go back to ceph-fuse 9.2.0 client everything 
works fine.

>
> The ceph-fuse 10.2.2 client segfault is the following (we were able 
to capture it mounting ceph-fuse in debug mode):

>>
>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346

>> ceph-fuse[7346]: starting ceph client
>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 
0x7f6af8c12320 newargc=11

>> ceph-fuse[7346]: starting fuse
>> *** Caught signal (Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175) 
[0x7f6aedaee035]

>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal 
(Segmentation fault) **

>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175) 
[0x7f6aedaee035]

>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

>>
>>
> The full dump is quite long. Here are the very last bits of it. Let 
me know if you need the full dump.

>>
>> --- begin dump of recent events ---
>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559 
_getxattr(137c789, "security.capability", 0) = -61
>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 
ll_write 0x7f6a08028be0 137c78c 20094~34
>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 
ll_write 0x7f6a08028be0 20094~34 = 34
>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 
ll_write 0x7f6a100145f0 137c78d 28526~34
>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 
ll_write 0x7f6a100145f0 28526~34 = 34
>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559 
ll_forget 137c78c 1
>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559 
ll_forget 137c789 1
>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 
ll_write 0x7f6a94006350 137c789 22010~216
>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 
ll_write 0x7f6a94006350 22010~216 = 216
>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559 
ll_getxattr 137c78c.head security.capability size 0
>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559 
_getxattr(137c78c, "security.capability", 0) = -61

>>
>> 
>>
>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559 
_getxattr(137c78a, "security.capability", 0) = -61
>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 
ll_write 0x7f6a08042560 137c78b 11900~34
>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 
ll_write 0x7f6a08042560 11900~34 = 34
>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559 
ll_getattr 11e9c80.head
>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559 
ll_getattr 11e9c80.head = 0
>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559 
ll_forget 137c78a 1
>>   

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Brad Hubbard
On Tue, Jul 5, 2016 at 1:34 PM, Patrick Donnelly  wrote:
> Hi Goncalo,
>
> I believe this segfault may be the one fixed here:
>
> https://github.com/ceph/ceph/pull/10027

Ah, nice one Patrick.

Goncalo, the patch is fairly simple, just the addition of a lock on two lines to
resolve the race. Could you try recompiling with those changes and let
us know how
it goes?

Cheers,
Brad

>
> (Sorry for brief top-post. Im on mobile.)
>
> On Jul 4, 2016 9:16 PM, "Goncalo Borges" 
> wrote:
>>
>> Dear All...
>>
>> We have recently migrated all our ceph infrastructure from 9.2.0 to
>> 10.2.2.
>>
>> We are currently using ceph-fuse to mount cephfs in a number of clients.
>>
>> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
>> scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
>> application requesting 4 hosts with 4 cores each (16 instances in total) .
>> According to the user, each instance has its own dedicated inputs and
>> outputs.
>>
>> Please note that if we go back to ceph-fuse 9.2.0 client everything works
>> fine.
>>
>> The ceph-fuse 10.2.2 client segfault is the following (we were able to
>> capture it mounting ceph-fuse in debug mode):
>>>
>>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
>>> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
>>> ceph-fuse[7346]: starting ceph client
>>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 0x7f6af8c12320
>>> newargc=11
>>> ceph-fuse[7346]: starting fuse
>>> *** Caught signal (Segmentation fault) **
>>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>>> [0x7f6aedaee035]
>>>  5: (()+0x199891) [0x7f6aedaee891]
>>>  6: (()+0x15b76) [0x7f6aed50db76]
>>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal
>>> (Segmentation fault) **
>>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>>
>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>>> [0x7f6aedaee035]
>>>  5: (()+0x199891) [0x7f6aedaee891]
>>>  6: (()+0x15b76) [0x7f6aed50db76]
>>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>>> to interpret this.
>>>
>>>
>> The full dump is quite long. Here are the very last bits of it. Let me
>> know if you need the full dump.
>>>
>>> --- begin dump of recent events ---
>>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
>>> _getxattr(137c789, "security.capability", 0) = -61
>>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 ll_write
>>> 0x7f6a08028be0 137c78c 20094~34
>>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 ll_write
>>> 0x7f6a08028be0 20094~34 = 34
>>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 ll_write
>>> 0x7f6a100145f0 137c78d 28526~34
>>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 ll_write
>>> 0x7f6a100145f0 28526~34 = 34
>>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559
>>> ll_forget 137c78c 1
>>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559
>>> ll_forget 137c789 1
>>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 ll_write
>>> 0x7f6a94006350 137c789 22010~216
>>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 ll_write
>>> 0x7f6a94006350 22010~216 = 216
>>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
>>> ll_getxattr 137c78c.head security.capability size 0
>>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
>>> _getxattr(137c78c, "security.capability", 0) = -61
>>>
>>> 
>>>
>>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
>>> _getxattr(137c78a, "security.capability", 0) = -61
>>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 ll_write
>>> 0x7f6a08042560 137c78b 11900~34
>>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 ll_write
>>> 0x7f6a08042560 11900~34 = 34
>>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
>>> ll_getattr 11e9c80.head
>>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
>>> ll_getattr 11e9c80.head = 0
>>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559
>>> ll_forget 137c78a 1
>>>   -154> 

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Patrick Donnelly
Hi Goncalo,

I believe this segfault may be the one fixed here:

https://github.com/ceph/ceph/pull/10027

(Sorry for brief top-post. Im on mobile.)

On Jul 4, 2016 9:16 PM, "Goncalo Borges" 
wrote:
>
> Dear All...
>
> We have recently migrated all our ceph infrastructure from 9.2.0 to
10.2.2.
>
> We are currently using ceph-fuse to mount cephfs in a number of clients.
>
> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
application requesting 4 hosts with 4 cores each (16 instances in total) .
According to the user, each instance has its own dedicated inputs and
outputs.
>
> Please note that if we go back to ceph-fuse 9.2.0 client everything works
fine.
>
> The ceph-fuse 10.2.2 client segfault is the following (we were able to
capture it mounting ceph-fuse in debug mode):
>>
>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
>> ceph-fuse[7346]: starting ceph client
>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv =
0x7f6af8c12320 newargc=11
>> ceph-fuse[7346]: starting fuse
>> *** Caught signal (Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal
(Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
>>
>>
> The full dump is quite long. Here are the very last bits of it. Let me
know if you need the full dump.
>>
>> --- begin dump of recent events ---
>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
_getxattr(137c789, "security.capability", 0) = -61
>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559
ll_write 0x7f6a08028be0 137c78c 20094~34
>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559
ll_write 0x7f6a08028be0 20094~34 = 34
>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559
ll_write 0x7f6a100145f0 137c78d 28526~34
>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559
ll_write 0x7f6a100145f0 28526~34 = 34
>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559
ll_forget 137c78c 1
>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559
ll_forget 137c789 1
>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a94006350 137c789 22010~216
>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a94006350 22010~216 = 216
>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
ll_getxattr 137c78c.head security.capability size 0
>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
_getxattr(137c78c, "security.capability", 0) = -61
>>
>> 
>>
>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
_getxattr(137c78a, "security.capability", 0) = -61
>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559
ll_write 0x7f6a08042560 137c78b 11900~34
>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559
ll_write 0x7f6a08042560 11900~34 = 34
>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
ll_getattr 11e9c80.head
>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
ll_getattr 11e9c80.head = 0
>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559
ll_forget 137c78a 1
>>   -154> 2016-07-05 10:09:14.043738 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a140d5930 137c78a 18292~34
>>   -153> 2016-07-05 10:09:14.043759 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a140d5930 18292~34 = 34
>>   -152> 2016-07-05 10:09:14.043767 7f6ac17fb700  3 client.464559
ll_forget 11e9c80 1
>>   -151> 2016-07-05 10:09:14.043784 7f6aa8cf9700  3 client.464559
ll_flush 0x7f6a00049fe0 11e9c80
>>   -150> 2016-07-05 10:09:14.043794 7f6aa8cf9700  3 client.464559
ll_getxattr 

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Brad Hubbard
On Tue, Jul 5, 2016 at 12:13 PM, Shinobu Kinjo  wrote:
> Can you reproduce with debug client = 20?

In addition to this I would suggest making sure you have debug symbols
in your build
and capturing a core file.

You can do that by setting "ulimit -c unlimited" in the environment
where ceph-fuse is running.

Once you have a core file you can do the following.

$ gdb /path/to/ceph-fuse core.
(gdb) thread apply all bt full

This looks like it might be a race and that might help us identify the
threads involved.

HTH,
Brad

>
> On Tue, Jul 5, 2016 at 10:16 AM, Goncalo Borges
>  wrote:
>>
>> Dear All...
>>
>> We have recently migrated all our ceph infrastructure from 9.2.0 to
>> 10.2.2.
>>
>> We are currently using ceph-fuse to mount cephfs in a number of clients.
>>
>> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
>> scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
>> application requesting 4 hosts with 4 cores each (16 instances in total) .
>> According to the user, each instance has its own dedicated inputs and
>> outputs.
>>
>> Please note that if we go back to ceph-fuse 9.2.0 client everything works
>> fine.
>>
>> The ceph-fuse 10.2.2 client segfault is the following (we were able to
>> capture it mounting ceph-fuse in debug mode):
>>
>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
>> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
>> ceph-fuse[7346]: starting ceph client
>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 0x7f6af8c12320
>> newargc=11
>> ceph-fuse[7346]: starting fuse
>> *** Caught signal (Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>> [0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal (Segmentation
>> fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>> [0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>>
>> The full dump is quite long. Here are the very last bits of it. Let me
>> know if you need the full dump.
>>
>> --- begin dump of recent events ---
>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
>> _getxattr(137c789, "security.capability", 0) = -61
>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 ll_write
>> 0x7f6a08028be0 137c78c 20094~34
>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 ll_write
>> 0x7f6a08028be0 20094~34 = 34
>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 ll_write
>> 0x7f6a100145f0 137c78d 28526~34
>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 ll_write
>> 0x7f6a100145f0 28526~34 = 34
>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559 ll_forget
>> 137c78c 1
>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559 ll_forget
>> 137c789 1
>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 ll_write
>> 0x7f6a94006350 137c789 22010~216
>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 ll_write
>> 0x7f6a94006350 22010~216 = 216
>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
>> ll_getxattr 137c78c.head security.capability size 0
>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
>> _getxattr(137c78c, "security.capability", 0) = -61
>>
>> 
>>
>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
>> _getxattr(137c78a, "security.capability", 0) = -61
>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 ll_write
>> 0x7f6a08042560 137c78b 11900~34
>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 ll_write
>> 0x7f6a08042560 11900~34 = 34
>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
>> ll_getattr 11e9c80.head
>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
>> ll_getattr 11e9c80.head = 0
>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559 

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Shinobu Kinjo
Can you reproduce with debug client = 20?

On Tue, Jul 5, 2016 at 10:16 AM, Goncalo Borges <
goncalo.bor...@sydney.edu.au> wrote:

> Dear All...
>
> We have recently migrated all our ceph infrastructure from 9.2.0 to 10.2.2.
>
> We are currently using ceph-fuse to mount cephfs in a number of clients.
>
> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
> scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
> application requesting 4 hosts with 4 cores each (16 instances in total) .
> According to the user, each instance has its own dedicated inputs and
> outputs.
>
> Please note that if we go back to ceph-fuse 9.2.0 client everything works
> fine.
>
> The ceph-fuse 10.2.2 client segfault is the following (we were able to
> capture it mounting ceph-fuse in debug mode):
>
> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
> ceph-fuse[7346]: starting ceph client
> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 0x7f6af8c12320
> newargc=11
> ceph-fuse[7346]: starting fuse
> *** Caught signal (Segmentation fault) **
>  in thread 7f69d7fff700 thread_name:ceph-fuse
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (()+0x297ef2) [0x7f6aedbecef2]
>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
> [0x7f6aedaee035]
>  5: (()+0x199891) [0x7f6aedaee891]
>  6: (()+0x15b76) [0x7f6aed50db76]
>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>  9: (clone()+0x6d) [0x7f6aeb8d193d]
> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal (Segmentation
> fault) **
>  in thread 7f69d7fff700 thread_name:ceph-fuse
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (()+0x297ef2) [0x7f6aedbecef2]
>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
> [0x7f6aedaee035]
>  5: (()+0x199891) [0x7f6aedaee891]
>  6: (()+0x15b76) [0x7f6aed50db76]
>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> The full dump is quite long. Here are the very last bits of it. Let me
> know if you need the full dump.
>
> --- begin dump of recent events ---
>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
> _getxattr(137c789, "security.capability", 0) = -61
>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 ll_write
> 0x7f6a08028be0 137c78c 20094~34
>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 ll_write
> 0x7f6a08028be0 20094~34 = 34
>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 ll_write
> 0x7f6a100145f0 137c78d 28526~34
>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 ll_write
> 0x7f6a100145f0 28526~34 = 34
>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559 ll_forget
> 137c78c 1
>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559 ll_forget
> 137c789 1
>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 ll_write
> 0x7f6a94006350 137c789 22010~216
>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 ll_write
> 0x7f6a94006350 22010~216 = 216
>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
> ll_getxattr 137c78c.head security.capability size 0
>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
> _getxattr(137c78c, "security.capability", 0) = -61
>
> 
>
>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
> _getxattr(137c78a, "security.capability", 0) = -61
>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 ll_write
> 0x7f6a08042560 137c78b 11900~34
>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 ll_write
> 0x7f6a08042560 11900~34 = 34
>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
> ll_getattr 11e9c80.head
>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
> ll_getattr 11e9c80.head = 0
>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559 ll_forget
> 137c78a 1
>   -154> 2016-07-05 10:09:14.043738 7f6a5ebfd700  3 client.464559 ll_write
> 0x7f6a140d5930 137c78a 18292~34
>   -153> 2016-07-05 10:09:14.043759 7f6a5ebfd700  3 client.464559 ll_write
> 0x7f6a140d5930 18292~34 = 34
>   -152> 2016-07-05 10:09:14.043767 7f6ac17fb700  3 client.464559 ll_forget
> 11e9c80 1
>   -151> 2016-07-05 10:09:14.043784 7f6aa8cf9700  3 client.464559 ll_flush
> 0x7f6a00049fe0 11e9c80
>   -150> 2016-07-05 10:09:14.043794 7f6aa8cf9700  3 client.464559
> ll_getxattr 137c78a.head security.capability size 0
>   -149> 2016-07-05 10:09:14.043799 7f6aa8cf9700  3 

[ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Goncalo Borges

Dear All...

We have recently migrated all our ceph infrastructure from 9.2.0 to 10.2.2.

We are currently using ceph-fuse to mount cephfs in a number of clients.

ceph-fuse 10.2.2 client is segfaulting in some situations. One of the 
scenarios where ceph-fuse segfaults is when a user submits a parallel 
(mpi) application requesting 4 hosts with 4 cores each (16 instances in 
total) . According to the user, each instance has its own dedicated 
inputs and outputs.


Please note that if we go back to ceph-fuse 9.2.0 client everything 
works fine.


The ceph-fuse 10.2.2 client segfault is the following (we were able to 
capture it mounting ceph-fuse in debug mode):


   2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
   (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
   ceph-fuse[7346]: starting ceph client
   2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv =
   0x7f6af8c12320 newargc=11
   ceph-fuse[7346]: starting fuse
   *** Caught signal (Segmentation fault) **
 in thread 7f69d7fff700 thread_name:ceph-fuse
 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x297ef2) [0x7f6aedbecef2]
 2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
 3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
 4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
   [0x7f6aedaee035]
 5: (()+0x199891) [0x7f6aedaee891]
 6: (()+0x15b76) [0x7f6aed50db76]
 7: (()+0x12aa9) [0x7f6aed50aaa9]
 8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
 9: (clone()+0x6d) [0x7f6aeb8d193d]
   2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal
   (Segmentation fault) **
 in thread 7f69d7fff700 thread_name:ceph-fuse

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x297ef2) [0x7f6aedbecef2]
 2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
 3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
 4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
   [0x7f6aedaee035]
 5: (()+0x199891) [0x7f6aedaee891]
 6: (()+0x15b76) [0x7f6aed50db76]
 7: (()+0x12aa9) [0x7f6aed50aaa9]
 8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
 9: (clone()+0x6d) [0x7f6aeb8d193d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
   needed to interpret this.


The full dump is quite long. Here are the very last bits of it. Let me 
know if you need the full dump.


   --- begin dump of recent events ---
 -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
   _getxattr(137c789, "security.capability", 0) = -61
 -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559
   ll_write 0x7f6a08028be0 137c78c 20094~34
 -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559
   ll_write 0x7f6a08028be0 20094~34 = 34
 -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559
   ll_write 0x7f6a100145f0 137c78d 28526~34
 -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559
   ll_write 0x7f6a100145f0 28526~34 = 34
 -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559
   ll_forget 137c78c 1
 -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559
   ll_forget 137c789 1
 -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559
   ll_write 0x7f6a94006350 137c789 22010~216
 -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559
   ll_write 0x7f6a94006350 22010~216 = 216
 -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
   ll_getxattr 137c78c.head security.capability size 0
 -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
   _getxattr(137c78c, "security.capability", 0) = -61

   

  -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
   _getxattr(137c78a, "security.capability", 0) = -61
  -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559
   ll_write 0x7f6a08042560 137c78b 11900~34
  -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559
   ll_write 0x7f6a08042560 11900~34 = 34
  -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
   ll_getattr 11e9c80.head
  -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
   ll_getattr 11e9c80.head = 0
  -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559
   ll_forget 137c78a 1
  -154> 2016-07-05 10:09:14.043738 7f6a5ebfd700  3 client.464559
   ll_write 0x7f6a140d5930 137c78a 18292~34
  -153> 2016-07-05 10:09:14.043759 7f6a5ebfd700  3 client.464559
   ll_write 0x7f6a140d5930 18292~34 = 34
  -152> 2016-07-05 10:09:14.043767 7f6ac17fb700  3 client.464559
   ll_forget 11e9c80 1
  -151> 2016-07-05 10:09:14.043784 7f6aa8cf9700  3 client.464559
   ll_flush 0x7f6a00049fe0 11e9c80
  -150> 2016-07-05 10:09:14.043794 7f6aa8cf9700  3 client.464559
   ll_getxattr 137c78a.head security.capability size 0
  -149> 2016-07-05 10:09:14.043799 7f6aa8cf9700  3 client.464559
   

Re: [ceph-users] 答复: 转发: how to fix the mds damaged issue

2016-07-04 Thread Shinobu Kinjo
Reproduce with 'debug mds = 20' and 'debug ms = 20'.

 shinobu

On Mon, Jul 4, 2016 at 9:42 PM, Lihang  wrote:

> Thank you very much for your advice. The command "ceph mds repaired 0"
> work fine in my cluster, my cluster state become HEALTH_OK and the cephfs
> state become normal also. but in the monitor or mds log file ,it just
> record the replay and recover process log without point out somewhere is
> abnormal . and I haven't the log when this issue happened . So I haven't
> found out the root cause of this issue. I'll try to reproduce this issue .
> thank you very much again!
> fisher
>
> -邮件原件-
> 发件人: John Spray [mailto:jsp...@redhat.com]
> 发送时间: 2016年7月4日 17:49
> 收件人: lihang 12398 (RD)
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] 转发: how to fix the mds damaged issue
>
> On Sun, Jul 3, 2016 at 8:06 AM, Lihang  wrote:
> > root@BoreNode2:~# ceph -v
> >
> > ceph version 10.2.0
> >
> >
> >
> > 发件人: lihang 12398 (RD)
> > 发送时间: 2016年7月3日 14:47
> > 收件人: ceph-users@lists.ceph.com
> > 抄送: Ceph Development; 'uker...@gmail.com'; zhengbin 08747 (RD);
> > xusangdi
> > 11976 (RD)
> > 主题: how to fix the mds damaged issue
> >
> >
> >
> > Hi, my ceph cluster mds is damaged and the cluster is degraded after
> > our machines library power down suddenly. then the cluster is
> > “HEALTH_ERR” and cann’t be recovered to health by itself after my
> >
> > Reboot the storage node system or restart the ceph cluster yet. After
> > that I also use the following command to remove the damaged mds, but
> > the damaged mds be removed failed and the issue exist still. The
> > another two mds state is standby. Who can tell me how to fix this
> > issue and find out what happened in my cluter?
> >
> > the remove damaged mds process in my storage node as follows.
> >
> > 1> Execute ”stop ceph-mds-all” command  in the damaged mds node
> >
> > 2>  ceph mds rmfailed 0 --yes-i-really-mean-it
>
> rmfailed is not something you want to use in these circumstances.
>
> > 3>  root@BoreNode2:~# ceph  mds rm 0
> >
> > mds gid 0 dne
> >
> >
> >
> > The detailed status of my cluster as following:
> >
> > root@BoreNode2:~# ceph -s
> >
> >   cluster 98edd275-5df7-414f-a202-c3d4570f251c
> >
> >  health HEALTH_ERR
> >
> > mds rank 0 is damaged
> >
> > mds cluster is degraded
> >
> >  monmap e1: 3 mons at
> > {BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNod
> > e4=172.16.65.143:6789/0}
> >
> > election epoch 1010, quorum 0,1,2
> > BoreNode2,BoreNode3,BoreNode4
> >
> >   fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
> >
> >  osdmap e338: 8 osds: 8 up, 8 in
> >
> > flags sortbitwise
> >
> >   pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
> >
> > 423 MB used, 3018 GB / 3018 GB avail
> >
> > 1560 active+clean
>
> When an MDS rank is marked as damaged, that means something invalid was
> found when reading from the pool storing metadata objects.  The next step
> is to find out what that was.  Look in the MDS log and in ceph.log from the
> time when it went damaged, to find the most specific error message you can.
>
> If you do not have the logs and want to have the MDS try operating again
> (to reproduce whatever condition caused it to be marked damaged), you can
> enable it by using "ceph mds repaired 0", then start the daemon and see how
> it is failing.
>
> John
>
> > root@BoreNode2:~# ceph mds dump
> >
> > dumped fsmap epoch 168
> >
> > fs_name TudouFS
> >
> > epoch   156
> >
> > flags   0
> >
> > created 2016-04-02 02:48:11.150539
> >
> > modified2016-04-03 03:04:57.347064
> >
> > tableserver 0
> >
> > root0
> >
> > session_timeout 60
> >
> > session_autoclose   300
> >
> > max_file_size   1099511627776
> >
> > last_failure0
> >
> > last_failure_osd_epoch  83
> >
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client
> > writeable ranges,3=default file layouts on dirs,4=dir inode in
> > separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> > omap,8=file layout v2}
> >
> > max_mds 1
> >
> > in  0
> >
> > up  {}
> >
> > failed
> >
> > damaged 0
> >
> > stopped
> >
> > data_pools  4
> >
> > metadata_pool   3
> >
> > inline_data disabled
> >
> > --
> > ---
> > 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> > 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> > 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> > 邮件!
> > This e-mail and its attachments contain confidential information from
> > H3C, which is intended only for the person or entity whose address is
> > listed above. Any use of the information contained herein in any way
> > (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you 

Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Alex Gorbachev
HI Nick,


On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk  wrote:



> However, there are a number of pain points with iSCSI + ESXi + RBD and they 
> all mainly centre on write latency. It seems VMFS was designed around the 
> fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph 
> will service them in 2-10ms.
>
> 1. Thin Provisioning makes things slow. I believe the main cause is that when 
> growing and zeroing the new blocks, metadata needs to be updated and the 
> block zero'd. Both issue small IO which would normally not be a problem, but 
> with Ceph it becomes a bottleneck to overall IO on the datastore.
>
> 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN 
> will coalesce these back into a stream of larger IO's before committing to 
> disk. However with Ceph each IO takes 2-10ms and so everything seems slow. 
> The future feature of persistent RBD cache may go a long way to helping with 
> this.

Are you referring to ESXi snapshots?  Specifically, if a VM is running
off a snapshot 
(https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=1015180),
its IO will drop to 64KB "grains"?

> 3. >2TB VMDK's with snapshots use a different allocation mode, which happens 
> in 4kb chunks instead of 64kb ones. This makes the problem 16 times worse 
> than above.
>
> 4. Any of the above will also apply when migrating machines around, so VM's 
> can takes hours/days to move.
>
> 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, 
> you get thin provisioning, but no pagecache or readahead, so performance can 
> nose dive if this is needed.

Would not FILEIO also leverage the Linux scheduler to do IO coalescing
and help with (2) ?  Since FILEIO also uses the dirty flush mechanism
in page cache (and makes IO somewhat crash-unsafe at the same time).

> 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to 
> seeing APD/PDL even when you think you have finally got everything working 
> great.

We were used to seeing APD/PDL all the time with LIO, but pretty much
have not seen any with SCST > 3.1.  Most of the ESXi problems are with
just with high latency periods, which are not a problem for the
hypervisor itself, but rather for the databases or applications inside
VMs.

Thanks,
Alex

>
>
> Normal IO from eager zeroed VM's with no snapshots, however should perform 
> ok. So depends what your workload is.
>
>
> And then comes NFS. It's very easy to setup, very easy to configure for HA, 
> and works pretty well overall. You don't seem to get any of the IO size 
> penalties when using snapshots. If you mount with discard, thin provisioning 
> is done by Ceph. You can defragment the FS on the proxy node and several 
> other things that you can't do with VMFS. Just make sure you run the server 
> in sync mode to avoid data loss.
>
> The only downside is that every IO causes an IO to the FS and one to the FS 
> journal, so you effectively double your IO. But if your Ceph backend can 
> support it, then it shouldn't be too much of a problem.
>
> Now to the original poster, assuming the iSCSI node is just kernel mounting 
> the RBD, I would run iostat on it, to try and see what sort of latency you 
> are seeing at that point. Also do the same with esxtop +u, and look at the 
> write latency there, both whilst running the fio in the VM. This should 
> hopefully let you see if there is just a gradual increase as you go from hop 
> to hop or if there is an obvious culprit.
>
> Can you also confirm your kernel version?
>
> With 1GB networking I think you will struggle to get your write latency much 
> below 10-15ms, but from your example ~30ms is still a bit high. I wonder if 
> the default queue depths on your iSCSI target are too low as well?
>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Oliver Dzombic
>> Sent: 01 July 2016 09:27
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users]
>> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
>>
>> Hi,
>>
>> my experience:
>>
>> ceph + iscsi ( multipath ) + vmware == worst
>>
>> Better you search for another solution.
>>
>> vmware + nfs + vmware might have a much better performance.
>>
>> 
>>
>> If you are able to get vmware run with iscsi and ceph, i would be
>> >>very<< intrested in what/how you did that.
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 01.07.2016 um 07:04 schrieb mq:
>> > Hi list
>> > I have tested suse enterprise storage3 using 2 iscsi  gateway attached
>> > to  vmware. The performance is bad.  I have turn off  VAAI 

Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?

2016-07-04 Thread Alex Gorbachev
On Wed, Jun 29, 2016 at 5:41 AM, Campbell Steven  wrote:
> Hi Alex/Stefan,
>
> I'm in the middle of testing 4.7rc5 on our test cluster to confirm
> once and for all this particular issue has been completely resolved by
> Peter's recent patch to sched/fair.c refereed to by Stefan above. For
> us anyway the patches that Stefan applied did not solve the issue and
> neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it
> does the trick for you. We could get about 4 hours uptime before
> things went haywire for us.
>
> It's interesting how it seems the CEPH workload triggers this bug so
> well as it's quite a long standing issue that's only just been
> resolved, another user chimed in on the lkml thread a couple of days
> ago as well and again his trace had ceph-osd in it as well.
>
> https://lkml.org/lkml/headers/2016/6/21/491
>
> Campbell

Campbell, any luck with testing 4.7rc5?  rc6 came out just now, and I
am having trouble booting it on an ubuntu box due to some other
unrelated problem.  So dropping to kernel 4.2.0 for now, which does
not seem to have this load related problem.

I looked at the fair.c code in kernel source tree 4.4.14 and it is
quite different than Peter's patch (assuming 4.5.x source), so the
patch does not apply cleanly.  Maybe another 4.4.x kernel will get the
update.

Thanks,
Alex



>
> On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG
>  wrote:
>>
>> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev:
>>> Hi Stefan,
>>>
>>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG
>>>  wrote:
 Please be aware that you may need even more patches. Overall this needs 3
 patches. Where the first two try to fix a bug and the 3rd one fixes the
 fixes + even more bugs related to the scheduler. I've no idea on which 
 patch
 level Ubuntu is.
>>>
>>> Stefan, would you be able to please point to the other two patches
>>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ?
>>
>> Sorry sure yes:
>>
>> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a
>> bounded value")
>>
>> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix
>> post_init_entity_util_avg() serialization")
>>
>> 3.) the one listed at lkml.
>>
>> Stefan
>>
>>>
>>> Thank you,
>>> Alex
>>>

 Stefan

 Excuse my typo sent from my mobile phone.

 Am 28.06.2016 um 17:59 schrieb Tim Bishop :

 Yes - I noticed this today on Ubuntu 16.04 with the default kernel. No
 useful information to add other than it's not just you.

 Tim.

 On Tue, Jun 28, 2016 at 11:05:40AM -0400, Alex Gorbachev wrote:

 After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of

 these issues where an OSD would fail with the stack below.  I logged a

 bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there is

 a similar description at https://lkml.org/lkml/2016/6/22/102, but the

 odd part is we have turned off CFQ and blk-mq/scsi-mq and are using

 just the noop scheduler.


 Does the ceph kernel code somehow use the fair scheduler code block?


 Thanks

 --

 Alex Gorbachev

 Storcium


 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684974] CPU: 30 PID:

 10403 Comm: ceph-osd Not tainted 4.4.13-040413-generic #201606072354

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684991] Hardware name:

 Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2

 03/04/2015

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685009] task:

 880f79df8000 ti: 880f79fb8000 task.ti: 880f79fb8000

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685024] RIP:

 0010:[]  []

 task_numa_find_cpu+0x22e/0x6f0

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685051] RSP:

 0018:880f79fbb818  EFLAGS: 00010206

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685063] RAX:

  RBX: 880f79fbb8b8 RCX: 

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685076] RDX:

  RSI:  RDI: 8810352d4800

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685107] RBP:

 880f79fbb880 R08: 0001020cf87c R09: 00ff00ff

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685150] R10:

 0009 R11: 0006 R12: 8807c3adc4c0

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685194] R13:

 0006 R14: 033e R15: fec7

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685238] FS:

 7f30e46b8700() GS:88105f58()

 knlGS:

 Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685283] CS:  0010 DS:

  ES: 

Re: [ceph-users] mds standby + standby-reply upgrade

2016-07-04 Thread Dzianis Kahanovich
Gregory Farnum пишет:
> On Thu, Jun 30, 2016 at 1:03 PM, Dzianis Kahanovich  wrote:
>> Upgraded infernalis->jewel (git, Gentoo). Upgrade passed over global
>> stop/restart everything oneshot.
>>
>> Infernalis: e5165: 1/1/1 up {0=c=up:active}, 1 up:standby-replay, 1 
>> up:standby
>>
>> Now after upgrade start and next mon restart, active monitor falls with
>> "assert(info.state == MDSMap::STATE_STANDBY)" (even without running mds) . 
>> Fixed:
>>
>> --- a/src/mon/MDSMonitor.cc 2016-06-27 21:26:26.0 +0300
>> +++ b/src/mon/MDSMonitor.cc 2016-06-28 10:44:32.0 +0300
>> @@ -2793,7 +2793,11 @@ bool MDSMonitor::maybe_promote_standby(s
>>  for (const auto  : pending_fsmap.standby_daemons) {
>>const auto  = j.first;
>>const auto  = j.second;
>> -  assert(info.state == MDSMap::STATE_STANDBY);
>> +//  assert(info.state == MDSMap::STATE_STANDBY);
>> +  if (info.state != MDSMap::STATE_STANDBY) {
>> +dout(0) << "gid " << gid << " ex-assert(info.state ==
>> MDSMap::STATE_STANDBY) " << do_propose << dendl;
>> +   return do_propose;
>> +  }
>>
>>if (!info.standby_replay) {
>>  continue;
>>
>>
>> Now: e5442: 1/1/1 up {0=a=up:active}, 1 up:standby
>> - but really there are 3 mds (active, replay, standby).
>>
>> # ceph mds dump
>> dumped fsmap epoch 5442
>> fs_name cephfs
>> epoch   5441
>> flags   0
>> created 2016-04-10 23:44:38.858769
>> modified2016-06-27 23:08:26.211880
>> tableserver 0
>> root0
>> session_timeout 60
>> session_autoclose   300
>> max_file_size   1099511627776
>> last_failure5239
>> last_failure_osd_epoch  18473
>> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
>> uses
>> versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
>> max_mds 1
>> in  0
>> up  {0=3104110}
>> failed
>> damaged
>> stopped
>> data_pools  5
>> metadata_pool   6
>> inline_data disabled
>> 3104110:10.227.227.103:6800/14627 'a' mds.0.5436 up:active seq 30
>> 3084126:10.227.227.104:6800/24069 'c' mds.0.0 up:standby-replay seq 1
>>
>>
>> If standby-replay false - all OK: 1/1/1 up {0=a=up:active}, 2 up:standby
>>
>> How to fix this 3-mds behaviour?
> 
> Ah, you hit a known bug with that assert. I thought the fix was
> already in the latest point release; are you behind?
> -Greg
> 

Cheked in logs - observed in version 10.2.2-45-g9aafefe
(9aafefeab6b0f01d7467f70cb2f1b16ae88340e8) - 27.06 git jewel branch latest.
Where is fixed point?

-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Lars Marowsky-Bree
> Sent: 04 July 2016 11:36
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users]
> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
> 
> On 2016-07-01T19:11:34, Nick Fisk  wrote:
> 
> > To summarise,
> >
> > LIO is just not working very well at the moment because of the ABORT
> > Tasks problem, this will hopefully be fixed at some point. I'm not
> > sure if SUSE works around this, but see below for other pain points
> > with RBD + ESXi + iSCSI
> 
> Yes, the SUSE kernel has recent backports that fix these bugs. And there's
> obviously on-going work to improve the performance and code.
> 
> That's not to say that I'd advocate iSCSI as a primary access mechanism
for
> Ceph. But the need to interface from non-Linux systems to a Ceph cluster
is
> unfortunately very real.
> 
> > With 1GB networking I think you will struggle to get your write latency
> much below 10-15ms, but from your example ~30ms is still a bit high. I
> wonder if the default queue depths on your iSCSI target are too low as
well?
> 
> Thanks for all the insights on the performance issues. You're really quite
spot
> on.

Thanks, it's been a painful experience working through them all, but have
learnt a lot along the way.

> 
> The main concern here obviously is that the same 2x1GbE network is
carrying
> both the client/ESX traffic, the iSCSI target to OSD traffic, and the OSD
> backend traffic. That is not advisable.
> 
> 
> Regards,
> Lars
> 
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their
> mistakes." -- Oscar Wilde
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD mirroring between a IPv6 and IPv4 Cluster

2016-07-04 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: 04 July 2016 14:34
> To: ceph-users@lists.ceph.com; n...@fisk.me.uk
> Subject: Re: [ceph-users] RBD mirroring between a IPv6 and IPv4 Cluster
> 
> 
> > Op 4 juli 2016 om 9:25 schreef Nick Fisk :
> >
> >
> > Hi All,
> >
> > Quick question. I'm currently in the process of getting ready to
> > deploy a 2nd cluster, which at some point in the next 12 months, I
> > will want to enable RBD mirroring between the new and existing
> > clusters. I'm leaning towards deploying this new cluster with IPv6,
> > because Wido says so ;-)
> >
> 
> Good job! More IPv6 is better :)
> 
> > Question is, will RBD mirroring still be possible between the two? I
> > know you can't dual stack the core Ceph components, but does RBD
> > mirroring have the same limitations?
> >
> 
> I haven't touched it yet, but looking at the docs it seems it will:
> http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
> 
> "The cluster name in the following examples corresponds to a Ceph
> configuration file of the same name (e.g. /etc/ceph/remote.conf). See the
> ceph-conf documentation for how to configure multiple clusters."
> 
> So, in 'remote.conf' you can add the IPv6 addresses of a cluster running
on
> IPv6 and on 'ceph.conf' the IPv4 addresses.
> 
> The rbd-mirror daemon will eventually talk to librbd/librados which will
act as
> a 'proxy' between the two clusters.
> 
> I think it works, but that's just based on reading the docs and prior
> knowledge.

Ok, thanks Wido. It won't be for a while that we have this all setup, but I
report back to confirm.

> 
> Wido
> 
> > Thanks,
> > Nick
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Rebalance Issue

2016-07-04 Thread Wido den Hollander

> Op 3 juli 2016 om 11:34 schreef Roozbeh Shafiee :
> 
> 
> Actually I tried all the ways which I found them on Ceph Docs and mailing 
> lists but
> non of them had no effect. As a last resort I changed pg/pgp.
> 
> Anyway… What can I do as the best way to solve this problem?
> 

Did you try to restart some of the OSDs on which recovery is hanging? Does that 
help anything?

Wido

> Thanks
> 
> > On Jul 3, 2016, at 1:43 PM, Wido den Hollander  wrote:
> > 
> > 
> >> Op 3 juli 2016 om 11:02 schreef Roozbeh Shafiee 
> >> :
> >> 
> >> 
> >> Yes, you’re right but I have 0 object/s recovery last night. when I 
> >> changed pg/pgp from 1400
> >> to 2048, rebalancing speeded up but the percentage of rebalancing backed 
> >> to 53%.
> >> 
> > 
> > Why did you change that? I would not change that value while a cluster is 
> > still in recovery.
> > 
> >> I have this situation again n again since I dropped out failed OSD when I 
> >> increase pg/pgp but 
> >> each time rebalancing stopped at 0 objects/s and low speed transfer.
> >> 
> > 
> > Hard to judge at this point. You might want to try and restart osd.27 and 
> > see if that gets things going again. It seems to be involved in many PGs 
> > which are in 'backfilling' state.
> > 
> > Wido
> > 
> >> Thanks
> >> 
> >>> On Jul 3, 2016, at 1:25 PM, Wido den Hollander  wrote:
> >>> 
> >>> 
>  Op 3 juli 2016 om 10:50 schreef Roozbeh Shafiee 
>  :
>  
>  
>  Thanks for quick response, Wido
>  
>  the "ceph -s" output has pasted here:
>  http://pastie.org/10897747
>  
>  and this is output of “ceph health detail”:
>  http://pastebin.com/vMeURWC9
>  
> >>> 
> >>> It seems the cluster is still backfilling PGs and you 'ceph -s' shows so: 
> >>> 'recovery io 62375 kB/s, 15 objects/s'
> >>> 
> >>> It will just take some time before it finishes.
> >>> 
> >>> Wido
> >>> 
>  Thank you
>  
> > On Jul 3, 2016, at 1:10 PM, Wido den Hollander  wrote:
> > 
> > 
> >> Op 3 juli 2016 om 10:34 schreef Roozbeh Shafiee 
> >> :
> >> 
> >> 
> >> Hi list,
> >> 
> >> A few days ago one of my OSDs failed and I dropped out that but 
> >> afterwards I got
> >> HEALTH_WARN until now. After turing off the OSD, the self-healing 
> >> system started
> >> to rebalance data between other OSDs.
> >> 
> >> My question is: At the end of rebalancing, the process doesn’t 
> >> complete and I get this message
> >> at the end of “ceph -s” output:
> >> 
> >> recovery io 1456 KB/s, 0 object/s
> >> 
> > 
> > Could you post the exact output of 'ceph -s'?
> > 
> > There is something more which needs to be shown.
> > 
> > 'ceph health detail' also might tell you more.
> > 
> > Wido
> > 
> >> how can I back to HEALTH_OK situation again?
> >> 
> >> My cluster details are:
> >> 
> >> - 27 OSDs
> >> - 3 MONs
> >> - 2048 pg/pgs
> >> - Each OSD has 4 TB of space
> >> - CentOS 7.2 with 3.10 linux kernel
> >> - Ceph Hammer version
> >> 
> >> Thank you,
> >> Roozbeh___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> >> 
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD mirroring between a IPv6 and IPv4 Cluster

2016-07-04 Thread Wido den Hollander

> Op 4 juli 2016 om 9:25 schreef Nick Fisk :
> 
> 
> Hi All,
> 
> Quick question. I'm currently in the process of getting ready to deploy a
> 2nd cluster, which at some point in the next 12 months, I will want to
> enable RBD mirroring between the new and existing clusters. I'm leaning
> towards deploying this new cluster with IPv6, because Wido says so ;-) 
> 

Good job! More IPv6 is better :)

> Question is, will RBD mirroring still be possible between the two? I know
> you can't dual stack the core Ceph components, but does RBD mirroring have
> the same limitations?
> 

I haven't touched it yet, but looking at the docs it seems it will: 
http://docs.ceph.com/docs/master/rbd/rbd-mirroring/

"The cluster name in the following examples corresponds to a Ceph configuration 
file of the same name (e.g. /etc/ceph/remote.conf). See the ceph-conf 
documentation for how to configure multiple clusters."

So, in 'remote.conf' you can add the IPv6 addresses of a cluster running on 
IPv6 and on 'ceph.conf' the IPv4 addresses.

The rbd-mirror daemon will eventually talk to librbd/librados which will act as 
a 'proxy' between the two clusters.

I think it works, but that's just based on reading the docs and prior knowledge.

Wido

> Thanks,
> Nick
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 转发: how to fix the mds damaged issue

2016-07-04 Thread Lihang
Thank you very much for your advice. The command "ceph mds repaired 0" work 
fine in my cluster, my cluster state become HEALTH_OK and the cephfs state 
become normal also. but in the monitor or mds log file ,it just record the 
replay and recover process log without point out somewhere is abnormal . and I 
haven't the log when this issue happened . So I haven't found out the root 
cause of this issue. I'll try to reproduce this issue . thank you very much 
again!
fisher

-邮件原件-
发件人: John Spray [mailto:jsp...@redhat.com] 
发送时间: 2016年7月4日 17:49
收件人: lihang 12398 (RD)
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] 转发: how to fix the mds damaged issue

On Sun, Jul 3, 2016 at 8:06 AM, Lihang  wrote:
> root@BoreNode2:~# ceph -v
>
> ceph version 10.2.0
>
>
>
> 发件人: lihang 12398 (RD)
> 发送时间: 2016年7月3日 14:47
> 收件人: ceph-users@lists.ceph.com
> 抄送: Ceph Development; 'uker...@gmail.com'; zhengbin 08747 (RD); 
> xusangdi
> 11976 (RD)
> 主题: how to fix the mds damaged issue
>
>
>
> Hi, my ceph cluster mds is damaged and the cluster is degraded after 
> our machines library power down suddenly. then the cluster is 
> “HEALTH_ERR” and cann’t be recovered to health by itself after my
>
> Reboot the storage node system or restart the ceph cluster yet. After 
> that I also use the following command to remove the damaged mds, but 
> the damaged mds be removed failed and the issue exist still. The 
> another two mds state is standby. Who can tell me how to fix this 
> issue and find out what happened in my cluter?
>
> the remove damaged mds process in my storage node as follows.
>
> 1> Execute ”stop ceph-mds-all” command  in the damaged mds node
>
> 2>  ceph mds rmfailed 0 --yes-i-really-mean-it

rmfailed is not something you want to use in these circumstances.

> 3>  root@BoreNode2:~# ceph  mds rm 0
>
> mds gid 0 dne
>
>
>
> The detailed status of my cluster as following:
>
> root@BoreNode2:~# ceph -s
>
>   cluster 98edd275-5df7-414f-a202-c3d4570f251c
>
>  health HEALTH_ERR
>
> mds rank 0 is damaged
>
> mds cluster is degraded
>
>  monmap e1: 3 mons at
> {BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNod
> e4=172.16.65.143:6789/0}
>
> election epoch 1010, quorum 0,1,2 
> BoreNode2,BoreNode3,BoreNode4
>
>   fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
>
>  osdmap e338: 8 osds: 8 up, 8 in
>
> flags sortbitwise
>
>   pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
>
> 423 MB used, 3018 GB / 3018 GB avail
>
> 1560 active+clean

When an MDS rank is marked as damaged, that means something invalid was found 
when reading from the pool storing metadata objects.  The next step is to find 
out what that was.  Look in the MDS log and in ceph.log from the time when it 
went damaged, to find the most specific error message you can.

If you do not have the logs and want to have the MDS try operating again (to 
reproduce whatever condition caused it to be marked damaged), you can enable it 
by using "ceph mds repaired 0", then start the daemon and see how it is failing.

John

> root@BoreNode2:~# ceph mds dump
>
> dumped fsmap epoch 168
>
> fs_name TudouFS
>
> epoch   156
>
> flags   0
>
> created 2016-04-02 02:48:11.150539
>
> modified2016-04-03 03:04:57.347064
>
> tableserver 0
>
> root0
>
> session_timeout 60
>
> session_autoclose   300
>
> max_file_size   1099511627776
>
> last_failure0
>
> last_failure_osd_epoch  83
>
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client 
> writeable ranges,3=default file layouts on dirs,4=dir inode in 
> separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
> omap,8=file layout v2}
>
> max_mds 1
>
> in  0
>
> up  {}
>
> failed
>
> damaged 0
>
> stopped
>
> data_pools  4
>
> metadata_pool   3
>
> inline_data disabled
>
> --
> ---
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from 
> H3C, which is intended only for the person or entity whose address is 
> listed above. Any use of the information contained herein in any way 
> (including, but not limited to, total or partial disclosure, 
> reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, 
> please notify the sender by phone or email immediately and delete it!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Lars Marowsky-Bree
On 2016-07-01T19:11:34, Nick Fisk  wrote:

> To summarise,
> 
> LIO is just not working very well at the moment because of the ABORT Tasks 
> problem, this will hopefully be fixed at some point. I'm not sure if SUSE 
> works around this, but see below for other pain points with RBD + ESXi + iSCSI

Yes, the SUSE kernel has recent backports that fix these bugs. And
there's obviously on-going work to improve the performance and code.

That's not to say that I'd advocate iSCSI as a primary access mechanism
for Ceph. But the need to interface from non-Linux systems to a Ceph
cluster is unfortunately very real.

> With 1GB networking I think you will struggle to get your write latency much 
> below 10-15ms, but from your example ~30ms is still a bit high. I wonder if 
> the default queue depths on your iSCSI target are too low as well?

Thanks for all the insights on the performance issues. You're really
quite spot on.

The main concern here obviously is that the same 2x1GbE network is
carrying both the client/ESX traffic, the iSCSI target to OSD traffic,
and the OSD backend traffic. That is not advisable.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Lars Marowsky-Bree
On 2016-07-01T17:18:19, Christian Balzer  wrote:

> First off, it's somewhat funny that you're testing the repackaged SUSE
> Ceph, but asking for help here (with Ceph being owned by Red Hat).

*cough* Ceph is not owned by RH. RH acquired the InkTank team and the
various trademarks, that's true (and, admittedly, I'm a bit envious
about that ;-), but Ceph itself is an Open Source project that is not
owned by a single company.

You may want to check out the growing contributions from other
companies and the active involvement by them in the Ceph community ;-)


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 转发: how to fix the mds damaged issue

2016-07-04 Thread John Spray
On Sun, Jul 3, 2016 at 8:06 AM, Lihang  wrote:
> root@BoreNode2:~# ceph -v
>
> ceph version 10.2.0
>
>
>
> 发件人: lihang 12398 (RD)
> 发送时间: 2016年7月3日 14:47
> 收件人: ceph-users@lists.ceph.com
> 抄送: Ceph Development; 'uker...@gmail.com'; zhengbin 08747 (RD); xusangdi
> 11976 (RD)
> 主题: how to fix the mds damaged issue
>
>
>
> Hi, my ceph cluster mds is damaged and the cluster is degraded after our
> machines library power down suddenly. then the cluster is “HEALTH_ERR” and
> cann’t be recovered to health by itself after my
>
> Reboot the storage node system or restart the ceph cluster yet. After that I
> also use the following command to remove the damaged mds, but the damaged
> mds be removed failed and the issue exist still. The another two mds state
> is standby. Who can tell me how to fix this issue and find out what happened
> in my cluter?
>
> the remove damaged mds process in my storage node as follows.
>
> 1> Execute ”stop ceph-mds-all” command  in the damaged mds node
>
> 2>  ceph mds rmfailed 0 --yes-i-really-mean-it

rmfailed is not something you want to use in these circumstances.

> 3>  root@BoreNode2:~# ceph  mds rm 0
>
> mds gid 0 dne
>
>
>
> The detailed status of my cluster as following:
>
> root@BoreNode2:~# ceph -s
>
>   cluster 98edd275-5df7-414f-a202-c3d4570f251c
>
>  health HEALTH_ERR
>
> mds rank 0 is damaged
>
> mds cluster is degraded
>
>  monmap e1: 3 mons at
> {BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNode4=172.16.65.143:6789/0}
>
> election epoch 1010, quorum 0,1,2 BoreNode2,BoreNode3,BoreNode4
>
>   fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged
>
>  osdmap e338: 8 osds: 8 up, 8 in
>
> flags sortbitwise
>
>   pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects
>
> 423 MB used, 3018 GB / 3018 GB avail
>
> 1560 active+clean

When an MDS rank is marked as damaged, that means something invalid
was found when reading from the pool storing metadata objects.  The
next step is to find out what that was.  Look in the MDS log and in
ceph.log from the time when it went damaged, to find the most specific
error message you can.

If you do not have the logs and want to have the MDS try operating
again (to reproduce whatever condition caused it to be marked
damaged), you can enable it by using "ceph mds repaired 0", then start
the daemon and see how it is failing.

John

> root@BoreNode2:~# ceph mds dump
>
> dumped fsmap epoch 168
>
> fs_name TudouFS
>
> epoch   156
>
> flags   0
>
> created 2016-04-02 02:48:11.150539
>
> modified2016-04-03 03:04:57.347064
>
> tableserver 0
>
> root0
>
> session_timeout 60
>
> session_autoclose   300
>
> max_file_size   1099511627776
>
> last_failure0
>
> last_failure_osd_epoch  83
>
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>
> max_mds 1
>
> in  0
>
> up  {}
>
> failed
>
> damaged 0
>
> stopped
>
> data_pools  4
>
> metadata_pool   3
>
> inline_data disabled
>
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C,
> which is
> intended only for the person or entity whose address is listed above. Any
> use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender
> by phone or email immediately and delete it!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Behind on trimming (58621/30)

2016-07-04 Thread Kenneth Waegeman



On 01/07/16 16:01, Yan, Zheng wrote:

On Fri, Jul 1, 2016 at 6:59 PM, John Spray  wrote:

On Fri, Jul 1, 2016 at 11:35 AM, Kenneth Waegeman
 wrote:

Hi all,

While syncing a lot of files to cephfs, our mds cluster got haywire: the
mdss have a lot of segments behind on trimming:  (58621/30)
Because of this the mds cluster gets degraded. RAM usage is about 50GB. The
mdses were respawning and replaying continiously, and I had to stop all
syncs , unmount all clients and increase the beacon_grace to keep the
cluster up .

[root@mds03 ~]# ceph status
 cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
  health HEALTH_WARN
 mds0: Behind on trimming (58621/30)
  monmap e1: 3 mons at
{mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
 election epoch 170, quorum 0,1,2 mds01,mds02,mds03
   fsmap e78658: 1/1/1 up {0=mds03=up:active}, 2 up:standby
  osdmap e19966: 156 osds: 156 up, 156 in
 flags sortbitwise
   pgmap v10213164: 4160 pgs, 4 pools, 253 TB data, 203 Mobjects
 357 TB used, 516 TB / 874 TB avail
 4151 active+clean
5 active+clean+scrubbing
4 active+clean+scrubbing+deep
   client io 0 B/s rd, 0 B/s wr, 63 op/s rd, 844 op/s wr
   cache io 68 op/s promote


Now it finally is up again, it is trimming very slowly (+-120 segments /
min)

Hmm, so it sounds like something was wrong that got cleared by either
the MDS restart or the client unmount, and now it's trimming at a
healthier rate.

What client (kernel or fuse, and version)?

Can you confirm that the RADOS cluster itself was handling operations
reasonably quickly?  Is your metadata pool using the same drives as
your data?  Were the OSDs saturated with IO?

While the cluster was accumulating untrimmed segments, did you also
have a "client xyz failing to advanced oldest_tid" warning?

This does not prevent MDS from trimming log segment.


It would be good to clarify whether the MDS was trimming slowly, or
not at all.  If you can reproduce this situation, get it to a "behind
on trimming" state, and the stop the client IO (but leave it mounted).
See if the (x/30) number stays the same.  Then, does it start to
decrease when you unmount the client?  That would indicate a
misbehaving client.

Behind on trimming on single MDS cluster should be caused by either
slow rados operations or MDS trim too few log segments on each tick.

Kenneth, could you try setting mds_log_max_expiring to a large value
(such as 200)
I've set the mds_log_max_expiring to 200 right now. Should I see 
something instantly?


This weekend , the trimming did not contunue and something happened to 
the cluster:


mds.0.cache.dir(1000da74e85) commit error -2 v 2466977
log_channel(cluster) log [ERR] : failed to commit dir 1000da74e85 
object, errno -2
mds.0.78429 unhandled write error (2) No such file or directory, force 
readonly...

mds.0.cache force file system read-only
log_channel(cluster) log [WRN] : force file system read-only

and ceph health reported:
mds0: MDS in read-only mode

I restarted it and it is trimming again.


Thanks again!
Kenneth

Regards
Yan, Zheng


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD mirroring between a IPv6 and IPv4 Cluster

2016-07-04 Thread Nick Fisk
Hi All,

Quick question. I'm currently in the process of getting ready to deploy a
2nd cluster, which at some point in the next 12 months, I will want to
enable RBD mirroring between the new and existing clusters. I'm leaning
towards deploying this new cluster with IPv6, because Wido says so ;-) 

Question is, will RBD mirroring still be possible between the two? I know
you can't dual stack the core Ceph components, but does RBD mirroring have
the same limitations?

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Nick Fisk
> -Original Message-
> From: mq [mailto:maoqi1...@126.com]
> Sent: 04 July 2016 08:13
> To: Nick Fisk 
> Subject: Re: [ceph-users]
> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
> 
> Hi  Nick
> i have test NFS: since NFS cannot choose Eager Zeroed Thick Provision mode
> so i use the default thin provision in sphere.
> first test: fio result: 4k randwrite iops 538 , latency 59ms.
> second test: formatted the sdb, fio result : 4k randwrite iops 746 , latency
> 48ms.
> 
> the NFS performance is half of LIO

NFS will always have a penalty compared to VMFS on iSCSI because of the extra 
journal write, but as you saw in your LIO test, you have to conform to certain 
criteria, this may or may not be a problem.

Just one thing comes to mind though. How many NFS server threads are you 
running? By default I think most OS's only spin up 8, which is far too low. If 
you run fio at 32 depth against the defaults, you will see really low 
performance as IO's queue up. Try setting the NFS server threads to something 
like 128.

Other thing to keep in mind (as I have just been finding out) it's important to 
set an extent size hint on the XFS FS on the NFS server, otherwise you will get 
lots of fragmentation. 

Eg.
Xfs_io -c extsize 16M /mountpoint

> 
> Regards
> MQ
> 在 2016年7月4日,下午2:07,mq  写道:
> 
> Hi Nick
> 
> kernel v: 3.12.49-11-default
> after change vsphere virtual disk configuration to Eager Zeroed Thick
> Provision mode the performance in vm is ok. fio result: 4k randwrite iops
> 1600, latency 8ms.  1M seq write bw 100MB/s. but when clone 200G vm need
> 30min.
> 
> by the way  i want test bcache/flashcache+OSD or cache tier, do you have
> any suggestion can give to me.
> 
> i will try NFS next day.
> 
> Regards
> 
> 在 2016年7月2日,上午2:11,Nick Fisk  写道:
> 
> To summarise,
> 
> LIO is just not working very well at the moment because of the ABORT Tasks
> problem, this will hopefully be fixed at some point. I'm not sure if SUSE 
> works
> around this, but see below for other pain points with RBD + ESXi + iSCSI
> 
> TGT is easy to get going, but performance isn't the best and failover is an
> absolute pain as TGT won't stop if it has ongoing IO. You normally end up in a
> complete mess if you try and do HA, unless you can cover a number of
> different failure scenarios.
> 
> SCST probably works the best at the moment. Yes, you have to compile it
> into a new kernel, but it performs well, doesn't fall over, supports the VAAI
> extensions and can be configured HA in an ALUA or VIP failover modes.
> There might be a couple of corner cases with the ALUA mode with
> Active/Standby paths, with possible data corruption that need to be
> tested/explored.
> 
> However, there are a number of pain points with iSCSI + ESXi + RBD and they
> all mainly centre on write latency. It seems VMFS was designed around the
> fact that Enterprise storage arrays service writes in 10-100us, whereas Ceph
> will service them in 2-10ms.
> 
> 1. Thin Provisioning makes things slow. I believe the main cause is that when
> growing and zeroing the new blocks, metadata needs to be updated and the
> block zero'd. Both issue small IO which would normally not be a problem, but
> with Ceph it becomes a bottleneck to overall IO on the datastore.
> 
> 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN 
> will
> coalesce these back into a stream of larger IO's before committing to disk.
> However with Ceph each IO takes 2-10ms and so everything seems slow. The
> future feature of persistent RBD cache may go a long way to helping with
> this.
> 
> 3. >2TB VMDK's with snapshots use a different allocation mode, which
> happens in 4kb chunks instead of 64kb ones. This makes the problem 16
> times worse than above.
> 
> 4. Any of the above will also apply when migrating machines around, so VM's
> can takes hours/days to move.
> 
> 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, you
> get thin provisioning, but no pagecache or readahead, so performance can
> nose dive if this is needed.
> 
> 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to
> seeing APD/PDL even when you think you have finally got everything
> working great.
> 
> 
> Normal IO from eager zeroed VM's with no snapshots, however should
> perform ok. So depends what your workload is.
> 
> 
> And then comes NFS. It's very easy to setup, very easy to configure for HA,
> and works pretty well overall. You don't seem to get any of the IO size
> penalties when using snapshots. If you mount with discard, thin provisioning
> is done by Ceph. You can defragment the FS on the proxy node and several
> other things that you can't do with VMFS. Just make sure you run the server
> in sync mode to avoid data loss.
> 
> The only downside is that every IO causes an IO to the FS and one to the FS
> journal, so you effectively double your IO. But if your Ceph