Re: [ceph-users] Potential OSD deadlock?

2015-10-16 Thread Max A. Krasilnikov
Hello!

On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:

> Have you tried running iperf between the nodes? Capturing a pcap of the 
> (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?

> What about some intermediate MTU like 8000 - does that work?
> Oh and if there's any bonding/trunking involved, beware that you need to set 
> the same MTU and offloads on all interfaces on certains kernels - flags like 
> MTU/offloads should propagate between the master/slave interfaces but in 
> reality it's not the case and they get reset even if you unplug/replug the 
> ethernet cable.

I'm sorry for long time to answer, but I have fixed problem with Jumbo frames
with sysctl:
#
net.ipv4.tcp_moderate_rcvbuf = 0
#
net.ipv4.tcp_rmem= 1024000 8738000 1677721600
net.ipv4.tcp_wmem= 1024000 8738000 1677721600
net.ipv4.tcp_mem= 1024000 8738000 1677721600
net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160

And now i can load my cluster without any slow requests. The essential setting
is net.ipv4.tcp_moderate_rcvbuf = 0. All other are just tunings.

> Jan

>> On 09 Oct 2015, at 13:21, Max A. Krasilnikov  wrote:
>> 
>> Hello!
>> 
>> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
>> 
>>> Are there any errors on the NICs? (ethtool -s ethX)
>> 
>> No errors. Neither on nodes, nor on switches.
>> 
>>> Also take a look at the switch and look for flow control statistics - do 
>>> you have flow control enabled or disabled?
>> 
>> flow control disabled everywhere.
>> 
>>> We had to disable flow control as it would pause all IO on the port 
>>> whenever any path got congested which you don't want to happen with a 
>>> cluster like Ceph. It's better to let the frame drop/retransmit in this 
>>> case (and you should size it so it doesn't happen in any case).
>>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
>>> put my money on that...
>> 
>> I tried to completely disable all offloads and setting mtu back to 9000 
>> after.
>> No luck.
>> I am speaking with my NOC about MTU in 10G network. If I have update, I will
>> write here. I can hardly beleave that it is ceph side, but nothing is
>> impossible.
>> 
>>> Jan
>> 
>> 
 On 09 Oct 2015, at 10:48, Max A. Krasilnikov  wrote:
 
 Hello!
 
 On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
 
> Sage,
 
> After trying to bisect this issue (all test moved the bisect towards
> Infernalis) and eventually testing the Infernalis branch again, it
> looks like the problem still exists although it is handled a tad
> better in Infernalis. I'm going to test against Firefly/Giant next
> week and then try and dive into the code to see if I can expose any
> thing.
 
> If I can do anything to provide you with information, please let me know.
 
 I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
 network
 between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux 
 bounding
 driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
 82599ES
 Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on 
 Nexus 5020
 switch with Jumbo frames enabled i have performance drop and slow 
 requests. When
 setting 1500 on nodes and not touching Nexus all problems are fixed.
 
 I have rebooted all my ceph services when changing MTU and changing things 
 to
 9000 and 1500 several times in order to be sure. It is reproducable in my
 environment.
 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
 
> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> BCFo
> =GJL4
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 
 
> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc 

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

It seems in our situation the cluster is just busy, usually with
really small RBD I/O. We have gotten things to where it doesn't happen
as much in a steady state, but when we have an OSD fail (mostly from
an XFS log bug we hit at least once a week), it is very painful as the
OSD exits and enters the cluster. We are working to split the PGs a
couple of fold, but this is a painful process for the reasons
mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
on IRC about getting the other primaries to throttle back when such a
situation occurs so that each primary OSD has some time to service
client I/O and to push back on the clients to slow down in these
situations.

In our case a single OSD can lock up a VM for a very long time while
others are happily going about their business. Instead of looking like
the cluster is out of I/O, it looks like there is an error. If
pressure is pushed back to clients, it would show up as all of the
clients slowing down a little instead of one or two just hanging for
even over 1,000 seconds.

My thoughts is that each OSD should have some percentage to time given
to servicing client I/O whereas now it seems that replica I/O can
completely starve client I/O. I understand why replica traffic needs a
higher priority, but I think some balance needs to be attained.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
+LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
eP3D
=R0O9
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
> On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
>> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> After a weekend, I'm ready to hit this from a different direction.
>>>
>>> I replicated the issue with Firefly so it doesn't seem an issue that
>>> has been introduced or resolved in any nearby version. I think overall
>>> we may be seeing [1] to a great degree. From what I can extract from
>>> the logs, it looks like in situations where OSDs are going up and
>>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>>> the PG to become clean before dispatching the I/O to the replicas.
>>>
>>> In an effort to understand the flow of the logs, I've attached a small
>>> 2 minute segment of a log I've extracted what I believe to be
>>> important entries in the life cycle of an I/O along with my
>>> understanding. If someone would be kind enough to help my
>>> understanding, I would appreciate it.
>>>
>>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
>>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Messenger has recieved the message from the client (previous
>>> entries in the 7fb9d2c68700 thread are the individual segments that
>>> make up this message).
>>>
>>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>>> <== client.6709 192.168.55.12:0/2013622 19 
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>  235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>>
>>> - ->OSD process acknowledges that it has received the write.
>>>
>>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Not sure excatly what is going on here, the op is being enqueued 
>>> somewhere..
>>>
>>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>>> 0x3052b300 

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Sage Weil
On Wed, 14 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> It seems in our situation the cluster is just busy, usually with
> really small RBD I/O. We have gotten things to where it doesn't happen
> as much in a steady state, but when we have an OSD fail (mostly from
> an XFS log bug we hit at least once a week), it is very painful as the
> OSD exits and enters the cluster. We are working to split the PGs a
> couple of fold, but this is a painful process for the reasons
> mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
> on IRC about getting the other primaries to throttle back when such a
> situation occurs so that each primary OSD has some time to service
> client I/O and to push back on the clients to slow down in these
> situations.
> 
> In our case a single OSD can lock up a VM for a very long time while
> others are happily going about their business. Instead of looking like
> the cluster is out of I/O, it looks like there is an error. If
> pressure is pushed back to clients, it would show up as all of the
> clients slowing down a little instead of one or two just hanging for
> even over 1,000 seconds.

This 1000 seconds figure is very troubling.  Do you have logs?  I suspect 
this is a different issue than the prioritization one in the log from the 
other day (which only waited about 30s for higher-priority replica 
requests).

> My thoughts is that each OSD should have some percentage to time given
> to servicing client I/O whereas now it seems that replica I/O can
> completely starve client I/O. I understand why replica traffic needs a
> higher priority, but I think some balance needs to be attained.

We currently do 'fair' prioritized queueing with a token bucket filter 
only for requests with priorities <= 63.  Simply increasing this threshold 
so that it covers replica requests might be enough.  But... we'll be 
starting client requests locally at the expense of in-progress client 
writes elsewhere.  Given that the amount of (our) client-related work we 
do is always bounded by the msgr throttle, I think this is okay since we 
only make the situation worse by a fixed factor.  (We still don't address 
the possibilty that we are replica for every other osd in the system and 
could be flooded by N*(max client ops per osd).

It's this line:

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8334

sage



> 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
> YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
> I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
> YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
> hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
> D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
> k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
> +LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
> Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
> nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
> XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
> EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
> eP3D
> =R0O9
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
> > On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
> >> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
> >>> -BEGIN PGP SIGNED MESSAGE-
> >>> Hash: SHA256
> >>>
> >>> After a weekend, I'm ready to hit this from a different direction.
> >>>
> >>> I replicated the issue with Firefly so it doesn't seem an issue that
> >>> has been introduced or resolved in any nearby version. I think overall
> >>> we may be seeing [1] to a great degree. From what I can extract from
> >>> the logs, it looks like in situations where OSDs are going up and
> >>> down, I see I/O blocked at the primary OSD waiting for peering and/or
> >>> the PG to become clean before dispatching the I/O to the replicas.
> >>>
> >>> In an effort to understand the flow of the logs, I've attached a small
> >>> 2 minute segment of a log I've extracted what I believe to be
> >>> important entries in the life cycle of an I/O along with my
> >>> understanding. If someone would be kind enough to help my
> >>> understanding, I would appreciate it.
> >>>
> >>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
> >>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
> >>> l=1 c=0x32c85440).reader got message 19 0x2af81700
> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) 

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm sure I have a log of a 1,000 second block somewhere, I'll have to
look around for it.

I'll try turning that knob and see what happens. I'll come back with
the results.

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 14, 2015 at 11:08 AM, Sage Weil  wrote:
> On Wed, 14 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> It seems in our situation the cluster is just busy, usually with
>> really small RBD I/O. We have gotten things to where it doesn't happen
>> as much in a steady state, but when we have an OSD fail (mostly from
>> an XFS log bug we hit at least once a week), it is very painful as the
>> OSD exits and enters the cluster. We are working to split the PGs a
>> couple of fold, but this is a painful process for the reasons
>> mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
>> on IRC about getting the other primaries to throttle back when such a
>> situation occurs so that each primary OSD has some time to service
>> client I/O and to push back on the clients to slow down in these
>> situations.
>>
>> In our case a single OSD can lock up a VM for a very long time while
>> others are happily going about their business. Instead of looking like
>> the cluster is out of I/O, it looks like there is an error. If
>> pressure is pushed back to clients, it would show up as all of the
>> clients slowing down a little instead of one or two just hanging for
>> even over 1,000 seconds.
>
> This 1000 seconds figure is very troubling.  Do you have logs?  I suspect
> this is a different issue than the prioritization one in the log from the
> other day (which only waited about 30s for higher-priority replica
> requests).
>
>> My thoughts is that each OSD should have some percentage to time given
>> to servicing client I/O whereas now it seems that replica I/O can
>> completely starve client I/O. I understand why replica traffic needs a
>> higher priority, but I think some balance needs to be attained.
>
> We currently do 'fair' prioritized queueing with a token bucket filter
> only for requests with priorities <= 63.  Simply increasing this threshold
> so that it covers replica requests might be enough.  But... we'll be
> starting client requests locally at the expense of in-progress client
> writes elsewhere.  Given that the amount of (our) client-related work we
> do is always bounded by the msgr throttle, I think this is okay since we
> only make the situation worse by a fixed factor.  (We still don't address
> the possibilty that we are replica for every other osd in the system and
> could be flooded by N*(max client ops per osd).
>
> It's this line:
>
> https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8334
>
> sage
>
>
>
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
>> YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
>> I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
>> YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
>> hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
>> D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
>> k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
>> +LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
>> Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
>> nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
>> XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
>> EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
>> eP3D
>> =R0O9
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
>> > On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
>> >> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> >>> -BEGIN PGP SIGNED MESSAGE-
>> >>> Hash: SHA256
>> >>>
>> >>> After a weekend, I'm ready to hit this from a different direction.
>> >>>
>> >>> I replicated the issue with Firefly so it doesn't seem an issue that
>> >>> has been introduced or resolved in any nearby version. I think overall
>> >>> we may be seeing [1] to a great degree. From what I can extract from
>> >>> the logs, it looks like in situations where OSDs are going up and
>> >>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> >>> the PG to become clean before dispatching the I/O to the replicas.
>> >>>
>> >>> In an effort to understand the flow of the logs, I've attached a small
>> >>> 2 minute segment of a log I've extracted what I believe to be
>> >>> important entries in the life cycle of an I/O along with my
>> >>> understanding. If someone would be kind enough to help my
>> >>> understanding, I 

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Haomai Wang
On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> After a weekend, I'm ready to hit this from a different direction.
>>
>> I replicated the issue with Firefly so it doesn't seem an issue that
>> has been introduced or resolved in any nearby version. I think overall
>> we may be seeing [1] to a great degree. From what I can extract from
>> the logs, it looks like in situations where OSDs are going up and
>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> the PG to become clean before dispatching the I/O to the replicas.
>>
>> In an effort to understand the flow of the logs, I've attached a small
>> 2 minute segment of a log I've extracted what I believe to be
>> important entries in the life cycle of an I/O along with my
>> understanding. If someone would be kind enough to help my
>> understanding, I would appreciate it.
>>
>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Messenger has recieved the message from the client (previous
>> entries in the 7fb9d2c68700 thread are the individual segments that
>> make up this message).
>>
>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>> <== client.6709 192.168.55.12:0/2013622 19 
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>  235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>
>> - ->OSD process acknowledges that it has received the write.
>>
>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Not sure excatly what is going on here, the op is being enqueued 
>> somewhere..
>>
>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
>> active+clean]
>>
>> - ->The op is dequeued from this mystery queue 30 seconds later in a
>> different thread.
>
> ^^ This is the problem.  Everything after this looks reasonable.  Looking
> at the other dequeue_op calls over this period, it looks like we're just
> overwhelmed with higher priority requests.  New clients are 63, while
> osd_repop (replicated write from another primary) are 127 and replies from
> our own replicated ops are 196.  We do process a few other prio 63 items,
> but you'll see that their latency is also climbing up to 30s over this
> period.
>
> The question is why we suddenly get a lot of them.. maybe the peering on
> other OSDs just completed so we get a bunch of these?  It's also not clear
> to me what makes osd.4 or this op special.  We expect a mix of primary and
> replica ops on all the OSDs, so why would we suddenly have more of them
> here

I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
related to this thread.

So is it means that there exists live lock with client op and repop?
We permit all clients issue too much client ops which cause some OSDs
bottleneck, then actually other OSDs maybe idle enough and accept more
client ops. Finally, all osds are stuck into the bottleneck OSD. It
seemed reasonable, but why it will last so long?

>
> sage
>
>
>>
>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
>>
>> - ->Not sure what this message is. Look up of secondary OSDs?
>>
>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 

Re: [ceph-users] Potential OSD deadlock?

2015-10-13 Thread Sage Weil
On Mon, 12 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> After a weekend, I'm ready to hit this from a different direction.
> 
> I replicated the issue with Firefly so it doesn't seem an issue that
> has been introduced or resolved in any nearby version. I think overall
> we may be seeing [1] to a great degree. From what I can extract from
> the logs, it looks like in situations where OSDs are going up and
> down, I see I/O blocked at the primary OSD waiting for peering and/or
> the PG to become clean before dispatching the I/O to the replicas.
> 
> In an effort to understand the flow of the logs, I've attached a small
> 2 minute segment of a log I've extracted what I believe to be
> important entries in the life cycle of an I/O along with my
> understanding. If someone would be kind enough to help my
> understanding, I would appreciate it.
> 
> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
> l=1 c=0x32c85440).reader got message 19 0x2af81700
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> 
> - ->Messenger has recieved the message from the client (previous
> entries in the 7fb9d2c68700 thread are the individual segments that
> make up this message).
> 
> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
> <== client.6709 192.168.55.12:0/2013622 19 
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>  235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
> 
> - ->OSD process acknowledges that it has received the write.
> 
> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
> 0x3052b300 prio 63 cost 4194304 latency 0.012371
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> 
> - ->Not sure excatly what is going on here, the op is being enqueued 
> somewhere..
> 
> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
> 0x3052b300 prio 63 cost 4194304 latency 30.017094
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
> active+clean]
> 
> - ->The op is dequeued from this mystery queue 30 seconds later in a
> different thread.

^^ This is the problem.  Everything after this looks reasonable.  Looking 
at the other dequeue_op calls over this period, it looks like we're just 
overwhelmed with higher priority requests.  New clients are 63, while 
osd_repop (replicated write from another primary) are 127 and replies from 
our own replicated ops are 196.  We do process a few other prio 63 items, 
but you'll see that their latency is also climbing up to 30s over this 
period.

The question is why we suddenly get a lot of them.. maybe the peering on 
other OSDs just completed so we get a bunch of these?  It's also not clear 
to me what makes osd.4 or this op special.  We expect a mix of primary and 
replica ops on all the OSDs, so why would we suddenly have more of them 
here

sage


> 
> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
> 
> - ->Not sure what this message is. Look up of secondary OSDs?
> 
> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> new_repop rep_tid 17815 on osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5
> 
> - ->Dispatch write to secondaty OSDs?
> 
> 2015-10-12 14:13:06.545116 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
> --> 192.168.55.15:6801/32036 -- osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.003a/head//0 v 44'704) v1
> -- ?+4195078 0x238fd600 con 0x32bcb5a0
> 
> - ->OSD dispatch write to OSD.0.

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Jan Schermer
Are there any errors on the NICs? (ethtool -s ethX)
Also take a look at the switch and look for flow control statistics - do you 
have flow control enabled or disabled?
We had to disable flow control as it would pause all IO on the port whenever 
any path got congested which you don't want to happen with a cluster like Ceph. 
It's better to let the frame drop/retransmit in this case (and you should size 
it so it doesn't happen in any case).
And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't put 
my money on that...

Jan


> On 09 Oct 2015, at 10:48, Max A. Krasilnikov  wrote:
> 
> Hello!
> 
> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
> 
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
> 
>> Sage,
> 
>> After trying to bisect this issue (all test moved the bisect towards
>> Infernalis) and eventually testing the Infernalis branch again, it
>> looks like the problem still exists although it is handled a tad
>> better in Infernalis. I'm going to test against Firefly/Giant next
>> week and then try and dive into the code to see if I can expose any
>> thing.
> 
>> If I can do anything to provide you with information, please let me know.
> 
> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
> network
> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
> 82599ES
> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 
> 5020
> switch with Jumbo frames enabled i have performance drop and slow requests. 
> When
> setting 1500 on nodes and not touching Nexus all problems are fixed.
> 
> I have rebooted all my ceph services when changing MTU and changing things to
> 9000 and 1500 several times in order to be sure. It is reproducable in my
> environment.
> 
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
> 
>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>> BCFo
>> =GJL4
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>> 
>>> We forgot to upload the ceph.log yesterday. It is there now.
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256
 
 I upped the debug on about everything and ran the test for about 40
 minutes. I took OSD.19 on ceph1 doen and then brought it back in.
 There was at least one op on osd.19 that was blocked for over 1,000
 seconds. Hopefully this will have something that will cast a light on
 what is going on.
 
 We are going to upgrade this cluster to Infernalis tomorrow and rerun
 the test to verify the results from the dev cluster. This cluster
 matches the hardware of our production cluster but is not yet in
 production so we can safely wipe it to downgrade back to Hammer.
 
 Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
 
 Let me know what else we can do to help.
 
 Thanks,
 -BEGIN PGP SIGNATURE-
 Version: Mailvelope v1.2.0
 Comment: https://www.mailvelope.com
 
 wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
 xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
 e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
 gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
 HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
 eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
 OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
 IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
 mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
 Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
 

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Hello!

On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256

> Sage,

> After trying to bisect this issue (all test moved the bisect towards
> Infernalis) and eventually testing the Infernalis branch again, it
> looks like the problem still exists although it is handled a tad
> better in Infernalis. I'm going to test against Firefly/Giant next
> week and then try and dive into the code to see if I can expose any
> thing.

> If I can do anything to provide you with information, please let me know.

I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 
5020
switch with Jumbo frames enabled i have performance drop and slow requests. When
setting 1500 on nodes and not touching Nexus all problems are fixed.

I have rebooted all my ceph services when changing MTU and changing things to
9000 and 1500 several times in order to be sure. It is reproducable in my
environment.

> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com

> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> BCFo
> =GJL4
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We forgot to upload the ceph.log yesterday. It is there now.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> I upped the debug on about everything and ran the test for about 40
>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>> There was at least one op on osd.19 that was blocked for over 1,000
>>> seconds. Hopefully this will have something that will cast a light on
>>> what is going on.
>>>
>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>> the test to verify the results from the dev cluster. This cluster
>>> matches the hardware of our production cluster but is not yet in
>>> production so we can safely wipe it to downgrade back to Hammer.
>>>
>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>
>>> Let me know what else we can do to help.
>>>
>>> Thanks,
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>> EDrG
>>> =BZVw
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 On my second test (a much longer one), it took nearly an hour, but a
 few messages have popped up over a 20 window. Still far less than I
 have been seeing.
 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


 On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I'll 

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Dzianis Kahanovich
Additional issues about Intel NICs: some of them (I*GB series, not e1000e) are 
multiqueue. Default qdisc - "mq", not "pfifo_fast". I have half of cluster with 
e1000e and half - IGB (every - 2x with bonding+bridge, no jumbo, txqueuelen 
2000). So, on my MQ NICs irqbalance produce massive network drops (visible by 
simple ping). Now I kill irqbalance on every node.


But somebody (I am too) replace default qdisc to something else. I use prio + 3x 
pfifo (limit 2000), all non-cluster src+dst traffic filtered to class 3. On 
single-queue (e1000e) there are 1 pfifo per NIC. On MQ - it MAY be 1 pfifo per 
NIC, but I do 1 pfifo per mq class - 8 per NIC on 8 cores.


Related to other things (SMP, NUMA, balancers, task scheduler details) this 
settings can be significant too.


PS Last detail: all e1000e: e1000e.InterruptThrottleRate=1,1

Max A. Krasilnikov пишет:

Hello!

On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256



Sage,



After trying to bisect this issue (all test moved the bisect towards
Infernalis) and eventually testing the Infernalis branch again, it
looks like the problem still exists although it is handled a tad
better in Infernalis. I'm going to test against Firefly/Giant next
week and then try and dive into the code to see if I can expose any
thing.



If I can do anything to provide you with information, please let me know.


I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 
5020
switch with Jumbo frames enabled i have performance drop and slow requests. When
setting 1500 on nodes and not touching Nexus all problems are fixed.

I have rebooted all my ceph services when changing MTU and changing things to
9000 and 1500 several times in order to be sure. It is reproducable in my
environment.


Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
BCFo
=GJL4
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1




On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We forgot to upload the ceph.log yesterday. It is there now.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I upped the debug on about everything and ran the test for about 40
minutes. I took OSD.19 on ceph1 doen and then brought it back in.
There was at least one op on osd.19 that was blocked for over 1,000
seconds. Hopefully this will have something that will cast a light on
what is going on.

We are going to upgrade this cluster to Infernalis tomorrow and rerun
the test to verify the results from the dev cluster. This cluster
matches the hardware of our production cluster but is not yet in
production so we can safely wipe it to downgrade back to Hammer.

Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/

Let me know what else we can do to help.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
EDrG
=BZVw
-END PGP SIGNATURE-

Robert LeBlanc

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Hello!

On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:

> Are there any errors on the NICs? (ethtool -s ethX)

No errors. Neither on nodes, nor on switches.

> Also take a look at the switch and look for flow control statistics - do you 
> have flow control enabled or disabled?

flow control disabled everywhere.

> We had to disable flow control as it would pause all IO on the port whenever 
> any path got congested which you don't want to happen with a cluster like 
> Ceph. It's better to let the frame drop/retransmit in this case (and you 
> should size it so it doesn't happen in any case).
> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
> put my money on that...

I tried to completely disable all offloads and setting mtu back to 9000 after.
No luck.
I am speaking with my NOC about MTU in 10G network. If I have update, I will
write here. I can hardly beleave that it is ceph side, but nothing is
impossible.

> Jan


>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov  wrote:
>> 
>> Hello!
>> 
>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>> 
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>> 
>>> Sage,
>> 
>>> After trying to bisect this issue (all test moved the bisect towards
>>> Infernalis) and eventually testing the Infernalis branch again, it
>>> looks like the problem still exists although it is handled a tad
>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>> week and then try and dive into the code to see if I can expose any
>>> thing.
>> 
>>> If I can do anything to provide you with information, please let me know.
>> 
>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
>> network
>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
>> 82599ES
>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 
>> 5020
>> switch with Jumbo frames enabled i have performance drop and slow requests. 
>> When
>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>> 
>> I have rebooted all my ceph services when changing MTU and changing things to
>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>> environment.
>> 
>>> Thanks,
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>> 
>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>> BCFo
>>> =GJL4
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> 
>> 
>>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256
 
 We forgot to upload the ceph.log yesterday. It is there now.
 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 
 
 On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I upped the debug on about everything and ran the test for about 40
> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
> There was at least one op on osd.19 that was blocked for over 1,000
> seconds. Hopefully this will have something that will cast a light on
> what is going on.
> 
> We are going to upgrade this cluster to Infernalis tomorrow and rerun
> the test to verify the results from the dev cluster. This cluster
> matches the hardware of our production cluster but is not yet in
> production so we can safely wipe it to downgrade back to Hammer.
> 
> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
> 
> Let me know what else we can do to help.
> 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Jan Schermer
Have you tried running iperf between the nodes? Capturing a pcap of the 
(failing) Ceph comms from both sides could help narrow it down.
Is there any SDN layer involved that could add overhead/padding to the frames?

What about some intermediate MTU like 8000 - does that work?
Oh and if there's any bonding/trunking involved, beware that you need to set 
the same MTU and offloads on all interfaces on certains kernels - flags like 
MTU/offloads should propagate between the master/slave interfaces but in 
reality it's not the case and they get reset even if you unplug/replug the 
ethernet cable.

Jan

> On 09 Oct 2015, at 13:21, Max A. Krasilnikov  wrote:
> 
> Hello!
> 
> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
> 
>> Are there any errors on the NICs? (ethtool -s ethX)
> 
> No errors. Neither on nodes, nor on switches.
> 
>> Also take a look at the switch and look for flow control statistics - do you 
>> have flow control enabled or disabled?
> 
> flow control disabled everywhere.
> 
>> We had to disable flow control as it would pause all IO on the port whenever 
>> any path got congested which you don't want to happen with a cluster like 
>> Ceph. It's better to let the frame drop/retransmit in this case (and you 
>> should size it so it doesn't happen in any case).
>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
>> put my money on that...
> 
> I tried to completely disable all offloads and setting mtu back to 9000 after.
> No luck.
> I am speaking with my NOC about MTU in 10G network. If I have update, I will
> write here. I can hardly beleave that it is ceph side, but nothing is
> impossible.
> 
>> Jan
> 
> 
>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov  wrote:
>>> 
>>> Hello!
>>> 
>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>>> 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256
>>> 
 Sage,
>>> 
 After trying to bisect this issue (all test moved the bisect towards
 Infernalis) and eventually testing the Infernalis branch again, it
 looks like the problem still exists although it is handled a tad
 better in Infernalis. I'm going to test against Firefly/Giant next
 week and then try and dive into the code to see if I can expose any
 thing.
>>> 
 If I can do anything to provide you with information, please let me know.
>>> 
>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
>>> network
>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux 
>>> bounding
>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
>>> 82599ES
>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on 
>>> Nexus 5020
>>> switch with Jumbo frames enabled i have performance drop and slow requests. 
>>> When
>>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>>> 
>>> I have rebooted all my ceph services when changing MTU and changing things 
>>> to
>>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>>> environment.
>>> 
 Thanks,
 -BEGIN PGP SIGNATURE-
 Version: Mailvelope v1.2.0
 Comment: https://www.mailvelope.com
>>> 
 wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
 YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
 BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
 qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
 ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
 V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
 jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
 VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
 VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
 Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
 BCFo
 =GJL4
 -END PGP SIGNATURE-
 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
 On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  
 wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> We forgot to upload the ceph.log yesterday. It is there now.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>> 
>> I upped the debug on about everything and ran the test for about 40
>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>> There was at least one op on osd.19 that was blocked for over 1,000
>> seconds. Hopefully this will have something that will cast a light on

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Здравствуйте! 

On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:

> Have you tried running iperf between the nodes? Capturing a pcap of the 
> (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?

No other layers, only 2x Nexus 5020 with virtual portchannels. All other I will
check on Monday.

> What about some intermediate MTU like 8000 - does that work?

Not tested. I will.

> Oh and if there's any bonding/trunking involved, beware that you need to set 
> the same MTU and offloads on all interfaces on certains kernels - flags like 
> MTU/offloads should propagate between the master/slave interfaces but in 
> reality it's not the case and they get reset even if you unplug/replug the 
> ethernet cable.

Yes, I understand it :) I was setting parameters on both interfaces and checked
it out using "ip link".

> Jan

>> On 09 Oct 2015, at 13:21, Max A. Krasilnikov  wrote:
>> 
>> Hello!
>> 
>> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
>> 
>>> Are there any errors on the NICs? (ethtool -s ethX)
>> 
>> No errors. Neither on nodes, nor on switches.
>> 
>>> Also take a look at the switch and look for flow control statistics - do 
>>> you have flow control enabled or disabled?
>> 
>> flow control disabled everywhere.
>> 
>>> We had to disable flow control as it would pause all IO on the port 
>>> whenever any path got congested which you don't want to happen with a 
>>> cluster like Ceph. It's better to let the frame drop/retransmit in this 
>>> case (and you should size it so it doesn't happen in any case).
>>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
>>> put my money on that...
>> 
>> I tried to completely disable all offloads and setting mtu back to 9000 
>> after.
>> No luck.
>> I am speaking with my NOC about MTU in 10G network. If I have update, I will
>> write here. I can hardly beleave that it is ceph side, but nothing is
>> impossible.
>> 
>>> Jan
>> 
>> 
 On 09 Oct 2015, at 10:48, Max A. Krasilnikov  wrote:
 
 Hello!
 
 On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
 
> Sage,
 
> After trying to bisect this issue (all test moved the bisect towards
> Infernalis) and eventually testing the Infernalis branch again, it
> looks like the problem still exists although it is handled a tad
> better in Infernalis. I'm going to test against Firefly/Giant next
> week and then try and dive into the code to see if I can expose any
> thing.
 
> If I can do anything to provide you with information, please let me know.
 
 I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
 network
 between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux 
 bounding
 driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
 82599ES
 Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on 
 Nexus 5020
 switch with Jumbo frames enabled i have performance drop and slow 
 requests. When
 setting 1500 on nodes and not touching Nexus all problems are fixed.
 
 I have rebooted all my ceph services when changing MTU and changing things 
 to
 9000 and 1500 several times in order to be sure. It is reproducable in my
 environment.
 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
 
> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> BCFo
> =GJL4
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 
 
> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  
> wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>> 
>> We forgot to upload the ceph.log yesterday. It is there now.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> 
>> 
>> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-08 Thread Dzianis Kahanovich
I have probably similar situation on latest hammer & 4.1+ kernels on spinning 
OSDs (journal - leased partition on same HDD): evential slow requests, etc. Try:

1) even on leased partition journal - "journal aio = false";
2) single-queue "noop" scheduler (OSDs);
3) reduce nr_requests to 32 (OSDs);
4) remove all other queue "tunes";
5) killall irqbalance (& any balancers exclude in-kernel NUMA auto-balancing);
6) net.ipv4.tcp_congestion_control = scalable
7) net.ipv4.tcp_notsent_lowat = 131072
8) vm.zone_reclaim_mode = 7

This is neutral, fairness settings.
If all fixed - play with other values (congestion "yeah", etc).

Also I put all active processes (ceph daemons & qemu) in single RR level ("chrt 
-par 3 $pid", etc).


Robert LeBlanc пишет:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We have had two situations where I/O just seems to be indefinitely
blocked on our production cluster today (0.94.3). In the case this
morning, it was just normal I/O traffic, no recovery or backfill. The
case this evening, we were backfilling to some new OSDs. I would have
loved to have bumped up the debugging to get an idea of what was going
on, but time was exhausted. The incident this evening I was able to do
some additional troubleshooting, but got real anxious after I/O had
been blocked for 10 minutes and OPs was getting hot around the collar.

Here are the important parts of the logs:
[osd.30]
2015-09-18 23:05:36.188251 7efed0ef0700  0 log_channel(cluster) log
[WRN] : slow request 30.662958 seconds old,
  received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441
  rbd_data.1099d2f67aaea.0f62 [set-alloc-hint object_size
8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c
ack+ondisk+write+known_if_redirected e55919)
  currently waiting for subops from 32,70,72

[osd.72]
2015-09-18 23:05:19.302985 7f3fa19f8700  0 log_channel(cluster) log
[WRN] : slow request 30.200408 seconds old,
  received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311
  rbd_data.3f41d41bd65b28.9e2b [set-alloc-hint object_size
4194304 write_size 4194304,write 1048576~421888] 17.40adcada
ack+ondisk+write+known_if_redirected e55919)
  currently waiting for subops from 2,30,90

The other OSDs listed (32,70,2,90) did not have any errors in the logs
about blocked I/O. It seems that osd.30 was waiting for osd.72 and
visa versa. I looked at top and iostat of these two hosts and the OSD
processes and disk I/O were pretty idle.

I know that this isn't a lot to go on. Our cluster is under very heavy
load and we get several blocked I/Os every hour, but they usually
clear up within 15 seconds. We seem to get I/O blocked when the op
latency of the cluster goes above 1 (average from all OSDs as seen by
Graphite).

Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately
cleared all the blocked I/O and then it was fine after rejoining the
cluster. Increasing what logs and to what level would be most
beneficial in this case for troubleshooting?

I hope this makes sense, it has been a long day.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV
LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7
cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc
K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE
9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC
DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75
Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c
MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl
kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY
nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2
4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq
qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio
Roj5
=ut4B
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-08 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Sage,

After trying to bisect this issue (all test moved the bisect towards
Infernalis) and eventually testing the Infernalis branch again, it
looks like the problem still exists although it is handled a tad
better in Infernalis. I'm going to test against Firefly/Giant next
week and then try and dive into the code to see if I can expose any
thing.

If I can do anything to provide you with information, please let me know.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
BCFo
=GJL4
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We forgot to upload the ceph.log yesterday. It is there now.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I upped the debug on about everything and ran the test for about 40
>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>> There was at least one op on osd.19 that was blocked for over 1,000
>> seconds. Hopefully this will have something that will cast a light on
>> what is going on.
>>
>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>> the test to verify the results from the dev cluster. This cluster
>> matches the hardware of our production cluster but is not yet in
>> production so we can safely wipe it to downgrade back to Hammer.
>>
>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>
>> Let me know what else we can do to help.
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>> EDrG
>> =BZVw
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> On my second test (a much longer one), it took nearly an hour, but a
>>> few messages have popped up over a 20 window. Still far less than I
>>> have been seeing.
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 I'll capture another set of logs. Is there any other debugging you
 want turned up? I've seen the same thing where I see the message
 dispatched to the secondary OSD, but the message just doesn't show up
 for 30+ seconds in the secondary OSD logs.
 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


 On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I can't think of anything. In my dev cluster the only thing that has
>> changed is the Ceph versions (no reboot). What I like is even though
>> the disks are 100% utilized, it is preforming as I expect now. Client
>> I/O is slightly degraded during the recovery, but no 

Re: [ceph-users] Potential OSD deadlock?

2015-10-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We forgot to upload the ceph.log yesterday. It is there now.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I upped the debug on about everything and ran the test for about 40
> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
> There was at least one op on osd.19 that was blocked for over 1,000
> seconds. Hopefully this will have something that will cast a light on
> what is going on.
>
> We are going to upgrade this cluster to Infernalis tomorrow and rerun
> the test to verify the results from the dev cluster. This cluster
> matches the hardware of our production cluster but is not yet in
> production so we can safely wipe it to downgrade back to Hammer.
>
> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>
> Let me know what else we can do to help.
>
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
> EDrG
> =BZVw
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> On my second test (a much longer one), it took nearly an hour, but a
>> few messages have popped up over a 20 window. Still far less than I
>> have been seeing.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> I'll capture another set of logs. Is there any other debugging you
>>> want turned up? I've seen the same thing where I see the message
>>> dispatched to the secondary OSD, but the message just doesn't show up
>>> for 30+ seconds in the secondary OSD logs.
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
 On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I can't think of anything. In my dev cluster the only thing that has
> changed is the Ceph versions (no reboot). What I like is even though
> the disks are 100% utilized, it is preforming as I expect now. Client
> I/O is slightly degraded during the recovery, but no blocked I/O when
> the OSD boots or during the recovery period. This is with
> max_backfills set to 20, one backfill max in our production cluster is
> painful on OSD boot/recovery. I was able to reproduce this issue on
> our dev cluster very easily and very quickly with these settings. So
> far two tests and an hour later, only the blocked I/O when the OSD is
> marked out. We would love to see that go away too, but this is far
 (me too!)
> better than what we have now. This dev cluster also has
> osd_client_message_cap set to default (100).
>
> We need to stay on the Hammer version of Ceph and I'm willing to take
> the time to bisect this. If this is not a problem in Firefly/Giant,
> you you prefer a bisect to find the introduction of the problem
> (Firefly/Giant -> Hammer) or the introduction of the resolution
> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
> commit that prevents a clean build as that is my most limiting factor?

 Nothing comes to mind.  I think the best way to find this is still to see
 it happen in the logs with hammer.  The frustrating thing with that log
 dump you sent is that although I see plenty of slow request warnings in
 the osd logs, I don't see the requests arriving.  Maybe the logs weren't
 turned up for long enough?

 sage



> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> Thanks for your time Sage. It sounds like a few people may be helped if you
> can find something.
> 
> I did a recursive chown as in the instructions (although I didn't know about
> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
> I'll also do ms and make the logs available. I'll also review the document
> to make sure I didn't miss anything else.

Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build) 
first.  They won't be allowed to boot until that happens... all upgrades 
must stop at 0.94.4 first.  And that isn't released yet.. we'll try to 
do that today.  In the meantime, you can use the hammer gitbuilder 
build...

sage


> 
> Robert LeBlanc
> 
> Sent from a mobile device please excuse any typos.
> 
> On Oct 6, 2015 6:37 AM, "Sage Weil"  wrote:
>   On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>   > -BEGIN PGP SIGNED MESSAGE-
>   > Hash: SHA256
>   >
>   > With some off-list help, we have adjusted
>   > osd_client_message_cap=1. This seems to have helped a bit
>   and we
>   > have seen some OSDs have a value up to 4,000 for client
>   messages. But
>   > it does not solve the problem with the blocked I/O.
>   >
>   > One thing that I have noticed is that almost exactly 30
>   seconds elapse
>   > between an OSD boots and the first blocked I/O message. I
>   don't know
>   > if the OSD doesn't have time to get it's brain right about a
>   PG before
>   > it starts servicing it or what exactly.
> 
>   I'm downloading the logs from yesterday now; sorry it's taking
>   so long.
> 
>   > On another note, I tried upgrading our CentOS dev cluster from
>   Hammer
>   > to master and things didn't go so well. The OSDs would not
>   start
>   > because /var/lib/ceph was not owned by ceph. I chowned the
>   directory
>   > and all OSDs and the OSD then started, but never became active
>   in the
>   > cluster. It just sat there after reading all the PGs. There
>   were
>   > sockets open to the monitor, but no OSD to OSD sockets. I
>   tried
>   > downgrading to the Infernalis branch and still no luck getting
>   the
>   > OSDs to come up. The OSD processes were idle after the initial
>   boot.
>   > All packages were installed from gitbuilder.
> 
>   Did you chown -R ?
> 
>          
> https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
>   g-from-hammer
> 
>   My guess is you only chowned the root dir, and the OSD didn't
>   throw
>   an error when it encountered the other files?  If you can
>   generate a debug
>   osd = 20 log, that would be helpful.. thanks!
> 
>   sage
> 
> 
>   >
>   > Thanks,
>   > -BEGIN PGP SIGNATURE-
>   > Version: Mailvelope v1.2.0
>   > Comment: https://www.mailvelope.com
>   >
>   > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>   > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>   > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>   > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>   > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>   > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>   > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>   > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>   > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>   > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>   > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>   > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>   > GdXC
>   > =Aigq
>   > -END PGP SIGNATURE-
>   > 
>   > Robert LeBlanc
>   > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
>   B9F1
>   >
>   >
>   > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc
>    wrote:
>   > > -BEGIN PGP SIGNED MESSAGE-
>   > > Hash: SHA256
>   > >
>   > > I have eight nodes running the fio job rbd_test_real to
>   different RBD
>   > > volumes. I've included the CRUSH map in the tarball.
>   > >
>   > > I stopped one OSD process and marked it out. I let it
>   recover for a
>   > > few minutes and then I started the process again and marked
>   it in. I
>   > > started getting block I/O messages during the recovery.
>   > >
>   > > The logs are located at
>   http://162.144.87.113/files/ushou1.tar.xz
>   > >
>   > > Thanks,
>   > > -BEGIN PGP SIGNATURE-
>   > > Version: Mailvelope v1.2.0
>   > > Comment: https://www.mailvelope.com
>   > >
>   > > 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
Thanks for your time Sage. It sounds like a few people may be helped if you
can find something.

I did a recursive chown as in the instructions (although I didn't know
about the doc at the time). I did an osd debug at 20/20 but didn't see
anything. I'll also do ms and make the logs available. I'll also review the
document to make sure I didn't miss anything else.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Oct 6, 2015 6:37 AM, "Sage Weil"  wrote:

> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > With some off-list help, we have adjusted
> > osd_client_message_cap=1. This seems to have helped a bit and we
> > have seen some OSDs have a value up to 4,000 for client messages. But
> > it does not solve the problem with the blocked I/O.
> >
> > One thing that I have noticed is that almost exactly 30 seconds elapse
> > between an OSD boots and the first blocked I/O message. I don't know
> > if the OSD doesn't have time to get it's brain right about a PG before
> > it starts servicing it or what exactly.
>
> I'm downloading the logs from yesterday now; sorry it's taking so long.
>
> > On another note, I tried upgrading our CentOS dev cluster from Hammer
> > to master and things didn't go so well. The OSDs would not start
> > because /var/lib/ceph was not owned by ceph. I chowned the directory
> > and all OSDs and the OSD then started, but never became active in the
> > cluster. It just sat there after reading all the PGs. There were
> > sockets open to the monitor, but no OSD to OSD sockets. I tried
> > downgrading to the Infernalis branch and still no luck getting the
> > OSDs to come up. The OSD processes were idle after the initial boot.
> > All packages were installed from gitbuilder.
>
> Did you chown -R ?
>
>
> https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>
> My guess is you only chowned the root dir, and the OSD didn't throw
> an error when it encountered the other files?  If you can generate a debug
> osd = 20 log, that would be helpful.. thanks!
>
> sage
>
>
> >
> > Thanks,
> > -BEGIN PGP SIGNATURE-
> > Version: Mailvelope v1.2.0
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> > GdXC
> > =Aigq
> > -END PGP SIGNATURE-
> > 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc 
> wrote:
> > > -BEGIN PGP SIGNED MESSAGE-
> > > Hash: SHA256
> > >
> > > I have eight nodes running the fio job rbd_test_real to different RBD
> > > volumes. I've included the CRUSH map in the tarball.
> > >
> > > I stopped one OSD process and marked it out. I let it recover for a
> > > few minutes and then I started the process again and marked it in. I
> > > started getting block I/O messages during the recovery.
> > >
> > > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> > >
> > > Thanks,
> > > -BEGIN PGP SIGNATURE-
> > > Version: Mailvelope v1.2.0
> > > Comment: https://www.mailvelope.com
> > >
> > > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> > > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> > > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> > > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> > > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> > > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> > > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> > > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> > > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> > > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> > > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> > > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> > > 3EPx
> > > =UDIV
> > > -END PGP SIGNATURE-
> > >
> > > 
> > > Robert LeBlanc
> > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > >
> > >
> > > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

This was from the monitor (can't bring it up with Hammer now, complete
cluster is down, this is only my lab, so no urgency).

I got it up and running this way:
1. Upgrade the mon node to Infernalis and started the mon.
2. Downgraded the OSDs to to-be-0.94.4 and started them up.
3. Upgraded the OSD node to Infernalis
4. Stopped the OSD processes
5. Chowned the files that were updated by downgrading (find
/var/lib/ceph -user root -exec chown ceph. {} +)
6. Ran ceph-disk activate

The OSDs then came up and into the cluster.

I had tried ceph-disk activate on the nodes while downgraded to 0.94.4
and the monitor was down. It took a while to timeout searching for the
monitor, based on the last OSDs I started it seemed to do enough to
allow the OSD to join an Infernalis monitor (My monitor is running on
an OSD node, I tried starting these OSDs just in case I could skip
zapping the disk and having to backfill it. It worked just fine).

Hopefully this helps someone who runs into this after the fact.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFAkLCRDmVDuy+mK58QAAp3AP/RptIt2yPrmL1EXzvl4V
N3Q69NE17ac9xb5ruxN/LqNMyZAE85UKhzkTFi2NdSMJzYygL3hpgLBqEOpF
5VhaKoaW/H/gfrXVTGt5reFySMvDZEA/9hqF9KLQggemRRebAv6DIHb8wLTO
OLHF/XSsi+JALlIx2a04OSFZQ2M9rPTmOGneZ63T0YoPK5XQVJgQT9D4h60+
IeSn9Drh+HPJQag1E6cuh9ixOofJAP9grAnGBqy4XWznMFMDYxaKYovS5Nkg
yt1ukH6R23dYNnIklVnpK3MmnU6JSnWyCraiolVb/Ddjd6D/wart95aClwHo
EmvirdctCk3mbfG/2MjcUO8UII9Dk0xs7ck/nqyDBatRcOCGOdn1SVCOT+0Q
N3dDFeEY7FoLZf0g9YYmtTnYtE5TQ0fJGOAwvJQeupJESMlXohXAQHgxKg0H
ksjmLrY1OTFdFMeS5P3sHHzN6qNGDKJyG6aURB6rN2xexTITQjl9ZZ/o9bfZ
vEldAdXp4B1TBlPqkVfHGkTrMezEOwOi0kGAChkflFu2nQB6LvKDKLiHjpKv
87MCaB97FvoUdZpGnbcuft6NU4lWH/ynVLLY8fOKb2/x1GEKYNz3cGKx3M0u
S5wRtYvOOZEwb/B2bhvCSNtb2FgqpS4INfgKn+334ibd2X1o42oj52SA1lz/
baNh
=LmYc
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 10:19 AM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> I downgraded to the hammer gitbuilder branch, but it looks like I've
>> passed the point of no return:
>>
>> 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
>> includes unsupported features:
>> compat={},rocompat={},incompat={7=support shec erasure code}
>> 2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1 error checking features:
>> (1) Operation not permitted
>
> In that case, mark all osds down, upgrade again, and they'll be
> allowed to start.  The restriction is that each osd can't go backwards,
> and post-hammer osds can't talk to pre-hammer osds.
>
> sage
>
>>
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil  wrote:
>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >> Thanks for your time Sage. It sounds like a few people may be helped if 
>> >> you
>> >> can find something.
>> >>
>> >> I did a recursive chown as in the instructions (although I didn't know 
>> >> about
>> >> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
>> >> I'll also do ms and make the logs available. I'll also review the document
>> >> to make sure I didn't miss anything else.
>> >
>> > Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
>> > first.  They won't be allowed to boot until that happens... all upgrades
>> > must stop at 0.94.4 first.  And that isn't released yet.. we'll try to
>> > do that today.  In the meantime, you can use the hammer gitbuilder
>> > build...
>> >
>> > sage
>> >
>> >
>> >>
>> >> Robert LeBlanc
>> >>
>> >> Sent from a mobile device please excuse any typos.
>> >>
>> >> On Oct 6, 2015 6:37 AM, "Sage Weil"  wrote:
>> >>   On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >>   > -BEGIN PGP SIGNED MESSAGE-
>> >>   > Hash: SHA256
>> >>   >
>> >>   > With some off-list help, we have adjusted
>> >>   > osd_client_message_cap=1. This seems to have helped a bit
>> >>   and we
>> >>   > have seen some OSDs have a value up to 4,000 for client
>> >>   messages. But
>> >>   > it does not solve the problem with the blocked I/O.
>> >>   >
>> >>   > One thing that I have noticed is that almost exactly 30
>> >>   seconds elapse
>> >>   > between an OSD boots and the first blocked I/O message. I
>> >>   don't know
>> >>   > if the OSD doesn't have time to get it's brain right about a
>> >>   PG before
>> >>   > it starts servicing it or what exactly.
>> >>
>> >>   I'm downloading the logs from yesterday now; sorry it's taking
>> >>   so long.
>> >>
>> >>   > On another note, I tried upgrading our CentOS dev cluster from
>> >>   Hammer
>> >>   > to master and things didn't go so well. The OSDs would not
>> >>   start
>> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> messages when the OSD was marked out:
> 
> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
> 34.476006 secs
> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
> cluster [WRN] slow request 32.913474 seconds old, received at
> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
> rbd_data.338102ae8944a.5270 [read 3302912~4096] 8.c74a4538
> ack+read+known_if_redirected e58744) currently waiting for peered
> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
> cluster [WRN] slow request 32.697545 seconds old, received at
> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
> rbd_data.3380f74b0dc51.0001ee75 [read 1016832~4096] 8.778d1be3
> ack+read+known_if_redirected e58744) currently waiting for peered
> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
> cluster [WRN] slow request 32.668006 seconds old, received at
> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
> rbd_data.3380f74b0dc51.00019b09 [read 1034240~4096] 8.e87a6f58
> ack+read+known_if_redirected e58744) currently waiting for peered
> 
> But I'm not seeing the blocked messages when the OSD came back in. The
> OSD spindles have been running at 100% during this test. I have seen
> slowed I/O from the clients as expected from the extra load, but so
> far no blocked messages. I'm going to run some more tests.

Good to hear.

FWIW I looked through the logs and all of the slow request no flag point 
messages came from osd.163... and the logs don't show when they arrived.  
My guess is this OSD has a slower disk than the others, or something else 
funny is going on?

I spot checked another OSD at random (60) where I saw a slow request.  It 
was stuck peering for 10s of seconds... waiting on a pg log message from 
osd.163.

sage


> 
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
> fo5a
> =ahEi
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> With some off-list help, we have adjusted
> >> osd_client_message_cap=1. This seems to have helped a bit and we
> >> have seen some OSDs have a value up to 4,000 for client messages. But
> >> it does not solve the problem with the blocked I/O.
> >>
> >> One thing that I have noticed is that almost exactly 30 seconds elapse
> >> between an OSD boots and the first blocked I/O message. I don't know
> >> if the OSD doesn't have time to get it's brain right about a PG before
> >> it starts servicing it or what exactly.
> >
> > I'm downloading the logs from yesterday now; sorry it's taking so long.
> >
> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
> >> to master and things didn't go so well. The OSDs would not start
> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
> >> and all OSDs and the OSD then started, but never became active in the
> >> cluster. It just sat there after reading all the PGs. There were
> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
> >> downgrading to the Infernalis branch and still no luck getting the
> >> OSDs to come up. The OSD processes were idle after the initial boot.
> >> All packages were installed from gitbuilder.
> >
> > Did you chown -R ?
> >
> > 
> > https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
> >
> > My guess is you only chowned the root dir, and the OSD didn't throw
> > an error when it encountered the other files?  If you can generate a debug
> > osd = 20 log, that would be helpful.. thanks!
> >
> > sage
> >
> >
> >>
> >> Thanks,
> >> -BEGIN PGP SIGNATURE-
> >> Version: 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I can't think of anything. In my dev cluster the only thing that has
> changed is the Ceph versions (no reboot). What I like is even though
> the disks are 100% utilized, it is preforming as I expect now. Client
> I/O is slightly degraded during the recovery, but no blocked I/O when
> the OSD boots or during the recovery period. This is with
> max_backfills set to 20, one backfill max in our production cluster is
> painful on OSD boot/recovery. I was able to reproduce this issue on
> our dev cluster very easily and very quickly with these settings. So
> far two tests and an hour later, only the blocked I/O when the OSD is
> marked out. We would love to see that go away too, but this is far
(me too!)
> better than what we have now. This dev cluster also has
> osd_client_message_cap set to default (100).
> 
> We need to stay on the Hammer version of Ceph and I'm willing to take
> the time to bisect this. If this is not a problem in Firefly/Giant,
> you you prefer a bisect to find the introduction of the problem
> (Firefly/Giant -> Hammer) or the introduction of the resolution
> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
> commit that prevents a clean build as that is my most limiting factor?

Nothing comes to mind.  I think the best way to find this is still to see 
it happen in the logs with hammer.  The frustrating thing with that log 
dump you sent is that although I see plenty of slow request warnings in 
the osd logs, I don't see the requests arriving.  Maybe the logs weren't 
turned up for long enough?

sage



> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> >> messages when the OSD was marked out:
> >>
> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
> >> 34.476006 secs
> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
> >> cluster [WRN] slow request 32.913474 seconds old, received at
> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
> >> rbd_data.338102ae8944a.5270 [read 3302912~4096] 8.c74a4538
> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
> >> cluster [WRN] slow request 32.697545 seconds old, received at
> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
> >> rbd_data.3380f74b0dc51.0001ee75 [read 1016832~4096] 8.778d1be3
> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
> >> cluster [WRN] slow request 32.668006 seconds old, received at
> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
> >> rbd_data.3380f74b0dc51.00019b09 [read 1034240~4096] 8.e87a6f58
> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>
> >> But I'm not seeing the blocked messages when the OSD came back in. The
> >> OSD spindles have been running at 100% during this test. I have seen
> >> slowed I/O from the clients as expected from the extra load, but so
> >> far no blocked messages. I'm going to run some more tests.
> >
> > Good to hear.
> >
> > FWIW I looked through the logs and all of the slow request no flag point
> > messages came from osd.163... and the logs don't show when they arrived.
> > My guess is this OSD has a slower disk than the others, or something else
> > funny is going on?
> >
> > I spot checked another OSD at random (60) where I saw a slow request.  It
> > was stuck peering for 10s of seconds... waiting on a pg log message from
> > osd.163.
> >
> > sage
> >
> >
> >>
> >> -BEGIN PGP SIGNATURE-
> >> Version: Mailvelope v1.2.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
> >> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
(4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
messages when the OSD was marked out:

2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
34.476006 secs
2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
cluster [WRN] slow request 32.913474 seconds old, received at
2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
rbd_data.338102ae8944a.5270 [read 3302912~4096] 8.c74a4538
ack+read+known_if_redirected e58744) currently waiting for peered
2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
cluster [WRN] slow request 32.697545 seconds old, received at
2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
rbd_data.3380f74b0dc51.0001ee75 [read 1016832~4096] 8.778d1be3
ack+read+known_if_redirected e58744) currently waiting for peered
2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
cluster [WRN] slow request 32.668006 seconds old, received at
2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
rbd_data.3380f74b0dc51.00019b09 [read 1034240~4096] 8.e87a6f58
ack+read+known_if_redirected e58744) currently waiting for peered

But I'm not seeing the blocked messages when the OSD came back in. The
OSD spindles have been running at 100% during this test. I have seen
slowed I/O from the clients as expected from the extra load, but so
far no blocked messages. I'm going to run some more tests.

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
fo5a
=ahEi
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> With some off-list help, we have adjusted
>> osd_client_message_cap=1. This seems to have helped a bit and we
>> have seen some OSDs have a value up to 4,000 for client messages. But
>> it does not solve the problem with the blocked I/O.
>>
>> One thing that I have noticed is that almost exactly 30 seconds elapse
>> between an OSD boots and the first blocked I/O message. I don't know
>> if the OSD doesn't have time to get it's brain right about a PG before
>> it starts servicing it or what exactly.
>
> I'm downloading the logs from yesterday now; sorry it's taking so long.
>
>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>> to master and things didn't go so well. The OSDs would not start
>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>> and all OSDs and the OSD then started, but never became active in the
>> cluster. It just sat there after reading all the PGs. There were
>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>> downgrading to the Infernalis branch and still no luck getting the
>> OSDs to come up. The OSD processes were idle after the initial boot.
>> All packages were installed from gitbuilder.
>
> Did you chown -R ?
>
> 
> https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>
> My guess is you only chowned the root dir, and the OSD didn't throw
> an error when it encountered the other files?  If you can generate a debug
> osd = 20 log, that would be helpful.. thanks!
>
> sage
>
>
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I can't think of anything. In my dev cluster the only thing that has
changed is the Ceph versions (no reboot). What I like is even though
the disks are 100% utilized, it is preforming as I expect now. Client
I/O is slightly degraded during the recovery, but no blocked I/O when
the OSD boots or during the recovery period. This is with
max_backfills set to 20, one backfill max in our production cluster is
painful on OSD boot/recovery. I was able to reproduce this issue on
our dev cluster very easily and very quickly with these settings. So
far two tests and an hour later, only the blocked I/O when the OSD is
marked out. We would love to see that go away too, but this is far
better than what we have now. This dev cluster also has
osd_client_message_cap set to default (100).

We need to stay on the Hammer version of Ceph and I'm willing to take
the time to bisect this. If this is not a problem in Firefly/Giant,
you you prefer a bisect to find the introduction of the problem
(Firefly/Giant -> Hammer) or the introduction of the resolution
(Hammer -> Infernalis)? Do you have some hints to reduce hitting a
commit that prevents a clean build as that is my most limiting factor?

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>> messages when the OSD was marked out:
>>
>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>> 34.476006 secs
>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>> cluster [WRN] slow request 32.913474 seconds old, received at
>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>> rbd_data.338102ae8944a.5270 [read 3302912~4096] 8.c74a4538
>> ack+read+known_if_redirected e58744) currently waiting for peered
>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>> cluster [WRN] slow request 32.697545 seconds old, received at
>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>> rbd_data.3380f74b0dc51.0001ee75 [read 1016832~4096] 8.778d1be3
>> ack+read+known_if_redirected e58744) currently waiting for peered
>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>> cluster [WRN] slow request 32.668006 seconds old, received at
>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>> rbd_data.3380f74b0dc51.00019b09 [read 1034240~4096] 8.e87a6f58
>> ack+read+known_if_redirected e58744) currently waiting for peered
>>
>> But I'm not seeing the blocked messages when the OSD came back in. The
>> OSD spindles have been running at 100% during this test. I have seen
>> slowed I/O from the clients as expected from the extra load, but so
>> far no blocked messages. I'm going to run some more tests.
>
> Good to hear.
>
> FWIW I looked through the logs and all of the slow request no flag point
> messages came from osd.163... and the logs don't show when they arrived.
> My guess is this OSD has a slower disk than the others, or something else
> funny is going on?
>
> I spot checked another OSD at random (60) where I saw a slow request.  It
> was stuck peering for 10s of seconds... waiting on a pg log message from
> osd.163.
>
> sage
>
>
>>
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>> fo5a
>> =ahEi
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >> -BEGIN PGP SIGNED MESSAGE-
>> >> Hash: SHA256
>> >>
>> >> With some off-list help, we have adjusted
>> >> osd_client_message_cap=1. This seems to have helped a bit and we
>> >> have seen some OSDs have a value up to 4,000 for client messages. But
>> >> it does not solve 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On my second test (a much longer one), it took nearly an hour, but a
few messages have popped up over a 20 window. Still far less than I
have been seeing.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I'll capture another set of logs. Is there any other debugging you
> want turned up? I've seen the same thing where I see the message
> dispatched to the secondary OSD, but the message just doesn't show up
> for 30+ seconds in the secondary OSD logs.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> I can't think of anything. In my dev cluster the only thing that has
>>> changed is the Ceph versions (no reboot). What I like is even though
>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>> the OSD boots or during the recovery period. This is with
>>> max_backfills set to 20, one backfill max in our production cluster is
>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>> our dev cluster very easily and very quickly with these settings. So
>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>> marked out. We would love to see that go away too, but this is far
>> (me too!)
>>> better than what we have now. This dev cluster also has
>>> osd_client_message_cap set to default (100).
>>>
>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>> you you prefer a bisect to find the introduction of the problem
>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>> commit that prevents a clean build as that is my most limiting factor?
>>
>> Nothing comes to mind.  I think the best way to find this is still to see
>> it happen in the logs with hammer.  The frustrating thing with that log
>> dump you sent is that although I see plenty of slow request warnings in
>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>> turned up for long enough?
>>
>> sage
>>
>>
>>
>>> Thanks,
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>> >> -BEGIN PGP SIGNED MESSAGE-
>>> >> Hash: SHA256
>>> >>
>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>> >> messages when the OSD was marked out:
>>> >>
>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>> >> 34.476006 secs
>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>> >> rbd_data.338102ae8944a.5270 [read 3302912~4096] 8.c74a4538
>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>> >> rbd_data.3380f74b0dc51.0001ee75 [read 1016832~4096] 8.778d1be3
>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>> >> rbd_data.3380f74b0dc51.00019b09 [read 1034240~4096] 8.e87a6f58
>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >>
>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>>> >> OSD spindles have been running at 100% during this test. I have seen
>>> >> slowed I/O from the clients as expected from the extra load, but so
>>> >> far no blocked messages. I'm going to run some more tests.
>>> >
>>> > Good to hear.
>>> >
>>> > FWIW I looked through the logs and all of the slow request no flag point
>>> > messages came from osd.163... and the logs don't show when they arrived.
>>> > My guess is this OSD has a slower disk than the others, or something else
>>> > funny is going on?
>>> >
>>> > I spot checked another OSD at random 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'll capture another set of logs. Is there any other debugging you
want turned up? I've seen the same thing where I see the message
dispatched to the secondary OSD, but the message just doesn't show up
for 30+ seconds in the secondary OSD logs.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I can't think of anything. In my dev cluster the only thing that has
>> changed is the Ceph versions (no reboot). What I like is even though
>> the disks are 100% utilized, it is preforming as I expect now. Client
>> I/O is slightly degraded during the recovery, but no blocked I/O when
>> the OSD boots or during the recovery period. This is with
>> max_backfills set to 20, one backfill max in our production cluster is
>> painful on OSD boot/recovery. I was able to reproduce this issue on
>> our dev cluster very easily and very quickly with these settings. So
>> far two tests and an hour later, only the blocked I/O when the OSD is
>> marked out. We would love to see that go away too, but this is far
> (me too!)
>> better than what we have now. This dev cluster also has
>> osd_client_message_cap set to default (100).
>>
>> We need to stay on the Hammer version of Ceph and I'm willing to take
>> the time to bisect this. If this is not a problem in Firefly/Giant,
>> you you prefer a bisect to find the introduction of the problem
>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>> commit that prevents a clean build as that is my most limiting factor?
>
> Nothing comes to mind.  I think the best way to find this is still to see
> it happen in the logs with hammer.  The frustrating thing with that log
> dump you sent is that although I see plenty of slow request warnings in
> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
> turned up for long enough?
>
> sage
>
>
>
>> Thanks,
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >> -BEGIN PGP SIGNED MESSAGE-
>> >> Hash: SHA256
>> >>
>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>> >> messages when the OSD was marked out:
>> >>
>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>> >> 34.476006 secs
>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>> >> rbd_data.338102ae8944a.5270 [read 3302912~4096] 8.c74a4538
>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>> >> rbd_data.3380f74b0dc51.0001ee75 [read 1016832~4096] 8.778d1be3
>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>> >> rbd_data.3380f74b0dc51.00019b09 [read 1034240~4096] 8.e87a6f58
>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>
>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>> >> OSD spindles have been running at 100% during this test. I have seen
>> >> slowed I/O from the clients as expected from the extra load, but so
>> >> far no blocked messages. I'm going to run some more tests.
>> >
>> > Good to hear.
>> >
>> > FWIW I looked through the logs and all of the slow request no flag point
>> > messages came from osd.163... and the logs don't show when they arrived.
>> > My guess is this OSD has a slower disk than the others, or something else
>> > funny is going on?
>> >
>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>> > osd.163.
>> >
>> > sage
>> >
>> >
>> >>
>> >> -BEGIN PGP SIGNATURE-
>> >> Version: Mailvelope v1.2.0
>> >> Comment: https://www.mailvelope.com
>> >>
>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>> >> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I upped the debug on about everything and ran the test for about 40
minutes. I took OSD.19 on ceph1 doen and then brought it back in.
There was at least one op on osd.19 that was blocked for over 1,000
seconds. Hopefully this will have something that will cast a light on
what is going on.

We are going to upgrade this cluster to Infernalis tomorrow and rerun
the test to verify the results from the dev cluster. This cluster
matches the hardware of our production cluster but is not yet in
production so we can safely wipe it to downgrade back to Hammer.

Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/

Let me know what else we can do to help.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
EDrG
=BZVw
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> On my second test (a much longer one), it took nearly an hour, but a
> few messages have popped up over a 20 window. Still far less than I
> have been seeing.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I'll capture another set of logs. Is there any other debugging you
>> want turned up? I've seen the same thing where I see the message
>> dispatched to the secondary OSD, but the message just doesn't show up
>> for 30+ seconds in the secondary OSD logs.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 I can't think of anything. In my dev cluster the only thing that has
 changed is the Ceph versions (no reboot). What I like is even though
 the disks are 100% utilized, it is preforming as I expect now. Client
 I/O is slightly degraded during the recovery, but no blocked I/O when
 the OSD boots or during the recovery period. This is with
 max_backfills set to 20, one backfill max in our production cluster is
 painful on OSD boot/recovery. I was able to reproduce this issue on
 our dev cluster very easily and very quickly with these settings. So
 far two tests and an hour later, only the blocked I/O when the OSD is
 marked out. We would love to see that go away too, but this is far
>>> (me too!)
 better than what we have now. This dev cluster also has
 osd_client_message_cap set to default (100).

 We need to stay on the Hammer version of Ceph and I'm willing to take
 the time to bisect this. If this is not a problem in Firefly/Giant,
 you you prefer a bisect to find the introduction of the problem
 (Firefly/Giant -> Hammer) or the introduction of the resolution
 (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
 commit that prevents a clean build as that is my most limiting factor?
>>>
>>> Nothing comes to mind.  I think the best way to find this is still to see
>>> it happen in the logs with hammer.  The frustrating thing with that log
>>> dump you sent is that although I see plenty of slow request warnings in
>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>> turned up for long enough?
>>>
>>> sage
>>>
>>>
>>>
 Thanks,
 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


 On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
 > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
 >> -BEGIN PGP SIGNED MESSAGE-
 >> Hash: SHA256
 >>
 >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
 >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
 >> messages when the OSD was marked out:
 >>
 >> 2015-10-06 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Max A. Krasilnikov
Hello!

On Mon, Oct 05, 2015 at 09:35:26PM -0600, robert wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256

> With some off-list help, we have adjusted
> osd_client_message_cap=1. This seems to have helped a bit and we
> have seen some OSDs have a value up to 4,000 for client messages. But
> it does not solve the problem with the blocked I/O.

> One thing that I have noticed is that almost exactly 30 seconds elapse
> between an OSD boots and the first blocked I/O message. I don't know
> if the OSD doesn't have time to get it's brain right about a PG before
> it starts servicing it or what exactly.

I have problems like yours in my cluster. All of them can be fixed with
restarting some osds, but i can not restart all my osds time to time.
Problem occurs when client is writing to rbd wolume or when recovering volume.
Typical message is (this was when recovering):

[WRN] slow request 30.929654 seconds old, received at 2015-10-06 
13:00:41.412329: osd_op(client.1068613.0:192715 
rbd_data.dc7650539e6a.0820 [set-alloc-hint object_size 4194304 
write_size 4194304,write 3371008~4096] 5.d66fd55d snapc c=[c] 
ack+ondisk+write+known_if_redirected e4009) currently waiting for subops from 51

Restarting osd.51 in such scenario fixes the problem.

There are no slow requests with low io on systems, only when i do something like
uploading image.

Some times ago i had too much created but not used osds. In that time, when
going down for restart, osds did not inform mon about this. Removing unused osds
entries fixes this issue. But when doing ceph crush dump i can see them. Maybe,
it is the root of problem? I tried to do getcrushmap/edit/setcrushmap, but
entries are in their place.

Maybe, my experience will help You to find answer. I hope, it wil fix my
problems :)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> With some off-list help, we have adjusted
> osd_client_message_cap=1. This seems to have helped a bit and we
> have seen some OSDs have a value up to 4,000 for client messages. But
> it does not solve the problem with the blocked I/O.
> 
> One thing that I have noticed is that almost exactly 30 seconds elapse
> between an OSD boots and the first blocked I/O message. I don't know
> if the OSD doesn't have time to get it's brain right about a PG before
> it starts servicing it or what exactly.

I'm downloading the logs from yesterday now; sorry it's taking so long.

> On another note, I tried upgrading our CentOS dev cluster from Hammer
> to master and things didn't go so well. The OSDs would not start
> because /var/lib/ceph was not owned by ceph. I chowned the directory
> and all OSDs and the OSD then started, but never became active in the
> cluster. It just sat there after reading all the PGs. There were
> sockets open to the monitor, but no OSD to OSD sockets. I tried
> downgrading to the Infernalis branch and still no luck getting the
> OSDs to come up. The OSD processes were idle after the initial boot.
> All packages were installed from gitbuilder.

Did you chown -R ?


https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer

My guess is you only chowned the root dir, and the OSD didn't throw 
an error when it encountered the other files?  If you can generate a debug 
osd = 20 log, that would be helpful.. thanks!

sage


> 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> GdXC
> =Aigq
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > I have eight nodes running the fio job rbd_test_real to different RBD
> > volumes. I've included the CRUSH map in the tarball.
> >
> > I stopped one OSD process and marked it out. I let it recover for a
> > few minutes and then I started the process again and marked it in. I
> > started getting block I/O messages during the recovery.
> >
> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> >
> > Thanks,
> > -BEGIN PGP SIGNATURE-
> > Version: Mailvelope v1.2.0
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> > 3EPx
> > =UDIV
> > -END PGP SIGNATURE-
> >
> > 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >>> -BEGIN PGP SIGNED MESSAGE-
> >>> Hash: SHA256
> >>>
> >>> We are still struggling with this and have tried a lot of different
> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> >>> consulting services for non-Red Hat systems. If there are some
> >>> certified Ceph consultants in the US that we can do both remote and
> >>> on-site engagements, please let us know.
> >>>
> >>> This certainly seems to be network related, but somewhere in the
> >>> kernel. We have tried increasing the network and TCP buffers, number
> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Ken Dreyer
On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil  wrote:
> Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> first.  They won't be allowed to boot until that happens... all upgrades
> must stop at 0.94.4 first.

This sounds pretty crucial. is there Redmine ticket(s)?

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> I downgraded to the hammer gitbuilder branch, but it looks like I've
> passed the point of no return:
> 
> 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1 error checking features:
> (1) Operation not permitted

In that case, mark all osds down, upgrade again, and they'll be 
allowed to start.  The restriction is that each osd can't go backwards, 
and post-hammer osds can't talk to pre-hammer osds.

sage

> 
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil  wrote:
> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >> Thanks for your time Sage. It sounds like a few people may be helped if you
> >> can find something.
> >>
> >> I did a recursive chown as in the instructions (although I didn't know 
> >> about
> >> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
> >> I'll also do ms and make the logs available. I'll also review the document
> >> to make sure I didn't miss anything else.
> >
> > Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> > first.  They won't be allowed to boot until that happens... all upgrades
> > must stop at 0.94.4 first.  And that isn't released yet.. we'll try to
> > do that today.  In the meantime, you can use the hammer gitbuilder
> > build...
> >
> > sage
> >
> >
> >>
> >> Robert LeBlanc
> >>
> >> Sent from a mobile device please excuse any typos.
> >>
> >> On Oct 6, 2015 6:37 AM, "Sage Weil"  wrote:
> >>   On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >>   > -BEGIN PGP SIGNED MESSAGE-
> >>   > Hash: SHA256
> >>   >
> >>   > With some off-list help, we have adjusted
> >>   > osd_client_message_cap=1. This seems to have helped a bit
> >>   and we
> >>   > have seen some OSDs have a value up to 4,000 for client
> >>   messages. But
> >>   > it does not solve the problem with the blocked I/O.
> >>   >
> >>   > One thing that I have noticed is that almost exactly 30
> >>   seconds elapse
> >>   > between an OSD boots and the first blocked I/O message. I
> >>   don't know
> >>   > if the OSD doesn't have time to get it's brain right about a
> >>   PG before
> >>   > it starts servicing it or what exactly.
> >>
> >>   I'm downloading the logs from yesterday now; sorry it's taking
> >>   so long.
> >>
> >>   > On another note, I tried upgrading our CentOS dev cluster from
> >>   Hammer
> >>   > to master and things didn't go so well. The OSDs would not
> >>   start
> >>   > because /var/lib/ceph was not owned by ceph. I chowned the
> >>   directory
> >>   > and all OSDs and the OSD then started, but never became active
> >>   in the
> >>   > cluster. It just sat there after reading all the PGs. There
> >>   were
> >>   > sockets open to the monitor, but no OSD to OSD sockets. I
> >>   tried
> >>   > downgrading to the Infernalis branch and still no luck getting
> >>   the
> >>   > OSDs to come up. The OSD processes were idle after the initial
> >>   boot.
> >>   > All packages were installed from gitbuilder.
> >>
> >>   Did you chown -R ?
> >>
> >>  
> >> https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
> >>   g-from-hammer
> >>
> >>   My guess is you only chowned the root dir, and the OSD didn't
> >>   throw
> >>   an error when it encountered the other files?  If you can
> >>   generate a debug
> >>   osd = 20 log, that would be helpful.. thanks!
> >>
> >>   sage
> >>
> >>
> >>   >
> >>   > Thanks,
> >>   > -BEGIN PGP SIGNATURE-
> >>   > Version: Mailvelope v1.2.0
> >>   > Comment: https://www.mailvelope.com
> >>   >
> >>   > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> >>   > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> >>   > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> >>   > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> >>   > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> >>   > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> >>   > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> >>   > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> >>   > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> >>   > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> >>   > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> >>   > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> >>   > GdXC
> >>   > =Aigq
> >>   > 

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
I downgraded to the hammer gitbuilder branch, but it looks like I've
passed the point of no return:

2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
includes unsupported features:
compat={},rocompat={},incompat={7=support shec erasure code}
2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1 error checking features:
(1) Operation not permitted


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> Thanks for your time Sage. It sounds like a few people may be helped if you
>> can find something.
>>
>> I did a recursive chown as in the instructions (although I didn't know about
>> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
>> I'll also do ms and make the logs available. I'll also review the document
>> to make sure I didn't miss anything else.
>
> Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> first.  They won't be allowed to boot until that happens... all upgrades
> must stop at 0.94.4 first.  And that isn't released yet.. we'll try to
> do that today.  In the meantime, you can use the hammer gitbuilder
> build...
>
> sage
>
>
>>
>> Robert LeBlanc
>>
>> Sent from a mobile device please excuse any typos.
>>
>> On Oct 6, 2015 6:37 AM, "Sage Weil"  wrote:
>>   On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>   > -BEGIN PGP SIGNED MESSAGE-
>>   > Hash: SHA256
>>   >
>>   > With some off-list help, we have adjusted
>>   > osd_client_message_cap=1. This seems to have helped a bit
>>   and we
>>   > have seen some OSDs have a value up to 4,000 for client
>>   messages. But
>>   > it does not solve the problem with the blocked I/O.
>>   >
>>   > One thing that I have noticed is that almost exactly 30
>>   seconds elapse
>>   > between an OSD boots and the first blocked I/O message. I
>>   don't know
>>   > if the OSD doesn't have time to get it's brain right about a
>>   PG before
>>   > it starts servicing it or what exactly.
>>
>>   I'm downloading the logs from yesterday now; sorry it's taking
>>   so long.
>>
>>   > On another note, I tried upgrading our CentOS dev cluster from
>>   Hammer
>>   > to master and things didn't go so well. The OSDs would not
>>   start
>>   > because /var/lib/ceph was not owned by ceph. I chowned the
>>   directory
>>   > and all OSDs and the OSD then started, but never became active
>>   in the
>>   > cluster. It just sat there after reading all the PGs. There
>>   were
>>   > sockets open to the monitor, but no OSD to OSD sockets. I
>>   tried
>>   > downgrading to the Infernalis branch and still no luck getting
>>   the
>>   > OSDs to come up. The OSD processes were idle after the initial
>>   boot.
>>   > All packages were installed from gitbuilder.
>>
>>   Did you chown -R ?
>>
>>  
>> https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
>>   g-from-hammer
>>
>>   My guess is you only chowned the root dir, and the OSD didn't
>>   throw
>>   an error when it encountered the other files?  If you can
>>   generate a debug
>>   osd = 20 log, that would be helpful.. thanks!
>>
>>   sage
>>
>>
>>   >
>>   > Thanks,
>>   > -BEGIN PGP SIGNATURE-
>>   > Version: Mailvelope v1.2.0
>>   > Comment: https://www.mailvelope.com
>>   >
>>   > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>   > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>   > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>   > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>   > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>   > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>   > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>   > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>   > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>   > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>   > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>   > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>   > GdXC
>>   > =Aigq
>>   > -END PGP SIGNATURE-
>>   > 
>>   > Robert LeBlanc
>>   > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
>>   B9F1
>>   >
>>   >
>>   > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc
>>    wrote:
>>   > > -BEGIN PGP SIGNED MESSAGE-
>>   > > Hash: SHA256
>>   > >
>>   > > I have eight nodes running the fio job rbd_test_real to
>>   different RBD
>>   > > 

Re: [ceph-users] Potential OSD deadlock?

2015-10-05 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

With some off-list help, we have adjusted
osd_client_message_cap=1. This seems to have helped a bit and we
have seen some OSDs have a value up to 4,000 for client messages. But
it does not solve the problem with the blocked I/O.

One thing that I have noticed is that almost exactly 30 seconds elapse
between an OSD boots and the first blocked I/O message. I don't know
if the OSD doesn't have time to get it's brain right about a PG before
it starts servicing it or what exactly.

On another note, I tried upgrading our CentOS dev cluster from Hammer
to master and things didn't go so well. The OSDs would not start
because /var/lib/ceph was not owned by ceph. I chowned the directory
and all OSDs and the OSD then started, but never became active in the
cluster. It just sat there after reading all the PGs. There were
sockets open to the monitor, but no OSD to OSD sockets. I tried
downgrading to the Infernalis branch and still no luck getting the
OSDs to come up. The OSD processes were idle after the initial boot.
All packages were installed from gitbuilder.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
GdXC
=Aigq
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I have eight nodes running the fio job rbd_test_real to different RBD
> volumes. I've included the CRUSH map in the tarball.
>
> I stopped one OSD process and marked it out. I let it recover for a
> few minutes and then I started the process again and marked it in. I
> started getting block I/O messages during the recovery.
>
> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> 3EPx
> =UDIV
> -END PGP SIGNATURE-
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> We are still struggling with this and have tried a lot of different
>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>> consulting services for non-Red Hat systems. If there are some
>>> certified Ceph consultants in the US that we can do both remote and
>>> on-site engagements, please let us know.
>>>
>>> This certainly seems to be network related, but somewhere in the
>>> kernel. We have tried increasing the network and TCP buffers, number
>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>> at a time). There seems to be no reasonable explanation why I/O is
>>> blocked pretty frequently longer than 30 seconds. We have verified
>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>> network admins have verified that packets are not being dropped in the
>>> switches for these nodes. We have tried different kernels including
>>> the recent Google patch to cubic. This is showing up on three cluster
>>> (two Ethernet and one IPoIB). I booted one 

Re: [ceph-users] Potential OSD deadlock?

2015-10-05 Thread Josef Johansson
Hi,

Looking over disks etc and comparing to our setup, we got a bit different 
hardware, but they should be comparable. Running Hitachi 4TB (HUS724040AL), 
Intel DC S3700 and SAS3008 instead.

In our old cluster (almost same hardware in new and old) we have overloaded the 
cluster and had to wait three nights before a last new disk was added, next 
time we’ll turn down recover ratios and let it run daytime. Now we use 
nobackfill to only run during nighttime. We also had to turn off deep-scrub 
during day to let IO have more space. 

The new cluster we still run everything daytime, but I feel that the article 
that Intel wrote, where they specified that a cluster could handle X clients is 
pretty true. What I couldn’t figure out was why the system was not loaded more 
than it was, but it _feels_ like mixed read/write make the latency go beyond a 
certain point where the clients starts to suffer. And that it’s not possible to 
see it through iostat and friends.

Hope that the debug logs can track what the latency is due to. Keeping tabs on 
how this turns out.

I believe these guys do ceph consulting as well. 
https://www.hastexo.com/knowledge/storage-io/ceph-rados 


Regards,
Josef

> On 04 Oct 2015, at 18:13, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> These are Toshiba MG03ACA400 drives.
> 
> sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
> sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
> sde is SATADOM with OS install
> sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI Logic / 
> Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
> sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI 
> controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 
> (rev 05)
> 
> There is probably some performance optimization that we can do in this area, 
> however unless I'm missing something, I don't see anything that should cause 
> I/O to take 30-60+ seconds to complete from a disk standpoint.
> 
> [root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1: "; 
> xfs_db -c frag -r /dev/sd${i}1; done  
>   
> 
> sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
> sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
> sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
> sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
> sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
> sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
> sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
> sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
> sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
> sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%
> 
> 
> [root@ceph1 ~]# iostat -xd 2
> Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015  _x86_64_(16 
> CPU)
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.09 2.069.24   36.18   527.28  1743.71   100.00
>  8.96  197.32   17.50  243.23   4.07  18.47
> sdb   0.17 3.61   16.70   74.44   949.65  2975.3086.13
>  6.74   73.95   23.94   85.16   4.31  39.32
> sdc   0.14 4.67   15.69   87.80   818.02  3860.1190.41
>  9.56   92.38   26.73  104.11   4.44  45.91
> sdd   0.17 3.437.16   69.13   480.96  2847.4287.25
>  4.80   62.89   30.00   66.30   4.33  33.00
> sde   0.01 1.130.340.99 8.3512.0130.62
>  0.017.372.649.02   1.64   0.22
> sdj   0.00 1.220.01  348.22 0.03 11302.6564.91
>  0.230.660.140.66   0.15   5.15
> sdk   0.00 1.990.01  369.94 0.03 12876.7469.61
>  0.260.710.130.71   0.16   5.75
> sdf   0.01 1.791.55   31.1239.64  1431.3790.06
>  4.07  124.67   16.25  130.05   3.11  10.17
> sdi   0.22 3.17   23.92   72.90  1386.45  2676.2883.93
>  7.75   80.00   24.31   98.27   4.31  41.77
> sdm   0.16 3.10   17.63   72.84   986.29  2767.2482.98
>  6.57   72.64   23.67   84.50   4.23  38.30
> sdl   0.11 3.01   12.10   55.14   660.85  2361.4089.89
> 17.87  265.80   21.64  319.36   4.08  27.45
> sdg   0.08 2.459.75   53.90   489.67  1929.4276.01
> 17.27  271.30   20.77  316.61   3.98  25.33
> sdh   

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Josef Johansson
Hi,

I don't know what brand those 4TB spindles are, but I know that mine are
very bad at doing write at the same time as read. Especially small read
write.

This has an absurdly bad effect when doing maintenance on ceph. That being
said we see a lot of difference between dumpling and hammer in performance
on these drives. Most likely due to hammer able to read write degraded PGs.

We have run into two different problems along the way, the first was
blocked request where we had to upgrade from 64GB mem on each node to
256GB. We thought that it was the only safe buy make things better.

I believe it worked because more reads were cached so we had less mixed
read write on the nodes, thus giving the spindles more room to breath. Now
this was a shot in the dark then, but the price is not that high even to
just try it out.. compared to 6 people working on it. I believe the IO on
disk was not huge either, but what kills the disk is high latency. How much
bandwidth are the disk using? We had very low.. 3-5MB/s.

The second problem was defragmentations hitting 70%, lowering that to 6%
made a lot of difference. Depending on IO pattern this increases different.

TL;DR read kills the 4TB spindles.

Hope you guys clear out of the woods.
/Josef
On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> -BEGIN PGP 

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Sage Weil
On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
> 
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
> 
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
> 
> I don't know what "no flag points reached" means.

Just that the op hasn't been marked as reaching any interesting points 
(op->mark_*() calls).

Is it possible to gather a lot with debug ms = 20 and debug osd = 20?  
It's extremely verbose but it'll let us see where the op is getting 
blocked.  If you see the "slow request" message it means the op in 
received by ceph (that's when the clock starts), so I suspect it's not 
something we can blame on the network stack.

sage


> 
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
> 
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
> 
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
> 
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
> 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> 

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Alex Gorbachev
We had multiple issues with 4TB drives and delays.  Here is the
configuration that works for us fairly well on Ubuntu (but we are about to
significantly increase the IO load so this may change).

NTP: always use NTP and make sure it is working - Ceph is very sensitive to
time being precise

/etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="elevator=noop nomodeset splash=silent
vga=normal net.ifnames=0 biosdevname=0 scsi_mod.use_blk_mq=Y"

blk_mq really helps with spreading the IO load over multiple cores.

I used to use intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll, but
it seems allowing idle states actually can improve performance by running
CPUs cooler, so will likely remove this soon.

chmod -x /etc/init.d/ondemand - in order to prevent CPU throttling

use Mellanox OFED on pre-4.x kernels

check your flow control settings on server and switch using ethtool

test network performance with iperf

disable firewall rules or just uninstall firewall (e.g. ufw)

Turn off in BIOS any virtualization technology VT-d etc., and (see note
above re C-states) maybe also disable power saving features

/etc/sysctl.conf:
kernel.pid_max = 4194303
vm.swappiness=1
vm.min_free_kbytes=1048576

Hope this helps.

Alex


On Sun, Oct 4, 2015 at 2:16 AM, Josef Johansson  wrote:

> Hi,
>
> I don't know what brand those 4TB spindles are, but I know that mine are
> very bad at doing write at the same time as read. Especially small read
> write.
>
> This has an absurdly bad effect when doing maintenance on ceph. That being
> said we see a lot of difference between dumpling and hammer in performance
> on these drives. Most likely due to hammer able to read write degraded PGs.
>
> We have run into two different problems along the way, the first was
> blocked request where we had to upgrade from 64GB mem on each node to
> 256GB. We thought that it was the only safe buy make things better.
>
> I believe it worked because more reads were cached so we had less mixed
> read write on the nodes, thus giving the spindles more room to breath. Now
> this was a shot in the dark then, but the price is not that high even to
> just try it out.. compared to 6 people working on it. I believe the IO on
> disk was not huge either, but what kills the disk is high latency. How much
> bandwidth are the disk using? We had very low.. 3-5MB/s.
>
> The second problem was defragmentations hitting 70%, lowering that to 6%
> made a lot of difference. Depending on IO pattern this increases different.
>
> TL;DR read kills the 4TB spindles.
>
> Hope you guys clear out of the woods.
> /Josef
> On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:
>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We are still struggling with this and have tried a lot of different
>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> consulting services for non-Red Hat systems. If there are some
>> certified Ceph consultants in the US that we can do both remote and
>> on-site engagements, please let us know.
>>
>> This certainly seems to be network related, but somewhere in the
>> kernel. We have tried increasing the network and TCP buffers, number
>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> on the boxes, the disks are busy, but not constantly at 100% (they
>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> at a time). There seems to be no reasonable explanation why I/O is
>> blocked pretty frequently longer than 30 seconds. We have verified
>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> network admins have verified that packets are not being dropped in the
>> switches for these nodes. We have tried different kernels including
>> the recent Google patch to cubic. This is showing up on three cluster
>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> (from CentOS 7.1) with similar results.
>>
>> The messages seem slightly different:
>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> 100.087155 secs
>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> cluster [WRN] slow request 30.041999 seconds old, received at
>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> points reached
>>
>> I don't know what "no flag points reached" means.
>>
>> The problem is most pronounced when we have to reboot an OSD node (1
>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> seconds. It takes a good 15 minutes for things to settle down. The
>> production cluster is very busy doing normally 8,000 I/O and peaking
>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> are between 25-50% full. We are currently splitting PGs to distribute

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Josef Johansson
I would start with defrag the drives, the good part is that you can just
run the defrag with the time parameter and it will take all available xfs
drives.
On 4 Oct 2015 6:13 pm, "Robert LeBlanc"  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> These are Toshiba MG03ACA400 drives.
>
> sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
> sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
> sde is SATADOM with OS install
> sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI Logic / 
> Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
> sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI 
> controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 
> (rev 05)
>
> There is probably some performance optimization that we can do in this area, 
> however unless I'm missing something, I don't see anything that should cause 
> I/O to take 30-60+ seconds to complete from a disk standpoint.
>
> [root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1: "; 
> xfs_db -c frag -r /dev/sd${i}1; done
> sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
> sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
> sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
> sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
> sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
> sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
> sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
> sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
> sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
> sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%
>
>
> [root@ceph1 ~]# iostat -xd 2
> Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015  _x86_64_(16 
> CPU)
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.09 2.069.24   36.18   527.28  1743.71   100.00
>  8.96  197.32   17.50  243.23   4.07  18.47
> sdb   0.17 3.61   16.70   74.44   949.65  2975.3086.13
>  6.74   73.95   23.94   85.16   4.31  39.32
> sdc   0.14 4.67   15.69   87.80   818.02  3860.1190.41
>  9.56   92.38   26.73  104.11   4.44  45.91
> sdd   0.17 3.437.16   69.13   480.96  2847.4287.25
>  4.80   62.89   30.00   66.30   4.33  33.00
> sde   0.01 1.130.340.99 8.3512.0130.62
>  0.017.372.649.02   1.64   0.22
> sdj   0.00 1.220.01  348.22 0.03 11302.6564.91
>  0.230.660.140.66   0.15   5.15
> sdk   0.00 1.990.01  369.94 0.03 12876.7469.61
>  0.260.710.130.71   0.16   5.75
> sdf   0.01 1.791.55   31.1239.64  1431.3790.06
>  4.07  124.67   16.25  130.05   3.11  10.17
> sdi   0.22 3.17   23.92   72.90  1386.45  2676.2883.93
>  7.75   80.00   24.31   98.27   4.31  41.77
> sdm   0.16 3.10   17.63   72.84   986.29  2767.2482.98
>  6.57   72.64   23.67   84.50   4.23  38.30
> sdl   0.11 3.01   12.10   55.14   660.85  2361.4089.89
> 17.87  265.80   21.64  319.36   4.08  27.45
> sdg   0.08 2.459.75   53.90   489.67  1929.4276.01
> 17.27  271.30   20.77  316.61   3.98  25.33
> sdh   0.10 2.76   11.28   60.97   600.10  2114.4875.14
>  1.70   23.55   22.92   23.66   4.10  29.60
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.000.500.00   146.00 0.00   584.00
>  0.01   16.00   16.000.00  16.00   0.80
> sdb   0.00 0.509.00  119.00  2036.00  2578.0072.09
>  0.685.507.065.39   2.36  30.25
> sdc   0.00 4.00   34.00  129.00   494.00  6987.7591.80
>  1.70   10.44   17.008.72   4.44  72.40
> sdd   0.00 1.501.50   95.5074.00  2396.5050.94
>  0.858.75   23.338.52   7.53  73.05
> sde   0.0037.00   11.001.0046.00   152.0033.00
>  0.011.000.645.00   0.54   0.65
> sdj   0.00 0.500.00  970.50 0.00 12594.0025.95
>  0.090.090.000.09   0.08   8.20
> sdk   0.00 0.000.00  977.50 0.00 12016.0024.59
>  0.100.100.000.10   0.09   8.90
> sdf   0.00 0.500.50   37.50 2.00   230.2512.22
>  9.63   10.588.00   10.61   1.79   6.80
> sdi  

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

These are Toshiba MG03ACA400 drives.

sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79
series chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79
series chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
sde is SATADOM with OS install
sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI
Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI
controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT
SAS-2 (rev 05)

There is probably some performance optimization that we can do in this
area, however unless I'm missing something, I don't see anything that
should cause I/O to take 30-60+ seconds to complete from a disk
standpoint.

[root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1:
"; xfs_db -c frag -r /dev/sd${i}1; done
sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%


[root@ceph1 ~]# iostat -xd 2
Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015  _x86_64_
   (16 CPU)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.09 2.069.24   36.18   527.28  1743.71
100.00 8.96  197.32   17.50  243.23   4.07  18.47
sdb   0.17 3.61   16.70   74.44   949.65  2975.30
86.13 6.74   73.95   23.94   85.16   4.31  39.32
sdc   0.14 4.67   15.69   87.80   818.02  3860.11
90.41 9.56   92.38   26.73  104.11   4.44  45.91
sdd   0.17 3.437.16   69.13   480.96  2847.42
87.25 4.80   62.89   30.00   66.30   4.33  33.00
sde   0.01 1.130.340.99 8.3512.01
30.62 0.017.372.649.02   1.64   0.22
sdj   0.00 1.220.01  348.22 0.03 11302.65
64.91 0.230.660.140.66   0.15   5.15
sdk   0.00 1.990.01  369.94 0.03 12876.74
69.61 0.260.710.130.71   0.16   5.75
sdf   0.01 1.791.55   31.1239.64  1431.37
90.06 4.07  124.67   16.25  130.05   3.11  10.17
sdi   0.22 3.17   23.92   72.90  1386.45  2676.28
83.93 7.75   80.00   24.31   98.27   4.31  41.77
sdm   0.16 3.10   17.63   72.84   986.29  2767.24
82.98 6.57   72.64   23.67   84.50   4.23  38.30
sdl   0.11 3.01   12.10   55.14   660.85  2361.40
89.8917.87  265.80   21.64  319.36   4.08  27.45
sdg   0.08 2.459.75   53.90   489.67  1929.42
76.0117.27  271.30   20.77  316.61   3.98  25.33
sdh   0.10 2.76   11.28   60.97   600.10  2114.48
75.14 1.70   23.55   22.92   23.66   4.10  29.60

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.500.00   146.00 0.00
584.00 0.01   16.00   16.000.00  16.00   0.80
sdb   0.00 0.509.00  119.00  2036.00  2578.00
72.09 0.685.507.065.39   2.36  30.25
sdc   0.00 4.00   34.00  129.00   494.00  6987.75
91.80 1.70   10.44   17.008.72   4.44  72.40
sdd   0.00 1.501.50   95.5074.00  2396.50
50.94 0.858.75   23.338.52   7.53  73.05
sde   0.0037.00   11.001.0046.00   152.00
33.00 0.011.000.645.00   0.54   0.65
sdj   0.00 0.500.00  970.50 0.00 12594.00
25.95 0.090.090.000.09   0.08   8.20
sdk   0.00 0.000.00  977.50 0.00 12016.00
24.59 0.100.100.000.10   0.09   8.90
sdf   0.00 0.500.50   37.50 2.00   230.25
12.22 9.63   10.588.00   10.61   1.79   6.80
sdi   2.00 0.00   10.500.00  2528.00 0.00
481.52 0.109.339.330.00   7.76   8.15
sdm   0.00 0.50   15.00  116.00   546.00   833.25
21.06 0.947.17   14.036.28   4.13  54.15
sdl   0.00 0.003.000.0026.00 0.00
17.33 0.027.507.500.00   7.50   2.25
sdg   0.00 3.501.00   64.50 4.00  2929.25
89.56 0.406.049.005.99   3.42  22.40
sdh   0.50 

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I have eight nodes running the fio job rbd_test_real to different RBD
volumes. I've included the CRUSH map in the tarball.

I stopped one OSD process and marked it out. I let it recover for a
few minutes and then I started the process again and marked it in. I
started getting block I/O messages during the recovery.

The logs are located at http://162.144.87.113/files/ushou1.tar.xz

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
3EPx
=UDIV
-END PGP SIGNATURE-


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We are still struggling with this and have tried a lot of different
>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> consulting services for non-Red Hat systems. If there are some
>> certified Ceph consultants in the US that we can do both remote and
>> on-site engagements, please let us know.
>>
>> This certainly seems to be network related, but somewhere in the
>> kernel. We have tried increasing the network and TCP buffers, number
>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> on the boxes, the disks are busy, but not constantly at 100% (they
>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> at a time). There seems to be no reasonable explanation why I/O is
>> blocked pretty frequently longer than 30 seconds. We have verified
>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> network admins have verified that packets are not being dropped in the
>> switches for these nodes. We have tried different kernels including
>> the recent Google patch to cubic. This is showing up on three cluster
>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> (from CentOS 7.1) with similar results.
>>
>> The messages seem slightly different:
>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> 100.087155 secs
>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> cluster [WRN] slow request 30.041999 seconds old, received at
>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> points reached
>>
>> I don't know what "no flag points reached" means.
>
> Just that the op hasn't been marked as reaching any interesting points
> (op->mark_*() calls).
>
> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> It's extremely verbose but it'll let us see where the op is getting
> blocked.  If you see the "slow request" message it means the op in
> received by ceph (that's when the clock starts), so I suspect it's not
> something we can blame on the network stack.
>
> sage
>
>
>>
>> The problem is most pronounced when we have to reboot an OSD node (1
>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> seconds. It takes a good 15 minutes for things to settle down. The
>> production cluster is very busy doing normally 8,000 I/O and peaking
>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> are between 25-50% full. We are currently splitting PGs to distribute
>> the load better across the disks, but we are having to do this 10 PGs
>> at a time as we get blocked I/O. We have max_backfills and
>> max_recovery set to 1, client op priority is set higher than recovery
>> priority. We tried increasing the number of op threads but this didn't
>> seem to help. It seems as soon as PGs are finished being checked, they
>> become active and could be the cause for slow I/O while the other PGs
>> are being checked.
>>
>> What I don't understand is that the messages are delayed. As soon as
>> the message is received by Ceph OSD process, it is very quickly
>> committed to the journal and a response is sent back to the primary
>> OSD which is received very quickly as well. 

Re: [ceph-users] Potential OSD deadlock?

2015-10-03 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are still struggling with this and have tried a lot of different
things. Unfortunately, Inktank (now Red Hat) no longer provides
consulting services for non-Red Hat systems. If there are some
certified Ceph consultants in the US that we can do both remote and
on-site engagements, please let us know.

This certainly seems to be network related, but somewhere in the
kernel. We have tried increasing the network and TCP buffers, number
of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
on the boxes, the disks are busy, but not constantly at 100% (they
cycle from <10% up to 100%, but not 100% for more than a few seconds
at a time). There seems to be no reasonable explanation why I/O is
blocked pretty frequently longer than 30 seconds. We have verified
Jumbo frames by pinging from/to each node with 9000 byte packets. The
network admins have verified that packets are not being dropped in the
switches for these nodes. We have tried different kernels including
the recent Google patch to cubic. This is showing up on three cluster
(two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
(from CentOS 7.1) with similar results.

The messages seem slightly different:
2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
100.087155 secs
2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
cluster [WRN] slow request 30.041999 seconds old, received at
2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
points reached

I don't know what "no flag points reached" means.

The problem is most pronounced when we have to reboot an OSD node (1
of 13), we will have hundreds of I/O blocked for some times up to 300
seconds. It takes a good 15 minutes for things to settle down. The
production cluster is very busy doing normally 8,000 I/O and peaking
at 15,000. This is all 4TB spindles with SSD journals and the disks
are between 25-50% full. We are currently splitting PGs to distribute
the load better across the disks, but we are having to do this 10 PGs
at a time as we get blocked I/O. We have max_backfills and
max_recovery set to 1, client op priority is set higher than recovery
priority. We tried increasing the number of op threads but this didn't
seem to help. It seems as soon as PGs are finished being checked, they
become active and could be the cause for slow I/O while the other PGs
are being checked.

What I don't understand is that the messages are delayed. As soon as
the message is received by Ceph OSD process, it is very quickly
committed to the journal and a response is sent back to the primary
OSD which is received very quickly as well. I've adjust
min_free_kbytes and it seems to keep the OSDs from crashing, but
doesn't solve the main problem. We don't have swap and there is 64 GB
of RAM per nodes for 10 OSDs.

Is there something that could cause the kernel to get a packet but not
be able to dispatch it to Ceph such that it could be explaining why we
are seeing these blocked I/O for 30+ seconds. Is there some pointers
to tracing Ceph messages from the network buffer through the kernel to
the Ceph process?

We can really use some pointers no matter how outrageous. We've have
over 6 people looking into this for weeks now and just can't think of
anything else.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
l7OF
=OI++
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> We dropped the replication on our cluster from 4 to 3 and it looks
> like all the blocked I/O has stopped (no entries in the log for the
> last 12 hours). This makes me believe that there is some issue with
> the number of sockets or some other TCP issue. We have not messed with
> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> processes 

Re: [ceph-users] Potential OSD deadlock?

2015-09-25 Thread Robert LeBlanc
We dropped the replication on our cluster from 4 to 3 and it looks
like all the blocked I/O has stopped (no entries in the log for the
last 12 hours). This makes me believe that there is some issue with
the number of sockets or some other TCP issue. We have not messed with
Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
hosts hosting about 150 VMs. Open files is set at 32K for the OSD
processes and 16K system wide.

Does this seem like the right spot to be looking? What are some
configuration items we should be looking at?

Thanks,

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> seems that there were some major reworks in the network handling in
> the kernel to efficiently handle that network rate. If I remember
> right we also saw a drop in CPU utilization. I'm starting to think
> that we did see packet loss while congesting our ISLs in our initial
> testing, but we could not tell where the dropping was happening. We
> saw some on the switches, but it didn't seem to be bad if we weren't
> trying to congest things. We probably already saw this issue, just
> didn't know it.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> drivers might cause problems though.
>>
>> Here's ifconfig from one of the nodes:
>>
>> ens513f1: flags=4163  mtu 1500
>> inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> Mark
>>
>>
>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> OK, here is the update on the saga...
>>>
>>> I traced some more of blocked I/Os and it seems that communication
>>> between two hosts seemed worse than others. I did a two way ping flood
>>> between the two hosts using max packet sizes (1500). After 1.5M
>>> packets, no lost pings. Then then had the ping flood running while I
>>> put Ceph load on the cluster and the dropped pings started increasing
>>> after stopping the Ceph workload the pings stopped dropping.
>>>
>>> I then ran iperf between all the nodes with the same results, so that
>>> ruled out Ceph to a large degree. I then booted in the the
>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>> need the network enhancements in the 4.x series to work well.
>>>
>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>> kernel to see where this issue in introduced. Both of the clusters
>>> with this issue are running 4.x, other than that, they are pretty
>>> differing hardware and network configs.
>>>
>>> Thanks,
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.1.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>> 4OEo
>>> =P33I
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>> wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 This is IPoIB and we have the MTU set to 64K. There was some issues
 pinging hosts with "No buffer space available" (hosts are 

Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Mark Nelson
FWIW, we've got some 40GbE Intel cards in the community performance 
cluster on a Mellanox 40GbE switch that appear (knock on wood) to be 
running fine with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from 
Intel that older drivers might cause problems though.


Here's ifconfig from one of the nodes:

ens513f1: flags=4163  mtu 1500
inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Mark

On 09/23/2015 01:48 PM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, here is the update on the saga...

I traced some more of blocked I/Os and it seems that communication
between two hosts seemed worse than others. I did a two way ping flood
between the two hosts using max packet sizes (1500). After 1.5M
packets, no lost pings. Then then had the ping flood running while I
put Ceph load on the cluster and the dropped pings started increasing
after stopping the Ceph workload the pings stopped dropping.

I then ran iperf between all the nodes with the same results, so that
ruled out Ceph to a large degree. I then booted in the the
3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
need the network enhancements in the 4.x series to work well.

Does this sound familiar to anyone? I'll probably start bisecting the
kernel to see where this issue in introduced. Both of the clusters
with this issue are running 4.x, other than that, they are pretty
differing hardware and network configs.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
/XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
4OEo
=P33I
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

This is IPoIB and we have the MTU set to 64K. There was some issues
pinging hosts with "No buffer space available" (hosts are currently
configured for 4GB to test SSD caching rather than page cache). I
found that MTU under 32K worked reliable for ping, but still had the
blocked I/O.

I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
the blocked I/O.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:

On Tue, 22 Sep 2015, Samuel Just wrote:

I looked at the logs, it looks like there was a 53 second delay
between when osd.17 started sending the osd_repop message and when
osd.13 started reading it, which is pretty weird.  Sage, didn't we
once see a kernel issue which caused some messages to be mysteriously
delayed for many 10s of seconds?


Every time we have seen this behavior and diagnosed it in the wild it has
been a network misconfiguration.  Usually related to jumbo frames.

sage




What kernel are you running?
-Sam

On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, looping in ceph-devel to see if I can get some more eyes. I've
extracted what I think are important entries from the logs for the
first blocked request. NTP is running all the servers so the logs
should be close in terms of time. Logs for 12:50 to 13:00 are
available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz

2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
2015-09-22 

Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We were able to only get ~17Gb out of the XL710 (heavily tweaked)
until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
seems that there were some major reworks in the network handling in
the kernel to efficiently handle that network rate. If I remember
right we also saw a drop in CPU utilization. I'm starting to think
that we did see packet loss while congesting our ISLs in our initial
testing, but we could not tell where the dropping was happening. We
saw some on the switches, but it didn't seem to be bad if we weren't
trying to congest things. We probably already saw this issue, just
didn't know it.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> drivers might cause problems though.
>
> Here's ifconfig from one of the nodes:
>
> ens513f1: flags=4163  mtu 1500
> inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> Mark
>
>
> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> OK, here is the update on the saga...
>>
>> I traced some more of blocked I/Os and it seems that communication
>> between two hosts seemed worse than others. I did a two way ping flood
>> between the two hosts using max packet sizes (1500). After 1.5M
>> packets, no lost pings. Then then had the ping flood running while I
>> put Ceph load on the cluster and the dropped pings started increasing
>> after stopping the Ceph workload the pings stopped dropping.
>>
>> I then ran iperf between all the nodes with the same results, so that
>> ruled out Ceph to a large degree. I then booted in the the
>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> need the network enhancements in the 4.x series to work well.
>>
>> Does this sound familiar to anyone? I'll probably start bisecting the
>> kernel to see where this issue in introduced. Both of the clusters
>> with this issue are running 4.x, other than that, they are pretty
>> differing hardware and network configs.
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> 4OEo
>> =P33I
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> wrote:
>>>
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>> pinging hosts with "No buffer space available" (hosts are currently
>>> configured for 4GB to test SSD caching rather than page cache). I
>>> found that MTU under 32K worked reliable for ping, but still had the
>>> blocked I/O.
>>>
>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>> the blocked I/O.
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:

 On Tue, 22 Sep 2015, Samuel Just wrote:
>
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?


 Every 

Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, here is the update on the saga...

I traced some more of blocked I/Os and it seems that communication
between two hosts seemed worse than others. I did a two way ping flood
between the two hosts using max packet sizes (1500). After 1.5M
packets, no lost pings. Then then had the ping flood running while I
put Ceph load on the cluster and the dropped pings started increasing
after stopping the Ceph workload the pings stopped dropping.

I then ran iperf between all the nodes with the same results, so that
ruled out Ceph to a large degree. I then booted in the the
3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
need the network enhancements in the 4.x series to work well.

Does this sound familiar to anyone? I'll probably start bisecting the
kernel to see where this issue in introduced. Both of the clusters
with this issue are running 4.x, other than that, they are pretty
differing hardware and network configs.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
/XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
4OEo
=P33I
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> This is IPoIB and we have the MTU set to 64K. There was some issues
> pinging hosts with "No buffer space available" (hosts are currently
> configured for 4GB to test SSD caching rather than page cache). I
> found that MTU under 32K worked reliable for ping, but still had the
> blocked I/O.
>
> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> the blocked I/O.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>> I looked at the logs, it looks like there was a 53 second delay
>>> between when osd.17 started sending the osd_repop message and when
>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>> once see a kernel issue which caused some messages to be mysteriously
>>> delayed for many 10s of seconds?
>>
>> Every time we have seen this behavior and diagnosed it in the wild it has
>> been a network misconfiguration.  Usually related to jumbo frames.
>>
>> sage
>>
>>
>>>
>>> What kernel are you running?
>>> -Sam
>>>
>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>> > -BEGIN PGP SIGNED MESSAGE-
>>> > Hash: SHA256
>>> >
>>> > OK, looping in ceph-devel to see if I can get some more eyes. I've
>>> > extracted what I think are important entries from the logs for the
>>> > first blocked request. NTP is running all the servers so the logs
>>> > should be close in terms of time. Logs for 12:50 to 13:00 are
>>> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>> >
>>> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>> >
>>> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>> > but for some reason osd.13 doesn't get the message until 53 seconds
>>> > later. osd.17 seems happy to just wait and doesn't resend the data
>>> > (well, I'm not 100% sure how to tell which entries are the actual data
>>> > transfer).
>>> >
>>> > It looks like osd.17 is receiving responses to start the communication
>>> > with osd.13, but the op 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm starting to wonder if this has to do with some OSDs getting full
or the 0.94.3 code. Earlier this afternoon, I cleared out my test
cluster so there was no pools. I created anew rbd pool and started
filling it with 6 - 1TB fio jobs replication 3 with 6 spindles over
six servers. It was running 0.94.2 at the time. After several hours of
writes, we had the new patched 0.93.3 binaries ready for testing so I
rolled the update on the test cluster while the fio jobs were running.
There were a few blocked I/O as the services were restarted (nothing
I'm concerned about). Now that the OSDs are about 60% full, the
blocked I/O is becoming very frequent even with the backports. The
write bandwidth was consistently at 200 MB/s until this point, now it
is fluctuating between 200 MB/s and 75 MB/s mostly around about
100MB/s. Our production cluster is XFS on the OSDs, this test cluster
is EXT4.

I'll see if I can go back to 0.94.2 and fill the cluster up again
Going back to 0.94.2 and 0.94.0 still has the issue (although I didn't
refill the cluster, I didn't delete what was already there). I'm
building the latest of hammer-backports now and see if it resolves the
issue.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAPioCRDmVDuy+mK58QAAOIwP/3D86CWYlgozKBNlsuIv
AT30S7ZrqDZmxygaJQ9PZgSyQlgQuXpDLL4CnVtbUNd+dgz91i7CVecVGj3h
/jrFwrH063yPD1r3nMmSdc2GTTIahH1JhvzpWqcP9pkmuGHoYlWqteYnosfn
ptOjJI57AFw/goxcJLUExLfdp+L/3GkHNoMMKtJXZX7OIEWdkMj1f9jBGEK6
tJ3AGbbpL6eZGB/KFDObHwCEjfwouTkRk0wNh0luDAU9QlBokmcKS134Ht2C
kRtggOMlXxOKaQiXKZHZL7TUEgvlwldpS01rgDLnNOn3AHZMiAoaC2noFDDS
48ZnbkJgdqpMX2nMFcbwh4zdWOmRRcFqNXuA/t4m0UrZwRCWlSwcVPxDqbHr
00kjDMFtlbov1NWfDXfcMF32qSdsfVaDAwjCmMct1IEn3EXYKYeYA8GUePia
+A9FvUezeYSELWxk59Hirk69A39wNsA40lrMbFzIOkp8CLLuKiHSKs8dTFtJ
CaIPMwZDElcKJDKXPEMu260/GIcJmERUZXPayIQp2Attgx3/gvDpU3crWN7C
49dqnPOVqm6+f+ciUBVwIgQ7Xbbqom+yc1jxlvmpMW1C5iu9vjH/mvO42N/c
e+R0/SgCJnDQU4tYppYadA8vKA/e9JyjMfBlbTW0urxHQlkNqohFY9G+edLW
Zkxf
=kYQ2
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Sep 21, 2015 at 4:33 PM, Gregory Farnum  wrote:
> On Mon, Sep 21, 2015 at 3:14 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> In my lab cluster I can saturate the disks and I'm not seeing any of
>> the blocked I/Os from the Ceph side, although the client shows that
>> I/O stops for a while. I'm not convinced that it is load related.
>>
>> I was looking through the logs using the technique you described as
>> well as looking for the associated PG. There is a lot of data to go
>> through and it is taking me some time.
>>
>> We are rolling some of the backports for 0.94.4 into a build, one for
>> the PG split problem, and 5 others that might help. One that I'm
>> really hopeful about is http://tracker.ceph.com/issues/12843, but I'm
>> not sure the messages we are seeing are exactly related. We are
>> planning to roll the new binaries tomorrow night. I'll update this
>> thread after the new code has been rolled.
>
> Ah, yep, I didn't realize we still had any of those in hammer. That
> bug is indeed a good bet for what you're seeing.
> -Greg
>
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAIE9CRDmVDuy+mK58QAAyS8P/21+0Y+QhsByqgu/bTiS
>> 3dG6hNMyElXFyuWXievqqvyvaak7Y/nkVhC+oII1glujWFRRTL+61K4Qq8oo
>> abFBtFVSRkkQpg0BCuHH0LsbXwyK7bmiSTZted2/XzZfJdcuQcDCVXZ0K3En
>> LLWn0PvDj7OBnLexAAKAMF91a8gCnjuKq3AJnEYxQBeI/Fv58cpfERAiYa+W
>> Fl6jBKPboJr8sgbQ87k6hu4aLuHGepliFJlUO3XPTvuD4WQ6Ak1HAD+KtmXd
>> i8GYOZK9ukMQs8YavO8GqVAiZvUcuIGHVf502fP0v+7SR/s/9OY6Loo00/kK
>> QdG0+mgV0o60AZ4r/setlsd7Uo3l9u4ra9n3D2RUtSJZRvcBK2HweeMiit4u
>> FgA5dcx0lRFd6IluxZstgZlQiyxggIWHUgoQYFashtNWu/bl8bXn+gzK0GxO
>> mWZqaeKBMauBWwLADIX1Q+VYBSvZWqFCfKGUawQ4bRnyz7zlHXQANlL1t7iF
>> /QakoriydMW3l2WPftk4kDt4egFGhxxrCRZfA0TnVNx1DOLE9vRBKXKgTr0j
>> miB0Ca9v9DQzVnTWhPCTfb8UdEHzozMTMEv30V3nskafPolsRJmjO04C1K7e
>> 61R+cawG02J0RQqFMMNj3X2Gnbp/CC6JzUpQ5JPvNrvO34lcTYBWkdfwtolg
>> 9ExB
>> =hAcJ
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Sep 21, 2015 at 4:00 PM, Gregory Farnum  wrote:
>>> So it sounds like you've got two different things here:
>>> 1) You get a lot of slow operations that show up as warnings.
>>>
>>> 2) Rarely, you get blocked op warnings that don't seem to go away
>>> until the cluster state changes somehow.
>>>
>>> (2) is the interesting one. Since you say the cluster is under heavy
>>> load, I presume (1) is just you overloading your servers and getting
>>> some hot spots that take time to clear up.
>>>
>>> There have been bugs in the past where slow op warnings weren't
>>> getting 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

This is IPoIB and we have the MTU set to 64K. There was some issues
pinging hosts with "No buffer space available" (hosts are currently
configured for 4GB to test SSD caching rather than page cache). I
found that MTU under 32K worked reliable for ping, but still had the
blocked I/O.

I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
the blocked I/O.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> On Tue, 22 Sep 2015, Samuel Just wrote:
>> I looked at the logs, it looks like there was a 53 second delay
>> between when osd.17 started sending the osd_repop message and when
>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> once see a kernel issue which caused some messages to be mysteriously
>> delayed for many 10s of seconds?
>
> Every time we have seen this behavior and diagnosed it in the wild it has
> been a network misconfiguration.  Usually related to jumbo frames.
>
> sage
>
>
>>
>> What kernel are you running?
>> -Sam
>>
>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA256
>> >
>> > OK, looping in ceph-devel to see if I can get some more eyes. I've
>> > extracted what I think are important entries from the logs for the
>> > first blocked request. NTP is running all the servers so the logs
>> > should be close in terms of time. Logs for 12:50 to 13:00 are
>> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >
>> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >
>> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> > but for some reason osd.13 doesn't get the message until 53 seconds
>> > later. osd.17 seems happy to just wait and doesn't resend the data
>> > (well, I'm not 100% sure how to tell which entries are the actual data
>> > transfer).
>> >
>> > It looks like osd.17 is receiving responses to start the communication
>> > with osd.13, but the op is not acknowledged until almost a minute
>> > later. To me it seems that the message is getting received but not
>> > passed to another thread right away or something. This test was done
>> > with an idle cluster, a single fio client (rbd engine) with a single
>> > thread.
>> >
>> > The OSD servers are almost 100% idle during these blocked I/O
>> > requests. I think I'm at the end of my troubleshooting, so I can use
>> > some help.
>> >
>> > Single Test started about
>> > 2015-09-22 12:52:36
>> >
>> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> > 30.439150 secs
>> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> > cluster [WRN] slow request 30.439150 seconds old, received at
>> > 2015-09-22 12:55:06.487451:
>> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,16
>> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> > 30.379680 secs
>> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> > 12:55:06.406303:
>> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,17
>> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> > 12:55:06.318144:
>> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,14
>> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> > cluster 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

OK, looping in ceph-devel to see if I can get some more eyes. I've
extracted what I think are important entries from the logs for the
first blocked request. NTP is running all the servers so the logs
should be close in terms of time. Logs for 12:50 to 13:00 are
available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz

2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0

In the logs I can see that osd.17 dispatches the I/O to osd.13 and
osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
but for some reason osd.13 doesn't get the message until 53 seconds
later. osd.17 seems happy to just wait and doesn't resend the data
(well, I'm not 100% sure how to tell which entries are the actual data
transfer).

It looks like osd.17 is receiving responses to start the communication
with osd.13, but the op is not acknowledged until almost a minute
later. To me it seems that the message is getting received but not
passed to another thread right away or something. This test was done
with an idle cluster, a single fio client (rbd engine) with a single
thread.

The OSD servers are almost 100% idle during these blocked I/O
requests. I think I'm at the end of my troubleshooting, so I can use
some help.

Single Test started about
2015-09-22 12:52:36

2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.439150 secs
2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
cluster [WRN] slow request 30.439150 seconds old, received at
2015-09-22 12:55:06.487451:
 osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,16
2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
[WRN] 2 slow requests, 2 included below; oldest blocked for >
30.379680 secs
2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
[WRN] slow request 30.291520 seconds old, received at 2015-09-22
12:55:06.406303:
 osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,17
2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
[WRN] slow request 30.379680 seconds old, received at 2015-09-22
12:55:06.318144:
 osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,14
2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.954212 secs
2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
cluster [WRN] slow request 30.954212 seconds old, received at
2015-09-22 12:57:33.044003:
 osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 16,17
2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.704367 secs
2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
cluster [WRN] slow request 30.704367 seconds old, received at
2015-09-22 12:57:33.055404:
 osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.070e
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,17

Server   IP addr  OSD
nodev  - 192.168.55.11 - 12
nodew  - 192.168.55.12 - 13
nodex  - 192.168.55.13 - 16
nodey  - 192.168.55.14 - 17
nodez  - 192.168.55.15 - 14
nodezz - 192.168.55.16 - 15

fio job:
[rbd-test]
readwrite=write
blocksize=4M
#runtime=60
name=rbd-test
#readwrite=randwrite
#bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
#rwmixread=72
#norandommap
#size=1T
#blocksize=4k
ioengine=rbd
rbdname=test2
pool=rbd
clientname=admin
iodepth=8
#numjobs=4
#thread
#group_reporting
#time_based
#direct=1

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Samuel Just
I looked at the logs, it looks like there was a 53 second delay
between when osd.17 started sending the osd_repop message and when
osd.13 started reading it, which is pretty weird.  Sage, didn't we
once see a kernel issue which caused some messages to be mysteriously
delayed for many 10s of seconds?

What kernel are you running?
-Sam

On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> OK, looping in ceph-devel to see if I can get some more eyes. I've
> extracted what I think are important entries from the logs for the
> first blocked request. NTP is running all the servers so the logs
> should be close in terms of time. Logs for 12:50 to 13:00 are
> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>
> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>
> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> but for some reason osd.13 doesn't get the message until 53 seconds
> later. osd.17 seems happy to just wait and doesn't resend the data
> (well, I'm not 100% sure how to tell which entries are the actual data
> transfer).
>
> It looks like osd.17 is receiving responses to start the communication
> with osd.13, but the op is not acknowledged until almost a minute
> later. To me it seems that the message is getting received but not
> passed to another thread right away or something. This test was done
> with an idle cluster, a single fio client (rbd engine) with a single
> thread.
>
> The OSD servers are almost 100% idle during these blocked I/O
> requests. I think I'm at the end of my troubleshooting, so I can use
> some help.
>
> Single Test started about
> 2015-09-22 12:52:36
>
> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.439150 secs
> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> cluster [WRN] slow request 30.439150 seconds old, received at
> 2015-09-22 12:55:06.487451:
>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,16
> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> 30.379680 secs
> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> 12:55:06.406303:
>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,17
> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> 12:55:06.318144:
>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,14
> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.954212 secs
> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> cluster [WRN] slow request 30.954212 seconds old, received at
> 2015-09-22 12:57:33.044003:
>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 16,17
> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.704367 secs
> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> cluster [WRN] slow request 30.704367 seconds old, received at
> 2015-09-22 12:57:33.055404:
>  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.070e
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

4.2.0-1.el7.elrepo.x86_64
- - 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:41 PM, Samuel Just  wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?
>
> What kernel are you running?
> -Sam
>
> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> extracted what I think are important entries from the logs for the
>> first blocked request. NTP is running all the servers so the logs
>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>
>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>
>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> but for some reason osd.13 doesn't get the message until 53 seconds
>> later. osd.17 seems happy to just wait and doesn't resend the data
>> (well, I'm not 100% sure how to tell which entries are the actual data
>> transfer).
>>
>> It looks like osd.17 is receiving responses to start the communication
>> with osd.13, but the op is not acknowledged until almost a minute
>> later. To me it seems that the message is getting received but not
>> passed to another thread right away or something. This test was done
>> with an idle cluster, a single fio client (rbd engine) with a single
>> thread.
>>
>> The OSD servers are almost 100% idle during these blocked I/O
>> requests. I think I'm at the end of my troubleshooting, so I can use
>> some help.
>>
>> Single Test started about
>> 2015-09-22 12:52:36
>>
>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.439150 secs
>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> cluster [WRN] slow request 30.439150 seconds old, received at
>> 2015-09-22 12:55:06.487451:
>>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,16
>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> 30.379680 secs
>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> 12:55:06.406303:
>>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,17
>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> 12:55:06.318144:
>>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,14
>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.954212 secs
>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> cluster [WRN] slow request 30.954212 seconds old, received at
>> 2015-09-22 12:57:33.044003:
>>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 16,17
>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.704367 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Sage Weil
On Tue, 22 Sep 2015, Samuel Just wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?

Every time we have seen this behavior and diagnosed it in the wild it has 
been a network misconfiguration.  Usually related to jumbo frames.

sage


> 
> What kernel are you running?
> -Sam
> 
> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > OK, looping in ceph-devel to see if I can get some more eyes. I've
> > extracted what I think are important entries from the logs for the
> > first blocked request. NTP is running all the servers so the logs
> > should be close in terms of time. Logs for 12:50 to 13:00 are
> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >
> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >
> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> > but for some reason osd.13 doesn't get the message until 53 seconds
> > later. osd.17 seems happy to just wait and doesn't resend the data
> > (well, I'm not 100% sure how to tell which entries are the actual data
> > transfer).
> >
> > It looks like osd.17 is receiving responses to start the communication
> > with osd.13, but the op is not acknowledged until almost a minute
> > later. To me it seems that the message is getting received but not
> > passed to another thread right away or something. This test was done
> > with an idle cluster, a single fio client (rbd engine) with a single
> > thread.
> >
> > The OSD servers are almost 100% idle during these blocked I/O
> > requests. I think I'm at the end of my troubleshooting, so I can use
> > some help.
> >
> > Single Test started about
> > 2015-09-22 12:52:36
> >
> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.439150 secs
> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> > cluster [WRN] slow request 30.439150 seconds old, received at
> > 2015-09-22 12:55:06.487451:
> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0545
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,16
> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
> > 30.379680 secs
> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> > 12:55:06.406303:
> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0541
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,17
> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> > 12:55:06.318144:
> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.053f
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,14
> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.954212 secs
> > 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> > cluster [WRN] slow request 30.954212 seconds old, received at
> > 2015-09-22 12:57:33.044003:
> >  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.070d
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 16,17
> > 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Gregory Farnum
On Mon, Sep 21, 2015 at 11:43 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I'm starting to wonder if this has to do with some OSDs getting full
> or the 0.94.3 code. Earlier this afternoon, I cleared out my test
> cluster so there was no pools. I created anew rbd pool and started
> filling it with 6 - 1TB fio jobs replication 3 with 6 spindles over
> six servers. It was running 0.94.2 at the time. After several hours of
> writes, we had the new patched 0.93.3 binaries ready for testing so I
> rolled the update on the test cluster while the fio jobs were running.
> There were a few blocked I/O as the services were restarted (nothing
> I'm concerned about). Now that the OSDs are about 60% full, the
> blocked I/O is becoming very frequent even with the backports. The
> write bandwidth was consistently at 200 MB/s until this point, now it
> is fluctuating between 200 MB/s and 75 MB/s mostly around about
> 100MB/s. Our production cluster is XFS on the OSDs, this test cluster
> is EXT4.
>
> I'll see if I can go back to 0.94.2 and fill the cluster up again
> Going back to 0.94.2 and 0.94.0 still has the issue (although I didn't
> refill the cluster, I didn't delete what was already there). I'm
> building the latest of hammer-backports now and see if it resolves the
> issue.

You're probably running into the FileStore collection splitting and
that's what is slowing things down in that testing.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Is there some way to tell in the logs that this is happening? I'm not
seeing much I/O, CPU usage during these times. Is there some way to
prevent the splitting? Is there a negative side effect to doing so?
We've had I/O block for over 900 seconds and as soon as the sessions
are aborted, they are reestablished and complete immediately.

The fio test is just a seq write, starting it over (rewriting from the
beginning) is still causing the issue. I was suspect that it is not
having to create new file and therefore split collections. This is on
my test cluster with no other load.

I'll be doing a lot of testing today. Which log options and depths
would be the most helpful for tracking this issue down?

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAWShCRDmVDuy+mK58QAAuUAP/3XuYrcOsneXKvWhHSRV
4oi6MZ4mEuVvxGsf+2Nhx70CUJGNOH37cpNL3xTt5R9V7Kpj0KoxoyVv81bN
ud1YfH5jZn1sGizHBEIR94mqNkqsQmYyqLAvez2xhShAbKYdsMjvyxovUGBE
skLY6oXNZ8UVAuBRoq8KMNWCCf5mLlp/XITYd9B+SMOwTEcU9D/tdkOMf8fn
wIv3FHIMOLgVmvzCgfXPjuPCvl2eo3oO9bSGmWU0FZUUTGzc+PranuQngULz
JOPaA2Qvte+jn0lU99tZhPaZ+62E9L8sZtQ2eorJoF1SBJtpzF+TW0Ev+7co
DNBdqp+JHTQIEyuPluhWi89E+MZlhQcsEBpb82Y5FIcZAjI00AJP+IHmFFPZ
ThP1UVpyymY3qn5995V0eUnbt6vpRUGDDdxPTMmW8dCRVZz9F1n2eoM1tdUS
t/tChgLHRq1RL0N2gD2w1E8r+t5Cu5zYK/+ZWs6HhRc1LuxwtOy/3QbXO+Bu
SfgFHh+tMFDinVSQCAbx6a759ySZ2FoMBhxONljluaemrdDCntcgq+52h3dK
Q4lkf1y3a4sdqHQwJ+Ew3rONilixC0abHw+GF29GjCXbYDBUeLxXoqIJXQbM
TGsOz4v0AnDLzgFQIaSHyweuptyh8MKT3XJbrOOAcmZo3YmGtYYfjSF6+qXF
6PLJ
=HIRW
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 8:09 AM, Gregory Farnum  wrote:
> On Mon, Sep 21, 2015 at 11:43 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I'm starting to wonder if this has to do with some OSDs getting full
>> or the 0.94.3 code. Earlier this afternoon, I cleared out my test
>> cluster so there was no pools. I created anew rbd pool and started
>> filling it with 6 - 1TB fio jobs replication 3 with 6 spindles over
>> six servers. It was running 0.94.2 at the time. After several hours of
>> writes, we had the new patched 0.93.3 binaries ready for testing so I
>> rolled the update on the test cluster while the fio jobs were running.
>> There were a few blocked I/O as the services were restarted (nothing
>> I'm concerned about). Now that the OSDs are about 60% full, the
>> blocked I/O is becoming very frequent even with the backports. The
>> write bandwidth was consistently at 200 MB/s until this point, now it
>> is fluctuating between 200 MB/s and 75 MB/s mostly around about
>> 100MB/s. Our production cluster is XFS on the OSDs, this test cluster
>> is EXT4.
>>
>> I'll see if I can go back to 0.94.2 and fill the cluster up again
>> Going back to 0.94.2 and 0.94.0 still has the issue (although I didn't
>> refill the cluster, I didn't delete what was already there). I'm
>> building the latest of hammer-backports now and see if it resolves the
>> issue.
>
> You're probably running into the FileStore collection splitting and
> that's what is slowing things down in that testing.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Gregory Farnum
On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Is there some way to tell in the logs that this is happening?

You can search for the (mangled) name _split_collection
> I'm not
> seeing much I/O, CPU usage during these times. Is there some way to
> prevent the splitting? Is there a negative side effect to doing so?

Bump up the split and merge thresholds. You can search the list for
this, it was discussed not too long ago.

> We've had I/O block for over 900 seconds and as soon as the sessions
> are aborted, they are reestablished and complete immediately.
>
> The fio test is just a seq write, starting it over (rewriting from the
> beginning) is still causing the issue. I was suspect that it is not
> having to create new file and therefore split collections. This is on
> my test cluster with no other load.

Hmm, that does make it seem less likely if you're really not creating
new objects, if you're actually running fio in such a way that it's
not allocating new FS blocks (this is probably hard to set up?).

>
> I'll be doing a lot of testing today. Which log options and depths
> would be the most helpful for tracking this issue down?

If you want to go log diving "debug osd = 20", "debug filestore = 20",
"debug ms = 1" are what the OSD guys like to see. That should spit out
everything you need to track exactly what each Op is doing.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-09-21 Thread Gregory Farnum
So it sounds like you've got two different things here:
1) You get a lot of slow operations that show up as warnings.

2) Rarely, you get blocked op warnings that don't seem to go away
until the cluster state changes somehow.

(2) is the interesting one. Since you say the cluster is under heavy
load, I presume (1) is just you overloading your servers and getting
some hot spots that take time to clear up.

There have been bugs in the past where slow op warnings weren't
getting removed when they should have. I don't *think* any are in
.94.3 but could be wrong. Have you observed these from the other
direction, where a client has blocked operations?
If you want to go through the logs yourself, you should try and find
all the lines about one of the operations which seems to be blocked.
They aren't the most readable but if you grep for the operation ID
(client.4267090.0:3510311) and then once you're in the right area look
for what the threads processing it are doing you should get some idea
of where things are going wrong you can share.
-Greg

On Sun, Sep 20, 2015 at 10:43 PM, Robert LeBlanc  wrote:
> We set the logging on an OSD that had problems pretty frequently, but
> cleared up in less than 30 seconds. The logs are at
> http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed
> at 8.6GB. Some of the messages we were seeing in ceph -w are:
>
> 2015-09-20 20:55:44.029041 osd.112 [WRN] 10 slow requests, 10 included
> below; oldest blocked for > 30.132696 secs
> 2015-09-20 20:55:44.029047 osd.112 [WRN] slow request 30.132696
> seconds old, received at 2015-09-20 20:55:13.896286:
> osd_op(client.3289538.0:62497509
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 2588672~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently reached_pg
> 2015-09-20 20:55:44.029051 osd.112 [WRN] slow request 30.132619
> seconds old, received at 2015-09-20 20:55:13.896363:
> osd_op(client.3289538.0:62497510
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 2908160~12288]
> 17.118f0c67 ack+ondisk+write+known_if_redirected e57590) currently
> waiting for rw locks
> 2015-09-20 20:55:44.029054 osd.112 [WRN] slow request 30.132520
> seconds old, received at 2015-09-20 20:55:13.896462:
> osd_op(client.3289538.0:62497511
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 2949120~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 20:55:44.029058 osd.112 [WRN] slow request 30.132415
> seconds old, received at 2015-09-20 20:55:13.896567:
> osd_op(client.3289538.0:62497512
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 2957312~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 20:55:44.029061 osd.112 [WRN] slow request 30.132302
> seconds old, received at 2015-09-20 20:55:13.896680:
> osd_op(client.3289538.0:62497513
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 2998272~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 20:55:45.029290 osd.112 [WRN] 9 slow requests, 5 included
> below; oldest blocked for > 31.132843 secs
> 2015-09-20 20:55:45.029298 osd.112 [WRN] slow request 31.132447
> seconds old, received at 2015-09-20 20:55:13.896759:
> osd_op(client.3289538.0:62497514
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 3035136~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 20:55:45.029303 osd.112 [WRN] slow request 31.132362
> seconds old, received at 2015-09-20 20:55:13.896845:
> osd_op(client.3289538.0:62497515
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 3047424~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 20:55:45.029309 osd.112 [WRN] slow request 31.132276
> seconds old, received at 2015-09-20 20:55:13.896931:
> osd_op(client.3289538.0:62497516
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 3072000~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 20:55:45.029315 osd.112 [WRN] slow request 31.132199
> seconds old, received at 2015-09-20 20:55:13.897008:
> osd_op(client.3289538.0:62497517
> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
> object_size 8388608 write_size 8388608,write 3211264~4096] 17.118f0c67
> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
> locks
> 2015-09-20 

Re: [ceph-users] Potential OSD deadlock?

2015-09-21 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

In my lab cluster I can saturate the disks and I'm not seeing any of
the blocked I/Os from the Ceph side, although the client shows that
I/O stops for a while. I'm not convinced that it is load related.

I was looking through the logs using the technique you described as
well as looking for the associated PG. There is a lot of data to go
through and it is taking me some time.

We are rolling some of the backports for 0.94.4 into a build, one for
the PG split problem, and 5 others that might help. One that I'm
really hopeful about is http://tracker.ceph.com/issues/12843, but I'm
not sure the messages we are seeing are exactly related. We are
planning to roll the new binaries tomorrow night. I'll update this
thread after the new code has been rolled.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAIE9CRDmVDuy+mK58QAAyS8P/21+0Y+QhsByqgu/bTiS
3dG6hNMyElXFyuWXievqqvyvaak7Y/nkVhC+oII1glujWFRRTL+61K4Qq8oo
abFBtFVSRkkQpg0BCuHH0LsbXwyK7bmiSTZted2/XzZfJdcuQcDCVXZ0K3En
LLWn0PvDj7OBnLexAAKAMF91a8gCnjuKq3AJnEYxQBeI/Fv58cpfERAiYa+W
Fl6jBKPboJr8sgbQ87k6hu4aLuHGepliFJlUO3XPTvuD4WQ6Ak1HAD+KtmXd
i8GYOZK9ukMQs8YavO8GqVAiZvUcuIGHVf502fP0v+7SR/s/9OY6Loo00/kK
QdG0+mgV0o60AZ4r/setlsd7Uo3l9u4ra9n3D2RUtSJZRvcBK2HweeMiit4u
FgA5dcx0lRFd6IluxZstgZlQiyxggIWHUgoQYFashtNWu/bl8bXn+gzK0GxO
mWZqaeKBMauBWwLADIX1Q+VYBSvZWqFCfKGUawQ4bRnyz7zlHXQANlL1t7iF
/QakoriydMW3l2WPftk4kDt4egFGhxxrCRZfA0TnVNx1DOLE9vRBKXKgTr0j
miB0Ca9v9DQzVnTWhPCTfb8UdEHzozMTMEv30V3nskafPolsRJmjO04C1K7e
61R+cawG02J0RQqFMMNj3X2Gnbp/CC6JzUpQ5JPvNrvO34lcTYBWkdfwtolg
9ExB
=hAcJ
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Sep 21, 2015 at 4:00 PM, Gregory Farnum  wrote:
> So it sounds like you've got two different things here:
> 1) You get a lot of slow operations that show up as warnings.
>
> 2) Rarely, you get blocked op warnings that don't seem to go away
> until the cluster state changes somehow.
>
> (2) is the interesting one. Since you say the cluster is under heavy
> load, I presume (1) is just you overloading your servers and getting
> some hot spots that take time to clear up.
>
> There have been bugs in the past where slow op warnings weren't
> getting removed when they should have. I don't *think* any are in
> .94.3 but could be wrong. Have you observed these from the other
> direction, where a client has blocked operations?
> If you want to go through the logs yourself, you should try and find
> all the lines about one of the operations which seems to be blocked.
> They aren't the most readable but if you grep for the operation ID
> (client.4267090.0:3510311) and then once you're in the right area look
> for what the threads processing it are doing you should get some idea
> of where things are going wrong you can share.
> -Greg
>
> On Sun, Sep 20, 2015 at 10:43 PM, Robert LeBlanc  wrote:
>> We set the logging on an OSD that had problems pretty frequently, but
>> cleared up in less than 30 seconds. The logs are at
>> http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed
>> at 8.6GB. Some of the messages we were seeing in ceph -w are:
>>
>> 2015-09-20 20:55:44.029041 osd.112 [WRN] 10 slow requests, 10 included
>> below; oldest blocked for > 30.132696 secs
>> 2015-09-20 20:55:44.029047 osd.112 [WRN] slow request 30.132696
>> seconds old, received at 2015-09-20 20:55:13.896286:
>> osd_op(client.3289538.0:62497509
>> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
>> object_size 8388608 write_size 8388608,write 2588672~4096] 17.118f0c67
>> ack+ondisk+write+known_if_redirected e57590) currently reached_pg
>> 2015-09-20 20:55:44.029051 osd.112 [WRN] slow request 30.132619
>> seconds old, received at 2015-09-20 20:55:13.896363:
>> osd_op(client.3289538.0:62497510
>> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
>> object_size 8388608 write_size 8388608,write 2908160~12288]
>> 17.118f0c67 ack+ondisk+write+known_if_redirected e57590) currently
>> waiting for rw locks
>> 2015-09-20 20:55:44.029054 osd.112 [WRN] slow request 30.132520
>> seconds old, received at 2015-09-20 20:55:13.896462:
>> osd_op(client.3289538.0:62497511
>> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
>> object_size 8388608 write_size 8388608,write 2949120~4096] 17.118f0c67
>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>> locks
>> 2015-09-20 20:55:44.029058 osd.112 [WRN] slow request 30.132415
>> seconds old, received at 2015-09-20 20:55:13.896567:
>> osd_op(client.3289538.0:62497512
>> rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
>> object_size 8388608 write_size 8388608,write 2957312~4096] 17.118f0c67
>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>> locks
>> 2015-09-20 20:55:44.029061 osd.112 [WRN] slow request 30.132302

Re: [ceph-users] Potential OSD deadlock?

2015-09-20 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I was able to catch the tail end of one of these and increased the
logging on it. I had to kill it a minute or two after the logging was
increased because of the time of the day.

I've put the logs at https://robert.leblancnet.us/ceph-osd.8.log.xz .

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Sep 20, 2015 at 9:03 AM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We had another incident of 100 long blocked I/O this morning, but I
> didn't get to it in time. I wound up clearing itself after almost
> 1,000 seconds. On interesting note is that the blocked I/O kept
> creeping up until I see a bunch of entrys in the log like:
>
> 2015-09-20 08:20:01.870141 7f5fbe05c700  0 -- 10.217.72.12:6812/1468
>>> 10.217.72.35:0/4027675 pipe(0x2f7a3000 sd=363 :6812 s=0 pgs=0 cs=0
> l=1 c=0x1bd40840).accept replacing existing (lossy) channel (new one
> lossy=1)
> 2015-09-20 08:20:02.061539 7f5f43de3700  0 -- 10.217.72.12:6812/1468
>>> 10.217.72.33:0/2012691 pipe(0x28857000 sd=408 :6812 s=0 pgs=0 cs=0
> l=1 c=0x1bd43020).accept replacing existing (lossy) channel (new one
> lossy=1)
> 2015-09-20 08:20:02.817884 7f5fb2ca9700  0 -- 10.217.72.12:6812/1468
>>> 10.217.72.33:0/2040360 pipe(0x283ff000 sd=605 :6812 s=0 pgs=0 cs=0
> l=1 c=0x1bd402c0).accept replacing existing (lossy) channel (new one
> lossy=1)
>
> after almost 100 of these in about a 2 second period, the I/O starts
> draining over the next 3-4 seconds. What does this message mean? 1468
> appears to be the PID of one of the OSD processes, but the number in
> the same position on the dest is not a PID. What would cause a channel
> to be dropped and recreated?
>
> There are 10 OSDs on this host, but only two other OSDs show anything
> and only two messages (but it is many seconds AFTER the problem clears
> up).
>
> OSD.128
>
> 2015-09-20 08:25:32.331268 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
> submit_message osd_op_reply(5553069
> rbd_data.3e30b0275d493f.d800 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 143360~4096] v57084'16557680
> uv16557680 ack = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
> con, dropping message 0xa099180
> 2015-09-20 08:25:32.331401 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
> submit_message osd_op_reply(5553069
> rbd_data.3e30b0275d493f.d800 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 143360~4096] v57084'16557680
> uv16557680 ondisk = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
> con, dropping message 0x16217600
>
> OSD.121
>
> 2015-09-20 08:29:15.192055 7f51c07e2700  0 -- 10.217.72.12:6802/25568
> submit_message osd_op_reply(1483497
> rbd_data.3fc0847f32a8f4.3200 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 135168~4096] v57086'14949171
> uv14949171 ack = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
> con, dropping message 0x1dd8a840
> 2015-09-20 08:29:15.192612 7f51c07e2700  0 -- 10.217.72.12:6802/25568
> submit_message osd_op_reply(1483497
> rbd_data.3fc0847f32a8f4.3200 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 135168~4096] v57086'14949171
> uv14949171 ondisk = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
> con, dropping message 0x1476ec00
>
> Our Ceph processes start with some real high limits:
> root  1465  0.0  0.0 127700  3492 ?Ss   Jun11   0:00
> /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 126 --pid-file
> /var/run/ceph/osd.126.pid -c /etc/ceph/ceph.conf --cluster ceph -f
> root  1468  9.7  1.4 4849560 923536 ?  Sl   Jun11 14190:59
> /usr/bin/ceph-osd -i 126 --pid-file /var/run/ceph/osd.126.pid -c
> /etc/ceph/ceph.conf --cluster ceph -f
>
> [root@ceph2 1468]# cat /etc/security/limits.d/90-local.conf
> *   -   nofile  16384
>
> We have 130 OSDs on 13 hosts. We also have ~50 KVM VM running on this storage.
>
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN
> 198 pgs backfill
> 4 pgs backfilling
> 169 pgs degraded
> 150 pgs recovery_wait
> 169 pgs stuck degraded
> 352 pgs stuck unclean
> 12 pgs stuck undersized
> 12 pgs undersized
> recovery 161065/41285858 objects degraded (0.390%)
> recovery 2871014/41285858 objects misplaced (6.954%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e57086: 130 osds: 130 up, 130 in; 270 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v10921036: 2308 pgs, 3 pools, 39046 GB data, 9735 kobjects
> 151 TB used, 320 TB / 472 TB avail
> 161065/41285858 objects degraded (0.390%)
> 2871014/41285858 

Re: [ceph-users] Potential OSD deadlock?

2015-09-20 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We had another incident of 100 long blocked I/O this morning, but I
didn't get to it in time. I wound up clearing itself after almost
1,000 seconds. On interesting note is that the blocked I/O kept
creeping up until I see a bunch of entrys in the log like:

2015-09-20 08:20:01.870141 7f5fbe05c700  0 -- 10.217.72.12:6812/1468
>> 10.217.72.35:0/4027675 pipe(0x2f7a3000 sd=363 :6812 s=0 pgs=0 cs=0
l=1 c=0x1bd40840).accept replacing existing (lossy) channel (new one
lossy=1)
2015-09-20 08:20:02.061539 7f5f43de3700  0 -- 10.217.72.12:6812/1468
>> 10.217.72.33:0/2012691 pipe(0x28857000 sd=408 :6812 s=0 pgs=0 cs=0
l=1 c=0x1bd43020).accept replacing existing (lossy) channel (new one
lossy=1)
2015-09-20 08:20:02.817884 7f5fb2ca9700  0 -- 10.217.72.12:6812/1468
>> 10.217.72.33:0/2040360 pipe(0x283ff000 sd=605 :6812 s=0 pgs=0 cs=0
l=1 c=0x1bd402c0).accept replacing existing (lossy) channel (new one
lossy=1)

after almost 100 of these in about a 2 second period, the I/O starts
draining over the next 3-4 seconds. What does this message mean? 1468
appears to be the PID of one of the OSD processes, but the number in
the same position on the dest is not a PID. What would cause a channel
to be dropped and recreated?

There are 10 OSDs on this host, but only two other OSDs show anything
and only two messages (but it is many seconds AFTER the problem clears
up).

OSD.128

2015-09-20 08:25:32.331268 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
submit_message osd_op_reply(5553069
rbd_data.3e30b0275d493f.d800 [set-alloc-hint object_size
4194304 write_size 4194304,write 143360~4096] v57084'16557680
uv16557680 ack = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
con, dropping message 0xa099180
2015-09-20 08:25:32.331401 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
submit_message osd_op_reply(5553069
rbd_data.3e30b0275d493f.d800 [set-alloc-hint object_size
4194304 write_size 4194304,write 143360~4096] v57084'16557680
uv16557680 ondisk = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
con, dropping message 0x16217600

OSD.121

2015-09-20 08:29:15.192055 7f51c07e2700  0 -- 10.217.72.12:6802/25568
submit_message osd_op_reply(1483497
rbd_data.3fc0847f32a8f4.3200 [set-alloc-hint object_size
4194304 write_size 4194304,write 135168~4096] v57086'14949171
uv14949171 ack = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
con, dropping message 0x1dd8a840
2015-09-20 08:29:15.192612 7f51c07e2700  0 -- 10.217.72.12:6802/25568
submit_message osd_op_reply(1483497
rbd_data.3fc0847f32a8f4.3200 [set-alloc-hint object_size
4194304 write_size 4194304,write 135168~4096] v57086'14949171
uv14949171 ondisk = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
con, dropping message 0x1476ec00

Our Ceph processes start with some real high limits:
root  1465  0.0  0.0 127700  3492 ?Ss   Jun11   0:00
/bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 126 --pid-file
/var/run/ceph/osd.126.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root  1468  9.7  1.4 4849560 923536 ?  Sl   Jun11 14190:59
/usr/bin/ceph-osd -i 126 --pid-file /var/run/ceph/osd.126.pid -c
/etc/ceph/ceph.conf --cluster ceph -f

[root@ceph2 1468]# cat /etc/security/limits.d/90-local.conf
*   -   nofile  16384

We have 130 OSDs on 13 hosts. We also have ~50 KVM VM running on this storage.

cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN
198 pgs backfill
4 pgs backfilling
169 pgs degraded
150 pgs recovery_wait
169 pgs stuck degraded
352 pgs stuck unclean
12 pgs stuck undersized
12 pgs undersized
recovery 161065/41285858 objects degraded (0.390%)
recovery 2871014/41285858 objects misplaced (6.954%)
noscrub,nodeep-scrub flag(s) set
 monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
election epoch 180, quorum 0,1,2 mon1,mon2,mon3
 osdmap e57086: 130 osds: 130 up, 130 in; 270 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v10921036: 2308 pgs, 3 pools, 39046 GB data, 9735 kobjects
151 TB used, 320 TB / 472 TB avail
161065/41285858 objects degraded (0.390%)
2871014/41285858 objects misplaced (6.954%)
1956 active+clean
 183 active+remapped+wait_backfill
  82 active+recovery_wait+degraded
  67 active+recovery_wait+degraded+remapped
   9 active+undersized+degraded+remapped+wait_backfill
   6 active+degraded+remapped+wait_backfill
   2 active+undersized+degraded+remapped+backfilling
   2 active+degraded+remapped+backfilling
   1 active+recovery_wait+undersized+degraded+remapped
recovery io 25770 kB/s, 6 objects/s
  client io 78274 kB/s rd, 119 MB/s wr, 5949 op/s

Any 

Re: [ceph-users] Potential OSD deadlock?

2015-09-20 Thread Robert LeBlanc
We set the logging on an OSD that had problems pretty frequently, but
cleared up in less than 30 seconds. The logs are at
http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed
at 8.6GB. Some of the messages we were seeing in ceph -w are:

2015-09-20 20:55:44.029041 osd.112 [WRN] 10 slow requests, 10 included
below; oldest blocked for > 30.132696 secs
2015-09-20 20:55:44.029047 osd.112 [WRN] slow request 30.132696
seconds old, received at 2015-09-20 20:55:13.896286:
osd_op(client.3289538.0:62497509
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 2588672~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently reached_pg
2015-09-20 20:55:44.029051 osd.112 [WRN] slow request 30.132619
seconds old, received at 2015-09-20 20:55:13.896363:
osd_op(client.3289538.0:62497510
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 2908160~12288]
17.118f0c67 ack+ondisk+write+known_if_redirected e57590) currently
waiting for rw locks
2015-09-20 20:55:44.029054 osd.112 [WRN] slow request 30.132520
seconds old, received at 2015-09-20 20:55:13.896462:
osd_op(client.3289538.0:62497511
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 2949120~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:44.029058 osd.112 [WRN] slow request 30.132415
seconds old, received at 2015-09-20 20:55:13.896567:
osd_op(client.3289538.0:62497512
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 2957312~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:44.029061 osd.112 [WRN] slow request 30.132302
seconds old, received at 2015-09-20 20:55:13.896680:
osd_op(client.3289538.0:62497513
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 2998272~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:45.029290 osd.112 [WRN] 9 slow requests, 5 included
below; oldest blocked for > 31.132843 secs
2015-09-20 20:55:45.029298 osd.112 [WRN] slow request 31.132447
seconds old, received at 2015-09-20 20:55:13.896759:
osd_op(client.3289538.0:62497514
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 3035136~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:45.029303 osd.112 [WRN] slow request 31.132362
seconds old, received at 2015-09-20 20:55:13.896845:
osd_op(client.3289538.0:62497515
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 3047424~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:45.029309 osd.112 [WRN] slow request 31.132276
seconds old, received at 2015-09-20 20:55:13.896931:
osd_op(client.3289538.0:62497516
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 3072000~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:45.029315 osd.112 [WRN] slow request 31.132199
seconds old, received at 2015-09-20 20:55:13.897008:
osd_op(client.3289538.0:62497517
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 3211264~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks
2015-09-20 20:55:45.029326 osd.112 [WRN] slow request 31.132127
seconds old, received at 2015-09-20 20:55:13.897079:
osd_op(client.3289538.0:62497518
rbd_data.29b9ae3f960770.0200 [stat,set-alloc-hint
object_size 8388608 write_size 8388608,write 3235840~4096] 17.118f0c67
ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
locks

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Sep 20, 2015 at 7:02 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I was able to catch the tail end of one of these and increased the
> logging on it. I had to kill it a minute or two after the logging was
> increased because of the time of the day.
>
> I've put the logs at https://robert.leblancnet.us/ceph-osd.8.log.xz .
>
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Sun, Sep 20, 2015 at 9:03 AM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We had another incident of 100 long blocked I/O this morning, but I
>> didn't get to it in time. I wound up clearing itself after almost
>> 1,000 seconds. On interesting note is that the blocked I/O kept

[ceph-users] Potential OSD deadlock?

2015-09-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We have had two situations where I/O just seems to be indefinitely
blocked on our production cluster today (0.94.3). In the case this
morning, it was just normal I/O traffic, no recovery or backfill. The
case this evening, we were backfilling to some new OSDs. I would have
loved to have bumped up the debugging to get an idea of what was going
on, but time was exhausted. The incident this evening I was able to do
some additional troubleshooting, but got real anxious after I/O had
been blocked for 10 minutes and OPs was getting hot around the collar.

Here are the important parts of the logs:
[osd.30]
2015-09-18 23:05:36.188251 7efed0ef0700  0 log_channel(cluster) log
[WRN] : slow request 30.662958 seconds old,
 received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441
 rbd_data.1099d2f67aaea.0f62 [set-alloc-hint object_size
8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c
ack+ondisk+write+known_if_redirected e55919)
 currently waiting for subops from 32,70,72

[osd.72]
2015-09-18 23:05:19.302985 7f3fa19f8700  0 log_channel(cluster) log
[WRN] : slow request 30.200408 seconds old,
 received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311
 rbd_data.3f41d41bd65b28.9e2b [set-alloc-hint object_size
4194304 write_size 4194304,write 1048576~421888] 17.40adcada
ack+ondisk+write+known_if_redirected e55919)
 currently waiting for subops from 2,30,90

The other OSDs listed (32,70,2,90) did not have any errors in the logs
about blocked I/O. It seems that osd.30 was waiting for osd.72 and
visa versa. I looked at top and iostat of these two hosts and the OSD
processes and disk I/O were pretty idle.

I know that this isn't a lot to go on. Our cluster is under very heavy
load and we get several blocked I/Os every hour, but they usually
clear up within 15 seconds. We seem to get I/O blocked when the op
latency of the cluster goes above 1 (average from all OSDs as seen by
Graphite).

Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately
cleared all the blocked I/O and then it was fine after rejoining the
cluster. Increasing what logs and to what level would be most
beneficial in this case for troubleshooting?

I hope this makes sense, it has been a long day.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV
LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7
cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc
K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE
9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC
DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75
Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c
MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl
kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY
nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2
4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq
qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio
Roj5
=ut4B
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com