Re: [ceph-users] Defective Gbic brings whole Cluster down

Robert LeBlanc Thu, 27 Aug 2015 12:02:55 -0700


On Thu, Aug 27, 2015 at 12:48 PM, Jan Schermer  wrote:
> Don't kick out the node, just deal with it gracefully and without 
> interruption... if the IO reached the quorum number of OSDs then there's no 
> need to block anymore, just queue it. Reads can be mirrored or retried (much 
> quicker, because making writes idempotent, ordered and async is pretty hard 
> and expensive).
> If there's an easy way to detect unreliable OSD that flaps - great, let's 
> have a warning in ceph health.
>
> Jan
>
>> On 27 Aug 2015, at 20:43, Robert LeBlanc  wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> This has been discussed a few times. The consensus seems to be to make
>> sure error rates of NICs or other such metrics are included in your
>> monitoring solution. It would also be good to preform periodic network
>> tests like a full size ping with nofrag set between all nodes and have
>> your monitoring solution report that as well.
>>
>> Although I would like to see such a feature in Ceph, the concern is
>> that such a feature can quickly get out of hand and that something
>> else that is really designed for it should do it. I can understand
>> where they are coming from in that regard, but having Ceph kick out a
>> misbehaving node quickly is appealing as well (there would have to be
>> a way to specify that only so many nodes could be kicked out).
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Aug 27, 2015 at 9:37 AM, Christoph Adomeit  wrote:
>>> Hello Ceph Users,
>>>
>>> yesterday I had a defective Gbic in 1 node of my 10 node ceph cluster.
>>>
>>> The Gbic was working somehow but had 50% packet-loss. Some packets went 
>>> through, some did not.
>>>
>>> What happend that the whole cluster did not service requests in time, there 
>>> were lots of timeouts and so on
>>> until the problem was isolated. Monitors and osds where asked for data but 
>>> did dot answer or answer late.
>>>
>>> I am wondering, here we have a highly redundant network setup and a highly 
>>> redundant piece of software, but a small
>>> network fault brings down the whole cluster.
>>>
>>> Is there anything that can be configured or changed in ceph so that 
>>> availability will become better in case of flapping networks ?
>>>
>>> I understand, it is not a ceph problem but a network problem but maybe 
>>> something can be learned from such incidents  ?
>>>
>>> Thanks
>>>  Christoph
>>> --
>>> Christoph Adomeit
>>> GATWORKS GmbH
>>> Reststrauch 191
>>> 41199 Moenchengladbach
>>> Sitz: Moenchengladbach
>>> Amtsgericht Moenchengladbach, HRB 6303
>>> Geschaeftsfuehrer:
>>> Christoph Adomeit, Hans Wilhelm Terstappen
>>>
>>> [email protected]     Internetloesungen vom Feinsten
>>> Fon. +49 2166 9149-32                      Fax. +49 2166 9149-10
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.0.2
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJV31pFCRDmVDuy+mK58QAA7qwQAL0EvbHneC00qhCX/jjT
>> Xl8whWvQgm/UUDEPAWe2wGkgVZtP3cSAx/p+IkusZuD6NClIiWvazdz5n+vf
>> cj4Y+S8Zj4Lw7gypHjy5GSCDSbQnEni32QNKp74GM/EZ1331gXuDvP0bS2Sz
>> 7g5MXu8Vpf0Kdrj8JrOPnHY1PtljxkQXdrEmijDkmnjruO+XGFQrl8l9GFbN
>> enFZI+PpEAoSEJPZosCnX+ZLM3/ZiwAfAPtvcARyDwdmjV7CjyRjVviloR3K
>> DV/b+VuWX+NVzTZMKCnILVubt1Khexzk6reU3m7Yjy713dmEehDmKQsESFci
>> pMi61iEuxje0O+iqOp+mhhYWtv+Iv7bbpHcGv04vfMsl6+ms6v/EHo/Cccoi
>> ZiOa+xD6l7ZkO+A+2bvunBvC3cjBFXn8yrNpHDj6G+jUWMDuJcs7wAhExhPv
>> Qicjhzk9AoTFXPIkfkGnuHJ/ngFnswdHeVa1DU7GV+Evh/2BCtoHH7Ur+XQY
>> u7gL6LXt+2UAB3+ZIEvr2NOAFiIVsPqnGqQqNiNz5XQDFh5bD3e1iScucZbm
>> VNStBkWDoDwrBYVe74cN55ZXA5auTSDYuYlen+BPbYhAKmpkBp+Suv1H4CFy
>> 01cnANvJfbaxoBIPLzvhdx4c73Qd+J6ttxi2g8u8EedXDbPIYGFPy2madvtW
>> JNPc
>> =3sV8
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV316iCRDmVDuy+mK58QAAxogP+QGDVhfGxa4OeIslEsoj
aW3LY4nzzFP1iDJNjlvPDDTj5AcC56c2QhvshLRy3pYUmwoWqO0gnTOGh/YX
ma6+hGVJCaBZU5L6rZ0SQfcfo3CNglIzQ2ts07Xb5XPQRrS6/yLsMki+kDf0
qjCqpZpPTL/d80sBrbCNDoZcnMKBYKwBZbay8RsSBZ0pHmdylfnhvGSxZBEk
U8ZTrUdsZ9ejzdfh29byR3V/Mz6EkGVnnFPlkIAdkuZJPns+i6NGoe16z3kL
3u967qFfFcNGUWCGO0MC/iYT4fcRoramqMWhY5hBUD8DmWXgbKQmutal8vaO
sDBfKLgmQkBpDgOTng6/uE/BpkROmjsjXuCar/xf+QwXMQJhIWFmBqcaKdQl
TUjd4QovBJVFPWq9qpyh9ia95cfoFm942LaunaA4chTQnxjTbS/0aajSTM7/
OxUuMcPCnuAsbHXsj/wkPE1ZTmNU9KPQgo10h8UlhMoQU2fOZB8h/p/0fuqk
bfBo9k07EkdakFFc/ASLpFqIeV49ZTYjUg/0MPdXW5KnJPb+4OBojIZZF9An
/20UgUXqqBd2LKFY/bwKqoOw5bKCMxQCJptJXmY6zx7vJ76ahyOuiP/OaQWh
8uikTxCUeNdaNzez3s3TUgQSR4y94zT1It5rF16VrFPMik66Yq8x11t6z8eR
Po34
=tqy/
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Defective Gbic brings whole Cluster down

Reply via email to