I've seen one misbehaving OSD stopping all the IO in a cluster... I've had
a situation where everything seemed fine with the OSD/node but the cluster
was grinding to a halt. There was no iowait, disk wasn't very busy, wasn't
doing recoveries, was up+in, no scrubs... Restart the OSD and everything
recovers like magic...

On Thu, Aug 27, 2015 at 8:38 PM, Robert LeBlanc <rob...@leblancnet.us>
wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> +1
>
> :)
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Thu, Aug 27, 2015 at 1:16 PM, Jan Schermer  wrote:
> Well, there's no other way to get reliable performance and SLAs compared to 
> traditional storage when what you work with is commodity hardware in a mesh-y 
> configuration.
> And we do like the idea of killing the traditional storage, right? I think 
> 80s called already and wanted their SAN back...
>
> Jan
>
> > On 27 Aug 2015, at 21:01, Robert LeBlanc  wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > I know writing to min_size as sync and size-min_size as async has been
> > discussed before and would help here. From what I understand required
> > a lot of code changes and goes against the strong consistency model of
> > Ceph. I'm not sure if it will be implemented although I do love this
> > idea to help against tail latency.
> > - ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Thu, Aug 27, 2015 at 12:48 PM, Jan Schermer  wrote:
> >> Don't kick out the node, just deal with it gracefully and without 
> >> interruption... if the IO reached the quorum number of OSDs then there's 
> >> no need to block anymore, just queue it. Reads can be mirrored or retried 
> >> (much quicker, because making writes idempotent, ordered and async is 
> >> pretty hard and expensive).
> >> If there's an easy way to detect unreliable OSD that flaps - great, let's 
> >> have a warning in ceph health.
> >>
> >> Jan
> >>
> >>> On 27 Aug 2015, at 20:43, Robert LeBlanc  wrote:
> >>>
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA256
> >>>
> >>> This has been discussed a few times. The consensus seems to be to make
> >>> sure error rates of NICs or other such metrics are included in your
> >>> monitoring solution. It would also be good to preform periodic network
> >>> tests like a full size ping with nofrag set between all nodes and have
> >>> your monitoring solution report that as well.
> >>>
> >>> Although I would like to see such a feature in Ceph, the concern is
> >>> that such a feature can quickly get out of hand and that something
> >>> else that is really designed for it should do it. I can understand
> >>> where they are coming from in that regard, but having Ceph kick out a
> >>> misbehaving node quickly is appealing as well (there would have to be
> >>> a way to specify that only so many nodes could be kicked out).
> >>> - ----------------
> >>> Robert LeBlanc
> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>
> >>>
> >>> On Thu, Aug 27, 2015 at 9:37 AM, Christoph Adomeit  wrote:
> >>>> Hello Ceph Users,
> >>>>
> >>>> yesterday I had a defective Gbic in 1 node of my 10 node ceph cluster.
> >>>>
> >>>> The Gbic was working somehow but had 50% packet-loss. Some packets went 
> >>>> through, some did not.
> >>>>
> >>>> What happend that the whole cluster did not service requests in time, 
> >>>> there were lots of timeouts and so on
> >>>> until the problem was isolated. Monitors and osds where asked for data 
> >>>> but did dot answer or answer late.
> >>>>
> >>>> I am wondering, here we have a highly redundant network setup and a 
> >>>> highly redundant piece of software, but a small
> >>>> network fault brings down the whole cluster.
> >>>>
> >>>> Is there anything that can be configured or changed in ceph so that 
> >>>> availability will become better in case of flapping networks ?
> >>>>
> >>>> I understand, it is not a ceph problem but a network problem but maybe 
> >>>> something can be learned from such incidents  ?
> >>>>
> >>>> Thanks
> >>>> Christoph
> >>>> --
> >>>> Christoph Adomeit
> >>>> GATWORKS GmbH
> >>>> Reststrauch 191
> >>>> 41199 Moenchengladbach
> >>>> Sitz: Moenchengladbach
> >>>> Amtsgericht Moenchengladbach, HRB 6303
> >>>> Geschaeftsfuehrer:
> >>>> Christoph Adomeit, Hans Wilhelm Terstappen
> >>>>
> >>>> christoph.adom...@gatworks.de     Internetloesungen vom Feinsten
> >>>> Fon. +49 2166 9149-32                      Fax. +49 2166 9149-10
>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: Mailvelope v1.0.2
> >>> Comment: https://www.mailvelope.com
> >>>
> >>> wsFcBAEBCAAQBQJV31pFCRDmVDuy+mK58QAA7qwQAL0EvbHneC00qhCX/jjT
> >>> Xl8whWvQgm/UUDEPAWe2wGkgVZtP3cSAx/p+IkusZuD6NClIiWvazdz5n+vf
> >>> cj4Y+S8Zj4Lw7gypHjy5GSCDSbQnEni32QNKp74GM/EZ1331gXuDvP0bS2Sz
> >>> 7g5MXu8Vpf0Kdrj8JrOPnHY1PtljxkQXdrEmijDkmnjruO+XGFQrl8l9GFbN
> >>> enFZI+PpEAoSEJPZosCnX+ZLM3/ZiwAfAPtvcARyDwdmjV7CjyRjVviloR3K
> >>> DV/b+VuWX+NVzTZMKCnILVubt1Khexzk6reU3m7Yjy713dmEehDmKQsESFci
> >>> pMi61iEuxje0O+iqOp+mhhYWtv+Iv7bbpHcGv04vfMsl6+ms6v/EHo/Cccoi
> >>> ZiOa+xD6l7ZkO+A+2bvunBvC3cjBFXn8yrNpHDj6G+jUWMDuJcs7wAhExhPv
> >>> Qicjhzk9AoTFXPIkfkGnuHJ/ngFnswdHeVa1DU7GV+Evh/2BCtoHH7Ur+XQY
> >>> u7gL6LXt+2UAB3+ZIEvr2NOAFiIVsPqnGqQqNiNz5XQDFh5bD3e1iScucZbm
> >>> VNStBkWDoDwrBYVe74cN55ZXA5auTSDYuYlen+BPbYhAKmpkBp+Suv1H4CFy
> >>> 01cnANvJfbaxoBIPLzvhdx4c73Qd+J6ttxi2g8u8EedXDbPIYGFPy2madvtW
> >>> JNPc
> >>> =3sV8
> >>> -----END PGP SIGNATURE-----
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.0.2
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJV316iCRDmVDuy+mK58QAAxogP+QGDVhfGxa4OeIslEsoj
> > aW3LY4nzzFP1iDJNjlvPDDTj5AcC56c2QhvshLRy3pYUmwoWqO0gnTOGh/YX
> > ma6+hGVJCaBZU5L6rZ0SQfcfo3CNglIzQ2ts07Xb5XPQRrS6/yLsMki+kDf0
> > qjCqpZpPTL/d80sBrbCNDoZcnMKBYKwBZbay8RsSBZ0pHmdylfnhvGSxZBEk
> > U8ZTrUdsZ9ejzdfh29byR3V/Mz6EkGVnnFPlkIAdkuZJPns+i6NGoe16z3kL
> > 3u967qFfFcNGUWCGO0MC/iYT4fcRoramqMWhY5hBUD8DmWXgbKQmutal8vaO
> > sDBfKLgmQkBpDgOTng6/uE/BpkROmjsjXuCar/xf+QwXMQJhIWFmBqcaKdQl
> > TUjd4QovBJVFPWq9qpyh9ia95cfoFm942LaunaA4chTQnxjTbS/0aajSTM7/
> > OxUuMcPCnuAsbHXsj/wkPE1ZTmNU9KPQgo10h8UlhMoQU2fOZB8h/p/0fuqk
> > bfBo9k07EkdakFFc/ASLpFqIeV49ZTYjUg/0MPdXW5KnJPb+4OBojIZZF9An
> > /20UgUXqqBd2LKFY/bwKqoOw5bKCMxQCJptJXmY6zx7vJ76ahyOuiP/OaQWh
> > 8uikTxCUeNdaNzez3s3TUgQSR4y94zT1It5rF16VrFPMik66Yq8x11t6z8eR
> > Po34
> > =tqy/
> > -----END PGP SIGNATURE-----
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
>
> wsFcBAEBCAAQBQJV32cbCRDmVDuy+mK58QAAev8P+wQY2fEnphdvRzeW2XII
> D6IJEB+7qklberNTLq1nVbnFDoMi/d0M3nNEEvChckbI9AR/YYsrUalxehQR
> x5uxMs+pbH71XHbq9IDe/3ej057dij/YtwTGJd2i8A3B82dJFlFCu0TnmVRr
> HvR07k1rY7LCp6PwV8rIWYiGsTriM8QFq8z9gK5Qo8OPwNgOCxV5Z4AJO8DY
> 7PrtfBC6AAoeNvIqkmBs3bjQ1Bzv6qxV1XM0kAO72+g9PvBNf9G2YCAUNfIW
> uRrPKrFDGLCI5X3POBgU/4un+ULH81N3jNfPKT6m1uTGim8wWyQXiC3P0+TY
> OpgMurCjdylrc4BQ3nr2EAEzRLAihgg9SCENWcGrca+i4fmfsmzc7PGoXmUb
> i02IHYixIpOrc0hzSzfRez4nkDVQMuOKSHEfdf8y/NOaG/E+rnx39RB5ff65
> LSAG3v1e+pgjA4zOIBYlkPelV5FfcM/7nKxXz/f7fYqDSgpweIxcAId6RYy3
> 1TEc1YpEWhZ+v8OivVnLkiVwE73vAmFUlZFaLhlMxf5I9fT0+4U9ykhrO0Rt
> Cf+lVpBz22jTIhTXdYSUQXh9+7LtaBqure8S+BWP4IlBBfiWecA61hWwDX6Q
> GDM7EvRFIjY0kMxv9zV5npwCXud3lEx8XaI545jW1LMw8vPFVblEIq6vThDc
> oHTK
> =estz
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to