Re: [ceph-users] problem returning mon back to cluster
Hi, just wanted to add some info.. 1) I was able to workaround the problem (as advised by Harald) by increasing mon_lease to 50s, waiting for monitor to join the cluster (it took hours!) and decreasing it again. 2) since then we got hit by the same problem on different cluster. same symptoms, same workaround. 3) I was able to 100% reproduce the problem on cleanly installed ceph environment in virtual machines with same addresation and copied monitor data. for anyone of developers interested, I could give direct SSH access. shall I fill a bug report for this? thanks nik On Tue, Oct 15, 2019 at 07:17:38AM +0200, Nikola Ciprich wrote: > On Tue, Oct 15, 2019 at 06:50:31AM +0200, Nikola Ciprich wrote: > > > > > > On Mon, Oct 14, 2019 at 11:52:55PM +0200, Paul Emmerich wrote: > > > How big is the mon's DB? As in just the total size of the directory you > > > copied > > > > > > FWIW I recently had to perform mon surgery on a 14.2.4 (or was it > > > 14.2.2?) cluster with 8 GB mon size and I encountered no such problems > > > while syncing a new mon which took 10 minutes or so. > > Hi Paul, > > > > yup I forgot to mention this.. It doesn't seem to be too big, just about > > 100MB. I also noticed that while third monitor tries to join the cluster, > > leader starts flapping between "leader" and "electing", so I suppose it's > > quorum forming problem.. I tried bumping debug_ms and debug_paxos but > > couldn't make head or tails of it.. can paste the logs somewhere if it > > can help > > btw I just noticed, that on test cluster, third mon finally managed to join > the cluster and forum got formed.. after more then 6 hours.. knowing that > during > it, the IO blocks for clients, it's pretty scary > > now I can stop/start monitors without problems on it.. so it somehow got > "fixed" > > still dunno what to do with this production cluster though, so I'll just > prepare > test environment again and try digging more into it > > BR > > nik > > > > > > > > > BR > > > > nik > > > > > > > > > > > > Paul > > > > > > -- > > > Paul Emmerich > > > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > > > croit GmbH > > > Freseniusstr. 31h > > > 81247 München > > > www.croit.io > > > Tel: +49 89 1896585 90 > > > > > > On Mon, Oct 14, 2019 at 9:41 PM Nikola Ciprich > > > wrote: > > > > > > > > On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote: > > > > > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote: > > > > > > Probably same problem here. When I try to add another MON, "ceph > > > > > > health" becomes mostly unresponsive. One of the existing ceph-mon > > > > > > processes uses 100% CPU for several minutes. Tried it on 2 test > > > > > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds > > > > > > each). To avoid errors like "lease timeout", I temporarily increase > > > > > > "mon lease", from 5 to 50 seconds. > > > > > > > > > > > > Not sure how bad it is from a customer PoV. But it is a problem by > > > > > > itself to be several minutes without "ceph health", when there is an > > > > > > increased risk of losing the quorum ... > > > > > > > > > > Hi Harry, > > > > > > > > > > thanks a lot for your reply! not sure we're experiencing the same > > > > > issue, > > > > > i don't have it on any other cluster.. when this is happening to you, > > > > > does > > > > > only ceph health stop working, or it also blocks all clients IO? > > > > > > > > > > BR > > > > > > > > > > nik > > > > > > > > > > > > > > > > > > > > > > Harry > > > > > > > > > > > > On 13.10.19 20:26, Nikola Ciprich wrote: > > > > > > >dear ceph users and developers, > > > > > > > > > > > > > >on one of our production clusters, we got into pretty unpleasant > > > > > > >situation. > > > > > > > > > > > > > >After rebooting one of the nodes, when trying to start monitor, > > > > >
Re: [ceph-users] problem returning mon back to cluster
On Tue, Oct 15, 2019 at 06:50:31AM +0200, Nikola Ciprich wrote: > > > On Mon, Oct 14, 2019 at 11:52:55PM +0200, Paul Emmerich wrote: > > How big is the mon's DB? As in just the total size of the directory you > > copied > > > > FWIW I recently had to perform mon surgery on a 14.2.4 (or was it > > 14.2.2?) cluster with 8 GB mon size and I encountered no such problems > > while syncing a new mon which took 10 minutes or so. > Hi Paul, > > yup I forgot to mention this.. It doesn't seem to be too big, just about > 100MB. I also noticed that while third monitor tries to join the cluster, > leader starts flapping between "leader" and "electing", so I suppose it's > quorum forming problem.. I tried bumping debug_ms and debug_paxos but > couldn't make head or tails of it.. can paste the logs somewhere if it > can help btw I just noticed, that on test cluster, third mon finally managed to join the cluster and forum got formed.. after more then 6 hours.. knowing that during it, the IO blocks for clients, it's pretty scary now I can stop/start monitors without problems on it.. so it somehow got "fixed" still dunno what to do with this production cluster though, so I'll just prepare test environment again and try digging more into it BR nik > > BR > > nik > > > > > > > Paul > > > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > > > > On Mon, Oct 14, 2019 at 9:41 PM Nikola Ciprich > > wrote: > > > > > > On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote: > > > > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote: > > > > > Probably same problem here. When I try to add another MON, "ceph > > > > > health" becomes mostly unresponsive. One of the existing ceph-mon > > > > > processes uses 100% CPU for several minutes. Tried it on 2 test > > > > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds > > > > > each). To avoid errors like "lease timeout", I temporarily increase > > > > > "mon lease", from 5 to 50 seconds. > > > > > > > > > > Not sure how bad it is from a customer PoV. But it is a problem by > > > > > itself to be several minutes without "ceph health", when there is an > > > > > increased risk of losing the quorum ... > > > > > > > > Hi Harry, > > > > > > > > thanks a lot for your reply! not sure we're experiencing the same issue, > > > > i don't have it on any other cluster.. when this is happening to you, > > > > does > > > > only ceph health stop working, or it also blocks all clients IO? > > > > > > > > BR > > > > > > > > nik > > > > > > > > > > > > > > > > > > Harry > > > > > > > > > > On 13.10.19 20:26, Nikola Ciprich wrote: > > > > > >dear ceph users and developers, > > > > > > > > > > > >on one of our production clusters, we got into pretty unpleasant > > > > > >situation. > > > > > > > > > > > >After rebooting one of the nodes, when trying to start monitor, > > > > > >whole cluster > > > > > >seems to hang, including IO, ceph -s etc. When this mon is stopped > > > > > >again, > > > > > >everything seems to continue. Traying to spawn new monitor leads to > > > > > >the same problem > > > > > >(even on different node). > > > > > > > > > > > >I had to give up after minutes of outage, since it's unacceptable. I > > > > > >think we had this > > > > > >problem once in the past on this cluster, but after some (but much > > > > > >shorter) time, monitor > > > > > >joined and it worked fine since then. > > > > > > > > > > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are > > > > > >now running), I'm > > > > > >using ceph 13.2.6 > > > > > > > > > > > >Network connection seems to be fine. > > > > > > > > > > > >Anyone seen similar problem? I'd be very grate
Re: [ceph-users] problem returning mon back to cluster
On Mon, Oct 14, 2019 at 11:52:55PM +0200, Paul Emmerich wrote: > How big is the mon's DB? As in just the total size of the directory you > copied > > FWIW I recently had to perform mon surgery on a 14.2.4 (or was it > 14.2.2?) cluster with 8 GB mon size and I encountered no such problems > while syncing a new mon which took 10 minutes or so. Hi Paul, yup I forgot to mention this.. It doesn't seem to be too big, just about 100MB. I also noticed that while third monitor tries to join the cluster, leader starts flapping between "leader" and "electing", so I suppose it's quorum forming problem.. I tried bumping debug_ms and debug_paxos but couldn't make head or tails of it.. can paste the logs somewhere if it can help BR nik > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Mon, Oct 14, 2019 at 9:41 PM Nikola Ciprich > wrote: > > > > On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote: > > > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote: > > > > Probably same problem here. When I try to add another MON, "ceph > > > > health" becomes mostly unresponsive. One of the existing ceph-mon > > > > processes uses 100% CPU for several minutes. Tried it on 2 test > > > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds > > > > each). To avoid errors like "lease timeout", I temporarily increase > > > > "mon lease", from 5 to 50 seconds. > > > > > > > > Not sure how bad it is from a customer PoV. But it is a problem by > > > > itself to be several minutes without "ceph health", when there is an > > > > increased risk of losing the quorum ... > > > > > > Hi Harry, > > > > > > thanks a lot for your reply! not sure we're experiencing the same issue, > > > i don't have it on any other cluster.. when this is happening to you, does > > > only ceph health stop working, or it also blocks all clients IO? > > > > > > BR > > > > > > nik > > > > > > > > > > > > > > Harry > > > > > > > > On 13.10.19 20:26, Nikola Ciprich wrote: > > > > >dear ceph users and developers, > > > > > > > > > >on one of our production clusters, we got into pretty unpleasant > > > > >situation. > > > > > > > > > >After rebooting one of the nodes, when trying to start monitor, whole > > > > >cluster > > > > >seems to hang, including IO, ceph -s etc. When this mon is stopped > > > > >again, > > > > >everything seems to continue. Traying to spawn new monitor leads to > > > > >the same problem > > > > >(even on different node). > > > > > > > > > >I had to give up after minutes of outage, since it's unacceptable. I > > > > >think we had this > > > > >problem once in the past on this cluster, but after some (but much > > > > >shorter) time, monitor > > > > >joined and it worked fine since then. > > > > > > > > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are > > > > >now running), I'm > > > > >using ceph 13.2.6 > > > > > > > > > >Network connection seems to be fine. > > > > > > > > > >Anyone seen similar problem? I'd be very grateful for tips on how to > > > > >debug and solve this.. > > > > > > > > > >for those interested, here's log of one of running monitors with > > > > >debug_mon set to 10/10: > > > > > > > > > >https://storage.lbox.cz/public/d258d0 > > > > > > > > > >if I could provide more info, please let me know > > > > > > > > > >with best regards > > > > > > > > > >nikola ciprich > > > > just to add quick update, I was able to reproduce the issue by transferring > > monitor > > directories to test environmen with same IP adressing, so I can safely play > > with that > > now.. > > > > increasing lease timeout didn't help me to fix the problem, > > but at least I seem to be able to use ceph -s now. > > > > few things I noticed in the meantime: > > > > - when
Re: [ceph-users] problem returning mon back to cluster
On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote: > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote: > > Probably same problem here. When I try to add another MON, "ceph > > health" becomes mostly unresponsive. One of the existing ceph-mon > > processes uses 100% CPU for several minutes. Tried it on 2 test > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds > > each). To avoid errors like "lease timeout", I temporarily increase > > "mon lease", from 5 to 50 seconds. > > > > Not sure how bad it is from a customer PoV. But it is a problem by > > itself to be several minutes without "ceph health", when there is an > > increased risk of losing the quorum ... > > Hi Harry, > > thanks a lot for your reply! not sure we're experiencing the same issue, > i don't have it on any other cluster.. when this is happening to you, does > only ceph health stop working, or it also blocks all clients IO? > > BR > > nik > > > > > > Harry > > > > On 13.10.19 20:26, Nikola Ciprich wrote: > > >dear ceph users and developers, > > > > > >on one of our production clusters, we got into pretty unpleasant situation. > > > > > >After rebooting one of the nodes, when trying to start monitor, whole > > >cluster > > >seems to hang, including IO, ceph -s etc. When this mon is stopped again, > > >everything seems to continue. Traying to spawn new monitor leads to the > > >same problem > > >(even on different node). > > > > > >I had to give up after minutes of outage, since it's unacceptable. I think > > >we had this > > >problem once in the past on this cluster, but after some (but much > > >shorter) time, monitor > > >joined and it worked fine since then. > > > > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now > > >running), I'm > > >using ceph 13.2.6 > > > > > >Network connection seems to be fine. > > > > > >Anyone seen similar problem? I'd be very grateful for tips on how to debug > > >and solve this.. > > > > > >for those interested, here's log of one of running monitors with debug_mon > > >set to 10/10: > > > > > >https://storage.lbox.cz/public/d258d0 > > > > > >if I could provide more info, please let me know > > > > > >with best regards > > > > > >nikola ciprich just to add quick update, I was able to reproduce the issue by transferring monitor directories to test environmen with same IP adressing, so I can safely play with that now.. increasing lease timeout didn't help me to fix the problem, but at least I seem to be able to use ceph -s now. few things I noticed in the meantime: - when I start problematic monitor, monitor slow ops start to appear for quorum leader and the count is slowly increasing: 44 slow ops, oldest one blocked for 130 sec, mon.nodev1c has slow ops - removing and recreating monitor didn't help - checking mon_status of problematic monitor shows it remains in the "synchronizing" state I tried increasing debug_ms and debug_paxos but didn't see anything usefull there.. will report further when I got something. I anyone has any idea in the meantime, please let me know. BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpVNlq5CstSn.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem returning mon back to cluster
On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote: > Probably same problem here. When I try to add another MON, "ceph > health" becomes mostly unresponsive. One of the existing ceph-mon > processes uses 100% CPU for several minutes. Tried it on 2 test > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds > each). To avoid errors like "lease timeout", I temporarily increase > "mon lease", from 5 to 50 seconds. > > Not sure how bad it is from a customer PoV. But it is a problem by > itself to be several minutes without "ceph health", when there is an > increased risk of losing the quorum ... Hi Harry, thanks a lot for your reply! not sure we're experiencing the same issue, i don't have it on any other cluster.. when this is happening to you, does only ceph health stop working, or it also blocks all clients IO? BR nik > > Harry > > On 13.10.19 20:26, Nikola Ciprich wrote: > >dear ceph users and developers, > > > >on one of our production clusters, we got into pretty unpleasant situation. > > > >After rebooting one of the nodes, when trying to start monitor, whole cluster > >seems to hang, including IO, ceph -s etc. When this mon is stopped again, > >everything seems to continue. Traying to spawn new monitor leads to the same > >problem > >(even on different node). > > > >I had to give up after minutes of outage, since it's unacceptable. I think > >we had this > >problem once in the past on this cluster, but after some (but much shorter) > >time, monitor > >joined and it worked fine since then. > > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now > >running), I'm > >using ceph 13.2.6 > > > >Network connection seems to be fine. > > > >Anyone seen similar problem? I'd be very grateful for tips on how to debug > >and solve this.. > > > >for those interested, here's log of one of running monitors with debug_mon > >set to 10/10: > > > >https://storage.lbox.cz/public/d258d0 > > > >if I could provide more info, please let me know > > > >with best regards > > > >nikola ciprich > > > > > > > > > > > > > > > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp6aNC92DcqR.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] problem returning mon back to cluster
dear ceph users and developers, on one of our production clusters, we got into pretty unpleasant situation. After rebooting one of the nodes, when trying to start monitor, whole cluster seems to hang, including IO, ceph -s etc. When this mon is stopped again, everything seems to continue. Traying to spawn new monitor leads to the same problem (even on different node). I had to give up after minutes of outage, since it's unacceptable. I think we had this problem once in the past on this cluster, but after some (but much shorter) time, monitor joined and it worked fine since then. All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm using ceph 13.2.6 Network connection seems to be fine. Anyone seen similar problem? I'd be very grateful for tips on how to debug and solve this.. for those interested, here's log of one of running monitors with debug_mon set to 10/10: https://storage.lbox.cz/public/d258d0 if I could provide more info, please let me know with best regards nikola ciprich -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] unable to manually flush cache: failed to flush /xxx: (2) No such file or directory
Hi, we're having issue on one of our clusters, while wanting to remove cache tier, trying to manually flush cache always ends up with error: rados -p ssd-cache cache-flush-evict-all . . . failed to flush /rb.0.965780.238e1f29.1641: (2) No such file or directory rb.0.965780.238e1f29.02c8 failed to flush /rb.0.965780.238e1f29.02c8: (2) No such file or directory rb.0.965780.238e1f29.9113 failed to flush /rb.0.965780.238e1f29.9113: (2) No such file or directory rb.0.965780.238e1f29.9b0f failed to flush /rb.0.965780.238e1f29.9b0f: (2) No such file or directory rb.0.965780.238e1f29.62b6 failed to flush /rb.0.965780.238e1f29.62b6: (2) No such file or directory rb.0.965780.238e1f29.030c . . . cluster is healthy, running 13.2.5 any idea on what might be wrong? should I provide more details, please let me know BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd: error processing image xxx (2) No such file or directory
Hi, on one of my clusters, I'm getting error message which is getting me a bit nervous.. while listing contents of a pool I'm getting error for one of images: [root@node1 ~]# rbd ls -l nvme > /dev/null rbd: error processing image xxx: (2) No such file or directory [root@node1 ~]# rbd info nvme/xxx rbd image 'xxx': size 60 GiB in 15360 objects order 22 (4 MiB objects) id: 132773d6deb56 block_name_prefix: rbd_data.132773d6deb56 format: 2 features: layering, operations op_features: snap-trash flags: create_timestamp: Wed Aug 29 12:25:13 2018 volume contains production data and seems to be working correctly (it's used by VM) is this something to worry about? What is snap-trash feature? wasn't able to google much about it.. I'm running ceph 13.2.4 on centos 7. I'd be gratefull any help BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bad crc/signature errors
> > Hi Ilya, > > > > hmm, OK, I'm not sure now whether this is the bug which I'm > > experiencing.. I've had read_partial_message / bad crc/signature > > problem occurance on the second cluster in short period even though > > we're on the same ceph version (12.2.5) for quite long time (almost since > > its release), so it's starting to pain me.. I suppose this must > > have been caused by some kernel update, (we're currently sticking > > to 4.14.x and lately been upgrading to 4.14.50) > > These "bad crc/signature" are usually the sign of faulty hardware. > > What was the last "good" kernel and the first "bad" kernel? > > You said "on the second cluster". How is it different from the first? > Are you using the kernel client with both? Is there Xen involved? it's complicated.. both those clusters are fairly new, running kernel 4.14.50, ceph 12.2.5. XEN is not involved, but KVM is. I think those were already installed with this kernel. I was thinking about that, and main difference compared to other (and older) clusters is, krbd is used much more: before, we were using krbd only for postgres, and qemu-kvm accessed RBD volumes using librbd. on new clusters where problems occured, all volumes are accessed using krbd, since it performs way much better.. so we'll just revert to librbd and I'll try to find way to reproduce. If I find some, we can talk about bisect, but it's possible the problem is here for the long time, but since we didn't use krbd heavily, it just didn't occur.. but I think we can rule out hardware problem here.. > > Thanks, > > Ilya > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bad crc/signature errors
Hi Ilya, hmm, OK, I'm not sure now whether this is the bug which I'm experiencing.. I've had read_partial_message / bad crc/signature problem occurance on the second cluster in short period even though we're on the same ceph version (12.2.5) for quite long time (almost since its release), so it's starting to pain me.. I suppose this must have been caused by some kernel update, (we're currently sticking to 4.14.x and lately been upgrading to 4.14.50) not sure whether this in is of some use.. BR nik On Mon, Aug 13, 2018 at 03:22:21PM +0200, Ilya Dryomov wrote: > On Mon, Aug 13, 2018 at 2:49 PM Nikola Ciprich > wrote: > > > > Hi Paul, > > > > thanks, I'll give it a try.. do you think this might head to > > upstream soon? for some reason I can't review comments for > > this patch on github.. Is some new version of this patch > > on the way, or can I try to apply this one to latest luminous? > > > > thanks a lot! > > > > nik > > > > > > On Fri, Aug 10, 2018 at 06:05:26PM +0200, Paul Emmerich wrote: > > > I've built a work-around here: > > > https://github.com/ceph/ceph/pull/23273 > > Those are completely different crc errors. The ones Paul is talking > about occur in bluestore when fetching data from the underlying disk. > When they occur, there is no data to reply with to the client. Paul's > pull request is working around that (likely a bug in the core kernel) > by adding up to two retries. > > The ones this thread is about occur on the client side when receiving > a reply from the OSD. The retry logic is already there: the connection > is cut, the client reconnects and resends the OSD request. > > Thanks, > > Ilya > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bad crc/signature errors
Hi Paul, thanks, I'll give it a try.. do you think this might head to upstream soon? for some reason I can't review comments for this patch on github.. Is some new version of this patch on the way, or can I try to apply this one to latest luminous? thanks a lot! nik On Fri, Aug 10, 2018 at 06:05:26PM +0200, Paul Emmerich wrote: > I've built a work-around here: > https://github.com/ceph/ceph/pull/23273 > > > Paul > > 2018-08-10 12:51 GMT+02:00 Nikola Ciprich : > > > Hi, > > > > did this ever come to some conclusion? I've recently started seeing > > those messages on one luminous cluster and am not sure whethere > > those are dangerous or not.. > > > > BR > > > > nik > > > > > > On Fri, Oct 06, 2017 at 05:37:00PM +0200, Olivier Bonvalet wrote: > > > Le jeudi 05 octobre 2017 à 21:52 +0200, Ilya Dryomov a écrit : > > > > On Thu, Oct 5, 2017 at 6:05 PM, Olivier Bonvalet > > > > wrote: > > > > > Le jeudi 05 octobre 2017 à 17:03 +0200, Ilya Dryomov a écrit : > > > > > > When did you start seeing these errors? Can you correlate that > > > > > > to > > > > > > a ceph or kernel upgrade? If not, and if you don't see other > > > > > > issues, > > > > > > I'd write it off as faulty hardware. > > > > > > > > > > Well... I have one hypervisor (Xen 4.6 and kernel Linux 4.1.13), > > > > > which > > > > > > > > Is that 4.1.13 or 4.13.1? > > > > > > > > > > Linux 4.1.13. The old Debian 8, with Xen 4.6 from upstream. > > > > > > > > > > > have the problem for a long time, at least since 1 month (I haven't > > > > > older logs). > > > > > > > > > > But, on others hypervisors (Xen 4.8 with Linux 4.9.x), I haven't > > > > > the > > > > > problem. > > > > > And it's when I upgraded thoses hypervisors to Linux 4.13.x, that > > > > > "bad > > > > > crc" errors appeared. > > > > > > > > > > Note : if I upgraded kernels on Xen 4.8 hypervisors, it's because > > > > > some > > > > > DISCARD commands over RBD were blocking ("fstrim" works, but not > > > > > "lvremove" with discard enabled). After upgrading to Linux 4.13.3, > > > > > DISCARD works again on Xen 4.8. > > > > > > > > Which kernel did you upgrade from to 4.13.3 exactly? > > > > > > > > > > > > > > 4.9.47 or 4.9.52, I don't have more precise data about this. > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > -- > > ----- > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Re : Re : Re : bad crc/signature errors
Hi, did this ever come to some conclusion? I've recently started seeing those messages on one luminous cluster and am not sure whethere those are dangerous or not.. BR nik On Fri, Oct 06, 2017 at 05:37:00PM +0200, Olivier Bonvalet wrote: > Le jeudi 05 octobre 2017 à 21:52 +0200, Ilya Dryomov a écrit : > > On Thu, Oct 5, 2017 at 6:05 PM, Olivier Bonvalet > > wrote: > > > Le jeudi 05 octobre 2017 à 17:03 +0200, Ilya Dryomov a écrit : > > > > When did you start seeing these errors? Can you correlate that > > > > to > > > > a ceph or kernel upgrade? If not, and if you don't see other > > > > issues, > > > > I'd write it off as faulty hardware. > > > > > > Well... I have one hypervisor (Xen 4.6 and kernel Linux 4.1.13), > > > which > > > > Is that 4.1.13 or 4.13.1? > > > > Linux 4.1.13. The old Debian 8, with Xen 4.6 from upstream. > > > > > have the problem for a long time, at least since 1 month (I haven't > > > older logs). > > > > > > But, on others hypervisors (Xen 4.8 with Linux 4.9.x), I haven't > > > the > > > problem. > > > And it's when I upgraded thoses hypervisors to Linux 4.13.x, that > > > "bad > > > crc" errors appeared. > > > > > > Note : if I upgraded kernels on Xen 4.8 hypervisors, it's because > > > some > > > DISCARD commands over RBD were blocking ("fstrim" works, but not > > > "lvremove" with discard enabled). After upgrading to Linux 4.13.3, > > > DISCARD works again on Xen 4.8. > > > > Which kernel did you upgrade from to 4.13.3 exactly? > > > > > > 4.9.47 or 4.9.52, I don't have more precise data about this. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd vs librbd performance with qemu
> > opts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=${numjobs} > > --gtod_reduce=1 --name=test --pool=${pool} --rbdname=${vol} --invalidate=0 > > --bs=4k --iodepth=64 --time_based --runtime=$time --group_reporting" > > > > So that "--numjobs" parameter is what I was referring to when I said > multiple jobs will cause a huge performance it. This causes fio to open the > same image X images, so with (nearly) each write operation, the > exclusive-lock is being moved from client-to-client. Instead of multiple > jobs against the same image, you should use multiple images. ah, I see, didn't realize thant... thanks a lot for valuable info! n. -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpDh9tPM6Oa2.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd vs librbd performance with qemu
> Care to share your "bench-rbd" script (on pastebin or similar)? sure, no problem.. it's so short I hope nobody will get offended if I paste it right here :) #!/bin/bash #export LD_PRELOAD="/usr/lib64/libtcmalloc.so.4" numjobs=8 pool=nvme vol=xxx time=30 opts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=${numjobs} --gtod_reduce=1 --name=test --pool=${pool} --rbdname=${vol} --invalidate=0 --bs=4k --iodepth=64 --time_based --runtime=$time --group_reporting" sopts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=1 --gtod_reduce=1 --name=test --pool=${pool} --rbdname=${vol} --invalidate=0 --bs=256k --iodepth=64 --time_based --runtime=$time --group_reporting" #fio $sopts --readwrite=read --output=rbd-fio-seqread.log echo #fio $sopts --readwrite=write --output=rbd-fio-seqwrite.log echo fio $opts --readwrite=randread --output=rbd-fio-randread.log echo fio $opts --readwrite=randwrite --output=rbd-fio-randwrite.log echo hope it's of some use.. n. -- --------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp1e1Nvjmsdz.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd vs librbd performance with qemu
> What's the output from "rbd info nvme/centos7"? that was it! the parent had some of unsupported features enabled, therefore the child could not be mapped.. so the error message is a bit confusing, but now after disabling the features on the parent it works for me, thanks! > Odd. The exclusive-lock code is only executed once (in general) upon the > first write IO (or immediately upon mapping the image if the "exclusive" > option is passed to the kernel). Therefore, it should have zero impact on > IO performance. hmm, then I might have found a bug.. [root@v4a bench1]# sh bench-rbd Jobs: 8 (f=8): [r(8)][100.0%][r=671MiB/s,w=0KiB/s][r=172k,w=0 IOPS][eta 00m:00s] Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=230MiB/s][r=0,w=58.8k IOPS][eta 00m:00s] [root@v4a bench1]# rbd feature enable nvme/xxx exclusive-lock [root@v4a bench1]# sh bench-rbd Jobs: 8 (f=8): [r(8)][100.0%][r=651MiB/s,w=0KiB/s][r=167k,w=0 IOPS][eta 00m:00s] Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=45.9MiB/s][r=0,w=11.7k IOPS][eta 00m:00s] (as you can see, the performance impact is even worse..) I guess I should create a bug report for this one? nik > > > > > > BR > > > > nik > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28. rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > > > > -- > Jason -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp1jfmQEJQAu.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd vs librbd performance with qemu
;6QHi Janon, > Just to clarify: modern / rebased krbd block drivers definitely support > layering. The only missing features right now are object-map/fast-diff, > deep-flatten, and journaling (for RBD mirroring). I thought it as well, but at least mapping clone does not work for me even under 4.17.6: [root@v4a ~]# rbd map nvme/xxx rbd: sysfs write failed RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable nvme/xxx". In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (6) No such device or address (note incorrect hint on how this is supposed to be fixed, with feature disable command without any feature) dmesg output: [ +3.919281] rbd: image xxx: WARNING: kernel layering is EXPERIMENTAL! [ +0.001266] rbd: id 36dde238e1f29: image uses unsupported features: 0x38 [root@v4a ~]# rbd info nvme/xxx rbd image 'xxx': size 20480 MB in 5120 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.6a71313887ee0 format: 2 features: layering flags: create_timestamp: Wed Jun 20 13:46:38 2018 parent: nvme/centos7@template overlap: 20480 MB is trying 4.18-rc5 worth giving a try? > If you are running multiple fio jobs against the same image (or have the > krbd device mapped to multiple hosts w/ active IO), then I would expect a > huge performance hit since the lock needs to be transitioned between > clients. nope, only one running fio instance, no users on the other node.. BR nik -- --------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp_UJKdcWlKC.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] krbd vs librbd performance with qemu
Hi, historically I've found many discussions about this topic in last few years, but it seems to me to be still a bit unresolved so I'd like to open the question again.. In all flash deployments, under 12.2.5 luminous and qemu 12.2.0 using lbirbd, I'm getting much worse results regarding IOPS then with KRBD and direct block device access.. I'm testing on the same 100GB RBD volume, notable ceph settings: client rbd cache disabled osd_enable_op_tracker = False osd_op_num_shards = 64 osd_op_num_threads_per_shard = 1 osds are running bluestore, 2 replicas (it's just for testing) when I run FIO using librbd directly, I'm getting ~160k reads/s and ~60k writes/s which is not that bad. however when I run fio on block device under VM (qemu using librbd), I'm getting only 60/40K op/s which is a huge loss.. when I use VM with block access to krbd mapped device, numbers are much better, I'm getting something like 115/40K op/s which is not ideal, but still much better.. tried many optimisations and configuration variants (multiple queues, threads vs native aio etc), but krbd still performs much much better.. My question is whether this is expected, or should both access methods give more similar results? If possible, I'd like to stick to librbd (especially because krbd still lacks layering support, but there are more reasons) interesting is, that when I compare fio direct ceph access, librbd performs better then KRBD, but this doesn't concern me that much.. another question, during the tests, I noticed that enabling exclusive lock feature degrades write iops a lot as well, is this expected? (the performance falls to someting like 50%) I'm doing the tests on small 2 node cluster, VMS are running directly on ceph nodes, all is centos 7 with 4.14 kernel. (I know it's not recommended to run VMs directly on ceph nodes, but for small deployments it's necessary for us) if I could provide more details, I'll be happy to do so BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] luminous - 12.2.1 - stale RBD locks after client crash
Hello Jason, you're right! I've done the upgrade according to the docs you've mentioned, but I must have overlooked this step with caps completely.. thanks a lot for the help! with best regards nik On Wed, Nov 22, 2017 at 07:52:31AM -0500, Jason Dillaman wrote: > See previous threads about this subject [1][2] and see step 6 in the > upgrade notes [3]. > > [1] > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020722.html > [2] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg41718.html > [3] > http://docs.ceph.com/docs/master/release-notes/#upgrade-from-jewel-or-kraken > > On Wed, Nov 22, 2017 at 2:50 AM, Nikola Ciprich > <nikola.cipr...@linuxbox.cz> wrote: > > Hello ceph users and developers, > > > > I've stumbled upon a bit strange problem with Luminous. > > > > One of our servers running multiple QEMU clients crashed. > > When we tried restarting those on another cluster node, > > we got lots of fsck errors, disks seemed to return "physical" > > block errors. I figured this out to be stale RBD locks on volumes > > from the crashed machine. Wnen I removed the locks, everything > > started to work. (for some volumes, I was fixing those the another > > day after crash, so it was >10-15hours later) > > > > My question is, it this a bug or feature? I mean, after the client > > crashes, should locks somehow expire, or they need to be removed > > by hand? I don't remember having this issue with older ceph versions, > > but I suppose we didn't have exclusive locks feature enabled.. > > > > I'll be very grateful for any reply > > > > with best regards > > > > nik > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > --------- > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] luminous - 12.2.1 - stale RBD locks after client crash
Hello ceph users and developers, I've stumbled upon a bit strange problem with Luminous. One of our servers running multiple QEMU clients crashed. When we tried restarting those on another cluster node, we got lots of fsck errors, disks seemed to return "physical" block errors. I figured this out to be stale RBD locks on volumes from the crashed machine. Wnen I removed the locks, everything started to work. (for some volumes, I was fixing those the another day after crash, so it was >10-15hours later) My question is, it this a bug or feature? I mean, after the client crashes, should locks somehow expire, or they need to be removed by hand? I don't remember having this issue with older ceph versions, but I suppose we didn't have exclusive locks feature enabled.. I'll be very grateful for any reply with best regards nik -- ----- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] **** SPAM **** jewel - recovery keeps stalling (continues after restarting OSDs)
Hi, I tried balancing number of OSDs per node, set their weights the same, increased op recovery priority, but it still takes ages to recover.. I've got my cluster OK now, so I'll try switching to kraken to see if it behaves better.. nik On Mon, Aug 07, 2017 at 11:36:10PM +0800, cgxu wrote: > I encountered same issue today and I solved problem by adjusting "osd > recovery op priority” to 63 temporarily. > > It looks like recovery PUSH/PULL op starved in op_wq prioritized queue and > I’ve never experienced in hammer version. > > Any other idea? > > > > Hi, > > > > I'm trying to find reason for strange recovery issues I'm seeing on > > our cluster.. > > > > it's mostly idle, 4 node cluster with 26 OSDs evenly distributed > > across nodes. jewel 10.2.9 > > > > the problem is that after some disk replaces and data moves, recovery > > is progressing extremely slowly.. pgs seem to be stuck in > > active+recovering+degraded > > state: > > > > [root@v1d ~]# ceph -s > > cluster a5efbc87-3900-4c42-a977-8c93f7aa8c33 > > health HEALTH_WARN > > 159 pgs backfill_wait > > 4 pgs backfilling > > 259 pgs degraded > > 12 pgs recovering > > 113 pgs recovery_wait > > 215 pgs stuck degraded > > 266 pgs stuck unclean > > 140 pgs stuck undersized > > 151 pgs undersized > > recovery 37788/2327775 objects degraded (1.623%) > > recovery 23854/2327775 objects misplaced (1.025%) > > noout,noin flag(s) set > > monmap e21: 3 mons at > > {v1a=10.0.0.1:6789/0,v1b=10.0.0.2:6789/0,v1c=10.0.0.3:6789/0} > > election epoch 6160, quorum 0,1,2 v1a,v1b,v1c > > fsmap e817: 1/1/1 up {0=v1a=up:active}, 1 up:standby > > osdmap e76002: 26 osds: 26 up, 26 in; 185 remapped pgs > > flags noout,noin,sortbitwise,require_jewel_osds > > pgmap v80995844: 3200 pgs, 4 pools, 2876 GB data, 757 kobjects > > 9215 GB used, 35572 GB / 45365 GB avail > > 37788/2327775 objects degraded (1.623%) > > 23854/2327775 objects misplaced (1.025%) > > 2912 active+clean > > 130 active+undersized+degraded+remapped+wait_backfill > > 97 active+recovery_wait+degraded > > 29 active+remapped+wait_backfill > > 12 active+recovery_wait+undersized+degraded+remapped > >6 active+recovering+degraded > >5 active+recovering+undersized+degraded+remapped > >4 active+undersized+degraded+remapped+backfilling > >4 active+recovery_wait+degraded+remapped > >1 active+recovering+degraded+remapped > > client io 2026 B/s rd, 146 kB/s wr, 9 op/s rd, 21 op/s wr > > > > > > when I restart affected OSDs, it bumps the recovery, but then another > > PGs get stuck.. All OSDs were restarted multiple times, none are even close > > to > > nearfull, I just cant find what I'm doing wrong.. > > > > possibly related OSD options: > > > > osd max backfills = 4 > > osd recovery max active = 15 > > debug osd = 0/0 > > osd op threads = 4 > > osd backfill scan min = 4 > > osd backfill scan max = 16 > > > > Any hints would be greatly appreciated > > > > thanks > > > > nik > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz <http://www.linuxbox.cz/> > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)
t;, "items": [ { "id": 1, "weight": 104857, "pos": 0 }, { "id": 3, "weight": 117964, "pos": 1 }, { "id": 9, "weight": 104857, "pos": 2 }, { "id": 11, "weight": 117964, "pos": 3 }, { "id": 24, "weight": 235929, "pos": 4 } ] }, { "id": -6, "name": "v1c", "type_id": 1, "type_name": "host", "weight": 511178, "alg": "straw2", "hash": "rjenkins1", "items": [ { "id": 14, "weight": 104857, "pos": 0 }, { "id": 15, "weight": 117964, "pos": 1 }, { "id": 16, "weight": 91750, "pos": 2 }, { "id": 18, "weight": 91750, "pos": 3 }, { "id": 17, "weight": 104857, "pos": 4 } ] }, { "id": -7, "name": "v1d-ssd", "type_id": 1, "type_name": "host", "weight": 14417, "alg": "straw", "hash": "rjenkins1", "items": [ { "id": 19, "weight": 14417, "pos": 0 } ] }, { "id": -9, "name": "v1c-ssd", "type_id": 1, "type_name": "host", "weight": 26214, "alg": "straw2", "hash": "rjenkins1", "items": [ { "id": 10, "weight": 26214, "pos": 0 } ] }, { "id": -10, "name": "v1a-ssd", "type_id": 1, "type_name": "host", "weight": 39320, "alg": "straw2", "hash": "rjenkins1", "items": [ { "id": 5, "weight": 19660, "pos": 0 }, { "id": 26, "weight": 19660, "pos": 1 } ] }, { "id": -11, "name": "v1b-ssd", "type_id": 1, "type_name": "host", "weight": 22282, "alg": "straw2", "hash": "rjenkins1", "items": [ { "id": 13, "weight": 22282, "pos": 0 } ] } ], "rules": [ { "rule_id": 0, "rule_name": "replicated_ruleset", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn",
Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)
On Fri, Jul 28, 2017 at 05:43:14PM +0800, linghucongsong wrote: > > > It look like the osd in your cluster is not all the same size. > > can you show ceph osd df output? you're right, they're not.. here's the output: [root@v1b ~]# ceph osd df tree ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME -2 1.55995- 1706G 883G 805G 51.78 2.55 0 root ssd -9 0.3- 393G 221G 171G 56.30 2.78 0 host v1c-ssd 10 0.3 1.0 393G 221G 171G 56.30 2.78 98 osd.10 -10 0.59998- 683G 275G 389G 40.39 1.99 0 host v1a-ssd 5 0.2 1.0 338G 151G 187G 44.77 2.21 65 osd.5 26 0.2 1.0 344G 124G 202G 36.07 1.78 52 osd.26 -11 0.34000- 338G 219G 119G 64.68 3.19 0 host v1b-ssd 13 0.34000 1.0 338G 219G 119G 64.68 3.19 96 osd.13 -7 0.21999- 290G 166G 123G 57.43 2.83 0 host v1d-ssd 19 0.21999 1.0 290G 166G 123G 57.43 2.83 73 osd.19 -1 39.29982- 43658G 8312G 34787G 19.04 0.94 0 root default -4 11.89995- 12806G 2422G 10197G 18.92 0.93 0 host v1a 6 1.5 1.0 1833G 358G 1475G 19.53 0.96 366 osd.6 8 1.7 1.0 1833G 313G 1519G 17.11 0.84 370 osd.8 2 1.5 1.0 1833G 320G 1513G 17.46 0.86 331 osd.2 0 1.7 1.0 1804G 431G 1373G 23.90 1.18 359 osd.0 4 1.5 1.0 1833G 294G 1539G 16.07 0.79 360 osd.4 25 3.5 1.0 3667G 704G 2776G 19.22 0.95 745 osd.25 -5 10.39995- 10914G 2154G 8573G 19.74 0.97 0 host v1b 1 1.5 1.0 1804G 350G 1454G 19.42 0.96 409 osd.1 3 1.7 1.0 1804G 360G 1444G 19.98 0.99 412 osd.3 9 1.5 1.0 1804G 331G 1473G 18.37 0.91 363 osd.9 11 1.7 1.0 1833G 367G 1465G 20.06 0.99 415 osd.11 24 3.5 1.0 3667G 744G 2736G 20.30 1.00 834 osd.24 -6 7.79996- 9051G 1769G 7282G 19.54 0.96 0 host v1c 14 1.5 1.0 1804G 370G 1433G 20.54 1.01 442 osd.14 15 1.7 1.0 1833G 383G 1450G 20.92 1.03 447 osd.15 16 1.3 1.0 1804G 295G 1508G 16.38 0.81 355 osd.16 18 1.3 1.0 1804G 366G 1438G 20.29 1.00 381 osd.18 17 1.5 1.0 1804G 353G 1451G 19.57 0.97 429 osd.17 -3 9.19997- 10885G 1965G 8733G 18.06 0.89 0 host v1d-sata 12 1.3 1.0 1804G 348G 1455G 19.32 0.95 365 osd.12 20 1.3 1.0 1804G 335G 1468G 18.60 0.92 371 osd.20 21 3.5 1.0 3667G 695G 2785G 18.97 0.94 871 osd.21 22 1.3 1.0 1804G 281G 1522G 15.63 0.77 326 osd.22 23 1.3 1.0 1804G 303G 1500G 16.83 0.83 321 osd.23 TOTAL 45365G 9195G 35592G 20.27 MIN/MAX VAR: 0.77/3.19 STDDEV: 14.69 apart from replacing OSDs, how can I help it? > > > At 2017-07-28 17:24:29, "Nikola Ciprich" <nikola.cipr...@linuxbox.cz> wrote: > >I forgot to add that OSD daemons really seem to be idle, no disk > >activity, no CPU usage.. it just looks to me like some kind of > >deadlock, as they were waiting for each other.. > > > >and so I'm trying to get last 1.5% of misplaced / degraded PGs > >for almost a week.. > > > > > >On Fri, Jul 28, 2017 at 10:56:02AM +0200, Nikola Ciprich wrote: > >> Hi, > >> > >> I'm trying to find reason for strange recovery issues I'm seeing on > >> our cluster.. > >> > >> it's mostly idle, 4 node cluster with 26 OSDs evenly distributed > >> across nodes. jewel 10.2.9 > >> > >> the problem is that after some disk replaces and data moves, recovery > >> is progressing extremely slowly.. pgs seem to be stuck in > >> active+recovering+degraded > >> state: > >> > >> [root@v1d ~]# ceph -s > >> cluster a5efbc87-3900-4c42-a977-8c93f7aa8c33 > >> health HEALTH_WARN > >> 159 pgs backfill_wait > >> 4 pgs backfilling > >> 259 pgs degraded > >> 12 pgs recovering > >> 113 pgs recovery_wait > >> 215 pgs stuck degraded > >> 266 pgs stuck unclean > >> 140 pgs stuck undersized > >> 151 pgs undersized > >> recovery 37788/2327775 objects degraded (1.623%) > >> recovery 23854/2327775 objects misplaced (1.
Re: [ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)
I forgot to add that OSD daemons really seem to be idle, no disk activity, no CPU usage.. it just looks to me like some kind of deadlock, as they were waiting for each other.. and so I'm trying to get last 1.5% of misplaced / degraded PGs for almost a week.. On Fri, Jul 28, 2017 at 10:56:02AM +0200, Nikola Ciprich wrote: > Hi, > > I'm trying to find reason for strange recovery issues I'm seeing on > our cluster.. > > it's mostly idle, 4 node cluster with 26 OSDs evenly distributed > across nodes. jewel 10.2.9 > > the problem is that after some disk replaces and data moves, recovery > is progressing extremely slowly.. pgs seem to be stuck in > active+recovering+degraded > state: > > [root@v1d ~]# ceph -s > cluster a5efbc87-3900-4c42-a977-8c93f7aa8c33 > health HEALTH_WARN > 159 pgs backfill_wait > 4 pgs backfilling > 259 pgs degraded > 12 pgs recovering > 113 pgs recovery_wait > 215 pgs stuck degraded > 266 pgs stuck unclean > 140 pgs stuck undersized > 151 pgs undersized > recovery 37788/2327775 objects degraded (1.623%) > recovery 23854/2327775 objects misplaced (1.025%) > noout,noin flag(s) set > monmap e21: 3 mons at > {v1a=10.0.0.1:6789/0,v1b=10.0.0.2:6789/0,v1c=10.0.0.3:6789/0} > election epoch 6160, quorum 0,1,2 v1a,v1b,v1c > fsmap e817: 1/1/1 up {0=v1a=up:active}, 1 up:standby > osdmap e76002: 26 osds: 26 up, 26 in; 185 remapped pgs > flags noout,noin,sortbitwise,require_jewel_osds > pgmap v80995844: 3200 pgs, 4 pools, 2876 GB data, 757 kobjects > 9215 GB used, 35572 GB / 45365 GB avail > 37788/2327775 objects degraded (1.623%) > 23854/2327775 objects misplaced (1.025%) > 2912 active+clean > 130 active+undersized+degraded+remapped+wait_backfill > 97 active+recovery_wait+degraded > 29 active+remapped+wait_backfill > 12 active+recovery_wait+undersized+degraded+remapped >6 active+recovering+degraded >5 active+recovering+undersized+degraded+remapped >4 active+undersized+degraded+remapped+backfilling >4 active+recovery_wait+degraded+remapped >1 active+recovering+degraded+remapped > client io 2026 B/s rd, 146 kB/s wr, 9 op/s rd, 21 op/s wr > > > when I restart affected OSDs, it bumps the recovery, but then another > PGs get stuck.. All OSDs were restarted multiple times, none are even close to > nearfull, I just cant find what I'm doing wrong.. > > possibly related OSD options: > > osd max backfills = 4 > osd recovery max active = 15 > debug osd = 0/0 > osd op threads = 4 > osd backfill scan min = 4 > osd backfill scan max = 16 > > Any hints would be greatly appreciated > > thanks > > nik > > > -- > - > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax:+420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > ----- > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] jewel - recovery keeps stalling (continues after restarting OSDs)
Hi, I'm trying to find reason for strange recovery issues I'm seeing on our cluster.. it's mostly idle, 4 node cluster with 26 OSDs evenly distributed across nodes. jewel 10.2.9 the problem is that after some disk replaces and data moves, recovery is progressing extremely slowly.. pgs seem to be stuck in active+recovering+degraded state: [root@v1d ~]# ceph -s cluster a5efbc87-3900-4c42-a977-8c93f7aa8c33 health HEALTH_WARN 159 pgs backfill_wait 4 pgs backfilling 259 pgs degraded 12 pgs recovering 113 pgs recovery_wait 215 pgs stuck degraded 266 pgs stuck unclean 140 pgs stuck undersized 151 pgs undersized recovery 37788/2327775 objects degraded (1.623%) recovery 23854/2327775 objects misplaced (1.025%) noout,noin flag(s) set monmap e21: 3 mons at {v1a=10.0.0.1:6789/0,v1b=10.0.0.2:6789/0,v1c=10.0.0.3:6789/0} election epoch 6160, quorum 0,1,2 v1a,v1b,v1c fsmap e817: 1/1/1 up {0=v1a=up:active}, 1 up:standby osdmap e76002: 26 osds: 26 up, 26 in; 185 remapped pgs flags noout,noin,sortbitwise,require_jewel_osds pgmap v80995844: 3200 pgs, 4 pools, 2876 GB data, 757 kobjects 9215 GB used, 35572 GB / 45365 GB avail 37788/2327775 objects degraded (1.623%) 23854/2327775 objects misplaced (1.025%) 2912 active+clean 130 active+undersized+degraded+remapped+wait_backfill 97 active+recovery_wait+degraded 29 active+remapped+wait_backfill 12 active+recovery_wait+undersized+degraded+remapped 6 active+recovering+degraded 5 active+recovering+undersized+degraded+remapped 4 active+undersized+degraded+remapped+backfilling 4 active+recovery_wait+degraded+remapped 1 active+recovering+degraded+remapped client io 2026 B/s rd, 146 kB/s wr, 9 op/s rd, 21 op/s wr when I restart affected OSDs, it bumps the recovery, but then another PGs get stuck.. All OSDs were restarted multiple times, none are even close to nearfull, I just cant find what I'm doing wrong.. possibly related OSD options: osd max backfills = 4 osd recovery max active = 15 debug osd = 0/0 osd op threads = 4 osd backfill scan min = 4 osd backfill scan max = 16 Any hints would be greatly appreciated thanks nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] after jewel 10.2.2->10.2.7 upgrade, one of OSD crashes on OSDMap::decode
Hi, I've ugpraded tiny jewel cluster from 10.2.2 to 10.2.7 and now one of OSDs fails to start.. here's (hopefully) important part of the backtrace: 2017-05-01 19:54:17.627262 7fb2bbf78800 10 filestore(/var/lib/ceph/osd/ceph-1) stat meta/#-1:c0371625:::snapmapper:0# = 0 (size 0) 2017-05-01 19:54:17.627440 7fb2bbf78800 0 cls/hello/cls_hello.cc:305: loading cls_hello 2017-05-01 19:54:17.629044 7fb2bbf78800 0 cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan 2017-05-01 19:54:17.630656 7fb2bbf78800 15 filestore(/var/lib/ceph/osd/ceph-1) read meta/#-1:3294e826:::osdmap.53:0# 0~0 2017-05-01 19:54:17.630674 7fb2bbf78800 10 filestore(/var/lib/ceph/osd/ceph-1) FileStore::read meta/#-1:3294e826:::osdmap.53:0# 0~0/0 terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer *** Caught signal (Aborted) ** in thread 7fb2bbf78800 thread_name:ceph-osd ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) 1: (()+0x91d8ea) [0x5609e9f938ea] 2: (()+0xf370) [0x7fb2ba6ca370] 3: (gsignal()+0x37) [0x7fb2b8c8b1d7] 4: (abort()+0x148) [0x7fb2b8c8c8c8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fb2b958f9d5] 6: (()+0x5e946) [0x7fb2b958d946] 7: (()+0x5e973) [0x7fb2b958d973] 8: (()+0x5eb93) [0x7fb2b958db93] 9: (ceph::buffer::list::iterator_impl::copy(unsigned int, char*)+0xa5) [0x5609ea09e425] 10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0x5609ea055a9d] 11: (OSDMap::decode(ceph::buffer::list&)+0x2e) [0x5609ea056d9e] 12: (OSDService::try_get_map(unsigned int)+0x4ac) [0x5609e9a0882c] 13: (OSDService::get_map(unsigned int)+0xe) [0x5609e9a6b5fe] 14: (OSD::init()+0x1fe2) [0x5609e9a1e782] 15: (main()+0x2c55) [0x5609e9981dc5] 16: (__libc_start_main()+0xf5) [0x7fb2b8c77b35] 17: (()+0x3561e7) [0x5609e99cc1e7] 2017-05-01 19:54:17.632871 7fb2bbf78800 -1 *** Caught signal (Aborted) ** in thread 7fb2bbf78800 thread_name:ceph-osd full osd log is here: http://nik.lbox.cz/download/osd-crash.txt I've found some older discussions and reports of similar problem, but none of current versions, especially 10.2.7 the cluster is very small (just 2+2 OSDs, 3 mons, no MDS), was installed as 10.2.2, therefore no upgrade from hammer or so.. OS is centos7 based, 4.4.52 x86_64 kernel.. If anyone is interested in it, I can provide more info if needed, otherwise I'll reformat OSD to get it back into OK state.. BR nik -- ----- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer - lost object after just one OSD failure?
Hi Gregory, thanks for the reply. > > Is OSD 0 the one which had a failing hard drive? And OSD 10 is > supposed to be fine? yes, OSD 0 crashed due to disk errors, rest of the cluster was without problems, no crash, no restarts.. that's why it scared me a bit.. pity I purged lost placement groups, maybe we could have digged some more debug info... I'll torture and watch the cluster carefully and report if something similar happens again.. I suppose we can't do much more till then... BR nik > > In general what you're saying does make it sound like something under > the Ceph code lost objects, but if one of those OSDs has never had a > problem I'm not sure what it could be. > > (The most common failure mode is power loss while the user has > barriers turned off, or a RAID card misconfigured, or similar.) > -Greg > > > > > I'd be grateful for any info > > > > br > > > > nik > > > > > > > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > > > ___________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpoSeBEBZpA7.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hammer - lost object after just one OSD failure?
Hi, I was doing some performance tuning on test cluster of just 2 nodes (each 10 OSDs). I have test pool of 2 replicas (size=2, min_size=2) then one of OSD crashed due to failing harddrive. All remaining OSDs were fine, but health status reported one lost object.. here's detail: "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2016-05-04 07:59:10.706866", "might_have_unfound": [ { "osd": "0", "status": "osd is down" }, { "osd": "10", "status": "already probed" } ], it was no important data, so I just discarded it as I don't need to recover it, but now I'm wondering what is the cause of all this.. I have min_size set to 2 and I though that writes are confirmed after they reach all target OSD journals, no? Is there something specific I should check? Maybe I have some bug in configuration? Or how else could this object be lost? I'd be grateful for any info br nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpewxsGEVgLj.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] can't get rid of stale+active+clean pgs by no means
Hi, I'm still strugling with health problems of my cluster.. I still have 2 stale+active+clean and one creating pgs.. I've just stopped all nodes and started them all again, and those pages still remain.. I think I've read all related discussions and docs, and tried virtually everything I though could help (and be safe). Querying those stale pgs hangs, OSDs which should be acting for them are running.. I can't figure what could be wrong.. does anyone have an idea what to try? I'm running latest hammer (0.94.5) on centos 6.. thanks a lot in advance cheers nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp1v3XmsIQXD.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer-0.94.5 + kernel-4.1.15 - cephfs stuck
On 4 February 2016 08:33:55 CET, Gregory Farnum <gfar...@redhat.com> wrote: >The quick and dirty cleanup is to restart the OSDs hosting those PGs. >They might have gotten some stuck ops which didn't get woken up; a few >bugs like that have gone by and are resolved in various stable >branches (I'm not sure what release binaries they're in). > That's what I thought, so I tried restarting all OSD already.. But those stuck PGs still remain The version I'm running is 0.94.5 Nik >On Wed, Feb 3, 2016 at 11:32 PM, Nikola Ciprich ><nikola.cipr...@linuxbox.cz> wrote: >>> Yeah, these inactive PGs are basically guaranteed to be the cause of >>> the problem. There are lots of threads about getting PGs healthy >>> again; you should dig around the archives and the documentation >>> troubleshooting page(s). :) >>> -Greg >> >> Hello Gregory, >> >> well, I wouldn't doubt it, but when the problems started, the only >> unclean pages were some remapped, no inactive, so I guess it must've >> been something else.. >> >> but I'm now struggling to get rid of those inactive of course.. >> however I've not been successfull so far, I've probably read all >> the related docs and discussions and still haven't found similar >> problem.. >> >> pg 6.11 is stuck stale for 79285.647847, current state >stale+active+clean, last acting [4,10,8] >> pg 3.198 is stuck stale for 79367.532437, current state >stale+active+clean, last acting [8,13] >> >> those two are stale for some reason.. but OSDS 4, 8, 10, 13 are >running, there >> are no network problems.. PG query on those just hangs.. >> >> I'm running ot of ideas here.. >> >> nik >> >> >> -- >> - >> Ing. Nikola CIPRICH >> LinuxBox.cz, s.r.o. >> 28. rijna 168, 709 00 Ostrava >> >> tel.: +420 591 166 214 >> fax:+420 596 621 273 >> mobil: +420 777 093 799 >> >> www.linuxbox.cz >> >> mobil servis: +420 737 238 656 >> email servis: ser...@linuxbox.cz >> - -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer-0.94.5 + kernel-4.1.15 - cephfs stuck
Hello Gregory, in the meantime, I managed to break it further :( I tried getting rid of active+remapped pgs and got some undersized instead.. nto sure whether this can be related.. anyways here's the status: ceph -s cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a health HEALTH_WARN 3 pgs degraded 2 pgs stale 3 pgs stuck degraded 1 pgs stuck inactive 2 pgs stuck stale 242 pgs stuck unclean 3 pgs stuck undersized 3 pgs undersized recovery 65/3374343 objects degraded (0.002%) recovery 186187/3374343 objects misplaced (5.518%) mds0: Behind on trimming (155/30) monmap e3: 3 mons at {remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0} election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c mdsmap e342: 1/1/1 up {0=remrprv1c=up:active}, 2 up:standby osdmap e4385: 21 osds: 21 up, 21 in; 238 remapped pgs pgmap v18679192: 1856 pgs, 7 pools, 4223 GB data, 1103 kobjects 12947 GB used, 22591 GB / 35538 GB avail 65/3374343 objects degraded (0.002%) 186187/3374343 objects misplaced (5.518%) 1612 active+clean 238 active+remapped 3 active+undersized+degraded 2 stale+active+clean 1 creating client io 0 B/s rd, 40830 B/s wr, 17 op/s > What's the full output of "ceph -s"? Have you looked at the MDS admin > socket at all — what state does it say it's in? [root@remrprv1c ceph]# ceph --admin-daemon /var/run/ceph/ceph-mds.remrprv1c.asok dump_ops_in_flight { "ops": [ { "description": "client_request(client.3052096:83 getattr Fs #1000288 2016-02-03 10:10:46.361591 RETRY=1)", "initiated_at": "2016-02-03 10:23:25.791790", "age": 3963.093615, "duration": 9.519091, "type_data": [ "failed to rdlock, waiting", "client.3052096:83", "client_request", { "client": "client.3052096", "tid": 83 }, [ { "time": "2016-02-03 10:23:25.791790", "event": "initiated" }, { "time": "2016-02-03 10:23:35.310881", "event": "failed to rdlock, waiting" } ] ] } ], "num_ops": 1 } seems there's some lock stuck here.. Killing stuck client (it's postgres trying to access cephfs file doesn't help..) > -Greg > > > > > My question here is: > > > > 1) is there some known issue with hammer 0.94.5 or kernel 4.1.15 > > which could lead to cephfs hangs? > > > > 2) what can I do to debug what is the cause of this hang? > > > > 3) is there a way to recover this without hard resetting > > node with hung cephfs mount? > > > > If I could provide more information, please let me know > > > > I'd really appreciate any help > > > > with best regards > > > > nik > > > > > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpGoG5McCNrp.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hammer - remapped / undersized pgs + related questions
1 creating client io 14830 B/s rd, 269 kB/s wr, 94 op/s I'd be very gratefull for any help with those.. with best regards nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp1GDSul_Wqi.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] placement group lost by using force_create_pg ?
Hello cephers, I think I've got into pretty bad situation :( I mistakenly run force_create_pg on one placement group in live cluster Now it's stuck in creating state. Now I suppose the placement group content is lost, right? Is there a way to recover it? Or at least way to find out which objects are affected by it? I've only found ways to find to which placement group objects belong, but not the other direction (apart from trying all objects). some data are in rbd objects, some on cephfs... is there a way to help? it'd be rally appreciated... thanks a lot in advance with best regards nikola cirpich -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpzf96s06zZ7.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer-0.94.5 + kernel-4.1.15 - cephfs stuck
> Yeah, these inactive PGs are basically guaranteed to be the cause of > the problem. There are lots of threads about getting PGs healthy > again; you should dig around the archives and the documentation > troubleshooting page(s). :) > -Greg Hello Gregory, well, I wouldn't doubt it, but when the problems started, the only unclean pages were some remapped, no inactive, so I guess it must've been something else.. but I'm now struggling to get rid of those inactive of course.. however I've not been successfull so far, I've probably read all the related docs and discussions and still haven't found similar problem.. pg 6.11 is stuck stale for 79285.647847, current state stale+active+clean, last acting [4,10,8] pg 3.198 is stuck stale for 79367.532437, current state stale+active+clean, last acting [8,13] those two are stale for some reason.. but OSDS 4, 8, 10, 13 are running, there are no network problems.. PG query on those just hangs.. I'm running ot of ideas here.. nik -- ----- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpKKRsiUcBaT.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] sync writes - expected performance?
Hello Mark, thanks for your explanation, it all makes sense. I've done some measuring on google and amazon clouds as well and really, those numbers seem to be pretty good. I'll be playing with fine tunning a little bit more, but overall performance really seems to be quite nice. Thanks to all of you for your replies guys! nik On Mon, Dec 14, 2015 at 11:03:16AM -0600, Mark Nelson wrote: > > > On 12/14/2015 04:49 AM, Nikola Ciprich wrote: > >Hello, > > > >i'm doing some measuring on test (3 nodes) cluster and see strange > >performance > >drop for sync writes.. > > > >I'm using SSD for both journalling and OSD. It should be suitable for > >journal, giving about 16.1KIOPS (67MB/s) for sync IO. > > > >(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write > >--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting > >--name=journal-test) > > > >On top of this cluster, I have running KVM guest (using qemu librbd backend). > >Overall performance seems to be quite good, but the problem is when I try > >to measure sync IO performance inside the guest.. I'm getting only about > >600IOPS, > >which I think is quite poor. > > > >The problem is, I don't see any bottlenect, OSD daemons don't seem to be > >hanging on > >IO, neither hogging CPU, qemu process is also not somehow too much loaded.. > > > >I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging > >disabled, > > > >my question is, what results I can expect for synchronous writes? I > >understand > >there will always be some performance drop, but 600IOPS on top of storage > >which > >can give as much as 16K IOPS seems to little.. > > So basically what this comes down to is latency. Since you get 16K IOPS for > O_DSYNC writes on the SSD, there's a good chance that it has a > super-capacitor on board and can basically acknowledge a write as complete > as soon as it hits the on-board cache rather than when it's written to > flash. Figure that for 16K O_DSYNC IOPs means that each IO is completing in > around 0.06ms on average. That's very fast! At 600 IOPs for O_DSYNC writes > on your guest, you're looking at about 1.6ms per IO on average. > > So how do we account for the difference? Let's start out by looking at a > quick example of network latency (This is between two random machines in one > of our labs at Red Hat): > > >64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms > >64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms > >64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms > >64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms > >64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms > > now consider that when you do a write in ceph, you write to the primary OSD > which then writes out to the replica OSDs. Every replica IO has to complete > before the primary will send the acknowledgment to the client (ie you have > to add the latency of the worst of the replica writes!). In your case, the > network latency alone is likely dramatically increasing IO latency vs raw > SSD O_DSYNC writes. Now add in the time to process crush mappings, look up > directory and inode metadata on the filesystem where objects are stored > (assuming it's not cached), and other processing time, and the 1.6ms latency > for the guest writes starts to make sense. > > Can we improve things? Likely yes. There's various areas in the code where > we can trim latency away, implement alternate OSD backends, and potentially > use alternate network technology like RDMA to reduce network latency. The > thing to remember is that when you are talking about O_DSYNC writes, even > very small increases in latency can have dramatic effects on performance. > Every fraction of a millisecond has huge ramifications. > > > > >Has anyone done similar measuring? > > > >thanks a lot in advance! > > > >BR > > > >nik > > > > > > > > > >___ > >ceph-users mailing list > >ceph-users@lists.ceph.com > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpcTqptKGKxY.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] sync writes - expected performance?
Hello, i'm doing some measuring on test (3 nodes) cluster and see strange performance drop for sync writes.. I'm using SSD for both journalling and OSD. It should be suitable for journal, giving about 16.1KIOPS (67MB/s) for sync IO. (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test) On top of this cluster, I have running KVM guest (using qemu librbd backend). Overall performance seems to be quite good, but the problem is when I try to measure sync IO performance inside the guest.. I'm getting only about 600IOPS, which I think is quite poor. The problem is, I don't see any bottlenect, OSD daemons don't seem to be hanging on IO, neither hogging CPU, qemu process is also not somehow too much loaded.. I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging disabled, my question is, what results I can expect for synchronous writes? I understand there will always be some performance drop, but 600IOPS on top of storage which can give as much as 16K IOPS seems to little.. Has anyone done similar measuring? thanks a lot in advance! BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpXMhY4Ixq8A.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD pool and SATA pool
I'm not an ceph expert, but I needed to use osd crush update on start = false in [osd] config section.. BR nik On Tue, Nov 17, 2015 at 08:53:37PM +, Michael Kuriger wrote: > Hey everybody, > I have 10 servers, each with 2 SSD drives, and 8 SATA drives. Is it possible > to create 2 pools, one made up of SSD and one made up of SATA? I tried > manually editing the crush map to do it, but the configuration doesn’t seem > to persist reboots. Any help would be very appreciated. > > Thanks! > > Mike > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ----- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpaZwgaH_fdn.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] python binding - snap rollback - progress reporting
Hello, I'd like to ask - I'm using python RBD/rados bindings. Everything works well for me, the only thing I'd like to improve is snapshots rollback as the operation is quite time consuming, I would like to report it's progress. is this somehow possible? even at the cost of implementing whole rollback operation by myself? thanks a lot in advance! BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpGcfrLSPucy.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] qemu (or librbd in general) - very high load on client side
Hello dear ceph developers and users, I've spent some time tuning and measuring our ceph cluster performance, and noticed quite strange thing.. I've been using fio (using both rbd engine on hosts and direct block (aio) engine inside qemu-kvm guests (qemu connected to ceph storage using rbd)) and I noticed client part always generates huge amount of CPU load and therefore CLIENT seems to be the bottleneck. For example, when I measure direct SSD performance on one of ceph OSDs, I'm getting 100k IOPS (which is OK, according to SSD specs) using fio, but when I measure performance of ceph SSD pool volume, it's much worse. I'd understand, if the bottlenect would be ceph-osd processes (or some other ceph component), but it seems to me fio using rbd engine is the problem here (it's able to eat 6 CPU cores itself). Seems to be very similar when using qemu to access the ceph storage - it shows very high cpu utilisation (i'm using virtio-scsi for guest disk emulation). This behaviour is for both random and sequential IO. preloading libtcmalloc helps fio (and I also tried compiling qemu with libtcmallc, it also helps), but still it seems to me that there could be something wrong in librbd.. Has anyone else noticed this behaviour? I noticed on some mail threads, that disabling cephx authentication can help a lot, but I don't really like this idea and haven't tried it yet.. with best regards nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpsv3bZhzGOC.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] very different performance on two volumes in the same pool #2
On Mon, May 11, 2015 at 06:07:21AM +, Somnath Roy wrote: Yes, you need to run fio clients on a separate box, it will take quite a bit of cpu. Stopping OSDs on other nodes, rebalancing will start. Have you waited cluster to go for active + clean state ? If you are running while rebalancing is going on , the performance will be impacted. I set noout, so there was no rebalancing, I forgot to mention that.. ~110% cpu util seems pretty low. Try to run fio_rbd with more num_jobs (say 3 or 4 or more), io_depth =64 is fine and see if it improves performance or not. ok, increasing jobs to 4 seems to squeeze a bit more from the cluster, about 43.3K iops.. OSD cpu util jumps to ~300% on both alive nodes, so there seems to be still a bit of reserves.. Also, since you have 3 OSDs (3 nodes?), I would suggest to tweak the following settings osd_op_num_threads_per_shard osd_op_num_shards May be (1,10 / 1,15 / 2, 10 ?). tried all those combinations, but it doesn't make almost any difference.. do you think I could get more then those 43k? one more think that makes me wonder a bit is this line I can see in perf: 2.21% libsoftokn3.so [.] 0x0001ebb2 I suppose this has something to do with resolving, 2.2% seems quite a lot to me.. Should I be worried about it? Does it make sense to enable kernel DNS resolving support in ceph? thanks for your time Somnath! nik Thanks Regards Somnath -Original Message- From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] Sent: Sunday, May 10, 2015 10:33 PM To: Somnath Roy Cc: ceph-users; n...@linuxbox.cz Subject: Re: [ceph-users] very different performance on two volumes in the same pool #2 On Mon, May 11, 2015 at 05:20:25AM +, Somnath Roy wrote: Two things.. 1. You should always use SSD drives for benchmarking after preconditioning it. well, I don't really understand... ? 2. After creating and mapping rbd lun, you need to write data first to read it afterword otherwise fio output will be misleading. In fact, I think you will see IO is not even hitting cluster (check with ceph -s) yes, so this approves my conjecture. ok. Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check the following. 1. Check client or OSd node cpu is saturating or not. On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node (which is one of OSD nodes as well), I can see fio eating quite lot of CPU cycles.. I tried stopping ceph-osd on this node (thus only two nodes are serving data) and performance got a bit higher, to ~33k IOPS. But still I think it's not very good.. 2. With 4K, hope network BW is fine I think it's ok.. 3. Number of PGs/pool should be ~128 or so. I'm using pg_num 128 4. If you are using krbd, you might want to try latest krbd module where TCP_NODELAY problem is fixed. If you don't want that complexity, try with fio-rbd. I'm not using RBD (only for writing data to volume), for benchmarking, I'm using fio-rbd. anything else I could check? Hope this helps, Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nikola Ciprich Sent: Sunday, May 10, 2015 9:43 PM To: ceph-users Cc: n...@linuxbox.cz Subject: [ceph-users] very different performance on two volumes in the same pool #2 Hello ceph developers and users, some time ago, I posted here a question regarding very different performance for two volumes in one pool (backed by SSD drives). After some examination, I probably got to the root of the problem.. When I create fresh volume (ie rbd create --image-format 2 --size 51200 ssd/test) and run random io fio benchmark fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 --readwrite=randread I get very nice performance of up to 200k IOPS. However once the volume is written to (ie when I map it using rbd map and dd whole volume with some random data), and repeat the benchmark, random performance drops to ~23k IOPS. This leads me to conjecture that for unwritten (sparse) volumes, read is just a noop, simply returning zeroes without really having to read data from physical storage, and thus showing nice performance, but once the volume is written, performance drops due to need to physically read the data, right? However I'm a bit unhappy about the performance drop, the pool is backed by 3 SSD drives (each having random io performance of 100k iops) on three nodes, and object size is set to 3. Cluster is completely idle, nodes are quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 3.18.12, ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading gperftools-libs to 2.4) Nodes
Re: [ceph-users] very different performance on two volumes in the same pool #2
On Mon, May 11, 2015 at 05:20:25AM +, Somnath Roy wrote: Two things.. 1. You should always use SSD drives for benchmarking after preconditioning it. well, I don't really understand... ? 2. After creating and mapping rbd lun, you need to write data first to read it afterword otherwise fio output will be misleading. In fact, I think you will see IO is not even hitting cluster (check with ceph -s) yes, so this approves my conjecture. ok. Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check the following. 1. Check client or OSd node cpu is saturating or not. On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node (which is one of OSD nodes as well), I can see fio eating quite lot of CPU cycles.. I tried stopping ceph-osd on this node (thus only two nodes are serving data) and performance got a bit higher, to ~33k IOPS. But still I think it's not very good.. 2. With 4K, hope network BW is fine I think it's ok.. 3. Number of PGs/pool should be ~128 or so. I'm using pg_num 128 4. If you are using krbd, you might want to try latest krbd module where TCP_NODELAY problem is fixed. If you don't want that complexity, try with fio-rbd. I'm not using RBD (only for writing data to volume), for benchmarking, I'm using fio-rbd. anything else I could check? Hope this helps, Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nikola Ciprich Sent: Sunday, May 10, 2015 9:43 PM To: ceph-users Cc: n...@linuxbox.cz Subject: [ceph-users] very different performance on two volumes in the same pool #2 Hello ceph developers and users, some time ago, I posted here a question regarding very different performance for two volumes in one pool (backed by SSD drives). After some examination, I probably got to the root of the problem.. When I create fresh volume (ie rbd create --image-format 2 --size 51200 ssd/test) and run random io fio benchmark fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 --readwrite=randread I get very nice performance of up to 200k IOPS. However once the volume is written to (ie when I map it using rbd map and dd whole volume with some random data), and repeat the benchmark, random performance drops to ~23k IOPS. This leads me to conjecture that for unwritten (sparse) volumes, read is just a noop, simply returning zeroes without really having to read data from physical storage, and thus showing nice performance, but once the volume is written, performance drops due to need to physically read the data, right? However I'm a bit unhappy about the performance drop, the pool is backed by 3 SSD drives (each having random io performance of 100k iops) on three nodes, and object size is set to 3. Cluster is completely idle, nodes are quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 3.18.12, ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading gperftools-libs to 2.4) Nodes are connected using 10gb ethernet, with jumbo frames enabled. I tried tuning following values: osd_op_threads = 5 filestore_op_threads = 4 osd_op_num_threads_per_shard = 1 osd_op_num_shards = 25 filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 I don't see anything special in perf: 5.43% [kernel] [k] acpi_processor_ffh_cstate_enter 2.93% libtcmalloc.so.4.2.6 [.] 0x00017d2c 2.45% libpthread-2.12.so[.] pthread_mutex_lock 2.37% libpthread-2.12.so[.] pthread_mutex_unlock 2.33% [kernel] [k] do_raw_spin_lock 2.00% libsoftokn3.so[.] 0x0001f455 1.96% [kernel] [k] __switch_to 1.32% [kernel] [k] __schedule 1.24% libstdc++.so.6.0.13 [.] std::basic_ostreamchar, std::char_traitschar std::__ostream_insertchar, std::char_traitschar (std::basic_ostreamchar, std::char 1.24% libc-2.12.so [.] memcpy 1.19% libtcmalloc.so.4.2.6 [.] operator delete(void*) 1.16% [kernel] [k] __d_lookup_rcu 1.09% libstdc++.so.6.0.13 [.] 0x0007d6be 0.93% libstdc++.so.6.0.13 [.] std::basic_streambufchar, std::char_traitschar ::xsputn(char const*, long) 0.93% ceph-osd [.] crush_hash32_3 0.85% libc-2.12.so [.] vfprintf 0.84% libc-2.12.so [.] __strlen_sse42 0.80% [kernel] [k] get_futex_key_refs 0.80% libpthread-2.12.so[.] pthread_mutex_trylock 0.78% libtcmalloc.so.4.2.6 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) 0.71% libstdc++.so.6.0.13 [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(std::string const) 0.68% ceph-osd [.] ceph::log::Log::flush
[ceph-users] very different performance on two volumes in the same pool #2
Hello ceph developers and users, some time ago, I posted here a question regarding very different performance for two volumes in one pool (backed by SSD drives). After some examination, I probably got to the root of the problem.. When I create fresh volume (ie rbd create --image-format 2 --size 51200 ssd/test) and run random io fio benchmark fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 --readwrite=randread I get very nice performance of up to 200k IOPS. However once the volume is written to (ie when I map it using rbd map and dd whole volume with some random data), and repeat the benchmark, random performance drops to ~23k IOPS. This leads me to conjecture that for unwritten (sparse) volumes, read is just a noop, simply returning zeroes without really having to read data from physical storage, and thus showing nice performance, but once the volume is written, performance drops due to need to physically read the data, right? However I'm a bit unhappy about the performance drop, the pool is backed by 3 SSD drives (each having random io performance of 100k iops) on three nodes, and object size is set to 3. Cluster is completely idle, nodes are quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 3.18.12, ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading gperftools-libs to 2.4) Nodes are connected using 10gb ethernet, with jumbo frames enabled. I tried tuning following values: osd_op_threads = 5 filestore_op_threads = 4 osd_op_num_threads_per_shard = 1 osd_op_num_shards = 25 filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 I don't see anything special in perf: 5.43% [kernel] [k] acpi_processor_ffh_cstate_enter 2.93% libtcmalloc.so.4.2.6 [.] 0x00017d2c 2.45% libpthread-2.12.so[.] pthread_mutex_lock 2.37% libpthread-2.12.so[.] pthread_mutex_unlock 2.33% [kernel] [k] do_raw_spin_lock 2.00% libsoftokn3.so[.] 0x0001f455 1.96% [kernel] [k] __switch_to 1.32% [kernel] [k] __schedule 1.24% libstdc++.so.6.0.13 [.] std::basic_ostreamchar, std::char_traitschar std::__ostream_insertchar, std::char_traitschar (std::basic_ostreamchar, std::char 1.24% libc-2.12.so [.] memcpy 1.19% libtcmalloc.so.4.2.6 [.] operator delete(void*) 1.16% [kernel] [k] __d_lookup_rcu 1.09% libstdc++.so.6.0.13 [.] 0x0007d6be 0.93% libstdc++.so.6.0.13 [.] std::basic_streambufchar, std::char_traitschar ::xsputn(char const*, long) 0.93% ceph-osd [.] crush_hash32_3 0.85% libc-2.12.so [.] vfprintf 0.84% libc-2.12.so [.] __strlen_sse42 0.80% [kernel] [k] get_futex_key_refs 0.80% libpthread-2.12.so[.] pthread_mutex_trylock 0.78% libtcmalloc.so.4.2.6 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) 0.71% libstdc++.so.6.0.13 [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(std::string const) 0.68% ceph-osd [.] ceph::log::Log::flush() 0.66% libtcmalloc.so.4.2.6 [.] tc_free 0.63% [kernel] [k] resched_curr 0.63% [kernel] [k] page_fault 0.62% libstdc++.so.6.0.13 [.] std::string::reserve(unsigned long) I'm running benchmark directly on one of nodes, which I know is not optimal, but it's still able to give those 200k iops for empty volume, so I guess it shouldn't be problem.. Another story is random write performance, which is totally poor, but I't like to deal with read performance first.. so my question is, are those numbers normal? If not, what should I check? I'll be very grateful for all the hints I could get.. thanks a lot in advance nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpWv_92Orh0U.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] very different performance on two volumes in the same pool
Hello Somnath, Thanks for the perf data..It seems innocuous..I am not seeing single tcmalloc trace, are you running with tcmalloc by the way ? according to ldd, it seems I have it compiled in, yes: [root@vfnphav1a ~]# ldd /usr/bin/ceph-osd . . libtcmalloc.so.4 = /usr/lib64/libtcmalloc.so.4 (0x7f7a3756e000) . . What about my other question, is the performance of slow volume increasing if you stop IO on the other volume ? I don't have any other cpeh users, actually whole cluster is idle.. Are you using default ceph.conf ? Probably, you want to try with different osd_op_num_shards (may be = 10 , based on your osd server config) and osd_op_num_threads_per_shard (may be = 1). Also, you may want to see the effect by doing osd_enable_op_tracker = false I guess I'm using pretty default settings, few changes probably not much related: [osd] osd crush update on start = false [client] rbd cache = true rbd cache writethrough until flush = true [mon] debug paxos = 0 I now tried setting throttler perf counter = false osd enable op tracker = false osd_op_num_threads_per_shard = 1 osd_op_num_shards = 10 and restarting all ceph servers.. but it seems to make no big difference.. Are you seeing similar resource consumption on both the servers while IO is going on ? yes, on all three nodes, ceph-osd seems to be consuming lots of CPU during benchmark. Need some information about your client, are the volumes exposed with krbd or running with librbd environment ? If krbd and with same physical box, hope you mapped the images with 'noshare' enabled. I'm using fio with ceph engine, so I guess none rbd related stuff is in use here? Too many questions :-) But, this may give some indication what is going on there. :-) hopefully my answers are not too confused, I'm still pretty new to ceph.. BR nik Thanks Regards Somnath -Original Message- From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] Sent: Sunday, April 26, 2015 7:32 AM To: Somnath Roy Cc: ceph-users@lists.ceph.com; n...@linuxbox.cz Subject: Re: [ceph-users] very different performance on two volumes in the same pool Hello Somnath, On Fri, Apr 24, 2015 at 04:23:19PM +, Somnath Roy wrote: This could be again because of tcmalloc issue I reported earlier. Two things to observe. 1. Is the performance improving if you stop IO on other volume ? If so, it could be different issue. there is no other IO.. only cephfs mounted, but no users of it. 2. Run perf top in the OSD node and see if tcmalloc traces are popping up. don't see anything special: 3.34% libc-2.12.so [.] _int_malloc 2.87% libc-2.12.so [.] _int_free 2.79% [vdso][.] __vdso_gettimeofday 2.67% libsoftokn3.so[.] 0x0001fad9 2.34% libfreeblpriv3.so [.] 0x000355e6 2.33% libpthread-2.12.so[.] pthread_mutex_unlock 2.19% libpthread-2.12.so[.] pthread_mutex_lock 1.80% libc-2.12.so [.] malloc 1.43% [kernel] [k] do_raw_spin_lock 1.42% libc-2.12.so [.] memcpy 1.23% [kernel] [k] __switch_to 1.19% [kernel] [k] acpi_processor_ffh_cstate_enter 1.09% libc-2.12.so [.] malloc_consolidate 1.08% [kernel] [k] __schedule 1.05% libtcmalloc.so.4.1.0 [.] 0x00017e6f 0.98% libc-2.12.so [.] vfprintf 0.83% libstdc++.so.6.0.13 [.] std::basic_ostreamchar, std::char_traitschar std::__ostream_insertchar, std::char_traitschar (std::basic_ostreamchar, 0.76% libstdc++.so.6.0.13 [.] 0x0008092a 0.73% libc-2.12.so [.] __memset_sse2 0.72% libc-2.12.so [.] __strlen_sse42 0.70% libstdc++.so.6.0.13 [.] std::basic_streambufchar, std::char_traitschar ::xsputn(char const*, long) 0.68% libpthread-2.12.so[.] pthread_mutex_trylock 0.67% librados.so.2.0.0 [.] ceph_crc32c_sctp 0.63% libpython2.6.so.1.0 [.] 0x0007d823 0.55% libnss3.so[.] 0x00056d2a 0.52% libc-2.12.so [.] free 0.50% libstdc++.so.6.0.13 [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(std::string const) should I check anything else? BR nik Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nikola Ciprich Sent: Friday, April 24, 2015 7:10 AM To: ceph-users@lists.ceph.com Cc: n...@linuxbox.cz Subject: [ceph-users] very different performance on two volumes in the same pool Hello, I'm trying to solve a bit mysterious situation: I've got 3 nodes CEPH cluster
Re: [ceph-users] 3.18.11 - RBD triggered deadlock?
tcp0 0 10.0.0.1:6809 10.0.0.1:59692 ESTABLISHED 20182/ceph-osd tcp0 4163543 10.0.0.1:59692 10.0.0.1:6809 ESTABLISHED - You got bitten by a recently fixed regression. It's never been a good idea to co-locate kernel client with osds, and we advise not to do it. However it happens to work most of the time so you can do it if you really want to. That happens to work part got accidentally broken in 3.18 and was fixed in 4.0, 3.19.5 and 3.18.12. You are running 3.18.11, so you are going to need to upgrade. tried upgrading to 3.18.12 and and can no longer reproduce the issue. Thanks a lot! BR nik Thanks, Ilya -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp7euaxcLrX7.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] very different performance on two volumes in the same pool
Hello Somnath, On Fri, Apr 24, 2015 at 04:23:19PM +, Somnath Roy wrote: This could be again because of tcmalloc issue I reported earlier. Two things to observe. 1. Is the performance improving if you stop IO on other volume ? If so, it could be different issue. there is no other IO.. only cephfs mounted, but no users of it. 2. Run perf top in the OSD node and see if tcmalloc traces are popping up. don't see anything special: 3.34% libc-2.12.so [.] _int_malloc 2.87% libc-2.12.so [.] _int_free 2.79% [vdso][.] __vdso_gettimeofday 2.67% libsoftokn3.so[.] 0x0001fad9 2.34% libfreeblpriv3.so [.] 0x000355e6 2.33% libpthread-2.12.so[.] pthread_mutex_unlock 2.19% libpthread-2.12.so[.] pthread_mutex_lock 1.80% libc-2.12.so [.] malloc 1.43% [kernel] [k] do_raw_spin_lock 1.42% libc-2.12.so [.] memcpy 1.23% [kernel] [k] __switch_to 1.19% [kernel] [k] acpi_processor_ffh_cstate_enter 1.09% libc-2.12.so [.] malloc_consolidate 1.08% [kernel] [k] __schedule 1.05% libtcmalloc.so.4.1.0 [.] 0x00017e6f 0.98% libc-2.12.so [.] vfprintf 0.83% libstdc++.so.6.0.13 [.] std::basic_ostreamchar, std::char_traitschar std::__ostream_insertchar, std::char_traitschar (std::basic_ostreamchar, 0.76% libstdc++.so.6.0.13 [.] 0x0008092a 0.73% libc-2.12.so [.] __memset_sse2 0.72% libc-2.12.so [.] __strlen_sse42 0.70% libstdc++.so.6.0.13 [.] std::basic_streambufchar, std::char_traitschar ::xsputn(char const*, long) 0.68% libpthread-2.12.so[.] pthread_mutex_trylock 0.67% librados.so.2.0.0 [.] ceph_crc32c_sctp 0.63% libpython2.6.so.1.0 [.] 0x0007d823 0.55% libnss3.so[.] 0x00056d2a 0.52% libc-2.12.so [.] free 0.50% libstdc++.so.6.0.13 [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(std::string const) should I check anything else? BR nik Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nikola Ciprich Sent: Friday, April 24, 2015 7:10 AM To: ceph-users@lists.ceph.com Cc: n...@linuxbox.cz Subject: [ceph-users] very different performance on two volumes in the same pool Hello, I'm trying to solve a bit mysterious situation: I've got 3 nodes CEPH cluster, and pool made of 3 OSDs (each on one node), OSDs are 1TB SSD drives. pool has 3 replicas set. I'm measuring random IO performance using fio: fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 --readwrite=randread --output=randio.log it's giving very nice performance of ~ 186K IOPS for random read. the problem is, I've got one volume on which it fives only ~20K IOPS and I can't figure why. It's created using python, so I first suspected it can be similar to missing layerign problem I was consulting here few days ago, but when I tried reproducing it, I'm beting ~180K IOPS even for another volumes created using python. so there is only this one problematic, others are fine. Since there is only one SSD in each box and I'm using 3 replicas, there should not be any difference in physical storage used between volumes.. I'm using hammer, 0.94.1, fio 2.2.6. here's RBD info: slow volume: [root@vfnphav1a fio]# rbd info ssd3r/vmtst23-6 rbd image 'vmtst23-6': size 30720 MB in 7680 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.1376d82ae8944a format: 2 features: flags: fast volume: [root@vfnphav1a fio]# rbd info ssd3r/vmtst23-7 rbd image 'vmtst23-7': size 30720 MB in 7680 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.13d01d2ae8944a format: 2 features: flags: any idea on what could be wrong here? thanks a lot in advance! BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review
Re: [ceph-users] 3.18.11 - RBD triggered deadlock?
It seems you just grepped for ceph-osd - that doesn't include sockets opened by the kernel client, which is what I was after. Paste the entire netstat? ouch, bummer! here are full netstats, sorry about delay.. http://nik.lbox.cz/download/ceph/ BR nik Thanks, Ilya -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpFPyi0Qlehp.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] very different performance on two volumes in the same pool
Hello, I'm trying to solve a bit mysterious situation: I've got 3 nodes CEPH cluster, and pool made of 3 OSDs (each on one node), OSDs are 1TB SSD drives. pool has 3 replicas set. I'm measuring random IO performance using fio: fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 --readwrite=randread --output=randio.log it's giving very nice performance of ~ 186K IOPS for random read. the problem is, I've got one volume on which it fives only ~20K IOPS and I can't figure why. It's created using python, so I first suspected it can be similar to missing layerign problem I was consulting here few days ago, but when I tried reproducing it, I'm beting ~180K IOPS even for another volumes created using python. so there is only this one problematic, others are fine. Since there is only one SSD in each box and I'm using 3 replicas, there should not be any difference in physical storage used between volumes.. I'm using hammer, 0.94.1, fio 2.2.6. here's RBD info: slow volume: [root@vfnphav1a fio]# rbd info ssd3r/vmtst23-6 rbd image 'vmtst23-6': size 30720 MB in 7680 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.1376d82ae8944a format: 2 features: flags: fast volume: [root@vfnphav1a fio]# rbd info ssd3r/vmtst23-7 rbd image 'vmtst23-7': size 30720 MB in 7680 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.13d01d2ae8944a format: 2 features: flags: any idea on what could be wrong here? thanks a lot in advance! BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpvrmrUaU6_j.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 3.18.11 - RBD triggered deadlock?
] [81050dc4] ? do_exit+0x6e4/0xaa0 Apr 24 17:09:45 vfnphav1a kernel: [340711.180987] [8106a8b0] ? __init_kthread_worker+0x40/0x40 Apr 24 17:09:45 vfnphav1a kernel: [340711.187757] [81498d88] ret_from_fork+0x58/0x90 Apr 24 17:09:45 vfnphav1a kernel: [340711.193652] [8106a8b0] ? __init_kthread_worker+0x40/0x40 the process started running after some time, but it's excruciatingly slow, with speeds about 40KB/s. all ceph processes seem to be mostly idle.. From the backtrace I'm not sure if this can't be network adapter problem, since I see some bnc2x_ locking functions, but network seems to be running fine otherwise and I didn't have any issuess till I tried heavily using RBD.. If I could provide some more information, please let me know. BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpR52at9miX1.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 3.18.11 - RBD triggered deadlock?
Does this mean rbd device is mapped on a node that also runs one or more osds? yes.. I know it's not the best practice, but it's just test cluster.. Can you watch osd sockets in netstat for a while and describe what you are seeing or forward a few representative samples? sure, here it is: http://nik.lbox.cz/download/netstat-osd.log it doesn't seem to change at all. (just to be exact, there are 3 OSD on each node, 2 are SATA drives which are not used in this pool though). there are currently no other ceph users apart from this testing RBD. I'll have to get off computer for today in few minutes, so I won't be able to help much today, but I'll be able to send whatever you need tommorou or whenever later will you wish. n. Thanks, Ilya -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp9thQI3SgPg.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer (0.94.1) - still getting feature set mismatch for cephfs mount requests
Your crushmap has straw2 buckets (alg straw2). That's going to be supported in 4.1 kernel - when 3.18 was released none of the straw2 stuff existed. I see.. maybe this is a bit too radical setting for optimal preset? Well, it depends on how you look at it. Generally optimal is something that is the best or most desirable, and for hammer cluster it's going to be hammer tunables ;) You have to remember that kernel client is just another client as far as ceph concerned. yes, this makes sense and it's pretty easy to fix in case of need thanks for your time! Thanks, Ilya -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpesQ1A24XdQ.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hammer (0.94.1) - still getting feature set mismatch for cephfs mount requests
Hello, I'm quite new to ceph, so please forgive my ignorance. Yesterday, I've deployed small test cluster (3 nodes, 2 SATA + 1 SSD OSD / node) I enabled MDS server and created cephfs data + metadata pools and created filesystem. However upon mount requests, I'm getting following error: [Apr20 10:09] libceph: mon0 10.0.0.1:6789 feature set mismatch, my 2b84a042aca server's 102b84a042aca, missing 1 This two threads seem related to me: http://www.spinics.net/lists/ceph-users/msg17406.html (protocol feature mismatch after upgrading to Hammer) and http://www.spinics.net/lists/ceph-users/msg17445.html (crush issues in v0.94 hammer) but I'm using 0.94.1 on all nodes (and 3.18.11) kernel and am still getting those errors, which to my understanding I shouldn't be.. What should I check please? In case it could help, my crushmap can be checked here: http://nik.lbox.cz/download/ceph/crushmap.txt with best regards nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp9vpeUYb6Np.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer (0.94.1) - still getting feature set mismatch for cephfs mount requests
Hello Ilya, Have you set your crush tunables to hammer? I've set crush tunables to optimal (therefore I guess they got set to hammer). Your crushmap has straw2 buckets (alg straw2). That's going to be supported in 4.1 kernel - when 3.18 was released none of the straw2 stuff existed. I see.. maybe this is a bit too radical setting for optimal preset? You should be able to change alg straw2 to alg straw and that should make it work with 3.18 kernel. It indeed helped! Thanks! BR nik Thanks, Ilya -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpURrowE0LXW.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hammer (0.94.1) - image must support layering(38) Function not implemented on v2 image
Hello, I'd like to ask about another problem I've stumbled upon.. I've got format 2 image + snapshot, and while trying to protect snapshot I'm getting following error: [root@vfnphav1a ~]# rbd ls -l ssd2r NAMESIZE PARENT FMT PROT LOCK fio_test 4096M 2 template-win2k8-20150420 40960M 2 template-win2k8-20150420@snap 40960M 2 [root@vfnphav1a ~]# rbd snap protect ssd2r/template-win2k8-20150420@snap rbd: protecting snap failed: 2015-04-20 16:47:31.587489 7f5e9e4fa760 -1 librbd: snap_protect: image must support layering(38) Function not implemented am I doing something wrong? thanks a lot in advance for reply BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpNPs9zf65iQ.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hammer (0.94.1) - image must support layering(38) Function not implemented on v2 image
Hello Jason, On Mon, Apr 20, 2015 at 01:48:14PM -0400, Jason Dillaman wrote: Can you please run 'rbd info' on template-win2k8-20150420 and template-win2k8-20150420@snap? I just want to verify which RBD features are currently enabled on your images. Have you overridden the value of rbd_default_features in your ceph.conf? Did you use the new rbd CLI option '--image-features' when creating the image? sure, now I can see the difference: this is image created using rbd create ... [root@vfnphav1a python-rbd]# rbd info ssd2r/template-win2k8-20150420 rbd image 'template-win2k8-20150420': size 40960 MB in 10240 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.abc32ae8944a format: 2 features: layering flags: this is the image created using python script: [root@vfnphav1a python-rbd]# rbd info ssd2r/template-win2k8-20150420_ rbd image 'template-win2k8-20150420_': size 40960 MB in 10240 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.5e6236db3ab3 format: 2 features: flags: I haven't used any --image-features, nor have I rbd_default_features in ceph.conf. apparantly the problem is in missing layering feature. So the python rbd create method does not enable layering, although v2 format is used - when I added rbd.RBD_FEATURE_LAYERING flag, I can properly protect created snapshots. problem solved for me :) Maybe the question is, whether layering should be enabled by default, but now that I know what is the problem, It's no big deal.. thanks a lot for your time! BR nikola ciprich -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Nikola Ciprich nikola.cipr...@linuxbox.cz To: Jason Dillaman dilla...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Monday, April 20, 2015 12:41:26 PM Subject: Re: [ceph-users] hammer (0.94.1) - image must support layering(38) Function not implemented on v2 image Hello Jason, here it is: [root@vfnphav1a ceph]# rbd snap protect ssd2r/template-win2k8-20150420_@snap 2015-04-20 18:33:43.635427 7fa0344ca760 20 librbd::ImageCtx: enabling caching... 2015-04-20 18:33:43.635458 7fa0344ca760 20 librbd::ImageCtx: Initial cache settings: size=33554432 num_objects=10 max_dirty=25165824 target_dirty=16777216 max_dirty_age=1 2015-04-20 18:33:43.635497 7fa0344ca760 20 librbd: open_image: ictx = 0x4672010 name = 'template-win2k8-20150420_' id = '' snap_name = '' 2015-04-20 18:33:43.636792 7fa0344ca760 20 librbd: detect format of template-win2k8-20150420_ : new 2015-04-20 18:33:43.637901 7fa0344ca760 10 librbd::ImageCtx: cache bytes 33554432 order 22 - about 42 objects 2015-04-20 18:33:43.637906 7fa0344ca760 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.5e6236db3ab3 format rbd_data.5e6236db3ab3.%016llx 2015-04-20 18:33:43.637932 7fa0344ca760 10 librbd::ImageWatcher: registering image watcher 2015-04-20 18:33:43.643651 7fa0344ca760 20 librbd: ictx_refresh 0x4672010 2015-04-20 18:33:43.645062 7fa0344ca760 20 librbd: new snapshot id=6 name=snap size=42949672960 features=0 2015-04-20 18:33:43.645075 7fa0344ca760 20 librbd::ImageCtx: finished flushing cache 2015-04-20 18:33:43.645083 7fa0344ca760 20 librbd: snap_protect 0x4672010 snap 2015-04-20 18:33:43.645089 7fa0344ca760 20 librbd: ictx_check 0x4672010 2015-04-20 18:33:43.645090 7fa0344ca760 -1 librbd: snap_protect: image must support layering rbd: protecting snap failed: (38) Function not implemented 2015-04-20 18:33:43.645115 7fa0344ca760 20 librbd: close_image 0x4672010 2015-04-20 18:33:43.645117 7fa0344ca760 10 librbd::ImageCtx: canceling async requests: count=0 2015-04-20 18:33:43.645148 7fa0344ca760 10 librbd::ImageWatcher: unregistering image watcher In the meantime, I realised what could be the difference here.. the image I've got trouble protecting snapshot is created using python rbd binding.. here's simple script to reproduce: #!/usr/bin/python import rados import rbd rc=rados.Rados(conffile='/etc/ceph/ceph.conf') rc.connect() ioctx = rc.open_ioctx('ssd2r') rbdi=rbd.RBD() rbdi.create(ioctx, 'test', 1024**2, old_format=False) will it help? BR nik On Mon, Apr 20, 2015 at 11:35:07AM -0400, Jason Dillaman wrote: Can you add debug rbd = 20 your ceph.conf, re-run the command, and provide a link to the generated librbd log messages? Thanks, -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Nikola Ciprich nikola.cipr...@linuxbox.cz To: ceph-users@lists.ceph.com Sent: Monday, April 20, 2015 10:54:17 AM Subject: [ceph-users] hammer (0.94.1) - image must support layering(38) Function not implemented on v2 image Hello, I'd like to ask about another problem I've stumbled upon.. I've got format 2 image
Re: [ceph-users] hammer (0.94.1) - image must support layering(38) Function not implemented on v2 image
Hello Jason, here it is: [root@vfnphav1a ceph]# rbd snap protect ssd2r/template-win2k8-20150420_@snap 2015-04-20 18:33:43.635427 7fa0344ca760 20 librbd::ImageCtx: enabling caching... 2015-04-20 18:33:43.635458 7fa0344ca760 20 librbd::ImageCtx: Initial cache settings: size=33554432 num_objects=10 max_dirty=25165824 target_dirty=16777216 max_dirty_age=1 2015-04-20 18:33:43.635497 7fa0344ca760 20 librbd: open_image: ictx = 0x4672010 name = 'template-win2k8-20150420_' id = '' snap_name = '' 2015-04-20 18:33:43.636792 7fa0344ca760 20 librbd: detect format of template-win2k8-20150420_ : new 2015-04-20 18:33:43.637901 7fa0344ca760 10 librbd::ImageCtx: cache bytes 33554432 order 22 - about 42 objects 2015-04-20 18:33:43.637906 7fa0344ca760 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.5e6236db3ab3 format rbd_data.5e6236db3ab3.%016llx 2015-04-20 18:33:43.637932 7fa0344ca760 10 librbd::ImageWatcher: registering image watcher 2015-04-20 18:33:43.643651 7fa0344ca760 20 librbd: ictx_refresh 0x4672010 2015-04-20 18:33:43.645062 7fa0344ca760 20 librbd: new snapshot id=6 name=snap size=42949672960 features=0 2015-04-20 18:33:43.645075 7fa0344ca760 20 librbd::ImageCtx: finished flushing cache 2015-04-20 18:33:43.645083 7fa0344ca760 20 librbd: snap_protect 0x4672010 snap 2015-04-20 18:33:43.645089 7fa0344ca760 20 librbd: ictx_check 0x4672010 2015-04-20 18:33:43.645090 7fa0344ca760 -1 librbd: snap_protect: image must support layering rbd: protecting snap failed: (38) Function not implemented 2015-04-20 18:33:43.645115 7fa0344ca760 20 librbd: close_image 0x4672010 2015-04-20 18:33:43.645117 7fa0344ca760 10 librbd::ImageCtx: canceling async requests: count=0 2015-04-20 18:33:43.645148 7fa0344ca760 10 librbd::ImageWatcher: unregistering image watcher In the meantime, I realised what could be the difference here.. the image I've got trouble protecting snapshot is created using python rbd binding.. here's simple script to reproduce: #!/usr/bin/python import rados import rbd rc=rados.Rados(conffile='/etc/ceph/ceph.conf') rc.connect() ioctx = rc.open_ioctx('ssd2r') rbdi=rbd.RBD() rbdi.create(ioctx, 'test', 1024**2, old_format=False) will it help? BR nik On Mon, Apr 20, 2015 at 11:35:07AM -0400, Jason Dillaman wrote: Can you add debug rbd = 20 your ceph.conf, re-run the command, and provide a link to the generated librbd log messages? Thanks, -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Nikola Ciprich nikola.cipr...@linuxbox.cz To: ceph-users@lists.ceph.com Sent: Monday, April 20, 2015 10:54:17 AM Subject: [ceph-users] hammer (0.94.1) - image must support layering(38) Function not implemented on v2 image Hello, I'd like to ask about another problem I've stumbled upon.. I've got format 2 image + snapshot, and while trying to protect snapshot I'm getting following error: [root@vfnphav1a ~]# rbd ls -l ssd2r NAMESIZE PARENT FMT PROT LOCK fio_test 4096M 2 template-win2k8-20150420 40960M 2 template-win2k8-20150420@snap 40960M 2 [root@vfnphav1a ~]# rbd snap protect ssd2r/template-win2k8-20150420@snap rbd: protecting snap failed: 2015-04-20 16:47:31.587489 7f5e9e4fa760 -1 librbd: snap_protect: image must support layering(38) Function not implemented am I doing something wrong? thanks a lot in advance for reply BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpl02NeFF9Lv.pgp Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com