[ceph-users] Mon crashes virtual void LogMonitor::update_from_paxos(bool*)
Hey all, One of my mons has been having a rough time for the last day or so. It started with a crash and restart I didn't notice about a day ago and now it won't start. Where it crashes has changed over time but it is now stuck on the last error below. I've tried to get some more information out of it with debug logging and gdb but I haven't seen anything that makes the root cause of this obvious. Right now it is crashing at line 103 in https://github.com/ceph/ceph/blob/mimic/src/mon/LogMonitor.cc#L103. This is part of the mon preinit step. Best that I can tell right now is that it is having a problem with a map version. I'm considering rebuilding the mon's store though I don't see any clear signs of corruption. It bails at assert(err == 0); // walk through incrementals while (version > summary.version) { bufferlist bl; int err = get_version(summary.version+1, bl); assert(err == 0); assert(bl.length()); Has anyone seen similar or have any ideas? ceph 13.2.8 Thanks! Kevin The first crash/restart Jan 14 20:47:11 sephmon5 ceph-mon: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair, std::basic_string >*, int*)' thread 7f5b54680700 time 2020-01-14 20:47:11.618368 Jan 14 20:47:11 sephmon5 ceph-mon: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/Monitor.cc: 5225: FAILED assert(err == 0) Jan 14 20:47:11 sephmon5 ceph-mon: ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) Jan 14 20:47:11 sephmon5 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f5b6440b87b] Jan 14 20:47:11 sephmon5 ceph-mon: 2: (()+0x26fa07) [0x7f5b6440ba07] Jan 14 20:47:11 sephmon5 ceph-mon: 3: (Monitor::_scrub(ScrubResult*, std::pair*, int*)+0xfa6) [0x55c3230a1896] Jan 14 20:47:11 sephmon5 ceph-mon: 4: (Monitor::handle_scrub(boost::intrusive_ptr)+0x25e) [0x55c3230aa01e] Jan 14 20:47:11 sephmon5 ceph-mon: 5: (Monitor::dispatch_op(boost::intrusive_ptr)+0xcaf) [0x55c3230c73ff] Jan 14 20:47:11 sephmon5 ceph-mon: 6: (Monitor::_ms_dispatch(Message*)+0x732) [0x55c3230c8152] Jan 14 20:47:11 sephmon5 ceph-mon: 7: (Monitor::ms_dispatch(Message*)+0x23) [0x55c3230edcc3] Jan 14 20:47:11 sephmon5 ceph-mon: 8: (DispatchQueue::entry()+0xb7a) [0x7f5b644ca24a] Jan 14 20:47:11 sephmon5 ceph-mon: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f5b645684bd] Jan 14 20:47:11 sephmon5 ceph-mon: 10: (()+0x7e65) [0x7f5b63749e65] Jan 14 20:47:11 sephmon5 ceph-mon: 11: (clone()+0x6d) [0x7f5b6025d88d] Then a couple more crashes/restarts about 11 hours later with this trace -10001> 2020-01-15 09:36:35.796 7f9600fc7700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc: In function 'void LogMonitor::_create_sub_incremental(MLog*, int, version_t)' thread 7f9600fc7700 time 2020-01-15 09:36:35.796354 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc: 673: FAILED assert(err == 0) ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f9610d5287b] 2: (()+0x26fa07) [0x7f9610d52a07] 3: (LogMonitor::_create_sub_incremental(MLog*, int, unsigned long)+0xb54) [0x55aeb09e2f94] 4: (LogMonitor::check_sub(Subscription*)+0x506) [0x55aeb09e3806] 5: (Monitor::handle_subscribe(boost::intrusive_ptr)+0x10ed) [0x55aeb098973d] 6: (Monitor::dispatch_op(boost::intrusive_ptr)+0x3cd) [0x55aeb09b0b1d] 7: (Monitor::_ms_dispatch(Message*)+0x732) [0x55aeb09b2152] 8: (Monitor::ms_dispatch(Message*)+0x23) [0x55aeb09d7cc3] 9: (DispatchQueue::entry()+0xb7a) [0x7f9610e1124a] 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9610eaf4bd] 11: (()+0x7e65) [0x7f9610090e65] 12: (clone()+0x6d) [0x7f960cba488d] -10001> 2020-01-15 09:36:35.797 7f95fffc5700 1 -- 10.1.9.205:6789/0 >> - conn(0x55aec5dd0600 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=47 - -10001> 2020-01-15 09:36:35.798 7f9600fc7700 -1 *** Caught signal (Aborted) ** in thread 7f9600fc7700 thread_name:ms_dispatch And now the mon no longer starts with this trace -261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).paxosservice(logm 0..86521000) refresh -261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).log v86521000 update_from_paxos -261> 2020-01-15 16:36:46.084 7
[ceph-users] January Ceph Science Group Virtual Meeting
Hello, We will be having a Ceph science/research/big cluster call on Wednesday January 22nd. If anyone wants to discuss something specific they can add it to the pad linked below. If you have questions or comments you can contact me. This is an informal open call of community members mostly from hpc/htc/research environments where we discuss whatever is on our minds regarding ceph. Updates, outages, features, maintenance, etc...there is no set presenter but I do attempt to keep the conversation lively. https://pad.ceph.com/p/Ceph_Science_User_Group_20200122 Ceph calendar event details: January 22, 2020 9am US Central 4pm Central Eurpean We try to keep it to an hour or less. Description:Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1579363980705000=AOvVaw2SfHjvt23rQFRJn8z4_zJ8> Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1579363980705000=AOvVaw2CgJXpvLRSOlaaWC5rc3id> To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1579363980705000=AOvVaw3mDPS_6nmD3yh_9Sw7Z7So> 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1579363980705000=AOvVaw2aHSwR3wGU0yTs-bCsUFoC> 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1579363980705000=AOvVaw3UlW-AxGCX7TXfn8VAGfH4> Kevin -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Science User Group Call October
Hello, This Wednesday we'll have a ceph science user group call. This is an informal conversation focused on using ceph in htc/hpc and scientific research environments. Call details copied from the event: Wednesday October 23rd 14:00 UTC 4:00PM Central European 10:00AM Eastern American Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1572095869727000=AOvVaw2s9XswFrmihEuDdJMRHxy6> Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1572095869727000=AOvVaw0EMESnNO_RKhONBQ8sgKI2> To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1572095869727000=AOvVaw350zlzIKbJ0pjXk5apTWwi> 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1572095869727000=AOvVaw0Gycb74NLeUaeZuvSg4pgy> 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1572095869727000=AOvVaw1bRfUtekflHoeS36FKwXw2> -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow ops for mon slowly increasing
OK, looks like clock skew is the problem. I thought this is caused by the reboot but it did not fix itself after some minutes (mon3 was 6 seconds ahead). After forcing time sync from the same server, it seems to be solved now. Kevin Am Fr., 20. Sept. 2019 um 07:33 Uhr schrieb Kevin Olbrich : > Hi! > > Today some OSDs went down, a temporary problem that was solved easily. > The mimic cluster is working and all OSDs are complete, all active+clean. > > Completely new for me is this: > > 25 slow ops, oldest one blocked for 219 sec, mon.mon03 has slow ops > > The cluster itself looks fine, monitoring for the VMs that use RBD are > fine. > > I thought that might be (https://tracker.ceph.com/issues/24531) but I've > restarted the mon service (and the node as a whole) but both did not help. > The slop ops slowly increase. > > Example: > > { > "description": "auth(proto 0 30 bytes epoch 0)", > "initiated_at": "2019-09-20 05:31:52.295858", > "age": 7.851164, > "duration": 7.900068, > "type_data": { > "events": [ > { > "time": "2019-09-20 05:31:52.295858", > "event": "initiated" > }, > { > "time": "2019-09-20 05:31:52.295858", > "event": "header_read" > }, > { > "time": "2019-09-20 05:31:52.295864", > "event": "throttled" > }, > { > "time": "2019-09-20 05:31:52.295875", > "event": "all_read" > }, > { > "time": "2019-09-20 05:31:52.296075", > "event": "dispatched" > }, > { > "time": "2019-09-20 05:31:52.296089", > "event": "mon:_ms_dispatch" > }, > { > "time": "2019-09-20 05:31:52.296097", > "event": "mon:dispatch_op" > }, > { > "time": "2019-09-20 05:31:52.296098", > "event": "psvc:dispatch" > }, > { > "time": "2019-09-20 05:31:52.296172", > "event": "auth:wait_for_readable" > }, > { > "time": "2019-09-20 05:31:52.296177", > "event": "auth:wait_for_readable/paxos" > }, > { > "time": "2019-09-20 05:31:52.296232", > "event": "paxos:wait_for_readable" > } > ], > "info": { > "seq": 1708, > "src_is_mon": false, > "source": "client.? > [fd91:462b:4243:47e::1:3]:0/2365414961", > "forwarded_to_leader": false > } > } > }, > { > "description": "auth(proto 0 30 bytes epoch 0)", > "initiated_at": "2019-09-20 05:31:52.314892", > "age": 7.832131, > "duration": 7.881230, > "type_data": { > "events": [ > { > "time": "2019-09-20 05:31:52.314892", > "event": "initiated" > }, > { > "time": "2019-09-20 05:31:52.314892", > "event": "header_read" > }, > { > "time": "2019-09-20 05:31:52.3
[ceph-users] slow ops for mon slowly increasing
"event": "mon:dispatch_op" }, { "time": "2019-09-20 05:31:52.315083", "event": "psvc:dispatch" }, { "time": "2019-09-20 05:31:52.315161", "event": "auth:wait_for_readable" }, { "time": "2019-09-20 05:31:52.315167", "event": "auth:wait_for_readable/paxos" }, { "time": "2019-09-20 05:31:52.315230", "event": "paxos:wait_for_readable" } ], "info": { "seq": 1709, "src_is_mon": false, "source": "client.? [fd91:462b:4243:47e::1:3]:0/997594187", "forwarded_to_leader": false } } } This is a new situation for me. What am I supposed to do in this case? Thank you! Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Scientific Computing User Group
The first ceph + htc/hpc/science virtual user group meeting is tomorrow Wednesday August 28th at 10:30am us eastern/4:30pm eu central time. Duration will be kept to <= 1 hour. I'd like this to be conducted as a user group and not only one person talking/presenting. For this first meeting I'd like to get input from everyone on the call regarding what field they are in and how ceph is used as a solution for their implementation. We'll see where it goes from there. Use the pad link below to get to a url for live meeting notes. Meeting connection details from the ceph community calendar: Description: Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1567360416884000=AOvVaw1yGuD9c8NbJk7MX4uqnsN2> Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1567360416884000=AOvVaw2gv2hpS9KWJGhGkC6WTqzz> To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1567360416884000=AOvVaw0SvLNonjp6O--t7_XUO18j> 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1567360416884000=AOvVaw2zh8KetLc01bmQWGSDY9lK> 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1567360416884000=AOvVaw0Euz9flNV7X85AWSYNZ2R-> Kevin On 8/2/19 12:08 PM, Mike Perez wrote: We have scheduled the next meeting on the community calendar for August 28 at 14:30 UTC. Each meeting will then take place on the last Wednesday of each month. Here's the pad to collect agenda/notes: https://pad.ceph.com/p/Ceph_Science_User_Group_Index -- Mike Perez (thingee) On Tue, Jul 23, 2019 at 10:40 AM Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> wrote: Update We're going to hold off until August for this so we can promote it on the Ceph twitter with more notice. Sorry for the inconvenience if you were planning on the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for updates. Kevin On 7/5/19 11:15 PM, Kevin Hrpcek wrote: We've had some positive feedback and will be moving forward with this user group. The first virtual user group meeting is planned for July 24th at 4:30pm central European time/10:30am American eastern time. We will keep it to an hour in length. The plan is to use the ceph bluejeans video conferencing and it will be put on the ceph community calendar. I will send out links when it is closer to the 24th. The goal of this user group is to promote conversations and sharing ideas for how ceph is used in the the scientific/hpc/htc communities. Please be willing to discuss your use cases, cluster configs, problems you've had, shortcomings in ceph, etc... Not everyone pays attention to the ceph lists so feel free to share the meeting information with others you know that may be interested in joining in. Contact me if you have questions, comments, suggestions, or want to volunteer a topic for meetings. I will be brainstorming some conversation starters but it would also be interesting to have people give a deep dive into their use of ceph and what they have built around it to support the science being done at their facility. Kevin On 6/17/19 10:43 AM, Kevin Hrpcek wrote: Hey all, At cephalocon some of us who work in scientific computing got together for a BoF and had a good conversation. There was some interest in finding a way to continue the conversation focused on ceph in scientific computing and htc/hpc environments. We are considering putting together monthly video conference user group meeting to facilitate sharing thoughts and ideas for this part of the ceph community. At cephalocon we mostly had teams present from the EU so I'm interested in hearing how much community interest there is in a ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time that works well for everyone but initially we considered something later in the work day for EU countries. Reply to me if you're interested and please include your timezone. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> htt
Re: [ceph-users] How to add 100 new OSDs...
I change the crush weights. My 4 second sleep doesn't let peering finish for each one before continuing. I'd test with some small steps to get an idea of how much remaps when increasing the weight by $x. I've found my cluster is comfortable with +1 increases...also it take awhile to get to a weight of 11 if I did anything smaller. for i in {264..311}; do ceph osd crush reweight osd.${i} 11.0;sleep 4;done Kevin On 7/24/19 12:33 PM, Xavier Trilla wrote: Hi Kevin, Yeah, that makes a lot of sense, and looks even safer than adding OSDs one by one. What do you change, the crush weight? Or the reweight? (I guess you change the crush weight, I am right?) Thanks! El 24 jul 2019, a les 19:17, Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> va escriure: I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously change the weight increase steps to what you are comfortable with. This has worked well for me and my workloads. I've sometimes seen peering take longer if I do steps too quickly but I don't run any mission critical has to be up 100% stuff and I usually don't notice if a pg takes a while to peer. Add all OSDs with an initial weight of 0. (nothing gets remapped) Ensure cluster is healthy. Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep between each for peering. Let the cluster balance and get healthy or close to healthy. Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am at the desired weight. Kevin On 7/24/19 11:44 AM, Xavier Trilla wrote: Hi, What would be the proper way to add 100 new OSDs to a cluster? I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like to know how you do it. Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it can handle plenty of load, but for the sake of safety -it hosts thousands of VMs via RBD- we usually add them one by one, waiting for a long time between adding each OSD. Obviously this leads to PLENTY of data movement, as each time the cluster geometry changes, data is migrated among all the OSDs. But with the kind of load we have, if we add several OSDs at the same time, some PGs can get stuck for a while, while they peer to the new OSDs. Now that I have to add > 100 new OSDs I was wondering if somebody has some suggestions. Thanks! Xavier. ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to add 100 new OSDs...
I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously change the weight increase steps to what you are comfortable with. This has worked well for me and my workloads. I've sometimes seen peering take longer if I do steps too quickly but I don't run any mission critical has to be up 100% stuff and I usually don't notice if a pg takes a while to peer. Add all OSDs with an initial weight of 0. (nothing gets remapped) Ensure cluster is healthy. Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep between each for peering. Let the cluster balance and get healthy or close to healthy. Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am at the desired weight. Kevin On 7/24/19 11:44 AM, Xavier Trilla wrote: Hi, What would be the proper way to add 100 new OSDs to a cluster? I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like to know how you do it. Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it can handle plenty of load, but for the sake of safety -it hosts thousands of VMs via RBD- we usually add them one by one, waiting for a long time between adding each OSD. Obviously this leads to PLENTY of data movement, as each time the cluster geometry changes, data is migrated among all the OSDs. But with the kind of load we have, if we add several OSDs at the same time, some PGs can get stuck for a while, while they peer to the new OSDs. Now that I have to add > 100 new OSDs I was wondering if somebody has some suggestions. Thanks! Xavier. ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Scientific Computing User Group
Update We're going to hold off until August for this so we can promote it on the Ceph twitter with more notice. Sorry for the inconvenience if you were planning on the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for updates. Kevin On 7/5/19 11:15 PM, Kevin Hrpcek wrote: We've had some positive feedback and will be moving forward with this user group. The first virtual user group meeting is planned for July 24th at 4:30pm central European time/10:30am American eastern time. We will keep it to an hour in length. The plan is to use the ceph bluejeans video conferencing and it will be put on the ceph community calendar. I will send out links when it is closer to the 24th. The goal of this user group is to promote conversations and sharing ideas for how ceph is used in the the scientific/hpc/htc communities. Please be willing to discuss your use cases, cluster configs, problems you've had, shortcomings in ceph, etc... Not everyone pays attention to the ceph lists so feel free to share the meeting information with others you know that may be interested in joining in. Contact me if you have questions, comments, suggestions, or want to volunteer a topic for meetings. I will be brainstorming some conversation starters but it would also be interesting to have people give a deep dive into their use of ceph and what they have built around it to support the science being done at their facility. Kevin On 6/17/19 10:43 AM, Kevin Hrpcek wrote: Hey all, At cephalocon some of us who work in scientific computing got together for a BoF and had a good conversation. There was some interest in finding a way to continue the conversation focused on ceph in scientific computing and htc/hpc environments. We are considering putting together monthly video conference user group meeting to facilitate sharing thoughts and ideas for this part of the ceph community. At cephalocon we mostly had teams present from the EU so I'm interested in hearing how much community interest there is in a ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time that works well for everyone but initially we considered something later in the work day for EU countries. Reply to me if you're interested and please include your timezone. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Scientific Computing User Group
We've had some positive feedback and will be moving forward with this user group. The first virtual user group meeting is planned for July 24th at 4:30pm central European time/10:30am American eastern time. We will keep it to an hour in length. The plan is to use the ceph bluejeans video conferencing and it will be put on the ceph community calendar. I will send out links when it is closer to the 24th. The goal of this user group is to promote conversations and sharing ideas for how ceph is used in the the scientific/hpc/htc communities. Please be willing to discuss your use cases, cluster configs, problems you've had, shortcomings in ceph, etc... Not everyone pays attention to the ceph lists so feel free to share the meeting information with others you know that may be interested in joining in. Contact me if you have questions, comments, suggestions, or want to volunteer a topic for meetings. I will be brainstorming some conversation starters but it would also be interesting to have people give a deep dive into their use of ceph and what they have built around it to support the science being done at their facility. Kevin On 6/17/19 10:43 AM, Kevin Hrpcek wrote: Hey all, At cephalocon some of us who work in scientific computing got together for a BoF and had a good conversation. There was some interest in finding a way to continue the conversation focused on ceph in scientific computing and htc/hpc environments. We are considering putting together monthly video conference user group meeting to facilitate sharing thoughts and ideas for this part of the ceph community. At cephalocon we mostly had teams present from the EU so I'm interested in hearing how much community interest there is in a ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time that works well for everyone but initially we considered something later in the work day for EU countries. Reply to me if you're interested and please include your timezone. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Scientific Computing User Group
Hey all, At cephalocon some of us who work in scientific computing got together for a BoF and had a good conversation. There was some interest in finding a way to continue the conversation focused on ceph in scientific computing and htc/hpc environments. We are considering putting together monthly video conference user group meeting to facilitate sharing thoughts and ideas for this part of the ceph community. At cephalocon we mostly had teams present from the EU so I'm interested in hearing how much community interest there is in a ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time that works well for everyone but initially we considered something later in the work day for EU countries. Reply to me if you're interested and please include your timezone. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU/KVM client compatibility
Am Di., 28. Mai 2019 um 10:20 Uhr schrieb Wido den Hollander : > > > On 5/28/19 10:04 AM, Kevin Olbrich wrote: > > Hi Wido, > > > > thanks for your reply! > > > > For CentOS 7, this means I can switch over to the "rpm-nautilus/el7" > > repository and Qemu uses a nautilus compatible client? > > I just want to make sure, I understand correctly. > > > > Yes, that is correct. Keep in mind though that you will need to > Stop/Start the VMs or (Live) Migrate them to a different hypervisor for > the new packages to be loaded. > > Actually the hosts are Fedora 29 which I need to re-deploy with Fedora 30 to get nautilus on the clients. I just wanted to unterstand how this works. I always reboot the whole machine after such a large change to make sure it works. Thank you for your time! > Wido > > > Thank you very much! > > > > Kevin > > > > Am Di., 28. Mai 2019 um 09:46 Uhr schrieb Wido den Hollander > > mailto:w...@42on.com>>: > > > > > > > > On 5/28/19 7:52 AM, Kevin Olbrich wrote: > > > Hi! > > > > > > How can I determine which client compatibility level (luminous, > mimic, > > > nautilus, etc.) is supported in Qemu/KVM? > > > Does it depend on the version of ceph packages on the system? Or > do I > > > need a recent version Qemu/KVM? > > > > This is mainly related to librados and librbd on your system. Qemu > talks > > to librbd which then talks to librados. > > > > Qemu -> librbd -> librados -> Ceph cluster > > > > So make sure you keep the librbd and librados packages updated on > your > > hypervisor. > > > > When upgrading them make sure you either Stop/Start or Live Migrate > the > > VMs to a different hypervisor so the VMs are initiated with the new > > code. > > > > Wido > > > > > Which component defines, which client level will be supported? > > > > > > Thank you very much! > > > > > > Kind regards > > > Kevin > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU/KVM client compatibility
Hi Wido, thanks for your reply! For CentOS 7, this means I can switch over to the "rpm-nautilus/el7" repository and Qemu uses a nautilus compatible client? I just want to make sure, I understand correctly. Thank you very much! Kevin Am Di., 28. Mai 2019 um 09:46 Uhr schrieb Wido den Hollander : > > > On 5/28/19 7:52 AM, Kevin Olbrich wrote: > > Hi! > > > > How can I determine which client compatibility level (luminous, mimic, > > nautilus, etc.) is supported in Qemu/KVM? > > Does it depend on the version of ceph packages on the system? Or do I > > need a recent version Qemu/KVM? > > This is mainly related to librados and librbd on your system. Qemu talks > to librbd which then talks to librados. > > Qemu -> librbd -> librados -> Ceph cluster > > So make sure you keep the librbd and librados packages updated on your > hypervisor. > > When upgrading them make sure you either Stop/Start or Live Migrate the > VMs to a different hypervisor so the VMs are initiated with the new code. > > Wido > > > Which component defines, which client level will be supported? > > > > Thank you very much! > > > > Kind regards > > Kevin > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] QEMU/KVM client compatibility
Hi! How can I determine which client compatibility level (luminous, mimic, nautilus, etc.) is supported in Qemu/KVM? Does it depend on the version of ceph packages on the system? Or do I need a recent version Qemu/KVM? Which component defines, which client level will be supported? Thank you very much! Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
ok this just gives me: error getting xattr ec31/10004dfce92./parent: (2) No such file or directory Does this mean that the lost object isn't even a file that appears in the ceph directory. Maybe a leftover of a file that has not been deleted properly? It wouldn't be an issue to mark the object as lost in that case. On 24.05.19 5:08 nachm., Robert LeBlanc wrote: You need to use the first stripe of the object as that is the only one with the metadata. Try "rados -p ec31 getxattr 10004dfce92. parent" instead. Robert LeBlanc Sent from a mobile device, please excuse any typos. On Fri, May 24, 2019, 4:42 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote: Hi, we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but this is just hanging forever if we are looking for unfound objects. It works fine for all other objects. We also tried scanning the ceph directory with find -inum 1099593404050 (decimal of 10004dfce92) and found nothing. This is also working for non unfound objects. Is there another way to find the corresponding file? On 24.05.19 11:12 vorm., Burkhard Linke wrote: Hi, On 5/24/19 9:48 AM, Kevin Flöh wrote: We got the object ids of the missing objects with|ceph pg 1.24c list_missing:| |{ "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "10004dfce92.003d", "key": "", "snapid": -2, "hash": 90219084, "max": 0, "pool": 1, "namespace": "" }, "need": "46950'195355", "have": "0'0", "flags": "none", "locations": [ "36(3)", "61(2)" ] } ], "more": false } | |we want to give up those objects with:| ceph pg 1.24c mark_unfound_lost revert But first we would like to know which file(s) is affected. Is there a way to map the object id to the corresponding file? The object name is composed of the file inode id and the chunk within the file. The first chunk has some metadata you can use to retrieve the filename. See the 'CephFS object mapping' thread on the mailing list for more information. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but this is just hanging forever if we are looking for unfound objects. It works fine for all other objects. We also tried scanning the ceph directory with find -inum 1099593404050 (decimal of 10004dfce92) and found nothing. This is also working for non unfound objects. Is there another way to find the corresponding file? On 24.05.19 11:12 vorm., Burkhard Linke wrote: Hi, On 5/24/19 9:48 AM, Kevin Flöh wrote: We got the object ids of the missing objects with|ceph pg 1.24c list_missing:| |{ "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "10004dfce92.003d", "key": "", "snapid": -2, "hash": 90219084, "max": 0, "pool": 1, "namespace": "" }, "need": "46950'195355", "have": "0'0", "flags": "none", "locations": [ "36(3)", "61(2)" ] } ], "more": false } | |we want to give up those objects with:| ceph pg 1.24c mark_unfound_lost revert But first we would like to know which file(s) is affected. Is there a way to map the object id to the corresponding file? The object name is composed of the file inode id and the chunk within the file. The first chunk has some metadata you can use to retrieve the filename. See the 'CephFS object mapping' thread on the mailing list for more information. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
We got the object ids of the missing objects with|ceph pg 1.24c list_missing:| |{ "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "10004dfce92.003d", "key": "", "snapid": -2, "hash": 90219084, "max": 0, "pool": 1, "namespace": "" }, "need": "46950'195355", "have": "0'0", "flags": "none", "locations": [ "36(3)", "61(2)" ] } ], "more": false } | |we want to give up those objects with:| ceph pg 1.24c mark_unfound_lost revert But first we would like to know which file(s) is affected. Is there a way to map the object id to the corresponding file? || On 23.05.19 3:52 nachm., Alexandre Marangone wrote: The PGs will stay active+recovery_wait+degraded until you solve the unfound objects issue. You can follow this doc to look at which objects are unfound[1] and if no other recourse mark them lost [1] http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects. On Thu, May 23, 2019 at 5:47 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote: thank you for this idea, it has improved the situation. Nevertheless, there are still 2 PGs in recovery_wait. ceph -s gives me: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_WARN 3/125481112 objects unfound (0.000%) Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu <http://ceph-node01.etp.kit.edu> mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu <http://ceph-node03.etp.kit.edu>=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 3/497011315 objects degraded (0.000%) 3/125481112 objects unfound (0.000%) 4083 active+clean 10 active+clean+scrubbing+deep 2 active+recovery_wait+degraded 1 active+clean+scrubbing io: client: 318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr and ceph health detail: HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 p gs degraded OBJECT_UNFOUND 3/125481112 objects unfound (0.000%) pg 1.24c has 1 unfound objects pg 1.779 has 2 unfound objects PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2 unfound also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph osd down for all OSDs of the degraded PGs. Do you have any further suggestions on how to proceed? On 23.05.19 11:08 vorm., Dan van der Ster wrote: > I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer > their degraded PGs. > > Open a window with `watch ceph -s`, then in another window slowly do > > ceph osd down 1 > # then wait a minute or so for that osd.1 to re-peer fully. > ceph osd down 11 > ... > > Continue that for each of the osds with stuck requests, or until there > are no more recovery_wait/degraded PGs. > > After each `ceph osd down...`, you should expect to see several PGs > re-peer, and then ideally the slow requests will disappear and the > degraded PGs will become active+clean. > If anything else happens, you should stop and let us know. > > > -- dan > > On Thu, May 23, 2019 at 10:59 AM Kevin Flöh mailto:kevin.fl...@kit.edu>> wrote: >> This is the current status of ceph: >> >> >> cluster: >> id: 23e72372-0d
Re: [ceph-users] Major ceph disaster
thank you for this idea, it has improved the situation. Nevertheless, there are still 2 PGs in recovery_wait. ceph -s gives me: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_WARN 3/125481112 objects unfound (0.000%) Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 3/497011315 objects degraded (0.000%) 3/125481112 objects unfound (0.000%) 4083 active+clean 10 active+clean+scrubbing+deep 2 active+recovery_wait+degraded 1 active+clean+scrubbing io: client: 318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr and ceph health detail: HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 p gs degraded OBJECT_UNFOUND 3/125481112 objects unfound (0.000%) pg 1.24c has 1 unfound objects pg 1.779 has 2 unfound objects PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2 unfound also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph osd down for all OSDs of the degraded PGs. Do you have any further suggestions on how to proceed? On 23.05.19 11:08 vorm., Dan van der Ster wrote: I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer their degraded PGs. Open a window with `watch ceph -s`, then in another window slowly do ceph osd down 1 # then wait a minute or so for that osd.1 to re-peer fully. ceph osd down 11 ... Continue that for each of the osds with stuck requests, or until there are no more recovery_wait/degraded PGs. After each `ceph osd down...`, you should expect to see several PGs re-peer, and then ideally the slow requests will disappear and the degraded PGs will become active+clean. If anything else happens, you should stop and let us know. -- dan On Thu, May 23, 2019 at 10:59 AM Kevin Flöh wrote: This is the current status of ceph: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 9/125481144 objects unfound (0.000%) Degraded data redundancy: 9/497011417 objects degraded (0.000%), 7 pgs degraded 9 stuck requests are blocked > 4096 sec. Implicated osds 1,11,21,32,43,50,65 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 9/497011417 objects degraded (0.000%) 9/125481144 objects unfound (0.000%) 4078 active+clean 11 active+clean+scrubbing+deep 7active+recovery_wait+degraded io: client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr On 23.05.19 10:54 vorm., Dan van der Ster wrote: What's the full ceph status? Normally recovery_wait just means that the relevant osd's are busy recovering/backfilling another PG. On Thu, May 23, 2019 at 10:53 AM Kevin Flöh wrote: Hi, we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? Kevin On 22.05.19 6:03 nachm., Robert LeBlanc wrote: On Wed, May 22, 2019 at 4:31 AM Kevin Flöh wrote: Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --forma
Re: [ceph-users] Major ceph disaster
This is the current status of ceph: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 9/125481144 objects unfound (0.000%) Degraded data redundancy: 9/497011417 objects degraded (0.000%), 7 pgs degraded 9 stuck requests are blocked > 4096 sec. Implicated osds 1,11,21,32,43,50,65 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 9/497011417 objects degraded (0.000%) 9/125481144 objects unfound (0.000%) 4078 active+clean 11 active+clean+scrubbing+deep 7 active+recovery_wait+degraded io: client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr On 23.05.19 10:54 vorm., Dan van der Ster wrote: What's the full ceph status? Normally recovery_wait just means that the relevant osd's are busy recovering/backfilling another PG. On Thu, May 23, 2019 at 10:53 AM Kevin Flöh wrote: Hi, we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? Kevin On 22.05.19 6:03 nachm., Robert LeBlanc wrote: On Wed, May 22, 2019 at 4:31 AM Kevin Flöh wrote: Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --format=json-pretty` 3. You will want to look at the error messages and see if all the shards have the same data. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? Kevin On 22.05.19 6:03 nachm., Robert LeBlanc wrote: On Wed, May 22, 2019 at 4:31 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote: Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --format=json-pretty` 3. You will want to look at the error messages and see if all the shards have the same data. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin On 21.05.19 4:52 nachm., Wido den Hollander wrote: On 5/21/19 4:48 PM, Kevin Flöh wrote: Hi, we gave up on the incomplete pgs since we do not have enough complete shards to restore them. What is the procedure to get rid of these pgs? You need to start with marking the OSDs as 'lost' and then you can force_create_pg to get the PGs back (empty). Wido regards, Kevin On 20.05.19 9:22 vorm., Kevin Flöh wrote: Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data dam
Re: [ceph-users] Major ceph disaster
Hi, we gave up on the incomplete pgs since we do not have enough complete shards to restore them. What is the procedure to get rid of these pgs? regards, Kevin On 20.05.19 9:22 vorm., Kevin Flöh wrote: Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on trimming mdsceph-no
Re: [ceph-users] Major ceph disaster
Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on trimming mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128) max_segments: 128, num_segments: 46034 OBJECT_UNFOUND 1/126319687 objects unfound (0.000%) pg 1.24c has 1 unfound objects OSD_SCRUB_ERRORS 19 scrub errors P
Re: [ceph-users] Major ceph disaster
We tried to export the shards from the OSDs but there are only two shards left for each of the pgs, so we decided to give up these pgs. Will the files of these pgs be deleted from the mds or do we have to delete them manually. Is this the correct command to mark the pgs as lost: ceph pg {pg-id} mark_unfound_lost revert|delete Cheers, Kevin On 15.05.19 8:55 vorm., Kevin Flöh wrote: The hdds of OSDs 4 and 23 are completely lost, we cannot access them in any way. Is it possible to use the shards which are maybe stored on working OSDs as shown in the all_participants list? On 14.05.19 5:24 nachm., Dan van der Ster wrote: On Tue, May 14, 2019 at 5:13 PM Kevin Flöh wrote: ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? What is the state of the hdd's which had osds 4 & 23? You may be able to use ceph-objectstore-tool to export those PG shards and import to another operable OSD. -- dan On 14.05.19 11:02 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost a
Re: [ceph-users] Major ceph disaster
ceph osd pool get ec31 min_size min_size: 3 On 15.05.19 9:09 vorm., Konstantin Shalygin wrote: ceph osd pool get ec31 min_size ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
The hdds of OSDs 4 and 23 are completely lost, we cannot access them in any way. Is it possible to use the shards which are maybe stored on working OSDs as shown in the all_participants list? On 14.05.19 5:24 nachm., Dan van der Ster wrote: On Tue, May 14, 2019 at 5:13 PM Kevin Flöh wrote: ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? What is the state of the hdd's which had osds 4 & 23? You may be able to use ceph-objectstore-tool to export those PG shards and import to another operable OSD. -- dan On 14.05.19 11:02 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those
Re: [ceph-users] Major ceph disaster
Hi, since we have 3+1 ec I didn't try before. But when I run the command you suggested I get the following error: ceph osd pool set ec31 min_size 2 Error EINVAL: pool min_size must be between 3 and 4 On 14.05.19 6:18 nachm., Konstantin Shalygin wrote: peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? Try to reduce min_size for problem pool as 'health detail' suggested: `ceph osd pool set ec31 min_size 2`. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? On 14.05.19 11:02 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "n
Re: [ceph-users] Major ceph disaster
On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-09 16:11:48.611171", "past_intervals": [ { "first": "49767", "last": "59313", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "58860", "last": "58861", "acting": "4(1),24(0),79(3)" }, { "first": "58875", "last": "58877", "acting": "4(1),23(2),24(0)" }, { "first": "59002", "last": "59009", "acting": "4(1),23(2),79(3)" }, { "first": "59010", "last": "59012", "acting": "2(0),4(1),23(2),79(3)" }, { "first": "59197", "last": "59233", "acting": "23(2),24(0),79(3)" }, { "first": "59234", "last": "59313", "acting": "23(2),24(0),72(1),79(3)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [], &q
Re: [ceph-users] Major ceph disaster
On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7active+clean+inconsistent 2incomplete 1active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on trimming mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128) max_segments: 128, num_segments: 46034 OBJECT_UNFOUND 1/126319687 objects unfound (0.000%) pg 1.24c has 1 unfound objects OSD_SCRUB_ERRORS 19 scrub errors PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31 min_size from 3 may help; search ceph.com/docs for 'incomplete') pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31 min_size from 3 may help; search ceph.com/docs for 'incomplete') PG_DAMAGED Possible data damage: 7 pgs inconsistent pg 1.17f is active+clean+inconsistent, acting [65,49,25,4] pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81] pg 1.203 is active+clean+inconsistent, acting [43,49,4,72] pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4] pg 1.779 is active+clean+inconsistent, acting [50,4,77,62] pg 1.77c is active+clean+inconsistent, acting [21,49,40,4] pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4] PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 118 ops
Re: [ceph-users] Major ceph disaster
On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-09 16:11:48.611171", "past_intervals": [ { "first": "49767", "last": "59313", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "58860", "last": "58861", "acting": "4(1),24(0),79(3)" }, { "first": "58875", "last": "58877", "acting": "4(1),23(2),24(0)" }, { "first": "59002", "last": "59009", "acting": "4(1),23(2),79(3)" }, { "first": "59010", "last": "59012", "acting": "2(0),4(1),23(2),79(3)" }, { "first": "59197", "last": "59233", "acting": "23(2),24(0),79(3)" }, { "first": "59234", "last": "59313", "acting": "23(2),24(0),72(1),79(3)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [], "peering_blocked_by_detail": [ { "detail": "peering_blocked_by_history_les_bound" } ] }, { "name": "Started",
[ceph-users] Major ceph disaster
d: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. And now we are stuck with 2 pgs in incomplete. Ceph pg query gives the following problem: "down_osds_we_would_probe": [], "peering_blocked_by": [], "peering_blocked_by_detail": [ { "detail": "peering_blocked_by_history_les_bound" } We already tried to set "osd_find_best_info_ignore_history_les": "true" for the affected osds, which had no effect. Furthermore, the cluster is behind on trimming by more than 40,000 segments and we have folders and files which cannot be deleted or moved. (which are not on the 2 incomplete pgs). Is there any way to solve these problems? Best regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster is not stable
Are you sure that firewalld is stopped and disabled? Looks exactly like that when I missed one host in a test cluster. Kevin Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou : > Hi, > > I deployed a ceph cluster with good performance. But the logs > indicate that the cluster is not as stable as I think it should be. > > The log shows the monitors mark some osd as down periodly: > [image: image.png] > > I didn't find any useful information in osd logs. > > ceph version 13.2.4 mimic (stable) > OS version CentOS 7.6.1810 > kernel version 5.0.0-2.el7 > > Thanks. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage of devices in SSD pool vary very much
dd 0.90999 1.0 932GiB 335GiB 597GiB 35.96 0.79 91 12 hdd 0.90999 1.0 932GiB 357GiB 575GiB 38.28 0.84 96 35 hdd 0.90970 1.0 932GiB 318GiB 614GiB 34.14 0.75 86 6 ssd 0.43700 1.0 447GiB 278GiB 170GiB 62.08 1.36 63 7 ssd 0.43700 1.0 447GiB 256GiB 191GiB 57.17 1.25 60 8 ssd 0.43700 1.0 447GiB 291GiB 156GiB 65.01 1.42 57 31 ssd 0.43660 1.0 447GiB 246GiB 201GiB 54.96 1.20 51 34 ssd 0.43660 1.0 447GiB 189GiB 258GiB 42.22 0.92 46 36 ssd 0.87329 1.0 894GiB 389GiB 506GiB 43.45 0.95 91 37 ssd 0.87329 1.0 894GiB 390GiB 504GiB 43.63 0.96 85 42 ssd 0.87329 1.0 894GiB 401GiB 493GiB 44.88 0.98 92 43 ssd 0.87329 1.0 894GiB 455GiB 439GiB 50.89 1.11 89 17 hdd 0.90999 1.0 932GiB 368GiB 563GiB 39.55 0.87 100 18 hdd 0.90999 1.0 932GiB 350GiB 582GiB 37.56 0.82 95 24 hdd 0.90999 1.0 932GiB 359GiB 572GiB 38.58 0.84 97 26 hdd 0.90999 1.0 932GiB 388GiB 544GiB 41.62 0.91 105 13 ssd 0.43700 1.0 447GiB 322GiB 125GiB 72.12 1.58 80 14 ssd 0.43700 1.0 447GiB 291GiB 156GiB 65.16 1.43 70 15 ssd 0.43700 1.0 447GiB 350GiB 96.9GiB 78.33 1.72 78 <-- 16 ssd 0.43700 1.0 447GiB 268GiB 179GiB 60.05 1.31 71 23 hdd 0.90999 1.0 932GiB 364GiB 567GiB 39.08 0.86 98 25 hdd 0.90999 1.0 932GiB 391GiB 541GiB 41.92 0.92 106 27 hdd 0.90999 1.0 932GiB 393GiB 538GiB 42.21 0.92 106 28 hdd 0.90970 1.0 932GiB 467GiB 464GiB 50.14 1.10 126 19 ssd 0.43700 1.0 447GiB 310GiB 137GiB 69.36 1.52 76 20 ssd 0.43700 1.0 447GiB 316GiB 131GiB 70.66 1.55 76 21 ssd 0.43700 1.0 447GiB 323GiB 125GiB 72.13 1.58 80 22 ssd 0.43700 1.0 447GiB 283GiB 164GiB 63.39 1.39 69 38 ssd 0.43660 1.0 447GiB 146GiB 302GiB 32.55 0.71 46 39 ssd 0.43660 1.0 447GiB 142GiB 305GiB 31.84 0.70 43 40 ssd 0.87329 1.0 894GiB 407GiB 487GiB 45.53 1.00 98 41 ssd 0.87329 1.0 894GiB 353GiB 541GiB 39.51 0.87 102 TOTAL 29.9TiB 13.7TiB 16.3TiB 45.66 MIN/MAX VAR: 0.63/1.72 STDDEV: 13.59 Kevin Am So., 6. Jan. 2019 um 07:34 Uhr schrieb Konstantin Shalygin : > > On 1/5/19 4:17 PM, Kevin Olbrich wrote: > > root@adminnode:~# ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -1 30.82903 root default > > -16 30.82903 datacenter dc01 > > -19 30.82903 pod dc01-agg01 > > -10 17.43365 rack dc01-rack02 > > -47.20665 host node1001 > >0 hdd 0.90999 osd.0 up 1.0 1.0 > >1 hdd 0.90999 osd.1 up 1.0 1.0 > >5 hdd 0.90999 osd.5 up 1.0 1.0 > > 29 hdd 0.90970 osd.29up 1.0 1.0 > > 32 hdd 0.90970 osd.32 down0 1.0 > > 33 hdd 0.90970 osd.33up 1.0 1.0 > >2 ssd 0.43700 osd.2 up 1.0 1.0 > >3 ssd 0.43700 osd.3 up 1.0 1.0 > >4 ssd 0.43700 osd.4 up 1.0 1.0 > > 30 ssd 0.43660 osd.30up 1.0 1.0 > > -76.29724 host node1002 > >9 hdd 0.90999 osd.9 up 1.0 1.0 > > 10 hdd 0.90999 osd.10up 1.0 1.0 > > 11 hdd 0.90999 osd.11up 1.0 1.0 > > 12 hdd 0.90999 osd.12up 1.0 1.0 > > 35 hdd 0.90970 osd.35up 1.0 1.0 > >6 ssd 0.43700 osd.6 up 1.0 1.0 > >7 ssd 0.43700 osd.7 up 1.0 1.0 > >8 ssd 0.43700 osd.8 up 1.0 1.0 > > 31 ssd 0.43660 osd.31up 1.0 1.0 > > -282.18318 host node1005 > > 34 ssd 0.43660 osd.34up 1.0 1.0 > > 36 ssd 0.87329 osd.36up 1.0 1.0 > > 37 ssd 0.87329 osd.37up 1.0 1.0 > > -291.74658 host node1006 > > 42 ssd 0.87329 osd.42up 1.0 1.0 > > 43 ssd 0.87329 osd.43up 1.0 1.0 > > -11 13.39537 rack dc01-rack03 > > -225.38794 host node100
Re: [ceph-users] Rezising an online mounted ext4 on a rbd - failed
Am Sa., 26. Jan. 2019 um 13:43 Uhr schrieb Götz Reinicke : > > Hi, > > I have a fileserver which mounted a 4TB rbd, which is ext4 formatted. > > I grow that rbd and ext4 starting with an 2TB rbd that way: > > rbd resize testpool/disk01--size 4194304 > > resize2fs /dev/rbd0 > > Today I wanted to extend that ext4 to 8 TB and did: > > rbd resize testpool/disk01--size 8388608 > > resize2fs /dev/rbd0 > > => which gives an error: The filesystem is already 1073741824 blocks. Nothing > to do. > > > I bet I missed something very simple. Any hint? Thanks and regards . > Götz Try "partprobe" to read device metrics again. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore 32bit max_object_size limit
On 1/18/19 7:26 AM, Igor Fedotov wrote: Hi Kevin, On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote: Hey, I recall reading about this somewhere but I can't find it in the docs or list archive and confirmation from a dev or someone who knows for sure would be nice. What I recall is that bluestore has a max 4GB file size limit based on the design of bluestore not the osd_max_object_size setting. The bluestore source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe I'm reading all this wrong..? You're correct, BlueStore doesn't support object larger than OBJECT_MAX_SIZE(i.e. 4Gb) Thanks for confirming that! If bluestore has a hard 4GB object limit using radosstriper to break up an object would work, but does using an EC pool that breaks up the object to shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get around a 4GB limit? We use rados directly and would like to move to bluestore but we have some large objects <= 13G that may need attention if this 4GB limit does exist and an ec pool doesn't get around it. Theoretically object split using EC might help. But I'm not sure whether one needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb object usage in EC pool. If it's needed than tosd_max_object_size <= OBJECT_MAX_SIZE constraint is violated and BlueStore wouldn't start. In my experience I had to increase osd_max_object_size from the 128M default it changed to a couple versions ago to ~20G to be able to write our largest objects with some margin. Do you think there is another way to handle osd_max_object_size > OBJECT_MAX_SIZE so that bluestore will start and EC pools or striping can be used to write objects that are greater than OBJECT_MAX_SIZE but each stripe/shard ends up smaller than OBJECT_MAX_SIZE after striping or being in an ec pool? https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88 #define OBJECT_MAX_SIZE 0x // 32 bits https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395 // sanity check(s) auto osd_max_object_size = cct->_conf.get_val("osd_max_object_size"); if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) { derr << __func__ << " osd_max_object_size >= 0x" << std::hex << OBJECT_MAX_SIZE << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." << std::dec << dendl; return -EINVAL; } https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331 if (offset + length >= OBJECT_MAX_SIZE) { r = -E2BIG; } else { _assign_nid(txc, o); r = _do_write(txc, c, o, offset, length, bl, fadvise_flags); txc->write_onode(o); } Thanks! Kevin -- Kevin Hrpcek NASA SNPP Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Thanks, Igor ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore 32bit max_object_size limit
Hey, I recall reading about this somewhere but I can't find it in the docs or list archive and confirmation from a dev or someone who knows for sure would be nice. What I recall is that bluestore has a max 4GB file size limit based on the design of bluestore not the osd_max_object_size setting. The bluestore source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe I'm reading all this wrong..? If bluestore has a hard 4GB object limit using radosstriper to break up an object would work, but does using an EC pool that breaks up the object to shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get around a 4GB limit? We use rados directly and would like to move to bluestore but we have some large objects <= 13G that may need attention if this 4GB limit does exist and an ec pool doesn't get around it. https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88 #define OBJECT_MAX_SIZE 0x // 32 bits https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395 // sanity check(s) auto osd_max_object_size = cct->_conf.get_val("osd_max_object_size"); if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) { derr << __func__ << " osd_max_object_size >= 0x" << std::hex << OBJECT_MAX_SIZE << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." << std::dec << dendl; return -EINVAL; } https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331 if (offset + length >= OBJECT_MAX_SIZE) { r = -E2BIG; } else { _assign_nid(txc, o); r = _do_write(txc, c, o, offset, length, bl, fadvise_flags); txc->write_onode(o); } Thanks! Kevin -- Kevin Hrpcek NASA SNPP Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck in creating+peering state
Are you sure, no service like firewalld is running? Did you check that all machines have the same MTU and jumbo frames are enabled if needed? I had this problem when I first started with ceph and forgot to disable firewalld. Replication worked perfectly fine but the OSD was kicked out every few seconds. Kevin Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen : > > Hi, > > I have a sad ceph cluster. > All my osds complain about failed reply on heartbeat, like so: > > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42 > ever on either front or back, first ping sent 2019-01-16 > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353) > > .. I've checked the network sanity all I can, and all ceph ports are > open between nodes both on the public network and the cluster network, > and I have no problems sending traffic back and forth between nodes. > I've tried tcpdump'ing and traffic is passing in both directions > between the nodes, but unfortunately I don't natively speak the ceph > protocol, so I can't figure out what's going wrong in the heartbeat > conversation. > > Still: > > # ceph health detail > > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072 > pgs inactive, 1072 pgs peering > OSDMAP_FLAGS nodown,noout flag(s) set > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering > pg 7.3cd is stuck inactive for 245901.560813, current state > creating+peering, last acting [13,41,1] > pg 7.3ce is stuck peering for 245901.560813, current state > creating+peering, last acting [1,40,7] > pg 7.3cf is stuck peering for 245901.560813, current state > creating+peering, last acting [0,42,9] > pg 7.3d0 is stuck peering for 245901.560813, current state > creating+peering, last acting [20,8,38] > pg 7.3d1 is stuck peering for 245901.560813, current state > creating+peering, last acting [10,20,42] >() > > > I've set "noout" and "nodown" to prevent all osd's from being removed > from the cluster. They are all running and marked as "up". > > # ceph osd tree > > ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF > -1 249.73434 root default > -25 166.48956 datacenter m1 > -2483.24478 pod kube1 > -3541.62239 rack 10 > -3441.62239 host ceph-sto-p102 > 40 hdd 7.27689 osd.40 up 1.0 1.0 > 41 hdd 7.27689 osd.41 up 1.0 1.0 > 42 hdd 7.27689 osd.42 up 1.0 1.0 >() > > I'm at a point where I don't know which options and what logs to check > anymore? > > Any debug hint would be very much appreciated. > > btw. I have no important data in the cluster (yet), so if the solution > is to drop all osd and recreate them, it's ok for now. But I'd really > like to know how the cluster ended in this state. > > /Johan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problem with CephFS - No space left on device
It would but you should not: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html Kevin Am Di., 8. Jan. 2019 um 15:35 Uhr schrieb Rodrigo Embeita : > > Thanks again Kevin. > If I reduce the size flag to a value of 2, that should fix the problem? > > Regards > > On Tue, Jan 8, 2019 at 11:28 AM Kevin Olbrich wrote: >> >> You use replication 3 failure-domain host. >> OSD 2 and 4 are full, thats why your pool is also full. >> You need to add two disks to pf-us1-dfs3 or swap one from the larger >> nodes to this one. >> >> Kevin >> >> Am Di., 8. Jan. 2019 um 15:20 Uhr schrieb Rodrigo Embeita >> : >> > >> > Hi Yoann, thanks for your response. >> > Here are the results of the commands. >> > >> > root@pf-us1-dfs2:/var/log/ceph# ceph osd df >> > ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS >> > 0 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 571 GiB 92.33 1.74 310 >> > 5 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.18 1.45 271 >> > 6 hdd 7.27739 1.0 7.3 TiB 609 GiB 6.7 TiB 8.17 0.15 49 >> > 8 hdd 7.27739 1.0 7.3 TiB 2.5 GiB 7.3 TiB 0.030 42 >> > 1 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.28 1.45 285 >> > 3 hdd 7.27739 1.0 7.3 TiB 6.9 TiB 371 GiB 95.02 1.79 296 >> > 7 hdd 7.27739 1.0 7.3 TiB 360 GiB 6.9 TiB 4.84 0.09 53 >> > 9 hdd 7.27739 1.0 7.3 TiB 4.1 GiB 7.3 TiB 0.06 0.00 38 >> > 2 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 576 GiB 92.27 1.74 321 >> > 4 hdd 7.27739 1.0 7.3 TiB 6.1 TiB 1.2 TiB 84.10 1.58 351 >> >TOTAL 73 TiB 39 TiB 34 TiB 53.13 >> > MIN/MAX VAR: 0/1.79 STDDEV: 41.15 >> > >> > >> > root@pf-us1-dfs2:/var/log/ceph# ceph osd pool ls detail >> > pool 1 'poolcephfs' replicated size 3 min_size 2 crush_rule 0 object_hash >> > rjenkins pg_num 128 pgp_num 128 last_change 471 fla >> > gs hashpspool,full stripe_width 0 >> > pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash >> > rjenkins pg_num 256 pgp_num 256 last_change 471 lf >> > or 0/439 flags hashpspool,full stripe_width 0 application cephfs >> > pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 256 pgp_num 256 last_change 47 >> > 1 lfor 0/448 flags hashpspool,full stripe_width 0 application cephfs >> > pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash >> > rjenkins pg_num 8 pgp_num 8 last_change 471 flags ha >> > shpspool,full stripe_width 0 application rgw >> > pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 47 >> > 1 flags hashpspool,full stripe_width 0 application rgw >> > pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 f >> > lags hashpspool,full stripe_width 0 application rgw >> > pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 >> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 fl >> > ags hashpspool,full stripe_width 0 application rgw >> > >> > >> > root@pf-us1-dfs2:/var/log/ceph# ceph osd tree >> > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF >> > -1 72.77390 root default >> > -3 29.10956 host pf-us1-dfs1 >> > 0 hdd 7.27739 osd.0up 1.0 1.0 >> > 5 hdd 7.27739 osd.5up 1.0 1.0 >> > 6 hdd 7.27739 osd.6up 1.0 1.0 >> > 8 hdd 7.27739 osd.8up 1.0 1.0 >> > -5 29.10956 host pf-us1-dfs2 >> > 1 hdd 7.27739 osd.1up 1.0 1.0 >> > 3 hdd 7.27739 osd.3up 1.0 1.0 >> > 7 hdd 7.27739 osd.7up 1.0 1.0 >> > 9 hdd 7.27739 osd.9up 1.0 1.0 >> > -7 14.55478 host pf-us1-dfs3 >> > 2 hdd 7.27739 osd.2up 1.0 1.0 >> > 4 hdd 7.27739 osd.4up 1.0 1.0 >> > >> > >> > Thanks for your help guys. >> > >> > >> > On Tue, Jan 8, 2019 at 10:36 AM Yoann Moulin wrote: >> >> >> >> Hello, >> >> >> >> > Hi guys, I need your help. >> >> > I'm new with Cephfs and we started using it
Re: [ceph-users] Problem with CephFS - No space left on device
You use replication 3 failure-domain host. OSD 2 and 4 are full, thats why your pool is also full. You need to add two disks to pf-us1-dfs3 or swap one from the larger nodes to this one. Kevin Am Di., 8. Jan. 2019 um 15:20 Uhr schrieb Rodrigo Embeita : > > Hi Yoann, thanks for your response. > Here are the results of the commands. > > root@pf-us1-dfs2:/var/log/ceph# ceph osd df > ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS > 0 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 571 GiB 92.33 1.74 310 > 5 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.18 1.45 271 > 6 hdd 7.27739 1.0 7.3 TiB 609 GiB 6.7 TiB 8.17 0.15 49 > 8 hdd 7.27739 1.0 7.3 TiB 2.5 GiB 7.3 TiB 0.030 42 > 1 hdd 7.27739 1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.28 1.45 285 > 3 hdd 7.27739 1.0 7.3 TiB 6.9 TiB 371 GiB 95.02 1.79 296 > 7 hdd 7.27739 1.0 7.3 TiB 360 GiB 6.9 TiB 4.84 0.09 53 > 9 hdd 7.27739 1.0 7.3 TiB 4.1 GiB 7.3 TiB 0.06 0.00 38 > 2 hdd 7.27739 1.0 7.3 TiB 6.7 TiB 576 GiB 92.27 1.74 321 > 4 hdd 7.27739 1.0 7.3 TiB 6.1 TiB 1.2 TiB 84.10 1.58 351 >TOTAL 73 TiB 39 TiB 34 TiB 53.13 > MIN/MAX VAR: 0/1.79 STDDEV: 41.15 > > > root@pf-us1-dfs2:/var/log/ceph# ceph osd pool ls detail > pool 1 'poolcephfs' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 128 pgp_num 128 last_change 471 fla > gs hashpspool,full stripe_width 0 > pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 256 pgp_num 256 last_change 471 lf > or 0/439 flags hashpspool,full stripe_width 0 application cephfs > pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 256 pgp_num 256 last_change 47 > 1 lfor 0/448 flags hashpspool,full stripe_width 0 application cephfs > pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 8 pgp_num 8 last_change 471 flags ha > shpspool,full stripe_width 0 application rgw > pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 last_change 47 > 1 flags hashpspool,full stripe_width 0 application rgw > pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 f > lags hashpspool,full stripe_width 0 application rgw > pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 fl > ags hashpspool,full stripe_width 0 application rgw > > > root@pf-us1-dfs2:/var/log/ceph# ceph osd tree > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF > -1 72.77390 root default > -3 29.10956 host pf-us1-dfs1 > 0 hdd 7.27739 osd.0up 1.0 1.0 > 5 hdd 7.27739 osd.5up 1.0 1.0 > 6 hdd 7.27739 osd.6up 1.0 1.0 > 8 hdd 7.27739 osd.8up 1.0 1.0 > -5 29.10956 host pf-us1-dfs2 > 1 hdd 7.27739 osd.1up 1.0 1.0 > 3 hdd 7.27739 osd.3up 1.0 1.0 > 7 hdd 7.27739 osd.7up 1.0 1.0 > 9 hdd 7.27739 osd.9up 1.0 1.0 > -7 14.55478 host pf-us1-dfs3 > 2 hdd 7.27739 osd.2up 1.0 1.0 > 4 hdd 7.27739 osd.4up 1.0 1.0 > > > Thanks for your help guys. > > > On Tue, Jan 8, 2019 at 10:36 AM Yoann Moulin wrote: >> >> Hello, >> >> > Hi guys, I need your help. >> > I'm new with Cephfs and we started using it as file storage. >> > Today we are getting no space left on device but I'm seeing that we have >> > plenty space on the filesystem. >> > Filesystem Size Used Avail Use% Mounted on >> > 192.168.51.8,192.168.51.6,192.168.51.118:6789:/pagefreezer/smhosts 73T >> > 39T 35T 54% /mnt/cephfs >> > >> > We have 35TB of disk space. I've added 2 additional OSD disks with 7TB >> > each but I'm getting the error "No space left on device" every time that >> > I want to add a new file. >> > After adding the 2 additional OSD disks I'm seeing that the load is beign >> > distributed among the cluster. >> > Please I need your help. >> >> Could you give us the output of >> >> ceph osd df >> ceph osd pool ls detail >> ceph osd tree >> >> Best regards, >> >> -- >> Yoann Moulin >> EPFL IC-IT >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problem with CephFS - No space left on device
Looks like the same problem like mine: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032054.html The free space is total while Ceph uses the smallest free space (worst OSD). Please check your (re-)weights. Kevin Am Di., 8. Jan. 2019 um 14:32 Uhr schrieb Rodrigo Embeita : > > Hi guys, I need your help. > I'm new with Cephfs and we started using it as file storage. > Today we are getting no space left on device but I'm seeing that we have > plenty space on the filesystem. > Filesystem Size Used Avail Use% Mounted on > 192.168.51.8,192.168.51.6,192.168.51.118:6789:/pagefreezer/smhosts 73T > 39T 35T 54% /mnt/cephfs > > We have 35TB of disk space. I've added 2 additional OSD disks with 7TB each > but I'm getting the error "No space left on device" every time that I want to > add a new file. > After adding the 2 additional OSD disks I'm seeing that the load is beign > distributed among the cluster. > Please I need your help. > > root@pf-us1-dfs1:/etc/ceph# ceph -s > cluster: >id: 609e9313-bdd3-449e-a23f-3db8382e71fb >health: HEALTH_ERR >2 backfillfull osd(s) >1 full osd(s) >7 pool(s) full >197313040/508449063 objects misplaced (38.807%) >Degraded data redundancy: 2/508449063 objects degraded (0.000%), 2 > pgs degraded >Degraded data redundancy (low space): 16 pgs backfill_toofull, 3 > pgs recovery_toofull > > services: >mon: 3 daemons, quorum pf-us1-dfs2,pf-us1-dfs1,pf-us1-dfs3 >mgr: pf-us1-dfs3(active), standbys: pf-us1-dfs2 >mds: pagefs-2/2/2 up {0=pf-us1-dfs3=up:active,1=pf-us1-dfs1=up:active}, 1 > up:standby >osd: 10 osds: 10 up, 10 in; 189 remapped pgs >rgw: 1 daemon active > > data: >pools: 7 pools, 416 pgs >objects: 169.5 M objects, 3.6 TiB >usage: 39 TiB used, 34 TiB / 73 TiB avail >pgs: 2/508449063 objects degraded (0.000%) > 197313040/508449063 objects misplaced (38.807%) > 224 active+clean > 168 active+remapped+backfill_wait > 16 active+remapped+backfill_wait+backfill_toofull > 5 active+remapped+backfilling > 2 active+recovery_toofull+degraded > 1 active+recovery_toofull > > io: >recovery: 1.1 MiB/s, 31 objects/s > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancer=on with crush-compat mode
If I understand the balancer correct, it balances PGs not data. This worked perfectly fine in your case. I prefer a PG count of ~100 per OSD, you are at 30. Maybe it would help to bump the PGs. Kevin Am Sa., 5. Jan. 2019 um 14:39 Uhr schrieb Marc Roos : > > > I have straw2, balancer=on, crush-compat and it gives worst spread over > my ssd drives (4 only) being used by only 2 pools. One of these pools > has pg 8. Should I increase this to 16 to create a better result, or > will it never be any better. > > For now I like to stick to crush-compat, so I can use a default centos7 > kernel. > > Luminous 12.2.8, 3.10.0-862.14.4.el7.x86_64, CentOS Linux release > 7.5.1804 (Core) > > > > [@c01 ~]# cat balancer-1-before.txt | egrep '^19|^20|^21|^30' > 19 ssd 0.48000 1.0 447GiB 164GiB 283GiB 36.79 0.93 31 > 20 ssd 0.48000 1.0 447GiB 136GiB 311GiB 30.49 0.77 32 > 21 ssd 0.48000 1.0 447GiB 215GiB 232GiB 48.02 1.22 30 > 30 ssd 0.48000 1.0 447GiB 151GiB 296GiB 33.72 0.86 27 > > [@c01 ~]# ceph osd df | egrep '^19|^20|^21|^30' > 19 ssd 0.48000 1.0 447GiB 157GiB 290GiB 35.18 0.87 30 > 20 ssd 0.48000 1.0 447GiB 125GiB 322GiB 28.00 0.69 30 > 21 ssd 0.48000 1.0 447GiB 245GiB 202GiB 54.71 1.35 30 > 30 ssd 0.48000 1.0 447GiB 217GiB 230GiB 48.46 1.20 30 > > [@c01 ~]# ceph osd pool ls detail | egrep 'fs_meta|rbd.ssd' > pool 19 'fs_meta' replicated size 3 min_size 2 crush_rule 5 object_hash > rjenkins pg_num 16 pgp_num 16 last_change 22425 lfor 0/9035 flags > hashpspool stripe_width 0 application cephfs > pool 54 'rbd.ssd' replicated size 3 min_size 2 crush_rule 5 object_hash > rjenkins pg_num 8 pgp_num 8 last_change 24666 flags hashpspool > stripe_width 0 application rbd > > [@c01 ~]# ceph df |egrep 'ssd|fs_meta' > fs_meta 19 170MiB 0.07 > 240GiB 2451382 > fs_data.ssd 33 0B 0 > 240GiB 0 > rbd.ssd 54 266GiB 52.57 > 240GiB 75902 > fs_data.ec21.ssd 55 0B 0 > 480GiB 0 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage of devices in SSD pool vary very much
osd.33 2 ssd 0.43700 1.0 447GiB 271GiB 176GiB 60.67 1.30 50 osd.2 3 ssd 0.43700 1.0 447GiB 249GiB 198GiB 55.62 1.19 58 osd.3 4 ssd 0.43700 1.0 447GiB 297GiB 150GiB 66.39 1.42 56 osd.4 30 ssd 0.43660 1.0 447GiB 236GiB 211GiB 52.85 1.13 48 osd.30 -76.29724- 6.29TiB 2.74TiB 3.55TiB 43.53 0.93 - host node1002 9 hdd 0.90999 1.0 932GiB 354GiB 578GiB 37.96 0.81 95 osd.9 10 hdd 0.90999 1.0 932GiB 357GiB 575GiB 38.28 0.82 96 osd.10 11 hdd 0.90999 1.0 932GiB 318GiB 613GiB 34.18 0.73 86 osd.11 12 hdd 0.90999 1.0 932GiB 373GiB 558GiB 40.09 0.86 100 osd.12 35 hdd 0.90970 1.0 932GiB 343GiB 588GiB 36.83 0.79 92 osd.35 6 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.20 1.29 60 osd.6 7 ssd 0.43700 1.0 447GiB 249GiB 198GiB 55.69 1.19 56 osd.7 8 ssd 0.43700 1.0 447GiB 286GiB 161GiB 63.95 1.37 56 osd.8 31 ssd 0.43660 1.0 447GiB 257GiB 190GiB 57.47 1.23 55 osd.31 -282.18318- 2.18TiB 968GiB 1.24TiB 43.29 0.93 - host node1005 34 ssd 0.43660 1.0 447GiB 202GiB 245GiB 45.14 0.97 47 osd.34 36 ssd 0.87329 1.0 894GiB 405GiB 489GiB 45.28 0.97 91 osd.36 37 ssd 0.87329 1.0 894GiB 361GiB 533GiB 40.38 0.87 79 osd.37 -291.74658- 1.75TiB 888GiB 900GiB 49.65 1.06 - host node1006 42 ssd 0.87329 1.0 894GiB 417GiB 477GiB 46.68 1.00 92 osd.42 43 ssd 0.87329 1.0 894GiB 471GiB 424GiB 52.63 1.13 90 osd.43 -11 13.39537- 13.4TiB 6.64TiB 6.75TiB 49.60 1.06 - rack dc01-rack03 -225.38794- 5.39TiB 2.70TiB 2.69TiB 50.14 1.07 - host node1003 17 hdd 0.90999 1.0 932GiB 371GiB 560GiB 39.83 0.85 100 osd.17 18 hdd 0.90999 1.0 932GiB 390GiB 542GiB 41.82 0.90 105 osd.18 24 hdd 0.90999 1.0 932GiB 352GiB 580GiB 37.77 0.81 94 osd.24 26 hdd 0.90999 1.0 932GiB 387GiB 545GiB 41.54 0.89 104 osd.26 13 ssd 0.43700 1.0 447GiB 319GiB 128GiB 71.32 1.53 77 osd.13 14 ssd 0.43700 1.0 447GiB 303GiB 144GiB 67.76 1.45 70 osd.14 15 ssd 0.43700 1.0 447GiB 361GiB 86.4GiB 80.67 1.73 77 osd.15 16 ssd 0.43700 1.0 447GiB 283GiB 164GiB 63.29 1.36 71 osd.16 -255.38765- 5.39TiB 2.83TiB 2.56TiB 52.55 1.13 - host node1004 23 hdd 0.90999 1.0 932GiB 382GiB 549GiB 41.05 0.88 102 osd.23 25 hdd 0.90999 1.0 932GiB 412GiB 520GiB 44.20 0.95 111 osd.25 27 hdd 0.90999 1.0 932GiB 385GiB 546GiB 41.36 0.89 103 osd.27 28 hdd 0.90970 1.0 932GiB 462GiB 469GiB 49.64 1.06 124 osd.28 19 ssd 0.43700 1.0 447GiB 314GiB 133GiB 70.22 1.51 75 osd.19 20 ssd 0.43700 1.0 447GiB 327GiB 120GiB 73.06 1.57 76 osd.20 21 ssd 0.43700 1.0 447GiB 324GiB 123GiB 72.45 1.55 77 osd.21 22 ssd 0.43700 1.0 447GiB 292GiB 156GiB 65.21 1.40 68 osd.22 -302.61978- 2.62TiB 1.11TiB 1.51TiB 42.43 0.91 - host node1007 38 ssd 0.43660 1.0 447GiB 165GiB 283GiB 36.82 0.79 46 osd.38 39 ssd 0.43660 1.0 447GiB 156GiB 292GiB 34.79 0.75 42 osd.39 40 ssd 0.87329 1.0 894GiB 429GiB 466GiB 47.94 1.03 98 osd.40 41 ssd 0.87329 1.0 894GiB 389GiB 505GiB 43.55 0.93 103 osd.41 TOTAL 29.9TiB 14.0TiB 16.0TiB 46.65 MIN/MAX VAR: 0.65/1.73 STDDEV: 13.30 = root@adminnode:~# ceph df && ceph -v GLOBAL: SIZEAVAIL RAW USED %RAW USED 29.9TiB 16.0TiB 14.0TiB 46.65 POOLS: NAME ID USED%USED MAX AVAIL OBJECTS rbd_vms_ssd 2 986GiB 49.83993GiB 262606 rbd_vms_hdd 3 3.76TiB 48.94 3.92TiB 992255 rbd_vms_ssd_014 372KiB 0662GiB 148 rbd_vms_ssd_01_ec 6 2.85TiB 68.81 1.29TiB 770506 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) Kevin Am Sa., 5. Jan. 2019 um 05:12 Uhr schrieb Konstantin Shalygin : > > On 1/5/19 1:51 AM, Kevin Olbrich wrote: > &
Re: [ceph-users] Help Ceph Cluster Down
Hi Arun, actually deleting was no good idea, thats why I wrote, that the OSDs should be "out". You have down PGs, that because the data is on OSDs that are unavailable but known by the cluster. This can be checked by using "ceph pg 0.5 query" (change PG name). Because your PG count is so much oversized, the overdose limits get hit on every recovery on your cluster. I had the same problem on a medium cluster when I added to many new disks at once. You already got this info from Caspar earlier in this thread. https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/ https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/ The second link shows one of the config params you need to inject to all your OSDs like this: ceph tell osd.* injectargs --mon_max_pg_per_osd 1 This might help you getting these PGs some sort of "active" (+recovery/+degraded/+inconsistent/etc.). The down PGs will most likely never come back. It would bet, you will find OSD IDs that are invalid in the acting set, meaning that non-existent OSDs hold your data. I had a similar problem on a test cluster with erasure code pools where too many disks failed at the same time, you will then see negative values as OSD IDs. Maybe this helps a little bit. Kevin Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA : > > Hi Kevin, > > I tried deleting newly added server from Ceph Cluster and looks like Ceph is > not recovering. I agree with unfound data but it doesn't say about unfound > data. It says inactive/down for PGs and I can't bring them up. > > > [root@fre101 ~]# ceph health detail > 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2) No > such file or directory > HEALTH_ERR 3 pools have many more objects per pg than average; > 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation; > Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering, > 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded > (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are > blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs > per OSD (3003 > max 200) > MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average > pool glance-images objects per pg (10478) is more than 92.7257 times > cluster average (113) > pool vms objects per pg (4722) is more than 41.7876 times cluster average > (113) > pool volumes objects per pg (1220) is more than 10.7965 times cluster > average (113) > OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%) > PENDING_CREATING_PGS 6517 PGs pending on creation > osds > [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9] > have pending PGs. > PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs down, > 2 pgs peering, 2703 pgs stale > pg 10.90e is stuck inactive for 94928.999109, current state activating, > last acting [2,6] > pg 10.913 is stuck inactive for 95094.175400, current state activating, > last acting [9,5] > pg 10.915 is stuck inactive for 94929.184177, current state activating, > last acting [30,26] > pg 11.907 is stuck stale for 9612.906582, current state > stale+active+clean, last acting [38,24] > pg 11.910 is stuck stale for 11822.359237, current state stale+down, last > acting [21] > pg 11.915 is stuck stale for 9612.906604, current state > stale+active+clean, last acting [38,31] > pg 11.919 is stuck inactive for 95636.716568, current state activating, > last acting [25,12] > pg 12.902 is stuck stale for 10810.497213, current state > stale+activating, last acting [36,14] > pg 13.901 is stuck stale for 94889.512234, current state > stale+active+clean, last acting [1,31] > pg 13.904 is stuck stale for 10745.279158, current state > stale+active+clean, last acting [37,8] > pg 13.908 is stuck stale for 10745.279176, current state > stale+active+clean, last acting [37,19] > pg 13.909 is stuck inactive for 95370.129659, current state activating, > last acting [34,19] > pg 13.90e is stuck inactive for 95370.379694, current state activating, > last acting [21,20] > pg 13.911 is stuck inactive for 98449.317873, current state activating, > last acting [25,22] > pg 13.914 is stuck stale for 11827.503651, current state sta
Re: [ceph-users] Help Ceph Cluster Down
I don't think this will help you. Unfound means, the cluster is unable to find the data anywhere (it's lost). It would be sufficient to shut down the new host - the OSDs will then be out. You can also force-heal the cluster, something like "do your best possible": ceph pg 2.5 mark_unfound_lost revert|delete Src: http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ Kevin Am Fr., 4. Jan. 2019 um 20:47 Uhr schrieb Arun POONIA : > > Hi Kevin, > > Can I remove newly added server from Cluster and see if it heals cluster ? > > When I check Hard Disk Iops on new server which are very low compared to > existing cluster server. > > Indeed this is a critical cluster but I don't have expertise to make it > flawless. > > Thanks > Arun > > On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich wrote: >> >> If you realy created and destroyed OSDs before the cluster healed >> itself, this data will be permanently lost (not found / inactive). >> Also your PG count is so much oversized, the calculation for peering >> will most likely break because this was never tested. >> >> If this is a critical cluster, I would start a new one and bring back >> the backups (using a better PG count). >> >> Kevin >> >> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA >> : >> > >> > Can anyone comment on this issue please, I can't seem to bring my cluster >> > healthy. >> > >> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA >> > wrote: >> >> >> >> Hi Caspar, >> >> >> >> Number of IOPs are also quite low. It used be around 1K Plus on one of >> >> Pool (VMs) now its like close to 10-30 . >> >> >> >> Thansk >> >> Arun >> >> >> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA >> >> wrote: >> >>> >> >>> Hi Caspar, >> >>> >> >>> Yes and No, numbers are going up and down. If I run ceph -s command I >> >>> can see it decreases one time and later it increases again. I see there >> >>> are so many blocked/slow requests. Almost all the OSDs have slow >> >>> requests. Around 12% PGs are inactive not sure how to activate them >> >>> again. >> >>> >> >>> >> >>> [root@fre101 ~]# ceph health detail >> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) >> >>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed >> >>> to bind the UNIX domain socket to >> >>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': >> >>> (2) No such file or directory >> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than >> >>> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on >> >>> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, >> >>> 86 pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 >> >>> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 >> >>> slow requests are blocked > 32 sec; 551 stuck requests are blocked > >> >>> 4096 sec; too many PGs per OSD (2709 > max 200) >> >>> OSD_DOWN 1 osds down >> >>> osd.28 (root=default,host=fre119) is down >> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average >> >>> pool glance-images objects per pg (10478) is more than 92.7257 times >> >>> cluster average (113) >> >>> pool vms objects per pg (4717) is more than 41.7434 times cluster >> >>> average (113) >> >>> pool volumes objects per pg (1220) is more than 10.7965 times >> >>> cluster average (113) >> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%) >> >>> PENDING_CREATING_PGS 3610 PGs pending on creation >> >>> osds >> >>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9] >> >>> have pending PGs. >> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs >> >>> down, 86 pgs peering, 850 pgs stale >> >>> pg 10.900 is down, acting [18] >> >>> pg 10.90e is stuck inactive for 60266.030164, current state >> >&g
Re: [ceph-users] Help Ceph Cluster Down
If you realy created and destroyed OSDs before the cluster healed itself, this data will be permanently lost (not found / inactive). Also your PG count is so much oversized, the calculation for peering will most likely break because this was never tested. If this is a critical cluster, I would start a new one and bring back the backups (using a better PG count). Kevin Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA : > > Can anyone comment on this issue please, I can't seem to bring my cluster > healthy. > > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA > wrote: >> >> Hi Caspar, >> >> Number of IOPs are also quite low. It used be around 1K Plus on one of Pool >> (VMs) now its like close to 10-30 . >> >> Thansk >> Arun >> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA >> wrote: >>> >>> Hi Caspar, >>> >>> Yes and No, numbers are going up and down. If I run ceph -s command I can >>> see it decreases one time and later it increases again. I see there are so >>> many blocked/slow requests. Almost all the OSDs have slow requests. Around >>> 12% PGs are inactive not sure how to activate them again. >>> >>> >>> [root@fre101 ~]# ceph health detail >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) >>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to >>> bind the UNIX domain socket to >>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2) >>> No such file or directory >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than average; >>> 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on creation; >>> Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 pgs >>> peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 objects >>> degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow >>> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec; >>> too many PGs per OSD (2709 > max 200) >>> OSD_DOWN 1 osds down >>> osd.28 (root=default,host=fre119) is down >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average >>> pool glance-images objects per pg (10478) is more than 92.7257 times >>> cluster average (113) >>> pool vms objects per pg (4717) is more than 41.7434 times cluster >>> average (113) >>> pool volumes objects per pg (1220) is more than 10.7965 times cluster >>> average (113) >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%) >>> PENDING_CREATING_PGS 3610 PGs pending on creation >>> osds >>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9] >>> have pending PGs. >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs >>> down, 86 pgs peering, 850 pgs stale >>> pg 10.900 is down, acting [18] >>> pg 10.90e is stuck inactive for 60266.030164, current state activating, >>> last acting [2,38] >>> pg 10.913 is stuck stale for 1887.552862, current state stale+down, >>> last acting [9] >>> pg 10.915 is stuck inactive for 60266.215231, current state activating, >>> last acting [30,38] >>> pg 11.903 is stuck inactive for 59294.465961, current state activating, >>> last acting [11,38] >>> pg 11.910 is down, acting [21] >>> pg 11.919 is down, acting [25] >>> pg 12.902 is stuck inactive for 57118.544590, current state activating, >>> last acting [36,14] >>> pg 13.8f8 is stuck inactive for 60707.167787, current state activating, >>> last acting [29,37] >>> pg 13.901 is stuck stale for 60226.543289, current state >>> stale+active+clean, last acting [1,31] >>> pg 13.905 is stuck inactive for 60266.050940, current state activating, >>> last acting [2,36] >>> pg 13.909 is stuck inactive for 60707.160714, current state activating, >>> last acting [34,36] >>> pg 13.90e is stuck inactive for 60707.410749, current state activating, >>> last acting [21,36] >>> pg 13.911 is down, acting [25] >>> pg 13.914 is stale+down, acting [29] >>> pg 13.917 is stuck stale for 580.224688, current state stale+down, last >>> acting [16] >>> pg 14.901 is stuck inactive for 60266.037762, current state >
Re: [ceph-users] Usage of devices in SSD pool vary very much
PS: Could be http://tracker.ceph.com/issues/36361 There is one HDD OSD that is out (which will not be replaced because the SSD pool will get the images and the hdd pool will be deleted). Kevin Am Fr., 4. Jan. 2019 um 19:46 Uhr schrieb Kevin Olbrich : > > Hi! > > I did what you wrote but my MGRs started to crash again: > root@adminnode:~# ceph -s > cluster: > id: 086d9f80-6249-4594-92d0-e31b6a9c > health: HEALTH_WARN > no active mgr > 105498/6277782 objects misplaced (1.680%) > > services: > mon: 3 daemons, quorum mon01,mon02,mon03 > mgr: no daemons active > osd: 44 osds: 43 up, 43 in > > data: > pools: 4 pools, 1616 pgs > objects: 1.88M objects, 7.07TiB > usage: 13.2TiB used, 16.7TiB / 29.9TiB avail > pgs: 105498/6277782 objects misplaced (1.680%) > 1606 active+clean > 8active+remapped+backfill_wait > 2active+remapped+backfilling > > io: > client: 5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr > recovery: 60.3MiB/s, 15objects/s > > > MON 1 log: >-13> 2019-01-04 14:05:04.432186 7fec56a93700 4 mgr ms_dispatch > active mgrdigest v1 >-12> 2019-01-04 14:05:04.432194 7fec56a93700 4 mgr ms_dispatch mgrdigest > v1 >-11> 2019-01-04 14:05:04.822041 7fec434e1700 4 mgr[balancer] > Optimize plan auto_2019-01-04_14:05:04 >-10> 2019-01-04 14:05:04.822170 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/mode > -9> 2019-01-04 14:05:04.822231 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/max_misplaced > -8> 2019-01-04 14:05:04.822268 7fec434e1700 4 ceph_config_get > max_misplaced not found > -7> 2019-01-04 14:05:04.822444 7fec434e1700 4 mgr[balancer] Mode > upmap, max misplaced 0.05 > -6> 2019-01-04 14:05:04.822849 7fec434e1700 4 mgr[balancer] do_upmap > -5> 2019-01-04 14:05:04.822923 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/upmap_max_iterations > -4> 2019-01-04 14:05:04.822964 7fec434e1700 4 ceph_config_get > upmap_max_iterations not found > -3> 2019-01-04 14:05:04.823013 7fec434e1700 4 mgr get_config > get_configkey: mgr/balancer/upmap_max_deviation > -2> 2019-01-04 14:05:04.823048 7fec434e1700 4 ceph_config_get > upmap_max_deviation not found > -1> 2019-01-04 14:05:04.823265 7fec434e1700 4 mgr[balancer] pools > ['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec'] > 0> 2019-01-04 14:05:04.836124 7fec434e1700 -1 > /build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int > OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set int>&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04 > 14:05:04.832885 > /build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0) > > ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) > luminous (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x558c3c0bb572] > 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set std::less, std::allocator > const&, > OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] > 3: (()+0x2f3020) [0x558c3bf5d020] > 4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] > 5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] > 6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] > 7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] > 8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] > 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] > 10: (()+0x13e370) [0x7fec5e8be370] > 11: (PyObject_Call()+0x43) [0x7fec5e891273] > 12: (()+0x1853ac) [0x7fec5e9053ac] > 13: (PyObject_Call()+0x43) [0x7fec5e891273] > 14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] > 15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] > 16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] > 17: (()+0x76ba) [0x7fec5d74c6ba] > 18: (clone()+0x6d) [0x7fec5c7b841d] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > > --- logging levels --- >0/ 5 none >0/ 1 lockdep >0/ 1 context >1/ 1 crush >1/ 5 mds >1/ 5 mds_balancer >1/ 5 mds_locker >1/ 5 mds_log >1/ 5 mds_log_expire >1/ 5 mds_migrator >0/ 1 buffer >0/ 1 timer >0/ 1 filer >0/ 1 striper >0/ 1 objecter >0/ 5 rados >0/ 5 rbd >0/ 5 rbd_mirror >0/ 5 rbd_replay >0/ 5 journaler >0/ 5 objectcacher >0/ 5 client >1/ 5 osd >0/ 5 optracker >0/ 5 objclass >1/ 3 filestore >1/ 3 journal >0/ 5 ms >1/ 5 mon >0/10 monc >
Re: [ceph-users] Usage of devices in SSD pool vary very much
3c07a5b4] 2: (()+0x11390) [0x7fec5d756390] 3: (gsignal()+0x38) [0x7fec5c6e6428] 4: (abort()+0x16a) [0x7fec5c6e802a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558c3c0bb6fe] 6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&, OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] 7: (()+0x2f3020) [0x558c3bf5d020] 8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] 11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 14: (()+0x13e370) [0x7fec5e8be370] 15: (PyObject_Call()+0x43) [0x7fec5e891273] 16: (()+0x1853ac) [0x7fec5e9053ac] 17: (PyObject_Call()+0x43) [0x7fec5e891273] 18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] 19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] 21: (()+0x76ba) [0x7fec5d74c6ba] 22: (clone()+0x6d) [0x7fec5c7b841d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- 0> 2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal (Aborted) ** in thread 7fec434e1700 thread_name:balancer ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (()+0x4105b4) [0x558c3c07a5b4] 2: (()+0x11390) [0x7fec5d756390] 3: (gsignal()+0x38) [0x7fec5c6e6428] 4: (abort()+0x16a) [0x7fec5c6e802a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558c3c0bb6fe] 6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&, OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] 7: (()+0x2f3020) [0x558c3bf5d020] 8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] 11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 14: (()+0x13e370) [0x7fec5e8be370] 15: (PyObject_Call()+0x43) [0x7fec5e891273] 16: (()+0x1853ac) [0x7fec5e9053ac] 17: (PyObject_Call()+0x43) [0x7fec5e891273] 18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] 19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] 21: (()+0x76ba) [0x7fec5d74c6ba] 22: (clone()+0x6d) [0x7fec5c7b841d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log --- end dump of recent events --- Kevin Am Mi., 2. Jan. 2019 um 17:35 Uhr schrieb Konstantin Shalygin : > > On a medium sized cluster with device-classes, I am experiencing a > problem with the SSD pool: > > root at adminnode:~# ceph osd df | grep ssd > ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS > 2 ssd 0.43700 1.0 447GiB 254GiB 193GiB 56.77 1.28 50 > 3 ssd 0.43700 1.0 447GiB 208GiB 240GiB 46.41 1.04 58 > 4 ssd 0.43700 1.0 447GiB 266GiB 181GiB 59.44 1.34 55 > 30 ssd 0.43660 1.0 447GiB 222GiB 225GiB 49.68 1.12 49 > 6 ssd 0.43700 1.0 447GiB 238GiB 209GiB 53.28 1.20 59 > 7 ssd 0.43700 1.0 447GiB 228GiB 220GiB 50.88 1.14 56 > 8 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.16 1.35 57 > 31 ssd 0.43660 1.0 447GiB 231GiB 217GiB 51.58 1.16 56 > 34 ssd 0.43660 1.0 447GiB 186GiB 261GiB 41.65 0.94 49 > 36 ssd 0.87329 1.0 894GiB 364GiB 530GiB 40.68 0.92 91 > 37 ssd 0.87329 1.0 894GiB 321GiB 573GiB 35.95 0.81 78 > 42 ssd 0.87329 1.0 894GiB 375GiB 519GiB 41.91 0.94 92 > 43 ssd 0.87329 1.0 89
[ceph-users] TCP qdisc + congestion control / BBR
Hi! I wonder if changing qdisc and congestion_control (for example fq with Google BBR) on Ceph servers / clients has positive effects during high load. Google BBR: https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster I am running a lot of VMs with BBR but the hypervisors run fq_codel + cubic (OSDs run Ubuntu defaults). Did someone test qdisc and congestion control settings? Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usage of devices in SSD pool vary very much
Hi! On a medium sized cluster with device-classes, I am experiencing a problem with the SSD pool: root@adminnode:~# ceph osd df | grep ssd ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS 2 ssd 0.43700 1.0 447GiB 254GiB 193GiB 56.77 1.28 50 3 ssd 0.43700 1.0 447GiB 208GiB 240GiB 46.41 1.04 58 4 ssd 0.43700 1.0 447GiB 266GiB 181GiB 59.44 1.34 55 30 ssd 0.43660 1.0 447GiB 222GiB 225GiB 49.68 1.12 49 6 ssd 0.43700 1.0 447GiB 238GiB 209GiB 53.28 1.20 59 7 ssd 0.43700 1.0 447GiB 228GiB 220GiB 50.88 1.14 56 8 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.16 1.35 57 31 ssd 0.43660 1.0 447GiB 231GiB 217GiB 51.58 1.16 56 34 ssd 0.43660 1.0 447GiB 186GiB 261GiB 41.65 0.94 49 36 ssd 0.87329 1.0 894GiB 364GiB 530GiB 40.68 0.92 91 37 ssd 0.87329 1.0 894GiB 321GiB 573GiB 35.95 0.81 78 42 ssd 0.87329 1.0 894GiB 375GiB 519GiB 41.91 0.94 92 43 ssd 0.87329 1.0 894GiB 438GiB 456GiB 49.00 1.10 92 13 ssd 0.43700 1.0 447GiB 249GiB 198GiB 55.78 1.25 72 14 ssd 0.43700 1.0 447GiB 290GiB 158GiB 64.76 1.46 71 15 ssd 0.43700 1.0 447GiB 368GiB 78.6GiB 82.41 1.85 78 < 16 ssd 0.43700 1.0 447GiB 253GiB 194GiB 56.66 1.27 70 19 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.21 1.35 70 20 ssd 0.43700 1.0 447GiB 312GiB 135GiB 69.81 1.57 77 21 ssd 0.43700 1.0 447GiB 312GiB 135GiB 69.77 1.57 77 22 ssd 0.43700 1.0 447GiB 269GiB 178GiB 60.10 1.35 67 38 ssd 0.43660 1.0 447GiB 153GiB 295GiB 34.11 0.77 46 39 ssd 0.43660 1.0 447GiB 127GiB 320GiB 28.37 0.64 38 40 ssd 0.87329 1.0 894GiB 386GiB 508GiB 43.17 0.97 97 41 ssd 0.87329 1.0 894GiB 375GiB 520GiB 41.88 0.94 113 This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool). Currently, the balancer plugin is off because it immediately crashed the MGR in the past (on 12.2.5). Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I am unable to find the bugtracker ID] Would the balancer plugin correct this situation? What happens if all MGRs die like they did on 12.2.5 because of the plugin? Will the balancer take data from the most-unbalanced OSDs first? Otherwise the OSD may fill up more then FULL which would cause the whole pool to freeze (because the smallest OSD is taken into account for free space calculation). This would be the worst case as over 100 VMs would freeze, causing lot of trouble. This is also the reason I did not try to enable the balancer again. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
> > Assuming everything is on LVM including the root filesystem, only moving > > the boot partition will have to be done outside of LVM. > > Since the OP mentioned MS Exchange, I assume the VM is running windows. > You can do the same LVM-like trick in Windows Server via Disk Manager > though; add the new ceph RBD disk to the existing data volume as a > mirror; wait for it to sync, then break the mirror and remove the > original disk. Mirrors only work on dynamic disks which are a pain to revert and cause lot's of problems with backup solutions. I will keep this in mind as this is still better than shutting down the whole VM. @all Thank you very much for your inputs. I will try some less important VMs and then start migration of the big one. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
Hi! Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous cluster (which already holds lot's of images). The server has access to both local and cluster-storage, I only need to live migrate the storage, not machine. I have never used live migration as it can cause more issues and the VMs that are already migrated, had planned downtime. Taking the VM offline and convert/import using qemu-img would take some hours but I would like to still serve clients, even if it is slower. The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with BBU). There are two HDDs bound as RAID1 which are constantly under 30% - 60% load (this goes up to 100% during reboot, updates or login prime-time). What happens when either the local compute node or the ceph cluster fails (degraded)? Or network is unavailable? Are all writes performed to both locations? Is this fail-safe? Or does the VM crash in worst case, which can lead to dirty shutdown for MS-EX DBs? The node currently has 4GB free RAM and 29GB listed as cache / available. These numbers need caution because we have "tuned" enabled which causes de-deplication on RAM and this host runs about 10 Windows VMs. During reboots or updates, RAM can get full again. Maybe I am to cautious about live-storage-migration, maybe I am not. What are your experiences or advices? Thank you very much! Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
I now had the time to test and after installing this package, uploads to rbd are working perfectly. Thank you very much fur sharing this! Kevin Am Mi., 7. Nov. 2018 um 15:36 Uhr schrieb Kevin Olbrich : > Am Mi., 7. Nov. 2018 um 07:40 Uhr schrieb Nicolas Huillard < > nhuill...@dolomede.fr>: > >> >> > It lists rbd but still fails with the exact same error. >> >> I stumbled upon the exact same error, and since there was no answer >> anywhere, I figured it was a very simple problem: don't forget to >> install the qemu-block-extra package (Debian stretch) along with qemu- >> utils which contains the qemu-img command. >> This command is actually compiled with rbd support (hence the output >> above), but need this extra package to pull actual support-code and >> dependencies... >> > > I have not been able to test this yet but this package was indeed missing > on my system! > Thank you for this hint! > > >> -- >> Nicolas Huillard >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times
I read the whole thread and it looks like the write cache should always be disabled as in the worst case, the performance is the same(?). This is based on this discussion. I will test some WD4002FYYZ which don't mention "media cache". Kevin Am Di., 13. Nov. 2018 um 09:27 Uhr schrieb Виталий Филиппов < vita...@yourcmc.ru>: > This may be the explanation: > > > https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and > > Other manufacturers may have started to do the same, I suppose. > -- > With best regards, > Vitaliy Filippov___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph or Gluster for implementing big NAS
Hi Dan, ZFS without sync would be very much identical to ext2/ext4 without journals or XFS with barriers disabled. The ARC cache in ZFS is awesome but disbaling sync on ZFS is a very high risk (using ext4 with kvm-mode unsafe would be similar I think). Also, ZFS only works as expected with scheduler set to noop as it is optimized to consume whole, non-shared devices. Just my 2 cents ;-) Kevin Am Mo., 12. Nov. 2018 um 15:08 Uhr schrieb Dan van der Ster < d...@vanderster.com>: > We've done ZFS on RBD in a VM, exported via NFS, for a couple years. > It's very stable and if your use-case permits you can set zfs > sync=disabled to get very fast write performance that's tough to beat. > > But if you're building something new today and have *only* the NAS > use-case then it would make better sense to try CephFS first and see > if it works for you. > > -- Dan > > On Mon, Nov 12, 2018 at 3:01 PM Kevin Olbrich wrote: > > > > Hi! > > > > ZFS won't play nice on ceph. Best would be to mount CephFS directly with > the ceph-fuse driver on the endpoint. > > If you definitely want to put a storage gateway between the data and the > compute nodes, then go with nfs-ganesha which can export CephFS directly > without local ("proxy") mount. > > > > I had such a setup with nfs and switched to mount CephFS directly. If > using NFS with the same data, you must make sure your HA works well to > avoid data corruption. > > With ceph-fuse you directly connect to the cluster, one component less > that breaks. > > > > Kevin > > > > Am Mo., 12. Nov. 2018 um 12:44 Uhr schrieb Premysl Kouril < > premysl.kou...@gmail.com>: > >> > >> Hi, > >> > >> > >> We are planning to build NAS solution which will be primarily used via > NFS and CIFS and workloads ranging from various archival application to > more “real-time processing”. The NAS will not be used as a block storage > for virtual machines, so the access really will always be file oriented. > >> > >> > >> We are considering primarily two designs and I’d like to kindly ask for > any thoughts, views, insights, experiences. > >> > >> > >> Both designs utilize “distributed storage software at some level”. Both > designs would be built from commodity servers and should scale as we grow. > Both designs involve virtualization for instantiating "access virtual > machines" which will be serving the NFS and CIFS protocol - so in this > sense the access layer is decoupled from the data layer itself. > >> > >> > >> First design is based on a distributed filesystem like Gluster or > CephFS. We would deploy this software on those commodity servers and mount > the resultant filesystem on the “access virtual machines” and they would be > serving the mounted filesystem via NFS/CIFS. > >> > >> > >> Second design is based on distributed block storage using CEPH. So we > would build distributed block storage on those commodity servers, and then, > via virtualization (like OpenStack Cinder) we would allocate the block > storage into the access VM. Inside the access VM we would deploy ZFS which > would aggregate block storage into a single filesystem. And this filesystem > would be served via NFS/CIFS from the very same VM. > >> > >> > >> Any advices and insights highly appreciated > >> > >> > >> Cheers, > >> > >> Prema > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph or Gluster for implementing big NAS
Hi! ZFS won't play nice on ceph. Best would be to mount CephFS directly with the ceph-fuse driver on the endpoint. If you definitely want to put a storage gateway between the data and the compute nodes, then go with nfs-ganesha which can export CephFS directly without local ("proxy") mount. I had such a setup with nfs and switched to mount CephFS directly. If using NFS with the same data, you must make sure your HA works well to avoid data corruption. With ceph-fuse you directly connect to the cluster, one component less that breaks. Kevin Am Mo., 12. Nov. 2018 um 12:44 Uhr schrieb Premysl Kouril < premysl.kou...@gmail.com>: > Hi, > > We are planning to build NAS solution which will be primarily used via NFS > and CIFS and workloads ranging from various archival application to more > “real-time processing”. The NAS will not be used as a block storage for > virtual machines, so the access really will always be file oriented. > > We are considering primarily two designs and I’d like to kindly ask for > any thoughts, views, insights, experiences. > > Both designs utilize “distributed storage software at some level”. Both > designs would be built from commodity servers and should scale as we grow. > Both designs involve virtualization for instantiating "access virtual > machines" which will be serving the NFS and CIFS protocol - so in this > sense the access layer is decoupled from the data layer itself. > > First design is based on a distributed filesystem like Gluster or CephFS. > We would deploy this software on those commodity servers and mount the > resultant filesystem on the “access virtual machines” and they would be > serving the mounted filesystem via NFS/CIFS. > > Second design is based on distributed block storage using CEPH. So we > would build distributed block storage on those commodity servers, and then, > via virtualization (like OpenStack Cinder) we would allocate the block > storage into the access VM. Inside the access VM we would deploy ZFS which > would aggregate block storage into a single filesystem. And this filesystem > would be served via NFS/CIFS from the very same VM. > > > Any advices and insights highly appreciated > > > Cheers, > > Prema > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 12.2.9 release
Am Mi., 7. Nov. 2018 um 16:40 Uhr schrieb Gregory Farnum : > On Wed, Nov 7, 2018 at 5:58 AM Simon Ironside > wrote: > >> >> >> On 07/11/2018 10:59, Konstantin Shalygin wrote: >> >> I wonder if there is any release announcement for ceph 12.2.9 that I >> missed. >> >> I just found the new packages on download.ceph.com, is this an >> official >> >> release? >> > >> > This is because 12.2.9 have a several bugs. You should avoid to use >> this >> > release and wait for 12.2.10 >> >> Argh! What's it doing in the repos then?? I've just upgraded to it! >> What are the bugs? Is there a thread about them? > > > If you’ve already upgraded and have no issues then you won’t have any > trouble going forward — except perhaps on the next upgrade, if you do it > while the cluster is unhealthy. > > I agree that it’s annoying when these issues make it out. We’ve had > ongoing discussions to try and improve the release process so it’s less > drawn-out and to prevent these upgrade issues from making it through > testing, but nobody has resolved it yet. If anybody has experience working > with deb repositories and handling releases, the Ceph upstream could use > some help... ;) > -Greg > >> >> We solve this problem by hosting two repos. One for staging and QA and one for production. Every release gets to staging (for example directly after building a scm tag). If QA passed, the stage repo is turned into the prod one. Using symlinks, it would be possible to switch back if problems occure. Example: https://incoming.debian.org/ Currently I would be unable to deploy new nodes if I use the official mirrors as apt is unable to use older versions (which does work on yum/dnf). Thats why we are implementing "mirror-sync" / rsync with a copy of the repo and the desired packages until such solution is available. Kevin >> Simon >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
Am Mi., 7. Nov. 2018 um 07:40 Uhr schrieb Nicolas Huillard < nhuill...@dolomede.fr>: > > > It lists rbd but still fails with the exact same error. > > I stumbled upon the exact same error, and since there was no answer > anywhere, I figured it was a very simple problem: don't forget to > install the qemu-block-extra package (Debian stretch) along with qemu- > utils which contains the qemu-img command. > This command is actually compiled with rbd support (hence the output > above), but need this extra package to pull actual support-code and > dependencies... > I have not been able to test this yet but this package was indeed missing on my system! Thank you for this hint! > -- > Nicolas Huillard > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy osd creation failed with multipath and dmcrypt
I met the same problem. I had to create GPT table for each disk, create first partition over full space and then fed these to ceph-volume (should be similar for ceph-deploy). Also I am not sure if you can combine fs-type btrfs with bluestore (afaik this is for filestore). Kevin Am Di., 6. Nov. 2018 um 14:41 Uhr schrieb Pavan, Krish < krish.pa...@nuance.com>: > Trying to created OSD with multipath with dmcrypt and it failed . Any > suggestion please?. > > ceph-deploy --overwrite-conf osd create ceph-store1:/dev/mapper/mpathr > --bluestore --dmcrypt -- failed > > ceph-deploy --overwrite-conf osd create ceph-store1:/dev/mapper/mpathr > --bluestore – worked > > > > the logs for fail > > [ceph-store12][WARNIN] command: Running command: /usr/sbin/restorecon -R > /var/lib/ceph/osd-lockbox/e15f1adc-feff-4890-a617-adc473e7331e/magic.68428.tmp > > [ceph-store12][WARNIN] command: Running command: /usr/bin/chown -R > ceph:ceph > /var/lib/ceph/osd-lockbox/e15f1adc-feff-4890-a617-adc473e7331e/magic.68428.tmp > > [ceph-store12][WARNIN] Traceback (most recent call last): > > [ceph-store12][WARNIN] File "/usr/sbin/ceph-disk", line 9, in > > [ceph-store12][WARNIN] load_entry_point('ceph-disk==1.0.0', > 'console_scripts', 'ceph-disk')() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5736, in run > > [ceph-store12][WARNIN] main(sys.argv[1:]) > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5687, in main > > [ceph-store12][WARNIN] args.func(args) > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2108, in main > > [ceph-store12][WARNIN] Prepare.factory(args).prepare() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2097, in prepare > > [ceph-store12][WARNIN] self._prepare() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2171, in _prepare > > [ceph-store12][WARNIN] self.lockbox.prepare() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2861, in prepare > > [ceph-store12][WARNIN] self.populate() > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2818, in populate > > [ceph-store12][WARNIN] get_partition_base(self.partition.get_dev()), > > [ceph-store12][WARNIN] File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 844, in > get_partition_base > > [ceph-store12][WARNIN] raise Error('not a partition', dev) > > [ceph-store12][WARNIN] ceph_disk.main.Error: Error: not a partition: > /dev/dm-215 > > [ceph-store12][ERROR ] RuntimeError: command returned non-zero exit > status: 1 > > [ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-disk > -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore > --cluster ceph --fs-type btrfs -- /dev/mapper/mpathr > > [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
Hi! Proxmox has support for rbd as they ship additional packages as well as ceph via their own repo. I ran your command and got this: > qemu-img version 2.8.1(Debian 1:2.8+dfsg-6+deb9u4) > Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers > Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp > ftps gluster host_cdrom host_device http https iscsi iser luks nbd nfs > null-aio null-co parallels qcow qcow2 qed quorum raw rbd replication > sheepdog ssh vdi vhdx vmdk vpc vvfat It lists rbd but still fails with the exact same error. Kevin Am Di., 30. Okt. 2018 um 17:14 Uhr schrieb David Turner < drakonst...@gmail.com>: > What version of qemu-img are you using? I found [1] this when poking > around on my qemu server when checking for rbd support. This version (note > it's proxmox) has rbd listed as a supported format. > > [1] > # qemu-img -V; qemu-img --help|grep rbd > qemu-img version 2.11.2pve-qemu-kvm_2.11.2-1 > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp > ftps gluster host_cdrom host_device http https iscsi iser luks nbd null-aio > null-co parallels qcow qcow2 qed quorum raw rbd replication sheepdog > throttle vdi vhdx vmdk vpc vvfat zeroinit > On Tue, Oct 30, 2018 at 12:08 PM Kevin Olbrich wrote: > >> Is it possible to use qemu-img with rbd support on Debian Stretch? >> I am on Luminous and try to connect my image-buildserver to load images >> into a ceph pool. >> >> root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2 >>> rbd:rbd_vms_ssd_01/test_vm >>> qemu-img: Unknown protocol 'rbd' >> >> >> Kevin >> >> Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan < >> abhis...@suse.com>: >> >>> arad...@tma-0.net writes: >>> >>> > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain >>> packages for >>> > Debian? I'm not seeing any, but maybe I'm missing something... >>> > >>> > I'm seeing ceph-deploy install an older version of ceph on the nodes >>> (from the >>> > Debian repo) and then failing when I run "ceph-deploy osd ..." because >>> ceph- >>> > volume doesn't exist on the nodes. >>> > >>> The newer versions of Ceph (from mimic onwards) requires compiler >>> toolchains supporting c++17 which we unfortunately do not have for >>> stretch/jessie yet. >>> >>> - >>> Abhishek >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for debian in Ceph repo
Is it possible to use qemu-img with rbd support on Debian Stretch? I am on Luminous and try to connect my image-buildserver to load images into a ceph pool. root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2 > rbd:rbd_vms_ssd_01/test_vm > qemu-img: Unknown protocol 'rbd' Kevin Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan < abhis...@suse.com>: > arad...@tma-0.net writes: > > > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain packages > for > > Debian? I'm not seeing any, but maybe I'm missing something... > > > > I'm seeing ceph-deploy install an older version of ceph on the nodes > (from the > > Debian repo) and then failing when I run "ceph-deploy osd ..." because > ceph- > > volume doesn't exist on the nodes. > > > The newer versions of Ceph (from mimic onwards) requires compiler > toolchains supporting c++17 which we unfortunately do not have for > stretch/jessie yet. > > - > Abhishek > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Command to check last change to rbd image?
Hi! Is there an easy way to check when an image was last modified? I want to make sure, that the images I want to clean up, were not used for a long time. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] nfs-ganesha version in Ceph repos
I had a similar problem: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029698.html But even the recent 2.6.x releases were not working well for me (many many segfaults). I am on the master-branch (2.7.x) and that works well with less crashs. Cluster is 13.2.1/.2 with nfs-ganesha as standalone VM. Kevin Am Di., 9. Okt. 2018 um 19:39 Uhr schrieb Erik McCormick < emccorm...@cirrusseven.com>: > On Tue, Oct 9, 2018 at 1:27 PM Erik McCormick > wrote: > > > > Hello, > > > > I'm trying to set up an nfs-ganesha server with the Ceph FSAL, and > > running into difficulties getting the current stable release running. > > The versions in the Luminous repo is stuck at 2.6.1, whereas the > > current stable version is 2.6.3. I've seen a couple of HA issues in > > pre 2.6.3 versions that I'd like to avoid. > > > > I should have been more specific that the ones I am looking for are for > Centos 7 > > > I've also been attempting to build my own from source, but banging my > > head against a wall as far as dependencies and config options are > > concerned. > > > > If anyone reading this has the ability to kick off a fresh build of > > the V2.6-stable branch with all the knobs turned properly for Ceph, or > > can point me to a set of cmake configs and scripts that might help me > > do it myself, I would be eternally grateful. > > > > Thanks, > > Erik > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)
Hi Jakub, "ceph osd metadata X" this is perfect! This also lists multipath devices which I was looking for! Kevin Am Mo., 8. Okt. 2018 um 21:16 Uhr schrieb Jakub Jaszewski < jaszewski.ja...@gmail.com>: > Hi Kevin, > Have you tried ceph osd metadata OSDid ? > > Jakub > > pon., 8 paź 2018, 19:32 użytkownik Alfredo Deza > napisał: > >> On Mon, Oct 8, 2018 at 6:09 AM Kevin Olbrich wrote: >> > >> > Hi! >> > >> > Yes, thank you. At least on one node this works, the other node just >> freezes but this might by caused by a bad disk that I try to find. >> >> If it is freezing, you could maybe try running the command where it >> freezes? (ceph-volume will log it to the terminal) >> >> >> > >> > Kevin >> > >> > Am Mo., 8. Okt. 2018 um 12:07 Uhr schrieb Wido den Hollander < >> w...@42on.com>: >> >> >> >> Hi, >> >> >> >> $ ceph-volume lvm list >> >> >> >> Does that work for you? >> >> >> >> Wido >> >> >> >> On 10/08/2018 12:01 PM, Kevin Olbrich wrote: >> >> > Hi! >> >> > >> >> > Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id? >> >> > Before I migrated from filestore with simple-mode to bluestore with >> lvm, >> >> > I was able to find the raw disk with "df". >> >> > Now, I need to go from LVM LV to PV to disk every time I need to >> >> > check/smartctl a disk. >> >> > >> >> > Kevin >> >> > >> >> > >> >> > ___ >> >> > ceph-users mailing list >> >> > ceph-users@lists.ceph.com >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)
Hi! Yes, thank you. At least on one node this works, the other node just freezes but this might by caused by a bad disk that I try to find. Kevin Am Mo., 8. Okt. 2018 um 12:07 Uhr schrieb Wido den Hollander : > Hi, > > $ ceph-volume lvm list > > Does that work for you? > > Wido > > On 10/08/2018 12:01 PM, Kevin Olbrich wrote: > > Hi! > > > > Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id? > > Before I migrated from filestore with simple-mode to bluestore with lvm, > > I was able to find the raw disk with "df". > > Now, I need to go from LVM LV to PV to disk every time I need to > > check/smartctl a disk. > > > > Kevin > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)
Hi! Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id? Before I migrated from filestore with simple-mode to bluestore with lvm, I was able to find the raw disk with "df". Now, I need to go from LVM LV to PV to disk every time I need to check/smartctl a disk. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
nt: (5) Input/output error 2018-10-08 10:32:17.434 7f6af518e1c0 20 bdev aio_wait 0x55a3a1edb8c0 done 2018-10-08 10:32:17.434 7f6af518e1c0 1 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) close 2018-10-08 10:32:17.434 7f6af518e1c0 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _aio_stop 2018-10-08 10:32:17.568 7f6add7d3700 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _aio_thread end 2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_stop 2018-10-08 10:32:17.573 7f6adcfd2700 20 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_thread wake 2018-10-08 10:32:17.573 7f6adcfd2700 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_thread finish 2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62a80 /var/lib/ceph/osd/ceph-46/block) _discard_stop stopped 2018-10-08 10:32:17.573 7f6af518e1c0 1 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) close 2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _aio_stop 2018-10-08 10:32:17.817 7f6ade7d5700 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _aio_thread end 2018-10-08 10:32:17.822 7f6af518e1c0 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_stop 2018-10-08 10:32:17.822 7f6addfd4700 20 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_thread wake 2018-10-08 10:32:17.822 7f6addfd4700 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_thread finish 2018-10-08 10:32:17.822 7f6af518e1c0 10 bdev(0x55a3a1d62000 /var/lib/ceph/osd/ceph-46/block) _discard_stop stopped 2018-10-08 10:32:17.823 7f6af518e1c0 -1 osd.46 0 OSD:init: unable to mount object store 2018-10-08 10:32:17.823 7f6af518e1c0 -1 ** ERROR: osd init failed: (5) Input/output error Anything interesting here? I will try to export the down PGs from the disks. I got a bunch of new disks to replace all. Most of current disks are of same age. Kevin Am Mi., 3. Okt. 2018 um 13:52 Uhr schrieb Paul Emmerich < paul.emmer...@croit.io>: > There's "ceph-bluestore-tool repair/fsck" > > In your scenario, a few more log files would be interesting: try > setting debug bluefs to 20/20. And if that's not enough log try also > setting debug osd, debug bluestore, and debug bdev to 20/20. > > > > Paul > Am Mi., 3. Okt. 2018 um 13:48 Uhr schrieb Kevin Olbrich : > > > > The disks were deployed with ceph-deploy / ceph-volume using the default > style (lvm) and not simple-mode. > > > > The disks were provisioned as a whole, no resizing. I never touched the > disks after deployment. > > > > It is very strange that this first happened after the update, never met > such an error before. > > > > I found a BUG in the tracker, that also shows such an error with count > 0. That was closed with „can’t reproduce“ (don’t have the link ready). For > me this seems like the data itself is fine and I just hit a bad transaction > in the replay (which maybe caused the crash in the first place). > > > > I need one of three disks back. Object corruption would not be a problem > (regarding drop of a journal), as this cluster hosts backups which will > fail validation and regenerate. Just marking the OSD lost does not seem to > be an option. > > > > Is there some sort of fsck for BlueFS? > > > > Kevin > > > > > > Igor Fedotov schrieb am Mi. 3. Okt. 2018 um 13:01: > >> > >> I've seen somewhat similar behavior in a log from Sergey Malinin in > another thread ("mimic: 3/4 OSDs crashed...") > >> > >> He claimed it happened after LVM volume expansion. Isn't this the case > for you? > >> > >> Am I right that you use LVM volumes? > >> > >> > >> On 10/3/2018 11:22 AM, Kevin Olbrich wrote: > >> > >> Small addition: the failing disks are in the same host. > >> This is a two-host, failure-domain OSD cluster. > >> > >> > >> Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich : > >>> > >>> Hi! > >>> > >>> Yesterday one of our (non-priority) clusters failed when 3 OSDs went > down (EC 8+2) together. > >>> This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two > hours before. > >>> They failed exactly at the same moment, rendering the cluster unusable > (CephFS). > >>> We are using CentOS 7 with latest updates and ceph repo. No cache > SSDs, no external journal / wal / db. > >>> > >>> OSD 29 (no disk failure in dmesg): > >>> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 > (ceph:ceph) > >>> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 > (02899bfda8141
Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
The disks were deployed with ceph-deploy / ceph-volume using the default style (lvm) and not simple-mode. The disks were provisioned as a whole, no resizing. I never touched the disks after deployment. It is very strange that this first happened after the update, never met such an error before. I found a BUG in the tracker, that also shows such an error with count 0. That was closed with „can’t reproduce“ (don’t have the link ready). For me this seems like the data itself is fine and I just hit a bad transaction in the replay (which maybe caused the crash in the first place). I need one of three disks back. Object corruption would not be a problem (regarding drop of a journal), as this cluster hosts backups which will fail validation and regenerate. Just marking the OSD lost does not seem to be an option. Is there some sort of fsck for BlueFS? Kevin Igor Fedotov schrieb am Mi. 3. Okt. 2018 um 13:01: > I've seen somewhat similar behavior in a log from Sergey Malinin in > another thread ("mimic: 3/4 OSDs crashed...") > > He claimed it happened after LVM volume expansion. Isn't this the case for > you? > > Am I right that you use LVM volumes? > > On 10/3/2018 11:22 AM, Kevin Olbrich wrote: > > Small addition: the failing disks are in the same host. > This is a two-host, failure-domain OSD cluster. > > > Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich : > >> Hi! >> >> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down >> (EC 8+2) together. >> *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two >> hours before.* >> They failed exactly at the same moment, rendering the cluster unusable >> (CephFS). >> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, >> no external journal / wal / db. >> >> *OSD 29 (no disk failure in dmesg):* >> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 (ceph:ceph) >> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 >> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process >> ceph-osd, pid 20899 >> 2018-10-03 09:47:15.074 7fb8835ce1c0 0 pidfile_write: ignore empty >> --pid-file >> 2018-10-03 09:47:15.100 7fb8835ce1c0 0 load: jerasure load: lrc load: >> isa >> 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev create path >> /var/lib/ceph/osd/ceph-29/block type kernel >> 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block >> 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 >> GiB) block_size 4096 (4 KiB) rotational >> 2018-10-03 09:47:15.101 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > >> kv_ratio 0.5 >> 2018-10-03 09:47:15.101 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 >> meta 0 kv 1 data 0 >> 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) close >> 2018-10-03 09:47:15.358 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29 >> 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev create path >> /var/lib/ceph/osd/ceph-29/block type kernel >> 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block >> 2018-10-03 09:47:15.359 7fb8835ce1c0 1 bdev(0x561250a2 >> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 >> GiB) block_size 4096 (4 KiB) rotational >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > >> kv_ratio 0.5 >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 >> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 >> meta 0 kv 1 data 0 >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev create path >> /var/lib/ceph/osd/ceph-29/block type kernel >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 >> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 >> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 >> GiB) block_size 4096 (4 KiB) rotational >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs add_block_device bdev 1 >> path /var/lib/ceph/osd/ceph-29/block size 932 GiB >> 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs mount >> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file wi
Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
Small addition: the failing disks are in the same host. This is a two-host, failure-domain OSD cluster. Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich : > Hi! > > Yesterday one of our (non-priority) clusters failed when 3 OSDs went down > (EC 8+2) together. > *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two > hours before.* > They failed exactly at the same moment, rendering the cluster unusable > (CephFS). > We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no > external journal / wal / db. > > *OSD 29 (no disk failure in dmesg):* > 2018-10-03 09:47:15.074 7fb8835ce1c0 0 set uid:gid to 167:167 (ceph:ceph) > 2018-10-03 09:47:15.074 7fb8835ce1c0 0 ceph version 13.2.2 > (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process > ceph-osd, pid 20899 > 2018-10-03 09:47:15.074 7fb8835ce1c0 0 pidfile_write: ignore empty > --pid-file > 2018-10-03 09:47:15.100 7fb8835ce1c0 0 load: jerasure load: lrc load: isa > 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev create path > /var/lib/ceph/osd/ceph-29/block type kernel > 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block > 2018-10-03 09:47:15.100 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 > GiB) block_size 4096 (4 KiB) rotational > 2018-10-03 09:47:15.101 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > > kv_ratio 0.5 > 2018-10-03 09:47:15.101 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 > meta 0 kv 1 data 0 > 2018-10-03 09:47:15.101 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) close > 2018-10-03 09:47:15.358 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29 > 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev create path > /var/lib/ceph/osd/ceph-29/block type kernel > 2018-10-03 09:47:15.358 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block > 2018-10-03 09:47:15.359 7fb8835ce1c0 1 bdev(0x561250a2 > /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 > GiB) block_size 4096 (4 KiB) rotational > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > > kv_ratio 0.5 > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 > bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 > meta 0 kv 1 data 0 > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev create path > /var/lib/ceph/osd/ceph-29/block type kernel > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 > /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bdev(0x561250a20a80 > /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 > GiB) block_size 4096 (4 KiB) rotational > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs add_block_device bdev 1 > path /var/lib/ceph/osd/ceph-29/block size 932 GiB > 2018-10-03 09:47:15.360 7fb8835ce1c0 1 bluefs mount > 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link > count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev > 1 allocated 320 extents > [1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10]) > 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log: > (5) Input/output error > 2018-10-03 09:47:15.538 7fb8835ce1c0 1 stupidalloc 0x0x561250b8d030 > shutdown > 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 > bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (
[ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error
might be failed. *OSD 47 (same as above, seems not be died, no dmesg trace):* 2018-10-03 10:02:25.221 7f4d54b611c0 0 set uid:gid to 167:167 (ceph:ceph) 2018-10-03 10:02:25.221 7f4d54b611c0 0 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process ceph-osd, pid 8993 2018-10-03 10:02:25.221 7f4d54b611c0 0 pidfile_write: ignore empty --pid-file 2018-10-03 10:02:25.247 7f4d54b611c0 0 load: jerasure load: lrc load: isa 2018-10-03 10:02:25.248 7f4d54b611c0 1 bdev create path /var/lib/ceph/osd/ceph-46/block type kernel 2018-10-03 10:02:25.248 7f4d54b611c0 1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block 2018-10-03 10:02:25.248 7f4d54b611c0 1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational 2018-10-03 10:02:25.249 7f4d54b611c0 1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5 2018-10-03 10:02:25.249 7f4d54b611c0 1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0 2018-10-03 10:02:25.249 7f4d54b611c0 1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) close 2018-10-03 10:02:25.503 7f4d54b611c0 1 bluestore(/var/lib/ceph/osd/ceph-46) _mount path /var/lib/ceph/osd/ceph-46 2018-10-03 10:02:25.504 7f4d54b611c0 1 bdev create path /var/lib/ceph/osd/ceph-46/block type kernel 2018-10-03 10:02:25.504 7f4d54b611c0 1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block 2018-10-03 10:02:25.504 7f4d54b611c0 1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational 2018-10-03 10:02:25.505 7f4d54b611c0 1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5 2018-10-03 10:02:25.505 7f4d54b611c0 1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0 2018-10-03 10:02:25.505 7f4d54b611c0 1 bdev create path /var/lib/ceph/osd/ceph-46/block type kernel 2018-10-03 10:02:25.505 7f4d54b611c0 1 bdev(0x564072f96a80 /var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block 2018-10-03 10:02:25.505 7f4d54b611c0 1 bdev(0x564072f96a80 /var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational 2018-10-03 10:02:25.505 7f4d54b611c0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-46/block size 932 GiB 2018-10-03 10:02:25.505 7f4d54b611c0 1 bluefs mount 2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluefs _replay file with link count 0: file(ino 450 size 0x169964c mtime 2018-10-02 12:24:22.602432 bdev 1 allocated 170 extents [1:0x6fd950+10,1:0x6fd960+10,1:0x6fd970+10,1:0x6fd980+10,1:0x6fd990+10,1:0x6fd9a0+10,1:0x6fd9b0+10,1:0x6fd9c0+10,1:0x6fd9d0+10,1:0x6fd9e0+10,1:0x6fd9f0+10,1:0x6fda00+10,1:0x6fda10+10,1:0x6fda20+10,1:0x6fda30+10,1:0x6fda40+10,1:0x6fda50+10,1:0x6fda60+10,1:0x6fda70+10,1:0x6fda80+10,1:0x6fda90+10,1:0x6fdaa0+10,1:0x6fdab0+10]) 2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluefs mount failed to replay log: (5) Input/output error 2018-10-03 10:02:25.620 7f4d54b611c0 1 stupidalloc 0x0x564073102fc0 shutdown 2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluestore(/var/lib/ceph/osd/ceph-46) _open_db failed bluefs mount: (5) Input/output error 2018-10-03 10:02:25.620 7f4d54b611c0 1 bdev(0x564072f96a80 /var/lib/ceph/osd/ceph-46/block) close 2018-10-03 10:02:25.763 7f4d54b611c0 1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) close 2018-10-03 10:02:26.010 7f4d54b611c0 -1 osd.46 0 OSD:init: unable to mount object store 2018-10-03 10:02:26.010 7f4d54b611c0 -1 ** ERROR: osd init failed: (5) Input/output error We had failing disks in this cluster before but that was easily recovered by out + rebalance. For me, it seems like one disk died (there was large I/O on the cluster when this happened) and took two additional disks with it. It is very strange that this happened about two hours after the upgrade + reboot. *Any recommendations?* *I have 8 PGs down, the remeining are active and recovery / rebalance.* Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.
Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic upgrade with no data loss. I'd recommend looking through the thread about it to see how close it is to your issue. From my point of view there seems to be some similarities. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029649.html. At a similar point of desperation with my cluster I would shut all ceph processes down and bring them up in order. Doing this had my cluster almost healthy a few times until it fell over again due to mon issues. So solving any mon issues is the first priority. It seems like you may also benefit from setting mon_osd_cache_size to a very large number if you have enough memory on your mon servers. I'll hop on the irc today. Kevin On 09/25/2018 05:53 PM, by morphin wrote: After I tried too many things with so many helps on IRC. My pool health is still in ERROR and I think I can't recover from this. https://paste.ubuntu.com/p/HbsFnfkYDT/ At the end 2 of 3 mons crashed and started at same time and the pool is offlined. Recovery takes more than 12hours and it is way too slow. Somehow recovery seems to be not working. If I can reach my data I will re-create the pool easily. If I run ceph-object-tool script to regenerate mon store.db can I acccess the RBD pool again? by morphin <mailto:morphinwith...@gmail.com>, 25 Eyl 2018 Sal, 20:03 tarihinde şunu yazdı: Hi, Cluster is still down :( Up to not we have managed to compensate the OSDs. 118s of 160 OSD are stable and cluster is still in the progress of settling. Thanks for the guy Be-El in the ceph IRC channel. Be-El helped a lot to make flapping OSDs stable. What we learned up now is that this is the cause of unsudden death of 2 monitor servers of 3. And when they come back if they do not start one by one (each after joining cluster) this can happen. Cluster can be unhealty and it can take countless hour to come back. Right now here is our status: ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/ health detail: https://paste.ubuntu.com/p/w4gccnqZjR/ Since OSDs disks are NL-SAS it can take up to 24 hours for an online cluster. What is most it has been said that we could be extremely luck if all the data is rescued. Most unhappily our strategy is just to sit and wait :(. As soon as the peering and activating count drops to 300-500 pgs we will restart the stopped OSDs one by one. For each OSD and we will wait the cluster to settle down. The amount of data stored is OSD is 33TB. Our most concern is to export our rbd pool data outside to a backup space. Then we will start again with clean one. I hope to justify our analysis with an expert. Any help or advise would be greatly appreciated. by morphin <mailto:morphinwith...@gmail.com>, 25 Eyl 2018 Sal, 15:08 tarihinde şunu yazdı: After reducing the recovery parameter values did not change much. There are a lot of OSD still marked down. I don't know what I need to do after this point. [osd] osd recovery op priority = 63 osd client op priority = 1 osd recovery max active = 1 osd max scrubs = 1 ceph -s cluster: id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 health: HEALTH_ERR 42 osds down 1 host (6 osds) down 61/8948582 objects unfound (0.001%) Reduced data availability: 3837 pgs inactive, 1822 pgs down, 1900 pgs peering, 6 pgs stale Possible data damage: 18 pgs recovery_unfound Degraded data redundancy: 457246/17897164 objects degraded (2.555%), 213 pgs degraded, 209 pgs undersized 2554 slow requests are blocked > 32 sec 3273 slow ops, oldest one blocked for 1453 sec, daemons [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]... have slow ops. services: mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3 mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3, SRV-SEKUARK4 osd: 168 osds: 118 up, 160 in data: pools: 1 pools, 4096 pgs objects: 8.95 M objects, 17 TiB usage: 33 TiB used, 553 TiB / 586 TiB avail pgs: 93.677% pgs not active 457246/17897164 objects degraded (2.555%) 61/8948582 objects unfound (0.001%) 1676 down 1372 peering 528 stale+peering 164 active+undersized+degraded 145 stale+down 73 activating 40 active+clean 29 stale+activating 17 active+recovery_unfound+undersized+degraded 16 stale+active+clean 16 stale+active+undersized+degraded 9activating+undersized+degraded 3active+recovery_wait+degraded 2activating+undersized 2activating+degraded 1creating+down 1stale+active+recovery_unfound+undersized+degraded 1stale+active+clean+scrubbing+deep 1
Re: [ceph-users] Mimic upgrade failure
The cluster is healthy and stable. I'll leave a summary for the archive in case anyone else has a similar problem. centos 7.5 ceph mimic 13.2.1 3 mon/mgr/mds hosts, 862 osd (41 hosts) This was all triggered by an unexpected ~1 min network blip on our 10Gbit switch. The ceph cluster lost connectivity to each other and obviously tried to remap everything once connectivity returned and tons of OSDs were being marked down. This was made worse by the OSDs trying to use large amounts of memory while recovering and ending up swapping, hanging, and me ipmi resetting hosts. All of this caused a lot of osd map changes and the mons will have stored all of them without trimming due to the unhealthy PGs. I was able to get almost all PGs active and clean on a few occasions but the cluster would fall over again after about 2 hours with cephx auth errors or OSDs trying to mark each other down (the mons seemed to not be rotating cephx auth keys). Setting 'osd_heartbeat_interval = 30' helped a bit, but I eventually disabled process cephx auth with 'auth_cluster_required = none'. Setting that stopped the OSDs from falling over after 2 hours. From the beginning of this the MONs were running 100% on the ms_dispatch thread and constantly reelecting a leader every minute and not holding a consistent quorum with paxos lease_timeouts in the logs. The ms_dispatch was reading through the /var/lib/ceph/mon/mon-$hostname/store.db/*.sst constantly and strace showed this taking anywhere from 60 seconds to a couple minutes. This was almost all cpu user time and not much iowait. I think what was happening is that the mons failed health checks due to spending so much time constantly reading through the db and that held up other mon tasks which caused constant reelections. We eventually reduced the MON reelections by finding the average ms_dispatch sst read time on the rank 0 mon took 65 seconds and setting 'mon_lease = 75' so that the paxos lease would last longer than ms_dispatch running 100%. I also greatly increased the rocksdb_cache_size and leveldb_cache_size on the mons to be big enough to cache the entire db, but that didn't seem to make much difference initially. After working with Sage, he set the mon_osd_cache_size = 20 (default 10). The huge mon_osd_cache_size let the mons cache all osd maps on the first read and the ms_dispatch thread was able to use this cache instead of spinning 100% on rereading them every minute. This stopped the constant elections because the mon stopped failing health checks and was able to complete other tasks. Lastly there were some self inflicted osd corruptions from the ipmi resets that needed to be dealt with to get all PGs active+clean, and the cephx change was rolled back to operate normally. Sage, thanks again for your assistance with this. Kevin tl;dr Cache as much as you can. On 09/24/2018 09:24 AM, Sage Weil wrote: Hi Kevin, Do you have an update on the state of the cluster? I've opened a ticket http://tracker.ceph.com/issues/36163 to track the likely root cause we identified, and have a PR open at https://github.com/ceph/ceph/pull/24247 Thanks! sage On Thu, 20 Sep 2018, Sage Weil wrote: On Thu, 20 Sep 2018, KEVIN MICHAEL HRPCEK wrote: Top results when both were taken with ms_dispatch at 100%. The mon one changes alot so I've included 3 snapshots of those. I'll update mon_osd_cache_size. After disabling auth_cluster_required and a cluster reboot I am having less problems keeping OSDs in the cluster since they seem to not be having auth problems around the 2 hour uptime mark. The mons still have their problems but 859/861 OSDs are up with 2 crashing. I found a brief mention on a forum or somewhere that the mons will only trim their storedb when the cluster is healthy. If that's true do you think it is likely that once all osds are healthy and unset some no* cluster flags the mons will be able to trim their db and the result will be that ms_dispatch no longer takes to long to churn through the db? Our primary theory here is that ms_dispatch is taking too long and the mons reach a timeout and then reelect in a nonstop cycle. It's the PGs that need to all get healthy (active+clean) before the osdmaps get trimmed. Other health warnigns (e.g. about noout being set) aren't related. ceph-mon 34.24%34.24% libpthread-2.17.so[.] pthread_rwlock_rdlock + 34.00%34.00% libceph-common.so.0 [.] crush_hash32_3 If this is the -g output you need to hit enter on lines like this to see the call graph... Or you can do 'perf record -g -p ' and then 'perf report --stdio' (or similar) to dump it all to a file, fully expanded. Thanks! sage +5.01% 5.01% libceph-common.so.0 [.] ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >, std::_Select1st > > >, std::less > >, std::_Select1st > > >, std::less::copy +0.79% 0.79% libceph-c
Re: [ceph-users] data-pool option for qemu-img / ec pool
Hi Paul, thanks for the hint, I just checked and it works perfectly. I found this guide: https://www.reddit.com/r/ceph/comments/72yc9m/ceph_openstack_with_ec/ The works well with one meta/data setup but not with multiple (like device-class based pools). The link above uses client-auth, is there a better way? Kevin Am So., 23. Sep. 2018 um 18:08 Uhr schrieb Paul Emmerich : > > The usual trick for clients not supporting this natively is the option > "rbd_default_data_pool" in ceph.conf which should also work here. > > > Paul > Am So., 23. Sep. 2018 um 18:03 Uhr schrieb Kevin Olbrich : > > > > Hi! > > > > Is it possible to set data-pool for ec-pools on qemu-img? > > For repl-pools I used "qemu-img convert" to convert from e.g. vmdk to raw > > and write to rbd/ceph directly. > > > > The rbd utility is able to do this for raw or empty images but without > > convert (converting 800G and writing it again would now take at least twice > > the time). > > > > Do I miss a parameter for qemu-kvm? > > > > Kind regards > > Kevin > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] data-pool option for qemu-img / ec pool
Hi! Is it possible to set data-pool for ec-pools on qemu-img? For repl-pools I used "qemu-img convert" to convert from e.g. vmdk to raw and write to rbd/ceph directly. The rbd utility is able to do this for raw or empty images but without convert (converting 800G and writing it again would now take at least twice the time). Do I miss a parameter for qemu-kvm? Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic upgrade failure
denc_traits >, void> > +2.02% 2.02% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length +1.99% 1.99% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy +1.67% 0.00% [unknown] [k] 1.64% 1.64% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance +1.57% 1.57% libtcmalloc.so.4.4.5 [.] operator new[] +1.56% 1.56% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out +1.55% 1.55% libceph-common.so.0 [.] ceph::buffer::ptr::append 1.53% 1.53% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy@plt +1.51% 1.51% [kernel] [k] rb_insert_color +1.36% 1.36% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance +1.27% 1.27% libceph-common.so.0 [.] ceph::encode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >, std::_Select1st > > >, std::less' to see where all of the encoding activity is coming from? I see two possibilities (the mon attempts to cache encoded maps, and the MOSDMap message itself will also reencode if/when that fails). Also: mon_osd_cache_size = 10 by default... try making that 500 or something. sage On Wed, 19 Sep 2018, Kevin Hrpcek wrote: Majority of the clients are luminous with a few kraken stragglers. I looked at ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting luminous features, and majority of the luminous clients are reporting jewel features. I shut down my compute cluster to get rid of majority of the clients that are reporting jewel features, and there is still a lot of time spent by ms_dispatch in ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 6.67% libceph-common.so.0 [.] ceph::buffer::ptr::release 5.35% libceph-common.so.0 [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 5.20% libceph-common.so.0 [.] ceph::buffer::ptr::append 5.12% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy 4.66% libceph-common.so.0 [.] ceph::buffer::list::append 4.33% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment 4.27% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 4.18% libceph-common.so.0 [.] ceph::buffer::list::append 3.10% libceph-common.so.0 [.] ceph::decode >, denc_traits >, void> > 2.90% libceph-common.so.0 [.] ceph::encode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 2.56% libceph-common.so.0 [.] ceph::buffer::ptr::ptr 2.50% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance 2.39% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out 2.33% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance 2.21% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::FetchFromOneSpans 1.97% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::ReleaseToSpans 1.60% libceph-common.so.0 [.] crc32_iscsi_00 1.42% libtcmalloc.so.4.4.5 [.] operator new[] 1.29% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length 1.28% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy_shallow 1.25% libceph-common.so.0 [.] ceph::buffer::ptr::raw_length@plt 1.06% libceph-common.so.0 [.] ceph::buffer::ptr::end_c_str 1.06% libceph-common.so.0 [.] ceph::buffer::list::iterator::copy 0.99% libc-2.17.so [.] __memcpy_ssse3_back 0.94% libc-2.17.so [.] _IO_default_xsputn 0.89% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance@plt 0.87% libtcmalloc.so.4.4.5 [.] tcmalloc::ThreadCache::ReleaseToCentralCache 0.76% libleveldb.so.1.0.7 [.] leveldb::FindFile 0.72% [vdso][.] __vdso_clock_gettime 0.67% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 0.63% libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow 0.59% libceph-common.so.0 [.] ceph::buffer::list::iterator::advance 0.52% libceph-common.so.0 [.] ceph::buffer::list::iterator::get_current_ptr perf top ms_dispatch 11.88% libceph-common.so.0 [.] ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 11.23% [kernel] [k] system_call_after_swapgs 9.36% libceph-common.so.0 [.] crush_hash32_3 6.55% libceph-common.so.0 [.] crush_choose_indep 4.39% [kernel] [k] smp_call_function_many 3.17% libceph-common.so.0 [.] ceph::buffer::list::append 3.03% libceph-common.so.0 [.] ceph::buffer::list::append 3.02% libceph-common.so.0 [.] std::_Rb_tree > >, std::_
Re: [ceph-users] Mimic upgrade failure
The mons have a 300gb raid 1 on 10k sas. The /var lv is 44% full with the /var/lib/ceph/mon directory at 6.7gb. When ms_dispatch is running 100% it is all user time with iostat showing 0-2% utilization of the drive. I'm considering taking one the mon's raid 1 drives and dropping them into a server with better cpu to see if that makes a difference in the time it takes for ms_dispatch to do its thing. OSDs seem to be struggling to update their cephx auth key/ticket ~2hr after a cluster reboot. This morning I'm setting auth_cluster_required = none to see if that removes this issue until the cluster is stable again. Kevin On 09/20/2018 08:13 AM, David Turner wrote: Out of curiosity, what disks do you have your mons on and how does the disk usage, both utilization% and full%, look while this is going on? On Wed, Sep 19, 2018, 1:57 PM Kevin Hrpcek mailto:kevin.hrp...@ssec.wisc.edu>> wrote: Majority of the clients are luminous with a few kraken stragglers. I looked at ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting luminous features, and majority of the luminous clients are reporting jewel features. I shut down my compute cluster to get rid of majority of the clients that are reporting jewel features, and there is still a lot of time spent by ms_dispatch in ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 6.67% libceph-common.so.0 [.] ceph::buffer::ptr::release 5.35% libceph-common.so.0 [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 5.20% libceph-common.so.0 [.] ceph::buffer::ptr::append 5.12% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy 4.66% libceph-common.so.0 [.] ceph::buffer::list::append 4.33% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment 4.27% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 4.18% libceph-common.so.0 [.] ceph::buffer::list::append 3.10% libceph-common.so.0 [.] ceph::decode >, denc_traits >, void> > 2.90% libceph-common.so.0 [.] ceph::encode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 2.56% libceph-common.so.0 [.] ceph::buffer::ptr::ptr 2.50% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance 2.39% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out 2.33% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance 2.21% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::FetchFromOneSpans 1.97% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::ReleaseToSpans 1.60% libceph-common.so.0 [.] crc32_iscsi_00 1.42% libtcmalloc.so.4.4.5 [.] operator new[] 1.29% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length 1.28% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy_shallow 1.25% libceph-common.so.0 [.] ceph::buffer::ptr::raw_length@plt 1.06% libceph-common.so.0 [.] ceph::buffer::ptr::end_c_str 1.06% libceph-common.so.0 [.] ceph::buffer::list::iterator::copy 0.99% libc-2.17.so<http://libc-2.17.so> [.] __memcpy_ssse3_back 0.94% libc-2.17.so<http://libc-2.17.so> [.] _IO_default_xsputn 0.89% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance@plt 0.87% libtcmalloc.so.4.4.5 [.] tcmalloc::ThreadCache::ReleaseToCentralCache 0.76% libleveldb.so.1.0.7 [.] leveldb::FindFile 0.72% [vdso][.] __vdso_clock_gettime 0.67% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 0.63% libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow 0.59% libceph-common.so.0 [.] ceph::buffer::list::iterator::advance 0.52% libceph-common.so.0 [.] ceph::buffer::list::iterator::get_current_ptr perf top ms_dispatch 11.88% libceph-common.so.0 [.] ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 11.23% [kernel] [k] system_call_after_swapgs 9.36% libceph-common.so.0 [.] crush_hash32_3 6.55% libceph-common.so.0 [.] crush_choose_indep 4.39% [kernel] [k] smp_call_function_many 3.17% libceph-common.so.0 [.] ceph::buffer::list::append 3.03% libceph-common.so.0 [.] ceph::buffer::list::append 3.02% libceph-common.so.0 [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 2.92% libceph-common.so.0 [.] ceph::buffer::ptr::release 2.65% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance 2.57% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 2.27% libceph-common.so.0 [.] ceph::buffer::ptr::ptr 1.99% libstdc++.so.6.
Re: [ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts
Thank you very much Paul. Kevin Am Do., 20. Sep. 2018 um 15:19 Uhr schrieb Paul Emmerich < paul.emmer...@croit.io>: > Hi, > > device classes are internally represented as completely independent > trees/roots; showing them in one tree is just syntactic sugar. > > For example, if you have a hierarchy like root --> host1, host2, host3 > --> nvme/ssd/sata OSDs, then you'll actually have 3 trees: > > root~ssd -> host1~ssd, host2~ssd ... > root~sata -> host~sata, ... > > > Paul > > 2018-09-20 14:54 GMT+02:00 Kevin Olbrich : > > Hi! > > > > Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host. > > I also have replication rules to distinguish between HDD and SSD (and > > failure-domain set to rack) which are mapped to pools. > > > > What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where > > NVMe will be a new device-class based rule)? > > > > Will the crush weight be calculated from the OSDs up to the > failure-domain > > based on the crush rule? > > The only crush-weights I know and see are those shown by "ceph osd tree". > > > > Kind regards > > Kevin > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts
To answer my own question: ceph osd crush tree --show-shadow Sorry for the noise... Am Do., 20. Sep. 2018 um 14:54 Uhr schrieb Kevin Olbrich : > Hi! > > Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host. > I also have replication rules to distinguish between HDD and SSD (and > failure-domain set to rack) which are mapped to pools. > > What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where > NVMe will be a new device-class based rule)? > > Will the crush weight be calculated from the OSDs up to the failure-domain > based on the crush rule? > The only crush-weights I know and see are those shown by "ceph osd tree". > > Kind regards > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts
Hi! Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host. I also have replication rules to distinguish between HDD and SSD (and failure-domain set to rack) which are mapped to pools. What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where NVMe will be a new device-class based rule)? Will the crush weight be calculated from the OSDs up to the failure-domain based on the crush rule? The only crush-weights I know and see are those shown by "ceph osd tree". Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic upgrade failure
Majority of the clients are luminous with a few kraken stragglers. I looked at ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting luminous features, and majority of the luminous clients are reporting jewel features. I shut down my compute cluster to get rid of majority of the clients that are reporting jewel features, and there is still a lot of time spent by ms_dispatch in ceph::decode My small mimic test cluster actually shows similar in it's features, mon,mgr,mds,osd all report luminous features yet have 13.2.1 installed, so maybe that is normal. Kevin On 09/19/2018 09:35 AM, Sage Weil wrote: It's hard to tell exactly from the below, but it looks to me like there is still a lot of OSDMap reencoding going on. Take a look at 'ceph features' output and see who in the cluster is using pre-luminous features.. I'm guessing all of the clients? For any of those sessions, fetching OSDMaps from the cluster will require reencoding. If it's all clients (well, non-OSDs), I think we could work around it by avoiding the reencode entirely (it is only really there for OSDs, which want a perfect OSDMap copy that will match the monitor's CRC). sage On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote: I set mon lease = 30 yesterday and it had no effect on the quorum election. To give you an idea of how much cpu ms_dispatch is using, from the last mon restart about 7.5 hours ago, the ms_dispatch thread has 5h 40m of cpu time. Below are 2 snippets from perf top. I took them while ms_dispatch was 100% of a core, the first is using the pid of the ceph-mon, the second is the pid of the ms_dispatch thread. The last thing is a snippet from stracing the ms_dispatch pid. It is running through all of the sst files. perf top ceph-mon Overhead Shared Object Symbol 17.71% libceph-common.so.0 [.] ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 6.67% libceph-common.so.0 [.] ceph::buffer::ptr::release 5.35% libceph-common.so.0 [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 5.20% libceph-common.so.0 [.] ceph::buffer::ptr::append 5.12% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy 4.66% libceph-common.so.0 [.] ceph::buffer::list::append 4.33% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment 4.27% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 4.18% libceph-common.so.0 [.] ceph::buffer::list::append 3.10% libceph-common.so.0 [.] ceph::decode >, denc_traits >, void> > 2.90% libceph-common.so.0 [.] ceph::encode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 2.56% libceph-common.so.0 [.] ceph::buffer::ptr::ptr 2.50% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance 2.39% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out 2.33% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance 2.21% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::FetchFromOneSpans 1.97% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::ReleaseToSpans 1.60% libceph-common.so.0 [.] crc32_iscsi_00 1.42% libtcmalloc.so.4.4.5 [.] operator new[] 1.29% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length 1.28% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::copy_shallow 1.25% libceph-common.so.0 [.] ceph::buffer::ptr::raw_length@plt 1.06% libceph-common.so.0 [.] ceph::buffer::ptr::end_c_str 1.06% libceph-common.so.0 [.] ceph::buffer::list::iterator::copy 0.99% libc-2.17.so [.] __memcpy_ssse3_back 0.94% libc-2.17.so [.] _IO_default_xsputn 0.89% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl::advance@plt 0.87% libtcmalloc.so.4.4.5 [.] tcmalloc::ThreadCache::ReleaseToCentralCache 0.76% libleveldb.so.1.0.7 [.] leveldb::FindFile 0.72% [vdso][.] __vdso_clock_gettime 0.67% ceph-mon [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo 0.63% libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow 0.59% libceph-common.so.0 [.] ceph::buffer::list::iterator::advance 0.52% libceph-common.so.0 [.] ceph::buffer::list::iterator::get_current_ptr perf top ms_dispatch 11.88% libceph-common.so.0 [.] ceph::decode >, std::less, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > > 11.23% [kernel] [k] system_call_after_swapgs 9.36% libceph-common.so.0 [.] crush_hash32_3 6.55% libceph-common.so.0 [.] crush_choose_indep 4.39% [kernel] [k] smp_call_function_many 3.17% libceph-common.so.0 [.] ceph::buffer::list::append
Re: [ceph-users] Mimic upgrade failure
t;, void> > 1.07% libtcmalloc.so.4.4.5 [.] operator new[] 1.02% libceph-common.so.0 [.] ceph::buffer::list::iterator::copy 1.01% libtcmalloc.so.4.4.5 [.] tc_posix_memalign 0.85% ceph-mon [.] ceph::buffer::ptr::release@plt 0.76% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out@plt 0.74% libceph-common.so.0 [.] crc32_iscsi_00 strace munmap(0x7f2eda736000, 2463941) = 0 open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429 stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", {st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0 mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000 close(429) = 0 munmap(0x7f2ea8c97000, 2468005) = 0 open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = 429 stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", {st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0 mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000 close(429) = 0 munmap(0x7f2ee21dc000, 2472343) = 0 Kevin On 09/19/2018 06:50 AM, Sage Weil wrote: On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote: Sage, Unfortunately the mon election problem came back yesterday and it makes it really hard to get a cluster to stay healthy. A brief unexpected network outage occurred and sent the cluster into a frenzy and when I had it 95% healthy the mons started their nonstop reelections. In the previous logs I sent were you able to identify why the mons are constantly electing? The elections seem to be triggered by the below paxos message but do you know which lease timeout is being reached or why the lease isn't renewed instead of calling for an election? One thing I tried was to shutdown the entire cluster and bring up only the mon and mgr. The mons weren't able to hold their quorum with no osds running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a time. This is odd... with no other dameons running I'm not sure what would be eating up the CPU. Can you run a 'perf top -p `pidof ceph-mon`' (or similar) on the machine to see what the process is doing? You might need to install ceph-mon-dbg or ceph-debuginfo to get better symbols. 2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos active c 133382665..133383355) lease_timeout -- calling new election A workaround is probably to increase the lease timeout. Try setting mon_lease = 15 (default is 5... could also go higher than 15) in the ceph.conf for all of the mons. This is a bit of a band-aid but should help you keep the mons in quorum until we sort out what is going on. sage Thanks Kevin On 09/10/2018 07:06 AM, Sage Weil wrote: I took a look at the mon log you sent. A few things I noticed: - The frequent mon elections seem to get only 2/3 mons about half of the time. - The messages coming in a mostly osd_failure, and half of those seem to be recoveries (cancellation of the failure message). It does smell a bit like a networking issue, or some tunable that relates to the messaging layer. It might be worth looking at an OSD log for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote: Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on mimic since the random mass OSD heartbeat failures stopped and the constant mon election problem went away. I'm still battling with the cluster reacting poorly to host reboots or small map changes, but I feel like my current pg:osd ratio may be playing a factor in that since we are 2x normal pg count while migrating data to new EC pools. I'm not sure of the root cause but it seems like the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought back closer to defaults. I was setting nodown before but it seems to mask the problem. While its good to stop the osdmap changes, OSDs would come up, get marked up, but at some point go down again (but the process is still running) and still stay up in the map. Then when I'd unset nodown the cluster would immediately mark 250+ osd down again and i'd be back where I started. This morning I went ahead and finished the osd upgrades to mimic to remove that variabl
Re: [ceph-users] Mimic upgrade failure
Sage, Unfortunately the mon election problem came back yesterday and it makes it really hard to get a cluster to stay healthy. A brief unexpected network outage occurred and sent the cluster into a frenzy and when I had it 95% healthy the mons started their nonstop reelections. In the previous logs I sent were you able to identify why the mons are constantly electing? The elections seem to be triggered by the below paxos message but do you know which lease timeout is being reached or why the lease isn't renewed instead of calling for an election? One thing I tried was to shutdown the entire cluster and bring up only the mon and mgr. The mons weren't able to hold their quorum with no osds running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a time. 2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos active c 133382665..133383355) lease_timeout -- calling new election Thanks Kevin On 09/10/2018 07:06 AM, Sage Weil wrote: I took a look at the mon log you sent. A few things I noticed: - The frequent mon elections seem to get only 2/3 mons about half of the time. - The messages coming in a mostly osd_failure, and half of those seem to be recoveries (cancellation of the failure message). It does smell a bit like a networking issue, or some tunable that relates to the messaging layer. It might be worth looking at an OSD log for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote: Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on mimic since the random mass OSD heartbeat failures stopped and the constant mon election problem went away. I'm still battling with the cluster reacting poorly to host reboots or small map changes, but I feel like my current pg:osd ratio may be playing a factor in that since we are 2x normal pg count while migrating data to new EC pools. I'm not sure of the root cause but it seems like the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought back closer to defaults. I was setting nodown before but it seems to mask the problem. While its good to stop the osdmap changes, OSDs would come up, get marked up, but at some point go down again (but the process is still running) and still stay up in the map. Then when I'd unset nodown the cluster would immediately mark 250+ osd down again and i'd be back where I started. This morning I went ahead and finished the osd upgrades to mimic to remove that variable. I've looked for networking problems but haven't found any. 2 of the mons are on the same switch. I've also tried combinations of shutting down a mon to see if a single one was the problem, but they keep electing no matter the mix of them that are up. Part of it feels like a networking problem but I haven't been able to find a culprit yet as everything was working normally before starting the upgrade. Other than the constant mon elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it doesn't last long since at some point the OSDs start trying to fail each other through their heartbeats. 2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908 is reporting failure:1 2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log [DBG] : osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 10.1.9.13:6801/275806 is reporting failure:1 I'm working on getting things mostly good again with everything on mimic and will see if it behaves better. Thanks for your input on this David. [global] mon_initial_members = sephmon1, sephmon2, sephmon3 mon_host = 10.1.9.201,10.1.9.202,10.1.9.203 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 10.1.0.0/16 osd backfill full ratio = 0.92 osd failsafe nearfull ratio = 0.90 osd max object size = 21474836480 mon max pg per osd = 350 [mon] mon warn on legacy crush tunables = false mon pg warn max per osd = 300 mon osd down out subtree limit = host mon osd nearfull ratio = 0.90 mon osd full ratio = 0.97 mon hea
[ceph-users] (no subject)
Hi! is the compressible hint / incompressible hint supported on qemu+kvm? http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/ If not, only aggressive would work in this case for rbd, right? Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic upgrade failure
I couldn't find any sign of a networking issue at the OS or switches. No changes have been made in those to get the cluster stable again. I looked through a couple OSD logs and here is a selection of some of most frequent errors they were getting. Maybe something below is more obvious to you. 2018-09-09 18:17:33.245 7feb92079700 2 osd.84 991324 ms_handle_refused con 0x560e428b9800 session 0x560eb26b0060 2018-09-09 18:17:33.245 7feb9307b700 2 osd.84 991324 ms_handle_refused con 0x560ea639f000 session 0x560eb26b0060 2018-09-09 18:18:55.919 7feb9307b700 10 osd.84 991337 heartbeat_reset failed hb con 0x560e424a3600 for osd.20, reopening 2018-09-09 18:18:55.919 7feb9307b700 2 osd.84 991337 ms_handle_refused con 0x560e447df600 session 0x560e9ec37680 2018-09-09 18:18:55.919 7feb92079700 2 osd.84 991337 ms_handle_refused con 0x560e427a5600 session 0x560e9ec37680 2018-09-09 18:18:55.935 7feb92079700 10 osd.84 991337 heartbeat_reset failed hb con 0x560e40afcc00 for osd.18, reopening 2018-09-09 18:18:55.935 7feb92079700 2 osd.84 991337 ms_handle_refused con 0x560e44398c00 session 0x560e6a3a0620 2018-09-09 18:18:55.935 7feb9307b700 2 osd.84 991337 ms_handle_refused con 0x560e42f4ea00 session 0x560e6a3a0620 2018-09-09 18:18:55.939 7feb9307b700 10 osd.84 991337 heartbeat_reset failed hb con 0x560e424c1e00 for osd.9, reopening 2018-09-09 18:18:55.940 7feb9307b700 2 osd.84 991337 ms_handle_refused con 0x560ea4d09600 session 0x560e115e8120 2018-09-09 18:18:55.940 7feb92079700 2 osd.84 991337 ms_handle_refused con 0x560e424a3600 session 0x560e115e8120 2018-09-09 18:18:55.956 7febadf54700 20 osd.84 991337 share_map_peer 0x560e411ca600 already has epoch 991337 2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362 new session 0x560e40b5ce00 con=0x560e42471800 addr=10.1.9.13:6836/2276068 2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362 session 0x560e40b5ce00 osd.376 has caps osdcap[grant(*)] 'allow *' 2018-09-09 18:24:59.596 7feb9407d700 2 osd.84 991362 ms_handle_reset con 0x560e42471800 session 0x560e40b5ce00 2018-09-09 18:24:59.606 7feb9407d700 2 osd.84 991362 ms_handle_refused con 0x560e42d04600 session 0x560e10dfd000 2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 OSD::ms_get_authorizer type=osd 2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 ms_get_authorizer bailing, we are shutting down 2018-09-09 18:24:59.633 7febad753700 0 -- 10.1.9.9:6848/4287624 >> 10.1.9.12:6801/2269104 conn(0x560e42326a00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=18630 cs=1 l=0).handle_connect_reply connect got BADAUTHORIZER 2018-09-09 18:22:56.434 7febadf54700 0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 3995972256093848467 != 18374858748799134293 2018-09-09 18:22:56.434 7febadf54700 0 -- 10.1.9.9:6848/4287624 >> 10.1.9.12:6801/2269104 conn(0x560e41fad600 :6848 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer 2018-09-10 03:30:17.324 7ff0ab678700 -1 osd.84 992286 heartbeat_check: no reply from 10.1.9.28:6843 osd.578 since back 2018-09-10 03:15:35.358240 front 2018-09-10 03:15:47.879015 (cutoff 2018-09-10 03:29:17.326329) Kevin On 09/10/2018 07:06 AM, Sage Weil wrote: I took a look at the mon log you sent. A few things I noticed: - The frequent mon elections seem to get only 2/3 mons about half of the time. - The messages coming in a mostly osd_failure, and half of those seem to be recoveries (cancellation of the failure message). It does smell a bit like a networking issue, or some tunable that relates to the messaging layer. It might be worth looking at an OSD log for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote: Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on mimic since the random mass OSD heartbeat failures stopped and the constant mon election problem went away. I'm still battling with the cluster reacting poorly to host reboots or small map changes, but I feel like my current pg:osd ratio may be playing a factor in that since we are 2x normal pg count while migrating data to new EC pools. I'm not sure of the root cause but it seems like the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought
[ceph-users] nfs-ganesha FSAL CephFS: nfs_health :DBUS :WARN :Health status is unhealthy
Hi! Today one of our nfs-ganesha gateway experienced an outage and since crashs every time, the client behind it tries to access the data. This is a Ceph Mimic cluster with nfs-ganesha from ceph-repos: nfs-ganesha-2.6.2-0.1.el7.x86_64 nfs-ganesha-ceph-2.6.2-0.1.el7.x86_64 There were fixes for this problem in 2.6.3: https://github.com/nfs-ganesha/nfs-ganesha/issues/339 Can the build in the repos be compiled against this bugfix release? Thank you very much. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic upgrade failure
Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on mimic since the random mass OSD heartbeat failures stopped and the constant mon election problem went away. I'm still battling with the cluster reacting poorly to host reboots or small map changes, but I feel like my current pg:osd ratio may be playing a factor in that since we are 2x normal pg count while migrating data to new EC pools. I'm not sure of the root cause but it seems like the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought back closer to defaults. I was setting nodown before but it seems to mask the problem. While its good to stop the osdmap changes, OSDs would come up, get marked up, but at some point go down again (but the process is still running) and still stay up in the map. Then when I'd unset nodown the cluster would immediately mark 250+ osd down again and i'd be back where I started. This morning I went ahead and finished the osd upgrades to mimic to remove that variable. I've looked for networking problems but haven't found any. 2 of the mons are on the same switch. I've also tried combinations of shutting down a mon to see if a single one was the problem, but they keep electing no matter the mix of them that are up. Part of it feels like a networking problem but I haven't been able to find a culprit yet as everything was working normally before starting the upgrade. Other than the constant mon elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it doesn't last long since at some point the OSDs start trying to fail each other through their heartbeats. 2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908 is reporting failure:1 2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log [DBG] : osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 10.1.9.13:6801/275806 is reporting failure:1 I'm working on getting things mostly good again with everything on mimic and will see if it behaves better. Thanks for your input on this David. [global] mon_initial_members = sephmon1, sephmon2, sephmon3 mon_host = 10.1.9.201,10.1.9.202,10.1.9.203 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 10.1.0.0/16 osd backfill full ratio = 0.92 osd failsafe nearfull ratio = 0.90 osd max object size = 21474836480 mon max pg per osd = 350 [mon] mon warn on legacy crush tunables = false mon pg warn max per osd = 300 mon osd down out subtree limit = host mon osd nearfull ratio = 0.90 mon osd full ratio = 0.97 mon health preluminous compat warning = false osd heartbeat grace = 60 rocksdb cache size = 1342177280 [mds] mds log max segments = 100 mds log max expiring = 40 mds bal fragment size max = 20 mds cache memory limit = 4294967296 [osd] osd mkfs options xfs = -i size=2048 -d su=512k,sw=1 osd recovery delay start = 30 osd recovery max active = 5 osd max backfills = 3 osd recovery threads = 2 osd crush initial weight = 0 osd heartbeat interval = 30 osd heartbeat grace = 60 On 09/08/2018 11:24 PM, David Turner wrote: What osd/mon/etc config settings do you have that are not default? It might be worth utilizing nodown to stop osds from marking each other down and finish the upgrade to be able to set the minimum osd version to mimic. Stop the osds in a node, manually mark them down, start them back up in mimic. Depending on how bad things are, setting pause on the cluster to just finish the upgrade faster might not be a bad idea either. This should be a simple question, have you confirmed that there are no networking problems between the MONs while the elections are happening? On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <mailto:kevin.hrp...@ssec.wisc.edu>> wrote: Hey Sage, I've posted the file with my email address for the user. It is with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The mons are calling for elections about every minute so I let this run for a few elections and saw this node become the leader a couple times. Debug logs start around 23:27:30. I had managed to get about 850/857 osds up, but it seems that within the last 30 min it has all gone bad again due to the OSDs repor
Re: [ceph-users] Mimic upgrade failure
Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought back closer to defaults. I was setting nodown before but it seems to mask the problem. While its good to stop the osdmap changes, OSDs would come up, get marked up, but at some point go down again (but the process is still running) and still stay up in the map. Then when I'd unset nodown the cluster would immediately mark 250+ osd down again and i'd be back where I started. This morning I went ahead and finished the osd upgrades to mimic to remove that variable. I've looked for networking problems but haven't found any. 2 of the mons are on the same switch. I've also tried combinations of shutting down a mon to see if a single one was the problem, but they keep electing no matter the mix of them that are up. Part of it feels like a networking problem but I haven't been able to find a culprit yet as everything was working normally before starting the upgrade. Other than the constant mon elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it doesn't last long since at some point the OSDs start trying to fail each other through their heartbeats. 2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908 is reporting failure:1 2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log [DBG] : osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 10.1.9.13:6801/275806 is reporting failure:1 I'm working on getting things mostly good again with everything on mimic and will see if it behaves better. Thanks for your input on this David. [global] mon_initial_members = sephmon1, sephmon2, sephmon3 mon_host = 10.1.9.201,10.1.9.202,10.1.9.203 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 10.1.0.0/16 osd backfill full ratio = 0.92 osd failsafe nearfull ratio = 0.90 osd max object size = 21474836480 mon max pg per osd = 350 [mon] mon warn on legacy crush tunables = false mon pg warn max per osd = 300 mon osd down out subtree limit = host mon osd nearfull ratio = 0.90 mon osd full ratio = 0.97 mon health preluminous compat warning = false osd heartbeat grace = 60 rocksdb cache size = 1342177280 [mds] mds log max segments = 100 mds log max expiring = 40 mds bal fragment size max = 20 mds cache memory limit = 4294967296 [osd] osd mkfs options xfs = -i size=2048 -d su=512k,sw=1 osd recovery delay start = 30 osd recovery max active = 5 osd max backfills = 3 osd recovery threads = 2 osd crush initial weight = 0 osd heartbeat interval = 30 osd heartbeat grace = 60 On 09/08/2018 11:24 PM, David Turner wrote: What osd/mon/etc config settings do you have that are not default? It might be worth utilizing nodown to stop osds from marking each other down and finish the upgrade to be able to set the minimum osd version to mimic. Stop the osds in a node, manually mark them down, start them back up in mimic. Depending on how bad things are, setting pause on the cluster to just finish the upgrade faster might not be a bad idea either. This should be a simple question, have you confirmed that there are no networking problems between the MONs while the elections are happening? On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <mailto:kevin.hrp...@ssec.wisc.edu>> wrote: Hey Sage, I've posted the file with my email address for the user. It is with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The mons are calling for elections about every minute so I let this run for a few elections and saw this node become the leader a couple times. Debug logs start around 23:27:30. I had managed to get about 850/857 osds up, but it seems that within the last 30 min it has all gone bad again due to the OSDs reporting each other as failed. We relaxed the osd_heartbeat_interval to 30 and osd_heartbeat_grace to 60 in an attempt to slow down how quickly OSDs are trying to fail each other. I'll put in the rocksdb_cache_size setting. Thanks for taking a look. Kevin On 09/08/2018 06:04 PM, Sage Weil wrote: Hi Kevin, I can't think of any major luminous->mimic changes off the top of my head that would impact CPU usage, but it's always possible there is something subtle. Can you ceph-post-file a the full log from one of your mons (preferbably the leader)? You might try adjusting the rocksdb cache size.. try setting rocksdb_cache_size = 1342177280 # 10x the default, ~1.3 GB on the mons and restarting? Thanks! sage On Sat, 8 Sep 2018, Kevin Hrpcek wrote: Hello, I've had a Lumin
Re: [ceph-users] Mimic upgrade failure
Hey Sage, I've posted the file with my email address for the user. It is with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The mons are calling for elections about every minute so I let this run for a few elections and saw this node become the leader a couple times. Debug logs start around 23:27:30. I had managed to get about 850/857 osds up, but it seems that within the last 30 min it has all gone bad again due to the OSDs reporting each other as failed. We relaxed the osd_heartbeat_interval to 30 and osd_heartbeat_grace to 60 in an attempt to slow down how quickly OSDs are trying to fail each other. I'll put in the rocksdb_cache_size setting. Thanks for taking a look. Kevin On 09/08/2018 06:04 PM, Sage Weil wrote: Hi Kevin, I can't think of any major luminous->mimic changes off the top of my head that would impact CPU usage, but it's always possible there is something subtle. Can you ceph-post-file a the full log from one of your mons (preferbably the leader)? You might try adjusting the rocksdb cache size.. try setting rocksdb_cache_size = 1342177280 # 10x the default, ~1.3 GB on the mons and restarting? Thanks! sage On Sat, 8 Sep 2018, Kevin Hrpcek wrote: Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cluster and even run any commands on it since at least half the time a ceph command times out or takes over a minute to return results. I've looked at the debug logs and it appears there is some timeout occurring with paxos of about a minute. The msg_dispatch thread of the mons is often running a core at 100% for about a minute(user time, no iowait). Running strace on it shows the process is going through all of the mon db files (about 6gb in store.db/*.sst). Does anyone have an idea of what this timeout is or why my mons are always reelecting? One theory I have is that the msg_dispatch can't process the SST's fast enough and hits some timeout for a health check and the mon drops itself from the quorum since it thinks it isn't healthy. I've been thinking of introducing a new mon to the cluster on hardware with a better cpu to see if that can process the SSTs within this timeout. My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd servers on luminous. The original problem started when I restarted the osds on one of the hosts. The cluster reacted poorly to them going down and went into a frenzy of taking down other osds and remapping. I eventually got that stable and the PGs were 90% good with the finish line in sight and then the mons started their issue of releecting every minute. Now I can't keep any decent amount of PGs up for more than a few hours. This started on Wednesday. Any help would be greatly appreciated. Thanks, Kevin --Debug snippet from a mon at reelection time 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242 maybe_resize_cluster in 1 max 1 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mds e14242 tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s seconds 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to 13742 would only trim 238 < paxos_service_trim_min 250 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 auth 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 check_rotate updated rotating 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 encode_pending v 120658 2018-09-07 20:08:08.655 7f57b92cd700 5 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) trigger_propose not active, will propose later 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mgr e2234 tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8844s seconds 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734 would only trim 221 < paxos_service_trim_min 250 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health check_member_health 2018-09-07 20:08:08.657 7f57bcdd0700 1 -- 10.1.9.202:6789/0 >> - conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=447 - 2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health check_m
[ceph-users] Mimic upgrade failure
Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cluster and even run any commands on it since at least half the time a ceph command times out or takes over a minute to return results. I've looked at the debug logs and it appears there is some timeout occurring with paxos of about a minute. The msg_dispatch thread of the mons is often running a core at 100% for about a minute(user time, no iowait). Running strace on it shows the process is going through all of the mon db files (about 6gb in store.db/*.sst). Does anyone have an idea of what this timeout is or why my mons are always reelecting? One theory I have is that the msg_dispatch can't process the SST's fast enough and hits some timeout for a health check and the mon drops itself from the quorum since it thinks it isn't healthy. I've been thinking of introducing a new mon to the cluster on hardware with a better cpu to see if that can process the SSTs within this timeout. My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd servers on luminous. The original problem started when I restarted the osds on one of the hosts. The cluster reacted poorly to them going down and went into a frenzy of taking down other osds and remapping. I eventually got that stable and the PGs were 90% good with the finish line in sight and then the mons started their issue of releecting every minute. Now I can't keep any decent amount of PGs up for more than a few hours. This started on Wednesday. Any help would be greatly appreciated. Thanks, Kevin --Debug snippet from a mon at reelection time 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242 maybe_resize_cluster in 1 max 1 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mds e14242 tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s seconds 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to 13742 would only trim 238 < paxos_service_trim_min 250 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 auth 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 check_rotate updated rotating 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 encode_pending v 120658 2018-09-07 20:08:08.655 7f57b92cd700 5 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) trigger_propose not active, will propose later 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mgr e2234 tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8844s seconds 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734 would only trim 221 < paxos_service_trim_min 250 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health check_member_health 2018-09-07 20:08:08.657 7f57bcdd0700 1 -- 10.1.9.202:6789/0 >> - conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=447 - 2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health check_leader_health 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim trim_to 1715 would only trim 181 < paxos_service_trim_min 250 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17 sync_trim_providers 2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17 get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed timeout osd.72 10.1.9.9:6800/68904 for 317sec e987498 v987498) 2018-09-07 20:08:08.662 7f57b92cd700 1 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) accept timeout, calling fresh election 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 sync_reset_requester 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 unregister_cluster_logger 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.seph
[ceph-users] SPDK/DPDK with Intel P3700 NVMe pool
Hi! During our move from filestore to bluestore, we removed several Intel P3700 NVMe from the nodes. Is someone running a SPDK/DPDK NVMe-only EC pool? Is it working well? The docs are very short about the setup: http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage I would like to re-use these cards for high-end (max IO) for database VMs. Some notes or feedback about the setup (ceph-volume etc.) would be appreciated. Thank you. Kind regards Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HDD-only CephFS cluster with EC and without SSD/NVMe
Hi! I am in the progress of moving a local ("large", 24x1TB) ZFS RAIDZ2 to CephFS. This storage is used for backup images (large sequential reads and writes). To save space and have a RAIDZ2 (RAID6) like setup, I am planning the following profile: ceph osd erasure-code-profile set myprofile \ k=3 \ m=2 \ ruleset-failure-domain=rack Performance is not the first priority, this is why I do not plan to outsource WAL/DB (broken NVMe = broken OSDs is more administrative overhead then single OSDs). Disks are attached by SAS multipath, throughput in general is no problem but I did not test with ceph yet. Is anyone using CephFS + bluestore + ec 3/2 + without WAL/DB-dev and is it working well? Thank you. Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running 12.2.5 without problems, should I upgrade to 12.2.7 or wait for 12.2.8?
Am Fr., 10. Aug. 2018 um 19:29 Uhr schrieb : > > > Am 30. Juli 2018 09:51:23 MESZ schrieb Micha Krause : > >Hi, > > Hi Micha, > > > > >I'm Running 12.2.5 and I have no Problems at the moment. > > > >However my servers reporting daily that they want to upgrade to 12.2.7, > >is this save or should I wait for 12.2.8? > > > I guess you should Upgrade to 12.2.7 as soon as you can, specialy when > Why? As far as I unterstood, replicated pools for rbd are out of danger - .6 and .7 were mostly fixes for the known cases. We are not planning any upgrade from 12.2.5 atm. Please correct me, if I am wrong. Kevin > Quote: > The v12.2.5 release has a potential data corruption issue with erasure > coded pools. If you ran v12.2.5 with erasure coding, please see below. > > See: https://ceph.com/releases/12-2-7-luminous-released/ > > Hth > - Mehmet > >Are there any predictions when the 12.2.8 release will be available? > > > > > >Micha Krause > >___ > >ceph-users mailing list > >ceph-users@lists.ceph.com > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v12.2.7 Luminous released
Hi, on upgrade from 12.2.4 to 12.2.5 the balancer module broke (mgr crashes minutes after service started). Only solution was to disable the balancer (service is running fine since). Is this fixed in 12.2.7? I was unable to locate the bug in bugtracker. Kevin 2018-07-17 18:28 GMT+02:00 Abhishek Lekshmanan : > > This is the seventh bugfix release of Luminous v12.2.x long term > stable release series. This release contains several fixes for > regressions in the v12.2.6 and v12.2.5 releases. We recommend that > all users upgrade. > > *NOTE* The v12.2.6 release has serious known regressions, while 12.2.6 > wasn't formally announced in the mailing lists or blog, the packages > were built and available on download.ceph.com since last week. If you > installed this release, please see the upgrade procedure below. > > *NOTE* The v12.2.5 release has a potential data corruption issue with > erasure coded pools. If you ran v12.2.5 with erasure coding, please see > below. > > The full blog post alongwith the complete changelog is published at the > official ceph blog at https://ceph.com/releases/12-2-7-luminous-released/ > > Upgrading from v12.2.6 > -- > > v12.2.6 included an incomplete backport of an optimization for > BlueStore OSDs that avoids maintaining both the per-object checksum > and the internal BlueStore checksum. Due to the accidental omission > of a critical follow-on patch, v12.2.6 corrupts (fails to update) the > stored per-object checksum value for some objects. This can result in > an EIO error when trying to read those objects. > > #. If your cluster uses FileStore only, no special action is required. >This problem only affects clusters with BlueStore. > > #. If your cluster has only BlueStore OSDs (no FileStore), then you >should enable the following OSD option:: > > osd skip data digest = true > >This will avoid setting and start ignoring the full-object digests >whenever the primary for a PG is BlueStore. > > #. If you have a mix of BlueStore and FileStore OSDs, then you should >enable the following OSD option:: > > osd distrust data digest = true > >This will avoid setting and start ignoring the full-object digests >in all cases. This weakens the data integrity checks for >FileStore (although those checks were always only opportunistic). > > If your cluster includes BlueStore OSDs and was affected, deep scrubs > will generate errors about mismatched CRCs for affected objects. > Currently the repair operation does not know how to correct them > (since all replicas do not match the expected checksum it does not > know how to proceed). These warnings are harmless in the sense that > IO is not affected and the replicas are all still in sync. The number > of affected objects is likely to drop (possibly to zero) on their own > over time as those objects are modified. We expect to include a scrub > improvement in v12.2.8 to clean up any remaining objects. > > Additionally, see the notes below, which apply to both v12.2.5 and v12.2.6. > > Upgrading from v12.2.5 or v12.2.6 > - > > If you used v12.2.5 or v12.2.6 in combination with erasure coded > pools, there is a small risk of corruption under certain workloads. > Specifically, when: > > * An erasure coded pool is in use > * The pool is busy with successful writes > * The pool is also busy with updates that result in an error result to > the librados user. RGW garbage collection is the most common > example of this (it sends delete operations on objects that don't > always exist.) > * Some OSDs are reasonably busy. One known example of such load is > FileStore splitting, although in principle any load on the cluster > could also trigger the behavior. > * One or more OSDs restarts. > > This combination can trigger an OSD crash and possibly leave PGs in a state > where they fail to peer. > > Notably, upgrading a cluster involves OSD restarts and as such may > increase the risk of encountering this bug. For this reason, for > clusters with erasure coded pools, we recommend the following upgrade > procedure to minimize risk: > > 1. Install the v12.2.7 packages. > 2. Temporarily quiesce IO to cluster:: > > ceph osd pause > > 3. Restart all OSDs and wait for all PGs to become active. > 4. Resume IO:: > > ceph osd unpause > > This will cause an availability outage for the duration of the OSD > restarts. If this in unacceptable, an *more risky* alternative is to > disable RGW garbage collection (the primary known cause of these rados > operations) for the duration of the upgrade:: > > 1. Set ``rgw_enable_gc_threads = false`` in ceph
Re: [ceph-users] Periodically activating / peering on OSD add
PS: It's luminous 12.2.5! Mit freundlichen Grüßen / best regards, Kevin Olbrich. 2018-07-14 15:19 GMT+02:00 Kevin Olbrich : > Hi, > > why do I see activating followed by peering during OSD add (refill)? > I did not change pg(p)_num. > > Is this normal? From my other clusters, I don't think that happend... > > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Periodically activating / peering on OSD add
Hi, why do I see activating followed by peering during OSD add (refill)? I did not change pg(p)_num. Is this normal? From my other clusters, I don't think that happend... Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore and number of devices
You can keep the same layout as before. Most place DB/WAL combined in one partition (similar to the journal on filestore). Kevin 2018-07-13 12:37 GMT+02:00 Robert Stanford : > > I'm using filestore now, with 4 data devices per journal device. > > I'm confused by this: "BlueStore manages either one, two, or (in certain > cases) three storage devices." > (http://docs.ceph.com/docs/luminous/rados/configuration/ > bluestore-config-ref/) > > When I convert my journals to bluestore, will they still be four data > devices (osds) per journal, or will they each require a dedicated journal > drive now? > > Regards > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mds daemon damaged
Sorry for the long posting but trying to cover everything I woke up to find my cephfs filesystem down. This was in the logs 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head I had one standby MDS, but as far as I can tell it did not fail over. This was in the logs (insufficient standby MDS daemons available) Currently my ceph looks like this cluster: id: .. health: HEALTH_ERR 1 filesystem is degraded 1 mds daemon damaged services: mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29 mgr: ids27(active) mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged osd: 5 osds: 5 up, 5 in data: pools: 3 pools, 202 pgs objects: 1013k objects, 4018 GB usage: 12085 GB used, 6544 GB / 18630 GB avail pgs: 201 active+clean 1 active+clean+scrubbing+deep io: client: 0 B/s rd, 0 op/s rd, 0 op/s wr I started trying to get the damaged MDS back online Based on this page http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts # cephfs-journal-tool journal export backup.bin 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Error ((5) Input/output error) # cephfs-journal-tool event recover_dentries summary Events by type: 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is unreadableErrors: 0 cephfs-journal-tool journal reset - (I think this command might have worked) Next up, tried to reset the filesystem ceph fs reset test-cephfs-1 --yes-i-really-mean-it Each time same errors 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE (was: 1 mds daemon damaged) 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned to filesystem test-cephfs-1 as rank 0 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: (5) Input/output error 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE) 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is degraded; 1 mds daemon damaged Tried to 'fail' mds.ds27 # ceph mds fail ds27 # failed mds gid 1929168 Command worked, but each time I run the reset command the same errors above appear Online searches say the object read error has to be removed. But there's no object listed. This web page is the closest to the issue http://tracker.ceph.com/issues/20863 Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it completes but still have the same issue above Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and has data it should become live. If it was not I assume we will lose the filesystem at this point Why didn't the standby MDS failover? Just looking for any way to recover the cephfs, thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.
Sounds a little bit like the problem I had on OSDs: [ceph-users] Blocked requests activating+remapped after extending pg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html> *Burkhard Linke* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html> *Kevin Olbrich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html> *Paul Emmerich* - [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html> *Kevin Olbrich* I ended up restarting the OSDs which were stuck in that state and they immediately fixed themselfs. It should also work to just "out" the problem-OSDs and immeditly up them again to fix it. - Kevin 2018-07-11 20:30 GMT+02:00 Magnus Grönlund : > Hi, > > Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6) > > After upgrading and restarting the mons everything looked OK, the mons had > quorum, all OSDs where up and in and all the PGs where active+clean. > But before I had time to start upgrading the OSDs it became obvious that > something had gone terribly wrong. > All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data > was misplaced! > > The mons appears OK and all OSDs are still up and in, but a few hours > later there was still 1483 pgs stuck inactive, essentially all of them in > peering! > Investigating one of the stuck PGs it appears to be looping between > “inactive”, “remapped+peering” and “peering” and the epoch number is rising > fast, see the attached pg query outputs. > > We really can’t afford to loose the cluster or the data so any help or > suggestions on how to debug or fix this issue would be very, very > appreciated! > > > health: HEALTH_ERR > 1483 pgs are stuck inactive for more than 60 seconds > 542 pgs backfill_wait > 14 pgs backfilling > 11 pgs degraded > 1402 pgs peering > 3 pgs recovery_wait > 11 pgs stuck degraded > 1483 pgs stuck inactive > 2042 pgs stuck unclean > 7 pgs stuck undersized > 7 pgs undersized > 111 requests are blocked > 32 sec > 10586 requests are blocked > 4096 sec > recovery 9472/11120724 objects degraded (0.085%) > recovery 1181567/11120724 objects misplaced (10.625%) > noout flag(s) set > mon.eselde02u32 low disk space > > services: > mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34 > mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34 > osd: 111 osds: 111 up, 111 in; 800 remapped pgs > flags noout > > data: > pools: 18 pools, 4104 pgs > objects: 3620k objects, 13875 GB > usage: 42254 GB used, 160 TB / 201 TB avail > pgs: 1.876% pgs unknown > 34.259% pgs not active > 9472/11120724 objects degraded (0.085%) > 1181567/11120724 objects misplaced (10.625%) > 2062 active+clean > 1221 peering > 535 active+remapped+backfill_wait > 181 remapped+peering > 77 unknown > 13 active+remapped+backfilling > 7active+undersized+degraded+remapped+backfill_wait > 4remapped > 3active+recovery_wait+degraded+remapped > 1active+degraded+remapped+backfilling > > io: > recovery: 298 MB/s, 77 objects/s > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-10 14:37 GMT+02:00 Jason Dillaman : > On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich wrote: > >> 2018-07-10 0:35 GMT+02:00 Jason Dillaman : >> >>> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least >>> present on the client computer you used? I would have expected the OSD to >>> determine the client address, so it's odd that it was able to get a >>> link-local address. >>> >> >> Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is >> attached to brX which has an ULA-prefix for the ceph cluster. >> Eth0 has no address itself. In this case this must mean, the address has >> been carried down to the hardware interface. >> >> I am wondering why it uses link local when there is an ULA-prefix >> available. >> >> The address is available on brX on this client node. >> > > I'll open a tracker ticker to get that issue fixed, but in the meantime, > you can run "rados -p rmxattr rbd_header. > lock.rbd_lock" to remove the lock. > Worked perfectly, thank you very much! > >> - Kevin >> >> >>> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: >>> >>>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : >>>> >>>>> BTW -- are you running Ceph on a one-node computer? I thought IPv6 >>>>> addresses starting w/ fe80 were link-local addresses which would probably >>>>> explain why an interface scope id was appended. The current IPv6 address >>>>> parser stops reading after it encounters a non hex, colon character [1]. >>>>> >>>> >>>> No, this is a compute machine attached to the storage vlan where I >>>> previously had also local disks. >>>> >>>> >>>>> >>>>> >>>>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman >>>>> wrote: >>>>> >>>>>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses >>>>>> since it is failing to parse the address as valid. Perhaps it's barfing >>>>>> on >>>>>> the "%eth0" scope id suffix within the address. >>>>>> >>>>>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I tried to convert an qcow2 file to rbd and set the wrong pool. >>>>>>> Immediately I stopped the transfer but the image is stuck locked: >>>>>>> >>>>>>> Previusly when that happened, I was able to remove the image after >>>>>>> 30 secs. >>>>>>> >>>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >>>>>>> There is 1 exclusive lock on this image. >>>>>>> Locker ID Address >>>>>>> >>>>>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >>>>>>> eth0]:0/1200385089 >>>>>>> >>>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 >>>>>>> "auto 93921602220416" client.1195723 >>>>>>> rbd: releasing lock failed: (22) Invalid argument >>>>>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >>>>>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >>>>>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to >>>>>>> blacklist client: (22) Invalid argument >>>>>>> >>>>>>> The image is not in use anywhere! >>>>>>> >>>>>>> How can I force removal of all locks for this image? >>>>>>> >>>>>>> Kind regards, >>>>>>> Kevin >>>>>>> ___ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jason >>>>>> >>>>> >>>>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 >>>>> >>>>> -- >>>>> Jason >>>>> >>>> >>>> >>> >>> -- >>> Jason >>> >> >> > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd lock remove unable to parse address
2018-07-10 0:35 GMT+02:00 Jason Dillaman : > Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least > present on the client computer you used? I would have expected the OSD to > determine the client address, so it's odd that it was able to get a > link-local address. > Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is attached to brX which has an ULA-prefix for the ceph cluster. Eth0 has no address itself. In this case this must mean, the address has been carried down to the hardware interface. I am wondering why it uses link local when there is an ULA-prefix available. The address is available on brX on this client node. - Kevin > On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich wrote: > >> 2018-07-09 21:25 GMT+02:00 Jason Dillaman : >> >>> BTW -- are you running Ceph on a one-node computer? I thought IPv6 >>> addresses starting w/ fe80 were link-local addresses which would probably >>> explain why an interface scope id was appended. The current IPv6 address >>> parser stops reading after it encounters a non hex, colon character [1]. >>> >> >> No, this is a compute machine attached to the storage vlan where I >> previously had also local disks. >> >> >>> >>> >>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman >>> wrote: >>> >>>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses >>>> since it is failing to parse the address as valid. Perhaps it's barfing on >>>> the "%eth0" scope id suffix within the address. >>>> >>>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich wrote: >>>> >>>>> Hi! >>>>> >>>>> I tried to convert an qcow2 file to rbd and set the wrong pool. >>>>> Immediately I stopped the transfer but the image is stuck locked: >>>>> >>>>> Previusly when that happened, I was able to remove the image after 30 >>>>> secs. >>>>> >>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02 >>>>> There is 1 exclusive lock on this image. >>>>> Locker ID Address >>>>> >>>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86% >>>>> eth0]:0/1200385089 >>>>> >>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto >>>>> 93921602220416" client.1195723 >>>>> rbd: releasing lock failed: (22) Invalid argument >>>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse >>>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089 >>>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist >>>>> client: (22) Invalid argument >>>>> >>>>> The image is not in use anywhere! >>>>> >>>>> How can I force removal of all locks for this image? >>>>> >>>>> Kind regards, >>>>> Kevin >>>>> ___ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> >>>> -- >>>> Jason >>>> >>> >>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108 >>> >>> -- >>> Jason >>> >> >> > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com