Re: [ceph-users] ceph-mgr fails to restart after upgrade to mimic
I can't think of why the upgrade would have broken your keys, but have you verified that the mons still have the correct mgr keys configured? 'ceph auth ls' should list an mgr. key for each mgr with a key matching the contents of /var/lib/ceph/mgr/-/keyring on the mgr host and some caps that should minimally include '[mon] allow profile mgr' and '[osd] allow *' I would think. Again, it seems unlikely that this would have broken with the upgrade if it had been working previously, but if you're seeing auth errors it might be something to check out. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Fri, 2019-01-04 at 07:26 -0700, Randall Smith wrote: Greetings, I'm upgrading my cluster from luminous to mimic. I've upgraded my monitors and am attempting to upgrade the mgrs. Unfortunately, after an upgrade the mgr daemon exits immediately with error code 1. I've tried running ceph-mgr in debug mode to try to see what's happening but the output (below) is a bit cryptic for me. It looks like authentication might be failing but it was working prior to the upgrade. I do have "auth supported = cephx" in the global section of ceph.conf. What do I need to do to fix this? Thanks. /usr/bin/ceph-mgr -f --cluster ceph --id 8 --setuser ceph --setgroup ceph -d --debug_ms 5 2019-01-04 07:01:38.457 7f808f83f700 2 Event(0x30c42c0 nevent=5000 time_id=1).set_owner idx=0 owner=140190140331776 2019-01-04 07:01:38.457 7f808f03e700 2 Event(0x30c4500 nevent=5000 time_id=1).set_owner idx=1 owner=140190131939072 2019-01-04 07:01:38.457 7f808e83d700 2 Event(0x30c4e00 nevent=5000 time_id=1).set_owner idx=2 owner=140190123546368 2019-01-04 07:01:38.457 7f809dd5b380 1 Processor -- start 2019-01-04 07:01:38.477 7f809dd5b380 1 -- - start start 2019-01-04 07:01:38.481 7f809dd5b380 1 -- - --> 192.168.253.147:6789/0<http://192.168.253.147:6789/0> -- auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6780 con 0 2019-01-04 07:01:38.481 7f809dd5b380 1 -- - --> 192.168.253.148:6789/0<http://192.168.253.148:6789/0> -- auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6a00 con 0 2019-01-04 07:01:38.481 7f808e83d700 1 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> learned_addr learned my addr 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> 2019-01-04 07:01:38.481 7f808e83d700 2 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> >> 192.168.253.148:6789/0<http://192.168.253.148:6789/0> conn(0x332d500 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection got newly_a$ ked_seq 0 vs out_seq 0 2019-01-04 07:01:38.481 7f808f03e700 2 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> >> 192.168.253.147:6789/0<http://192.168.253.147:6789/0> conn(0x332ce00 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection got newly_a$ ked_seq 0 vs out_seq 0 2019-01-04 07:01:38.481 7f808f03e700 5 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> >> 192.168.253.147:6789/0<http://192.168.253.147:6789/0> conn(0x332ce00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 seq 1 0x30c5440 mon_map magic: 0 v1 2019-01-04 07:01:38.481 7f808e83d700 5 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> >> 192.168.253.148:6789/0<http://192.168.253.148:6789/0> conn(0x332d500 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 seq 1 0x30c5680 mon_map magic: 0 v1 2019-01-04 07:01:38.481 7f808f03e700 5 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> >> 192.168.253.147:6789/0<http://192.168.253.147:6789/0> conn(0x332ce00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 seq 2 0x32a6780 auth_reply(proto 2 0 (0) Success) v1 2019-01-04 07:01:38.481 7f808e83d700 5 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> >> 192.168.253.148:6789/0<http://192.168.253.148:6789/0> conn(0x332d500 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 seq 2 0x32a6a00 auth_reply(proto 2 0 (0) Success) v1 2019-01-04 07:01:38.481 7f808e03c700 1 -- 192.168.253.148:0/1359135487<http://192.168.253.148:0/1359135487> <== mon.1 192.168.253.147:6789/0<http://192.168.253.147:6
Re: [ceph-users] Balancer module not balancing perfectly
I ended up balancing my osdmap myself offline to figure out why the balancer couldn't do better. I had similar issues with osdmaptool, which of course is what I expected, but it's a lot easier to run osdmaptool in a debugger to see what's happening. When I dug into the upmap code I discovered that my problem was due to the way that code balances OSDs. In my case the average PG count per OSD is 56.882, so as soon as any OSD had 56 PGs it wouldn't get any more no matter what I used as my max deviation. I got into a state where each OSD had 56-61 PGs, and the upmap code wouldn't do any better because there were no "underfull" OSDs onto which to move PGs. I made some changes to the osdmap code to insure the computed "overfull" and "underfull" OSD lists were the same size even if the least or most full OSDs were within the expected deviation in order to allow those outside of the expected deviation some relief, and it worked nicely. I have two independent, production pools that were both in this state, and now every OSD across both pools has 56 or 57 PGs as expected. I intend to put together a pull request to push this upstream. I haven't reviewed the balancer module code to see how it's doing things, but assuming it uses osdmaptool or the same upmap code as osdmaptool this should also improve the balancer module. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2018-11-06 at 12:23 +0700, Konstantin Shalygin wrote: From the balancer module's code for v 12.2.7 I noticed [1] these lines which reference [2] these 2 config options for upmap. You might try using more max iterations or a smaller max deviation to see if you can get a better balance in your cluster. I would try to start with [3] these commands/values and see if it improves your balance and/or allows you to generate a better map. [1] https://github.com/ceph/ceph/blob/v12.2.7/src/pybind/mgr/balancer/module.py#L671-L672 [2] upmap_max_iterations (default 10) upmap_max_deviation (default .01) [3] ceph config-key set mgr/balancer/upmap_max_iterations 50 ceph config-key set mgr/balancer/upmap_max_deviation .005 This was not help to my 12.2.8 cluster. When first iterations of balancing was performing I decreased max_misplaced from default 0.05 to 0.01. After this balancing operations was stopped. After cluster is HEALTH_OK, I not see no any balancer run's. I'll try to lower balancer variables and restart mgr - message is still: "Error EALREADY: Unable to find further optimization,or distribution is already perfect" # ceph config-key dump | grep balancer "mgr/balancer/active": "1", "mgr/balancer/max_misplaced": ".50", "mgr/balancer/mode": "upmap", "mgr/balancer/upmap_max_deviation": ".001", "mgr/balancer/upmap_max_iterations": "100", So may be I need delete upmaps and start over? ID CLASS WEIGHTREWEIGHT SIZEUSE AVAIL %USE VAR PGS TYPE NAME -1 414.0- 445TiB 129TiB 316TiB 29.01 1.00 - root default -7 414.0- 445TiB 129TiB 316TiB 29.01 1.00 - datacenter rtcloud -8 138.0- 148TiB 42.9TiB 105TiB 28.93 1.00 - rack rack2 -269.0- 74.2TiB 21.5TiB 52.7TiB 28.93 1.00 - host ceph-osd0 0 hdd 5.0 1.0 5.46TiB 1.64TiB 3.82TiB 30.06 1.04 62 osd.0 4 hdd 5.0 1.0 5.46TiB 1.65TiB 3.80TiB 30.29 1.04 64 osd.4 7 hdd 5.0 1.0 5.46TiB 1.61TiB 3.85TiB 29.44 1.01 63 osd.7 9 hdd 5.0 1.0 5.46TiB 1.68TiB 3.78TiB 30.77 1.06 63 osd.9 46 hdd 5.0 1.0 5.46TiB 1.68TiB 3.77TiB 30.86 1.06 65 osd.46 47 hdd 5.0 1.0 5.46TiB 1.68TiB 3.78TiB 30.73 1.06 66 osd.47 48 hdd 5.0 1.0 5.46TiB 1.65TiB 3.81TiB 30.22 1.04 66 osd.48 49 hdd 5.0 1.0 5.46TiB 1.71TiB 3.74TiB 31.41 1.08 65 osd.49 54 hdd 5.0 1.0 5.46TiB 1.64TiB 3.82TiB 30.08 1.04 65 osd.54 55 hdd 5.0 1.0 5.46TiB 1.65TiB 3.80TiB 30.30 1.04 64 osd.55 56 hdd 5.0 1.0 5.46TiB 1.66TiB 3.80TiB 30.35 1.05 64 osd.56 57 hdd 5
Re: [ceph-users] Balancer module not balancing perfectly
I think I pretty well have things figured out at this point, but I'm not sure how to proceed. The config-key settings were not effective because I had not restarted the active mgr after setting them. Once I restarted the mgr the settings became effective. Once I had the config-key settings working I quickly discovered that they didn't make any difference, so I downloaded an osdmap and started trying to use osdmaptool offline to see if it would behave differently. It didn't, but when I specified '--debug-osd 20' on the osdmaptool command line things got interesting. It looks like osdmaptool generates lists of overfull and underfull OSDs and then uses those lists to move PGs in order to achieve a perfect balance. In my case the expected PG count range per OSD is 56-57, but the actual range is 56-61. The problem seems to lie in the fact that all of my OSDs have at least 56 PGs and are therefore not considered underfull. The debug output from osdmaptool shows a decent list of overfull OSDs and an empty list of underfull OSDs, then says there is nothing to be done. Perhaps the next step is to modify osdmaptool to allow OSDs that are not underfull but will not be made overfull by the move to take new PGs? That seems like it should be the expected behavior in this scenario. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: Steve Taylor Sent: Tuesday, October 30, 2018 1:40 PM To: drakonst...@gmail.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Balancer module not balancing perfectly I was having a difficult time getting debug logs from the active mgr, but I finally got it. Apparently injecting debug_mgr doesn't work, even when the change is reflected when you query the running config. Modifying the config file and restarting the mgr got it to log for me. Now that I have some debug logging, I think I may see the problem. 'ceph config-key dump' ... "mgr/balancer/active": "1", "mgr/balancer/max_misplaced": "1", "mgr/balancer/mode": "upmap", "mgr/balancer/upmap_max_deviation": "0.0001", "mgr/balancer/upmap_max_iterations": "1000" Mgr log excerpt: 2018-10-30 13:25:52.523117 7f08b47ff700 4 mgr[balancer] Optimize plan upmap-balance 2018-10-30 13:25:52.523135 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/mode 2018-10-30 13:25:52.523141 7f08b47ff700 10 ceph_config_get mode found: upmap 2018-10-30 13:25:52.523144 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/max_misplaced 2018-10-30 13:25:52.523145 7f08b47ff700 10 ceph_config_get max_misplaced found: 1 2018-10-30 13:25:52.523178 7f08b47ff700 4 mgr[balancer] Mode upmap, max misplaced 1.00 2018-10-30 13:25:52.523241 7f08b47ff700 20 mgr[balancer] unknown 0.00 degraded 0.00 inactive 0.00 misplaced 0 2018-10-30 13:25:52.523288 7f08b47ff700 4 mgr[balancer] do_upmap 2018-10-30 13:25:52.523296 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_iterations 2018-10-30 13:25:52.523298 7f08b47ff700 4 ceph_config_get upmap_max_iterations not found 2018-10-30 13:25:52.523301 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_deviation 2018-10-30 13:25:52.523305 7f08b47ff700 4 ceph_config_get upmap_max_deviation not found 2018-10-30 13:25:52.523339 7f08b47ff700 4 mgr[balancer] pools ['rbd- data'] 2018-10-30 13:25:52.523350 7f08b47ff700 10 osdmap_calc_pg_upmaps osdmap 0x7f08b1884280 inc 0x7f0898bda800 max_deviation 0.01 max_iterations 10 pools 3 2018-10-30 13:25:52.579669 7f08bbffc700 4 mgr ms_dispatch active mgrdigest v1 2018-10-30 13:25:52.579671 7f08bbffc700 4 mgr ms_dispatch mgrdigest v1 2018-10-30 13:25:52.579673 7f08bbffc700 10 mgr handle_mgr_digest 1364 2018-10-30 13:25:52.579674 7f08bbffc700 10 mgr handle_mgr_digest 501 2018-10-30 13:25:52.579677 7f08bbffc700 10 mgr notify_all notify_all: notify_all mon_status 2018-10-30 13:25:52.579681 7f08bbffc700 10 mgr notify_all notify_all: notify_all health 2018-10-30 13:25:52.579683 7f08bbffc700 10 mgr notify_all notify_all: notify_all pg_summary 2018-10-30 13:25:52.579684 7f08bbffc700 10 mgr handle_mgr_digest done. 2018-10-30 13:25:52.603867 7f08b47ff700 10 osdmap_calc_pg_upmaps r = 0 2018-10-30 13:25:52.603982 7f08b47ff700 4 mgr[balancer] prepared 0/10 changes The mgr claims that mgr/balancer/upmap_max_iterations and mgr/balancer/upmap_max_deviation aren't found in the config even though they have been set and appear in the config-key dump. It seems to be picking up the other config options correctly. Am I doing so
Re: [ceph-users] Balancer module not balancing perfectly
I was having a difficult time getting debug logs from the active mgr, but I finally got it. Apparently injecting debug_mgr doesn't work, even when the change is reflected when you query the running config. Modifying the config file and restarting the mgr got it to log for me. Now that I have some debug logging, I think I may see the problem. 'ceph config-key dump' ... "mgr/balancer/active": "1", "mgr/balancer/max_misplaced": "1", "mgr/balancer/mode": "upmap", "mgr/balancer/upmap_max_deviation": "0.0001", "mgr/balancer/upmap_max_iterations": "1000" Mgr log excerpt: 2018-10-30 13:25:52.523117 7f08b47ff700 4 mgr[balancer] Optimize plan upmap-balance 2018-10-30 13:25:52.523135 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/mode 2018-10-30 13:25:52.523141 7f08b47ff700 10 ceph_config_get mode found: upmap 2018-10-30 13:25:52.523144 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/max_misplaced 2018-10-30 13:25:52.523145 7f08b47ff700 10 ceph_config_get max_misplaced found: 1 2018-10-30 13:25:52.523178 7f08b47ff700 4 mgr[balancer] Mode upmap, max misplaced 1.00 2018-10-30 13:25:52.523241 7f08b47ff700 20 mgr[balancer] unknown 0.00 degraded 0.00 inactive 0.00 misplaced 0 2018-10-30 13:25:52.523288 7f08b47ff700 4 mgr[balancer] do_upmap 2018-10-30 13:25:52.523296 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_iterations 2018-10-30 13:25:52.523298 7f08b47ff700 4 ceph_config_get upmap_max_iterations not found 2018-10-30 13:25:52.523301 7f08b47ff700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_deviation 2018-10-30 13:25:52.523305 7f08b47ff700 4 ceph_config_get upmap_max_deviation not found 2018-10-30 13:25:52.523339 7f08b47ff700 4 mgr[balancer] pools ['rbd- data'] 2018-10-30 13:25:52.523350 7f08b47ff700 10 osdmap_calc_pg_upmaps osdmap 0x7f08b1884280 inc 0x7f0898bda800 max_deviation 0.01 max_iterations 10 pools 3 2018-10-30 13:25:52.579669 7f08bbffc700 4 mgr ms_dispatch active mgrdigest v1 2018-10-30 13:25:52.579671 7f08bbffc700 4 mgr ms_dispatch mgrdigest v1 2018-10-30 13:25:52.579673 7f08bbffc700 10 mgr handle_mgr_digest 1364 2018-10-30 13:25:52.579674 7f08bbffc700 10 mgr handle_mgr_digest 501 2018-10-30 13:25:52.579677 7f08bbffc700 10 mgr notify_all notify_all: notify_all mon_status 2018-10-30 13:25:52.579681 7f08bbffc700 10 mgr notify_all notify_all: notify_all health 2018-10-30 13:25:52.579683 7f08bbffc700 10 mgr notify_all notify_all: notify_all pg_summary 2018-10-30 13:25:52.579684 7f08bbffc700 10 mgr handle_mgr_digest done. 2018-10-30 13:25:52.603867 7f08b47ff700 10 osdmap_calc_pg_upmaps r = 0 2018-10-30 13:25:52.603982 7f08b47ff700 4 mgr[balancer] prepared 0/10 changes The mgr claims that mgr/balancer/upmap_max_iterations and mgr/balancer/upmap_max_deviation aren't found in the config even though they have been set and appear in the config-key dump. It seems to be picking up the other config options correctly. Am I doing something wrong? I feel like I must have a typo or something, but I'm not seeing it. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2018-10-30 at 10:11 -0600, Steve Taylor wrote: > I had played with those settings some already, but I just tried again > with max_deviation set to 0.0001 and max_iterations set to 1000. Same > result. Thanks for the suggestion though. > > On Tue, 2018-10-30 at 12:06 -0400, David Turner wrote: > > From the balancer module's code for v 12.2.7 I noticed [1] these > lines which reference [2] these 2 config options for upmap. You might > try using more max iterations or a smaller max deviation to see if > you can get a better balance in your cluster. I would try to start > with [3] these commands/values and see if it improves your balance > and/or allows you to generate a better map. > > [1] > https://github.com/ceph/ceph/blob/v12.2.7/src/pybind/mgr/balancer/module.py#L671-L672 > [2] upmap_max_iterations (default 10) > upmap_max_deviation (default .01) > > [3] ceph config-key set mgr/balancer/upmap_max_iterations 50 > ceph config-key set mgr/balancer/upmap_max_deviation .005 > > On Tue, Oct 30, 2018 at 11:14 AM Steve Taylor < > steve.tay...@storagecraft.com> wrote: > > I have a Luminous 12.2.7 cluster with 2 EC pools, both using k=8 > and > m=2. Each pool lives on 20 dedicated OSD hosts with 18 OSDs each. > Each > pool has 2048 PGs and is distributed across its 360 OSDs with host > failure domains. The OSDs are
Re: [ceph-users] Balancer module not balancing perfectly
I had played with those settings some already, but I just tried again with max_deviation set to 0.0001 and max_iterations set to 1000. Same result. Thanks for the suggestion though. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2018-10-30 at 12:06 -0400, David Turner wrote: > From the balancer module's code for v 12.2.7 I noticed [1] these > lines which reference [2] these 2 config options for upmap. You might > try using more max iterations or a smaller max deviation to see if > you can get a better balance in your cluster. I would try to start > with [3] these commands/values and see if it improves your balance > and/or allows you to generate a better map. > > [1] > https://github.com/ceph/ceph/blob/v12.2.7/src/pybind/mgr/balancer/module.py#L671-L672 > [2] upmap_max_iterations (default 10) > upmap_max_deviation (default .01) > > [3] ceph config-key set mgr/balancer/upmap_max_iterations 50 > ceph config-key set mgr/balancer/upmap_max_deviation .005 > > On Tue, Oct 30, 2018 at 11:14 AM Steve Taylor < > steve.tay...@storagecraft.com> wrote: > > I have a Luminous 12.2.7 cluster with 2 EC pools, both using k=8 > > and > > m=2. Each pool lives on 20 dedicated OSD hosts with 18 OSDs each. > > Each > > pool has 2048 PGs and is distributed across its 360 OSDs with host > > failure domains. The OSDs are identical (4TB) and are weighted with > > default weights (3.73). > > > > Initially, and not surprisingly, the PG distribution was all over > > the > > place with PG counts per OSD ranging from 40 to 83. I enabled the > > balancer module in upmap mode and let it work its magic, which > > reduced > > the range of the per-OSD PG counts to 56-61. > > > > While 56-61 is obviously a whole lot better than 40-83, with upmap > > I > > expected the range to be 56-57. If I run 'ceph balancer optimize > > ' again to attempt to create a new plan I get 'Error > > EALREADY: > > Unable to find further optimization,or distribution is already > > perfect.' I set the balancer's max_misplaced value to 1 in case > > that > > was preventing further optimization, but I still get the same > > error. > > > > I'm sure I'm missing some config option or something that will > > allow it > > to do better, but thus far I haven't been able to find anything in > > the > > docs, mailing list archives, or balancer source code that helps. > > Any > > ideas? > > > > > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > > Corporation > > 380 Data Drive Suite 300 | Draper | Utah | 84020 > > Office: 801.871.2799 | > > > > If you are not the intended recipient of this message or received > > it erroneously, please notify the sender and delete it, together > > with any attachments, and be advised that any dissemination or > > copying of this message is prohibited. > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Balancer module not balancing perfectly
I have a Luminous 12.2.7 cluster with 2 EC pools, both using k=8 and m=2. Each pool lives on 20 dedicated OSD hosts with 18 OSDs each. Each pool has 2048 PGs and is distributed across its 360 OSDs with host failure domains. The OSDs are identical (4TB) and are weighted with default weights (3.73). Initially, and not surprisingly, the PG distribution was all over the place with PG counts per OSD ranging from 40 to 83. I enabled the balancer module in upmap mode and let it work its magic, which reduced the range of the per-OSD PG counts to 56-61. While 56-61 is obviously a whole lot better than 40-83, with upmap I expected the range to be 56-57. If I run 'ceph balancer optimize ' again to attempt to create a new plan I get 'Error EALREADY: Unable to find further optimization,or distribution is already perfect.' I set the balancer's max_misplaced value to 1 in case that was preventing further optimization, but I still get the same error. I'm sure I'm missing some config option or something that will allow it to do better, but thus far I haven't been able to find anything in the docs, mailing list archives, or balancer source code that helps. Any ideas? Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange Ceph host behaviour
Unless this is related to load and OSDs really are unreponsive, it is almost certainly some sort of network issue. Duplicate IP address maybe? Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2018-10-02 at 17:17 +0200, Vincent Godin wrote: > Ceph cluster in Jewel 10.2.11 > Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64 > > Everyday, we can see in ceph.log on Monitor a lot of logs like these : > > 2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 : > cluster [WRN] map e612590 wrongly marked me down > 2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 : > cluster [WRN] map e612588 wrongly marked me down > 2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 : > cluster [WRN] map e612591 wrongly marked me down > 2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 : > cluster [WRN] map e612599 wrongly marked me down > 2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 : > cluster [WRN] map e612597 wrongly marked me down > 2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 : > cluster [WRN] map e612608 wrongly marked me down > 2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 : > cluster [WRN] map e612604 wrongly marked me down > 2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 : > cluster [WRN] map e612607 wrongly marked me down > 2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 : > cluster [WRN] map e612609 wrongly marked me down > > On the server 192.168.1.228 for example, the /var/log/messages looks like : > > Oct 2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658 > 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from > 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front > 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642) > Oct 2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841 > 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from > 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front > 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824) > Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822 > 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from > 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front > 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811) > Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645 > 7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from > 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front > 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612) > Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905 > 7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from > 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front > 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897) > Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997 > 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from > 192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front > 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981) > Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484 > 7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from > 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front > 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477) > Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154 > 7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from > 192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front > 2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106) > Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580 > 7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from > 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front > 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567) > Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983 > 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from > 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front > 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975) > > There is no network problem at that time (i checked the logs on the > host and on the switch). OSD logs shows nothing but "wrongly marked me > down" and
Re: [ceph-users] move rbd image (with snapshots) to different pool
I have done this with Luminous by deep-flattening a clone in a different pool. It seemed to do what I wanted, but the RBD appeared to lose its sparseness in the process. Can anyone verify that and/or comment on whether Mimic's "rbd deep copy" does the same? Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 |?Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users On Behalf Of Jason Dillaman Sent: Friday, June 15, 2018 7:45 AM To: Marc Roos Cc: ceph-users Subject: Re: [ceph-users] move rbd image (with snapshots) to different pool The "rbd clone" command will just create a copy-on-write cloned child of the source image. It will not copy any snapshots from the original image to the clone. With the Luminous release, you can use "rbd export --export-format 2 - | rbd import --export-format 2 - " to export / import an image (and all its snapshots) to a different pool. Additionally, with the Mimic release, you can run "rbd deep copy" to copy an image (and all its snapshots) to a different pool. On Fri, Jun 15, 2018 at 3:26 AM, Marc Roos wrote: > > If I would like to copy/move an rbd image, this is the only option I > have? (Want to move an image from a hdd pool to an ssd pool) > > rbd clone mypool/parent@snap otherpool/child > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
I can't comment directly on the relation XFS fragmentation has to Bluestore, but I had a similar issue probably 2-3 years ago where XFS fragmentation was causing a significant degradation in cluster performance. The use case was RBDs with lots of snapshots created and deleted at regular intervals. XFS got pretty severely fragmented and the cluster slowed down quickly. The solution I found was to set the XFS allocsize to match the RBD object size via osd_mount_options_xfs. Of course I also had to defragment XFS to clear up the existing fragmentation, but that was fairly painless. XFS fragmentation hasn't been an issue since. That solution isn't as applicable in an object store use case where the object size is more variable, but increasing the XFS allocsize could still help. As far as Bluestore goes, I haven't deployed it in production yet, but I would expect that manipulating bluestore_min_alloc_size in a similar fashion would yield similar benefits. Of course you are then wasting some disk space for every object that ends up being smaller than that allocation size in both cases. That's the trade-off. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Thu, 2018-04-12 at 04:13 +0200, Marc Roos wrote: Is that not obvious? The 8TB is handling twice as much as the 4TB. Afaik there is not a linear relationship with the iops of a disk and its size. But interesting about this xfs defragmentation, how does this relate/compare to bluestore? -Original Message- From: ? ?? [mailto:yaozong...@outlook.com] Sent: donderdag 12 april 2018 4:36 To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: *SPAM* [ceph-users] osds with different disk sizes may killing performance Importance: High Hi, For anybody who may be interested, here I share a process of locating the reason for ceph cluster performance slow down in our environment. Internally, we have a cluster with capacity 1.1PB, used 800TB, and raw user data is about 500TB. Each day, 3TB' data is uploaded and 3TB oldest data is lifecycled (we are using s3 object store, and bucket lifecycle is enabled). As time goes by, the cluster becomes some slower, we doubt the xfs fragmentation is the fiend. After some testing, we do find xfs fragmentation slow down filestore's performance, for example, at 15% fragmentation, the performance is 85% of the original, and at 25%, the performance is 74.73% of the original. But the main reason for our cluster's deterioration of performance is not the xfs fragmentation. Initially, our ceph cluster contains only osds with 4TB's disk, as time goes by, we scale out our cluster by adding some new osds with 8TB's disk. And as the new disk's capacity is double times of the old disks, so each new osd's weight is double of the old osd. And new osd has double pgs than old osd, and new osd used double disk space than the old osd. Everything looks good and fine. But even though the new osd has double capacity than the old osd, the new osd's performance is not double than the old osd. After digging into our internal system stats, we find the new added's disk io util is about two times than the old. And from time to time, the new disks' io util rises up to 100%. The new added osds are the performance killer. They slow down the whole cluster's performance. As the reason is found, the solution is very simple. After lower new added osds's weight, the annoying slow request warnings have died away. So the conclusion is: in cluster with different osd's disk size, osd's weight is not only determined by its capacity, we should also have a look at its performance. Best wishes, Yao Zongyou ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reweight 0 - best way to backfill slowly?
There are two concerns with setting the reweight to 1.0. The first is peering and the second is backfilling. Peering is going to block client I/O on the affected OSDs, while backfilling will only potentially slow things down. I don't know what your client I/O looks like, but personally I would probably set the norecover and nobackfill flags, slowly increment your reweight value by 0.01 or whatever you deem to be appropriate for your environment, waiting for peering to complete in between each step. Also allow any resulting blocked requests to clear up before incrementing your reweight again. When your reweight is all the way up to 1.0, inject osd_max_backfills to whatever you like (or don't if you're happy with it as is) and unset the norecover and nobackfill flags to let backfilling begin. If you are unable to handle the impact of backfilling with osd_max_backfills set to 1, then you need to add some new OSDs to your cluster before doing any of this. They will have to backfill too, but at least you'll have more spindles to handle it. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Mon, 2018-01-29 at 22:43 +0100, David Majchrzak wrote: And so I totally forgot to add df tree to the mail. Here's the interesting bit from two first nodes. where osd.11 has weight but is reweighted to 0. root@osd1:~# ceph osd df tree ID WEIGHTREWEIGHT SIZE USEAVAIL %USE VAR TYPE NAME -1 181.7- 109T 50848G 60878G 00 root default -2 36.3- 37242G 16792G 20449G 45.09 0.99 host osd1 0 3.64000 1.0 3724G 1730G 1993G 46.48 1.02 osd.0 1 3.64000 1.0 3724G 1666G 2057G 44.75 0.98 osd.1 2 3.64000 1.0 3724G 1734G 1989G 46.57 1.02 osd.2 3 3.64000 1.0 3724G 1387G 2336G 37.25 0.82 osd.3 4 3.64000 1.0 3724G 1722G 2002G 46.24 1.01 osd.4 6 3.64000 1.0 3724G 1840G 1883G 49.43 1.08 osd.6 7 3.64000 1.0 3724G 1651G 2072G 44.34 0.97 osd.7 8 3.64000 1.0 3724G 1747G 1976G 46.93 1.03 osd.8 9 3.64000 1.0 3724G 1697G 2026G 45.58 1.00 osd.9 5 3.64000 1.0 3724G 1614G 2109G 43.34 0.95 osd.5 -3 36.3- 0 0 0 00 host osd2 12 3.64000 1.0 3724G 1730G 1993G 46.46 1.02 osd.12 13 3.64000 1.0 3724G 1745G 1978G 46.88 1.03 osd.13 14 3.64000 1.0 3724G 1707G 2016G 45.84 1.01 osd.14 15 3.64000 1.0 3724G 1540G 2184G 41.35 0.91 osd.15 16 3.64000 1.0 3724G 1484G 2239G 39.86 0.87 osd.16 18 3.64000 1.0 3724G 1928G 1796G 51.77 1.14 osd.18 20 3.64000 1.0 3724G 1767G 1956G 47.45 1.04 osd.20 10 3.64000 1.0 3724G 1797G 1926G 48.27 1.06 osd.10 49 3.64000 1.0 3724G 1847G 1877G 49.60 1.09 osd.49 11 3.640000 0 0 0 00 osd.11 29 jan. 2018 kl. 22:40 skrev David Majchrzak <da...@visions.se<mailto:da...@visions.se>>: Hi! Cluster: 5 HW nodes, 10 HDDs with SSD journals, filestore, 0.94.9 hammer, debian wheezy (scheduled to upgrade once this is fixed). I have a replaced HDD that another admin set to reweight 0 instead of weight 0 (I can't remember the reason). What would be the best way to slowly backfill it? Usually I'm using weight and slowly growing it to max size. I guess if I just set reweight to 1.0, it will backfill as fast as I let it, that is max 1 backfill / osd but it will probably disrupt client io (this being on hammer). And if I set the weight on it to 0, the node will get less weight, and will start moving data around everywhere right? Can I use reweight the same way as weight here, slowly increasing it up to 1.0 by increments of say 0.01? Kind Regards, David Majchrzak ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade Hammer>Jewel>Luminous OSD fail to start
It seems like I've seen similar behavior in the past with the changing of the osd user context between hammer and jewel. Hammer ran osds as root, and they switched to running as the ceph user in jewel. That doesn't really seem to match your scenario perfectly, but I think the errors you're seeing in the logs match what I've seen in that situation before. If that's the issue, you need to chown everything under /var/lib/ceph/osd to be owned by ceph instead of root as documented in the jewel release notes. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Wed, 2017-09-13 at 00:52 +0530, kevin parrikar wrote: Can some one please help me on this.I have no idea how to bring up the cluster to operational state. Thanks, Kev On Tue, Sep 12, 2017 at 11:12 AM, kevin parrikar <kevin.parker...@gmail.com<mailto:kevin.parker...@gmail.com>> wrote: hello All, I am trying to upgrade a small test setup having one monitor and one osd node which is in hammer release . I updating from hammer to jewel using package update commands and things are working. How ever after updating from Jewel to Luminous, i am facing issues with osd failing to start . upgraded packages on both nodes and i can see in "ceph mon versions" is successful ceph mon versions { "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)": 1 } but ceph osd versions returns empty strig ceph osd versions {} dpkg --list|grep ceph ii ceph 12.2.0-1trusty amd64distributed storage and file system ii ceph-base12.2.0-1trusty amd64common ceph daemon libraries and management tools ii ceph-common 12.2.0-1trusty amd64common utilities to mount and interact with a ceph storage cluster ii ceph-deploy 1.5.38 all Ceph-deploy is an easy to use configuration tool ii ceph-mgr 12.2.0-1trusty amd64manager for the ceph distributed storage system ii ceph-mon 12.2.0-1trusty amd64monitor server for the ceph storage system ii ceph-osd 12.2.0-1trusty amd64OSD server for the ceph storage system ii libcephfs1 10.2.9-1trusty amd64Ceph distributed file system client library ii libcephfs2 12.2.0-1trusty amd64Ceph distributed file system client library ii python-cephfs12.2.0-1trusty amd64Python 2 libraries for the Ceph libcephfs library from OSD log: 2017-09-12 05:38:10.618023 7fc307a10d00 0 set uid:gid to 64045:64045 (ceph:ceph) 2017-09-12 05:38:10.618618 7fc307a10d00 0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), pid 21513 2017-09-12 05:38:10.624473 7fc307a10d00 0 pidfile_write: ignore empty --pid-file 2017-09-12 05:38:10.633099 7fc307a10d00 0 load: jerasure load: lrc load: isa 2017-09-12 05:38:10.633657 7fc307a10d00 0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342) 2017-09-12 05:38:10.635164 7fc307a10d00 0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342) 2017-09-12 05:38:10.637503 7fc307a10d00 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2017-09-12 05:38:10.637833 7fc307a10d00 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2017-09-12 05:38:10.637923 7fc307a10d00 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice() is disabled via 'filestore splice' config option 2017-09-12 05:38:10.639047 7fc307a10d00 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-09-12 05:38:10.639501 7fc307a10d00 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is disabled by conf 2017-09-12 05:38:10.640417 7fc307a10d00 0 file
Re: [ceph-users] Power outages!!! help!
I'm not familiar with dd_rescue, but I've just been reading about it. I'm not seeing any features that would be beneficial in this scenario that aren't also available in dd. What specific features give it "really a far better chance of restoring a copy of your disk" than dd? I'm always interested in learning about new recovery tools. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2017-08-29 at 21:49 +0200, Willem Jan Withagen wrote: On 29-8-2017 19:12, Steve Taylor wrote: Hong, Probably your best chance at recovering any data without special, expensive, forensic procedures is to perform a dd from /dev/sdb to somewhere else large enough to hold a full disk image and attempt to repair that. You'll want to use 'conv=noerror' with your dd command since your disk is failing. Then you could either re-attach the OSD from the new source or attempt to retrieve objects from the filestore on it. Like somebody else already pointed out In problem "cases like disk, use dd_rescue. It has really a far better chance of restoring a copy of your disk --WjW I have actually done this before by creating an RBD that matches the disk size, performing the dd, running xfs_repair, and eventually adding it back to the cluster as an OSD. RBDs as OSDs is certainly a temporary arrangement for repair only, but I'm happy to report that it worked flawlessly in my case. I was able to weight the OSD to 0, offload all of its data, then remove it for a full recovery, at which point I just deleted the RBD. The possibilities afforded by Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote: Rule of thumb with batteries is: - more “proper temperature” you run them at the more life you get out of them - more battery is overpowered for your application the longer it will survive. Get your self a LSI 94** controller and use it as HBA and you will be fine. but get MORE DRIVES ! … On 28 Aug 2017, at 23:10, hjcho616 <hjcho...@yahoo.com<mailto:hjcho...@yahoo.com>> wrote: Thank you Tomasz and Ronny. I'll have to order some hdd soon and try these out. Car battery idea is nice! I may try that.. =) Do they last longer? Ones that fit the UPS original battery spec didn't last very long... part of the reason why I gave up on them.. =P My wife probably won't like the idea of car battery hanging out though ha! The OSD1 (one with mostly ok OSDs, except that smart failure) motherboard doesn't have any additional SATA connectors available. Would it be safe to add another OSD host? Regards, Hong On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g mail.com> wrote: Sorry for being brutal … anyway 1. get the battery for UPS ( a car battery will do as well, I’ve moded on ups in the past with truck battery and it was working like a charm :D ) 2. get spare drives and put those in because your cluster CAN NOT get out of error due to lack of space 3. Follow advice of Ronny Aasen on hot to recover data from hard drives 4 get cooling to drives or you will loose more ! On 28 Aug 2017, at 22:39, hjcho616 <hjcho...@yahoo.com<mailto:hjcho...@yahoo.com>> wrote: Tomasz, Those machines are behind a surge protector. Doesn't appear to be a good one! I do have a UPS... but it is my fault... no battery. Power was pretty reliable for a while... and UPS was just beeping every chance it had, disrupting some sleep.. =P So running on surge protector only. I am running this in home environment. So far, HDD failures have been very rare for this environment. =) It just doesn't get loaded as much! I am not sure what to expect, seeing that "unfound" and just a feeling of possibility of maybe getting OSD back made me excited about it. =) Thanks for letting me know what should be the priority. I just lack experience and knowledge in this. =) Please do continue to guide me though this. Thank you for the decode of that smart messages! I do agree that looks like it is on its way out. I would like to know
Re: [ceph-users] Power outages!!! help!
Yes, if I had created the RBD in the same cluster I was trying to repair then I would have used rbd-fuse to "map" the RBD in order to avoid potential deadlock issues with the kernel client. I had another cluster available, so I copied its config file to the osd node, created the RBD in the second cluster, and used the kernel client for the dd, xfs_repair, and mount. Worked like a charm. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2017-08-29 at 18:04 +, David Turner wrote: But it was absolutely awesome to run an osd off of an rbd after the disk failed. On Tue, Aug 29, 2017, 1:42 PM David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote: To addend Steve's success, the rbd was created in a second cluster in the same datacenter so it didn't run the risk of deadlocking that mapping rbds on machines running osds has. It is still theoretical to work on the same cluster, but more inherently dangerous for a few reasons. On Tue, Aug 29, 2017, 1:15 PM Steve Taylor <steve.tay...@storagecraft.com<mailto:steve.tay...@storagecraft.com>> wrote: Hong, Probably your best chance at recovering any data without special, expensive, forensic procedures is to perform a dd from /dev/sdb to somewhere else large enough to hold a full disk image and attempt to repair that. You'll want to use 'conv=noerror' with your dd command since your disk is failing. Then you could either re-attach the OSD from the new source or attempt to retrieve objects from the filestore on it. I have actually done this before by creating an RBD that matches the disk size, performing the dd, running xfs_repair, and eventually adding it back to the cluster as an OSD. RBDs as OSDs is certainly a temporary arrangement for repair only, but I'm happy to report that it worked flawlessly in my case. I was able to weight the OSD to 0, offload all of its data, then remove it for a full recovery, at which point I just deleted the RBD. The possibilities afforded by Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote: > Rule of thumb with batteries is: > - more “proper temperature” you run them at the more life you get out > of them > - more battery is overpowered for your application the longer it will > survive. > > Get your self a LSI 94** controller and use it as HBA and you will be > fine. but get MORE DRIVES ! … > > On 28 Aug 2017, at 23:10, hjcho616 > > <hjcho...@yahoo.com<mailto:hjcho...@yahoo.com>> wrote: > > > > Thank you Tomasz and Ronny. I'll have to order some hdd soon and > > try these out. Car battery idea is nice! I may try that.. =) Do > > they last longer? Ones that fit the UPS original battery spec > > didn't last very long... part of the reason why I gave up on them.. > > =P My wife probably won't like the idea of car battery hanging out > > though ha! > > > > The OSD1 (one with mostly ok OSDs, except that smart failure) > > motherboard doesn't have any additional SATA connectors available. > > Would it be safe to add another OSD host? > > > > Regards, > > Hong > > > > > > > > On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g > > mail.com<http://mail.com>> wrote: > > > > > > Sorry for being brutal … anyway > > 1. get the battery for UPS ( a car battery will do as well, I’ve > > moded on ups in the past with truck battery and it was working like > > a charm :D ) > > 2. get spare drives and put those in because your cluster CAN NOT > > get out of error due to lack of space > > 3. Follow advice of Ronny Aasen on hot to recover data from hard > > drives > > 4 get cooling to drives or you will loose more ! > > > > > > > On 28 Aug 2017, at 22:39, hjcho616 > > > <hjcho...@yahoo.com<mailto:hjcho...@yaho
Re: [ceph-users] Power outages!!! help!
Hong, Probably your best chance at recovering any data without special, expensive, forensic procedures is to perform a dd from /dev/sdb to somewhere else large enough to hold a full disk image and attempt to repair that. You'll want to use 'conv=noerror' with your dd command since your disk is failing. Then you could either re-attach the OSD from the new source or attempt to retrieve objects from the filestore on it. I have actually done this before by creating an RBD that matches the disk size, performing the dd, running xfs_repair, and eventually adding it back to the cluster as an OSD. RBDs as OSDs is certainly a temporary arrangement for repair only, but I'm happy to report that it worked flawlessly in my case. I was able to weight the OSD to 0, offload all of its data, then remove it for a full recovery, at which point I just deleted the RBD. The possibilities afforded by Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote: > Rule of thumb with batteries is: > - more “proper temperature” you run them at the more life you get out > of them > - more battery is overpowered for your application the longer it will > survive. > > Get your self a LSI 94** controller and use it as HBA and you will be > fine. but get MORE DRIVES ! … > > On 28 Aug 2017, at 23:10, hjcho616 <hjcho...@yahoo.com> wrote: > > > > Thank you Tomasz and Ronny. I'll have to order some hdd soon and > > try these out. Car battery idea is nice! I may try that.. =) Do > > they last longer? Ones that fit the UPS original battery spec > > didn't last very long... part of the reason why I gave up on them.. > > =P My wife probably won't like the idea of car battery hanging out > > though ha! > > > > The OSD1 (one with mostly ok OSDs, except that smart failure) > > motherboard doesn't have any additional SATA connectors available. > > Would it be safe to add another OSD host? > > > > Regards, > > Hong > > > > > > > > On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g > > mail.com> wrote: > > > > > > Sorry for being brutal … anyway > > 1. get the battery for UPS ( a car battery will do as well, I’ve > > moded on ups in the past with truck battery and it was working like > > a charm :D ) > > 2. get spare drives and put those in because your cluster CAN NOT > > get out of error due to lack of space > > 3. Follow advice of Ronny Aasen on hot to recover data from hard > > drives > > 4 get cooling to drives or you will loose more ! > > > > > > > On 28 Aug 2017, at 22:39, hjcho616 <hjcho...@yahoo.com> wrote: > > > > > > Tomasz, > > > > > > Those machines are behind a surge protector. Doesn't appear to > > > be a good one! I do have a UPS... but it is my fault... no > > > battery. Power was pretty reliable for a while... and UPS was > > > just beeping every chance it had, disrupting some sleep.. =P So > > > running on surge protector only. I am running this in home > > > environment. So far, HDD failures have been very rare for this > > > environment. =) It just doesn't get loaded as much! I am not > > > sure what to expect, seeing that "unfound" and just a feeling of > > > possibility of maybe getting OSD back made me excited about it. > > > =) Thanks for letting me know what should be the priority. I > > > just lack experience and knowledge in this. =) Please do continue > > > to guide me though this. > > > > > > Thank you for the decode of that smart messages! I do agree that > > > looks like it is on its way out. I would like to know how to get > > > good portion of it back if possible. =) > > > > > > I think I just set the size and min_size to 1. > > > # ceph osd lspools > > > 0 data,1 metadata,2 rbd, > > > # ceph osd pool set rbd size 1 > > > set pool 2 size to 1 > > > # ceph osd pool set rbd min_size 1 > > > set pool 2 min_size to 1 > > > > > > Seems to be doing some backfilling work. > > > > > > # ceph health > > > HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 2 > > > pgs backfill_toofull; 7
Re: [ceph-users] Power outages!!! help!
I'm jumping in a little late here, but running xfs_repair on your partition can't frag your partition table. The partition table lives outside the partition block device and xfs_repair doesn't have access to it when run against /dev/sdb1. I haven't actually tested it, but it seems unlikely that running xfs_repair on /dev/sdb would do it either. I would assume it would just give you an error about /dev/sdb not containing an XFS filesystem. That's a guess though. I haven't ever tried anything like that. Are you sure there isn't physical damage to the disk? I wouldn't say it's common, but power outages can do that. You can run 'dmesg | grep sdb' and 'smartctl -a /dev/sdb' to see if there are kernel errors or SMART errors indicative of physical problems. If the disk is physically sound and the partition table really has been fragged, you may be able to restore it from the backup at the end of the disk, assuming it's GPT. If you can't find a partition or a filesystem somehow, then you're probably out of luck as far as retrieving any objects from that OSD. If the disk is physically damaged and your partition is gone, then it probably isn't worth wasting additional time on it. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Mon, 2017-08-28 at 19:18 +, hjcho616 wrote: Tomasz, Looks like when I did xfs_repair -L /dev/sdb1 it did something to partition table and I don't see /dev/sdb1 anymore... or maybe I missed 1 in the /dev/sdb1? =(. Yes.. that extra power outage did a pretty good damage... =P I am hoping 0.007% is very small...=P Any recommendations on fixing xfs partition I am missing? =) Ronny, Thank you for that link! No I haven't done anything to osds... not touching them, hoping that I can revive some of them.. =) Only thing done is trying to start and stop them.. Below are the links to newer files with just one start attempt. =) ceph-osd.3_single.log<https://drive.google.com/open?id=0By7YztAJNGUWRUUtREZhY0NCVzQ> <https://drive.google.com/open?id=0By7YztAJNGUWRUUtREZhY0NCVzQ> [https://s.yimg.com/nq/storm/assets/enhancrV2/23/logos/google.png] ceph-osd.3_single.log ceph-osd.4_single.log<https://drive.google.com/open?id=0By7YztAJNGUWVzFxbEZ4UURLQzA> <https://drive.google.com/open?id=0By7YztAJNGUWVzFxbEZ4UURLQzA> [https://s.yimg.com/vv//api/res/1.2/6js1HPFw1ePUfgrZdK0glw--/YXBwaWQ9bWFpbDtmaT1maWxsO2g9ODA7dz04MA--/https://lh5.googleusercontent.com/dgHcOP6Na3RcgR0rOHgRjiyos_MOtlk-WjCp__L2nIJX7vwaLQj3QQ=w1200-h630-p.cf.jpg] [https://s.yimg.com/nq/storm/assets/enhancrV2/23/logos/google.png] ceph-osd.4_single.log ceph-osd.5_single.log<https://drive.google.com/open?id=0By7YztAJNGUWQ18wRUVwYkNMRW8> <https://drive.google.com/open?id=0By7YztAJNGUWQ18wRUVwYkNMRW8> [https://s.yimg.com/vv//api/res/1.2/TNJOwajiVQcd_mAnFDCqpQ--/YXBwaWQ9bWFpbDtmaT1maWxsO2g9ODA7dz04MA--/https://lh5.googleusercontent.com/KnCXt_G7jTuxtknlvz3gU5g_dozYNe_EwEdEwaAXoDAPf9bqZurrvw=w1200-h630-p.cf.jpg] [https://s.yimg.com/nq/storm/assets/enhancrV2/23/logos/google.png] ceph-osd.5_single.log ceph-osd.8_single.log<https://drive.google.com/open?id=0By7YztAJNGUWSk9XY01SQUo1Vmc> <https://drive.google.com/open?id=0By7YztAJNGUWSk9XY01SQUo1Vmc> [https://s.yimg.com/nq/storm/assets/enhancrV2/23/logos/google.png] ceph-osd.8_single.log Regards, Hong On Monday, August 28, 2017 12:53 PM, Ronny Aasen <ronny+ceph-us...@aasen.cx> wrote: comments inline On 28.08.2017 18:31, hjcho616 wrote: I'll see what I can do on that... Looks like I may have to add another OSD host as I utilized all of the SATA ports on those boards. =P Ronny, I am running with size=2 min_size=1. I created everything with ceph-deploy and didn't touch much of that pool settings... I hope not, but sounds like I may have lost some files! I do want some of those OSDs to come back online somehow... to get that confidence level up. =P This is a bad idea as you have found out. once your cluster is healthy you should look at improving this. The dead osd.3 message is probably me trying to stop and start the osd. There were some cases where stop didn't kill the ceph-osd process. I just started or restarted osd to try and see if that worked.. After that, there were some reboots and I am not seeing those messages after it... when providing logs. try to move away the old one. do a single startup. and post that. it ma
Re: [ceph-users] how to fix X is an unexpected clone
I encountered this same issue on two different clusters running Hammer 0.94.9 last week. In both cases I was able to resolve it by deleting (moving) all replicas of the unexpected clone manually and issuing a pg repair. Which version did you see this on? A call stack for the resulting crash would also be interesting, although troubleshooting further is probably less valid and less valuable now that you've resolved the problem. It's just a matter of curiosity at this point. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2017-08-08 at 12:02 +0200, Stefan Priebe - Profihost AG wrote: Hello Greg, Am 08.08.2017 um 11:56 schrieb Gregory Farnum: On Mon, Aug 7, 2017 at 11:55 PM Stefan Priebe - Profihost AG <s.pri...@profihost.ag<mailto:s.pri...@profihost.ag> <mailto:s.pri...@profihost.ag>> wrote: Hello, how can i fix this one: 2017-08-08 08:42:52.265321 osd.20 [ERR] repair 3.61a 3:58654d3d:::rbd_data.106dd406b8b4567.018c:9d455 is an unexpected clone 2017-08-08 08:43:04.914640 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1 pgs repair; 1 scrub errors 2017-08-08 08:43:33.470246 osd.20 [ERR] 3.61a repair 1 errors, 0 fixed 2017-08-08 08:44:04.915148 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1 scrub errors If i just delete manually the relevant files ceph is crashing. rados does not list those at all? How can i fix this? You've sent quite a few emails that have this story spread out, and I think you've tried several different steps to repair it that have been a bit difficult to track. It would be helpful if you could put the whole story in one place and explain very carefully exactly what you saw and how you responded. Stuff like manually copying around the wrong files, or files without a matching object info, could have done some very strange things. Also, basic debugging stuff like what version you're running will help. :) Also note that since you've said elsewhere you don't need this image, I don't think it's going to hurt you to leave it like this for a bit (though it will definitely mess up your monitoring). -Greg i'm sorry about that. You're correct. I was able to fix this just a few minutes ago by using the ceph-object-tool and the remove operation to remove all left over files. I did this on all OSDs with the problematic pg. After that ceph was able to fix itself. A better approach might be that ceph can recover itself from an unexpected clone by just deleting it. Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Read errors on OSD
I've seen similar issues in the past with 4U Supermicro servers populated with spinning disks. In my case it turned out to be a specific firmware+BIOS combination on the disk controller card that was buggy. I fixed it by updating the firmware and BIOS on the card to the latest versions. I saw this on several servers, and it took a while to track down as you can imagine. Same symptoms you're reporting. There was a data corruption problem a while back with the Linux kernel and Samsung 850 Pro drives, but your problem doesn't sound like data corruption. Still, I'd check to make sure the kernel version you're running has the fix. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Thu, 2017-06-01 at 13:40 +0100, Oliver Humpage wrote: On 1 Jun 2017, at 11:55, Matthew Vernon <m...@sanger.ac.uk<mailto:m...@sanger.ac.uk>> wrote: You don't say what's in kern.log - we've had (rotating) disks that were throwing read errors but still saying they were OK on SMART. Fair point. There was nothing correlating to the time that ceph logged an error this morning, which is why I didn’t mention it, but looking harder I see yesterday there was a May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Sense Key : Hardware Error [current] May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 Add. Sense: Internal target failure May 31 07:20:13 osd1 kernel: sd 0:0:8:0: [sdi] tag#0 CDB: Read(10) 28 00 77 51 42 d8 00 02 00 00 May 31 07:20:13 osd1 kernel: blk_update_request: critical target error, dev sdi, sector 2001814232 sdi was the disk with the OSD affected today. Guess it’s flakey SSDs then. Weird that just re-reading the file makes everything OK though - wondering how much it’s worth worrying about that, or if there’s a way of making ceph retry reads automatically? Oliver. ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about unfound objects
One other thing to note with this experience is that we do a LOT of RBD snap trimming, like hundreds of millions of objects per day added to our snap_trimqs globally. All of the unfound objects in these cases were found on other OSDs in the cluster with identical contents, but associated with different snapshots. In other words, the file contents matched exactly, but the xattrs differed and the filenames indicated that the objects belonged to different snapshots. Some of the unfound objects belonged to head, so I don't necessarily believe that they were in the process of being trimmed, but I imagine there is some possibility that this issue is related to snap trimming or deleting snapshots. Just more information... On Thu, 2017-03-30 at 17:13 +, Steve Taylor wrote: Good suggestion, Nick. I actually did that at the time. The "ceph osd map" wasn't all that interesting because the OSDs had been outed and their PGs had been mapped to new OSDs. Everything appeared to be in order with the PGs being mapped to the right number of new OSDs. The PG mappings looked fine, but the objects just didn't exist anywhere except on the OSDs that had been marked out. The PG queries were a little more useful, but still didn't really help in the end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the PGs showed 5 or so OSDs where they thought the unfound objects might be, one of which was an OSD that had been marked out. In both cases we even waited until backfilling completed to see if perhaps the missing objects would turn up somewhere else, but none ever did. In the first instance we were simply able to reattach the 2 OSDs to the cluster with 0 weight and recover the unfound objects. The second instance involved drive problems and was a little bit trickier. The drives had experienced errors and the XFS filesystems had both become corrupt and wouldn't even mount. We didn't have any spare drives large enough, so I ended up using dd, ignoring errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel mapped the RBDs on the host with the failed drives, ran XFS repairs on them, mouted them to the OSD directories, started the OSDs, and put them back in the cluster with 0 weight. I was lucky enough that those objects were available and they were recovered. Of course I immediately removed those OSDs once the unfound objects cleared up. That's the other intersting aspect of this problem. This cluster had 4TB HGST drives for its OSDs, but we had to expand it fairly urgently and didn't have enough drives. We added two new servers, each with 16 4TB drives and 16 8TB HGST He8 drives. In both instances the problems we encountered were with the 8TB drives. We have since acquired more 4TB drives and have replaced all of the 8TB drives in the cluster. We have a total of 8 production clusters globally and have been running Ceph in production for 2 years. These two occurences recently are the only times we've seen these types of issues, and it was exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, but it's an interesting data point. On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote: Hi Steve, If you can recreate or if you can remember the object name, it might be worth trying to run “ceph osd map” on the objects and see where it thinks they map to. And/or maybe pg query might show something? Nick [cid:1490900827.2469.72.camel@storagecraft.com]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. [cid:imagef0e9d2.JPG@294fd64f.4893a633]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve Taylor Sent: 30 March 2017 16:24 To: ceph-users@lists.ceph.com Subject: [ceph-users] Question about unfound objects We've had a couple of puzzling experiences recently with unfound objects, and I wonder if anyone can shed some light. This happened with Ham
Re: [ceph-users] Question about unfound objects
Good suggestion, Nick. I actually did that at the time. The "ceph osd map" wasn't all that interesting because the OSDs had been outed and their PGs had been mapped to new OSDs. Everything appeared to be in order with the PGs being mapped to the right number of new OSDs. The PG mappings looked fine, but the objects just didn't exist anywhere except on the OSDs that had been marked out. The PG queries were a little more useful, but still didn't really help in the end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the PGs showed 5 or so OSDs where they thought the unfound objects might be, one of which was an OSD that had been marked out. In both cases we even waited until backfilling completed to see if perhaps the missing objects would turn up somewhere else, but none ever did. In the first instance we were simply able to reattach the 2 OSDs to the cluster with 0 weight and recover the unfound objects. The second instance involved drive problems and was a little bit trickier. The drives had experienced errors and the XFS filesystems had both become corrupt and wouldn't even mount. We didn't have any spare drives large enough, so I ended up using dd, ignoring errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel mapped the RBDs on the host with the failed drives, ran XFS repairs on them, mouted them to the OSD directories, started the OSDs, and put them back in the cluster with 0 weight. I was lucky enough that those objects were available and they were recovered. Of course I immediately removed those OSDs once the unfound objects cleared up. That's the other intersting aspect of this problem. This cluster had 4TB HGST drives for its OSDs, but we had to expand it fairly urgently and didn't have enough drives. We added two new servers, each with 16 4TB drives and 16 8TB HGST He8 drives. In both instances the problems we encountered were with the 8TB drives. We have since acquired more 4TB drives and have replaced all of the 8TB drives in the cluster. We have a total of 8 production clusters globally and have been running Ceph in production for 2 years. These two occurences recently are the only times we've seen these types of issues, and it was exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, but it's an interesting data point. On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote: Hi Steve, If you can recreate or if you can remember the object name, it might be worth trying to run “ceph osd map” on the objects and see where it thinks they map to. And/or maybe pg query might show something? Nick [cid:imagec0161b.JPG@d2cd1459.4ebbf9d5]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve Taylor Sent: 30 March 2017 16:24 To: ceph-users@lists.ceph.com Subject: [ceph-users] Question about unfound objects We've had a couple of puzzling experiences recently with unfound objects, and I wonder if anyone can shed some light. This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use case is exclusively RBD in this cluster, so it's naturally replicated. The rbd pool size is 3, min_size is 2. The crush map is flat, so each host is a failure domain. The OSD hosts are 4U Supermicro chassis with 32 OSDs each. Drive failures have caused the OSD count to be 1,309 instead of 1,312. Twice in the last few weeks we've experienced issues where the cluster was HEALTH_OK but was frequently getting some blocked requests. In each of the two occurrences we investigated and discovered that the blocked requests resulted from two drives in the same host that were misbehaving (different set of 2 drives in each occurrence). We decided to remove the misbehaving OSDs and let things backfill to see if that would address the issue. Removing the drives resulted in a small number of unfound objects, which was surprising. We were able to add the OSDs back with 0 weight and recover the unfound objects in both cases, but removing two OSDs from a single failure domain shouldn't have resulted in unfound objects in an otherwise healthy cluster, correct? [cid:1490894021.2469.65.camel@storagecraft.com]<http://xo4t.mj.am/lnk/ADsAAHBLtsEAAF3gdq4AADNJBWwAAACRXwBY3TNPpyGQvpKrR9qnYfGowzXSBwAAlBI/1/9Putges-Yax4GeLa0aybAg/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t> Steve Taylor | Senior Software Engine
[ceph-users] Question about unfound objects
We've had a couple of puzzling experiences recently with unfound objects, and I wonder if anyone can shed some light. This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use case is exclusively RBD in this cluster, so it's naturally replicated. The rbd pool size is 3, min_size is 2. The crush map is flat, so each host is a failure domain. The OSD hosts are 4U Supermicro chassis with 32 OSDs each. Drive failures have caused the OSD count to be 1,309 instead of 1,312. Twice in the last few weeks we've experienced issues where the cluster was HEALTH_OK but was frequently getting some blocked requests. In each of the two occurrences we investigated and discovered that the blocked requests resulted from two drives in the same host that were misbehaving (different set of 2 drives in each occurrence). We decided to remove the misbehaving OSDs and let things backfill to see if that would address the issue. Removing the drives resulted in a small number of unfound objects, which was surprising. We were able to add the OSDs back with 0 weight and recover the unfound objects in both cases, but removing two OSDs from a single failure domain shouldn't have resulted in unfound objects in an otherwise healthy cluster, correct? [cid:image575d42.JPG@8ddd3310.40afc06a]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] add multiple OSDs to cluster
Generally speaking, you are correct. Adding more OSDs at once is more efficient than adding fewer at a time. That being said, do so carefully. We typically add OSDs to our clusters either 32 or 64 at once, and we have had issues on occasion with bad drives. It's common for us to have a drive or two go bad within 24 hours or so of adding them to Ceph, and if multiple drives fail in multiple failure domains within a short amount of time, bad things can happen. The efficient, safe approach is to add as many drives as possible within a single failure domain, wait for recovery, and repeat. On Tue, 2017-03-21 at 19:56 +0100, mj wrote: > Hi, > > Just a quick question about adding OSDs, since most of the docs I > can > find talk about adding ONE OSD, and I'd like to add four per server > on > my three-node cluster. > > This morning I tried the careful approach, and added one OSD to > server1. > It all went fine, everything rebuilt and I have a HEALTH_OK again > now. > It took around 7 hours. > > But now I started thinking... (and that's when things go wrong, > therefore hoping for feedback here) > > The question: was I being stupid to add only ONE osd to the server1? > Is > it not smarter to add all four OSDs at the same time? > > I mean: things will rebuild anyway...and I have the feeling that > rebuilding from 4 -> 8 OSDs is not going to be much heavier than > rebuilding from 4 -> 5 OSDs. Right? > > So better add all new OSDs together on a specific server? > > Or not? :-) > > MJ > ________ [cid:imagec7a5fc.JPG@dc945914.44a32fb5]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM/QEMU rbd read latency
You might try running fio directly on the host using the rbd ioengine (direct librbd) and see how that compares. The major difference between that and the krbd test will be the page cache readahead, which will be present in the krbd stack but not with the rbd ioengine. I would have expected the guest OS to normalize that some due to its own page cache in the librbd test, but that might at least give you some more clues about where to look further. [cid:imagea0af4f.JPG@e3d04a1e.44ace3d9]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Phil Lacroute Sent: Thursday, February 16, 2017 11:54 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] KVM/QEMU rbd read latency Hi, I am doing some performance characterization experiments for ceph with KVM guests, and I’m observing significantly higher read latency when using the QEMU rbd client compared to krbd. Is that expected or have I missed some tuning knobs to improve this? Cluster details: Note that this cluster was built for evaluation purposes, not production, hence the choice of small SSDs with low endurance specs. Client host OS: Debian, 4.7.0 kernel QEMU version 2.7.0 Ceph version Jewel 10.2.3 Client and OSD CPU: Xeon D-1541 2.1 GHz OSDs: 5 nodes, 3 SSDs each, one journal partition and one data partition per SSD, XFS data file system (15 OSDs total) Disks: DC S3510 240GB Network: 10 GbE, dedicated switch for storage traffic Guest OS: Debian, virtio drivers Performance testing was done with fio on raw disk devices using this config: ioengine=libaio iodepth=128 direct=1 size=100% rw=randread bs=4k Case 1: krbd, fio running on the raw rbd device on the client host (no guest) IOPS: 142k Average latency: 0.9 msec Case 2: krbd, fio running in a guest (libvirt config below) IOPS: 119k Average Latency: 1.1 msec Case 3: QEMU RBD client, fio running in a guest (libvirt config below) IOPS: 25k Average Latency: 5.2 msec The question is why the test with the QEMU RBD client (case 3) shows 4 msec of additional latency compared the guest using the krbd-mapped image (case 2). Note that the IOPS bottleneck for all of these cases is the rate at which the client issues requests, which is limited by the average latency and the maximum number of outstanding requests (128). Since the latency is the dominant factor in average read throughput for these small accesses, we would really like to understand the source of the additional latency. Thanks, Phil ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
Thanks, Nick. One other data point that has come up is that nearly all of the blocked requests that are waiting on subops are waiting for OSDs with more PGs than the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs individually have 33% more PGs than the others and are causing almost all of the blocked requests. It appears that maps updates are generally not blocking long enough to show up as blocked requests. I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. I’ll test some more when the PG counts per OSD are more balanced and see what I get. I’ll also play with the filestore queue. I was telling some of my colleagues yesterday that this looked likely to be related to buffer bloat somewhere. I appreciate the suggestion. [cid:image1bd943.JPG@b026bd80.43945ba2]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: Tuesday, February 7, 2017 10:25 AM To: Steve Taylor <steve.tay...@storagecraft.com>; ceph-users@lists.ceph.com Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? Hi Steve, From what I understand, the issue is not with the queueing in Ceph, which is correctly moving client IO to the front of the queue. The problem lies below what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost in large disk queues surrounded by all the snap trim IO’s. The workaround Sam is working on will limit the amount of snap trims that are allowed to run, which I believe will have a similar effect to the sleep parameters in pre-jewel clusters, but without pausing the whole IO thread. Ultimately the solution requires Ceph to be able to control the queuing of IO’s at the lower levels of the kernel. Whether this is via some sort of tagging per IO (currently CFQ is only per thread/process) or some other method, I don’t know. I was speaking to Sage and he thinks the easiest method might be to shrink the filestore queue so that you don’t get buffer bloat at the disk level. You should be able to test this out pretty easily now by changing the parameter, probably around a queue of 5-10 would be about right for spinning disks. It’s a trade off of peak throughput vs queue latency though. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve Taylor Sent: 07 February 2017 17:01 To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? As I look at more of these stuck ops, it looks like more of them are actually waiting on subops than on osdmap updates, so maybe there is still some headway to be made with the weighted priority queue settings. I do see OSDs waiting for map updates all the time, but they aren’t blocking things as much as the subops are. Thoughts? [cid:image001.jpg@01D28146.3CD2FDC0]<http://xo4t.mj.am/lnk/AEAAHdX_NV8AAF3gdq4AADNJBWwAAACRXwBYmgL2v2Jjr_O-R2O240JbYsyYegAAlBI/1/octhy6gsul-9GJY5LCpcaA/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<http://xo4t.mj.am/lnk/AEAAHdX_NV8AAF3gdq4AADNJBWwAAACRXwBYmgL2v2Jjr_O-R2O240JbYsyYegAAlBI/2/tEMD834dug8FiYlzBdnDDg/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ________ From: Steve Taylor Sent: Tuesday, February 7, 2017 9:13 AM To: 'ceph-users@lists.ceph.com' <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? Sorry, I lost the previous thread on this. I apologize for the resulting incomplete reply. The issue that we’re having with Jewel, as David Turner mentioned, is that we can’t seem to throttle snap trimming sufficiently to prevent it from blocking I/O requests. On further investigation, I encountered osd_op_pq_max_tokens_per_priority,
Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
As I look at more of these stuck ops, it looks like more of them are actually waiting on subops than on osdmap updates, so maybe there is still some headway to be made with the weighted priority queue settings. I do see OSDs waiting for map updates all the time, but they aren’t blocking things as much as the subops are. Thoughts? [cid:image99464a.JPG@898dfa11.4e81d597]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ____ From: Steve Taylor Sent: Tuesday, February 7, 2017 9:13 AM To: 'ceph-users@lists.ceph.com' <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep? Sorry, I lost the previous thread on this. I apologize for the resulting incomplete reply. The issue that we’re having with Jewel, as David Turner mentioned, is that we can’t seem to throttle snap trimming sufficiently to prevent it from blocking I/O requests. On further investigation, I encountered osd_op_pq_max_tokens_per_priority, which should be able to be used in conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue positions for various operations using costs if I understand correctly. I’m testing with RBDs using 4MB objects, so in order to leave plenty of room in the weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially reserve 32MB in the queue for client I/O operations, which are prioritized higher and therefore shouldn’t get blocked. I still see blocked I/O requests, and when I dump in-flight ops, they show ‘op must wait for map.’ I assume this means that what’s blocking the I/O requests at this point is all of the osdmap updates caused by snap trimming, and not the actual snap trimming itself starving the ops of op threads. Hammer is able to mitigate this with osd_snap_trim_sleep by directly throttling snap trimming and therefore causing less frequent osdmap updates, but there doesn’t seem to be a good way to accomplish the same thing with Jewel. First of all, am I understanding these settings correctly? If so, are there other settings that could potentially help here, or do we just need something like Sam already mentioned that can sort of reserve threads for client I/O requests? Even then it seems like we might have issues if we can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a daily basis, which we recognize is an extreme use case. Just wondering if there’s something else to try or if we need to start working toward implementing something new ourselves to handle our use case better. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
Sorry, I lost the previous thread on this. I apologize for the resulting incomplete reply. The issue that we’re having with Jewel, as David Turner mentioned, is that we can’t seem to throttle snap trimming sufficiently to prevent it from blocking I/O requests. On further investigation, I encountered osd_op_pq_max_tokens_per_priority, which should be able to be used in conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue positions for various operations using costs if I understand correctly. I’m testing with RBDs using 4MB objects, so in order to leave plenty of room in the weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially reserve 32MB in the queue for client I/O operations, which are prioritized higher and therefore shouldn’t get blocked. I still see blocked I/O requests, and when I dump in-flight ops, they show ‘op must wait for map.’ I assume this means that what’s blocking the I/O requests at this point is all of the osdmap updates caused by snap trimming, and not the actual snap trimming itself starving the ops of op threads. Hammer is able to mitigate this with osd_snap_trim_sleep by directly throttling snap trimming and therefore causing less frequent osdmap updates, but there doesn’t seem to be a good way to accomplish the same thing with Jewel. First of all, am I understanding these settings correctly? If so, are there other settings that could potentially help here, or do we just need something like Sam already mentioned that can sort of reserve threads for client I/O requests? Even then it seems like we might have issues if we can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a daily basis, which we recognize is an extreme use case. Just wondering if there’s something else to try or if we need to start working toward implementing something new ourselves to handle our use case better. [cid:imagef15e00.JPG@e8bcd715.4a89bd4c]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ***Suspected Spam*** dm-crypt journal replacement
No need to re-create the osd. The easiest way to replace the journal is by creating the new journal partition with the same partition guid. You can use 'sgdisk -n :: --change-name=":ceph journal" --partition-guid=: --typecode=:45b0969e-9b03-4f30-b4c6-5ec00ceff106 ' to create the new journal partition. You can get the partition guid of the failed journal via 'cat /var/lib/ceph/osd//journal_uuid' if you don't have it already. Once your partition is created correctly, dmcrypt should be able to map it using the existing key from the old journal. Then the journal needs to be initialized via 'ceph-osd -i --mkjournal' and you should be able to start the osd at that point. If you can't or don't want to reuse the existing partition guid with its associated dmcrypt key, you can follow the same procedure to create the journal partition using a new partition guid of your choice, but then you have to generate a dmcrypt key with something like 'dd bs= count=1 if=/dev/urandom of=/etc/ceph/dmcrypt-keys/' and then create the dmcrypt volume with 'cryptsetup --key-file /etc/ceph/dmcrypt-keys/ --key-size create ' to get the encrypted journal device. Then you have to replace the 'journal' and 'journal_dmcrypt' symlinks in /var/lib/ceph/ and write the new partition guid to the 'journal_uuid' file in the same directory. You still have to perform the --mkjournal with ceph-osd, and you should be good to go. [cid:image9f5ad9.JPG@cc8e4767.4394994f]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nikolay Khramchikhin Sent: Wednesday, January 25, 2017 6:50 AM To: ceph-users@lists.ceph.com Subject: ***Suspected Spam*** [ceph-users] dm-crypt journal replacement Hello, folks, Can someone share the procedure of replacement failed journal deployed with "ceph-deploy disk prepare --dm-crypt" ? Can`t find at docs anything about it. Is there only way - recreation of osd? Ceph Jewel 10.2.5 -- Regards, Nikolay Khramchikhin, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
I'm seeing the same behavior with very similar perf top output. One server with 32 OSDs has a load average approaching 800. No excessive memory usage and no iowait at all. [cid:imagea8a69a.JPG@f4e62cf1.419383aa]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ruben Kerkhof Sent: Wednesday, December 7, 2016 3:08 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] 10.2.4 Jewel released On Wed, Dec 7, 2016 at 8:46 PM, Francois Lafont <francois.lafont.1...@gmail.com> wrote: > Hi, > > On 12/07/2016 01:21 PM, Abhishek L wrote: > >> This point release fixes several important bugs in RBD mirroring, RGW >> multi-site, CephFS, and RADOS. >> >> We recommend that all v10.2.x users upgrade. Also note the following >> when upgrading from hammer > > Well... little warning: after upgrade from 10.2.3 to 10.2.4, I have big load > cpu on osd and mds. Yes, same here. perf top shows: 8.23% [kernel] [k] sock_recvmsg 8.16% libpthread-2.17.so[.] __libc_recv 7.33% [kernel] [k] fget_light 7.24% [kernel] [k] tcp_recvmsg 6.41% [kernel] [k] sock_has_perm 6.19% [kernel] [k] _raw_spin_lock_bh 4.89% [kernel] [k] system_call 4.74% [kernel] [k] avc_has_perm_flags 3.93% [kernel] [k] SYSC_recvfrom 3.18% [kernel] [k] fput 3.15% [kernel] [k] system_call_after_swapgs 3.12% [kernel] [k] local_bh_enable_ip 3.11% [kernel] [k] release_sock 2.90% libpthread-2.17.so[.] __pthread_enable_asynccancel 2.71% libpthread-2.17.so[.] __pthread_disable_asynccancel 2.57% [kernel] [k] inet_recvmsg 2.43% [kernel] [k] local_bh_enable 2.16% [kernel] [k] local_bh_disable 2.03% [kernel] [k] tcp_cleanup_rbuf 1.44% [kernel] [k] sockfd_lookup_light 1.26% [kernel] [k] _raw_spin_unlock 1.20% [kernel] [k] sysret_check 1.18% [kernel] [k] lock_sock_nested 1.07% [kernel] [k] selinux_socket_recvmsg 0.98% [kernel] [k] _raw_spin_unlock_bh 0.97% ceph-osd [.] Pipe::do_recv 0.87% [kernel] [k] _cond_resched 0.73% [kernel] [k] tcp_release_cb 0.52% [kernel] [k] security_socket_recvmsg Kind regards, Ruben ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size?
I also should have mentioned that you’ll naturally have to remount your OSD filestores once you’ve made the change to ceph.conf. You can either restart each OSD after making the config file change or simply use the mount command yourself with the remount option to add the allocsize option live to each OSD’s filestore mount point. [cid:image71f234.JPG@2c6ee238.46ab8bf6]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve Taylor Sent: Wednesday, November 30, 2016 8:50 AM To: Thomas Bennett <tho...@ska.ac.za> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size? We’re using Ubuntu 14.04 on x86_64. We just added ‘osd mount options xfs = rw,noatime,inode64,allocsize=1m’ to the [osd] section of our ceph.conf so XFS allocates 1M blocks for new files. That only affected new files, so manual defragmentation was still necessary to clean up older data, but once that was done everything got better and stayed better. You can use the xfs_db command to check fragmentation on an XFS volume and xfs_fsr to perform a defragmentation. The defragmentation can run on a mounted filesystem too, so you don’t even have to rely on Ceph to avoid downtime. I probably wouldn’t run it everywhere at once though for performance reasons. A single OSD at a time would be ideal, but that’s a matter of preference. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas Bennett Sent: Wednesday, November 30, 2016 5:58 AM Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size? Hi Kate and Steve, Thanks for the replies. Always good to hear back from a community :) I'm using Linux on x86_64 architecture and the block size is limited to the page size which is 4k. So it looks like I'm hitting hard limits in any changes. to increase the block size. I found this out by running the following command: $ mkfs.xfs -f -b size=8192 /dev/sda1 $ mount -v /dev/sda1 /tmp/disk/ mount: Function not implemented #huh??? Checking out the man page: $ man mkfs.xfs -b block_size_options ... XFS on Linux currently only supports pagesize or smaller blocks. I'm hesitant to implement btrfs as its still experimental and ext4 seems to have the same current limitation. Our current approach is to exclude the hard drive that we're getting the poor read rates from our procurement process, but it would still be nice to find out how much control we have over how ceph-osd daemons read from the drives. I may attempts a strace on an osd daemon as we read to see what the actual read request size is being asked to the kernel. Cheers, Tom On Tue, Nov 29, 2016 at 11:53 PM, Steve Taylor <steve.tay...@storagecraft.com<mailto:steve.tay...@storagecraft.com>> wrote: We configured XFS on our OSDs to use 1M blocks (our use case is RBDs with 1M blocks) due to massive fragmentation in our filestores a while back. We were having to defrag all the time and cluster performance was noticeably degraded. We also create and delete lots of RBD snapshots on a daily basis, so that likely contributed to the fragmentation as well. It’s been MUCH better since we switched XFS to use 1M allocations. Virtually no fragmentation and performance is consistently good. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Kate Ward Sent: Tuesday, November 29, 2016 2:02 PM To: Thomas Bennett <tho...@ska.ac.za<mailto:tho...@ska.ac.za>> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size? I have no experience with XFS, but wouldn't expect poor behaviour with it. I use ZFS myself and know that it would combine writes, but btrfs might be an option. Do you know what block size was used to create the XFS filesystem? It looks like 4k is the default (reasonable) with a max of 64k. Perhaps a larger block size will give better performance for your particular use case. (I use a 1M block size with ZFS.) http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch04s02.html On Tue, Nov 29, 2016 at 10:23 AM Thomas Bennett <tho...@ska.ac.za<ma
Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size?
We’re using Ubuntu 14.04 on x86_64. We just added ‘osd mount options xfs = rw,noatime,inode64,allocsize=1m’ to the [osd] section of our ceph.conf so XFS allocates 1M blocks for new files. That only affected new files, so manual defragmentation was still necessary to clean up older data, but once that was done everything got better and stayed better. You can use the xfs_db command to check fragmentation on an XFS volume and xfs_fsr to perform a defragmentation. The defragmentation can run on a mounted filesystem too, so you don’t even have to rely on Ceph to avoid downtime. I probably wouldn’t run it everywhere at once though for performance reasons. A single OSD at a time would be ideal, but that’s a matter of preference. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas Bennett Sent: Wednesday, November 30, 2016 5:58 AM Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size? Hi Kate and Steve, Thanks for the replies. Always good to hear back from a community :) I'm using Linux on x86_64 architecture and the block size is limited to the page size which is 4k. So it looks like I'm hitting hard limits in any changes. to increase the block size. I found this out by running the following command: $ mkfs.xfs -f -b size=8192 /dev/sda1 $ mount -v /dev/sda1 /tmp/disk/ mount: Function not implemented #huh??? Checking out the man page: $ man mkfs.xfs -b block_size_options ... XFS on Linux currently only supports pagesize or smaller blocks. I'm hesitant to implement btrfs as its still experimental and ext4 seems to have the same current limitation. Our current approach is to exclude the hard drive that we're getting the poor read rates from our procurement process, but it would still be nice to find out how much control we have over how ceph-osd daemons read from the drives. I may attempts a strace on an osd daemon as we read to see what the actual read request size is being asked to the kernel. Cheers, Tom On Tue, Nov 29, 2016 at 11:53 PM, Steve Taylor <steve.tay...@storagecraft.com<mailto:steve.tay...@storagecraft.com>> wrote: We configured XFS on our OSDs to use 1M blocks (our use case is RBDs with 1M blocks) due to massive fragmentation in our filestores a while back. We were having to defrag all the time and cluster performance was noticeably degraded. We also create and delete lots of RBD snapshots on a daily basis, so that likely contributed to the fragmentation as well. It’s been MUCH better since we switched XFS to use 1M allocations. Virtually no fragmentation and performance is consistently good. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Kate Ward Sent: Tuesday, November 29, 2016 2:02 PM To: Thomas Bennett <tho...@ska.ac.za<mailto:tho...@ska.ac.za>> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size? I have no experience with XFS, but wouldn't expect poor behaviour with it. I use ZFS myself and know that it would combine writes, but btrfs might be an option. Do you know what block size was used to create the XFS filesystem? It looks like 4k is the default (reasonable) with a max of 64k. Perhaps a larger block size will give better performance for your particular use case. (I use a 1M block size with ZFS.) http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch04s02.html On Tue, Nov 29, 2016 at 10:23 AM Thomas Bennett <tho...@ska.ac.za<mailto:tho...@ska.ac.za>> wrote: Hi Kate, Thanks for your reply. We currently use xfs as created by ceph-deploy. What would you recommend we try? Kind regards, Tom On Tue, Nov 29, 2016 at 11:14 AM, Kate Ward <kate.w...@forestent.com<mailto:kate.w...@forestent.com>> wrote: What filesystem do you use on the OSD? Have you considered a different filesystem that is better at combining requests before they get to the drive? k8 On Tue, Nov 29, 2016 at 9:52 AM Thomas Bennett <tho...@ska.ac.za<mailto:tho...@ska.ac.za>> wrote: Hi, We have a use case where we are reading 128MB objects off spinning disks. We've benchmarked a number of different hard drive and have noticed that for a particular hard drive, we're experiencing slow reads by comparison. This occurs when we have multiple readers (even just 2) reading objects off the OSD. We've recreated the effect using iozone and have noticed that once the record size drops to 4k, the hard drive miss behaves. Is there a setting on Ceph that we can change to fix the minimum read size when the ceph-osd daemon reads the object of the hard drives, to see if we can overcome the overall slow read rate. Cheers, Tom ____ [cid:image001.jpg@01D24AE5.51191450]<https:
Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size?
We configured XFS on our OSDs to use 1M blocks (our use case is RBDs with 1M blocks) due to massive fragmentation in our filestores a while back. We were having to defrag all the time and cluster performance was noticeably degraded. We also create and delete lots of RBD snapshots on a daily basis, so that likely contributed to the fragmentation as well. It’s been MUCH better since we switched XFS to use 1M allocations. Virtually no fragmentation and performance is consistently good. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kate Ward Sent: Tuesday, November 29, 2016 2:02 PM To: Thomas Bennett <tho...@ska.ac.za> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size? I have no experience with XFS, but wouldn't expect poor behaviour with it. I use ZFS myself and know that it would combine writes, but btrfs might be an option. Do you know what block size was used to create the XFS filesystem? It looks like 4k is the default (reasonable) with a max of 64k. Perhaps a larger block size will give better performance for your particular use case. (I use a 1M block size with ZFS.) http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch04s02.html On Tue, Nov 29, 2016 at 10:23 AM Thomas Bennett <tho...@ska.ac.za<mailto:tho...@ska.ac.za>> wrote: Hi Kate, Thanks for your reply. We currently use xfs as created by ceph-deploy. What would you recommend we try? Kind regards, Tom On Tue, Nov 29, 2016 at 11:14 AM, Kate Ward <kate.w...@forestent.com<mailto:kate.w...@forestent.com>> wrote: What filesystem do you use on the OSD? Have you considered a different filesystem that is better at combining requests before they get to the drive? k8 On Tue, Nov 29, 2016 at 9:52 AM Thomas Bennett <tho...@ska.ac.za<mailto:tho...@ska.ac.za>> wrote: Hi, We have a use case where we are reading 128MB objects off spinning disks. We've benchmarked a number of different hard drive and have noticed that for a particular hard drive, we're experiencing slow reads by comparison. This occurs when we have multiple readers (even just 2) reading objects off the OSD. We've recreated the effect using iozone and have noticed that once the record size drops to 4k, the hard drive miss behaves. Is there a setting on Ceph that we can change to fix the minimum read size when the ceph-osd daemon reads the object of the hard drives, to see if we can overcome the overall slow read rate. Cheers, Tom [cid:image5f646a.JPG@1e0ce342.4f8bc00f]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Thomas Bennett SKA South Africa Science Processing Team Office: +27 21 5067341<tel:+27%2021%20506%207341> Mobile: +27 79 5237105<tel:+27%2079%20523%207105> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Out-of-date RBD client libraries
CRUSH is what determines where data gets stored, so if you employ newer CRUSH tunables prematurely against older clients that don’t support them, then you run the risk of your clients not being able to find nor place objects correctly. I don’t know Ceph’s internals well enough to tell you all of what might result at a lower level from such a scenario, but clients not knowing where data belongs seems bad enough. I wouldn’t necessarily expect data loss, but potentially a lot of client errors. From: jdavidli...@gmail.com [mailto:jdavidli...@gmail.com] On Behalf Of J David Sent: Tuesday, October 25, 2016 1:27 PM To: Steve Taylor <steve.tay...@storagecraft.com> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Out-of-date RBD client libraries On Tue, Oct 25, 2016 at 3:10 PM, Steve Taylor <steve.tay...@storagecraft.com<mailto:steve.tay...@storagecraft.com>> wrote: Recently we tested an upgrade from 0.94.7 to 10.2.3 and found exactly the opposite. Upgrading the clients first worked for many operations, but we got "function not implemented" errors when we would try to clone RBD snapshots. Yes, we have seen “function not implemented” in the past as well when connecting new clients to old clusters. you must keep your CRUSH tunables at firefly or hammer until the clients are upgraded. Not that I am proposing to try it, but… or else what? Whatever the “or else!” is, the same would apply, I assume, to connecting old clients to a brand-new jewel cluster which would have been created with jewel tunables in the first place? Thanks! [cid:image8cec56.JPG@f605432c.4b8508fe]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Out-of-date RBD client libraries
We tested an upgrade from 0.94.3 to 0.94.7 and experienced issues when the librbd clients were not upgraded first in the process. It was a while back and I don't remember the specific issues, but upgrading the clients prior to upgrading any services worked in that case. Recently we tested an upgrade from 0.94.7 to 10.2.3 and found exactly the opposite. Upgrading the clients first worked for many operations, but we got "function not implemented" errors when we would try to clone RBD snapshots. We re-tested that upgrade with the clients being upgraded after all of the services and everything worked fine for us in that case. The caveat there is that you must keep your CRUSH tunables at firefly or hammer until the clients are upgraded. At any rate, we've had different experiences upgrading the clients at different points in the process depending on the releases involved. The key is to test first and make sure you have a sane upgrade path before doing anything in production. [cid:imagebeeb2c.JPG@5541413f.4f9d6fa0]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David Sent: Tuesday, October 25, 2016 12:46 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] Out-of-date RBD client libraries What are the potential consequences of using out-of-date client libraries with RBD against newer clusters? Specifically, what are the potential ill-effects of using Firefly client libraries (0.80.7 and 0.80.8) to access Hammer or Jewel (10.2.3) clusters? The upgrading instructions ( http://docs.ceph.com/docs/jewel/install/upgrading-ceph/ ) don’t actually mention clients, just giving the recommended order as: ceph-deploy, mons, osds, mds, object gateways. Are long-running RBD clients (like Qemu virtual machines) placed at risk of instability or data corruption if they are not updated and restarted before, during, or after such an upgrade? If so, what are the potential consequences, and where in the process should they be upgraded to avoid those consequences? Thanks for any advice! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph consultants?
Try using 'ceph-deploy osd create' instead of 'ceph-deploy osd prepare' and 'ceph-deploy osd activate' when using an entire disk for an OSD. That will create a journal partition and co-locate your journal on the same disk with the OSD, but that's fine for an initial dev setup. [cid:imageaf5f23.JPG@182b1064.43828019]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tracy Reed Sent: Wednesday, October 5, 2016 3:12 PM To: Peter Maloney <peter.malo...@brockmann-consult.de> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph consultants? On Wed, Oct 05, 2016 at 01:17:52PM PDT, Peter Maloney spake thusly: > What do you need help with specifically? Setting up ceph isn't very > complicated... just fixing it when things go wrong should be. What > type of scale are you working with, and do you already have hardware? > Or is the problem more to do with integrating it with clients? Hi Peter, I agree, setting up Ceph isn't very complicated. I posted to the list on 10/03/16 with the initial problem I have run into under the subject "Can't activate OSD". Please refer to that thread as it has logs, details of my setup, etc. I started working on this about a month ago then spent several days on it and a few hours with a couple different people on IRC. Nobody has been able to figure out how to get my OSD activated. I took a couple weeks off and now I'm back at it as I really need to get this going soon. Basically, I'm following the quickstart guide at http://docs.ceph.com/docs/jewel/start/quick-ceph-deploy/ and when I run the command to activate the OSDs like so: ceph-deploy osd activate ceph02:/dev/sdc ceph03:/dev/sdc I get this in the ceph-deploy log: [2016-10-03 15:16:10,193][ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.2.1511 Core [2016-10-03 15:16:10,193][ceph_deploy.osd][DEBUG ] activating host ceph03 disk /dev/sdc [2016-10-03 15:16:10,193][ceph_deploy.osd][DEBUG ] will use init type: systemd [2016-10-03 15:16:10,194][ceph03][DEBUG ] find the location of an executable [2016-10-03 15:16:10,200][ceph03][INFO ] Running command: sudo /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/sdc [2016-10-03 15:16:10,377][ceph03][WARNING] main_activate: path = /dev/sdc [2016-10-03 15:21:10,380][ceph03][WARNING] No data was received after 300 seconds, disconnecting... [2016-10-03 15:21:15,387][ceph03][INFO ] checking OSD status... [2016-10-03 15:21:15,401][ceph03][DEBUG ] find the location of an executable [2016-10-03 15:21:15,472][ceph03][INFO ] Running command: sudo /bin/ceph --cluster=ceph osd stat --format=json [2016-10-03 15:21:15,698][ceph03][INFO ] Running command: sudo systemctl enable ceph.target More details in other thread. Where am I going wrong here? Thanks! -- Tracy Reed ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cleanup old osdmaps after #13990 fix applied
I think it's a maximum of 30 maps per osdmap update. So if you've got huge caches like we had, then you might have to generate a lot of updates to get things squared away. That's what I did, and it worked really well. [cid:image0a6b59.JPG@80784c1e.4796a4c3]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: Dan Van Der Ster [daniel.vanders...@cern.ch] Sent: Wednesday, September 14, 2016 7:21 AM To: Steve Taylor Cc: ceph-us...@ceph.com Subject: Re: Cleanup old osdmaps after #13990 fix applied Hi Steve, Thanks, that sounds promising. Are only a limited number of maps trimmed for each new osdmap generated? If so, I'll generate a bit of churn to get these cleaned up. -- Dan > On 14 Sep 2016, at 15:08, Steve Taylor <steve.tay...@storagecraft.com> wrote: > > http://tracker.ceph.com/issues/13990 was created by a colleague of mine from > an issue that was affecting us in production. When 0.94.8 was released with > the fix, I immediately deployed a test cluster on 0.94.7, reproduced this > issue, upgraded to 0.94.8, and tested the fix. It worked beautifully. > > I suspect the issue you're seeing is that the clean-up only occurs when new > osdmaps are generated, so as long as nothing is changing you'll continue to > see lots of stale maps cached. We delete RBD snapshots all the time in our > production use case, which updates the osdmap, so I did that in my test > cluster and watched the map cache on one of the OSDs. Sure enough, after a > while the cache was pruned down to the expected size. > > Over time I imagine you'll see things settle, but it may take a while if you > don't update the osdmap frequently. > > Steve Taylor | Senior Software Engineer | StorageCraft > Technology Corporation > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 | > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this message > is prohibited. > > From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Dan Van Der > Ster [daniel.vanders...@cern.ch] > Sent: Wednesday, September 14, 2016 3:45 AM > To: ceph-us...@ceph.com > Subject: [ceph-users] Cleanup old osdmaps after #13990 fix applied > > Hi, > > We've just upgraded to 0.94.9, so I believe this issue is fixed: > >http://tracker.ceph.com/issues/13990 > > AFAICT "resolved" means the number of osdmaps saved on each OSD will not grow > unboundedly anymore. > > However, we have many OSDs with loads of old osdmaps, e.g.: > > # pwd > /var/lib/ceph/osd/ceph-257/current/meta > # find . -name 'osdmap*' | wc -l > 112810 > > (And our maps are ~1MB, so this is >100GB per OSD). > > Is there a solution to remove these old maps? > > Cheers, > Dan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cleanup old osdmaps after #13990 fix applied
http://tracker.ceph.com/issues/13990 was created by a colleague of mine from an issue that was affecting us in production. When 0.94.8 was released with the fix, I immediately deployed a test cluster on 0.94.7, reproduced this issue, upgraded to 0.94.8, and tested the fix. It worked beautifully. I suspect the issue you're seeing is that the clean-up only occurs when new osdmaps are generated, so as long as nothing is changing you'll continue to see lots of stale maps cached. We delete RBD snapshots all the time in our production use case, which updates the osdmap, so I did that in my test cluster and watched the map cache on one of the OSDs. Sure enough, after a while the cache was pruned down to the expected size. Over time I imagine you'll see things settle, but it may take a while if you don't update the osdmap frequently. [cid:image9cbe59.JPG@a1d77762.42974963]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Dan Van Der Ster [daniel.vanders...@cern.ch] Sent: Wednesday, September 14, 2016 3:45 AM To: ceph-us...@ceph.com Subject: [ceph-users] Cleanup old osdmaps after #13990 fix applied Hi, We've just upgraded to 0.94.9, so I believe this issue is fixed: http://tracker.ceph.com/issues/13990 AFAICT "resolved" means the number of osdmaps saved on each OSD will not grow unboundedly anymore. However, we have many OSDs with loads of old osdmaps, e.g.: # pwd /var/lib/ceph/osd/ceph-257/current/meta # find . -name 'osdmap*' | wc -l 112810 (And our maps are ~1MB, so this is >100GB per OSD). Is there a solution to remove these old maps? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image
You can use 'rbd -p images --image 417ef4b6-b4b2-4e94-9ae6-ef7a4ee3e560 info' to see the parentage of your cloned RBD from Ceph's perspective. It seems like that could be useful at various times throughout this test to determine what glance is doing under the covers. [cid:imagebc1a87.JPG@004db369.419de911]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: Eugen Block [mailto:ebl...@nde.ag] Sent: Friday, September 2, 2016 7:12 AM To: Steve Taylor <steve.tay...@storagecraft.com> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image > Something isn't right. Ceph won't delete RBDs that have existing > snapshots That's what I thought, and I also noticed that in the first test, but not in the second. > The clone becomes a cinder device that is then attached to the nova instance. This is one option, but I don't use it. nova would create a cinder volume if I executed "nova boot --block-device ...", but I don't, so there's no cinder involved. I'll try to provide some details from openstack and ceph, maybe that helps to find the cause. So I created a glance image control1:~ # glance image-list | grep Test | 87862452-5872-40c9-b657-f5fec0d105c5 | Test2-SLE12SP1 which automatically gets one snapshot in rbd and has no children yet, because no VM has been launched yet: ceph@node1:~/ceph-deploy> rbd -p images --image 87862452-5872-40c9-b657-f5fec0d105c5 snap ls SNAPID NAMESIZE 429 snap 5120 MB ceph@node1:~/ceph-deploy> rbd -p images --image 87862452-5872-40c9-b657-f5fec0d105c5 children --snap snap ceph@node1:~/ceph-deploy> Now I boot a VM nova boot --flavor 2 --image 87862452-5872-40c9-b657-f5fec0d105c5 --nic net-id=4eafc4da-a3cd-4def-b863-5fb8e645e984 vm1 with a resulting instance_uuid=0e44badb-8a76-41d8-be43-b4125ffc6806 and see this in ceph: ceph@node1:~/ceph-deploy> rbd -p images --image 87862452-5872-40c9-b657-f5fec0d105c5 children --snap snap images/0e44badb-8a76-41d8-be43-b4125ffc6806_disk So I have the base image with a snapshot, and based on this snapshot a child which is the disk image for my instance. There is no cinder volume: control1:~ # cinder list +++--+--+-+--+-+ | ID | Status | Name | Size | Volume Type | Bootable | Attached to | +++--+--+-+--+-+ +++--+--+-+--+-+ Now I create a snapshot of vm1 (I removed some lines to focus on the IDs): control1:~ # nova image-show 417ef4b6-b4b2-4e94-9ae6-ef7a4ee3e560 +-+--+ | Property| Value | +-+--+ | id | 417ef4b6-b4b2-4e94-9ae6-ef7a4ee3e560 | | metadata base_image_ref | 87862452-5872-40c9-b657-f5fec0d105c5 | | metadata image_type | snapshot | | metadata instance_uuid | 0e44badb-8a76-41d8-be43-b4125ffc6806 | | name| snap-vm1 | | server | 0e44badb-8a76-41d8-be43-b4125ffc6806 | | status | ACTIVE | | updated | 2016-09-02T12:51:28Z | +-+--+ In rbd there is a new object now, without any children: ceph@node1:~/ceph-deploy> rbd -p images --image 417ef4b6-b4b2-4e94-9ae6-ef7a4ee3e560 snap ls SNAPID NAME SIZE 443 snap 20480 MB ceph@node1:~/ceph-deploy> rbd -p images --image 417ef4b6-b4b2-4e94-9ae6-ef7a4ee3e560 children --snap snap ceph@node1:~/ceph-deploy> And there's still no cinder volume ;-) After removing vm1 I can delete the base image and snap-vm1: control1:~ # nova delete vm1 Request to delete server vm1 has been accepted. control1:~ # glance image-delete 87862452-5872-40c9-b657-f5fec0d105c5 control1:~ # control1:~ # glance image-delete 417ef4b6-b4b2-4e94-9ae6-ef7a4ee3e560 I did not flatten any snapshot yet, this is really strange! It seems as if the nova snapshot creates a full image (flattened) so it doesn't depend on the base image. But I didn't
Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image
Something isn't right. Ceph won't delete RBDs that have existing snapshots, even when those snapshots aren't protected. You can't delete a snapshot that's protected, and you can't unprotect a snapshot if there is a COW clone that depends on it. I'm not intimately familiar with OpenStack, but it must be deleting A without any snapshots. That would seem to indicate that at the point of deletion there are no COW clones of A or that any clone is no longer dependent on A. A COW clone requires a protected snapshot, a protected snapshot can't be deleted, and existing snapshots prevent RBDs from being deleted. In my experience with OpenStack, booting a nova instance from a glance image causes a snapshot to be created, protected, and cloned on the RBD for the glance image. The clone becomes a cinder device that is then attached to the nova instance. Thus you're able to modify the contents of the volume within the instance. You wouldn't be able to delete the glance image at that point unless the cinder device were deleted first or it was flattened and no longer dependent on the glance image. I haven't performed this particular test. It's possible that OpenStack does the flattening for you in this scenario. This issue will likely require some investigation at the RBD level throughout your testing process to understand exactly what's happening. [cid:image5feece.JPG@7cacebfd.42833f4d]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: Eugen Block [mailto:ebl...@nde.ag] Sent: Thursday, September 1, 2016 9:06 AM To: Steve Taylor <steve.tay...@storagecraft.com> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image Thanks for the quick response, but I don't believe I'm there yet ;-) > cloned the glance image to a cinder device I have configured these three services (nova, glance, cinder) to use ceph as storage backend, but cinder is not involved in this process I'm referring to. Now I wanted to reproduce this scenario to show a colleague, and couldn't because now I was able to delete image A even with a non-flattened snapshot! How is that even possible? Eugen Zitat von Steve Taylor <steve.tay...@storagecraft.com>: > You're already there. When you booted ONE you cloned the glance image > to a cinder device (A', separate RBD) that was a COW clone of A. > That's why you can't delete A until you flatten SNAP1. A' isn't a full > copy until that flatten is complete, at which point you're able to > delete A. > > SNAP2 is a second snapshot on A', and thus A' already has all of the > data it needs from the previous flatten of SNAP1 to allow you to > delete SNAP1. So SNAP2 isn't actually a full extra copy of the data. > > > > > [cid:imagef01287.JPG@753835fa.45a0b2c0]<https://storagecraft.com> >Steve Taylor | Senior Software Engineer | StorageCraft Technology > Corporation<https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 > > > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with > any attachments, and be advised that any dissemination or copying of > this message is prohibited. > > > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > Behalf Of Eugen Block > Sent: Thursday, September 1, 2016 6:51 AM > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Turn snapshot of a flattened snapshot into > regular image > > Hi all, > > I'm trying to understand the idea behind rbd images and their > clones/snapshots. I have tried this scenario: > > 1. upload image A to glance > 2. boot instance ONE from image A > 3. make changes to instance ONE (install new package) 4. create > snapshot SNAP1 from ONE 5. delete instance ONE 6. delete image A >deleting image A fails because of existing snapshot SNAP1 7. > flatten snapshot SNAP1 8. delete image A >succeeds > 9. launch instance TWO from SNAP1 > 10. make changes to TWO (install package) 11. create snapshot SNAP2 > from TWO 12. delete TWO 13. delete SNAP1 > succeeds > > This means that th
Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image
You're already there. When you booted ONE you cloned the glance image to a cinder device (A', separate RBD) that was a COW clone of A. That's why you can't delete A until you flatten SNAP1. A' isn't a full copy until that flatten is complete, at which point you're able to delete A. SNAP2 is a second snapshot on A', and thus A' already has all of the data it needs from the previous flatten of SNAP1 to allow you to delete SNAP1. So SNAP2 isn't actually a full extra copy of the data. [cid:imagef01287.JPG@753835fa.45a0b2c0]<https://storagecraft.com> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen Block Sent: Thursday, September 1, 2016 6:51 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] Turn snapshot of a flattened snapshot into regular image Hi all, I'm trying to understand the idea behind rbd images and their clones/snapshots. I have tried this scenario: 1. upload image A to glance 2. boot instance ONE from image A 3. make changes to instance ONE (install new package) 4. create snapshot SNAP1 from ONE 5. delete instance ONE 6. delete image A deleting image A fails because of existing snapshot SNAP1 7. flatten snapshot SNAP1 8. delete image A succeeds 9. launch instance TWO from SNAP1 10. make changes to TWO (install package) 11. create snapshot SNAP2 from TWO 12. delete TWO 13. delete SNAP1 succeeds This means that the second snapshot has the same (full) size as the first. Can I manipulate SNAP1 somehow so that snapshots are not flattened anymore and SNAP2 becomes a cow clone of SNAP1? I hope my description is not too confusing. The idea behind this question is, if I have one base image and want to adjust that image from time to time, I don't want to keep several versions of that image, I just want one. But this way i would lose the protection from deleting the base image. Is there any config option in ceph or Openstack or anything else I can do to "un-flatten" an image? I would assume that there is some kind of flag set for that image. Maybe someone can point me to the right direction. Thanks, Eugen -- Eugen Block voice : +49-40-559 51 75 NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 Postfach 61 03 15 D-22423 Hamburg e-mail : ebl...@nde.ag Vorsitzende des Aufsichtsrates: Angelika Mozdzen Sitz und Registergericht: Hamburg, HRB 90934 Vorstand: Jens-U. Mozdzen USt-IdNr. DE 814 013 983 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Nick is right. Setting noout is the right move in this scenario. Restarting an OSD shouldn't block I/O unless nodown is also set, however. The exception to this would be a case where min_size can't be achieved because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would certainly block writes. Otherwise the cluster will recognize down OSDs as down (without nodown set), redirect I/O requests to OSDs that are up, and backfill as necessary when things are back to normal. You can set min_size to something lower if you don't have enough OSDs to allow you to restart one without blocking writes. If this isn't the case, something deeper is going on with your cluster. You shouldn't get slow requests due to restarting a single OSD with only noout set and idle disks on the remaining OSDs. I've done this many, many times. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: Friday, February 12, 2016 9:07 AM To: 'Christian Balzer' <ch...@gol.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff) > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Christian Balzer > Sent: 12 February 2016 15:38 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout > ain't > uptosnuff) > > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: > > > Hi, > > > > On 02/12/2016 03:47 PM, Christian Balzer wrote: > > > Hello, > > > > > > yesterday I upgraded our most busy (in other words lethally > > > overloaded) production cluster to the latest Firefly in > > > preparation for a Hammer upgrade and then phasing in of a cache tier. > > > > > > When restarting the ODSs it took 3 minutes (1 minute in a > > > consecutive repeat to test the impact of primed caches) during > > > which the cluster crawled to a near stand-still and the dreaded > > > slow requests piled up, causing applications in the VMs to fail. > > > > > > I had of course set things to "noout" beforehand, in hopes of > > > staving off this kind of scenario. > > > > > > Note that the other OSDs and their backing storage were NOT > > > overloaded during that time, only the backing storage of the OSD > > > being restarted was under duress. > > > > > > I was under the (wishful thinking?) impression that with noout set > > > and a controlled OSD shutdown/restart, operations would be > > > redirect to the new primary for the duration. > > > The strain on the restarted OSDs when recovering those operations > > > (which I also saw) I was prepared for, the near screeching halt > > > not so much. > > > > > > Any thoughts on how to mitigate this further or is this the > > > expected behavior? > > > > I wouldn't use noout in this scenario. It keeps the cluster from > > recognizing that a OSD is not available; other OSD will still try to > > write to that OSD. This is probably the cause of the blocked requests. > > Redirecting only works if the cluster is able to detect a PG as > > being degraded. > > > Oh well, that makes of course sense, but I found some article stating > that it > also would redirect things and the recovery activity I saw afterwards suggests > it did so at some point. Doesn't noout just stop the crushmap from being modified and hence data shuffling. Nodown controls whether or not the OSD is available for IO? Maybe try the reverse. Set noup so that OSD's don't participate in IO and then bring them in manually? > > > If the cluster is aware of the OSD being missing, it could handle > > the write requests more gracefully. To prevent it from backfilling > > etc, I prefer to use nobackfill and norecover. It blocks backfill on > > the cluster level, but allows requests to be carried out (at least > > in my understanding of these flags). > > > Yes, I concur and was thinking of that as well. Will give it a spin > with the > upgrade to Hammer. > > > 'noout' is fine for large scale cluster maintenance, since it keeps > > the cluster from backfilling. I've
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
I could be wrong, but I didn't think a PG would have to peer when an OSD is restarted with noout set. If I'm wrong, then this peering would definitely block I/O. I just did a quick test on a non-busy cluster and didn't see any peering when my OSD went down or up, but I'm not sure how good a test that is. The OSD should also stay "in" throughout the restart with noout set, so it wouldn't have been "out" before to cause peering when it came "in." I do know that OSDs don’t mark themselves "up" until they're caught up on OSD maps. They won't accept any op requests until they're "up," so they shouldn't have any catching up to do by the time they start taking op requests. In theory they're ready to handle I/O by the time they start handling I/O. At least that's my understanding. It would be interesting to see what this cluster looks like as far as OSD count, journal configuration, network, CPU, RAM, etc. Something is obviously amiss. Even in a semi-decent configuration one should be able to restart a single OSD with noout under little load without causing blocked op requests. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: Robert LeBlanc [mailto:rob...@leblancnet.us] Sent: Friday, February 12, 2016 1:30 PM To: Nick Fisk <n...@fisk.me.uk> Cc: Steve Taylor <steve.tay...@storagecraft.com>; Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff) -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 What I've seen is that when an OSD starts up in a busy cluster, as soon as it is "in" (could be "out" before) it starts getting client traffic. However, it has be "in" to start catching up and peering to the other OSDs in the cluster. The OSD is not ready to service requests for that PG yet, but it has the OP queued until it is ready. On a busy cluster it can take an OSD a long time to become ready especially if it is servicing client requests at the same time. If someone isn't able to look into the code to resolve this by the time I'm finished with the queue optimizations I'm doing (hopefully in a week or two), I plan on looking into this to see if there is something that can be done to prevent the OPs from being accepted until the OSD is ready for them. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk wrote: > I wonder if Christian is hitting some performance issue when the OSD > or number of OSD's all start up at once? Or maybe the OSD is still > doing some internal startup procedure and when the IO hits it on a > very busy cluster, it causes it to become overloaded for a few seconds? > > I've seen similar things in the past where if I did not have enough > min free KB's configured, PG's would take a long time to peer/activate > and cause slow ops. > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Steve Taylor >> Sent: 12 February 2016 16:32 >> To: Nick Fisk ; 'Christian Balzer' ; ceph- us...@lists.ceph.com >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout >> ain't >> uptosnuff) >> >> Nick is right. Setting noout is the right move in this scenario. > Restarting an >> OSD shouldn't block I/O unless nodown is also set, however. The >> exception to this would be a case where min_size can't be achieved >> because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is >> restarting. That would > certainly >> block writes. Otherwise the cluster will recognize down OSDs as down >> (without nodown set), redirect I/O requests to OSDs that are up, and > backfill >> as necessary when things are back to normal. >> >> You can set min_size to something lower if you don't have enough OSDs >> to allow you to restart one without blocking writes. If this isn't >> the case, something deeper is going on with your cluster. You >> shouldn't get slow requests due to restarting a single OSD with only >> noout set and idle disks > on >> the remaining OSDs. I've done this many, many times. >> >> Steve Taylor | Senior Software Engineer | StorageCraft Technology >> Corporation >> 380 Data Drive Suite 300 | Draper | Utah | 84020 >> Office: 80
Re: [ceph-users] OSDs are down, don't know why
With a single osd there shouldn't be much to worry about. It will have to get caught up on map epochs before it will report itself as up, but on a new cluster that should be pretty immediate. You'll probably have to look for clues in the osd and mon logs. I would expect some sort of error reported in this scenario. It seems likely that it would be network-related in this case, but the logs will confirm or debunk that theory. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: Jeff Epstein [mailto:jeff.epst...@commerceguys.com] Sent: Monday, January 18, 2016 8:32 AM To: Steve Taylor <steve.tay...@storagecraft.com>; ceph-users <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] OSDs are down, don't know why Hi Steve Thanks for your answer. I don't have a private network defined. Furthermore, in my current testing configuration, there is only one OSD, so communication between OSDs should be a non-issue. Do you know how OSD up/down state is determined when there is only one OSD? Best, Jeff On 01/18/2016 03:59 PM, Steve Taylor wrote: > Do you have a ceph private network defined in your config file? I've seen > this before in that situation where the private network isn't functional. The > osds can talk to the mon(s) but not to each other, so they report each other > as down when they're all running just fine. > > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > Corporation > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 | Fax: 801.545.4705 > > If you are not the intended recipient of this message, be advised that any > dissemination or copying of this message is prohibited. > If you received this message erroneously, please notify the sender and delete > it, together with any attachments. > > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Jeff Epstein > Sent: Friday, January 15, 2016 7:28 PM > To: ceph-users <ceph-users@lists.ceph.com> > Subject: [ceph-users] OSDs are down, don't know why > > Hello, > > I'm setting up a small test instance of ceph and I'm running into a situation > where the OSDs are being shown as down, but I don't know why. > > Connectivity seems to be working. The OSD hosts are able to communicate with > the MON hosts; running "ceph status" and "ceph osd in" from an OSD host works > fine, but with a HEALTH_WARN that I have 2 osds: 0 up, 2 in. > Both the OSD and MON daemons seem to be running fine. Network connectivity > seems to be okay: I can nc from the OSD to port 6789 on the MON, and from the > MON to port 6800-6803 on the OSD (I have constrained the ms bind port min/max > config options so that the OSDs will use only these ports). Neither OSD nor > MON logs show anything that seems unusual, nor why the OSD is marked as being > down. > > Furthermore, using tcpdump i've watched network traffic between the OSD and > the MON, and it seems that the OSD is sending heartbeats and getting an ack > from the MON. So I'm definitely not sure why the MON thinks the OSD is down. > > Some questions: > - How does the MON determine if the OSD is down? > - Is there a way to get the MON to report on why an OSD is down, e.g. no > heartbeat? > - Is there any need to open ports other than TCP 6789 and 6800-6803? > - Any other suggestions? > > ceph 0.94 on Debian Jessie > > Best, > Jeff > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] double rebalance when removing osd
Rafael, Yes, the cluster still rebalances twice when removing a failed osd. An osd that is marked out for any reason but still exists in the crush map gets its placement groups remapped to different osds until it comes back in, at which point those pgs are remapped back. When an osd is removed from the crush map, its pgs get mapped to new osds permanently. The mappings may be completely different for these two cases, which is why you get double rebalancing even when those two operations happen without the osd coming back in in between. In the case of a failed osd, I usually don't worry about it and just follow the documented steps because I'm marking an osd out and then removing it from the crush map immediately, so the first rebalance does almost nothing by the time the second overrides it, which matches what you were told by support. If this is a problem for you or if you're removing an osd that's still functional to some degree, then reweighting to 0, waiting for the single rebalance, then following the removal steps is probably your best bet. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: Andy Allan [mailto:gravityst...@gmail.com] Sent: Monday, January 11, 2016 4:09 AM To: Rafael Lopez <rafael.lo...@monash.edu> Cc: Steve Taylor <steve.tay...@storagecraft.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] double rebalance when removing osd On 11 January 2016 at 02:10, Rafael Lopez <rafael.lo...@monash.edu> wrote: > @Steve, even when you remove due to failing, have you noticed that the > cluster rebalances twice using the documented steps? You may not if you don't > wait for the initial recovery after 'ceph osd out'. If you do 'ceph osd out' > and immediately 'ceph osd crush remove', RH support has told me that this > effectively 'cancels' the original move triggered from 'ceph osd out' and > starts permanently remapping... which still doesn't really explain why we > have to do the ceph osd out in the first place.. This topic was last discussed in December - the documentation for removing an OSD from the cluster is not helpful. Unfortunately it doesn't look like anyone is going to fix the documentation. http://comments.gmane.org/gmane.comp.file-systems.ceph.user/25627 Basically, when you want to remove an OSD, there's an alternative sequence of commands that avoids the double-rebalance. The better approach is to reweight the OSD to zero first, then wait for the (one and only) rebalance, then mark out and remove. Here's more details from the previous thread: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/25629 Thanks, Andy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] double rebalance when removing osd
If I’m not mistaken, marking an osd out will remap its placement groups temporarily, while removing it from the crush map will remap the placement groups permanently. Additionally, other placement groups from other osds could get remapped permanently when an osd is removed from the crush map. I would think the only benefit to marking an osd out before stopping it would be a cleaner redirection of client I/O before the osd disappears, which may be worthwhile if you’re removing a healthy osd. As for reweighting to 0 prior to removing an osd, it seems like that would give the osd the ability to participate in the recovery essentially in read-only fashion (plus deletes) until it’s empty, so objects wouldn’t become degraded as placement groups are backfilling onto other osds. Again, this would really only be useful if you’re removing a healthy osd. If you’re removing an osd where other osds in different failure domains are known to be unhealthy, it seems like this would be a really good idea. I usually follow the documented steps you’ve outlined myself, but I’m typically removing osds due to failed/failing drives while the rest of the cluster is healthy. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<http://www.storagecraft.com/> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Rafael Lopez Sent: Wednesday, January 06, 2016 4:53 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] double rebalance when removing osd Hi all, I am curious what practices other people follow when removing OSDs from a cluster. According to the docs, you are supposed to: 1. ceph osd out 2. stop daemon 3. ceph osd crush remove 4. ceph auth del 5. ceph osd rm What value does ceph osd out (1) add to the removal process and why is it in the docs ? We have found (as have others) that by outing(1) and then crush removing (3), the cluster has to do two recoveries. Is it necessary? Can you just do a crush remove without step 1? I found this earlier message from GregF which he seems to affirm that just doing the crush remove is fine: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/007227.html This recent blog post from Sebastien that suggests reweighting to 0 first, but havent tested it: http://www.sebastien-han.fr/blog/2015/12/11/ceph-properly-remove-an-osd/ I thought that by marking it out, it sets the reweight to 0 anyway, so not sure how this would make a difference in terms of two rebalances but maybe there is a subtle difference.. ? Thanks, Raf -- Senior Storage Engineer - Automation and Delivery Infrastructure Services - eSolutions [http://assets.monash.edu/logos/logo.gif] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovery question
I recently migrated 240 OSDs to new servers this way in a single cluster, and it worked great. There are two additional items I would note based on my experience though. First, if you're using dmcrypt then of course you need to copy the dmcrypt keys for the OSDs to the new host(s). I had to do this in my case, but it was very straightforward. Second was an issue I didn't expect, probably just because of my ignorance. I was not able to migrate existing OSDs from different failure domains into a new, single failure domain without waiting for full recovery to HEALTH_OK in between. The very first server I put OSD disks from two different failure domains into had issues. The OSDs came up and in just fine, but immediately started flapping and failed to make progress toward recovery. I removed the disks from one failure domain and left the others, and recovery progressed as expected. As soon as I saw HEALTH_OK I re-migrated the OSDs from the other failure domain and again the cluster recovered as expected. Proceeding via this method allowed me to migrate all 240 OSDs without any further problems. I was also able to migrate as many OSDs as I wanted to simultaneously as long as I didn't mix OSDs from different, old failure domains in a new failure domain without recovering in between. I understand mixing failure domains li ke this is risky, but I sort of expected it to work anyway. Maybe it was better in the end that Ceph forced me to do it more safely. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Peter Hinman Sent: Wednesday, July 29, 2015 12:58 PM To: Robert LeBlanc rob...@leblancnet.us Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Recovery question Thanks for the guidance. I'm working on building a valid ceph.conf right now. I'm not familiar with the osd-bootstrap key. Is that the standard filename for it? Is it the keyring that is stored on the osd? I'll see if the logs turn up anything I can decipher after I rebuild the ceph.conf file. -- Peter Hinman On 7/29/2015 12:49 PM, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Did you use ceph-depoy or ceph-disk to create the OSDs? If so, it should use udev to start he OSDs. In that case, a new host that has the correct ceph.conf and osd-bootstrap key should be able to bring up the OSDs into the cluster automatically. Just make sure you have the correct journal in the same host with the matching OSD disk, udev should do the magic. The OSD logs are your friend if they don't start properly. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Jul 29, 2015 at 10:48 AM, Peter Hinman wrote: I've got a situation that seems on the surface like it should be recoverable, but I'm struggling to understand how to do it. I had a cluster of 3 monitors, 3 osd disks, and 3 journal ssds. After multiple hardware failures, I pulled the 3 osd disks and 3 journal ssds and am attempting to bring them back up again on new hardware in a new cluster. I see plenty of documentation on how to zap and initialize and add new osds, but I don't see anything on rebuilding with existing osd disks. Could somebody provide guidance on how to do this? I'm running 94.2 on all machines. Thanks, -- Peter Hinman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVuSA/CRDmVDuy+mK58QAAfGAQAMq62W7QvCAo2RSDWLli 13AJTpAWhk+ilBwcmxFr/gP/Aa9hMN5bV8idDqI56YWBjGO2WPQIUT8CXH5v ocBUZZJ0X08gOgHqFQ8x3rSSe6QINy1bQONMql3Jgpy8He/ctLnXROhNT9SU l30CI4qKwG48AZU5E4PoWgwQmdbFv0WIuFwCzPOVIU6GvO0umirerw3C7tZQ I34+OINURzCjKzLY/OEF4hRvRq3PV0KZAoolQTeBJtEdlyNgAQ/bHOgpfJ/h diGwQZyhSzqTvFYOEHWUuh5ZnhZAMNtaLBulwreUEKoI0IcXGxpH6KsC7ag4 KJ1kD8U0I18eP4iyTOIXg+DxafUU4wrITlKdomW12XqmlHadi2vYYBCqataI uc4KeXHP4/SrA1qoEDtXroAV2iuV6UUNIwsY4HPBJ/CNKXFU5QSdGOey3Kjs Mz2zuCpMkTf6fj8B4XJfenfFulRVJwrKJml7JebPFpLTRPFMbsuZ5htUMASn UWyCA9IfxLYsC5tPlii79Kkb93mvN3cCdvchkH2CQ38jxkVRZRUqeJlzvtVp 2mwinvqPD0irTvr+LvmlKOdtvFSOKJM0XmRSVk1LgLlpoyIZ9BqI02ul01fE 7nZ892/17zdv0Nguxr8F8bps0jA7NLFpgRhEsakdmTVTJQLMwSv7z6c9fdP0 7AWQ =VJV0 -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users