[ceph-users] Wrong PG information after increase pg_num
Hello all, I am testing cluster with mixed type OSD on same data node (yes, it's the idea from: http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/), and run into a strange status: ceph -s or ceph pg dump shows incorrect PG information after set pg_num to pool which is using different ruleset to select faster OSD. Please advise what's wrong and if I can fix the issue without recreate new pool with final pg_num directly: Soe more detail: 1) update crushmap to have different root ruleset to select different OSDs like this: rule replicated_ruleset_ssd { ruleset 50 type replicated min_size 1 max_size 10 step take sdd step chooseleaf firstn 0 type host step emit } 2) create new pool and set crush_ruleset to use this new rule $ ceph osd pool create ssd 64 64 replicated replicated_ruleset_ssd (however after this command it's still using default ruleset 0) $ ceph osd pool set ssd crush_ruleset 50 3) it looks good now: $ ceph osd dump | grep pool pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0 pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0 pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1574 flags hashpspool stripe_width 0 $ ceph -s cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569 health HEALTH_OK monmap e2: 3 mons at {DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0}, election epoch 84, quorum 0,1,2 DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3 osdmap e1578: 21 osds: 15 up, 15 in pgmap v560681: 1472 pgs, 5 pools, 285 GB data, 73352 objects 80151 MB used, 695 GB / 779 GB avail 1472 active+clean 4) increase pg_num pgp_num but total PG number is still 1472 in ceph -s: $ ceph osd pool set ssd pg_num 128 set pool 9 pg_num to 128 $ ceph osd pool set ssd pgp_num 128 set pool 9 pgp_num to 128 $ ceph osd dump | grep pool pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0 pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0 pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins pg_num 128 pgp_num 128 last_change 1581 flags hashpspool stripe_width 0 $ ceph -s cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569 health HEALTH_OK monmap e2: 3 mons at {DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0}, election epoch 84, quorum 0,1,2 DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3 osdmap e1582: 21 osds: 15 up, 15 in pgmap v560709: 1472 pgs, 5 pools, 285 GB data, 73352 objects 80158 MB used, 695 GB / 779 GB avail 1472 active+clean 5) same problem with pg dump: $ ceph pg dump | grep '^9\.' | wc dumped all in format plain 641472 10288 6) looks pg are created under /var/lib/ceph/osd/ceph-osd/current folder: $ ls -ld /var/lib/ceph/osd/ceph-15/current/9.* | wc 74 6666133 ]$ ls -ld /var/lib/ceph/osd/ceph-16/current/9.* | wc 54 4864475 6 osd for this ruleset = 128 * 3 / 6 ~= 64 Thanks a lot BR, Luke Kao MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com
Re: [ceph-users] Ceph on RHEL7.0
Hi Bruce, RHEL7.0 kernel has many issues on filesystem sub modules and most of them fixed only in RHEL7.1. So you should consider to go to RHEL7.1 directly and upgrade to at least kernel 3.10.0-229.1.2 BR, Luke From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Bruce McFarland [bruce.mcfarl...@taec.toshiba.com] Sent: Friday, May 29, 2015 5:13 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] Ceph on RHEL7.0 We’re planning on moving from Centos6.5 to RHEL7.0 for Ceph storage and monitor nodes. Are there any known issues using RHEL7.0? Thanks This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Linux block device tuning on Kernel RBD device
Hello everyone, Is there anyone having experience to try to tune Kernel RBD device by changing scheduler and other settings? Currently we are trying it on RHEL 7.1 bundled rbd module, to change the following setting under /sys/block/rbdX/queue: 1) scheduler: noop vs deadline, deadline seems better 2) nr_requests: default 128, tried 64 256 1024, and no clear difference between different value 3) rotational: as a network based device, should set to 0 for rbd? Tried and no clear difference between different value 4) read_ahead_kb: default 128, tried 4096 will be much better but also seen many extra network bandwidth used Now trying to have a plan to measure performance change on iops, throughput and side-effect in a quantitative way, and would like to know if anyone can share experience if there is already some optimal setting and if any other parameter I should try, like the tunable parameters tuning able for deadline. Thanks in advance, Luke MYCOM OSI http://www.mycom-osi.com This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor stay in synchronizing state for over 24hour
Hi all, Does anyone have some idea? Or maybe have some direction about which debug log I can enable to check some information about progress of synchronization. currently I have set debug_mon=20 mon_sync_debug=true But not sure I can really know which log enty I should check Thanks in advance BR, Luke MYCOM-OSI From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Luke Kao [luke@mycom-osi.com] Sent: Thursday, March 12, 2015 5:22 PM To: ceph-us...@ceph.com Subject: [ceph-users] Monitor stay in synchronizing state for over 24hour Hello everyone, I am currently trying to recover a ceph cluster from the disaster, now I have enough osd (171 up and in/195) and have 2 incomplete pgs at the end. However the question now is not the incomplete pgs, is about one mon services fail to start due to a strange, wrong monmap is used. After inject monmap exported from cluster, it's up and enter synchronizing and unable to be back after several hours. I originally guess it's common for the fact the whole cluster is still busy in recovering and backfilling, however it's over 24hour now and no hint when sync can be done or if it's still in healthy status. The log tells it is still doing synchronizing and I can see the file under store.db keep being updated. a small piece of log for the reference: 2015-03-12 03:20:15.025048 7f3cb6c48700 10 mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) service_tick 2015-03-12 03:20:15.025075 7f3cb6c48700 0 mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) update_stats avail 71% total 103080888 used 24281956 avail 73539668 2015-03-12 03:20:30.460672 7f3cb4b43700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).aborted = 0 2015-03-12 03:20:30.460923 7f3cb4b43700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).reader got message 1466470577 0x45b3c80 mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 2015-03-12 03:20:30.460963 7f3cbc783700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).writer: state = open policy.server=0 2015-03-12 03:20:30.460988 7f3cbc783700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).write_ack 1466470577 2015-03-12 03:20:30.461011 7f3cbc783700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).writer: state = open policy.server=0 2015-03-12 03:20:30.461030 7f3cb6447700 1 -- 10.137.36.30:6789/0 == mon.1 10.137.36.31:6789/0 1466470577 mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 792163+0+0 (2147002791 0 0) 0x45b3c80 con 0x34b1760 2015-03-12 03:20:30.461048 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) e1 handle_sync mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 2015-03-12 03:20:30.461052 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) e1 handle_sync_chunk mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 2015-03-12 03:20:30.463832 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) e1 sync_reset_timeout I am also wondering some osd are fail to join cluster due to this. Some osd processes are up without error, but after load pgs, it cannot keep moving to boot and status is still down and out. Please advise, thanks Luke Kao MYCOM OSI http://www.mycom-osi.com This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Monitor stay in synchronizing state for over 24hour
Hello everyone, I am currently trying to recover a ceph cluster from the disaster, now I have enough osd (171 up and in/195) and have 2 incomplete pgs at the end. However the question now is not the incomplete pgs, is about one mon services fail to start due to a strange, wrong monmap is used. After inject monmap exported from cluster, it's up and enter synchronizing and unable to be back after several hours. I originally guess it's common for the fact the whole cluster is still busy in recovering and backfilling, however it's over 24hour now and no hint when sync can be done or if it's still in healthy status. The log tells it is still doing synchronizing and I can see the file under store.db keep being updated. a small piece of log for the reference: 2015-03-12 03:20:15.025048 7f3cb6c48700 10 mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) service_tick 2015-03-12 03:20:15.025075 7f3cb6c48700 0 mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) update_stats avail 71% total 103080888 used 24281956 avail 73539668 2015-03-12 03:20:30.460672 7f3cb4b43700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).aborted = 0 2015-03-12 03:20:30.460923 7f3cb4b43700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).reader got message 1466470577 0x45b3c80 mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 2015-03-12 03:20:30.460963 7f3cbc783700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).writer: state = open policy.server=0 2015-03-12 03:20:30.460988 7f3cbc783700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).write_ack 1466470577 2015-03-12 03:20:30.461011 7f3cbc783700 10 -- 10.137.36.30:6789/0 10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 c=0x34b1760).writer: state = open policy.server=0 2015-03-12 03:20:30.461030 7f3cb6447700 1 -- 10.137.36.30:6789/0 == mon.1 10.137.36.31:6789/0 1466470577 mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 792163+0+0 (2147002791 0 0) 0x45b3c80 con 0x34b1760 2015-03-12 03:20:30.461048 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) e1 handle_sync mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 2015-03-12 03:20:30.461052 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) e1 handle_sync_chunk mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2 2015-03-12 03:20:30.463832 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) e1 sync_reset_timeout I am also wondering some osd are fail to join cluster due to this. Some osd processes are up without error, but after load pgs, it cannot keep moving to boot and status is still down and out. Please advise, thanks Luke Kao MYCOM OSI http://www.mycom-osi.com This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSHMAP for chassis balance
Hi Gregory, Thanks for the direction that I finish with 3 different rule in a ruleset for different rep size: Tested no bad-mapping and host / osd are correctly balanced between 2 chassis. Not sure if it can be optimized but I am happy with current result: rule rule_rep2 { ruleset 0 type replicated min_size 2 max_size 2 step take chassis1 step chooseleaf firstn 1 type host step emit step take chassis2 step chooseleaf firstn 1 type host step emit } rule rule_rep34 { ruleset 0 type replicated min_size 3 max_size 4 step take default step choose firstn 2 type chassis step chooseleaf firstn 2 type host step emit } rule rule_rep56 { ruleset 0 type replicated min_size 5 max_size 6 step take default step choose firstn 3 type chassis step chooseleaf firstn 3 type host step emit } Luke From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Friday, February 13, 2015 11:01 PM To: Luke Kao; ceph-users@lists.ceph.com Subject: Re: [ceph-users] CRUSHMAP for chassis balance With sufficiently new CRUSH versions (all the latest point releases on LTS?) I think you can simply have the rule return extra IDs which are dropped if they exceed the number required. So you can choose two chassis, then have those both choose to lead OSDs, and return those 4 from the rule. -Greg On Fri, Feb 13, 2015 at 6:13 AM Luke Kao luke@mycom-osi.commailto:luke@mycom-osi.com wrote: Dear cepher, Currently I am working on crushmap to try to make sure the at least one copy are going to different chassis. Say chassis1 has host1,host2,host3, and chassis2 has host4,host5,host6. With replication =2, it’s not a problem, I can use the following step in rule step take chasses1 step chooseleaf firstn 1 type host step emit step take chasses2 step chooseleaf firstn 1 type host step emit But for replication=3, I tried step take chasses1 step chooseleaf firstn 1 type host step emit step take chasses2 step chooseleaf firstn 1 type host step emit step take default step chooseleaf firstn 1 type host step emit At the end, the 3rd osd returned in rule test is always duplicate with first 1 or first 2. Any idea or what’s the direction to move forward? Thanks in advance BR, Luke MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CRUSHMAP for chassis balance
Dear cepher, Currently I am working on crushmap to try to make sure the at least one copy are going to different chassis. Say chassis1 has host1,host2,host3, and chassis2 has host4,host5,host6. With replication =2, it's not a problem, I can use the following step in rule step take chasses1 step chooseleaf firstn 1 type host step emit step take chasses2 step chooseleaf firstn 1 type host step emit But for replication=3, I tried step take chasses1 step chooseleaf firstn 1 type host step emit step take chasses2 step chooseleaf firstn 1 type host step emit step take default step chooseleaf firstn 1 type host step emit At the end, the 3rd osd returned in rule test is always duplicate with first 1 or first 2. Any idea or what's the direction to move forward? Thanks in advance BR, Luke MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] btrfs backend with autodefrag mount option
Thanks Lionel, we are using btrfs compression and it's also stable in our cluster. Currently another minor problem of btrfs fragments is sometimes we see btrfs-transacti process can pause the whole OSD node I/O for seconds, impacting all OSDs on the server. Especially when doing recovery / backfill. However, I wonder restart a OSD takes 30minutes may become a problem for maintenance. I will share if we have any result on testing different settings. BR, Luke From: Lionel Bouton [lionel-subscript...@bouton.name] Sent: Saturday, January 31, 2015 2:29 AM To: Luke Kao; ceph-us...@ceph.com Subject: Re: [ceph-users] btrfs backend with autodefrag mount option On 01/30/15 14:24, Luke Kao wrote: Dear ceph users, Has anyone tried to add autodefrag and mount option when use btrfs as the osd storage? In some previous discussion that btrfs osd startup becomes very slow after used for some time, just thinking about add autodefrag will help. We will add on our test cluster first to see if there is any difference. We used autodefrag but it didn't help: performance degrades over time. One possibility raised in previous discussions here is that BTRFS's autodefrag isn't smart enough when snapshots are heavily used as is the case with Ceph OSD by default. There are some tunings available that we have yet to test : filestore btrfs snap filestore btrfs clone range filestore journal parallel All are enabled by default for BTRFS backends. snap is probably the first you might want to disable and check how autodefrag and defrag behave. It might be possible to use snap and defrag, BTRFS was quite stable for us (but all our OSDs are on systems with at least 72GB RAM which have enough CPU power so memory wasn't much of an issue). Best regards, Lionel Bouton This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] btrfs backend with autodefrag mount option
Dear ceph users, Has anyone tried to add autodefrag and mount option when use btrfs as the osd storage? In some previous discussion that btrfs osd startup becomes very slow after used for some time, just thinking about add autodefrag will help. We will add on our test cluster first to see if there is any difference. Please kindly share experience if available, thanks Luke Kao MYCOM OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to do maintenance without falling out of service?
Hi David, How about your pools size min_size setting? In your cluster, you may need to set all pools min_size=1 before shutdown server BR, Luke MYCOM-OSI From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of J David [j.david.li...@gmail.com] Sent: Tuesday, January 20, 2015 12:40 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] How to do maintenance without falling out of service? A couple of weeks ago, we had some involuntary maintenance come up that required us to briefly turn off one node of a three-node ceph cluster. To our surprise, this resulted in failure to write on the VM's on that ceph cluster, even though we set noout before the maintenance. This cluster is for bulk storage, it has copies=1 (2 total) and very large SATA drives. The OSD tree looks like this: # id weight type name up/down reweight -1 127.1 root default -2 18.16 host f16 0 4.54 osd.0 up 1 1 4.54 osd.1 up 1 2 4.54 osd.2 up 1 3 4.54 osd.3 up 1 -3 54.48 host f17 4 4.54 osd.4 up 1 5 4.54 osd.5 up 1 6 4.54 osd.6 up 1 7 4.54 osd.7 up 1 8 4.54 osd.8 up 1 9 4.54 osd.9 up 1 10 4.54 osd.10 up 1 11 4.54 osd.11 up 1 12 4.54 osd.12 up 1 13 4.54 osd.13 up 1 14 4.54 osd.14 up 1 15 4.54 osd.15 up 1 -4 54.48 host f18 16 4.54 osd.16 up 1 17 4.54 osd.17 up 1 18 4.54 osd.18 up 1 19 4.54 osd.19 up 1 20 4.54 osd.20 up 1 21 4.54 osd.21 up 1 22 4.54 osd.22 up 1 23 4.54 osd.23 up 1 24 4.54 osd.24 up 1 25 4.54 osd.25 up 1 26 4.54 osd.26 up 1 27 4.54 osd.27 up 1 The host that was turned off was f18. f16 does have a handful of OSDs, but it is mostly there to provide an odd number of monitors. The cluster is very lightly used, here is the current status: cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7 health HEALTH_OK monmap e3: 3 mons at {f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0}, election epoch 28, quorum 0,1,2 f16,f17,f18 osdmap e1674: 28 osds: 28 up, 28 in pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects 22314 GB used, 105 TB / 127 TB avail 1152 active+clean client io 38162 B/s wr, 9 op/s Where did we go wrong last time? How can we do the same maintenance to f17 (taking it offline for about 15-30 minutes) without repeating our mistake? As it stands, it seems like we have inadvertently created a cluster with three single points of failure, rather than none. That has not been our experience with our other clusters, so we're really confused at present. Thanks for any advice! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any workaround for FAILED assert(p != snapset.clones.end())
Hi Sam and Greg, No, not using cache tier. Just for your information, backend filestore is btrfs with zlib compression Need I provide any more information? Thanks. BR, Luke From: Samuel Just [sam.j...@inktank.com] Sent: Wednesday, January 14, 2015 1:22 AM To: Luke Kao Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] any workaround for FAILED assert(p != snapset.clones.end()) Are you using a cache tier? -Sam On Mon, Jan 12, 2015 at 11:37 PM, Luke Kao luke@mycom-osi.com wrote: Hello community, We have a cluster using v0.80.5, and recently several OSDs goes down with error when removing a rbd snapshot: osd/ReplicatedPG.cc: 2352: FAILED assert(p != snapset.clones.end()) and after restart those OSDs, it will go down again soon for the same error. It looks like link to BUG#8629, but before upgrade to the patched version, is there any workaround other than reformat disk and create OSDs? Also a side question: I don't find this bug fix in release note of v0.80.6 or v0.80.7, so should I assume the patch is not yet released? Thanks BR, Luke Kao MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] any workaround for FAILED assert(p != snapset.clones.end())
Hello community, We have a cluster using v0.80.5, and recently several OSDs goes down with error when removing a rbd snapshot: osd/ReplicatedPG.cc: 2352: FAILED assert(p != snapset.clones.end()) and after restart those OSDs, it will go down again soon for the same error. It looks like link to BUG#8629, but before upgrade to the patched version, is there any workaround other than reformat disk and create OSDs? Also a side question: I don't find this bug fix in release note of v0.80.6 or v0.80.7, so should I assume the patch is not yet released? Thanks BR, Luke Kao MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD pool with unfound objects
Hi all, I have some questions about unfound objects in rbd pool, what's the real impact to the rbd image? Currently our cluster (running on v0.80.5) has 25 unfound objects due to recent OSD crashes, and cannot mark as lost yet (Bug #10405 created for this). So far it seems we can still mount the rbd image (filesystem is xfs) but I would like to know the real impact 1.My guess it should like bad sector of a real hard disk? 2.Is there any way to identify which file get impacted of the RBD disk? 3.What if we mark it as lost using ceph pg pg mark_unfound_lost revert revert / delete? 4.Is it better to copy current rbd image to another new one and use the new one instead? Any suggestion to current situation is also welcome that we need keep the data inside this RBD. Thanks in advance, BR, Luke This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com