Hi,
We use Ceph 0.80.7 for our IceHouse PoC. 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total. 4 pools for RBD, size=2, 512 PGs per pool Everything was fine until mid of last week, and here's what happened: - OSD node #12 passed away - AFAICR, ceph recovered fine - I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster - it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid "-3.052e-05". - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening - on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg "map eXXXXX wrongly marked me down" every minute) - I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue - ceph kept on reorganising PGs and reached this unhealthy state: --- 900 PGs stuck unclean --- some 'requests are blocked > 32 sec' --- command 'rbd info images/<image_id> hung --- all tested VMs hung - So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs - ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly) - but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck. Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this "PoC sev1" issue. So I hope that the community can help me find a solution before Christmas. Here are the details (actual host and DC names not shown in these outputs). [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done Tue Dec 23 06:53:15 GMT 2014 0dde9837-3e45-414d-a2c5-902adee0cfe9 <no reply for 2 hours, still ongoing...> [root@MON ]# rbd ls images | head -5 0dde9837-3e45-414d-a2c5-902adee0cfe9 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e 3917346f-12b4-46b8-a5a1-04296ea0a826 4bde285b-28db-4bef-99d5-47ce07e2463d 7da30b4c-4547-4b4c-a96e-6a3528e03214 [root@MON ]# [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file [cloud-user@francois-vm2 ~]$ rm /tmp/file <no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000'> [root@MON ~]# ceph -s cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked > 32 sec; noscrub,nodeep-scrub flag(s) set monmap e6: 3 mons at {<MON01>=10.60.9.11:6789/0,<MON06>=10.60.9.16:6789/0,<MON09>=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 <MON01>,<MON06>,<MON09> osdmap e42050: 6 osds: 6 up, 6 in flags noscrub,nodeep-scrub pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects 600 GB used, 1031 GB / 1632 GB avail 2 inactive 2045 active+clean 1 remapped+peering client io 818 B/s wr, 0 op/s [root@MON ~]# ceph health detail HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked > 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1] pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1] pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0] pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1] pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1] pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0] pg 5.b3 is remapped+peering, acting [1,0] 87 ops are blocked > 67108.9 sec 16 ops are blocked > 33554.4 sec 84 ops are blocked > 67108.9 sec on osd.1 16 ops are blocked > 33554.4 sec on osd.1 3 ops are blocked > 67108.9 sec on osd.2 2 osds have slow requests noscrub,nodeep-scrub flag(s) set [root@MON]# ceph osd tree # id weight type name up/down reweight -1 1.08 root default -5 0.54 datacenter dc_TWO -2 0.54 host node10 1 0.27 osd.1 up 1 5 0.27 osd.5 up 1 -4 0 host node12 -6 0.54 datacenter dc_ONE -3 0.54 host node11 2 0.27 osd.2 up 1 3 0.27 osd.3 up 1 0 0 osd.0 up 1 4 0 osd.4 up 1 (I'm concerned about the above two "ghost" osd.0 and osd.4...) [root@MON]# ceph osd dump epoch 42050 fsid f0e3957f-1df5-4e55-baeb-0b2236ff6e03 created 2014-09-02 13:29:11.352712 modified 2014-12-22 16:43:22.295253 flags noscrub,nodeep-scrub pool 3 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 5018 flags hashpspool stripe_width 0 removed_snaps [1~7,a~1,c~5] pool 4 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 5015 flags hashpspool stripe_width 0 removed_snaps [1~5,7~c,14~8,1e~2] pool 5 'ephemeral' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1553 flags hashpspool stripe_width 0 pool 6 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 2499 flags hashpspool stripe_width 0 removed_snaps [1~5] max_osd 8 osd.0 up in weight 1 up_from 40904 up_thru 41379 down_at 40899 last_clean_interval [5563,40902) 10.60.9.22:6800/4527 10.60.9.22:6801/4137004527 10.60.9.22:6811/4137004527 10.60.9.22:6812/4137004527 exists,up 1dea8553-d3fc-4a45-9706-3136104b935e osd.1 up in weight 1 up_from 4128 up_thru 42049 down_at 4024 last_clean_interval [3247,4006) 10.60.9.20:6800/2062 10.60.9.20:6801/2062 10.60.9.20:6802/2062 10.60.9.20:6803/2062 exists,up f47dea5a-6742-4749-956e-818ff7cb91b4 osd.2 up in weight 1 up_from 40750 up_thru 42048 down_at 40743 last_clean_interval [2950,40742) 10.60.9.21:6808/1141 10.60.9.21:6809/1141 10.60.9.21:6810/1141 10.60.9.21:6811/1141 exists,up 87c71251-df5b-48c9-8737-e1c609722a3f osd.3 up in weight 1 up_from 40750 up_thru 42039 down_at 40745 last_clean_interval [3998,40745) 10.60.9.21:6801/967 10.60.9.21:6804/967 10.60.9.21:6805/967 10.60.9.21:6806/967 exists,up 6ae95d34-81ae-4e3d-9af2-17886414295f osd.4 up in weight 1 up_from 40905 up_thru 41426 down_at 40902 last_clean_interval [5575,40903) 10.60.9.22:6805/5375 10.60.9.22:6802/4153005375 10.60.9.22:6803/4153005375 10.60.9.22:6810/4153005375 exists,up dca9f2b2-66cd-406a-9d8a-50ff91b8e4d2 osd.5 up in weight 1 up_from 40350 up_thru 42047 down_at 40198 last_clean_interval [3317,40283) 10.60.9.20:6805/19439 10.60.9.20:6810/1019439 10.60.9.20:6811/1019439 10.60.9.20:6812/1019439 exists,up 0ea4ce0a-f74c-4a2a-9fa5-c7b55373bc86 pg_temp 5.b3 [1,0] Again, I'm concerned about the osd.0 and osd.4 which appear as up. However these commands succeeded yesterday: [root@MON ~]# date;time ceph osd down 0 Mon Dec 22 15:59:31 UTC 2014 marked down osd.0. real 0m1.264s user 0m0.192s sys 0m0.031s [root@MON ~]# date;time ceph osd down 4 Mon Dec 22 15:59:35 UTC 2014 marked down osd.4. real 0m0.351s user 0m0.193s sys 0m0.028s The PG map keeps changing, but the state (ceph -s) is still the same. Here is an excerpt of the log. [root@MON]# tail -5 /var/log/ceph/ceph.log 2014-12-23 08:24:48.585052 mon.0 10.60.9.11:6789/0 1209178 : [INF] pgmap v3291074: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s 2014-12-23 08:24:52.201230 mon.0 10.60.9.11:6789/0 1209179 : [INF] pgmap v3291075: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s 2014-12-23 08:24:55.895255 mon.0 10.60.9.11:6789/0 1209180 : [INF] pgmap v3291076: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 560 B/s wr, 0 op/s 2014-12-23 08:24:58.583940 mon.0 10.60.9.11:6789/0 1209181 : [INF] pgmap v3291077: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 641 B/s wr, 0 op/s 2014-12-23 08:25:02.206420 mon.0 10.60.9.11:6789/0 1209182 : [INF] pgmap v3291078: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 1297 B/s wr, 0 op/s Apart from the PG map change, here are the other last messages: [root@MON]# grep -v "2 inactive, 2045 active+clean, 1 remapped +peering" /var/log/ceph/ceph.log |tail -5 2014-12-23 06:50:37.237534 osd.1 10.60.9.20:6800/2062 16347 : [WRN] slow request 30720.090953 seconds old, received at 2014-12-22 22:18:37.146491: osd_op (client.5021916.0:64428 rbd_data.1bcae02ae8944a.0000000000000510 [sparse-read 3321344~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4 currently reached pg 2014-12-23 06:50:37.237541 osd.1 10.60.9.20:6800/2062 16348 : [WRN] slow request 30720.093197 seconds old, received at 2014-12-22 22:18:37.144247: osd_op (client.3324797.0:679739 rbd_data.3fdb9c2ae8944a.000000000000030e [sparse-read 3554816~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4 currently reached pg 2014-12-23 07:00:38.469599 osd.1 10.60.9.20:6800/2062 16349 : [WRN] 100 slow requests, 2 included below; oldest blocked for > 54130.968782 secs 2014-12-23 07:00:38.471314 osd.1 10.60.9.20:6800/2062 16350 : [WRN] slow request 30720.750831 seconds old, received at 2014-12-22 22:28:37.718682: osd_op (client.5021916.0:64967 rbd_data.1bcae02ae8944a.0000000000000510 [sparse-read 3329536~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4 currently reached pg 2014-12-23 07:00:38.471326 osd.1 10.60.9.20:6800/2062 16351 : [WRN] slow request 30720.750807 seconds old, received at 2014-12-22 22:28:37.718706: osd_op (client.3324797.0:679750 rbd_data.3fdb9c2ae8944a.000000000000030e [sparse-read 2768384~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4 currently reached pg [root@MON]# The RBD image for the sampled VM, with its hung IOs, looks accessible, (but the same command against its parent hangs): [root@MON]# date;time rbd info volumes/volume-2e989ca0-b620-42ca-a16f-e218aea32000 Tue Dec 23 08:27:13 GMT 2014 rbd image 'volume-2e989ca0-b620-42ca-a16f-e218aea32000': size 6144 MB in 768 objects order 23 (8192 kB objects) block_name_prefix: rbd_data.412bb450fdfb09 format: 2 features: layering parent: images/80a2e4e0-0a26-4c00-8783-5530dc914719@snap overlap: 6144 MB real 0m0.098s user 0m0.018s sys 0m0.009s The CRUSH MAP: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 # devices device 0 device0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 device4 device 5 osd.5 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host node10 { id -2 # do not change unnecessarily # weight 0.540 alg straw hash 0 # rjenkins1 item osd.1 weight 0.270 item osd.5 weight 0.270 } host node12 { id -4 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } datacenter dc_TWO { id -5 # do not change unnecessarily # weight 0.540 alg straw hash 0 # rjenkins1 item node10 weight 0.540 item node12 weight 0.000 } host node11 { id -3 # do not change unnecessarily # weight 0.540 alg straw hash 0 # rjenkins1 item osd.2 weight 0.270 item osd.3 weight 0.270 } datacenter dc_ONE { id -6 # do not change unnecessarily # weight 0.540 alg straw hash 0 # rjenkins1 item node11 weight 0.540 } root default { id -1 # do not change unnecessarily # weight 1.080 alg straw hash 0 # rjenkins1 item dc_TWO weight 0.540 item dc_ONE weight 0.540 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule DRP { ruleset 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type datacenter step emit } # end crush map Francois. -- Accédez aux meilleurs tarifs Air France, gérez vos réservations et enregistrez-vous en ligne sur http://www.airfrance.com Find best Air France fares, manage your reservations and check in online at http://www.airfrance.com Les données et renseignements contenus dans ce message peuvent être de nature confidentielle et soumis au secret professionnel et sont destinés à l'usage exclusif du destinataire dont les coordonnées figurent ci-dessus. Si vous recevez cette communication par erreur, nous vous demandons de ne pas la copier, l'utiliser ou la divulguer. Nous vous prions de notifier cette erreur à l'expéditeur et d'effacer immédiatement cette communication de votre système. Société Air France - Société anonyme au capital de 126 748 775 euros - RCS Bobigny (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX The data and information contained in this message may be confidential and subject to professional secrecy and are intended for the exclusive use of the recipient at the address shown above. If you receive this message by mistake, we ask you not to copy, use or disclose it. Please notify this error to the sender immediately and delete this message from your system. Société Air France - Limited company with capital of 126,748,775 euros - Bobigny register of companies (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX Pensez à l'environnement avant d'imprimer ce message. Think of the environment before printing this mail.
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com