[ceph-users] Cluster unusable

Francois Petit Tue, 23 Dec 2014 00:57:09 -0800

Hi,


We use Ceph 0.80.7 for our IceHouse PoC.
3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage,
total.
4 pools for RBD, size=2,  512 PGs per pool

Everything was fine until mid of last week, and here's what happened:
- OSD node #12 passed away
- AFAICR, ceph recovered fine
- I installed a fresh new node #12 (which inadvertently erased its 2
attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join
the cluster
- it was looking okay, except that the weight for the 2 OSDs (osd.0 and
osd.4) was a solid "-3.052e-05".
- I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph
osd crush reweight' on both OSDs
- ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday
evening
- on Monday morning (yesterday), ceph was still busy. Actually the two new
OSDs were flapping (msg "map eXXXXX wrongly marked me down" every minute)
- I found the root cause was the firewall on node #12. I opened tcp ports
6789-6900 and this solved the flapping issue
- ceph kept on reorganising PGs and reached this unhealthy state:
--- 900 PGs stuck unclean
--- some 'requests are blocked > 32 sec'
--- command 'rbd info images/<image_id> hung
--- all tested VMs hung
- So I tried this:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html,
 and removed the 2 new OSDs
- ceph again started rebalancing data, and things were looking better (VMs
responding, although pretty slowly)
- but at the end, which is the current state, the cluster was back to an
unhealthy state, and our PoC is stuck.


Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm
UTC+1 and then back on Jan 5. So there are around 30 hours left for solving
this "PoC sev1"  issue. So I hope that the community can help me find a
solution before Christmas.



Here are the details (actual host and DC names not shown in these outputs).

[root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info
images/$im;done
Tue Dec 23 06:53:15 GMT 2014
0dde9837-3e45-414d-a2c5-902adee0cfe9

<no reply for 2 hours, still ongoing...>

[root@MON ]# rbd ls images | head -5
0dde9837-3e45-414d-a2c5-902adee0cfe9
2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
3917346f-12b4-46b8-a5a1-04296ea0a826
4bde285b-28db-4bef-99d5-47ce07e2463d
7da30b4c-4547-4b4c-a96e-6a3528e03214
[root@MON ]#

[cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
-rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
[cloud-user@francois-vm2 ~]$ rm /tmp/file

<no reply for 1 hour, still ongoing. The RBD image used by that VM is
'volume-2e989ca0-b620-42ca-a16f-e218aea32000'>


[root@MON ~]# ceph -s
    cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
     health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck
unclean; 103 requests are blocked > 32 sec; noscrub,nodeep-scrub flag(s)
set
     monmap e6: 3 mons at
{<MON01>=10.60.9.11:6789/0,<MON06>=10.60.9.16:6789/0,<MON09>=10.60.9.19:6789/0},
 election epoch 1338, quorum 0,1,2 <MON01>,<MON06>,<MON09>
     osdmap e42050: 6 osds: 6 up, 6 in
            flags noscrub,nodeep-scrub
      pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
            600 GB used, 1031 GB / 1632 GB avail
                   2 inactive
                2045 active+clean
                   1 remapped+peering
  client io 818 B/s wr, 0 op/s

[root@MON ~]# ceph health detail
HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103
requests are blocked > 32 sec; 2 osds have slow requests;
noscrub,nodeep-scrub flag(s) set
pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last
acting [2,1]
pg 5.ae is stuck inactive for 54774.738938, current state inactive, last
acting [2,1]
pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering,
last acting [1,0]
pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last
acting [2,1]
pg 5.ae is stuck unclean for 286227.592617, current state inactive, last
acting [2,1]
pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering,
last acting [1,0]
pg 5.b3 is remapped+peering, acting [1,0]
87 ops are blocked > 67108.9 sec
16 ops are blocked > 33554.4 sec
84 ops are blocked > 67108.9 sec on osd.1
16 ops are blocked > 33554.4 sec on osd.1
3 ops are blocked > 67108.9 sec on osd.2
2 osds have slow requests
noscrub,nodeep-scrub flag(s) set


[root@MON]# ceph osd tree
# id    weight  type name       up/down reweight
-1      1.08    root default
-5      0.54            datacenter dc_TWO
-2      0.54                    host node10
1       0.27                            osd.1   up      1
5       0.27                            osd.5   up      1
-4      0                       host node12
-6      0.54            datacenter dc_ONE
-3      0.54                    host node11
2       0.27                            osd.2   up      1
3       0.27                            osd.3   up      1
0       0       osd.0   up      1
4       0       osd.4   up      1

(I'm concerned about the above two "ghost" osd.0 and osd.4...)



[root@MON]# ceph osd dump
epoch 42050
fsid f0e3957f-1df5-4e55-baeb-0b2236ff6e03
created 2014-09-02 13:29:11.352712
modified 2014-12-22 16:43:22.295253
flags noscrub,nodeep-scrub
pool 3 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 5018 flags hashpspool
stripe_width 0
        removed_snaps [1~7,a~1,c~5]
pool 4 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 5015 flags hashpspool
stripe_width 0
        removed_snaps [1~5,7~c,14~8,1e~2]
pool 5 'ephemeral' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 1553 flags hashpspool
stripe_width 0
pool 6 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 2499 flags hashpspool
stripe_width 0
        removed_snaps [1~5]
max_osd 8
osd.0 up   in  weight 1 up_from 40904 up_thru 41379 down_at 40899
last_clean_interval [5563,40902) 10.60.9.22:6800/4527
10.60.9.22:6801/4137004527

10.60.9.22:6811/4137004527 10.60.9.22:6812/4137004527 exists,up
1dea8553-d3fc-4a45-9706-3136104b935e
osd.1 up   in  weight 1 up_from 4128 up_thru 42049 down_at 4024
last_clean_interval [3247,4006) 10.60.9.20:6800/2062 10.60.9.20:6801/2062
10.60.9.20:6802/2062

10.60.9.20:6803/2062 exists,up f47dea5a-6742-4749-956e-818ff7cb91b4
osd.2 up   in  weight 1 up_from 40750 up_thru 42048 down_at 40743
last_clean_interval [2950,40742) 10.60.9.21:6808/1141 10.60.9.21:6809/1141
10.60.9.21:6810/1141

10.60.9.21:6811/1141 exists,up 87c71251-df5b-48c9-8737-e1c609722a3f
osd.3 up   in  weight 1 up_from 40750 up_thru 42039 down_at 40745
last_clean_interval [3998,40745) 10.60.9.21:6801/967 10.60.9.21:6804/967
10.60.9.21:6805/967

10.60.9.21:6806/967 exists,up 6ae95d34-81ae-4e3d-9af2-17886414295f
osd.4 up   in  weight 1 up_from 40905 up_thru 41426 down_at 40902
last_clean_interval [5575,40903) 10.60.9.22:6805/5375
10.60.9.22:6802/4153005375

10.60.9.22:6803/4153005375 10.60.9.22:6810/4153005375 exists,up
dca9f2b2-66cd-406a-9d8a-50ff91b8e4d2
osd.5 up   in  weight 1 up_from 40350 up_thru 42047 down_at 40198
last_clean_interval [3317,40283) 10.60.9.20:6805/19439
10.60.9.20:6810/1019439 10.60.9.20:6811/1019439

10.60.9.20:6812/1019439 exists,up 0ea4ce0a-f74c-4a2a-9fa5-c7b55373bc86
pg_temp 5.b3 [1,0]


Again, I'm concerned about the osd.0 and osd.4 which appear as up.
However these commands succeeded yesterday:
[root@MON ~]# date;time ceph osd down 0
Mon Dec 22 15:59:31 UTC 2014
marked down osd.0.

real    0m1.264s
user    0m0.192s
sys     0m0.031s
[root@MON ~]# date;time ceph osd down 4
Mon Dec 22 15:59:35 UTC 2014
marked down osd.4.

real    0m0.351s
user    0m0.193s
sys     0m0.028s


The PG map keeps changing, but the state (ceph -s) is still the same. Here
is an excerpt of the log.
[root@MON]# tail -5 /var/log/ceph/ceph.log
2014-12-23 08:24:48.585052 mon.0 10.60.9.11:6789/0 1209178 : [INF] pgmap
v3291074: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301
GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s
2014-12-23 08:24:52.201230 mon.0 10.60.9.11:6789/0 1209179 : [INF] pgmap
v3291075: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301
GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s
2014-12-23 08:24:55.895255 mon.0 10.60.9.11:6789/0 1209180 : [INF] pgmap
v3291076: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301
GB data, 600 GB used, 1031 GB / 1632 GB avail; 560 B/s wr, 0 op/s
2014-12-23 08:24:58.583940 mon.0 10.60.9.11:6789/0 1209181 : [INF] pgmap
v3291077: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301
GB data, 600 GB  used, 1031 GB / 1632 GB avail; 641 B/s wr, 0 op/s
2014-12-23 08:25:02.206420 mon.0 10.60.9.11:6789/0 1209182 : [INF] pgmap
v3291078: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301
GB data, 600 GB used, 1031 GB / 1632 GB avail; 1297 B/s wr, 0 op/s

Apart from the PG map change, here are the other last messages:
[root@MON]# grep -v "2 inactive, 2045 active+clean, 1 remapped
+peering" /var/log/ceph/ceph.log |tail -5
2014-12-23 06:50:37.237534 osd.1 10.60.9.20:6800/2062 16347 : [WRN] slow
request 30720.090953 seconds old, received at 2014-12-22 22:18:37.146491:
osd_op

(client.5021916.0:64428 rbd_data.1bcae02ae8944a.0000000000000510
[sparse-read 3321344~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4
currently reached pg
2014-12-23 06:50:37.237541 osd.1 10.60.9.20:6800/2062 16348 : [WRN] slow
request 30720.093197 seconds old, received at 2014-12-22 22:18:37.144247:
osd_op

(client.3324797.0:679739 rbd_data.3fdb9c2ae8944a.000000000000030e
[sparse-read 3554816~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4
currently reached pg
2014-12-23 07:00:38.469599 osd.1 10.60.9.20:6800/2062 16349 : [WRN] 100
slow requests, 2 included below; oldest blocked for > 54130.968782 secs
2014-12-23 07:00:38.471314 osd.1 10.60.9.20:6800/2062 16350 : [WRN] slow
request 30720.750831 seconds old, received at 2014-12-22 22:28:37.718682:
osd_op

(client.5021916.0:64967 rbd_data.1bcae02ae8944a.0000000000000510
[sparse-read 3329536~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4
currently reached pg
2014-12-23 07:00:38.471326 osd.1 10.60.9.20:6800/2062 16351 : [WRN] slow
request 30720.750807 seconds old, received at 2014-12-22 22:28:37.718706:
osd_op

(client.3324797.0:679750 rbd_data.3fdb9c2ae8944a.000000000000030e
[sparse-read 2768384~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4
currently reached pg
[root@MON]#


The RBD image for the sampled VM, with its hung IOs, looks accessible, (but
the same command against its parent hangs):

[root@MON]# date;time rbd info
volumes/volume-2e989ca0-b620-42ca-a16f-e218aea32000
Tue Dec 23 08:27:13 GMT 2014
rbd image 'volume-2e989ca0-b620-42ca-a16f-e218aea32000':
        size 6144 MB in 768 objects
        order 23 (8192 kB objects)
        block_name_prefix: rbd_data.412bb450fdfb09
        format: 2
        features: layering
        parent: images/80a2e4e0-0a26-4c00-8783-5530dc914719@snap
        overlap: 6144 MB

real    0m0.098s
user    0m0.018s
sys     0m0.009s




The CRUSH MAP:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node10 {
        id -2           # do not change unnecessarily
        # weight 0.540
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 0.270
        item osd.5 weight 0.270
}
host node12 {
        id -4           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
}
datacenter dc_TWO {
        id -5           # do not change unnecessarily
        # weight 0.540
        alg straw
        hash 0  # rjenkins1
        item node10 weight 0.540
        item node12 weight 0.000
}
host node11 {
        id -3           # do not change unnecessarily
        # weight 0.540
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 0.270
        item osd.3 weight 0.270
}
datacenter dc_ONE {
        id -6           # do not change unnecessarily
        # weight 0.540
        alg straw
        hash 0  # rjenkins1
        item node11 weight 0.540
}
root default {
        id -1           # do not change unnecessarily
        # weight 1.080
        alg straw
        hash 0  # rjenkins1
        item dc_TWO weight 0.540
        item dc_ONE weight 0.540
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule DRP {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type datacenter
        step emit
}

# end crush map



Francois.
--   
Accédez aux meilleurs tarifs Air France, gérez vos réservations et 
enregistrez-vous en ligne sur  http://www.airfrance.com  
Find best Air France fares, manage your reservations and check in online at  
http://www.airfrance.com  Les données et renseignements contenus dans ce 
message peuvent être de nature confidentielle et soumis au secret professionnel 
et sont destinés à l'usage exclusif du destinataire dont les coordonnées 
figurent ci-dessus. Si vous recevez cette communication par erreur, nous vous 
demandons de ne pas la copier, l'utiliser ou la divulguer. Nous vous prions de 
notifier cette erreur à l'expéditeur et d'effacer immédiatement cette 
communication de votre système. Société Air France - Société anonyme au capital 
de 126 748 775 euros - RCS Bobigny (France) 420 495 178 - 45, rue de Paris, 
Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX  
The data and information contained in this message may be confidential and 
subject to professional secrecy and are intended for the exclusive use of the 
recipient at the address shown above. If you receive this message by mistake, 
we ask you not to copy, use or disclose it. Please notify this error to the 
sender immediately and delete this message from your system. Société Air France 
- Limited company with capital of 126,748,775 euros - Bobigny register of 
companies (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 
Roissy Charles de Gaulle CEDEX  Pensez à l'environnement avant d'imprimer ce 
message.  
Think of the environment before printing this mail.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cluster unusable

Reply via email to