Hi, It seems that one of your OSD server are dead. If you use the default setting of Ceph(size=3, min_size=2), there should have three OSD nodes to distribute objects' replicas. The important one is that, you only have one OSD node alive. The living object replication leave 1 (< min_size). So there show inactive to pgs.
2016-06-20 14:55 GMT+08:00 Ishmael Tsoaela <[email protected]>: > Hi David, > > Apologies for the late response. > > NodeB is mon+client, nodeC is client: > > > > Cheph health details: > > HEALTH_ERR 819 pgs are stuck inactive for more than 300 seconds; 883 pgs > degraded; 64 pgs stale; 819 pgs stuck inactive; 1064 pgs stuck unclean; 883 > pgs undersized; 22 requests are blocked > 32 sec; 3 osds have slow > requests; recovery 2/8 objects degraded (25.000%); recovery 2/8 objects > misplaced (25.000%); crush map has legacy tunables (require argonaut, min > is firefly); crush map has straw_calc_version=0 > pg 2.fc is stuck inactive since forever, current state > undersized+degraded+peered, last acting [2] > pg 2.fd is stuck inactive since forever, current state > undersized+degraded+peered, last acting [0] > pg 2.fe is stuck inactive since forever, current state > undersized+degraded+peered, last acting [2] > pg 2.ff is stuck inactive since forever, current state > undersized+degraded+peered, last acting [1] > pg 1.fb is stuck inactive for 493857.572982, current state > undersized+degraded+peered, last acting [4] > pg 2.f8 is stuck inactive since forever, current state > undersized+degraded+peered, last acting [3] > pg 1.fa is stuck inactive for 492185.443146, current state > undersized+degraded+peered, last acting [0] > pg 2.f9 is stuck inactive since forever, current state > undersized+degraded+peered, last acting [0] > pg 1.f9 is stuck inactive for 492185.452890, current state > undersized+degraded+peered, last acting [2] > pg 2.fa is stuck inactive since forever, current state > undersized+degraded+peered, last acting [3] > pg 1.f8 is stuck inactive for 492185.443324, current state > undersized+degraded+peered, last acting [0] > pg 2.fb is stuck inactive since forever, current state > undersized+degraded+peered, last acting [2] > . > . > . > > pg 1.fb is undersized+degraded+peered, acting [4] > pg 2.ff is undersized+degraded+peered, acting [1] > pg 2.fe is undersized+degraded+peered, acting [2] > pg 2.fd is undersized+degraded+peered, acting [0] > pg 2.fc is undersized+degraded+peered, acting [2] > 3 ops are blocked > 536871 sec on osd.4 > 15 ops are blocked > 268435 sec on osd.4 > 1 ops are blocked > 262.144 sec on osd.4 > 2 ops are blocked > 268435 sec on osd.3 > 1 ops are blocked > 268435 sec on osd.1 > 3 osds have slow requests > recovery 2/8 objects degraded (25.000%) > recovery 2/8 objects misplaced (25.000%) > crush map has legacy tunables (require argonaut, min is firefly); see > http://ceph.com/docs/master/rados/operations/crush-map/#tunables > crush map has straw_calc_version=0; see > http://ceph.com/docs/master/rados/operations/crush-map/#tunables > > > ceph osd stat > > cluster-admin@nodeB:~/.ssh/ceph-cluster$ cat ceph_osd_stat.txt > osdmap e80: 10 osds: 5 up, 5 in; 558 remapped pgs > flags sortbitwise > > > ceph osd tree: > > cluster-admin@nodeB:~/.ssh/ceph-cluster$ ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 9.08691 root default > -2 4.54346 host nodeB > 5 0.90869 osd.5 down 0 1.00000 > 6 0.90869 osd.6 down 0 1.00000 > 7 0.90869 osd.7 down 0 1.00000 > 8 0.90869 osd.8 down 0 1.00000 > 9 0.90869 osd.9 down 0 1.00000 > -3 4.54346 host nodeC > 0 0.90869 osd.0 up 1.00000 1.00000 > 1 0.90869 osd.1 up 1.00000 1.00000 > 2 0.90869 osd.2 up 1.00000 1.00000 > 3 0.90869 osd.3 up 1.00000 1.00000 > 4 0.90869 osd.4 up 1.00000 1.00000 > > > > > CrushMap: > > > # begin crush map > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > device 4 osd.4 > device 5 osd.5 > device 6 osd.6 > device 7 osd.7 > device 8 osd.8 > device 9 osd.9 > > # types > type 0 osd > type 1 host > type 2 chassis > type 3 rack > type 4 row > type 5 pdu > type 6 pod > type 7 room > type 8 datacenter > type 9 region > type 10 root > > # buckets > host nodeB { > id -2 # do not change unnecessarily > # weight 4.543 > alg straw > hash 0 # rjenkins1 > item osd.5 weight 0.909 > item osd.6 weight 0.909 > item osd.7 weight 0.909 > item osd.8 weight 0.909 > item osd.9 weight 0.909 > } > host nodeC { > id -3 # do not change unnecessarily > # weight 4.543 > alg straw > hash 0 # rjenkins1 > item osd.0 weight 0.909 > item osd.1 weight 0.909 > item osd.2 weight 0.909 > item osd.3 weight 0.909 > item osd.4 weight 0.909 > } > root default { > id -1 # do not change unnecessarily > # weight 9.087 > alg straw > hash 0 # rjenkins1 > item nodeB weight 4.543 > item nodeC weight 4.543 > } > > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > > # end crush map > > > > ceph.conf > > > cluster-admin@nodeB:~/.ssh/ceph-cluster$ cat /etc/ceph/ceph.conf > [global] > fsid = a04e9846-6c54-48ee-b26f-d6949d8bacb4 > mon_initial_members = nodeB > mon_host = <mon IP> > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > public_network = X.X.X.0/24 > > > > > > On Sat, Jun 18, 2016 at 12:15 PM, David <[email protected]> wrote: > >> Is this a test cluster that has never been healthy or a working cluster >> which has just gone unhealthy? Have you changed anything? Are all hosts, >> drives, network links working? More detail please. Any/all of the following >> would help: >> >> ceph health detail >> ceph osd stat >> ceph osd tree >> Your ceph.conf >> Your crushmap >> >> On 17 Jun 2016 14:14, "Ishmael Tsoaela" <[email protected]> wrote: >> > >> > Hi All, >> > >> > please assist to fix the error: >> > >> > 1 X admin >> > 2 X admin(hosting admin as well) >> > >> > 4 osd each node >> >> Please provide more detail, this suggests you should have 12 osd's but >> your osd map shows 10 osd's, 5 of which are down. >> > >> > >> > cluster a04e9846-6c54-48ee-b26f-d6949d8bacb4 >> > health HEALTH_ERR >> > 819 pgs are stuck inactive for more than 300 seconds >> > 883 pgs degraded >> > 64 pgs stale >> > 819 pgs stuck inactive >> > 245 pgs stuck unclean >> > 883 pgs undersized >> > 17 requests are blocked > 32 sec >> > recovery 2/8 objects degraded (25.000%) >> > recovery 2/8 objects misplaced (25.000%) >> > crush map has legacy tunables (require argonaut, min is >> firefly) >> > crush map has straw_calc_version=0 >> > monmap e1: 1 mons at {nodeB=155.232.195.4:6789/0} >> > election epoch 7, quorum 0 nodeB >> > osdmap e80: 10 osds: 5 up, 5 in; 558 remapped pgs >> > flags sortbitwise >> > pgmap v480: 1064 pgs, 3 pools, 6454 bytes data, 4 objects >> > 25791 MB used, 4627 GB / 4652 GB avail >> > 2/8 objects degraded (25.000%) >> > 2/8 objects misplaced (25.000%) >> > 819 undersized+degraded+peered >> > 181 active >> > 64 stale+active+undersized+degraded >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > [email protected] >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Best regards, 施柏安 Desmond Shih 技術研發部 Technical Development <http://www.inwinstack.com/> 迎棧科技股份有限公司 │ 886-975-857-982 │ desmond.s@inwinstack <[email protected]>.com │ 886-2-7738-2858 #7725 │ 新北市220板橋區遠東路3號5樓C室 Rm.C, 5F., No.3, Yuandong Rd., Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
