On 04/11/2014 12:23 PM, Matteo Favaro wrote:
Hi to all,
my name is Matteo Favaro
I'm a employee of CNAF and I'm trying to get working a test base of ceph.

I have learnt a lot about ceph and i know quite well how to build and
modify it but there is a question and a problem that I don't know how to
resolve.

my cluster is made in this way:

5 servers named "ds-07-01" --> through 04

- on 01 i have (at the moment):
2 osd + 1 monitor
- on 02:
2 osd + 1 monitor
- on 03:
2 osd + 1 monitor
- on 04
2 osd

at the moment in total i have 8 osds, 3 monitor. The cluster has pool
size = 1 so i haven't replica


the problem is that during the building phase my osd number 6 has
failed, I have removed it from the cluster removing both from auth list
and crush map and lastly giving the command "osd rm"

and then I recreated but the result is :

[root@ds-07-01 mycephcluster]# ceph health detail
HEALTH_WARN 16 pgs stale; 1 pgs stuck inactive; 16 pgs stuck stale; 1
pgs stuck unclean
pg 0.31 is stuck inactive since forever, current state creating, last
acting []
pg 0.31 is stuck unclean since forever, current state creating, last
acting []
pg 1.30 is stuck stale for 3449.169280, current state
stale+active+clean, last acting [6]
pg 2.2f is stuck stale for 3449.169280, current state
stale+active+clean, last acting [6]
pg 0.27 is stuck stale for 3449.169277, current state
stale+active+clean, last acting [6]
pg 1.26 is stuck stale for 3449.169283, current state
stale+active+clean, last acting [6]
pg 2.25 is stuck stale for 3449.169289, current state
stale+active+clean, last acting [6]
pg 0.25 is stuck stale for 3449.169286, current state
stale+active+clean, last acting [6]
pg 1.24 is stuck stale for 3449.169292, current state
stale+active+clean, last acting [6]
pg 0.21 is stuck stale for 3449.169277, current state
stale+active+clean, last acting [6]
pg 1.20 is stuck stale for 3449.169289, current state
stale+active+clean, last acting [6]
pg 2.23 is stuck stale for 3449.169294, current state
stale+active+clean, last acting [6]
pg 2.1f is stuck stale for 3449.169288, current state
stale+active+clean, last acting [6]
pg 0.11 is stuck stale for 3449.169286, current state
stale+active+clean, last acting [6]
pg 1.10 is stuck stale for 3449.169292, current state
stale+active+clean, last acting [6]
pg 2.f is stuck stale for 3449.169293, current state stale+active+clean,
last acting [6]
pg 0.1 is stuck stale for 3449.169291, current state stale+active+clean,
last acting [6]
pg 1.0 is stuck stale for 3449.169296, current state stale+active+clean,
last acting [6]

the number 0.31 is in that state because I have read this on the man page:*

Pool Size = 1*: If you have only one copy of an object, no other OSD
will tell the OSD which objects it should have. For each placement group
mapped to the remaining OSD (see ceph pg dump), you can force the OSD to
notice the placement groups it needs by running:

 1.

    ceph pg force_create_pg <pgid>

but if I try to launch the command:

[root@ds-07-01 mycephcluster]# ceph pg 1.30 mark_unfound_lost revert
Error ENOENT: i don't have pgid 1.30

this is some information about the cluster:

[root@ds-07-01 mycephcluster]# ceph osd lspools
0 data,1 metadata,2 rbd,

[root@ds-07-01 mycephcluster]# ceph -w
     cluster 042a6983-b824-4c99-9ba3-03eebaf74afa
      health HEALTH_WARN 16 pgs stale; 1 pgs stuck inactive; 16 pgs
stuck stale; 1 pgs stuck unclean
      monmap e3: 3 mons at
{ds-07-01=omissis:6789/0,ds-07-02=omissis:6789/0,ds-07-03=omissis:6789/0},
election epoch 22, quorum 0,1,2 ds-07-01,ds-07-02,ds-07-03
      osdmap e55: 8 osds: 8 up, 8 in
       pgmap v172: 192 pgs, 3 pools, 0 bytes data, 0 objects
             289 MB used, 58641 GB / 58641 GB avail
                    1 creating
                  175 active+clean
                   16 stale+active+clean


2014-04-11 12:12:29.457474 osd.6 [INF] i don't have pgid 1.30

[root@ds-07-01 mycephcluster]# ceph osd pool get data size
size: 1
[root@ds-07-01 mycephcluster]# ceph osd pool get metadata size
size: 1
[root@ds-07-01 mycephcluster]# ceph osd pool get rbd size
size: 1
[root@ds-07-01 mycephcluster]#


Could anyone help me to understand how to get the cluster on HEALTH_OK
status again? I'm missing something on understanding how ceph works...


Have you tried to make the OSD as lost?

$ ceph osd lost 6

That will remove the PGs and the cluster should come back.

I however advise you to run with at least rep size 2 for a pool.

Thanks a lot
Matteo Favaro




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to