Hello,
I forgot to say that the nodes are in preboot status. Something seems
strange to me.
root@red-compute:/var/lib/ceph/osd/ceph-1# ceph daemon osd.1 status
{
"cluster_fsid": "9028f4da-0d77-462b-be9b-dbdf7fa57771",
"osd_fsid": "adf9890a-e680-48e4-82c6-e96f4ed56889",
"whoami": 1,
"state": "preboot",
"oldest_map": 1764,
"newest_map": 2504,
"num_pgs": 323
}
root@red-compute:/var/lib/ceph/osd/ceph-1# ceph daemon osd.3 status
{
"cluster_fsid": "9028f4da-0d77-462b-be9b-dbdf7fa57771",
"osd_fsid": "8dd085d4-0b50-4c80-a0ca-c5bc4ad972f7",
"whoami": 3,
"state": "preboot",
"oldest_map": 1764,
"newest_map": 2504,
"num_pgs": 150
}
3 is up and in.
On Tue, May 10, 2016 at 6:07 PM, Gonzalo Aguilar Delgado <
[email protected]> wrote:
> Hello,
>
> I just upgraded my cluster to the version 10.1.2 and it worked well for a
> while until I saw that systemctl [email protected] was failed
> and I reruned it.
>
> From there the OSD stopped working.
>
> This is ubuntu 16.04.
>
> I connected to the IRC looking for help where people pointed me to one or
> another place but none of the investigations helped to resolve.
>
> My configuration is rather simple:
>
> oot@red-compute:~# ceph osd tree
> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 1.00000 root default
> -4 1.00000 rack rack-1
> -2 1.00000 host blue-compute
> 0 1.00000 osd.0 down 0 1.00000
> 2 1.00000 osd.2 down 0 1.00000
> -3 1.00000 host red-compute
> 1 1.00000 osd.1 down 0 1.00000
> 3 0.50000 osd.3 up 1.00000 1.00000
> 4 1.00000 osd.4 down 0 1.00000
>
>
>
> This is what I got sofar:
>
>
> 1. Once upgraded I discovered that daemon runs under ceph. I just ran
> chown on ceph directories. and it worked.
> 2. Firewall is fully disabled. Checked connectivity with nc and nmap.
> 3. Configuration seems to be right. I can post if you want.
> 4. Enabling logging on OSD shows that for example osd.1 is
> reconnecting all the time.
> 1. 2016-05-10 14:35:48.199573 7f53e8f1a700 1 -- 0.0.0.0:6806/13962
> >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0
> c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
> 2016-05-10 14:35:48.199966 7f53e8f1a700 2 -- 0.0.0.0:6806/13962
> >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0
> c=0x556f993b3a80).fault (0) Success
> 2016-05-10 14:35:48.200018 7f53fb941700 1 osd.1 2468
> ms_handle_reset con 0x556f993b3a80 session 0
> 5. OSD.3 goes ok because never left out because ceph restriction.
> 6. I rebooted all services at once for it to have available all OSD at
> the same time and don't mark it down. Don't work.
> 7. I forced up from commandline. ceph osd in 1-5. They appear as in
> for a while then out.
> 8. We tried ceph-disk activate-all to boot everything. Don't work.
>
>
> The strange thing is that culster started worked just right after upgrade.
> But the systemctrl command broke both servers.
>
> root@blue-compute:~# ceph -w
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
> health HEALTH_ERR
> 694 pgs are stuck inactive for more than 300 seconds
> 694 pgs stale
> 694 pgs stuck stale
> too many PGs per OSD (1528 > max 300)
> mds cluster is degraded
> crush map has straw_calc_version=0
> monmap e10: 2 mons at {blue-compute=
> 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 3600, quorum 0,1 red-compute,blue-compute
> fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
> osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
> pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
> 87641 MB used, 212 GB / 297 GB avail
> 694 stale+active+clean
> 70 active+clean
>
> 2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck
> inactive for more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too
> many PGs per OSD (1528 > max 300); mds cluster is degraded; crush map has
> straw_calc_version=
>
> cat /etc/ceph/ceph.conf
> [global]
>
> fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
> mon_initial_members = blue-compute, red-compute
> mon_host = 172.16.0.119, 172.16.0.100
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = 172.16.0.0/24
> osd_pool_default_pg_num = 100
> osd_pool_default_pgp_num = 100
> osd_pool_default_size = 2 # Write an object 3 times.
> osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.
>
> ## Required upgrade
> osd max object name len = 256
> osd max object namespace len = 64
>
> [mon.]
>
> debug mon = 9
> caps mon = "allow *"
>
>
> Any help on this? Any clue of what's going wrong?
>
>
> Best regards,
>
>
>
--
No subestimes el poder de la gente estúpida en grupos grandes...
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com