[I't is not really a 'mortem', but...]
Saturday afternoon, my 3-nodes proxmox ceph cluster have a big
'slowdown', that started at 12:35:24 with some OOM condition in one of
the 3 storage nodes, followed with also OOM on another node, at
12:43:31.
After that, all bad things happens: stuck requests, SCSI timeout on
VMs, MONs flip-flop on RBD clients.
I make a 'ceph -s' every hour, so at 14:17:01 i had at two nodes:
cluster 8794c124-c2ec-4e81-8631-742992159bd6
health HEALTH_WARN
26 requests are blocked > 32 sec
monmap e9: 5 mons at
{2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}
election epoch 3930, quorum 0,1,2,3,4
blackpanther,capitanmarvel,4,2,3
osdmap e15713: 12 osds: 12 up, 12 in
pgmap v67358590: 768 pgs, 3 pools, 2222 GB data, 560 kobjects
6639 GB used, 11050 GB / 17689 GB avail
768 active+clean
client io 266 kB/s wr, 25 op/s
and on the third:
cluster 8794c124-c2ec-4e81-8631-742992159bd6
health HEALTH_WARN
5 mons down, quorum
monmap e9: 5 mons at
{2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0,capitanmarvel=10.27.251.8:6789/0}
election epoch 3931, quorum
osdmap e15713: 12 osds: 12 up, 12 in
pgmap v67358598: 768 pgs, 3 pools, 2222 GB data, 560 kobjects
6639 GB used, 11050 GB / 17689 GB avail
767 active+clean
1 active+clean+scrubbing
client io 617 kB/s wr, 70 op/s
At that hour, the site served by the cluster was just closed (eg, no
users). The only task running, looking at logs, seems a backup
(bacula), but was just saving catalog, eg database workload on a
container, and ended at 14.27.
All that continue, more or less, till sunday morning, then all goes
back as normal.
Seems there was no hardware failures on nodes.
Backup tasks (all VM/LXC backups) on saturday night competed with no
errors.
Someone can provide some hint on how to 'correlate' various logs, and
so (try to) understand what happens?
Thanks.
--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/
Polo FVG - Via della Bontà , 7 - 33078 - San Vito al Tagliamento (PN)
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797
Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com