Hi,
Just had an incident in a 3-node test cluster running 12.1.1 on debian stretch
Each cluster had its own mon, mgr, radosgw, and osds. Just object store.
I had s3cmd looping and uploading files via S3.
On one of the machines, the RAID controller barfed and dropped the OS disks.
Or the disks failed. TBC. Anyway, / and /var went readonly.
The monitor on that machine found it couldn't write its logs and died. But the
OSDs stayed up - those disks didn't go readonly.
health: HEALTH_WARN
1/3 mons down, quorum store01,store03
osd: 18 osds: 18 up, 18 in
rgw: 3daemons active
The S3 process started timing out on connections to radosgw. Even when talking
to one of the other two radosgw instances. (I'm RRing the DNS records at the
moment).
I stopped the OSDs on that box. No change. I stopped radosgw on that box.
Still no change. The S3 upload process was still hanging/timing out. A manual
telnet to port 80 on the good nodes still hung.
"radosgw-admin bucket list" showed buckets &c
Then I restarted radosgw on one of the other two nodes. After about a minute,
the looping S3 upload process started working again.
So my questions: Why did I have to manually restart radosgw on one of the
other nodes? Why didn't it either keep working, or e.g. start working when
radosgw was stopped on the bad node?
Also where are the radosgw server/access logs?
I know it's probably an unusual edge case or something, but we're aiming for HA
and redundancy.
Thanks!
Sean Purdy
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com