[ceph-users] radosgw hung when OS disks went readonly, different node radosgw restart fixed it

Sean Purdy Mon, 31 Jul 2017 07:47:35 -0700

Hi,


Just had an incident in a 3-node test cluster running 12.1.1 on debian stretch

Each cluster had its own mon, mgr, radosgw, and osds.  Just object store.

I had s3cmd looping and uploading files via S3.

On one of the machines, the RAID controller barfed and dropped the OS disks.  
Or the disks failed.  TBC.  Anyway, / and /var went readonly.

The monitor on that machine found it couldn't write its logs and died.  But the 
OSDs stayed up - those disks didn't go readonly.


health: HEALTH_WARN
        1/3 mons down, quorum store01,store03
osd: 18 osds: 18 up, 18 in
rgw: 3daemons active


The S3 process started timing out on connections to radosgw.  Even when talking 
to one of the other two radosgw instances.  (I'm RRing the DNS records at the 
moment).

I stopped the OSDs on that box.  No change.  I stopped radosgw on that box.  
Still no change.  The S3 upload process was still hanging/timing out.  A manual 
telnet to port 80 on the good nodes still hung.

"radosgw-admin bucket list" showed buckets &c

Then I restarted radosgw on one of the other two nodes.  After about a minute, 
the looping S3 upload process started working again.


So my questions:  Why did I have to manually restart radosgw on one of the 
other nodes?  Why didn't it either keep working, or e.g. start working when 
radosgw was stopped on the bad node?

Also where are the radosgw server/access logs?


I know it's probably an unusual edge case or something, but we're aiming for HA 
and redundancy.


Thanks!

Sean Purdy
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw hung when OS disks went readonly, different node radosgw restart fixed it

Reply via email to