After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a 
problem where the new ceph-mgr would sometimes hang indefinitely when doing 
commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of 
our clusters (10+) aren't seeing the same issue, but they are all under 600 
OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but 
usually overnight it'll get back into the state where the hang reappears.  At 
first I thought it was a hardware issue, but switching the primary ceph-mgr to 
another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump 
looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg 
dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  
client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : 
from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' 
cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: 
dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) 
Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command 
prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to