Hi Alex, slow ops are often root cause of Bad disks... perpaps do a htop when you Stress your Cluster and See which disks operates on Limit...
Only a guess Am 12. März 2019 14:19:03 MEZ schrieb Alex Litvak <alexander.v.lit...@gmail.com>: >I looked further into historic slow ops (thanks to some other posts on >the list) and I am confused a bit with the following event > >{ >"description": "osd_repop(client.85322.0:86478552 7.1b e502/466 >7:d8d149b7:::rbd_data.ff7e3d1b58ba.0000000000000316:head v >502'10665506)", > "initiated_at": "2019-03-08 07:53:23.673807", > "age": 335669.547018, > "duration": 13.328475, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "events": [ > { > "time": "2019-03-08 07:53:23.673807", > "event": "initiated" > }, > { > "time": "2019-03-08 07:53:23.673807", > "event": "header_read" > }, > { > "time": "2019-03-08 07:53:23.673808", > "event": "throttled" > }, > { > "time": "2019-03-08 07:53:37.001601", > "event": "all_read" > }, > { > "time": "2019-03-08 07:53:37.001643", > "event": "dispatched" > }, > { > "time": "2019-03-08 07:53:37.001649", > "event": "queued_for_pg" > }, > { > "time": "2019-03-08 07:53:37.001679", > "event": "reached_pg" > }, > { > "time": "2019-03-08 07:53:37.001699", > "event": "started" > }, > { > "time": "2019-03-08 07:53:37.002208", > "event": "commit_sent" > }, > { > "time": "2019-03-08 07:53:37.002282", > "event": "done" > } > ] > } > }, > >It just tell me throttled, nothing else. What does throttled mean in >this case? >I see some events where osd is waiting for response from its partners >for a specific pg but while it can be attributed to a network issue, >throttled ones are not a clear cut. > >Appreciate any clues, > >On 3/11/2019 4:26 PM, Alex Litvak wrote: >> Hello Cephers, >> >> I am trying to find the cause of multiple slow ops happened with my >small cluster. I have a 3 node with 9 OSDs >> >> Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz >> 128 GB RAM >> Each OSD is SSD Intel DC-S3710 800GB >> It runs mimic 13.2.2 in containers. >> >> Cluster was operating normally for 4 month and then recently I had an >outage with multiple VMs (RBD) showing >> >> Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243812] INFO: >task xfsaild/vda1:404 blocked for more than 120 seconds. >> Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243957] Not >tainted 4.19.5-1.el7.elrepo.x86_64 #1 >> Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244063] "echo >0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Mar 8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244181] >xfsaild/vda1 D 0 404 2 0x80000000 >> >> After examining ceph logs, i found following entries in multiple OSDs >> Mar 8 07:38:52 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08 >07:38:52.299 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1 >slow ops, oldest is osd_op(client.148553.0:5996289 7.fe >> 7:7f0ebfe2:::rbd_data.17bab2eb141f2.000000000000023d:head [stat,write >2588672~16384] snapc 0=[] ondisk+write+known_if_redirected e502) >> Mar 8 07:38:53 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08 >07:38:53.347 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1 >slow ops, oldest is osd_op(client.148553.0:5996289 7.fe >> 7:7f0ebfe2:::rbd_data.17bab2eb141f2.00000000 >> >> Mar 8 07:43:05 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08 >07:43:05.360 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 1 >slow ops, oldest is osd_op(client.152215.0:7037343 7.1e >> 7:78d776e4:::rbd_data.27e662eb141f2.0000000000000436:head [stat,write >393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502) >> Mar 8 07:43:06 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08 >07:43:06.332 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 2 >slow ops, oldest is osd_op(client.152215.0:7037343 7.1e >> 7:78d776e4:::rbd_data.27e662eb141f2.0000000000000436:head [stat,write >393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502) >> >> The messages were showing on all nodes and affecting several osds on >each node. >> >> The trouble started at approximately 07:30 am and end 30 minutes >later. I have not seen any slow ops since then, nor VMs showed kernel >hangups since then. Here is my ceph status. I also want to note >> that the load on the cluster was minimal at the time. Please let me >know where I could start looking as the cluster cannot be in production >with this failures. >> >> cluster: >> id: 054890af-aef7-46cf-a179-adc9170e3958 >> health: HEALTH_OK >> >> services: >> mon: 3 daemons, quorum >storage1n1-chi,storage1n2-chi,storage1n3-chi >> mgr: storage1n3-chi(active), standbys: storage1n1-chi, >storage1n2-chi >> mds: cephfs-1/1/1 up {0=storage1n2-chi=up:active}, 2 up:standby >> osd: 27 osds: 27 up, 27 in >> rgw: 3 daemons active >> >> data: >> pools: 7 pools, 608 pgs >> objects: 1.46 M objects, 697 GiB >> usage: 3.0 TiB used, 17 TiB / 20 TiB avail >> pgs: 608 active+clean >> >> io: >> client: 0 B/s rd, 91 KiB/s wr, 6 op/s rd, 10 op/s wr >> >> Thank you in advance, > > >_______________________________________________ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com