Re: [ceph-users] Chasing slow ops in mimic

ceph Sat, 13 Apr 2019 03:17:55 -0700

Hi Alex,  slow ops are often root cause of Bad disks... perpaps do a htop when 
you Stress your Cluster and See which disks operates on Limit...


Only a guess 

Am 12. März 2019 14:19:03 MEZ schrieb Alex Litvak 
<[email protected]>:
>I looked further into historic slow ops (thanks to some other posts on
>the list) and I am confused a bit with the following event
>
>{
>"description": "osd_repop(client.85322.0:86478552 7.1b e502/466
>7:d8d149b7:::rbd_data.ff7e3d1b58ba.0000000000000316:head v
>502'10665506)",
>             "initiated_at": "2019-03-08 07:53:23.673807",
>             "age": 335669.547018,
>             "duration": 13.328475,
>             "type_data": {
>                 "flag_point": "commit sent; apply or cleanup",
>                 "events": [
>                     {
>                         "time": "2019-03-08 07:53:23.673807",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:23.673807",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:23.673808",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.001601",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.001643",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.001649",
>                         "event": "queued_for_pg"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.001679",
>                         "event": "reached_pg"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.001699",
>                         "event": "started"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.002208",
>                         "event": "commit_sent"
>                     },
>                     {
>                         "time": "2019-03-08 07:53:37.002282",
>                         "event": "done"
>                     }
>                 ]
>             }
>         },
>
>It just tell me throttled, nothing else.  What does throttled mean in
>this case?
>I see some events where osd is waiting for response from its partners
>for a specific pg but while it can be attributed to a network issue,
>throttled ones are not a clear cut.
>
>Appreciate any clues,
>
>On 3/11/2019 4:26 PM, Alex Litvak wrote:
>> Hello Cephers,
>> 
>> I am trying to find the cause of multiple slow ops happened with my
>small cluster.  I have a 3 node  with 9 OSDs
>> 
>> Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>> 128 GB RAM
>> Each OSD is SSD Intel DC-S3710 800GB
>> It runs mimic 13.2.2 in containers.
>> 
>> Cluster was operating normally for 4 month and then recently I had an
>outage with multiple VMs (RBD) showing
>> 
>> Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243812] INFO:
>task xfsaild/vda1:404 blocked for more than 120 seconds.
>> Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243957] Not
>tainted 4.19.5-1.el7.elrepo.x86_64 #1
>> Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244063] "echo
>0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244181]
>xfsaild/vda1    D    0   404      2 0x80000000
>> 
>> After examining ceph logs, i found following entries in multiple OSDs
>> Mar  8 07:38:52 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08
>07:38:52.299 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1
>slow ops, oldest is osd_op(client.148553.0:5996289 7.fe 
>> 7:7f0ebfe2:::rbd_data.17bab2eb141f2.000000000000023d:head [stat,write
>2588672~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
>> Mar  8 07:38:53 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08
>07:38:53.347 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1
>slow ops, oldest is osd_op(client.148553.0:5996289 7.fe 
>> 7:7f0ebfe2:::rbd_data.17bab2eb141f2.00000000
>> 
>> Mar  8 07:43:05 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08
>07:43:05.360 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 1
>slow ops, oldest is osd_op(client.152215.0:7037343 7.1e 
>> 7:78d776e4:::rbd_data.27e662eb141f2.0000000000000436:head [stat,write
>393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
>> Mar  8 07:43:06 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08
>07:43:06.332 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 2
>slow ops, oldest is osd_op(client.152215.0:7037343 7.1e 
>> 7:78d776e4:::rbd_data.27e662eb141f2.0000000000000436:head [stat,write
>393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
>> 
>> The messages were showing on all nodes and affecting several osds on
>each node.
>> 
>> The trouble started at approximately 07:30 am and end 30 minutes
>later. I have not seen any slow ops since then, nor VMs showed kernel
>hangups since then.  Here is my ceph status.  I also want to note 
>> that the load on the cluster was minimal at the time.  Please let me
>know where I could start looking as the cluster cannot be in production
>with this failures.
>> 
>>   cluster:
>>      id:     054890af-aef7-46cf-a179-adc9170e3958
>>      health: HEALTH_OK
>> 
>>    services:
>>      mon: 3 daemons, quorum
>storage1n1-chi,storage1n2-chi,storage1n3-chi
>>      mgr: storage1n3-chi(active), standbys: storage1n1-chi,
>storage1n2-chi
>>      mds: cephfs-1/1/1 up  {0=storage1n2-chi=up:active}, 2 up:standby
>>      osd: 27 osds: 27 up, 27 in
>>      rgw: 3 daemons active
>> 
>>    data:
>>      pools:   7 pools, 608 pgs
>>      objects: 1.46 M objects, 697 GiB
>>      usage:   3.0 TiB used, 17 TiB / 20 TiB avail
>>      pgs:     608 active+clean
>> 
>>    io:
>>      client:   0 B/s rd, 91 KiB/s wr, 6 op/s rd, 10 op/s wr
>> 
>> Thank you in advance,
>
>
>_______________________________________________
>ceph-users mailing list
>[email protected]
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Chasing slow ops in mimic

Reply via email to