Looks like it's mostly RGW metadata stuff; are you running your non-data RGW pools on SSDs (you should, that can help *a lot*)?
Paul 2018-05-15 18:49 GMT+02:00 Grigory Murashov <[email protected]>: > Hello guys! > > I collected output of ceph daemon osd.16 dump_ops_in_flight and ceph > daemon osd.16 dump_historic_ops. > > Here is the output of ceph heath details in the moment of problem > > HEALTH_WARN 20 slow requests are blocked > 32 sec > REQUEST_SLOW 20 slow requests are blocked > 32 sec > 20 ops are blocked > 65.536 sec > osds 16,27,29 have blocked requests > 65.536 sec > > So I grab logs from osd.16. > > The file is attached. Could you please help to translate? > > Thanks in advance. > > Grigory Murashov > Voximplant > > 14.05.2018 18:14, Grigory Murashov пишет: > > Hello David! > > 2. I set it up 10/10 > > 3. Thanks, my problem was I did it on host where was no osd.15 daemon. > > Could you please help to read osd logs? > > Here is a part from ceph.log > > 2018-05-14 13:46:32.644323 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553895 : cluster [INF] Cluster is now healthy > 2018-05-14 13:46:43.741921 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553896 : cluster [WRN] Health check failed: 21 slow requests are blocked > > 32 sec (REQUEST_SLOW) > 2018-05-14 13:46:49.746994 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553897 : cluster [WRN] Health check update: 23 slow requests are blocked > > 32 sec (REQUEST_SLOW) > 2018-05-14 13:46:55.752314 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553900 : cluster [WRN] Health check update: 3 slow requests are blocked > > 32 sec (REQUEST_SLOW) > 2018-05-14 13:47:01.030686 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553901 : cluster [WRN] Health check update: 4 slow requests are blocked > > 32 sec (REQUEST_SLOW) > 2018-05-14 13:47:07.764236 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553903 : cluster [WRN] Health check update: 32 slow requests are blocked > > 32 sec (REQUEST_SLOW) > 2018-05-14 13:47:13.770833 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553904 : cluster [WRN] Health check update: 21 slow requests are blocked > > 32 sec (REQUEST_SLOW) > 2018-05-14 13:47:17.774530 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553905 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 12 slow > requests are blocked > 32 sec) > 2018-05-14 13:47:17.774582 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553906 : cluster [INF] Cluster is now healthy > > At 13-47 I had a problem with osd.21 > > 1. Ceph Health (storage-ru1-osd1.voximplant.com:ceph.health): HEALTH_WARN > {u'REQUEST_SLOW': {u'severity': u'HEALTH_WARN', u'summary': {u'message': u'4 > slow requests are blocked > 32 sec'}}} > HEALTH_WARN 4 slow requests are blocked > 32 sec > REQUEST_SLOW 4 slow requests are blocked > 32 sec > 2 ops are blocked > 65.536 sec > 2 ops are blocked > 32.768 sec > osd.21 has blocked requests > 65.536 sec > > Here is a part from ceph-osd.21.log > > 2018-05-14 13:47:06.891399 7fb806dd6700 10 osd.21 pg_epoch: 236 pg[2.0( v > 236'297 (0'0,236'297] local-lis/les=223/224 n=1 ec=119/119 lis/c 223/223 > les/c/f 224/224/0 223/223/212) [21,29,15] > r=0 lpr=223 crt=236'297 lcod 236'296 mlcod 236'296 active+clean] dropping > ondisk_read_lock > 2018-05-14 13:47:06.891435 7fb806dd6700 10 osd.21 236 dequeue_op > 0x56453b753f80 finish > 2018-05-14 13:47:07.111388 7fb8185f9700 10 osd.21 236 tick > 2018-05-14 13:47:07.111398 7fb8185f9700 10 osd.21 236 do_waiters -- start > 2018-05-14 13:47:07.111401 7fb8185f9700 10 osd.21 236 do_waiters -- finish > 2018-05-14 13:47:07.800421 7fb817df8700 10 osd.21 236 tick_without_osd_lock > 2018-05-14 13:47:07.800444 7fb817df8700 10 osd.21 236 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; > target 25 obj/sec or 5120 k bytes/sec > 2018-05-14 13:47:07.800449 7fb817df8700 10 osd.21 236 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > 2018-05-14 13:47:08.111470 7fb8185f9700 10 osd.21 236 tick > 2018-05-14 13:47:08.111483 7fb8185f9700 10 osd.21 236 do_waiters -- start > 2018-05-14 13:47:08.111485 7fb8185f9700 10 osd.21 236 do_waiters -- finish > 2018-05-14 13:47:08.181070 7fb8055d3700 10 osd.21 236 dequeue_op > 0x564539651000 prio 63 cost 0 latency 0.000143 > osd_op(client.2597258.0:213844298 > 6.1d4 6.4079fd4 (undecoded) ondisk+read+kno > wn_if_redirected e236) v8 pg pg[6.1d4( v 236'20882 (236'19289,236'20882] > local-lis/les=223/224 n=20791 ec=145/132 lis/c 223/223 les/c/f 224/224/0 > 223/223/212) [21,29,17] r=0 lpr=223 crt=236 > '20882 lcod 236'20881 mlcod 236'20881 active+clean] > 2018-05-14 13:47:08.181112 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] _handle_message: 0x564539651000 > 2018-05-14 13:47:08.181141 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] do_op osd_op(client.2597258.0:213844298 6.1d4 > 6:2bf9e020:::eb359f44-3316-4cd3-9006-d416c21e0745.120446 > 4.6_2018%2f05%2f14%2fYWRjNmZmNzQzODI2ZGQzOTc3ZjFi > NGMxZjIxOGZlYzQvaHR0cDovL3d3dy1sdS0wMS0zNi52b3hpbXBsYW50LmNv > bS9yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjYxY2JiZjU3MTk4NS4xNTI2Mjkw > MzQ0Ljk2NjQ5MS5tcDM- > :head [getxattrs,stat,read 0~4194304] snapc 0=[] > ondisk+read+known_if_redirected > e236) v8 may_read -> read-ordered flags ondisk+read+known_if_redirected > 2018-05-14 13:47:08.181179 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] get_object_context: found obc in cache: 0x5645b98fdb80 > 2018-05-14 13:47:08.181193 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] get_object_context: 0x5645b98fdb80 6:2bf9e020:::eb359f44-3316- > 4cd3-9006-d416c21e0745.1204464.6_2018%2f > 05%2f14%2fYWRjNmZmNzQzODI2ZGQzOTc3ZjFiNGMxZjIxOGZlYzQvaHR0cDovL3d3dy > 1sdS0wMS0zNi52b3hpbXBsYW50LmNvbS9yZWNvcmRzLzIwMTgvMDUvMTQvOT > RlNjYxY2JiZjU3MTk4NS4xNTI2MjkwMzQ0Ljk2NjQ5MS5tcDM-:head rwsta > te(none n=0 w=0) oi: 6:2bf9e020:::eb359f44-3316-4cd3-9006-d416c21e0745. > 1204464.6_2018%2f05%2f14%2fYWRjNmZmNzQzODI2ZGQzOTc3ZjFi > NGMxZjIxOGZlYzQvaHR0cDovL3d3dy1sdS0wMS0zNi52b3hpbXBsYW50LmNvbS9 > yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjYxY2JiZjU3MTk4NS4xNTI2MjkwMzQ > 0Ljk2NjQ5MS5tcDM-:head(236'20762 client.2597258.0:212394860 > dirty|data_digest|omap_digest s 708737 uv 20762 dd c68e9ed od fffffff > f alloc_hint [0 0 0]) exists: 1 ssc: 0x564542a6a640 snapset: 0=[]:{} > 2018-05-14 13:47:08.181222 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] find_object_context 6:2bf9e020:::eb359f44-3316- > 4cd3-9006-d416c21e0745.1204464.6_2018%2f05%2f14%2fYWRjN > mZmNzQzODI2ZGQzOTc3ZjFiNGMxZjIxOGZlYzQvaHR0cDovL3d3dy1sdS0wM > S0zNi52b3hpbXBsYW50LmNvbS9yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjYxY > 2JiZjU3MTk4NS4xNTI2MjkwMzQ0Ljk2NjQ5MS5tcDM-:head @head oi=6:2bf9e020: > ::eb359f44-3316-4cd3-9006-d416c21e0745.1204464.6_2018%2f05%2f14% > 2fYWRjNmZmNzQzODI2ZGQzOTc3ZjFiNGMxZjIxOGZlYzQvaHR0cDovL3d3dy > 1sdS0wMS0zNi52b3hpbXBsYW50LmNvbS9yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjY > xY2JiZjU3MTk4NS4xNTI2MjkwMzQ0Ljk2NjQ5MS5tcDM-:head(236'20762 > client.2597258.0:212394860 dirty|data_digest|omap_digest s 708737 uv 20762 > dd c68e9ed od ffffffff alloc_hint [0 0 0]) > 2018-05-14 13:47:08.181252 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] execute_ctx 0x564545128700 > 2018-05-14 13:47:08.181265 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] execute_ctx 6:2bf9e020:::eb359f44-3316- > 4cd3-9006-d416c21e0745.1204464.6_2018%2f05%2f14%2fYWRjNmZmNzQzO > DI2ZGQzOTc3ZjFiNGMxZjIxOGZlYzQvaHR0cDovL3d3dy1sdS0wMS0zNi52b > 3hpbXBsYW50LmNvbS9yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjYxY2JiZjU3M > Tk4NS4xNTI2MjkwMzQ0Ljk2NjQ5MS5tcDM-:head [getxattrs,stat,read 0~41943 > 04] ov 236'20762 > 2018-05-14 13:47:08.181282 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] taking ondisk_read_lock > 2018-05-14 13:47:08.181292 7fb8055d3700 10 osd.21 pg_epoch: 236 pg[6.1d4( > v 236'20882 (236'19289,236'20882] local-lis/les=223/224 n=20791 ec=145/132 > lis/c 223/223 les/c/f 224/224/0 223/223/ > 212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 > active+clean] do_osd_op 6:2bf9e020:::eb359f44-3316-4cd3-9006-d416c21e0745. > 1204464.6_2018%2f05%2f14%2fYWRjNmZmNzQzODI > 2ZGQzOTc3ZjFiNGMxZjIxOGZlYzQvaHR0cDovL3d3dy1sdS0wMS0zNi52b3h > pbXBsYW50LmNvbS9yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjYxY2JiZjU3MTk > 4NS4xNTI2MjkwMzQ0Ljk2NjQ5MS5tcDM-:head [getxattrs,stat,read 0~4194304 > ] > > What are the key words to identify the problem if I want to catch this > line? > > 2018-05-14 13:47:07.764236 mon.storage-ru1-osd1 mon.0 185.164.149.2:6789/0 > 553903 : cluster [WRN] Health check update: 32 slow requests are blocked > > 32 sec (REQUEST_SLOW) > > > Thanks in advance. > > Grigory Murashov > Voximplant > > 10.05.2018 20:12, David Turner пишет: > > 2. When logging the 1/5 is what's written to the log file/what's > temporarily stored in memory. If you want to increase logging, you need to > increase both numbers to 20/20 or 10/10. You can also just set it to 20 or > 10 and ceph will set them to the same number. I personally do both numbers > to remind myself that the defaults aren't the same to set it back to. > > 3. You are not accessing something stored within the ceph cluster, so it > isn't using your admin cephx key that you have. It is accessing the daemon > socket in the OS so you need to have proper permissions to be able to > access it. The daemon sockets are located in /var/run/ceph/ which is a > folder without any public permissions. You could use `sudo -u ceph ceph > daemon osd.15 dump_historic_opsor` or just `sudo ceph daemon osd.15 > dump_historic_ops` as root can access the daemon socket as well. > > On Thu, May 10, 2018 at 7:14 AM Grigory Murashov <[email protected]> > wrote: > >> Hi JC! >> >> Thanks for your answer first. >> >> 1. I have added output of ceph health detail to Zabbix in case of >> warning. So every time I will see with which OSD the problem is. >> >> 2. I have default level of all logs. As I see here >> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/ >> default log level for OSD is 1/5. Should I try >> >> debug osd = 1/20 of 1/10 would be enough here? >> >> 3. Any thoughts why do I have Permission denied? All of my sockets are >> also defaults. >> >> [cephuser@osd3 ~]$ ceph daemon osd.15 dump_historic_ops >> admin_socket: exception getting command descriptions: [Errno 13] >> Permission denied >> >> Thanks in advance. >> >> Grigory Murashov >> Voximplant >> >> 08.05.2018 17:31, Jean-Charles Lopez пишет: >> > Hi Grigory, >> > >> > are these lines the only lines in your log file for OSD 15? >> > >> > Just for sanity, what are the log levels you have set, if any, in your >> config file away from the default? If you set all log levels to 0 like some >> people do you may want to simply go back to the default by commenting out >> the debug_ lines in your config file. If you want to see something more >> detailed you can indeed increase the log level to 5 or 10. >> > >> > What you can also do is to use the admin socket on the machine to see >> what operations are actually blocked: ceph daemon osd.15 dump_ops_in_flight >> and ceph daemon osd.15 dump_historic_ops. >> > >> > These two commands and their output will show you what exact operations >> are blocked and will also point you to the other OSDs this OSD is working >> with to serve the IO. May be the culprit is actually one of the OSDs >> handling the subops or it could be a network problem. >> > >> > Regards >> > JC >> > >> >> On May 8, 2018, at 03:11, Grigory Murashov <[email protected]> >> wrote: >> >> >> >> Hello Jean-Charles! >> >> >> >> I have finally catch the problem, It was at 13-02. >> >> >> >> [cephuser@storage-ru1-osd3 ~]$ ceph health detail >> >> HEALTH_WARN 18 slow requests are blocked > 32 sec >> >> REQUEST_SLOW 18 slow requests are blocked > 32 sec >> >> 3 ops are blocked > 65.536 sec >> >> 15 ops are blocked > 32.768 sec >> >> osd.15 has blocked requests > 65.536 sec >> >> [cephuser@storage-ru1-osd3 ~]$ >> >> >> >> >> >> But surprise - there is no information in ceph-osd.15.log that time >> >> >> >> >> >> 2018-05-08 12:54:26.105919 7f003f5f9700 4 rocksdb: (Original Log Time >> 2018/05/08-12:54:26.105843) EVENT_LOG_v1 {"time_micros": 1525773266105834, >> "job": 2793, "event": "trivial_move", "dest >> >> ination_level": 3, "files": 1, "total_files_size": 68316970} >> >> 2018-05-08 12:54:26.105926 7f003f5f9700 4 rocksdb: (Original Log Time >> 2018/05/08-12:54:26.105854) [/home/jenkins-build/build/ >> workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL >> >> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12. >> 2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1537] >> [default] Moved #1 files to level-3 68316970 bytes OK >> >> : base level 1 max bytes base 268435456 files[0 4 45 403 722 0 0] max >> score 0.98 >> >> >> >> 2018-05-08 13:07:29.711425 7f004f619700 4 rocksdb: >> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ >> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ >> centos7/MACHINE_SIZE/huge/r >> >> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684] >> reusing log 8051 from recycle list >> >> >> >> 2018-05-08 13:07:29.711497 7f004f619700 4 rocksdb: >> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ >> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ >> centos7/MACHINE_SIZE/huge/r >> >> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:725] >> [default] New memtable created with log file: #8089. Immutable memtables: 0. >> >> >> >> 2018-05-08 13:07:29.726107 7f003fdfa700 4 rocksdb: (Original Log Time >> 2018/05/08-13:07:29.711524) [/home/jenkins-build/build/ >> workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL >> >> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12. >> 2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1158] >> Calling FlushMemTableToOutputFile with column family >> >> [default], flush slots available 1, compaction slots allowed 1, >> compaction slots scheduled 1 >> >> 2018-05-08 13:07:29.726124 7f003fdfa700 4 rocksdb: >> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ >> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ >> centos7/MACHINE_SIZE/huge/r >> >> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/flush_job.cc:264] >> [default] [JOB 2794] Flushing memtable with next log file: 8089 >> >> >> >> Should I have some deeply logging? >> >> >> >> >> >> Grigory Murashov >> >> Voximplant >> >> >> >> 07.05.2018 18:59, Jean-Charles Lopez пишет: >> >>> Hi, >> >>> >> >>> ceph health detail >> >>> >> >>> This will tell you which OSDs are experiencing the problem so you can >> then go and inspect the logs and use the admin socket to find out which >> requests are at the source. >> >>> >> >>> Regards >> >>> JC >> >>> >> >>>> On May 7, 2018, at 03:52, Grigory Murashov <[email protected]> >> wrote: >> >>>> >> >>>> Hello! >> >>>> >> >>>> I'm not much experiensed in ceph troubleshouting that why I ask for >> help. >> >>>> >> >>>> I have multiple warnings coming from zabbix as a result of ceph -s >> >>>> >> >>>> REQUEST_SLOW: HEALTH_WARN : 21 slow requests are blocked > 32 sec >> >>>> >> >>>> I don't see any hardware problems that time. >> >>>> >> >>>> I'm able to find the same strings in ceph.log and ceph-mon.log like >> >>>> >> >>>> 2018-05-07 12:37:57.375546 7f3037dae700 0 log_channel(cluster) log >> [WRN] : Health check failed: 12 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> >>>> >> >>>> Now It's important to find out the root of the problem. >> >>>> >> >>>> How to find out: >> >>>> >> >>>> 1. which OSDs are affected >> >>>> >> >>>> 2. which particular requests were slowed and blocked? >> >>>> >> >>>> I assume I need more detailed logging - how to do that? >> >>>> >> >>>> Appreciate your help. >> >>>> >> >>>> -- >> >>>> Grigory Murashov >> >>>> >> >>>> _______________________________________________ >> >>>> ceph-users mailing list >> >>>> [email protected] >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
