Re: [ceph-users] Intepreting reason for blocked request
>>> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 : >>> cluster [WRN] slow request 961.557151 seconds old, >>> received at 2018-05-03 01:40:33.689191: >>> pg_query(4.f epoch 490) currently wait for new map >>> > >The OSD is waiting for a new OSD map, which it will get from one of its >peers or the monitor (by request). This tends to happen if the client sees >a newer version than the OSD does. Hmmm. So the client gets the current OSD map from the Monitor and then indicates in its request to the OSD what map epoch it is using? And if the OSD has an older map, it requests a new one from another OSD or Monitor before proceeding? And I suppose if the current epoch is still older than what the client said, the OSD keeps trying until it gets the epoch the client stated. If that's so, this situation could happen if for some reason the client got the idea that there's a newer map than what there really is. What I'm looking at is probably just a Ceph bug, because this small test cluster got into this state immediately upon startup, before any client had connected (I assume these blocked requests are from inside the cluster), and the requests aren't just blocked for a long time; they're blocked indefinitely. The only time I've seen it is when I brought the cluster up in a different order than I usually do. So I'm just trying to understand the inner workings in case I need to debug it if it keeps happening. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intepreting reason for blocked request
On Sat, May 12, 2018 at 3:22 PM Bryan Hendersonwrote: > I recently had some requests blocked indefinitely; I eventually cleared it > up by recycling the OSDs, but I'd like some help interpreting the log > messages > that supposedly give clue as to what caused the blockage: > > (I reformatted for easy email reading) > > 2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 : > cluster [WRN] 7 slow requests, 2 included below; > oldest blocked for > 961.596517 secs > > 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 : > cluster [WRN] slow request 961.557151 seconds old, > received at 2018-05-03 01:40:33.689191: > pg_query(4.f epoch 490) currently wait for new map > > 2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 : > cluster [WRN] slow request 961.556655 seconds old, > received at 2018-05-03 01:40:33.689686: > pg_query(1.d epoch 490) currently wait for new map > > 2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 : > cluster [WRN] 2 slow requests, 2 included below; > oldest blocked for > 960.677480 secs > > 2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 : > cluster [WRN] slow request 960.677480 seconds old, > received at 2018-05-03 01:40:31.238642: > osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893 > RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg > > 2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 : > cluster [WRN] slow request 960.663817 seconds old, > received at 2018-05-03 01:40:31.252305: > osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad > RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg > > "wait for new map": what map would that be, and where is the OSD expecting > it > to come from? > The OSD is waiting for a new OSD map, which it will get from one of its peers or the monitor (by request). This tends to happen if the client sees a newer version than the OSD does. > > "reached_pg"? > The request has been delivered into a queue for the PG to process, but it hasn't been picked up and worked on yet. There's nothing about the request that is blocking it here, but some other kind of back pressure is going on — either the PG is working or waiting on another request that prevents it from picking up new ones, or there's some kind of throttler preventing it from picking up new work, or there's no CPU time available for some reason. With what you've shown here, it looks like either your cluster is dramatically overloaded, or else something is going on with the mds tables that is killing the OSD whenever it tries to access them. I think there were some past issues with them if they grew too large in older releases? -Greg > > You see two OSDs: osd.0 and osd.1. They're basically set up as a mirrored > pair. > > Thanks. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Intepreting reason for blocked request
I recently had some requests blocked indefinitely; I eventually cleared it up by recycling the OSDs, but I'd like some help interpreting the log messages that supposedly give clue as to what caused the blockage: (I reformatted for easy email reading) 2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 : cluster [WRN] 7 slow requests, 2 included below; oldest blocked for > 961.596517 secs 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 : cluster [WRN] slow request 961.557151 seconds old, received at 2018-05-03 01:40:33.689191: pg_query(4.f epoch 490) currently wait for new map 2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 : cluster [WRN] slow request 961.556655 seconds old, received at 2018-05-03 01:40:33.689686: pg_query(1.d epoch 490) currently wait for new map 2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 960.677480 secs 2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 : cluster [WRN] slow request 960.677480 seconds old, received at 2018-05-03 01:40:31.238642: osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893 RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg 2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 : cluster [WRN] slow request 960.663817 seconds old, received at 2018-05-03 01:40:31.252305: osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg "wait for new map": what map would that be, and where is the OSD expecting it to come from? "reached_pg"? You see two OSDs: osd.0 and osd.1. They're basically set up as a mirrored pair. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com