Re: [ceph-users] Intepreting reason for blocked request

2018-05-19 Thread Bryan Henderson
>>> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
>>>   cluster [WRN] slow request 961.557151 seconds old,
>>>   received at 2018-05-03 01:40:33.689191:
>>> pg_query(4.f epoch 490) currently wait for new map
>>>
>
>The OSD is waiting for a new OSD map, which it will get from one of its
>peers or the monitor (by request). This tends to happen if the client sees
>a newer version than the OSD does.

Hmmm.  So the client gets the current OSD map from the Monitor and then
indicates in its request to the OSD what map epoch it is using?  And if the
OSD has an older map, it requests a new one from another OSD or Monitor before
proceeding?  And I suppose if the current epoch is still older than what the
client said, the OSD keeps trying until it gets the epoch the client stated.

If that's so, this situation could happen if for some reason the client got
the idea that there's a newer map than what there really is.

What I'm looking at is probably just a Ceph bug, because this small test
cluster got into this state immediately upon startup, before any client had
connected (I assume these blocked requests are from inside the cluster), and
the requests aren't just blocked for a long time; they're blocked
indefinitely.  The only time I've seen it is when I brought the cluster up in
a different order than I usually do.  So I'm just trying to understand the
inner workings in case I need to debug it if it keeps happening.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intepreting reason for blocked request

2018-05-16 Thread Gregory Farnum
On Sat, May 12, 2018 at 3:22 PM Bryan Henderson 
wrote:

> I recently had some requests blocked indefinitely; I eventually cleared it
> up by recycling the OSDs, but I'd like some help interpreting the log
> messages
> that supposedly give clue as to what caused the blockage:
>
> (I reformatted for easy email reading)
>
> 2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 :
>   cluster [WRN] 7 slow requests, 2 included below;
>   oldest blocked for > 961.596517 secs
>
> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
>   cluster [WRN] slow request 961.557151 seconds old,
>   received at 2018-05-03 01:40:33.689191:
> pg_query(4.f epoch 490) currently wait for new map
>
> 2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 :
>   cluster [WRN] slow request 961.556655 seconds old,
>   received at 2018-05-03 01:40:33.689686:
> pg_query(1.d epoch 490) currently wait for new map
>
> 2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 :
>   cluster [WRN] 2 slow requests, 2 included below;
>   oldest blocked for > 960.677480 secs
>
> 2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 :
>   cluster [WRN] slow request 960.677480 seconds old,
>   received at 2018-05-03 01:40:31.238642:
> osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893
>   RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg
>
> 2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 :
>   cluster [WRN] slow request 960.663817 seconds old,
>   received at 2018-05-03 01:40:31.252305:
> osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad
>   RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg
>
> "wait for new map": what map would that be, and where is the OSD expecting
> it
> to come from?
>

The OSD is waiting for a new OSD map, which it will get from one of its
peers or the monitor (by request). This tends to happen if the client sees
a newer version than the OSD does.


>
> "reached_pg"?
>

The request has been delivered into a queue for the PG to process, but it
hasn't been picked up and worked on yet. There's nothing about the request
that is blocking it here, but some other kind of back pressure is going on
— either the PG is working or waiting on another request that prevents it
from picking up new ones, or there's some kind of throttler preventing it
from picking up new work, or there's no CPU time available for some reason.

With what you've shown here, it looks like either your cluster is
dramatically overloaded, or else something is going on with the mds tables
that is killing the OSD whenever it tries to access them. I think there
were some past issues with them if they grew too large in older releases?
-Greg


>
> You see two OSDs: osd.0 and osd.1.  They're basically set up as a mirrored
> pair.
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intepreting reason for blocked request

2018-05-12 Thread Bryan Henderson
I recently had some requests blocked indefinitely; I eventually cleared it
up by recycling the OSDs, but I'd like some help interpreting the log messages
that supposedly give clue as to what caused the blockage:

(I reformatted for easy email reading)

2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 :
  cluster [WRN] 7 slow requests, 2 included below;
  oldest blocked for > 961.596517 secs
  
2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
  cluster [WRN] slow request 961.557151 seconds old,
  received at 2018-05-03 01:40:33.689191:
pg_query(4.f epoch 490) currently wait for new map

2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 :
  cluster [WRN] slow request 961.556655 seconds old,
  received at 2018-05-03 01:40:33.689686:
pg_query(1.d epoch 490) currently wait for new map

2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 :
  cluster [WRN] 2 slow requests, 2 included below;
  oldest blocked for > 960.677480 secs

2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 :
  cluster [WRN] slow request 960.677480 seconds old,
  received at 2018-05-03 01:40:31.238642:
osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893
  RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg

2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 :
  cluster [WRN] slow request 960.663817 seconds old,
  received at 2018-05-03 01:40:31.252305:
osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad
  RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg

"wait for new map": what map would that be, and where is the OSD expecting it
to come from?

"reached_pg"?

You see two OSDs: osd.0 and osd.1.  They're basically set up as a mirrored
pair.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com