Re: dump_historic_ops, slow requests

2015-10-13 Thread Gregory Farnum
On Mon, Oct 12, 2015 at 2:22 PM, Deneau, Tom <tom.den...@amd.com> wrote:
> I have a small ceph cluster (3 nodes, 5 osds each, journals all just 
> partitions
> on the spinner disks) and I have noticed that when I hit it with a bunch of
> rados bench clients all doing writes of large (40M objects) with --no-cleanup,
> the rados bench commands seem to finish OK but I often get health warnings 
> like
> HEALTH_WARN 4 requests are blocked > 32 sec;
> 2 osds have slow requests 3 ops are blocked > 32.768 sec on 
> osd.9
> 1 ops are blocked > 32.768 sec on osd.10
> 2 osds have slow requests
> After a couple of minutes, health goes to HEALTH_OK.
>
> But if I go to the node containing osd.10 for example and do dump_historic_ops
> I do get lots of around 20-sec durations but nothing over 32 sec.
>
> The 20-sec or so ops are always  "ack+ondisk+write+known_if_redirected"
> with type_data = "commit sent: apply or cleanup"
> and the following are typical event timings
>
>initiated: 14:06:58.205937
>   reached_pg: 14:07:01.823288, gap=  3617.351
>  started: 14:07:01.823359, gap= 0.071
>waiting for subops from 3: 14:07:01.855259, gap=31.900
>  commit_queued_for_journal_write: 14:07:03.132697, gap=  1277.438
>   write_thread_in_journal_buffer: 14:07:03.143356, gap=10.659
>  journaled_completion_queued: 14:07:04.175863, gap=  1032.507
>op_commit: 14:07:04.585040, gap=   409.177
>   op_applied: 14:07:04.589751, gap= 4.711
> sub_op_commit_rec from 3: 14:07:14.682925, gap= 10093.174
>  commit_sent: 14:07:14.683081, gap= 0.156
> done: 14:07:14.683119, gap= 0.038
>
> Should I expect to see a historic op with duration greater than 32 sec?

IIRC, the historic ops generally consist of up to the N longest
operations in the last Y time period, on a rolling basis. Something
like 20 ops from the last 6 minutes. So if you get a report of a
32-second blocked op and immediately go to the OSD, it should have
that op in either in-progress or historic ops.

One possibility is that if you've got peering happening, it's got ops
blocked on that but they get thrown out because the OSD turns out not
to be responsible for them.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


dump_historic_ops, slow requests

2015-10-12 Thread Deneau, Tom
I have a small ceph cluster (3 nodes, 5 osds each, journals all just partitions
on the spinner disks) and I have noticed that when I hit it with a bunch of
rados bench clients all doing writes of large (40M objects) with --no-cleanup,
the rados bench commands seem to finish OK but I often get health warnings like
HEALTH_WARN 4 requests are blocked > 32 sec;
2 osds have slow requests 3 ops are blocked > 32.768 sec on 
osd.9
1 ops are blocked > 32.768 sec on osd.10
2 osds have slow requests
After a couple of minutes, health goes to HEALTH_OK.

But if I go to the node containing osd.10 for example and do dump_historic_ops
I do get lots of around 20-sec durations but nothing over 32 sec.

The 20-sec or so ops are always  "ack+ondisk+write+known_if_redirected"
with type_data = "commit sent: apply or cleanup"
and the following are typical event timings

   initiated: 14:06:58.205937
  reached_pg: 14:07:01.823288, gap=  3617.351
 started: 14:07:01.823359, gap= 0.071
   waiting for subops from 3: 14:07:01.855259, gap=31.900
 commit_queued_for_journal_write: 14:07:03.132697, gap=  1277.438
  write_thread_in_journal_buffer: 14:07:03.143356, gap=10.659
 journaled_completion_queued: 14:07:04.175863, gap=  1032.507
   op_commit: 14:07:04.585040, gap=   409.177
  op_applied: 14:07:04.589751, gap= 4.711
sub_op_commit_rec from 3: 14:07:14.682925, gap= 10093.174
 commit_sent: 14:07:14.683081, gap= 0.156
done: 14:07:14.683119, gap= 0.038

Should I expect to see a historic op with duration greater than 32 sec?

-- Tom Deneau

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow requests, hunting for new mon

2013-02-14 Thread Chris Dunlop
On 2013-02-12, Chris Dunlop ch...@onthe.net.au wrote:
 Hi,

 What are likely causes for slow requests and monclient: hunting for new
 mon messages? E.g.:

 2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
 ...
 2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
 included below; oldest blocked for  30.383883 secs
 2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
 seconds old, received at 2013-02-12 16:27:15.508374: 
 osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
 921600~4096] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
 seconds old, received at 2013-02-12 16:27:15.508475: 
 osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
 987136~4096] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
 seconds old, received at 2013-02-12 16:27:15.508537: 
 osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
 seconds old, received at 2013-02-12 16:27:15.508573: 
 osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
 2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
 seconds old, received at 2013-02-12 16:27:15.563271: 
 osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

OK, for the sake of anyone who might come across this thread when
searching for similar issues...

http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd

...unfortunately the error message in the link above says old request
rather than slow request (old code?), so that page doesn't come up
when googling for the slow request message. The page needs
updating.

The underlying problem in our case seems to have been spikes in the
number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst
the disks were coping with steady state load, occasionally something (in
this case, activity in a vm running on rbd) would cause a spike in
activity and the disks couldn't cope. I'd initially looked at the amount
of data going to the disks and thought it was well with the disks'
capabilities, however I hadn't considered the IOPS.

The (partial?) solution was to move the journals onto a separate device,
halving the IOPS going to the data disk (write journal, write data) as
well as avoiding having the heads slamming back and forth between the
data and journal. We're continuing to watch the IOPS and will add more
OSDs to spread the load further if necessary.

I still don't know what the hunting messages actually indicate, but
they've also disappeared since fixing the slow request messages.

Incidentally, it strikes me that there is a significant amount of write
amplification going on when running vms with a file system such as xfs
or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs
(with journal). I.e. a single write from a vm can turn into up to 8
separate writes by the time it hits the underlying xfs filesystem. I
think this is why our ceph setup is struggling at far less load on the
same hardware compared to the drbd setup we're wanting to replace.

Cheers!

Chris.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow requests, hunting for new mon

2013-02-14 Thread Sage Weil
On Fri, 15 Feb 2013, Chris Dunlop wrote:
 On 2013-02-12, Chris Dunlop ch...@onthe.net.au wrote:
  Hi,
 
  What are likely causes for slow requests and monclient: hunting for new
  mon messages? E.g.:
 
  2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
  ...
  2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
  included below; oldest blocked for  30.383883 secs
  2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 
  30.383883 seconds old, received at 2013-02-12 16:27:15.508374: 
  osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
  921600~4096] 2.981cf6bc) v4 currently no flag points reached
  2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 
  30.383782 seconds old, received at 2013-02-12 16:27:15.508475: 
  osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
  987136~4096] 2.981cf6bc) v4 currently no flag points reached
  2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 
  30.383720 seconds old, received at 2013-02-12 16:27:15.508537: 
  osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
  1036288~8192] 2.981cf6bc) v4 currently no flag points reached
  2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 
  30.383684 seconds old, received at 2013-02-12 16:27:15.508573: 
  osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
  1454080~4096] 2.fff29a9a) v4 currently no flag points reached
  2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 
  30.328986 seconds old, received at 2013-02-12 16:27:15.563271: 
  osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
  1482752~4096] 2.fff29a9a) v4 currently no flag points reached
 
 OK, for the sake of anyone who might come across this thread when
 searching for similar issues...
 
 http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd
 
 ...unfortunately the error message in the link above says old request
 rather than slow request (old code?), so that page doesn't come up
 when googling for the slow request message. The page needs
 updating.

Updated, thanks!

 The underlying problem in our case seems to have been spikes in the
 number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst
 the disks were coping with steady state load, occasionally something (in
 this case, activity in a vm running on rbd) would cause a spike in
 activity and the disks couldn't cope. I'd initially looked at the amount
 of data going to the disks and thought it was well with the disks'
 capabilities, however I hadn't considered the IOPS.
 
 The (partial?) solution was to move the journals onto a separate device,
 halving the IOPS going to the data disk (write journal, write data) as
 well as avoiding having the heads slamming back and forth between the
 data and journal. We're continuing to watch the IOPS and will add more
 OSDs to spread the load further if necessary.
 
 I still don't know what the hunting messages actually indicate, but
 they've also disappeared since fixing the slow request messages.

This usually means the monitor was responding and we (the OSD or client) 
is trying to reconnect (to a random monitor).

 Incidentally, it strikes me that there is a significant amount of write
 amplification going on when running vms with a file system such as xfs
 or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs
 (with journal). I.e. a single write from a vm can turn into up to 8
 separate writes by the time it hits the underlying xfs filesystem. I
 think this is why our ceph setup is struggling at far less load on the
 same hardware compared to the drbd setup we're wanting to replace.

Currently, yes.  There is always going to be some additional overhead 
because the object data is stored in a file system.  We were/are doing 
several other non-optimal things too, however, that is being improved in 
the current master branch (moving some metadata into leveldb which does a 
better job of managing the IO pattern).  Stay tuned!

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow requests, hunting for new mon

2013-02-12 Thread Chris Dunlop
On Tue, Feb 12, 2013 at 06:28:15PM +1100, Chris Dunlop wrote:
 Hi,
 
 What are likely causes for slow requests and monclient: hunting for new
 mon messages? E.g.:
 
 2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
 ...
 2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
 included below; oldest blocked for  30.383883 secs
 2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
 seconds old, received at 2013-02-12 16:27:15.508374: 
 osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
 921600~4096] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
 seconds old, received at 2013-02-12 16:27:15.508475: 
 osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
 987136~4096] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
 seconds old, received at 2013-02-12 16:27:15.508537: 
 osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
 seconds old, received at 2013-02-12 16:27:15.508573: 
 osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
 2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
 seconds old, received at 2013-02-12 16:27:15.563271: 
 osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

Sorry, I forgot to add what lead me to look into this in the first place...

The b2 machine is running a number of libvirt / kvm virtuals on rbd via
librados:

disk type='network' device='disk'
  driver name='qemu' type='raw'/
  auth username='xx'
secret type='ceph' uuid='----'/
  /auth
  source protocol='rbd' name='rbd/foo:rbd_cache=1'/
  target dev='vdb' bus='virtio'/
  alias name='virtio-disk1'/
  address type='pci' domain='0x' bus='0x00' slot='0x06' 
function='0x0'/
/disk

Around the time the slow requests messages pop up, the virtual machines 
experience
a spike in disk latency, e.g. from inside one of the virtuals:

load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s 
   r/sw/srkB/swkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-12-18:55:23  1.7   0.3   0.0   0.9  42.2   0.0  18.5  vdb0.0   25.8 
   0.09.6 0.00   136.53  28.44   1.34 139.56   0.00 139.6  70.7
2013-02-12-18:55:38  1.5   0.9   0.0   0.4  35.8   0.0  42.8  vdb0.0   26.6 
   0.0   10.8 0.00   143.20  26.52   0.46  41.98   0.00  42.0  40.0
2013-02-12-18:55:53  1.4   0.7   0.0   0.4  38.0   0.0  51.7  vdb0.07.5 
   0.06.8 0.0053.33  15.69   0.46  68.08   0.00  68.1  38.8
2013-02-12-18:56:10  2.2   0.0   0.0   0.1   8.0   0.0  15.8  vdb0.02.1 
   0.00.2 0.00 9.18  78.00   0.98 4164.00   0.00 4164.0  70.6
2013-02-12-18:56:32  3.7   0.0   0.0   0.1   0.0   0.0   0.0  vdb0.00.1 
   0.00.8 0.00 3.27   8.00  24.12 14519.78   0.00 14519.8 100.0
2013-02-12-18:56:47  5.1   0.5   0.0   0.6   0.0   0.0   0.0  vdb0.0   11.5 
   0.05.4 0.0065.87  24.40   1.60 3620.15   0.00 3620.1  88.4
2013-02-12-18:57:03  5.2   0.6   0.0   1.1   3.5   0.0   0.0  vdb0.0   19.6 
   0.06.9 0.00   101.25  29.19   1.13 162.41   0.00 162.4  87.4
2013-02-12-18:57:20  5.2   0.6   0.0   0.9   9.7   0.0   0.0  vdb0.0   44.6 
   0.09.7 0.00   214.82  44.27   2.41 248.22   0.00 248.2  85.3
2013-02-12-18:57:36  4.4   0.4   0.0   0.5  28.7   0.0  46.5  vdb0.0   17.6 
   0.05.2 0.0087.50  33.33   0.56 107.14   0.00 107.1  48.5

...corresponding to this in the b5 / osd.1 log:

2013-02-12 18:52:08.812880 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 18:55:18.851791 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 18:56:25.414948 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for  30.372124 secs
2013-02-12 18:56:25.414958 7f9c13c26700  0 log [WRN] : slow request 30.372124 
seconds old, received at 2013-02-12 18:55:55.042767: 
osd_op(client.9821.0:144779 rb.0.209f.74b0dc51.0023 [write 
1593344~4096] 2.1882ddb7) v4 currently no flag points reached
...
2013-02-12 18:57:13.427008 7f9c13c26700  0 log [WRN] : slow request 40.721769 
seconds old, received at 2013-02-12 18:56:32.705190: 
osd_op(client.9821.0:146756 rb.0.209f.74b0dc51.0128 [write 819200~8192] 
2.b4390173) v4 currently commit sent
2013-02-12 18:59:43.886517 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 19:02:53.911641 7f9c0bc16700  0 monclient: hunting for new mon


Cheers,

Chris.
--
To unsubscribe from

slow requests, hunting for new mon

2013-02-11 Thread Chris Dunlop
Hi,

What are likely causes for slow requests and monclient: hunting for new
mon messages? E.g.:

2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
...
2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for  30.383883 secs
2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
seconds old, received at 2013-02-12 16:27:15.508374: 
osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 921600~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
seconds old, received at 2013-02-12 16:27:15.508475: 
osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 987136~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
seconds old, received at 2013-02-12 16:27:15.508537: 
osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
1036288~8192] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
seconds old, received at 2013-02-12 16:27:15.508573: 
osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
1454080~4096] 2.fff29a9a) v4 currently no flag points reached
2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
seconds old, received at 2013-02-12 16:27:15.563271: 
osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
1482752~4096] 2.fff29a9a) v4 currently no flag points reached

I have a ceph 0.56.2 system with 3 boxes: two boxes have both mon and a
single osd, and the 3rd box has just a mon - see ceph.conf below. The boxes
are running an eclectic mix of self-compiled kernels: b2 is linux-3.4.6, b4
is linux-3.7.3 and b5 is linux-3.6.10.

On b5 / osd.1 the 'hunting' message appears in the osd log regularly, e.g.
190 times yesterday. The message does't appear at all on b4 / osd.0.

Both osd logs show the 'slow requests' messages. Generally these come in
waves, with 30-50 of the associated individual 'slow request' messages
coming in just after the initial 'slow requests' message. Each box saw
around 30 waves yesterday, with no obvious time correlation between the two.

The osd disks are generally cruising along at around 400-800 KB/s, with
occasional spikes up to 1.2-2 MB/s, with a mostly write load.

The gigabit network interfaces (2 per box for public vs cluster) are
also cruising, with the busiest interface at less than 5% utilisation.

CPU utilisation is likewise small, with 90% or more idle and less then 3%
wait for io. There's plenty of free memory, 19 GB on b4 and 6 GB on b5.

Cheers,

Chris


ceph.conf

[global]
auth supported = cephx
[mon]
[mon.b2]
host = b2
mon addr = 10.200.63.130:6789
[mon.b4]
host = b4
mon addr = 10.200.63.132:6789
[mon.b5]
host = b5
mon addr = 10.200.63.133:6789
[osd]
osd journal size = 1000
public network = 10.200.63.0/24
cluster network = 192.168.254.0/24
[osd.0]
host = b4
public addr = 10.200.63.132
cluster addr = 192.168.254.132
[osd.1]
host = b5
public addr = 10.200.63.133
cluster addr = 192.168.254.133

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Slow requests

2013-01-15 Thread Chen, Xiaoxi
Hi,
I have also seen the same warning even when I use v0.56.1 (both kernel 
rbd and OSD side) when write stress is high enough(Say I have 3 osds but having 
4~5 clients doing dd on top of the rbd).
2013-01-15 15:54:05.990052 7ff97dd0c700  0 log [WRN] : slow request 32.545624 
seconds old, received at 2013-01-15 15:53:33.444348: osd_op(client.4125.1:51574 
rb.0.100f.6b8b4567.0827 [write 3670016~524288] 2.1ae8252c) currently 
waiting for sub ops
2013-01-15 15:54:05.990055 7ff97dd0c700  0 log [WRN] : slow request 32.533197 
seconds old, received at 2013-01-15 15:53:33.456775: osd_op(client.4125.1:51599 
rb.0.100e.6b8b4567.07a7 [write 3522560~524288] 2.54bcf797) currently 
waiting for sub ops
2013-01-15 15:54:05.990058 7ff97dd0c700  0 log [WRN] : slow request 32.519738 
seconds old, received at 2013-01-15 15:53:33.470234: osd_op(client.4125.1:51600 
rb.0.100e.6b8b4567.07a7 [write 4046848~147456] 2.54bcf797) currently 
waiting for sub ops
2013-01-15 15:54:06.990355 7ff97dd0c700  0 log [WRN] : 2 slow requests, 2 
included below; oldest blocked for  33.225773 secs
2013-01-15 15:54:06.990366 7ff97dd0c700  0 log [WRN] : slow request 33.225773 
seconds old, received at 2013-01-15 15:53:33.764540: osd_op(client.4125.1:51665 
rb.0.100f.6b8b4567.082e [write 3145728~524288] 2.3c00a0fb) currently 
waiting for sub ops
2013-01-15 15:54:06.990371 7ff97dd0c700  0 log [WRN] : slow request 33.221343 
seconds old, received at 2013-01-15 15:53:33.768970: osd_op(client.4125.1:51666 
rb.0.100f.6b8b4567.082e [write 3670016~524288] 2.3c00a0fb) currently 
waiting for sub ops
2013-01-15 15:54:18.993759 7ff97dd0c700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for  30.313194 secs
2013-01-15 15:54:18.993769 7ff97dd0c700  0 log [WRN] : slow request 30.313194 
seconds old, received at 2013-01-15 15:53:48.680488: osd_op(client.4125.1:52333 
rb.0.100d.6b8b4567.0851 [write 1048576~524288] 2.43b76099) currently 
waiting for sub ops
2013-01-15 15:54:18.993774 7ff97dd0c700  0 log [WRN] : slow request 30.308800 
seconds old, received at 2013-01-15 15:53:48.684882: osd_op(client.4125.1:52334 
rb.0.100d.6b8b4567.0851 [write 1572864~524288] 2.43b76099) currently 
waiting for sub ops
2013-01-15 15:54:18.993778 7ff97dd0c700  0 log [WRN] : slow request 30.304389 
seconds old, received at 2013-01-15 15:53:48.689293: osd_op(client.4125.1:52335 
rb.0.100d.6b8b4567.0851 [write 2097152~524288] 2.43b76099) currently 
waiting for sub ops
2013-01-15 15:54:18.993782 7ff97dd0c700  0 log [WRN] : slow request 30.299783 
seconds old, received at 2013-01-15 15:53:48.693899: osd_op(client.4125.1:52336 
rb.0.100d.6b8b4567.0851 [write 2621440~524288] 2.43b76099) currently 
waiting for sub ops
2013-01-15 15:54:18.993787 7ff97dd0c700  0 log [WRN] : slow request 30.295340 
seconds old, received at 2013-01-15 15:53:48.698342: osd_op(client.4125.1:52337 
rb.0.100d.6b8b4567.0851 [write 3145728~524288] 2.43b76099) currently 
waiting for sub ops 



Xiaoxi
-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: 2012年12月17日 6:55
To: Holcombe, Christopher
Cc: Jens Kristian Søgaard; ceph-devel@vger.kernel.org
Subject: RE: Slow requests

v0.52 is also quite old.  This was resolved in v0.55, I believe.  Where are you 
looking for the RPMs?  v0.55 should be there...

sage


On Sun, 16 Dec 2012, Holcombe, Christopher wrote:

 I heard the solution for this was to restart the osd's.  That fixed it for me.
 
 -Chris
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Jens Kristian 
 S?gaard
 Sent: Sunday, December 16, 2012 9:00 AM
 To: ceph-devel@vger.kernel.org
 Subject: Slow requests
 
 Hi,
 
 My log is filling up with warnings about a single slow request that has been 
 around for a very long time:
 
 osd.1 10.0.0.2:6800/900 162926 : [WRN] 1 slow requests, 1 included 
 below; oldest blocked for  84446.312051 secs
 
 osd.1 10.0.0.2:6800/900 162927 : [WRN] slow request 84446.312051 seconds old, 
 received at 2012-12-15 15:27:56.891437:
 osd_sub_op(client.4528.0:19602219 0.fe
 3807b5fe/rb.0.11b7.4a933baa.0008629e/head//0 [] v 53'185888 
 snapset=0=[]:[] snapc=0=[]) v7 currently started
 
 
 How can I identify the cause of this and how can I cancel this request?
 
 I'm running Ceph on Fedora 17 using the latest RPMs available from ceph.com 
 (0.52-6).
 
 
 Thanks in advance,
 --
 Jens Kristian S?gaard, Mermaid Consulting ApS, 
 j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org

Re: Slow requests

2012-12-18 Thread Jens Kristian Søgaard

Hi Dino,


Seems that the RPM packager likes to keep the latest and greatest
versions in http://ceph.com/rpm-testing/ but this path isn't defined
in the ceph yum repository.


Thanks for the link!

Perhaps the documentation should be updated with this URL?

The release notes link to:

http://ceph.com/docs/master/install/rpm/

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://.mermaidconsulting.com/


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow requests

2012-12-17 Thread Dino Yancey
Hi Jens,

Seems that the RPM packager likes to keep the latest and greatest
versions in http://ceph.com/rpm-testing/ but this path isn't defined
in the ceph yum repository.

Dino

On Sun, Dec 16, 2012 at 5:11 PM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi Sage,


 v0.52 is also quite old.  This was resolved in v0.55, I believe.  Where
 are you looking for the RPMs?  v0.55 should be there...


 The homepage said to look in:

   http://ceph.com/rpms/fc17/x86_64/

 The newest there is 0.52-6.fc17.x86_64.

 I also looked at the EL6 folder, but the newest there is also 0.52-6.

 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,

 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
__
Dino Yancey
2GNT.com Admin
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Slow requests

2012-12-16 Thread Holcombe, Christopher
I heard the solution for this was to restart the osd's.  That fixed it for me.

-Chris

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Jens Kristian Søgaard
Sent: Sunday, December 16, 2012 9:00 AM
To: ceph-devel@vger.kernel.org
Subject: Slow requests

Hi,

My log is filling up with warnings about a single slow request that has been 
around for a very long time:

osd.1 10.0.0.2:6800/900 162926 : [WRN] 1 slow requests, 1 included below; 
oldest blocked for  84446.312051 secs

osd.1 10.0.0.2:6800/900 162927 : [WRN] slow request 84446.312051 seconds old, 
received at 2012-12-15 15:27:56.891437:
osd_sub_op(client.4528.0:19602219 0.fe
3807b5fe/rb.0.11b7.4a933baa.0008629e/head//0 [] v 53'185888 snapset=0=[]:[] 
snapc=0=[]) v7 currently started


How can I identify the cause of this and how can I cancel this request?

I'm running Ceph on Fedora 17 using the latest RPMs available from ceph.com 
(0.52-6).


Thanks in advance,
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



NOTICE: This e-mail and any attachments is intended only for use by the 
addressee(s) named herein and may contain legally privileged, proprietary or 
confidential information. If you are not the intended recipient of this e-mail, 
you are hereby notified that any dissemination, distribution or copying of this 
email, and any attachments thereto, is strictly prohibited. If you receive this 
email in error please immediately notify me via reply email or at (800) 
927-9800 and permanently delete the original copy and any copy of any e-mail, 
and any printout.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow requests

2012-12-16 Thread Andrey Korolyov
On Sun, Dec 16, 2012 at 5:59 PM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi,

 My log is filling up with warnings about a single slow request that has been
 around for a very long time:

 osd.1 10.0.0.2:6800/900 162926 : [WRN] 1 slow requests, 1 included below;
 oldest blocked for  84446.312051 secs

 osd.1 10.0.0.2:6800/900 162927 : [WRN] slow request 84446.312051 seconds
 old, received at 2012-12-15 15:27:56.891437:
 osd_sub_op(client.4528.0:19602219 0.fe
 3807b5fe/rb.0.11b7.4a933baa.0008629e/head//0 [] v 53'185888
 snapset=0=[]:[] snapc=0=[]) v7 currently started


 How can I identify the cause of this and how can I cancel this request?

 I'm running Ceph on Fedora 17 using the latest RPMs available from ceph.com
 (0.52-6).


 Thanks in advance,

Hi Jens,

Please take a look to this thread:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843

Seems that you`ll need newer rpms to get rid of this.

 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow requests

2012-12-16 Thread Jens Kristian Søgaard

Hi Andrey,

Thanks for your reply!


Please take a look to this thread:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843


I took your advice and restarted each of my three osd's individually. 
The first two restarted within a minute or two. The last one took 20 
minutes to restart (?)


Afterwards the slow request had disappeared, so it did seem to work!


Seems that you`ll need newer rpms to get rid of this.


Are newer RPMs available for download somewhere, or do I need to compile 
my own?


I have searched the ceph.com site several times in the past, but I only 
find older versions.


--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow requests

2012-12-16 Thread Andrey Korolyov
On Mon, Dec 17, 2012 at 2:42 AM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi Andrey,

 Thanks for your reply!


 Please take a look to this thread:
 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10843


 I took your advice and restarted each of my three osd's individually. The
 first two restarted within a minute or two. The last one took 20 minutes to
 restart (?)

 Afterwards the slow request had disappeared, so it did seem to work!


 Seems that you`ll need newer rpms to get rid of this.


 Are newer RPMs available for download somewhere, or do I need to compile my
 own?

 I have searched the ceph.com site several times in the past, but I only find
 older versions.



Oh, sorry, I maybe misguided you - solution is the patch from Sam,
restarts may help only on the short distance and you`re not able to
check some pgs for consistency until patch have been applied - they`ll
hang on scrub every time.

 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Slow requests

2012-12-16 Thread Sage Weil
v0.52 is also quite old.  This was resolved in v0.55, I believe.  Where 
are you looking for the RPMs?  v0.55 should be there...

sage


On Sun, 16 Dec 2012, Holcombe, Christopher wrote:

 I heard the solution for this was to restart the osd's.  That fixed it for me.
 
 -Chris
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Jens Kristian S?gaard
 Sent: Sunday, December 16, 2012 9:00 AM
 To: ceph-devel@vger.kernel.org
 Subject: Slow requests
 
 Hi,
 
 My log is filling up with warnings about a single slow request that has been 
 around for a very long time:
 
 osd.1 10.0.0.2:6800/900 162926 : [WRN] 1 slow requests, 1 included below; 
 oldest blocked for  84446.312051 secs
 
 osd.1 10.0.0.2:6800/900 162927 : [WRN] slow request 84446.312051 seconds old, 
 received at 2012-12-15 15:27:56.891437:
 osd_sub_op(client.4528.0:19602219 0.fe
 3807b5fe/rb.0.11b7.4a933baa.0008629e/head//0 [] v 53'185888 
 snapset=0=[]:[] snapc=0=[]) v7 currently started
 
 
 How can I identify the cause of this and how can I cancel this request?
 
 I'm running Ceph on Fedora 17 using the latest RPMs available from ceph.com 
 (0.52-6).
 
 
 Thanks in advance,
 --
 Jens Kristian S?gaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 NOTICE: This e-mail and any attachments is intended only for use by the 
 addressee(s) named herein and may contain legally privileged, proprietary or 
 confidential information. If you are not the intended recipient of this 
 e-mail, you are hereby notified that any dissemination, distribution or 
 copying of this email, and any attachments thereto, is strictly prohibited. 
 If you receive this email in error please immediately notify me via reply 
 email or at (800) 927-9800 and permanently delete the original copy and any 
 copy of any e-mail, and any printout.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow requests

2012-12-16 Thread Jens Kristian Søgaard

Hi Andrey,


Oh, sorry, I maybe misguided you - solution is the patch from Sam,
restarts may help only on the short distance and you`re not able to
check some pgs for consistency until patch have been applied - they`ll
hang on scrub every time.


I'll need to update, yes!

I don't have any pgs in scrubbing though.

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-12-02 Thread Samuel Just
Looks good, pushed.
-Sam

On Sat, Dec 1, 2012 at 7:18 PM, Sage Weil s...@inktank.com wrote:
 I pushed another fix that plugs a leak when requests race with peering
 that fixed the 'slow request' for osd_sub_op messages for me.  It's in
 wip-osd-leak.

 Sam, when you get a chance, can you take a look and push it to next if it
 looks okay?

 Thanks!
 sage


 On Fri, 30 Nov 2012, Samuel Just wrote:

 I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
 Let me know if the problem persists with this patch.
 -Sam

 On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
  Also, these clusters aren't mixed argonaut and next, are they?  (Not
  that that shouldn't work, but it would be a useful data point.)
  -Sam
 
  On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
  Did you observe hung io along with that error?  Both sub_op_commit and
  sub_op_applied have happened, so the sub_op_reply should have been
  sent back to the primary.  This looks more like a leak.  If you also
  observed hung io, then it's possible that the problem is occurring
  between the sub_op_applied event and the response.
  -Sam
 
 
  It is relatively easy to check if one of client VMs has locked one or
  more cores to iowait or just hangs, so yes, these ops are related to
  real commit operations and they are hanged.
  I`m using all-new 0.54 cluster, without mixing of course. Does
  everyone who hit that bug readjusted cluster before bug shows
  itself(say, in a day-long distance)?
 
  On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
  Hi Stefan,
 
  On Thu, 15 Nov 2012, Sage Weil wrote:
  On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
   Am 14.11.2012 15:59, schrieb Sage Weil:
Hi Stefan,
   
I would be nice to confirm that no clients are waiting on replies 
for
these requests; currently we suspect that the OSD request 
tracking is the
buggy part.  If you query the OSD admin socket you should be able 
to dump
requests and see the client IP, and then query the client.
   
Is it librbd?  In that case you likely need to change the config 
so that
it is listening on an admin socket ('admin socket = path').
  
   Yes it is. So i have to specify admin socket at the KVM host?
 
  Right.  IIRC the disk line is a ; (or \;) separated list of key/value
  pairs.
 
   How do i query the admin socket for requests?
 
  ceph --admin-daemon /path/to/socket help
  ceph --admin-daemon /path/to/socket objecter_dump (i think)
 
  Were you able to reproduce this?
 
  Thanks!
  sage
 
  Meanwhile, I did. :)
  Such requests will always be created if you have restarted or marked
  an osd out and then back in and scrub didn`t happen in the meantime
  (after such operation and before request arrival).
  What is more interesting, the hangup happens not exactly at the time
  of operation, but tens of minutes later.
 
  { description: osd_sub_op(client.1292013.0:45422 4.731
  a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
  16444'21693 snapset=0=[]:[] snapc=0=[]),
received_at: 2012-11-28 03:54:43.094151,
age: 27812.942680,
duration: 2.676641,
flag_point: started,
events: [
  { time: 2012-11-28 03:54:43.094222,
event: waiting_for_osdmap},
  { time: 2012-11-28 03:54:43.386890,
event: reached_pg},
  { time: 2012-11-28 03:54:43.386894,
event: started},
  { time: 2012-11-28 03:54:43.386973,
event: commit_queued_for_journal_write},
  { time: 2012-11-28 03:54:45.360049,
event: write_thread_in_journal_buffer},
  { time: 2012-11-28 03:54:45.586183,
event: journaled_completion_queued},
  { time: 2012-11-28 03:54:45.586262,
event: sub_op_commit},
  { time: 2012-11-28 03:54:45.770792,
event: sub_op_applied}]}]}
 
 
 
 
 
  sage
 
  
   Stefan
  
  
On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
   
 Hello list,

 i see this several times. Endless flying slow requests. And 
 they never
 stop
 until i restart the mentioned osd.

 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
 included below;
 oldest blocked for  31789.858457 secs
 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 
 31789.858457 seconds
 old,
 received at 2012-11-14 01:22:07.654922: 
 osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1

Re: endless flying slow requests

2012-12-01 Thread Sage Weil
I pushed another fix that plugs a leak when requests race with peering 
that fixed the 'slow request' for osd_sub_op messages for me.  It's in 
wip-osd-leak.

Sam, when you get a chance, can you take a look and push it to next if it 
looks okay?

Thanks!
sage


On Fri, 30 Nov 2012, Samuel Just wrote:

 I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
 Let me know if the problem persists with this patch.
 -Sam
 
 On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
  Also, these clusters aren't mixed argonaut and next, are they?  (Not
  that that shouldn't work, but it would be a useful data point.)
  -Sam
 
  On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
  Did you observe hung io along with that error?  Both sub_op_commit and
  sub_op_applied have happened, so the sub_op_reply should have been
  sent back to the primary.  This looks more like a leak.  If you also
  observed hung io, then it's possible that the problem is occurring
  between the sub_op_applied event and the response.
  -Sam
 
 
  It is relatively easy to check if one of client VMs has locked one or
  more cores to iowait or just hangs, so yes, these ops are related to
  real commit operations and they are hanged.
  I`m using all-new 0.54 cluster, without mixing of course. Does
  everyone who hit that bug readjusted cluster before bug shows
  itself(say, in a day-long distance)?
 
  On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
  Hi Stefan,
 
  On Thu, 15 Nov 2012, Sage Weil wrote:
  On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
   Am 14.11.2012 15:59, schrieb Sage Weil:
Hi Stefan,
   
I would be nice to confirm that no clients are waiting on replies 
for
these requests; currently we suspect that the OSD request tracking 
is the
buggy part.  If you query the OSD admin socket you should be able 
to dump
requests and see the client IP, and then query the client.
   
Is it librbd?  In that case you likely need to change the config 
so that
it is listening on an admin socket ('admin socket = path').
  
   Yes it is. So i have to specify admin socket at the KVM host?
 
  Right.  IIRC the disk line is a ; (or \;) separated list of key/value
  pairs.
 
   How do i query the admin socket for requests?
 
  ceph --admin-daemon /path/to/socket help
  ceph --admin-daemon /path/to/socket objecter_dump (i think)
 
  Were you able to reproduce this?
 
  Thanks!
  sage
 
  Meanwhile, I did. :)
  Such requests will always be created if you have restarted or marked
  an osd out and then back in and scrub didn`t happen in the meantime
  (after such operation and before request arrival).
  What is more interesting, the hangup happens not exactly at the time
  of operation, but tens of minutes later.
 
  { description: osd_sub_op(client.1292013.0:45422 4.731
  a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
  16444'21693 snapset=0=[]:[] snapc=0=[]),
received_at: 2012-11-28 03:54:43.094151,
age: 27812.942680,
duration: 2.676641,
flag_point: started,
events: [
  { time: 2012-11-28 03:54:43.094222,
event: waiting_for_osdmap},
  { time: 2012-11-28 03:54:43.386890,
event: reached_pg},
  { time: 2012-11-28 03:54:43.386894,
event: started},
  { time: 2012-11-28 03:54:43.386973,
event: commit_queued_for_journal_write},
  { time: 2012-11-28 03:54:45.360049,
event: write_thread_in_journal_buffer},
  { time: 2012-11-28 03:54:45.586183,
event: journaled_completion_queued},
  { time: 2012-11-28 03:54:45.586262,
event: sub_op_commit},
  { time: 2012-11-28 03:54:45.770792,
event: sub_op_applied}]}]}
 
 
 
 
 
  sage
 
  
   Stefan
  
  
On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
   
 Hello list,

 i see this several times. Endless flying slow requests. And they 
 never
 stop
 until i restart the mentioned osd.

 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
 included below;
 oldest blocked for  31789.858457 secs
 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 
 31789.858457 seconds
 old,
 received at 2012-11-14 01:22:07.654922: 
 osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 
 included below;
 oldest blocked for  31790.858646 secs
 2012-11-14 10:11:58.513586 osd

Re: endless flying slow requests

2012-11-30 Thread Samuel Just
I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
Let me know if the problem persists with this patch.
-Sam

On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 Also, these clusters aren't mixed argonaut and next, are they?  (Not
 that that shouldn't work, but it would be a useful data point.)
 -Sam

 On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam


 It is relatively easy to check if one of client VMs has locked one or
 more cores to iowait or just hangs, so yes, these ops are related to
 real commit operations and they are hanged.
 I`m using all-new 0.54 cluster, without mixing of course. Does
 everyone who hit that bug readjusted cluster before bug shows
 itself(say, in a day-long distance)?

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking 
   is the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827

Re: endless flying slow requests

2012-11-28 Thread Samuel Just
Did you observe hung io along with that error?  Both sub_op_commit and
sub_op_applied have happened, so the sub_op_reply should have been
sent back to the primary.  This looks more like a leak.  If you also
observed hung io, then it's possible that the problem is occurring
between the sub_op_applied event and the response.
-Sam

On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client

Re: endless flying slow requests

2012-11-28 Thread Samuel Just
Also, these clusters aren't mixed argonaut and next, are they?  (Not
that that shouldn't work, but it would be a useful data point.)
-Sam

On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1

RE: endless flying slow requests

2012-11-28 Thread Moore, Shawn M
I actually have this exact issue with hung io during these slow requests.  The 
odd thing is when I restarted the osd that was being mentioned via ceph -w it 
didn't resume io.  A bit later (unsure on exact time) another one started being 
listed in ceph -w and when I restarted that different osd io came back.  I do 
believe these occur when I reboot nodes.  I never reboot more than one at a 
time and I have 3 replicas in the cluster.

I have a pure ceph version 0.53 
(commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c) cluster.

Regards,
Shawn


From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on 
behalf of Samuel Just [sam.j...@inktank.com]
Sent: Wednesday, November 28, 2012 16:12
To: Andrey Korolyov
Cc: Sage Weil; Stefan Priebe - Profihost AG; ceph-devel@vger.kernel.org
Subject: Re: endless flying slow requests

Also, these clusters aren't mixed argonaut and next, are they?  (Not
that that shouldn't work, but it would be a useful data point.)
-Sam

On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write

Re: endless flying slow requests

2012-11-28 Thread Andrey Korolyov
On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 Also, these clusters aren't mixed argonaut and next, are they?  (Not
 that that shouldn't work, but it would be a useful data point.)
 -Sam

 On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam


It is relatively easy to check if one of client VMs has locked one or
more cores to iowait or just hangs, so yes, these ops are related to
real commit operations and they are hanged.
I`m using all-new 0.54 cluster, without mixing of course. Does
everyone who hit that bug readjusted cluster before bug shows
itself(say, in a day-long distance)?

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10

Re: endless flying slow requests

2012-11-27 Thread Sage Weil
Hi Stefan,

On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
   
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is the
   buggy part.  If you query the OSD admin socket you should be able to dump
   requests and see the client IP, and then query the client.
   
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
  
  Yes it is. So i have to specify admin socket at the KVM host?
 
 Right.  IIRC the disk line is a ; (or \;) separated list of key/value 
 pairs.
 
  How do i query the admin socket for requests?
 
 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

Were you able to reproduce this?

Thanks!
sage


 
 sage
 
  
  Stefan
  
  
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
   
Hello list,

i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.

2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed

When i now restart osd 24 they go away and everything is fine again.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-27 Thread Stefan Priebe - Profihost AG

Am 28.11.2012 02:51, schrieb Sage Weil:

Hi Stefan,


Yes it is. So i have to specify admin socket at the KVM host?


Right.  IIRC the disk line is a ; (or \;) separated list of key/value
pairs.


How do i query the admin socket for requests?


ceph --admin-daemon /path/to/socket help
ceph --admin-daemon /path/to/socket objecter_dump (i think)


Were you able to reproduce this?


No not right now. But i've saved your answers and will make the tests 
when i see this again.


Right now i'm just seeing extremely slow OSD recovery.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-27 Thread Andrey Korolyov
On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is the
   buggy part.  If you query the OSD admin socket you should be able to dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

Meanwhile, I did. :)
Such requests will always be created if you have restarted or marked
an osd out and then back in and scrub didn`t happen in the meantime
(after such operation and before request arrival).
What is more interesting, the hangup happens not exactly at the time
of operation, but tens of minutes later.

{ description: osd_sub_op(client.1292013.0:45422 4.731
a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
16444'21693 snapset=0=[]:[] snapc=0=[]),
  received_at: 2012-11-28 03:54:43.094151,
  age: 27812.942680,
  duration: 2.676641,
  flag_point: started,
  events: [
{ time: 2012-11-28 03:54:43.094222,
  event: waiting_for_osdmap},
{ time: 2012-11-28 03:54:43.386890,
  event: reached_pg},
{ time: 2012-11-28 03:54:43.386894,
  event: started},
{ time: 2012-11-28 03:54:43.386973,
  event: commit_queued_for_journal_write},
{ time: 2012-11-28 03:54:45.360049,
  event: write_thread_in_journal_buffer},
{ time: 2012-11-28 03:54:45.586183,
  event: journaled_completion_queued},
{ time: 2012-11-28 03:54:45.586262,
  event: sub_op_commit},
{ time: 2012-11-28 03:54:45.770792,
  event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
   
When i now restart osd 24 they go away and everything is fine again.
   
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel 
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: endless flying slow requests

2012-11-15 Thread Stefan Priebe - Profihost AG

Am 14.11.2012 15:59, schrieb Sage Weil:

Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for
these requests; currently we suspect that the OSD request tracking is the
buggy part.  If you query the OSD admin socket you should be able to dump
requests and see the client IP, and then query the client.

Is it librbd?  In that case you likely need to change the config so that
it is listening on an admin socket ('admin socket = path').


Yes it is. So i have to specify admin socket at the KVM host? How do i 
query the admin socket for requests?


Stefan



On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:


Hello list,

i see this several times. Endless flying slow requests. And they never stop
until i restart the mentioned osd.

2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed

When i now restart osd 24 they go away and everything is fine again.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Josh Durgin

On 11/15/2012 12:09 AM, Stefan Priebe - Profihost AG wrote:

Am 14.11.2012 15:59, schrieb Sage Weil:

Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for
these requests; currently we suspect that the OSD request tracking is the
buggy part.  If you query the OSD admin socket you should be able to dump
requests and see the client IP, and then query the client.

Is it librbd?  In that case you likely need to change the config so that
it is listening on an admin socket ('admin socket = path').


Yes it is. So i have to specify admin socket at the KVM host? How do i
query the admin socket for requests?


Yes, add 'admin socket = /path/to/admin/socket' to the [client] section
in ceph.conf, and when a guest is running, show outstanding requests
with:

ceph --admin-daemon /path/to/admin/socket objecter_requests

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
 Am 14.11.2012 15:59, schrieb Sage Weil:
  Hi Stefan,
  
  I would be nice to confirm that no clients are waiting on replies for
  these requests; currently we suspect that the OSD request tracking is the
  buggy part.  If you query the OSD admin socket you should be able to dump
  requests and see the client IP, and then query the client.
  
  Is it librbd?  In that case you likely need to change the config so that
  it is listening on an admin socket ('admin socket = path').
 
 Yes it is. So i have to specify admin socket at the KVM host?

Right.  IIRC the disk line is a ; (or \;) separated list of key/value 
pairs.

 How do i query the admin socket for requests?

ceph --admin-daemon /path/to/socket help
ceph --admin-daemon /path/to/socket objecter_dump (i think)

sage

 
 Stefan
 
 
  On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
   Hello list,
   
   i see this several times. Endless flying slow requests. And they never
   stop
   until i restart the mentioned osd.
   
   2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31789.858457 secs
   2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31790.858646 secs
   2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31791.858827 secs
   2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31792.858971 secs
   2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31793.859124 secs
   2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   
   When i now restart osd 24 they go away and everything is fine again.
   
   Stefan
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


endless flying slow requests

2012-11-14 Thread Stefan Priebe - Profihost AG

Hello list,

i see this several times. Endless flying slow requests. And they never 
stop until i restart the mentioned osd.


2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below; oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds old, received at 2012-11-14 01:22:07.654922: 
osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 
282624~4096] 3.3f6d2373) v4 currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below; oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds old, received at 2012-11-14 01:22:07.654922: 
osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 
282624~4096] 3.3f6d2373) v4 currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below; oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds old, received at 2012-11-14 01:22:07.654922: 
osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 
282624~4096] 3.3f6d2373) v4 currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below; oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds old, received at 2012-11-14 01:22:07.654922: 
osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 
282624~4096] 3.3f6d2373) v4 currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below; oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds old, received at 2012-11-14 01:22:07.654922: 
osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 
282624~4096] 3.3f6d2373) v4 currently delayed


When i now restart osd 24 they go away and everything is fine again.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-14 Thread Sage Weil
Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for 
these requests; currently we suspect that the OSD request tracking is the 
buggy part.  If you query the OSD admin socket you should be able to dump 
requests and see the client IP, and then query the client.  

Is it librbd?  In that case you likely need to change the config so that 
it is listening on an admin socket ('admin socket = path').

Thanks!
sage


On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:

 Hello list,
 
 i see this several times. Endless flying slow requests. And they never stop
 until i restart the mentioned osd.
 
 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31789.858457 secs
 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31790.858646 secs
 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31791.858827 secs
 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31792.858971 secs
 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31793.859124 secs
 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 
 When i now restart osd 24 they go away and everything is fine again.
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html