Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Brad Hubbard
Glad to help!

On Tue, Feb 12, 2019 at 4:55 PM Massimo Sgaravatto
 wrote:
>
> Thanks a lot Brad !
>
> The problem is indeed in the network: we moved the OSD nodes back to the 
> "old" switches and the problem disappeared.
>
> Now we have to figure out what is wrong/misconfigured with the new switch: we 
> would try to replicate the problem, possibly without a ceph deployment ...
>
> Thanks again for your help !
>
> Cheers, Massimo
>
> On Sun, Feb 10, 2019 at 12:07 AM Brad Hubbard  wrote:
>>
>> The log ends at
>>
>> $ zcat ceph-osd.5.log.gz |tail -2
>> 2019-02-09 07:37:00.022534 7f5fce60d700  1 --
>> 192.168.61.202:6816/157436 >> - conn(0x56308edcf000 :6816
>> s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=296 -
>>
>> The last two messages are outbound to 192.168.222.204 and there are no
>> further messages between these two hosts (other than osd_pings) in the
>> log.
>>
>> $ zcat ceph-osd.5.log.gz |gawk
>> '!/osd_ping/&&/192.168.222.202/&&/192.168.222.204/&&/07:29:3/'|tail -5
>> 2019-02-09 07:29:34.267744 7f5fcee0e700  1 --
>> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
>> 1946  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1492
>> (3695125937 0 2050362985) 0x563090674d80 con 0x56308bf61000
>> 2019-02-09 07:29:34.375223 7f5faf4b4700  1 --
>> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
>> replica_scrub(pg:
>> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6145:::rbd_data.35f46d19abe4ed.77a4:0,end:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
>> v9 -- 0x56308bdf2000 con 0
>> 2019-02-09 07:29:34.378535 7f5fcee0e700  1 --
>> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
>> 1947  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1494
>> (3695125937 0 865217733) 0x563092d90900 con 0x56308bf61000
>> 2019-02-09 07:29:34.415868 7f5faf4b4700  1 --
>> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
>> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735
>> 8:d90adab6:::rbd_data.c47f3c390c8495.0001934a:head v
>> 1205833'4767322) v2 -- 0x56308ca42400 con 0
>> 2019-02-09 07:29:34.486296 7f5faf4b4700  1 --
>> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
>> replica_scrub(pg:
>> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,end:8:3d4e7434:::rbd_data.47c1b437840214.0003c594:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
>> v9 -- 0x56308e565340 con 0
>>
>> I'd be taking a good, hard look at the network, yes.
>>
>> On Sat, Feb 9, 2019 at 6:33 PM Massimo Sgaravatto
>>  wrote:
>> >
>> > Thanks for your feedback !
>> >
>> > I increased debug_ms to 1/5.
>> >
>> > This is another slow request (full output from 'ceph daemon osd.5 
>> > dump_historic_ops' for this event is attached):
>> >
>> >
>> > {
>> > "description": "osd_op(client.171725953.0:404377591 8.9b 
>> > 8:d90adab6:
>> > ::rbd_data.c47f3c390c8495.0001934a:head [set-alloc-hint 
>> > object_size 4194
>> > 304 write_size 4194304,write 1413120~122880] snapc 0=[] 
>> > ondisk+write+known_if_re
>> > directed e1205833)",
>> > "initiated_at": "2019-02-09 07:29:34.404655",
>> > "age": 387.914193,
>> > "duration": 340.224154,
>> > "type_data": {
>> > "flag_point": "commit sent; apply or cleanup",
>> > "client_info": {
>> > "client": "client.171725953",
>> > "client_addr": "192.168.61.66:0/4056439540",
>> > "tid": 404377591
>> > },
>> > "events": [
>> > {
>> > "time": "2019-02-09 07:29:34.404655",
>> > "event": "initiated"
>> > },
>> > 
>> > 
>> >{
>> > "time": "2019-02-09 07:29:34.416752",
>> > "event": "op_applied"
>> > },
>> > {
>> > "time": "2019-02-09 07:29:34.417200",
>> > "event": "sub_op_commit_rec from 14"
>> > },
>> > {
>> > "time": "2019-02-09 07:35:14.628724",
>> > "event": "sub_op_commit_rec from 29"
>> > },
>> >
>> > osd.5 has IP 192.168.222.202
>> > osd.14 has IP 192.168.222.203
>> > osd.29 has IP 192.168.222.204
>> >
>> >
>> > Grepping using that client id from osd.5 log as far as I can understand 
>> > (please correct me if my debugging is completely wrong) the request to 5 
>> > and 14 is sent at  07:29:34:
>> >
>> > 2019-02-09 07:29:34.415808 7f5faf4b4700  1 -- 192.168.222.202:6816/157436 
>> > --> 192.168.222.203:6811/158495 -- osd_repop(client.171725953.0:404377591 
>> > 8.9b e1205833/1205735 

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Massimo Sgaravatto
Thanks a lot Brad !

The problem is indeed in the network: we moved the OSD nodes back to the
"old" switches and the problem disappeared.

Now we have to figure out what is wrong/misconfigured with the new switch:
we would try to replicate the problem, possibly without a ceph deployment
...

Thanks again for your help !

Cheers, Massimo

On Sun, Feb 10, 2019 at 12:07 AM Brad Hubbard  wrote:

> The log ends at
>
> $ zcat ceph-osd.5.log.gz |tail -2
> 2019-02-09 07:37:00.022534 7f5fce60d700  1 --
> 192.168.61.202:6816/157436 >> - conn(0x56308edcf000 :6816
> s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=296 -
>
> The last two messages are outbound to 192.168.222.204 and there are no
> further messages between these two hosts (other than osd_pings) in the
> log.
>
> $ zcat ceph-osd.5.log.gz |gawk
> '!/osd_ping/&&/192.168.222.202/&&/192.168.222.204/&&/07:29:3/'|tail
>  -5
> 2019-02-09 07:29:34.267744 7f5fcee0e700  1 --
> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
> 1946  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1492
> (3695125937 0 2050362985) 0x563090674d80 con 0x56308bf61000
> 2019-02-09 07:29:34.375223 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
> replica_scrub(pg:
>
> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6145:::rbd_data.35f46d19abe4ed.77a4:0,end:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
> v9 -- 0x56308bdf2000 con 0
> 2019-02-09 07:29:34.378535 7f5fcee0e700  1 --
> 192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
> 1947  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1494
> (3695125937 0 865217733) 0x563092d90900 con 0x56308bf61000
> 2019-02-09 07:29:34.415868 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735
> 8:d90adab6:::rbd_data.c47f3c390c8495.0001934a:head v
> 1205833'4767322) v2 -- 0x56308ca42400 con 0
> 2019-02-09 07:29:34.486296 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
> replica_scrub(pg:
>
> 8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,end:8:3d4e7434:::rbd_data.47c1b437840214.0003c594:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
> v9 -- 0x56308e565340 con 0
>
> I'd be taking a good, hard look at the network, yes.
>
> On Sat, Feb 9, 2019 at 6:33 PM Massimo Sgaravatto
>  wrote:
> >
> > Thanks for your feedback !
> >
> > I increased debug_ms to 1/5.
> >
> > This is another slow request (full output from 'ceph daemon osd.5
> dump_historic_ops' for this event is attached):
> >
> >
> > {
> > "description": "osd_op(client.171725953.0:404377591 8.9b
> 8:d90adab6:
> > ::rbd_data.c47f3c390c8495.0001934a:head [set-alloc-hint
> object_size 4194
> > 304 write_size 4194304,write 1413120~122880] snapc 0=[]
> ondisk+write+known_if_re
> > directed e1205833)",
> > "initiated_at": "2019-02-09 07:29:34.404655",
> > "age": 387.914193,
> > "duration": 340.224154,
> > "type_data": {
> > "flag_point": "commit sent; apply or cleanup",
> > "client_info": {
> > "client": "client.171725953",
> > "client_addr": "192.168.61.66:0/4056439540",
> > "tid": 404377591
> > },
> > "events": [
> > {
> > "time": "2019-02-09 07:29:34.404655",
> > "event": "initiated"
> > },
> > 
> > 
> >{
> > "time": "2019-02-09 07:29:34.416752",
> > "event": "op_applied"
> > },
> > {
> > "time": "2019-02-09 07:29:34.417200",
> > "event": "sub_op_commit_rec from 14"
> > },
> > {
> > "time": "2019-02-09 07:35:14.628724",
> > "event": "sub_op_commit_rec from 29"
> > },
> >
> > osd.5 has IP 192.168.222.202
> > osd.14 has IP 192.168.222.203
> > osd.29 has IP 192.168.222.204
> >
> >
> > Grepping using that client id from osd.5 log as far as I can understand
> (please correct me if my debugging is completely wrong) the request to 5
> and 14 is sent at  07:29:34:
> >
> > 2019-02-09 07:29:34.415808 7f5faf4b4700  1 --
> 192.168.222.202:6816/157436 --> 192.168.222.203:6811/158495 --
> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735 8:d90ada\
> > b6:::rbd_data.c47f3c390c8495.0001934a:head v 1205833'4767322) v2
> -- 0x56307bb61e00 con 0
> > 2019-02-09 07:29:34.415868 7f5faf4b4700  1 

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-09 Thread Brad Hubbard
The log ends at

$ zcat ceph-osd.5.log.gz |tail -2
2019-02-09 07:37:00.022534 7f5fce60d700  1 --
192.168.61.202:6816/157436 >> - conn(0x56308edcf000 :6816
s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=296 -

The last two messages are outbound to 192.168.222.204 and there are no
further messages between these two hosts (other than osd_pings) in the
log.

$ zcat ceph-osd.5.log.gz |gawk
'!/osd_ping/&&/192.168.222.202/&&/192.168.222.204/&&/07:29:3/'|tail -5
2019-02-09 07:29:34.267744 7f5fcee0e700  1 --
192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
1946  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1492
(3695125937 0 2050362985) 0x563090674d80 con 0x56308bf61000
2019-02-09 07:29:34.375223 7f5faf4b4700  1 --
192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
replica_scrub(pg:
8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6145:::rbd_data.35f46d19abe4ed.77a4:0,end:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
v9 -- 0x56308bdf2000 con 0
2019-02-09 07:29:34.378535 7f5fcee0e700  1 --
192.168.222.202:6816/157436 <== osd.29 192.168.222.204:6804/4159520
1947  rep_scrubmap(8.2bc e1205735 from shard 29) v2  40+0+1494
(3695125937 0 865217733) 0x563092d90900 con 0x56308bf61000
2019-02-09 07:29:34.415868 7f5faf4b4700  1 --
192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735
8:d90adab6:::rbd_data.c47f3c390c8495.0001934a:head v
1205833'4767322) v2 -- 0x56308ca42400 con 0
2019-02-09 07:29:34.486296 7f5faf4b4700  1 --
192.168.222.202:6816/157436 --> 192.168.222.204:6804/4159520 --
replica_scrub(pg:
8.2bc,from:0'0,to:0'0,epoch:1205833/1205735,start:8:3d4e6916:::rbd_data.a6dc2425de9600.0006249c:0,end:8:3d4e7434:::rbd_data.47c1b437840214.0003c594:0,chunky:1,deep:0,version:9,allow_preemption:1,priority=5)
v9 -- 0x56308e565340 con 0

I'd be taking a good, hard look at the network, yes.

On Sat, Feb 9, 2019 at 6:33 PM Massimo Sgaravatto
 wrote:
>
> Thanks for your feedback !
>
> I increased debug_ms to 1/5.
>
> This is another slow request (full output from 'ceph daemon osd.5 
> dump_historic_ops' for this event is attached):
>
>
> {
> "description": "osd_op(client.171725953.0:404377591 8.9b 
> 8:d90adab6:
> ::rbd_data.c47f3c390c8495.0001934a:head [set-alloc-hint object_size 
> 4194
> 304 write_size 4194304,write 1413120~122880] snapc 0=[] 
> ondisk+write+known_if_re
> directed e1205833)",
> "initiated_at": "2019-02-09 07:29:34.404655",
> "age": 387.914193,
> "duration": 340.224154,
> "type_data": {
> "flag_point": "commit sent; apply or cleanup",
> "client_info": {
> "client": "client.171725953",
> "client_addr": "192.168.61.66:0/4056439540",
> "tid": 404377591
> },
> "events": [
> {
> "time": "2019-02-09 07:29:34.404655",
> "event": "initiated"
> },
> 
> 
>{
> "time": "2019-02-09 07:29:34.416752",
> "event": "op_applied"
> },
> {
> "time": "2019-02-09 07:29:34.417200",
> "event": "sub_op_commit_rec from 14"
> },
> {
> "time": "2019-02-09 07:35:14.628724",
> "event": "sub_op_commit_rec from 29"
> },
>
> osd.5 has IP 192.168.222.202
> osd.14 has IP 192.168.222.203
> osd.29 has IP 192.168.222.204
>
>
> Grepping using that client id from osd.5 log as far as I can understand 
> (please correct me if my debugging is completely wrong) the request to 5 and 
> 14 is sent at  07:29:34:
>
> 2019-02-09 07:29:34.415808 7f5faf4b4700  1 -- 192.168.222.202:6816/157436 --> 
> 192.168.222.203:6811/158495 -- osd_repop(client.171725953.0:404377591 8.9b 
> e1205833/1205735 8:d90ada\
> b6:::rbd_data.c47f3c390c8495.0001934a:head v 1205833'4767322) v2 -- 
> 0x56307bb61e00 con 0
> 2019-02-09 07:29:34.415868 7f5faf4b4700  1 -- 192.168.222.202:6816/157436 --> 
> 192.168.222.204:6804/4159520 -- osd_repop(client.171725953.0:404377591 8.9b 
> e1205833/1205735 8:d90ad\
> ab6:::rbd_data.c47f3c390c8495.0001934a:head v 1205833'4767322) v2 -- 
> 0x56308ca42400 con 0
> 2019-02-09 07:29:34.417132 7f5fcf60f700  1 -- 192.168.222.202:6816/157436 <== 
> osd.14 192.168.222.203:6811/158495 11242  
> osd_repop_reply(client.171725953.0:404377591 8.9b e120\
> 5833/1205735) v2  111+0+0 (634943494 0 0) 0x563090642780 con 
> 0x56308bbd
>
> The answer from 14 arrives immediately:
>
> 2019-02-09 07:29:34.417132 7f5fcf60f700  1 -- 

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-09 Thread Massimo Sgaravatto
Thanks for your feedback !

I increased debug_ms to 1/5.

This is another slow request (full output from 'ceph daemon osd.5
dump_historic_ops' for this event is attached):


{
"description": "osd_op(client.171725953.0:404377591 8.9b
8:d90adab6:
::rbd_data.c47f3c390c8495.0001934a:head [set-alloc-hint object_size
4194
304 write_size 4194304,write 1413120~122880] snapc 0=[]
ondisk+write+known_if_re
directed e1205833)",
"initiated_at": "2019-02-09 07:29:34.404655",
"age": 387.914193,
"duration": 340.224154,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.171725953",
"client_addr": "192.168.61.66:0/4056439540",
"tid": 404377591
},
"events": [
{
"time": "2019-02-09 07:29:34.404655",
"event": "initiated"
},


   {
"time": "2019-02-09 07:29:34.416752",
"event": "op_applied"
},
{
"time": "2019-02-09 07:29:34.417200",
"event": "sub_op_commit_rec from 14"
},
{
"time": "2019-02-09 07:35:14.628724",
"event": "sub_op_commit_rec from 29"
},

osd.5 has IP 192.168.222.202
osd.14 has IP 192.168.222.203
osd.29 has IP 192.168.222.204


Grepping using that client id from osd.5 log as far as I can understand
(please correct me if my debugging is completely wrong) the request to 5
and 14 is sent at  07:29:34:

2019-02-09 07:29:34.415808 7f5faf4b4700  1 -- 192.168.222.202:6816/157436
--> 192.168.222.203:6811/158495 -- osd_repop(client.171725953.0:404377591
8.9b e1205833/1205735 8:d90ada\
b6:::rbd_data.c47f3c390c8495.0001934a:head v 1205833'4767322) v2 --
0x56307bb61e00 con 0
2019-02-09 07:29:34.415868 7f5faf4b4700  1 -- 192.168.222.202:6816/157436
--> 192.168.222.204:6804/4159520 -- osd_repop(client.171725953.0:404377591
8.9b e1205833/1205735 8:d90ad\
ab6:::rbd_data.c47f3c390c8495.0001934a:head v 1205833'4767322) v2
-- 0x56308ca42400 con 0
2019-02-09 07:29:34.417132 7f5fcf60f700  1 -- 192.168.222.202:6816/157436
<== osd.14 192.168.222.203:6811/158495 11242 
osd_repop_reply(client.171725953.0:404377591 8.9b e120\
5833/1205735) v2  111+0+0 (634943494 0 0) 0x563090642780 con
0x56308bbd

The answer from 14 arrives immediately:

2019-02-09 07:29:34.417132 7f5fcf60f700  1 -- 192.168.222.202:6816/157436
<== osd.14 192.168.222.203:6811/158495 11242 
osd_repop_reply(client.171725953.0:404377591 8.9b e120\
5833/1205735) v2  111+0+0 (634943494 0 0) 0x563090642780 con
0x56308bbd

while the one from 29 arrives only at 7.35:

2019-02-09 07:35:14.628614 7f5fcee0e700  1 -- 192.168.222.202:6816/157436
<== osd.29 192.168.222.204:6804/4159520 1952 
osd_repop_reply(client.171725953.0:404377591 8.9b e120\
5833/1205735) v2  111+0+0 (3804866849 0 0) 0x56308f3f2a00 con
0x56308bf61000


In osd.29 log it looks like the request only arrives at 07.35 (and it
promptly replies):

2019-02-09 07:35:14.627462 7f99972cc700  1 -- 192.168.222.204:6804/4159520
<== osd.5 192.168.222.202:6816/157436 2527 
osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735) v2 
1050+0+123635 (1225076790 0 171428115) 0x5610f5128a00 con 0x5610fc5bf000
2019-02-09 07:35:14.628343 7f998d6d4700  1 -- 192.168.222.204:6804/4159520
--> 192.168.222.202:6816/157436 --
osd_repop_reply(client.171725953.0:404377591 8.9b e1205833/1205735 ondisk,
result = 0) v2 -- 0x5610f4a51180 con 0


Network problems ?


Full logs for the 3 relevant OSDs (just for that time period) is at:
https://drive.google.com/drive/folders/1TG5MomMJsqVbsuFosvYokNptLufxOnPY?usp=sharing



Thanks again !
Cheers, Massimo



On Fri, Feb 8, 2019 at 11:50 PM Brad Hubbard  wrote:

> Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
> to start with.
>
> On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
>  wrote:
> >
> > Our Luminous ceph cluster have been worked without problems for a while,
> but in the last days we have been suffering from continuous slow requests.
> >
> > We have indeed done some changes in the infrastructure recently:
> >
> > - Moved OSD nodes to a new switch
> > - Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also
> because  we have to install new OSDs in the cluster). The output of 'ceph
> osd df' is attached.
> >
> > The problem could also be due to some ''bad' client, but in the log I
> don't see a clear "correlation" with specific clients or images for such
> blocked requests.
> >
> > I also tried to update to latest luminous release and latest CentOS7,
> but this didn't help.
> >
> >
> >
> > Attached you can find the 

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-08 Thread Brad Hubbard
Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
to start with.

On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
 wrote:
>
> Our Luminous ceph cluster have been worked without problems for a while, but 
> in the last days we have been suffering from continuous slow requests.
>
> We have indeed done some changes in the infrastructure recently:
>
> - Moved OSD nodes to a new switch
> - Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also because  we 
> have to install new OSDs in the cluster). The output of 'ceph osd df' is 
> attached.
>
> The problem could also be due to some ''bad' client, but in the log I don't 
> see a clear "correlation" with specific clients or images for such blocked 
> requests.
>
> I also tried to update to latest luminous release and latest CentOS7, but 
> this didn't help.
>
>
>
> Attached you can find the detail of one of such slow operations which took 
> about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops').
> So as far as I can understand from these events:
> {
> "time": "2019-02-08 10:26:25.651728",
> "event": "op_commit"
> },
> {
> "time": "2019-02-08 10:26:25.651965",
> "event": "op_applied"
> },
>
>   {
> "time": "2019-02-08 10:26:25.653236",
> "event": "sub_op_commit_rec from 33"
> },
> {
> "time": "2019-02-08 10:30:51.890404",
> "event": "sub_op_commit_rec from 23"
> },
>
> the problem seems with the  "sub_op_commit_rec from 23" event which took too 
> much.
> So the problem is that the answer from OSD 23 took to much ?
>
>
> In the logs of the 2 OSD (11 and 23)in that time frame (attached) I can't 
> find anything useful.
> When the problem happened the load and the usage of memory was not high in 
> the relevant nodes.
>
>
> Any help to debug the issue is really appreciated ! :-)
>
> Thanks, Massimo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Debugging 'slow requests' ...

2019-02-08 Thread Massimo Sgaravatto
Our Luminous ceph cluster have been worked without problems for a while,
but in the last days we have been suffering from continuous slow requests.

We have indeed done some changes in the infrastructure recently:

- Moved OSD nodes to a new switch
- Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also because
we have to install new OSDs in the cluster). The output of 'ceph osd df' is
attached.

The problem could also be due to some ''bad' client, but in the log I don't
see a clear "correlation" with specific clients or images for such blocked
requests.

I also tried to update to latest luminous release and latest CentOS7, but
this didn't help.



Attached you can find the detail of one of such slow operations which took
about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops').
So as far as I can understand from these events:
{
"time": "2019-02-08 10:26:25.651728",
"event": "op_commit"
},
{
"time": "2019-02-08 10:26:25.651965",
"event": "op_applied"
},

  {
"time": "2019-02-08 10:26:25.653236",
"event": "sub_op_commit_rec from 33"
},
{
"time": "2019-02-08 10:30:51.890404",
"event": "sub_op_commit_rec from 23"
},

the problem seems with the  "sub_op_commit_rec from 23" event which took
too much.
So the problem is that the answer from OSD 23 took to much ?


In the logs of the 2 OSD (11 and 23)in that time frame (attached) I can't
find anything useful.
When the problem happened the load and the usage of memory was not high in
the relevant nodes.


Any help to debug the issue is really appreciated ! :-)

Thanks, Massimo
{
{
"description": "osd_op(client.171725953.0:368920029 8.208 
8:10448e0e:::rbd_data.c47f3c390c8495.00018b81:head [set-alloc-hint 
object_size 4194304 write_size 4194304,write 2744320~200704] snapc 0=[] 
ondisk+write+known_if_redirected e1203982)",
"initiated_at": "2019-02-08 10:26:25.614728",
"age": 331.598616,
"duration": 266.275721,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.171725953",
"client_addr": "192.168.61.66:0/4056439540",
"tid": 368920029
},
"events": [
{
"time": "2019-02-08 10:26:25.614728",
"event": "initiated"
},
{
"time": "2019-02-08 10:26:25.650783",
"event": "queued_for_pg"
},
{
"time": "2019-02-08 10:26:25.650819",
"event": "reached_pg"
},
{
"time": "2019-02-08 10:26:25.650860",
"event": "started"
},
{
"time": "2019-02-08 10:26:25.650919",
"event": "waiting for subops from 23,33"
},
{
"time": "2019-02-08 10:26:25.65",
"event": "commit_queued_for_journal_write"
},
{
"time": "2019-02-08 10:26:25.65",
"event": "commit_queued_for_journal_write"
},
{
"time": "2019-02-08 10:26:25.651141",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2019-02-08 10:26:25.651694",
"event": "journaled_completion_queued"
},
{
"time": "2019-02-08 10:26:25.651728",
"event": "op_commit"
},
{
"time": "2019-02-08 10:26:25.651965",
"event": "op_applied"
},
{
"time": "2019-02-08 10:26:25.653236",
"event": "sub_op_commit_rec from 33"
},
{
"time": "2019-02-08 10:30:51.890404", 
"event": "sub_op_commit_rec from 23"
},
{
"time": "2019-02-08 10:30:51.890434",
"event": "commit_sent"
},
{
"time": "2019-02-08 10:30:51.890450",