Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit :
> > On 18 Sep 2015, at 11:28, Christian Balzer  wrote:
> > 
> > On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
> > 
> > > Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit
> > > :
> > > > In that case it can either be slow monitors (slow network, slow
> > > > disks(!!!)  or a CPU or memory problem).
> > > > But it still can also be on the OSD side in the form of either
> > > > CPU
> > > > usage or memory pressure - in my case there were lots of memory
> > > > used
> > > > for pagecache (so for all intents and purposes considered
> > > > "free") but
> > > > when peering the OSD had trouble allocating any memory from it
> > > > and it
> > > > caused lots of slow ops and peering hanging in there for a
> > > > while.
> > > > This also doesn't show as high CPU usage, only kswapd spins up
> > > > a bit
> > > > (don't be fooled by its name, it has nothing to do with swap in
> > > > this
> > > > case).
> > > 
> > > My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM
> > > (for
> > > 4x800GB ones), so I will try track this too. Thanks !
> > > 
> > I haven't seen this (known problem) with 64GB or 128GB nodes,
> > probably
> > because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB
> > respectively.
> > 
> 
> I had this set to 6G and that doesn't help. This "buffer" is probably
> only useful for some atomic allocations that can use it, not for
> userland processes and their memory. Or maybe they get memory from
> this pool but it gets replenished immediately.
> QEMU has no problem allocating 64G on the same host, OSD struggles to
> allocate memory during startup or when PGs are added during
> rebalancing - probably because it does a lot of smaller allocations
> instead of one big.
> 

For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw
any rebalance, but I can see a reduced filestore.commitcycle_latency.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Jan Schermer
In that case it can either be slow monitors (slow network, slow disks(!!!)  or 
a CPU or memory problem).
But it still can also be on the OSD side in the form of either CPU usage or 
memory pressure - in my case there were lots of memory used for pagecache (so 
for all intents and purposes considered "free") but when peering the OSD had 
trouble allocating any memory from it and it caused lots of slow ops and 
peering hanging in there for a while. This also doesn't show as high CPU usage, 
only kswapd spins up a bit (don't be fooled by its name, it has nothing to do 
with swap in this case).

echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a routine 
now and that problem is gone.

Jan

> On 18 Sep 2015, at 10:53, Olivier Bonvalet  wrote:
> 
> mmm good point.
> 
> I don't see CPU or IO problem on mons, but in logs, I have this :
> 
> 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap v86359128:
> 6632 pgs: 77 inactive, 1 remapped, 10 active+remapped+wait_backfill, 25
> peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used, 58578
> GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s; 8417/15680513
> objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> 
> 
> So... it can be a peering problem. Didn't see that, thanks.
> 
> 
> 
> Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
>> Could this be caused by monitors? In my case lagging monitors can
>> also cause slow requests (because of slow peering). Not sure if
>> that's expected or not, but it of course doesn't show on the OSDs as
>> any kind of bottleneck when you try to investigate...
>> 
>> Jan
>> 
>>> On 18 Sep 2015, at 09:37, Olivier Bonvalet 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> sorry for missing informations. I was to avoid putting too much
>>> inappropriate infos ;)
>>> 
>>> 
>>> 
>>> Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
>>> écrit :
 Hello,
 
 On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
 
 The items below help, but be a s specific as possible, from OS,
 kernel
 version to Ceph version, "ceph -s", any other specific details
 (pool
 type,
 replica size).
 
>>> 
>>> So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
>>> kernel,
>>> and Ceph 0.80.10.
>>> I don't have anymore ceph status right now. But I have
>>> data to move tonight again, so I'll track that.
>>> 
>>> The affected pool is a standard one (no erasure coding), with only
>>> 2 replica (size=2).
>>> 
>>> 
>>> 
>>> 
> Some additionnal informations :
> - I have 4 SSD per node.
 Type, if nothing else for anecdotal reasons.
>>> 
>>> I have 7 storage nodes here :
>>> - 3 nodes which have each 12 OSD of 300GB
>>> SSD
>>> - 4 nodes which have each  4 OSD of 800GB SSD
>>> 
>>> And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
>>> 
>>> 
>>> 
> - the CPU usage is near 0
> - IO wait is near 0 too
 Including the trouble OSD(s)?
>>> 
>>> Yes
>>> 
>>> 
 Measured how, iostat or atop?
>>> 
>>> iostat, htop, and confirmed with Zabbix supervisor.
>>> 
>>> 
>>> 
>>> 
> - bandwith usage is also near 0
> 
 Yeah, all of the above are not surprising if everything is stuck
 waiting
 on some ops to finish. 
 
 How many nodes are we talking about?
>>> 
>>> 
>>> 7 nodes, 52 OSDs.
>>> 
>>> 
>>> 
> The whole cluster seems waiting for something... but I don't
> see
> what.
> 
 Is it just one specific OSD (or a set of them) or is that all
 over
 the
 place?
>>> 
>>> A set of them. When I increase the weight of all 4 OSDs of a node,
>>> I
>>> frequently have blocked IO from 1 OSD of this node.
>>> 
>>> 
>>> 
 Does restarting the OSD fix things?
>>> 
>>> Yes. For several minutes.
>>> 
>>> 
 Christian
> 
> Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> écrit :
>> Hi,
>> 
>> I have a cluster with lot of blocked operations each time I
>> try
>> to
>> move
>> data (by reweighting a little an OSD).
>> 
>> It's a full SSD cluster, with 10GbE network.
>> 
>> In logs, when I have blocked OSD, on the main OSD I can see
>> that
>> :
>> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for > 33.976680
>> secs
>> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
>> request
>> 30.125556 seconds old, received at 2015-09-18
>> 01:54:46.855821:
>> osd_op(client.29760717.1:18680817544
>> rb.0.1c16005.238e1f29.027f [write 180224~16384]
>> 6.c11916a4
>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
>> currently
>> reached pg
>> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Christian Balzer
On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:

> Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit :
> > In that case it can either be slow monitors (slow network, slow
> > disks(!!!)  or a CPU or memory problem).
> > But it still can also be on the OSD side in the form of either CPU
> > usage or memory pressure - in my case there were lots of memory used
> > for pagecache (so for all intents and purposes considered "free") but
> > when peering the OSD had trouble allocating any memory from it and it
> > caused lots of slow ops and peering hanging in there for a while.
> > This also doesn't show as high CPU usage, only kswapd spins up a bit
> > (don't be fooled by its name, it has nothing to do with swap in this
> > case).
> 
> My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM (for
> 4x800GB ones), so I will try track this too. Thanks !
> 
I haven't seen this (known problem) with 64GB or 128GB nodes, probably
because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB respectively.

Christian.

> 
> > echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a
> > routine now and that problem is gone.
> > 
> > Jan
> > 
> > > On 18 Sep 2015, at 10:53, Olivier Bonvalet 
> > > wrote:
> > > 
> > > mmm good point.
> > > 
> > > I don't see CPU or IO problem on mons, but in logs, I have this :
> > > 
> > > 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap
> > > v86359128:
> > > 6632 pgs: 77 inactive, 1 remapped, 10
> > > active+remapped+wait_backfill, 25
> > > peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> > > active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used,
> > > 58578
> > > GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s;
> > > 8417/15680513
> > > objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> > > 
> > > 
> > > So... it can be a peering problem. Didn't see that, thanks.
> > > 
> > > 
> > > 
> > > Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> > > > Could this be caused by monitors? In my case lagging monitors can
> > > > also cause slow requests (because of slow peering). Not sure if
> > > > that's expected or not, but it of course doesn't show on the OSDs
> > > > as
> > > > any kind of bottleneck when you try to investigate...
> > > > 
> > > > Jan
> > > > 
> > > > > On 18 Sep 2015, at 09:37, Olivier Bonvalet  > > > > >
> > > > > wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > sorry for missing informations. I was to avoid putting too much
> > > > > inappropriate infos ;)
> > > > > 
> > > > > 
> > > > > 
> > > > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > > > écrit :
> > > > > > Hello,
> > > > > > 
> > > > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > > > 
> > > > > > The items below help, but be a s specific as possible, from
> > > > > > OS,
> > > > > > kernel
> > > > > > version to Ceph version, "ceph -s", any other specific
> > > > > > details
> > > > > > (pool
> > > > > > type,
> > > > > > replica size).
> > > > > > 
> > > > > 
> > > > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > > > kernel,
> > > > > and Ceph 0.80.10.
> > > > > I don't have anymore ceph status right now. But I have
> > > > > data to move tonight again, so I'll track that.
> > > > > 
> > > > > The affected pool is a standard one (no erasure coding), with
> > > > > only
> > > > > 2 replica (size=2).
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > > Some additionnal informations :
> > > > > > > - I have 4 SSD per node.
> > > > > > Type, if nothing else for anecdotal reasons.
> > > > > 
> > > > > I have 7 storage nodes here :
> > > > > - 3 nodes which have each 12 OSD of 300GB
> > > > > SSD
> > > > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > > > 
> > > > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > > > 
> > > > > 
> > > > > 
> > > > > > > - the CPU usage is near 0
> > > > > > > - IO wait is near 0 too
> > > > > > Including the trouble OSD(s)?
> > > > > 
> > > > > Yes
> > > > > 
> > > > > 
> > > > > > Measured how, iostat or atop?
> > > > > 
> > > > > iostat, htop, and confirmed with Zabbix supervisor.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > > - bandwith usage is also near 0
> > > > > > > 
> > > > > > Yeah, all of the above are not surprising if everything is
> > > > > > stuck
> > > > > > waiting
> > > > > > on some ops to finish. 
> > > > > > 
> > > > > > How many nodes are we talking about?
> > > > > 
> > > > > 
> > > > > 7 nodes, 52 OSDs.
> > > > > 
> > > > > 
> > > > > 
> > > > > > > The whole cluster seems waiting for something... but I
> > > > > > > don't
> > > > > > > see
> > > > > > > what.
> > > > > > > 
> > > > > > Is it just one specific OSD (or a set of them) or is that all
> > > > > > over
> > > > > > the
> > > > > > place?
> > > > > 
> > > > > A set of them. When I increase the weight of all 4 OSDs of a
> > > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit :
> Hello,
> 
> On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:
> 
> > Hi,
> > 
> > sorry for missing informations. I was to avoid putting too much
> > inappropriate infos ;)
> > 
> Nah, everything helps, there are known problems with some versions,
> kernels, file systems, etc.
> 
> Speaking of which, what FS are you using on your OSDs?
> 

XFS.

> > 
> > 
> > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > écrit :
> > > Hello,
> > > 
> > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > 
> > > The items below help, but be a s specific as possible, from OS,
> > > kernel
> > > version to Ceph version, "ceph -s", any other specific details
> > > (pool
> > > type,
> > > replica size).
> > > 
> > 
> > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > kernel,
> > and Ceph 0.80.10.
> All my stuff is on Jessie, but at least Firefly should be stable and
> I
> haven't seen anything like your problem with it.
> And while 3.14 is a LTS kernel I wonder if something newer may be
> beneficial, but probably not.
> 

Well, I can try a 3.18.x kernel. But for that I have to restart all
nodes, which will throw some backfilling and probably some blocked IO
too ;)


> > I don't have anymore ceph status right now. But I have
> > data to move tonight again, so I'll track that.
> > 
> I was interested in that to see how many pools and PGs you have.

Well :

cluster de035250-323d-4cf6-8c4b-cf0faf6296b1
 health HEALTH_OK
 monmap e21: 3 mons at 
{faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0}, 
election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh
 osdmap e847496: 88 osds: 88 up, 87 in
  pgmap v86390609: 6632 pgs, 16 pools, 18883 GB data, 5266 kobjects
68559 GB used, 59023 GB / 124 TB avail
6632 active+clean
  client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s


There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd
pool use different OSD, on different nodes.
Since I don't often balance data of this hdd pool, I don't yet see
problem on it.



> >  The affected pool is a standard one (no erasure coding), with only
> > 2
> > replica (size=2).
> > 
> Good, nothing fancy going on there then.
> 
> > 
> > 
> > 
> > > > Some additionnal informations :
> > > > - I have 4 SSD per node.
> > > Type, if nothing else for anecdotal reasons.
> > 
> > I have 7 storage nodes here :
> > - 3 nodes which have each 12 OSD of 300GB
> > SSD
> > - 4 nodes which have each  4 OSD of 800GB SSD
> > 
> > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > 
> Type as in model/maker, but helpful information.
> 

300GB models are Intel SSDSC2BB300G4 (DC S3500).
800GB models are Intel SSDSC2BB800H4 (DC S3500 I think).




> > 
> > 
> > > > - the CPU usage is near 0
> > > > - IO wait is near 0 too
> > > Including the trouble OSD(s)?
> > 
> > Yes
> > 
> > 
> > > Measured how, iostat or atop?
> > 
> > iostat, htop, and confirmed with Zabbix supervisor.
> > 
> 
> Good. I'm sure you checked for network errors. 
> Single network or split client/cluster network?
> 

It's the first thing I checked, and latency and packet loss is
monitored between each node and mons, but maybe I forgot some checks.


> > 
> > 
> > 
> > > > - bandwith usage is also near 0
> > > > 
> > > Yeah, all of the above are not surprising if everything is stuck
> > > waiting
> > > on some ops to finish. 
> > > 
> > > How many nodes are we talking about?
> > 
> > 
> > 7 nodes, 52 OSDs.
> > 
> That be below the threshold for most system tunables (there are
> various
> threads and articles on how to tune Ceph for "large" clusters).
> 
> Since this happens only when your cluster reshuffles data (and thus
> has
> more threads going) what is your ulimit setting for open files?


Wow... the default one on Debian Wheezy : 1024.



> > 
> > 
> > > > The whole cluster seems waiting for something... but I don't
> > > > see
> > > > what.
> > > > 
> > > Is it just one specific OSD (or a set of them) or is that all
> > > over
> > > the
> > > place?
> > 
> > A set of them. When I increase the weight of all 4 OSDs of a node,
> > I
> > frequently have blocked IO from 1 OSD of this node.
> > 
> The plot thickens, as in, the target of most writes (new PGs being
> moved
> there) is the culprit.
> 

Yes.


> > 
> > 
> > > Does restarting the OSD fix things?
> > 
> > Yes. For several minutes.
> > 
> That also ties into a resource starvation of sorts, I'd investigate
> along
> those lines.

Yes, I agree. I will increase verbosity of OSD.


> Christian
> > 
> > > Christian
> > > > 
> > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > > écrit :
> > > > > Hi,
> > > > > 
> > > > > I have a cluster with lot of blocked operations each time I
> > > > > try
> > > > > to
> > > > > move
> > > > > data (by reweighting a little an OSD).
> > > > > 
> > > > > It's a full SSD 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit :
> In that case it can either be slow monitors (slow network, slow
> disks(!!!)  or a CPU or memory problem).
> But it still can also be on the OSD side in the form of either CPU
> usage or memory pressure - in my case there were lots of memory used
> for pagecache (so for all intents and purposes considered "free") but
> when peering the OSD had trouble allocating any memory from it and it
> caused lots of slow ops and peering hanging in there for a while.
> This also doesn't show as high CPU usage, only kswapd spins up a bit
> (don't be fooled by its name, it has nothing to do with swap in this
> case).

My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM (for
4x800GB ones), so I will try track this too. Thanks !


> echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a
> routine now and that problem is gone.
> 
> Jan
> 
> > On 18 Sep 2015, at 10:53, Olivier Bonvalet 
> > wrote:
> > 
> > mmm good point.
> > 
> > I don't see CPU or IO problem on mons, but in logs, I have this :
> > 
> > 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap
> > v86359128:
> > 6632 pgs: 77 inactive, 1 remapped, 10
> > active+remapped+wait_backfill, 25
> > peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> > active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used,
> > 58578
> > GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s;
> > 8417/15680513
> > objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> > 
> > 
> > So... it can be a peering problem. Didn't see that, thanks.
> > 
> > 
> > 
> > Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> > > Could this be caused by monitors? In my case lagging monitors can
> > > also cause slow requests (because of slow peering). Not sure if
> > > that's expected or not, but it of course doesn't show on the OSDs
> > > as
> > > any kind of bottleneck when you try to investigate...
> > > 
> > > Jan
> > > 
> > > > On 18 Sep 2015, at 09:37, Olivier Bonvalet  > > > >
> > > > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > sorry for missing informations. I was to avoid putting too much
> > > > inappropriate infos ;)
> > > > 
> > > > 
> > > > 
> > > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > > écrit :
> > > > > Hello,
> > > > > 
> > > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > > 
> > > > > The items below help, but be a s specific as possible, from
> > > > > OS,
> > > > > kernel
> > > > > version to Ceph version, "ceph -s", any other specific
> > > > > details
> > > > > (pool
> > > > > type,
> > > > > replica size).
> > > > > 
> > > > 
> > > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > > kernel,
> > > > and Ceph 0.80.10.
> > > > I don't have anymore ceph status right now. But I have
> > > > data to move tonight again, so I'll track that.
> > > > 
> > > > The affected pool is a standard one (no erasure coding), with
> > > > only
> > > > 2 replica (size=2).
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > > Some additionnal informations :
> > > > > > - I have 4 SSD per node.
> > > > > Type, if nothing else for anecdotal reasons.
> > > > 
> > > > I have 7 storage nodes here :
> > > > - 3 nodes which have each 12 OSD of 300GB
> > > > SSD
> > > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > > 
> > > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > > 
> > > > 
> > > > 
> > > > > > - the CPU usage is near 0
> > > > > > - IO wait is near 0 too
> > > > > Including the trouble OSD(s)?
> > > > 
> > > > Yes
> > > > 
> > > > 
> > > > > Measured how, iostat or atop?
> > > > 
> > > > iostat, htop, and confirmed with Zabbix supervisor.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > > - bandwith usage is also near 0
> > > > > > 
> > > > > Yeah, all of the above are not surprising if everything is
> > > > > stuck
> > > > > waiting
> > > > > on some ops to finish. 
> > > > > 
> > > > > How many nodes are we talking about?
> > > > 
> > > > 
> > > > 7 nodes, 52 OSDs.
> > > > 
> > > > 
> > > > 
> > > > > > The whole cluster seems waiting for something... but I
> > > > > > don't
> > > > > > see
> > > > > > what.
> > > > > > 
> > > > > Is it just one specific OSD (or a set of them) or is that all
> > > > > over
> > > > > the
> > > > > place?
> > > > 
> > > > A set of them. When I increase the weight of all 4 OSDs of a
> > > > node,
> > > > I
> > > > frequently have blocked IO from 1 OSD of this node.
> > > > 
> > > > 
> > > > 
> > > > > Does restarting the OSD fix things?
> > > > 
> > > > Yes. For several minutes.
> > > > 
> > > > 
> > > > > Christian
> > > > > > 
> > > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier
> > > > > > Bonvalet a
> > > > > > écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I have a cluster with lot of blocked operations each time
> > > > > > > I
> > > > > > > try
> > > > > > > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Christian Balzer

Hello,

On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:

> Hi,
> 
> sorry for missing informations. I was to avoid putting too much
> inappropriate infos ;)
> 
Nah, everything helps, there are known problems with some versions,
kernels, file systems, etc.

Speaking of which, what FS are you using on your OSDs?

> 
> 
> Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a écrit :
> > Hello,
> > 
> > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > 
> > The items below help, but be a s specific as possible, from OS,
> > kernel
> > version to Ceph version, "ceph -s", any other specific details (pool
> > type,
> > replica size).
> > 
> 
> So, all nodes use Debian Wheezy, running on a vanilla 3.14.x kernel,
> and Ceph 0.80.10.
All my stuff is on Jessie, but at least Firefly should be stable and I
haven't seen anything like your problem with it.
And while 3.14 is a LTS kernel I wonder if something newer may be
beneficial, but probably not.

> I don't have anymore ceph status right now. But I have
> data to move tonight again, so I'll track that.
>
I was interested in that to see how many pools and PGs you have.
 
> The affected pool is a standard one (no erasure coding), with only 2
> replica (size=2).
> 
Good, nothing fancy going on there then.

> 
> 
> 
> > > Some additionnal informations :
> > > - I have 4 SSD per node.
> > Type, if nothing else for anecdotal reasons.
> 
> I have 7 storage nodes here :
> - 3 nodes which have each 12 OSD of 300GB
> SSD
> - 4 nodes which have each  4 OSD of 800GB SSD
> 
> And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> 
Type as in model/maker, but helpful information.

> 
> 
> > > - the CPU usage is near 0
> > > - IO wait is near 0 too
> > Including the trouble OSD(s)?
> 
> Yes
> 
> 
> > Measured how, iostat or atop?
> 
> iostat, htop, and confirmed with Zabbix supervisor.
>

Good. I'm sure you checked for network errors. 
Single network or split client/cluster network?

> 
> 
> 
> > > - bandwith usage is also near 0
> > > 
> > Yeah, all of the above are not surprising if everything is stuck
> > waiting
> > on some ops to finish. 
> > 
> > How many nodes are we talking about?
> 
> 
> 7 nodes, 52 OSDs.
> 
That be below the threshold for most system tunables (there are various
threads and articles on how to tune Ceph for "large" clusters).

Since this happens only when your cluster reshuffles data (and thus has
more threads going) what is your ulimit setting for open files?

> 
> 
> > > The whole cluster seems waiting for something... but I don't see
> > > what.
> > > 
> > Is it just one specific OSD (or a set of them) or is that all over
> > the
> > place?
> 
> A set of them. When I increase the weight of all 4 OSDs of a node, I
> frequently have blocked IO from 1 OSD of this node.
> 
The plot thickens, as in, the target of most writes (new PGs being moved
there) is the culprit.

> 
> 
> > Does restarting the OSD fix things?
> 
> Yes. For several minutes.
> 
That also ties into a resource starvation of sorts, I'd investigate along
those lines.

Christian
> 
> > Christian
> > > 
> > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > écrit :
> > > > Hi,
> > > > 
> > > > I have a cluster with lot of blocked operations each time I try
> > > > to
> > > > move
> > > > data (by reweighting a little an OSD).
> > > > 
> > > > It's a full SSD cluster, with 10GbE network.
> > > > 
> > > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > > :
> > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > requests, 1 included below; oldest blocked for > 33.976680 secs
> > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > request
> > > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > > osd_op(client.29760717.1:18680817544
> > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > 6.c11916a4
> > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > currently
> > > > reached pg
> > > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > requests, 1 included below; oldest blocked for > 63.981596 secs
> > > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > > request
> > > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > > osd_op(client.29760717.1:18680817544
> > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > 6.c11916a4
> > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > currently
> > > > reached pg
> > > > 
> > > > How should I read that ? What this OSD is waiting for ?
> > > > 
> > > > Thanks for any help,
> > > > 
> > > > Olivier
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> > > ___
> > > ceph-users mailing list
> > > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Christian Balzer

Hello,

On Fri, 18 Sep 2015 10:35:37 +0200 Olivier Bonvalet wrote:

> Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit :
> > Hello,
> > 
> > On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:
> > 
> > > Hi,
> > > 
> > > sorry for missing informations. I was to avoid putting too much
> > > inappropriate infos ;)
> > > 
> > Nah, everything helps, there are known problems with some versions,
> > kernels, file systems, etc.
> > 
> > Speaking of which, what FS are you using on your OSDs?
> > 
> 
> XFS.
>
No surprises there, one hopes.
 
> > > 
> > > 
> > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > écrit :
> > > > Hello,
> > > > 
> > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > 
> > > > The items below help, but be a s specific as possible, from OS,
> > > > kernel
> > > > version to Ceph version, "ceph -s", any other specific details
> > > > (pool
> > > > type,
> > > > replica size).
> > > > 
> > > 
> > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > kernel,
> > > and Ceph 0.80.10.
> > All my stuff is on Jessie, but at least Firefly should be stable and
> > I
> > haven't seen anything like your problem with it.
> > And while 3.14 is a LTS kernel I wonder if something newer may be
> > beneficial, but probably not.
> > 
> 
> Well, I can try a 3.18.x kernel. But for that I have to restart all
> nodes, which will throw some backfilling and probably some blocked IO
> too ;)
>
Yeah, as I said, it might be helpful in some other ways, but probably not
related to your problems.
  
> 
> > > I don't have anymore ceph status right now. But I have
> > > data to move tonight again, so I'll track that.
> > > 
> > I was interested in that to see how many pools and PGs you have.
> 
> Well :
> 
> cluster de035250-323d-4cf6-8c4b-cf0faf6296b1
>  health HEALTH_OK
>  monmap e21: 3 mons at
> {faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0},
> election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh osdmap e847496:
> 88 osds: 88 up, 87 in pgmap v86390609: 6632 pgs, 16 pools, 18883 GB
> data, 5266 kobjects 68559 GB used, 59023 GB / 124 TB avail 6632
> active+clean client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s
> 
> 
> There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd
> pool use different OSD, on different nodes.
> Since I don't often balance data of this hdd pool, I don't yet see
> problem on it.
> 
How many PGs in the SSD pool? 
I can see this easily exceeding your open file limits.

> 
> 
> > >  The affected pool is a standard one (no erasure coding), with only
> > > 2
> > > replica (size=2).
> > > 
> > Good, nothing fancy going on there then.
> > 
> > > 
> > > 
> > > 
> > > > > Some additionnal informations :
> > > > > - I have 4 SSD per node.
> > > > Type, if nothing else for anecdotal reasons.
> > > 
> > > I have 7 storage nodes here :
> > > - 3 nodes which have each 12 OSD of 300GB
> > > SSD
> > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > 
> > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > 
> > Type as in model/maker, but helpful information.
> > 
> 
> 300GB models are Intel SSDSC2BB300G4 (DC S3500).
> 800GB models are Intel SSDSC2BB800H4 (DC S3500 I think).
> 
0.3 DWPD, but I guess you know that.

> 
> 
> 
> > > 
> > > 
> > > > > - the CPU usage is near 0
> > > > > - IO wait is near 0 too
> > > > Including the trouble OSD(s)?
> > > 
> > > Yes
> > > 
> > > 
> > > > Measured how, iostat or atop?
> > > 
> > > iostat, htop, and confirmed with Zabbix supervisor.
> > > 
> > 
> > Good. I'm sure you checked for network errors. 
> > Single network or split client/cluster network?
> > 
> 
> It's the first thing I checked, and latency and packet loss is
> monitored between each node and mons, but maybe I forgot some checks.
> 
> 
> > > 
> > > 
> > > 
> > > > > - bandwith usage is also near 0
> > > > > 
> > > > Yeah, all of the above are not surprising if everything is stuck
> > > > waiting
> > > > on some ops to finish. 
> > > > 
> > > > How many nodes are we talking about?
> > > 
> > > 
> > > 7 nodes, 52 OSDs.
> > > 
> > That be below the threshold for most system tunables (there are
> > various
> > threads and articles on how to tune Ceph for "large" clusters).
> > 
> > Since this happens only when your cluster reshuffles data (and thus
> > has
> > more threads going) what is your ulimit setting for open files?
> 
> 
> Wow... the default one on Debian Wheezy : 1024.
>
You want to fix this, both during startup and in general.
 
I use this for sysv-init systems like Wheezy:
---
# cat /etc/initscript 
#
ulimit -Hn 65536
ulimit -Sn 16384

# Execute the program.
eval exec "$4"
---

and:
---
# cat /etc/security/limits.d/tuning.conf 
rootsoftnofile  16384
roothardnofile  65536
*   softnofile  16384
*   hardnofile  65536
---

Adjust as needed.

Christian
> 
> 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Hi,

I think I found the problem : a way too large journal.
I catch this from logs of an OSD having blocked queries :

OSD.15 :

2015-09-19 00:41:12.717062 7fb8a3c44700  1 journal check_for_full at 3548528640 
: JOURNAL FULL 3548528640 >= 1376255 (max_size 4294967296 start 3549904896)
2015-09-19 00:41:43.124590 7fb8a6181700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.405719 secs
2015-09-19 00:41:43.124596 7fb8a6181700  0 log [WRN] : slow request 30.405719 
seconds old, received at 2015-09-19 00:41:12.718829: 
osd_op(client.31621623.1:5392489797 rb.0.1b844d6.238e1f29.04d3 [write 
0~4096] 6.3aed306f snapc 4=[4,11096,11018] ondisk+write e847952) v4 
currently waiting for subops from 19
2015-09-19 00:41:43.124599 7fb8a6181700  0 log [WRN] : slow request 30.172735 
seconds old, received at 2015-09-19 00:41:12.951813: 
osd_op(client.31435077.1:8423014905 rb.0.1c39394.238e1f29.037a [write 
1499136~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124602 7fb8a6181700  0 log [WRN] : slow request 30.172703 
seconds old, received at 2015-09-19 00:41:12.951845: 
osd_op(client.31435077.1:8423014906 rb.0.1c39394.238e1f29.037a [write 
1523712~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124604 7fb8a6181700  0 log [WRN] : slow request 30.172576 
seconds old, received at 2015-09-19 00:41:12.951972: 
osd_op(client.31435077.1:8423014907 rb.0.1c39394.238e1f29.037a [write 
1515520~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124606 7fb8a6181700  0 log [WRN] : slow request 30.172546 
seconds old, received at 2015-09-19 00:41:12.952002: 
osd_op(client.31435077.1:8423014909 rb.0.1c39394.238e1f29.037a [write 
1531904~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28

and at same time on OSD.19 :

2015-09-19 00:41:19.549508 7f55973c0700  0 -- 192.168.42.22:6806/28596 >> 
192.168.42.16:6828/38905 pipe(0x230f sd=358 :6806 s=2 pgs=14268 cs=3 l=0 
c=0x6d9cb00).fault with nothing to send, going to standby
2015-09-19 00:41:43.246421 7f55ba277700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 30.253274 secs
2015-09-19 00:41:43.246428 7f55ba277700  0 log [WRN] : slow request 30.253274 
seconds old, received at 2015-09-19 00:41:12.993123: 
osd_op(client.31626115.1:4664205553 rb.0.1c918ad.238e1f29.2da9 [write 
3063808~16384] 6.604ba242 snapc 10aaf=[10aaf,10a31,109b3] ondisk+write e847952) 
v4 currently waiting for subops from 15
2015-09-19 00:42:13.251591 7f55ba277700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 60.258446 secs
2015-09-19 00:42:13.251596 7f55ba277700  0 log [WRN] : slow request 60.258446 
seconds old, received at 2015-09-19 00:41:12.993123: 
osd_op(client.31626115.1:4664205553 rb.0.1c918ad.238e1f29.2da9 [write 
3063808~16384] 6.604ba242 snapc 10aaf=[10aaf,10a31,109b3] ondisk+write e847952) 
v4 currently waiting for subops from 15

So the blocking seem to be the "JOURNAL FULL" event, with big numbers. 
3548528640, is the journal size ?
I just reduced the filestore_max_sync_interval to 30s, and everything
seems to work fine.

For SSD OSD, with journal on same device, big journal is a crazy
thing... I suppose I break this setup when trying to tune the journal
for the HDD pool.

At same time, is there tips tuning journal in case of HDD OSD, with
(potentially big) SSD journal, and hardware RAID card which handle
write back ?

Thanks for your help.

Olivier


Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a cluster with lot of blocked operations each time I try to
> move
> data (by reweighting a little an OSD).
> 
> It's a full SSD cluster, with 10GbE network.
> 
> In logs, when I have blocked OSD, on the main OSD I can see that :
> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 33.976680 secs
> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request
> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 63.981596 secs
> 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request
> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 
> 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 14:14 +0200, Paweł Sadowski a écrit :
> It might be worth checking how many threads you have in your system
> (ps
> -eL | wc -l). By default there is a limit of 32k (sysctl -q
> kernel.pid_max). There is/was a bug in fork()
> (https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit
> is
> reached. We hit a situation when OSD trying to create new thread was
> killed and reports 'Cannot allocate memory' (12 OSD per node created
> more than 32k threads).
> 

Thanks ! For now I don't see more than 5k threads on nodes with 12 OSD,
but maybe during recovery/backfilling ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Paweł Sadowski
On 09/18/2015 12:17 PM, Olivier Bonvalet wrote:
> Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit :
>>> On 18 Sep 2015, at 11:28, Christian Balzer  wrote:
>>>
>>> On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
>>>
 Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit
 :
> In that case it can either be slow monitors (slow network, slow
> disks(!!!)  or a CPU or memory problem).
> But it still can also be on the OSD side in the form of either
> CPU
> usage or memory pressure - in my case there were lots of memory
> used
> for pagecache (so for all intents and purposes considered
> "free") but
> when peering the OSD had trouble allocating any memory from it
> and it
> caused lots of slow ops and peering hanging in there for a
> while.
> This also doesn't show as high CPU usage, only kswapd spins up
> a bit
> (don't be fooled by its name, it has nothing to do with swap in
> this
> case).
 My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM
 (for
 4x800GB ones), so I will try track this too. Thanks !

>>> I haven't seen this (known problem) with 64GB or 128GB nodes,
>>> probably
>>> because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB
>>> respectively.
>>>
>> I had this set to 6G and that doesn't help. This "buffer" is probably
>> only useful for some atomic allocations that can use it, not for
>> userland processes and their memory. Or maybe they get memory from
>> this pool but it gets replenished immediately.
>> QEMU has no problem allocating 64G on the same host, OSD struggles to
>> allocate memory during startup or when PGs are added during
>> rebalancing - probably because it does a lot of smaller allocations
>> instead of one big.
>>
> For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw
> any rebalance, but I can see a reduced filestore.commitcycle_latency.

It might be worth checking how many threads you have in your system (ps
-eL | wc -l). By default there is a limit of 32k (sysctl -q
kernel.pid_max). There is/was a bug in fork()
(https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit is
reached. We hit a situation when OSD trying to create new thread was
killed and reports 'Cannot allocate memory' (12 OSD per node created
more than 32k threads).

-- 
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
I use Ceph 0.80.10.

I see IO wait is near 0 thanks to iostat, htop (in detailed mode), and
rechecked with Zabbix supervisor.


Le jeudi 17 septembre 2015 à 20:28 -0700, GuangYang a écrit :
> Which version are you using?
> 
> My guess is that the request (op) is waiting for lock (might be
> ondisk_read_lock of the object, but a debug_osd=20 should be helpful
> to tell what happened to the op).
> 
> How do you tell the IO wait is near to 0 (by top?)? 
> 
> Thanks,
> Guang
> 
> > From: ceph.l...@daevel.fr
> > To: ceph-users@lists.ceph.com
> > Date: Fri, 18 Sep 2015 02:43:49 +0200
> > Subject: Re: [ceph-users] Lot of blocked operations
> > 
> > Some additionnal informations :
> > - I have 4 SSD per node.
> > - the CPU usage is near 0
> > - IO wait is near 0 too
> > - bandwith usage is also near 0
> > 
> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
But yes, I will try to increase OSD verbosity.

Le jeudi 17 septembre 2015 à 20:28 -0700, GuangYang a écrit :
> Which version are you using?
> 
> My guess is that the request (op) is waiting for lock (might be
> ondisk_read_lock of the object, but a debug_osd=20 should be helpful
> to tell what happened to the op).
> 
> How do you tell the IO wait is near to 0 (by top?)? 
> 
> Thanks,
> Guang
> 
> > From: ceph.l...@daevel.fr
> > To: ceph-users@lists.ceph.com
> > Date: Fri, 18 Sep 2015 02:43:49 +0200
> > Subject: Re: [ceph-users] Lot of blocked operations
> > 
> > Some additionnal informations :
> > - I have 4 SSD per node.
> > - the CPU usage is near 0
> > - IO wait is near 0 too
> > - bandwith usage is also near 0
> > 
> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
mmm good point.

I don't see CPU or IO problem on mons, but in logs, I have this :

2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap v86359128:
6632 pgs: 77 inactive, 1 remapped, 10 active+remapped+wait_backfill, 25
peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used, 58578
GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s; 8417/15680513
objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering


So... it can be a peering problem. Didn't see that, thanks.



Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> Could this be caused by monitors? In my case lagging monitors can
> also cause slow requests (because of slow peering). Not sure if
> that's expected or not, but it of course doesn't show on the OSDs as
> any kind of bottleneck when you try to investigate...
> 
> Jan
> 
> > On 18 Sep 2015, at 09:37, Olivier Bonvalet 
> > wrote:
> > 
> > Hi,
> > 
> > sorry for missing informations. I was to avoid putting too much
> > inappropriate infos ;)
> > 
> > 
> > 
> > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > écrit :
> > > Hello,
> > > 
> > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > 
> > > The items below help, but be a s specific as possible, from OS,
> > > kernel
> > > version to Ceph version, "ceph -s", any other specific details
> > > (pool
> > > type,
> > > replica size).
> > > 
> > 
> > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > kernel,
> > and Ceph 0.80.10.
> > I don't have anymore ceph status right now. But I have
> > data to move tonight again, so I'll track that.
> > 
> > The affected pool is a standard one (no erasure coding), with only
> > 2 replica (size=2).
> > 
> > 
> > 
> > 
> > > > Some additionnal informations :
> > > > - I have 4 SSD per node.
> > > Type, if nothing else for anecdotal reasons.
> > 
> > I have 7 storage nodes here :
> > - 3 nodes which have each 12 OSD of 300GB
> > SSD
> > - 4 nodes which have each  4 OSD of 800GB SSD
> > 
> > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > 
> > 
> > 
> > > > - the CPU usage is near 0
> > > > - IO wait is near 0 too
> > > Including the trouble OSD(s)?
> > 
> > Yes
> > 
> > 
> > > Measured how, iostat or atop?
> > 
> > iostat, htop, and confirmed with Zabbix supervisor.
> > 
> > 
> > 
> > 
> > > > - bandwith usage is also near 0
> > > > 
> > > Yeah, all of the above are not surprising if everything is stuck
> > > waiting
> > > on some ops to finish. 
> > > 
> > > How many nodes are we talking about?
> > 
> > 
> > 7 nodes, 52 OSDs.
> > 
> > 
> > 
> > > > The whole cluster seems waiting for something... but I don't
> > > > see
> > > > what.
> > > > 
> > > Is it just one specific OSD (or a set of them) or is that all
> > > over
> > > the
> > > place?
> > 
> > A set of them. When I increase the weight of all 4 OSDs of a node,
> > I
> > frequently have blocked IO from 1 OSD of this node.
> > 
> > 
> > 
> > > Does restarting the OSD fix things?
> > 
> > Yes. For several minutes.
> > 
> > 
> > > Christian
> > > > 
> > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > > écrit :
> > > > > Hi,
> > > > > 
> > > > > I have a cluster with lot of blocked operations each time I
> > > > > try
> > > > > to
> > > > > move
> > > > > data (by reweighting a little an OSD).
> > > > > 
> > > > > It's a full SSD cluster, with 10GbE network.
> > > > > 
> > > > > In logs, when I have blocked OSD, on the main OSD I can see
> > > > > that
> > > > > :
> > > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > requests, 1 included below; oldest blocked for > 33.976680
> > > > > secs
> > > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > > request
> > > > > 30.125556 seconds old, received at 2015-09-18
> > > > > 01:54:46.855821:
> > > > > osd_op(client.29760717.1:18680817544
> > > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > > 6.c11916a4
> > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > > currently
> > > > > reached pg
> > > > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > requests, 1 included below; oldest blocked for > 63.981596
> > > > > secs
> > > > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > > > request
> > > > > 60.130472 seconds old, received at 2015-09-18
> > > > > 01:54:46.855821:
> > > > > osd_op(client.29760717.1:18680817544
> > > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > > 6.c11916a4
> > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > > currently
> > > > > reached pg
> > > > > 
> > > > > How should I read that ? What this OSD is waiting for ?
> > > > > 
> > > > > Thanks for any help,
> > > > > 
> > > > > Olivier
> > > > > ___
> > > > > ceph-users mailing list
> > > > > 

[ceph-users] Lot of blocked operations

2015-09-17 Thread Olivier Bonvalet
Hi,

I have a cluster with lot of blocked operations each time I try to move
data (by reweighting a little an OSD).

It's a full SSD cluster, with 10GbE network.

In logs, when I have blocked OSD, on the main OSD I can see that :
2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow requests, 1 
included below; oldest blocked for > 33.976680 secs
2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request 30.125556 
seconds old, received at 2015-09-18 01:54:46.855821: 
osd_op(client.29760717.1:18680817544 rb.0.1c16005.238e1f29.027f [write 
180224~16384] 6.c11916a4 snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) 
v4 currently reached pg
2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow requests, 1 
included below; oldest blocked for > 63.981596 secs
2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request 60.130472 
seconds old, received at 2015-09-18 01:54:46.855821: 
osd_op(client.29760717.1:18680817544 rb.0.1c16005.238e1f29.027f [write 
180224~16384] 6.c11916a4 snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) 
v4 currently reached pg

How should I read that ? What this OSD is waiting for ?

Thanks for any help,

Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread Olivier Bonvalet
Some additionnal informations :
- I have 4 SSD per node.
- the CPU usage is near 0
- IO wait is near 0 too
- bandwith usage is also near 0

The whole cluster seems waiting for something... but I don't see what.


Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a cluster with lot of blocked operations each time I try to
> move
> data (by reweighting a little an OSD).
> 
> It's a full SSD cluster, with 10GbE network.
> 
> In logs, when I have blocked OSD, on the main OSD I can see that :
> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 33.976680 secs
> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request
> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 63.981596 secs
> 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request
> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 
> How should I read that ? What this OSD is waiting for ?
> 
> Thanks for any help,
> 
> Olivier
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread Christian Balzer

Hello,

On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:

The items below help, but be a s specific as possible, from OS, kernel
version to Ceph version, "ceph -s", any other specific details (pool type,
replica size).

> Some additionnal informations :
> - I have 4 SSD per node.
Type, if nothing else for anecdotal reasons.
 
> - the CPU usage is near 0
> - IO wait is near 0 too
Including the trouble OSD(s)?
Measured how, iostat or atop?

> - bandwith usage is also near 0
>
Yeah, all of the above are not surprising if everything is stuck waiting
on some ops to finish. 

How many nodes are we talking about?

> The whole cluster seems waiting for something... but I don't see what.
>
Is it just one specific OSD (or a set of them) or is that all over the
place?

Does restarting the OSD fix things?
 
Christian
> 
> Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
> > Hi,
> > 
> > I have a cluster with lot of blocked operations each time I try to
> > move
> > data (by reweighting a little an OSD).
> > 
> > It's a full SSD cluster, with 10GbE network.
> > 
> > In logs, when I have blocked OSD, on the main OSD I can see that :
> > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > requests, 1 included below; oldest blocked for > 33.976680 secs
> > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request
> > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > osd_op(client.29760717.1:18680817544
> > rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> > reached pg
> > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > requests, 1 included below; oldest blocked for > 63.981596 secs
> > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request
> > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > osd_op(client.29760717.1:18680817544
> > rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> > reached pg
> > 
> > How should I read that ? What this OSD is waiting for ?
> > 
> > Thanks for any help,
> > 
> > Olivier
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread GuangYang
Which version are you using?

My guess is that the request (op) is waiting for lock (might be 
ondisk_read_lock of the object, but a debug_osd=20 should be helpful to tell 
what happened to the op).

How do you tell the IO wait is near to 0 (by top?)? 

Thanks,
Guang

> From: ceph.l...@daevel.fr
> To: ceph-users@lists.ceph.com
> Date: Fri, 18 Sep 2015 02:43:49 +0200
> Subject: Re: [ceph-users] Lot of blocked operations
>
> Some additionnal informations :
> - I have 4 SSD per node.
> - the CPU usage is near 0
> - IO wait is near 0 too
> - bandwith usage is also near 0
>
> The whole cluster seems waiting for something... but I don't see what.
>
>
> Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
>> Hi,
>>
>> I have a cluster with lot of blocked operations each time I try to
>> move
>> data (by reweighting a little an OSD).
>>
>> It's a full SSD cluster, with 10GbE network.
>>
>> In logs, when I have blocked OSD, on the main OSD I can see that :
>> 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for> 33.976680 secs
>> 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow request
>> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
>> osd_op(client.29760717.1:18680817544
>> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
>> reached pg
>> 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for> 63.981596 secs
>> 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow request
>> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
>> osd_op(client.29760717.1:18680817544
>> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
>> reached pg
>>
>> How should I read that ? What this OSD is waiting for ?
>>
>> Thanks for any help,
>>
>> Olivier
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com