Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-24 Thread Andras Pataki
We pin half the OSDs to each socket (and to the corresponding memory).  
Since the disk controller and the network card is connected only to one 
socket, this still probably produces quite a bit of QPI traffic.
It is also worth investigating how the network does under high load.  We 
did run into problems where 40Gbps cards dropped packets heavily under load.


Andras


On 08/24/2018 05:16 AM, Marc Roos wrote:
  
Can this be related to numa issues? I have also dual processor nodes,

and was wondering if there is some guide on how to optimize for numa.




-Original Message-
From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net]
Sent: vrijdag 24 augustus 2018 3:11
To: Andras Pataki
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Stability Issue with 52 OSD hosts

Thanks for the info. I was investigating bluestore as well.  My host
dont go unresponsive but I do see parallel io slow down.

On Thu, Aug 23, 2018, 8:02 PM Andras Pataki
 wrote:


We are also running some fairly dense nodes with CentOS 7.4 and ran
into
similar problems.  The nodes ran filestore OSDs (Jewel, then
Luminous).
Sometimes a node would be so unresponsive that one couldn't even
ssh to
it (even though the root disk was a physically separate drive on a
separate controller from the OSD drives).  Often these would
coincide
with kernel stack traces about hung tasks. Initially we did blame
high
load, etc. from all the OSDs.

But then we benchmarked the nodes independently of ceph (with
iozone and
such) and noticed problems there too.  When we started a few dozen
iozone processes on separate JBOD drives with xfs, some didn't even

start and write a single byte for minutes.  The conclusion we came
to
was that there is some interference among a lot of mounted xfs file

systems in the Red Hat 3.10 kernels.  Some kind of central lock
that
prevents dozens of xfs file systems from running in parallel.  When
we
do I/O directly to raw devices in parallel, we saw no problems (no
high
loads, etc.).  So we built a newer kernel, and the situation got
better.  4.4 is already much better, nowadays we are testing moving
to 4.14.

Also, migrating to bluestore significantly reduced the load on
these
nodes too.  At busy times, the filestore host loads were 20-30,
even
higher (on a 28 core node), while the bluestore nodes hummed along
at a
lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs

mounts don't work in parallel.

Andras


On 08/23/2018 03:24 PM, Tyler Bishop wrote:
> Yes I've reviewed all the logs from monitor and host.   I am not
> getting useful errors (or any) in dmesg or general messages.
>
> I have 2 ceph clusters, the other cluster is 300 SSD and i never
have
> issues like this.   That's why Im looking for help.
>
> On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev
 wrote:
>> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
>>  wrote:
>>> During high load testing I'm only seeing user and sys cpu load
around 60%... my load doesn't seem to be anything crazy on the host and
iowait stays between 6 and 10%.  I have very good `ceph osd perf`
numbers too.
>>>
>>> I am using 10.2.11 Jewel.
>>>
>>>
>>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer
 wrote:
>>>> Hello,
>>>>
>>>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
>>>>
>>>>> Hi,   I've been fighting to get good stability on my cluster
for about
>>>>> 3 weeks now.  I am running into intermittent issues with OSD
flapping
>>>>> marking other OSD down then going back to a stable state for
hours and
>>>>> days.
>>>>>
>>>>> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB
ram, 40G
>>>>> Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST
SAS drives
>>>>> with 400GB HGST SAS 12G SSDs.   My configuration is 4
journals per
>>>>> host with 12 disk per journal for a total of 56 disk per
system and 52
>>>>> OSD.
>>>>>
>>>> Any denser and you'd have a storage black hole.
>>>>
>>>> You already pointed your finger in the (or at least one) right
direction
>>>> and everybody will agree that this setup is woefully
underpowered in the
>>>> CPU department.
>>>>
   

Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-24 Thread Marc Roos
 
Can this be related to numa issues? I have also dual processor nodes, 
and was wondering if there is some guide on how to optimize for numa. 




-Original Message-
From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net] 
Sent: vrijdag 24 augustus 2018 3:11
To: Andras Pataki
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Stability Issue with 52 OSD hosts

Thanks for the info. I was investigating bluestore as well.  My host 
dont go unresponsive but I do see parallel io slow down.

On Thu, Aug 23, 2018, 8:02 PM Andras Pataki 
 wrote:


We are also running some fairly dense nodes with CentOS 7.4 and ran 
into 
similar problems.  The nodes ran filestore OSDs (Jewel, then 
Luminous).  
Sometimes a node would be so unresponsive that one couldn't even 
ssh to 
it (even though the root disk was a physically separate drive on a 
separate controller from the OSD drives).  Often these would 
coincide 
with kernel stack traces about hung tasks. Initially we did blame 
high 
load, etc. from all the OSDs.

But then we benchmarked the nodes independently of ceph (with 
iozone and 
such) and noticed problems there too.  When we started a few dozen 
iozone processes on separate JBOD drives with xfs, some didn't even 

start and write a single byte for minutes.  The conclusion we came 
to 
was that there is some interference among a lot of mounted xfs file 

systems in the Red Hat 3.10 kernels.  Some kind of central lock 
that 
prevents dozens of xfs file systems from running in parallel.  When 
we 
do I/O directly to raw devices in parallel, we saw no problems (no 
high 
loads, etc.).  So we built a newer kernel, and the situation got 
better.  4.4 is already much better, nowadays we are testing moving 
to 4.14.

Also, migrating to bluestore significantly reduced the load on 
these 
nodes too.  At busy times, the filestore host loads were 20-30, 
even 
higher (on a 28 core node), while the bluestore nodes hummed along 
at a 
lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs 

mounts don't work in parallel.

Andras


On 08/23/2018 03:24 PM, Tyler Bishop wrote:
> Yes I've reviewed all the logs from monitor and host.   I am not
> getting useful errors (or any) in dmesg or general messages.
>
> I have 2 ceph clusters, the other cluster is 300 SSD and i never 
have
> issues like this.   That's why Im looking for help.
>
> On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev 
 wrote:
>> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
>>  wrote:
>>> During high load testing I'm only seeing user and sys cpu load 
around 60%... my load doesn't seem to be anything crazy on the host and 
iowait stays between 6 and 10%.  I have very good `ceph osd perf` 
numbers too.
>>>
>>> I am using 10.2.11 Jewel.
>>>
>>>
>>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer 
 wrote:
>>>> Hello,
>>>>
>>>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
>>>>
>>>>> Hi,   I've been fighting to get good stability on my cluster 
for about
>>>>> 3 weeks now.  I am running into intermittent issues with OSD 
flapping
>>>>> marking other OSD down then going back to a stable state for 
hours and
>>>>> days.
>>>>>
>>>>> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB 
ram, 40G
>>>>> Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST 
SAS drives
>>>>> with 400GB HGST SAS 12G SSDs.   My configuration is 4 
journals per
>>>>> host with 12 disk per journal for a total of 56 disk per 
system and 52
>>>>> OSD.
>>>>>
>>>> Any denser and you'd have a storage black hole.
>>>>
>>>> You already pointed your finger in the (or at least one) right 
direction
>>>> and everybody will agree that this setup is woefully 
underpowered in the
>>>> CPU department.
>>>>
>>>>> I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm 
profile
>>>>> for throughput-performance enabled.
>>>>>
>>>> Ceph version would be interesting as well...
>>>>
>>>>> I have these sysctls set:
>>>>>
>

Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-23 Thread Tyler Bishop
Thanks for the info. I was investigating bluestore as well.  My host dont
go unresponsive but I do see parallel io slow down.

On Thu, Aug 23, 2018, 8:02 PM Andras Pataki 
wrote:

> We are also running some fairly dense nodes with CentOS 7.4 and ran into
> similar problems.  The nodes ran filestore OSDs (Jewel, then Luminous).
> Sometimes a node would be so unresponsive that one couldn't even ssh to
> it (even though the root disk was a physically separate drive on a
> separate controller from the OSD drives).  Often these would coincide
> with kernel stack traces about hung tasks. Initially we did blame high
> load, etc. from all the OSDs.
>
> But then we benchmarked the nodes independently of ceph (with iozone and
> such) and noticed problems there too.  When we started a few dozen
> iozone processes on separate JBOD drives with xfs, some didn't even
> start and write a single byte for minutes.  The conclusion we came to
> was that there is some interference among a lot of mounted xfs file
> systems in the Red Hat 3.10 kernels.  Some kind of central lock that
> prevents dozens of xfs file systems from running in parallel.  When we
> do I/O directly to raw devices in parallel, we saw no problems (no high
> loads, etc.).  So we built a newer kernel, and the situation got
> better.  4.4 is already much better, nowadays we are testing moving to
> 4.14.
>
> Also, migrating to bluestore significantly reduced the load on these
> nodes too.  At busy times, the filestore host loads were 20-30, even
> higher (on a 28 core node), while the bluestore nodes hummed along at a
> lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs
> mounts don't work in parallel.
>
> Andras
>
>
> On 08/23/2018 03:24 PM, Tyler Bishop wrote:
> > Yes I've reviewed all the logs from monitor and host.   I am not
> > getting useful errors (or any) in dmesg or general messages.
> >
> > I have 2 ceph clusters, the other cluster is 300 SSD and i never have
> > issues like this.   That's why Im looking for help.
> >
> > On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev 
> wrote:
> >> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
> >>  wrote:
> >>> During high load testing I'm only seeing user and sys cpu load around
> 60%... my load doesn't seem to be anything crazy on the host and iowait
> stays between 6 and 10%.  I have very good `ceph osd perf` numbers too.
> >>>
> >>> I am using 10.2.11 Jewel.
> >>>
> >>>
> >>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer 
> wrote:
>  Hello,
> 
>  On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
> 
> > Hi,   I've been fighting to get good stability on my cluster for
> about
> > 3 weeks now.  I am running into intermittent issues with OSD flapping
> > marking other OSD down then going back to a stable state for hours
> and
> > days.
> >
> > The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
> > Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
> > with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
> > host with 12 disk per journal for a total of 56 disk per system and
> 52
> > OSD.
> >
>  Any denser and you'd have a storage black hole.
> 
>  You already pointed your finger in the (or at least one) right
> direction
>  and everybody will agree that this setup is woefully underpowered in
> the
>  CPU department.
> 
> > I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
> > for throughput-performance enabled.
> >
>  Ceph version would be interesting as well...
> 
> > I have these sysctls set:
> >
> > kernel.pid_max = 4194303
> > fs.file-max = 6553600
> > vm.swappiness = 0
> > vm.vfs_cache_pressure = 50
> > vm.min_free_kbytes = 3145728
> >
> > I feel like my issue is directly related to the high number of OSD
> per
> > host but I'm not sure what issue I'm really running into.   I believe
> > that I have ruled out network issues, i am able to get 38Gbit
> > consistently via iperf testing and mtu for jump pings successfully
> > with no fragment set and 8972 packet size.
> >
>  The fact that it all works for days at a time suggests this as well,
> but
>  you need to verify these things when they're happening.
> 
> >  From FIO testing I seem to be able to get 150-200k iops write from
> my
> > rbd clients on 1gbit networking... This is about what I expected due
> > to the write penalty and my underpowered CPU for the number of OSD.
> >
> > I get these messages which I believe are normal?
> > 2018-08-22 10:33:12.754722 7f7d009f5700  0 --
> 10.20.136.8:6894/718902
> >>> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
> > pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send,
> going
> > to standby
> >
>  Ignore.
> 
> > Then randomly I'll get a storm of this every few days for 20 

Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-23 Thread Andras Pataki
We are also running some fairly dense nodes with CentOS 7.4 and ran into 
similar problems.  The nodes ran filestore OSDs (Jewel, then Luminous).  
Sometimes a node would be so unresponsive that one couldn't even ssh to 
it (even though the root disk was a physically separate drive on a 
separate controller from the OSD drives).  Often these would coincide 
with kernel stack traces about hung tasks. Initially we did blame high 
load, etc. from all the OSDs.


But then we benchmarked the nodes independently of ceph (with iozone and 
such) and noticed problems there too.  When we started a few dozen 
iozone processes on separate JBOD drives with xfs, some didn't even 
start and write a single byte for minutes.  The conclusion we came to 
was that there is some interference among a lot of mounted xfs file 
systems in the Red Hat 3.10 kernels.  Some kind of central lock that 
prevents dozens of xfs file systems from running in parallel.  When we 
do I/O directly to raw devices in parallel, we saw no problems (no high 
loads, etc.).  So we built a newer kernel, and the situation got 
better.  4.4 is already much better, nowadays we are testing moving to 4.14.


Also, migrating to bluestore significantly reduced the load on these 
nodes too.  At busy times, the filestore host loads were 20-30, even 
higher (on a 28 core node), while the bluestore nodes hummed along at a 
lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs 
mounts don't work in parallel.


Andras


On 08/23/2018 03:24 PM, Tyler Bishop wrote:

Yes I've reviewed all the logs from monitor and host.   I am not
getting useful errors (or any) in dmesg or general messages.

I have 2 ceph clusters, the other cluster is 300 SSD and i never have
issues like this.   That's why Im looking for help.

On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev  wrote:

On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
 wrote:

During high load testing I'm only seeing user and sys cpu load around 60%... my 
load doesn't seem to be anything crazy on the host and iowait stays between 6 
and 10%.  I have very good `ceph osd perf` numbers too.

I am using 10.2.11 Jewel.


On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer  wrote:

Hello,

On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:


Hi,   I've been fighting to get good stability on my cluster for about
3 weeks now.  I am running into intermittent issues with OSD flapping
marking other OSD down then going back to a stable state for hours and
days.

The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
host with 12 disk per journal for a total of 56 disk per system and 52
OSD.


Any denser and you'd have a storage black hole.

You already pointed your finger in the (or at least one) right direction
and everybody will agree that this setup is woefully underpowered in the
CPU department.


I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
for throughput-performance enabled.


Ceph version would be interesting as well...


I have these sysctls set:

kernel.pid_max = 4194303
fs.file-max = 6553600
vm.swappiness = 0
vm.vfs_cache_pressure = 50
vm.min_free_kbytes = 3145728

I feel like my issue is directly related to the high number of OSD per
host but I'm not sure what issue I'm really running into.   I believe
that I have ruled out network issues, i am able to get 38Gbit
consistently via iperf testing and mtu for jump pings successfully
with no fragment set and 8972 packet size.


The fact that it all works for days at a time suggests this as well, but
you need to verify these things when they're happening.


 From FIO testing I seem to be able to get 150-200k iops write from my
rbd clients on 1gbit networking... This is about what I expected due
to the write penalty and my underpowered CPU for the number of OSD.

I get these messages which I believe are normal?
2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902

10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2

pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
to standby


Ignore.


Then randomly I'll get a storm of this every few days for 20 minutes or so:
2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
2018-08-22 15:48:12.630773)


Randomly is unlikely.
Again, catch it in the act, atop in huge terminal windows (showing all
CPUs and disks) for all nodes should be very telling, collecting and
graphing this data might work, too.

My suspects would be deep scrubs and/or high IOPS spikes when this is
happening, starving out OSD processes (CPU wise, RAM should be fine one
supposes).

Christian


Please help!!!

Have you looked at the OSD logs on the OSD nodes by chance?  I found
that correlating the 

Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-23 Thread Tyler Bishop
Yes I've reviewed all the logs from monitor and host.   I am not
getting useful errors (or any) in dmesg or general messages.

I have 2 ceph clusters, the other cluster is 300 SSD and i never have
issues like this.   That's why Im looking for help.

On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev  wrote:
>
> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
>  wrote:
> >
> > During high load testing I'm only seeing user and sys cpu load around 
> > 60%... my load doesn't seem to be anything crazy on the host and iowait 
> > stays between 6 and 10%.  I have very good `ceph osd perf` numbers too.
> >
> > I am using 10.2.11 Jewel.
> >
> >
> > On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer  wrote:
> >>
> >> Hello,
> >>
> >> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
> >>
> >> > Hi,   I've been fighting to get good stability on my cluster for about
> >> > 3 weeks now.  I am running into intermittent issues with OSD flapping
> >> > marking other OSD down then going back to a stable state for hours and
> >> > days.
> >> >
> >> > The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
> >> > Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
> >> > with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
> >> > host with 12 disk per journal for a total of 56 disk per system and 52
> >> > OSD.
> >> >
> >> Any denser and you'd have a storage black hole.
> >>
> >> You already pointed your finger in the (or at least one) right direction
> >> and everybody will agree that this setup is woefully underpowered in the
> >> CPU department.
> >>
> >> > I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
> >> > for throughput-performance enabled.
> >> >
> >> Ceph version would be interesting as well...
> >>
> >> > I have these sysctls set:
> >> >
> >> > kernel.pid_max = 4194303
> >> > fs.file-max = 6553600
> >> > vm.swappiness = 0
> >> > vm.vfs_cache_pressure = 50
> >> > vm.min_free_kbytes = 3145728
> >> >
> >> > I feel like my issue is directly related to the high number of OSD per
> >> > host but I'm not sure what issue I'm really running into.   I believe
> >> > that I have ruled out network issues, i am able to get 38Gbit
> >> > consistently via iperf testing and mtu for jump pings successfully
> >> > with no fragment set and 8972 packet size.
> >> >
> >> The fact that it all works for days at a time suggests this as well, but
> >> you need to verify these things when they're happening.
> >>
> >> > From FIO testing I seem to be able to get 150-200k iops write from my
> >> > rbd clients on 1gbit networking... This is about what I expected due
> >> > to the write penalty and my underpowered CPU for the number of OSD.
> >> >
> >> > I get these messages which I believe are normal?
> >> > 2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902
> >> > >> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
> >> > pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
> >> > to standby
> >> >
> >> Ignore.
> >>
> >> > Then randomly I'll get a storm of this every few days for 20 minutes or 
> >> > so:
> >> > 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
> >> > heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
> >> > 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
> >> > 2018-08-22 15:48:12.630773)
> >> >
> >> Randomly is unlikely.
> >> Again, catch it in the act, atop in huge terminal windows (showing all
> >> CPUs and disks) for all nodes should be very telling, collecting and
> >> graphing this data might work, too.
> >>
> >> My suspects would be deep scrubs and/or high IOPS spikes when this is
> >> happening, starving out OSD processes (CPU wise, RAM should be fine one
> >> supposes).
> >>
> >> Christian
> >>
> >> > Please help!!!
>
> Have you looked at the OSD logs on the OSD nodes by chance?  I found
> that correlating the messages in those logs with your master ceph log
> and also correlating with any messages in syslog or kern.log can
> elucidate the cause of the problem pretty well.
> --
> Alex Gorbachev
> Storcium
>
>
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >> --
> >> Christian BalzerNetwork/Systems Engineer
> >> ch...@gol.com   Rakuten Communications
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-23 Thread Alex Gorbachev
On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
 wrote:
>
> During high load testing I'm only seeing user and sys cpu load around 60%... 
> my load doesn't seem to be anything crazy on the host and iowait stays 
> between 6 and 10%.  I have very good `ceph osd perf` numbers too.
>
> I am using 10.2.11 Jewel.
>
>
> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer  wrote:
>>
>> Hello,
>>
>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
>>
>> > Hi,   I've been fighting to get good stability on my cluster for about
>> > 3 weeks now.  I am running into intermittent issues with OSD flapping
>> > marking other OSD down then going back to a stable state for hours and
>> > days.
>> >
>> > The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
>> > Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
>> > with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
>> > host with 12 disk per journal for a total of 56 disk per system and 52
>> > OSD.
>> >
>> Any denser and you'd have a storage black hole.
>>
>> You already pointed your finger in the (or at least one) right direction
>> and everybody will agree that this setup is woefully underpowered in the
>> CPU department.
>>
>> > I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
>> > for throughput-performance enabled.
>> >
>> Ceph version would be interesting as well...
>>
>> > I have these sysctls set:
>> >
>> > kernel.pid_max = 4194303
>> > fs.file-max = 6553600
>> > vm.swappiness = 0
>> > vm.vfs_cache_pressure = 50
>> > vm.min_free_kbytes = 3145728
>> >
>> > I feel like my issue is directly related to the high number of OSD per
>> > host but I'm not sure what issue I'm really running into.   I believe
>> > that I have ruled out network issues, i am able to get 38Gbit
>> > consistently via iperf testing and mtu for jump pings successfully
>> > with no fragment set and 8972 packet size.
>> >
>> The fact that it all works for days at a time suggests this as well, but
>> you need to verify these things when they're happening.
>>
>> > From FIO testing I seem to be able to get 150-200k iops write from my
>> > rbd clients on 1gbit networking... This is about what I expected due
>> > to the write penalty and my underpowered CPU for the number of OSD.
>> >
>> > I get these messages which I believe are normal?
>> > 2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902
>> > >> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
>> > pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
>> > to standby
>> >
>> Ignore.
>>
>> > Then randomly I'll get a storm of this every few days for 20 minutes or so:
>> > 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
>> > heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
>> > 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
>> > 2018-08-22 15:48:12.630773)
>> >
>> Randomly is unlikely.
>> Again, catch it in the act, atop in huge terminal windows (showing all
>> CPUs and disks) for all nodes should be very telling, collecting and
>> graphing this data might work, too.
>>
>> My suspects would be deep scrubs and/or high IOPS spikes when this is
>> happening, starving out OSD processes (CPU wise, RAM should be fine one
>> supposes).
>>
>> Christian
>>
>> > Please help!!!

Have you looked at the OSD logs on the OSD nodes by chance?  I found
that correlating the messages in those logs with your master ceph log
and also correlating with any messages in syslog or kern.log can
elucidate the cause of the problem pretty well.
--
Alex Gorbachev
Storcium


>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Rakuten Communications
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-22 Thread Tyler Bishop
During high load testing I'm only seeing user and sys cpu load around
60%... my load doesn't seem to be anything crazy on the host and iowait
stays between 6 and 10%.  I have very good `ceph osd perf` numbers too.

I am using 10.2.11 Jewel.


On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer  wrote:

> Hello,
>
> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
>
> > Hi,   I've been fighting to get good stability on my cluster for about
> > 3 weeks now.  I am running into intermittent issues with OSD flapping
> > marking other OSD down then going back to a stable state for hours and
> > days.
> >
> > The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
> > Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
> > with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
> > host with 12 disk per journal for a total of 56 disk per system and 52
> > OSD.
> >
> Any denser and you'd have a storage black hole.
>
> You already pointed your finger in the (or at least one) right direction
> and everybody will agree that this setup is woefully underpowered in the
> CPU department.
>
> > I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
> > for throughput-performance enabled.
> >
> Ceph version would be interesting as well...
>
> > I have these sysctls set:
> >
> > kernel.pid_max = 4194303
> > fs.file-max = 6553600
> > vm.swappiness = 0
> > vm.vfs_cache_pressure = 50
> > vm.min_free_kbytes = 3145728
> >
> > I feel like my issue is directly related to the high number of OSD per
> > host but I'm not sure what issue I'm really running into.   I believe
> > that I have ruled out network issues, i am able to get 38Gbit
> > consistently via iperf testing and mtu for jump pings successfully
> > with no fragment set and 8972 packet size.
> >
> The fact that it all works for days at a time suggests this as well, but
> you need to verify these things when they're happening.
>
> > From FIO testing I seem to be able to get 150-200k iops write from my
> > rbd clients on 1gbit networking... This is about what I expected due
> > to the write penalty and my underpowered CPU for the number of OSD.
> >
> > I get these messages which I believe are normal?
> > 2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902
> > >> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
> > pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
> > to standby
> >
> Ignore.
>
> > Then randomly I'll get a storm of this every few days for 20 minutes or
> so:
> > 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
> > heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
> > 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
> > 2018-08-22 15:48:12.630773)
> >
> Randomly is unlikely.
> Again, catch it in the act, atop in huge terminal windows (showing all
> CPUs and disks) for all nodes should be very telling, collecting and
> graphing this data might work, too.
>
> My suspects would be deep scrubs and/or high IOPS spikes when this is
> happening, starving out OSD processes (CPU wise, RAM should be fine one
> supposes).
>
> Christian
>
> > Please help!!!
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stability Issue with 52 OSD hosts

2018-08-22 Thread Christian Balzer
Hello,

On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:

> Hi,   I've been fighting to get good stability on my cluster for about
> 3 weeks now.  I am running into intermittent issues with OSD flapping
> marking other OSD down then going back to a stable state for hours and
> days.
> 
> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
> Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
> with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
> host with 12 disk per journal for a total of 56 disk per system and 52
> OSD.
>
Any denser and you'd have a storage black hole.

You already pointed your finger in the (or at least one) right direction
and everybody will agree that this setup is woefully underpowered in the
CPU department.
 
> I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
> for throughput-performance enabled.
> 
Ceph version would be interesting as well...

> I have these sysctls set:
> 
> kernel.pid_max = 4194303
> fs.file-max = 6553600
> vm.swappiness = 0
> vm.vfs_cache_pressure = 50
> vm.min_free_kbytes = 3145728
> 
> I feel like my issue is directly related to the high number of OSD per
> host but I'm not sure what issue I'm really running into.   I believe
> that I have ruled out network issues, i am able to get 38Gbit
> consistently via iperf testing and mtu for jump pings successfully
> with no fragment set and 8972 packet size.
> 
The fact that it all works for days at a time suggests this as well, but
you need to verify these things when they're happening.

> From FIO testing I seem to be able to get 150-200k iops write from my
> rbd clients on 1gbit networking... This is about what I expected due
> to the write penalty and my underpowered CPU for the number of OSD.
> 
> I get these messages which I believe are normal?
> 2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902
> >> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2  
> pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
> to standby
> 
Ignore.

> Then randomly I'll get a storm of this every few days for 20 minutes or so:
> 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
> heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
> 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
> 2018-08-22 15:48:12.630773)
> 
Randomly is unlikely.
Again, catch it in the act, atop in huge terminal windows (showing all
CPUs and disks) for all nodes should be very telling, collecting and
graphing this data might work, too.

My suspects would be deep scrubs and/or high IOPS spikes when this is
happening, starving out OSD processes (CPU wise, RAM should be fine one
supposes).

Christian

> Please help!!!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Stability Issue with 52 OSD hosts

2018-08-22 Thread Tyler Bishop
Hi,   I've been fighting to get good stability on my cluster for about
3 weeks now.  I am running into intermittent issues with OSD flapping
marking other OSD down then going back to a stable state for hours and
days.

The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
host with 12 disk per journal for a total of 56 disk per system and 52
OSD.

I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
for throughput-performance enabled.

I have these sysctls set:

kernel.pid_max = 4194303
fs.file-max = 6553600
vm.swappiness = 0
vm.vfs_cache_pressure = 50
vm.min_free_kbytes = 3145728

I feel like my issue is directly related to the high number of OSD per
host but I'm not sure what issue I'm really running into.   I believe
that I have ruled out network issues, i am able to get 38Gbit
consistently via iperf testing and mtu for jump pings successfully
with no fragment set and 8972 packet size.

>From FIO testing I seem to be able to get 150-200k iops write from my
rbd clients on 1gbit networking... This is about what I expected due
to the write penalty and my underpowered CPU for the number of OSD.

I get these messages which I believe are normal?
2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902
>> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
to standby

Then randomly I'll get a storm of this every few days for 20 minutes or so:
2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
2018-08-22 15:48:12.630773)

Please help!!!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com