Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-10 Thread Stefan Kooman
Quoting solarflow99 (solarflo...@gmail.com):
> can the bitmap allocator be set in ceph-ansible?  I wonder why is it not
> default in 12.2.12

We don't use ceph-ansible. But if ceph-ansible allow you to set specific
([osd]) settings in ceph.conf I guess you can do it.

I don't know what the policy is for changing default settings in Ceph. Not sure
if they ever do that. The feature is only available since 12.2.12 and
is not battle tested in luminous. It's not the default in Mimic either
IIRC. Might be default in Nautilus?

Behaviour changes can be tricky without people knowing about it.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-10 Thread solarflow99
can the bitmap allocator be set in ceph-ansible?  I wonder why is it not
default in 12.2.12


On Thu, Jun 6, 2019 at 7:06 AM Stefan Kooman  wrote:

> Quoting Max Vernimmen (vernim...@textkernel.nl):
> >
> > This is happening several times per day after we made several changes at
> > the same time:
> >
> >- add physical ram to the ceph nodes
> >- move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache
> kv
> >max' to 'bluestore cache autotune = 1' and 'osd memory target =
> >20401094656'.
> >- update ceph from 12.2.8 to 12.2.11
> >- update clients from 12.2.8 to 12.2.11
> >
> > We have since upgraded the ceph nodes to 12.2.12 but it did not help to
> fix
> > this problem.
>
> Have you tried the new bitmap allocator for the OSDs already (available
> since 12.2.12):
>
> [osd]
>
> # MEMORY ALLOCATOR
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap
>
> The issues you are reporting sound like an issue many of us have seen on
> luminous and mimic clusters and has been identified to be caused by the
> "stupid allocator" memory allocator.
>
> Gr. Stefan
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-07 Thread Stefan Kooman
Quoting Max Vernimmen (vernim...@textkernel.nl):
> Thank you for the suggestion to use the bitmap allocator. I looked at the
> ceph documentation and could find no mention of this setting. This makes me
> wonder how safe and production ready this setting really is. I'm hesitant
> to apply that to our production environment.
> If the allocator setting helps to resolve the problem then it looks to me
> like there is a bug in the 'stupid' allocator that is causing this
> behavior. Would this qualify for creating a bug report or is some more
> debugging needed before I can do that?

It's safe to use in production. We have test clusters running it, and
recently put it in production as well. As Igor noted this might not help
in your situation, but it might prevent you from running into decreased
performance (increased latency) over time.

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-07 Thread Igor Fedotov

Hi Max,

I don't think this is allocator related issue. The symptoms that 
triggered us to start using bitmap allocator over stupid one were:


- write op latency gradually increasing over time (days not hours)

- perf showing significant amount of time spent in allocator related 
function


- OSD reboot was the only remedy.

It had nothing related to network activity and/or client restarts.


Thanks,

Igor


On 6/7/2019 11:05 AM, Max Vernimmen wrote:
Thank you for the suggestion to use the bitmap allocator. I looked at 
the ceph documentation and could find no mention of this setting. This 
makes me wonder how safe and production ready this setting really is. 
I'm hesitant to apply that to our production environment.
If the allocator setting helps to resolve the problem then it looks to 
me like there is a bug in the 'stupid' allocator that is causing this 
behavior. Would this qualify for creating a bug report or is some more 
debugging needed before I can do that?


On Thu, Jun 6, 2019 at 11:18 AM Stefan Kooman > wrote:


Quoting Max Vernimmen (vernim...@textkernel.nl
):
>
> This is happening several times per day after we made several
changes at
> the same time:
>
>    - add physical ram to the ceph nodes
>    - move from fixed 'bluestore cache size hdd|sdd' and
'bluestore cache kv
>    max' to 'bluestore cache autotune = 1' and 'osd memory target =
>    20401094656'.
>    - update ceph from 12.2.8 to 12.2.11
>    - update clients from 12.2.8 to 12.2.11
>
> We have since upgraded the ceph nodes to 12.2.12 but it did not
help to fix
> this problem.

Have you tried the new bitmap allocator for the OSDs already
(available
since 12.2.12):

[osd]

# MEMORY ALLOCATOR
bluestore_allocator = bitmap
bluefs_allocator = bitmap

The issues you are reporting sound like an issue many of us have
seen on
luminous and mimic clusters and has been identified to be caused
by the
"stupid allocator" memory allocator.

Gr. Stefan


-- 
| BIT BV http://www.bit.nl/       Kamer van Koophandel 09090351

| GPG: 0xD14839C6                   +31 318 648 688 / i...@bit.nl




--
Max Vernimmen
Senior DevOps Engineer
Textkernel

--
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-07 Thread Max Vernimmen
Thank you for the suggestion to use the bitmap allocator. I looked at the
ceph documentation and could find no mention of this setting. This makes me
wonder how safe and production ready this setting really is. I'm hesitant
to apply that to our production environment.
If the allocator setting helps to resolve the problem then it looks to me
like there is a bug in the 'stupid' allocator that is causing this
behavior. Would this qualify for creating a bug report or is some more
debugging needed before I can do that?

On Thu, Jun 6, 2019 at 11:18 AM Stefan Kooman  wrote:

> Quoting Max Vernimmen (vernim...@textkernel.nl):
> >
> > This is happening several times per day after we made several changes at
> > the same time:
> >
> >- add physical ram to the ceph nodes
> >- move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache
> kv
> >max' to 'bluestore cache autotune = 1' and 'osd memory target =
> >20401094656'.
> >- update ceph from 12.2.8 to 12.2.11
> >- update clients from 12.2.8 to 12.2.11
> >
> > We have since upgraded the ceph nodes to 12.2.12 but it did not help to
> fix
> > this problem.
>
> Have you tried the new bitmap allocator for the OSDs already (available
> since 12.2.12):
>
> [osd]
>
> # MEMORY ALLOCATOR
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap
>
> The issues you are reporting sound like an issue many of us have seen on
> luminous and mimic clusters and has been identified to be caused by the
> "stupid allocator" memory allocator.
>
> Gr. Stefan
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
>


-- 
Max Vernimmen
Senior DevOps Engineer
Textkernel


--
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
---
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-06 Thread Stefan Kooman
Quoting Max Vernimmen (vernim...@textkernel.nl):
> 
> This is happening several times per day after we made several changes at
> the same time:
> 
>- add physical ram to the ceph nodes
>- move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv
>max' to 'bluestore cache autotune = 1' and 'osd memory target =
>20401094656'.
>- update ceph from 12.2.8 to 12.2.11
>- update clients from 12.2.8 to 12.2.11
> 
> We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix
> this problem.

Have you tried the new bitmap allocator for the OSDs already (available
since 12.2.12):

[osd]

# MEMORY ALLOCATOR
bluestore_allocator = bitmap
bluefs_allocator = bitmap

The issues you are reporting sound like an issue many of us have seen on
luminous and mimic clusters and has been identified to be caused by the
"stupid allocator" memory allocator.

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-06 Thread Max Vernimmen
HI,


We are running VM images on ceph using RBD. We are seeing a problem where
one of our VMs gets into problems due to IO not completing. iostat on the
VM shows IO remaining in the queue, and disk utilisation for ceph based
vdisks is 100%.


Upon investigation the problem seems to be with the message worker
(msgr-worker-0) thread for one OSD. Restarting the OSD process fixes the
problem: the IO gets completed and the VM is unfrozen, life continues
without a problem. Until it happens again. Recurrence is between 30 minutes
and over 24 hours. The VMs affected are always in the same pool, but it's
not always the same VM that is affected. The problem occurs with different
VMs on different hypervisors. The problem occurs on different ceph nodes
and with different OSDs.


When the problem occurs, we see on the network a sudden jump from
<100mbit/sec to >4gbit/sec of continuous traffic. This traffic is between
the hypervisor and one OSD, in these cases always one of our HDD OSDs. The
traffic is not visible within the VM, only on the hypervisor.


If the client is rebooted,  the problem is gone. If the OSD  is restarted,
the problem is gone.

This is happening several times per day after we made several changes at
the same time:

   - add physical ram to the ceph nodes
   - move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv
   max' to 'bluestore cache autotune = 1' and 'osd memory target =
   20401094656'.
   - update ceph from 12.2.8 to 12.2.11
   - update clients from 12.2.8 to 12.2.11

We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix
this problem.


My request is that someone takes a look at our findings below and can give
some insight into whether this is a bug, a misconfiguration or perhaps some
idea of where to take a closer look.


our setup is:

8 identical nodes, each with 4 HDDs (8TB, 7k rpm) and  6 SSDs (4TB). There
are a number of pools that using crush rules  map to either the HDDs  or
the SSDs. The pool that always has this problem is called 'prod_slow' and
goes to the HDDs.


I tracked down the osd by looking at the client port the client receives
most traffic from (all 4gbps is read traffic, outgoing from ceph, incoming
to client).


root@ceph-03:~# netstat -tlpn|grep 6804

tcp0  0 10.60.8.11:6804 0.0.0.0:*   LISTEN
3741/ceph-osd

tcp0  0 10.60.6.11:6804 0.0.0.0:*   LISTEN
3730/ceph-osd


root@ceph-03:~# ps uafx|grep 3730

ceph3730 44.3  6.3 20214604 16848524 ?   Ssl  Jun05 524:14
/usr/bin/ceph-osd -f --cluster ceph --id 23 --setuser ceph --setgroup ceph


root@ceph-03:~# ps -L -p3730

PID LWP TTY  TIME CMD

   37303730 ?00:00:05 ceph-osd

   37303778 ?00:00:00 log

   37303791 ?05:19:49 msgr-worker-0

   37303802 ?00:01:18 msgr-worker-1

   37303810 ?00:01:25 msgr-worker-2

   37303842 ?00:00:00 service

   37303845 ?00:00:00 admin_socket

   37304015 ?00:00:00 ceph-osd

   37304017 ?00:00:00 safe_timer

   37304018 ?00:00:03 safe_timer

   37304019 ?00:00:00 safe_timer

   37304020 ?00:00:00 safe_timer

   37304021 ?00:00:14 bstore_aio

   37304023 ?00:00:05 bstore_aio

   37304280 ?00:00:32 rocksdb:bg0

   37304634 ?00:00:00 dfin

   37304635 ?00:00:12 finisher

   37304636 ?00:00:51 bstore_kv_sync

   37304637 ?00:00:12 bstore_kv_final

   37304638 ?00:00:27 bstore_mempool

   37305803 ?00:03:08 ms_dispatch

   37305804 ?00:00:00 ms_local

   37305805 ?00:00:00 ms_dispatch

   37305806 ?00:00:00 ms_local

   37305807 ?00:00:00 ms_dispatch

   37305808 ?00:00:00 ms_local

   37305809 ?00:00:00 ms_dispatch

   37305810 ?00:00:00 ms_local

   37305811 ?00:00:00 ms_dispatch

   37305812 ?00:00:00 ms_local

   37305813 ?00:00:00 ms_dispatch

   37305814 ?00:00:00 ms_local

   37305815 ?00:00:00 ms_dispatch

   37305816 ?00:00:00 ms_local

   37305817 ?00:00:00 safe_timer

   37305818 ?00:00:00 fn_anonymous

   37305819 ?00:00:02 safe_timer

   37305820 ?00:00:00 tp_peering

   37305821 ?00:00:00 tp_peering

   37305822 ?00:00:00 fn_anonymous

   37305823 ?00:00:00 fn_anonymous

   37305824 ?00:00:00 safe_timer

   37305825 ?00:00:00 safe_timer

   37305826 ?00:00:00 safe_timer

   37305827 ?00:00:00 safe_timer

   37305828 ?00:00:00 osd_srv_agent

   37305829 ?00:01:15 tp_osd_tp

   37305830 ?00:01:27 tp_osd_tp

   37305831 ?00:01:40 tp_osd_tp

   37305832 ?00:00:49