Re: [ceph-users] Best practices for allocating memory to bluestore cache

2018-08-30 Thread Bastiaan Visser
Your claim that all cache is used for K/V cache is false (with default 
settings). 

K/V cache is maximized to 500Meg. : 

bluestore_cache_kv_max 
Description:The maximum amount of cache devoted to key/value data 
(rocksdb). 
Type:   Unsigned Integer 
Required:   Yes 
Default:512 * 1024*1024 (512 MB)



- Original Message -
From: "David Turner" 
To: "Tyler Bishop" 
Cc: "ceph-users" 
Sent: Friday, August 31, 2018 1:31:37 AM
Subject: Re: [ceph-users] Best practices for allocating memory to bluestore 
cache

Be very careful trying to utilize more RAM while your cluster is healthy. When 
you're going to need the extra RAM is when you're closer is unhealthy and your 
osds are peering, recovering, backfilling, etc. That's when the osd daemons 
start needing the RAM that is recommended in the docs. 

On Thu, Aug 30, 2018, 4:21 PM Tyler Bishop < [ 
mailto:tyler.bis...@beyondhosting.net | tyler.bis...@beyondhosting.net ] > 
wrote: 



Hi, 
My OSD host has 256GB of ram and I have 52 OSD. Currently I have the cache set 
to 1GB and the system only consumes around 44GB of ram and the other ram sits 
as unallocated because I am using bluestore vs filestore. 

The documentation: [ 
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ | 
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ ] 
list the defaults of ram to be used almost exclusively for the KV cache. 

With a system like mine do you think It would be safe to allow 3GB cache and 
change the KV ratio to 0.60? 

Thanks 
_ 

Tyler Bishop 
EST 2007 


O: 513-299-7108 x1000 
M: 513-646-5809 
[ http://beyondhosting.net/ | http://BeyondHosting.net ] 


This email is intended only for the recipient(s) above and/or otherwise 
authorized personnel. The information contained herein and attached is 
confidential and the property of Beyond Hosting. Any unauthorized copying, 
forwarding, printing, and/or disclosing any information related to this email 
is prohibited. If you received this message in error, please contact the sender 
and destroy all copies of this email and any attachment(s). 
___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 




___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS not start. Timeout??

2018-08-30 Thread morf...@gmail.com

Hello all!

I had a electric power problem. After this I have 2 incomplete pg. But 
all RBD volumes are work.


But not work my CephFS. MDS load stop at "replay" state and MDS related 
commands hangs:


cephfs-journal-tool journal export backup.bin - freeze;

cephfs-journal-tool event recover_dentries summary - freeze (no action 
in strace);


cephfs-journal-tool journal reset - freeze;


strace out:

* 
[pid  6314] <... futex resumed> )   = -1 ETIMEDOUT (Connection timed 
out)

[pid  6314] futex(0x55d342eea928, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  6314] futex(0x55d342eea97c, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 31, {1535692208, 
64139948},  
[pid  6318] <... futex resumed> )   = -1 ETIMEDOUT (Connection timed 
out)

[pid  6318] futex(0x55d3430b6958, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  6318] futex(0x55d3430b6984, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 33, {1535692208, 
80954445},  
[pid  6324] <... futex resumed> )   = -1 ETIMEDOUT (Connection timed 
out)

[pid  6324] futex(0x55d3430b7758, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  6324] write(12, "c", 1)   = 1
[pid  6317] <... epoll_wait resumed> {{EPOLLIN, {u32=11, u64=11}}}, 
5000, 3) = 1

[pid  6317] read(11, "c", 256)  = 1
[pid  6317] read(11, 0x7f1558c32300, 256) = -1 EAGAIN (Resource 
temporarily unavailable)
[pid  6317] futex(0x55d3432269e0, FUTEX_WAIT_PRIVATE, 2, NULL 


[pid  6324] futex(0x55d3432269e0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  6317] <... futex resumed> )   = 0
[pid  6317] futex(0x55d3432269e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  6317] sendmsg(17, {msg_name(0)=NULL, 
msg_iov(1)=[{"\7\25\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\177\0\1\0\0\0\0\0\0\0\0\0\0"..., 
75}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL 
[pid  6324] futex(0x55d3430b7784, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 33, {1535692208, 
169222622},  

[pid  6317] <... sendmsg resumed> ) = 75
[pid  6317] epoll_wait(10,

*

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange Client admin socket error in a containerized ceph environment

2018-08-30 Thread Alex Litvak

Also,

I tried to add a client admin socket, but I am getting this message:

[client]
admin socket = /var/run/ceph/$name.$pid.asok

2018-08-30 20:04:03.337 7f4641ffb700 -1 set_mon_vals failed to set 
admin_socket = $run_dir/$cluster-$name.asok: Configuration option 
'admin_socket' may not be modified at runtime


On 08/30/2018 09:06 PM, Alex Litvak wrote:

I keep getting the following error message:

018-08-30 18:52:37.882 7fca9df7c700 -1 asok(0x7fca98000fe0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
to bind the UNIX domain socket to 
'/var/run/ceph/ceph-client.admin.asok': (17) File exists


Otherwise things seem to be fine

I am running mimic 13.2.1 deployed with ceph-ansible and running in 
docker containers


  s -latr /var/run/ceph/ceph-*

srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.10.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.25.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.1.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.19.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.13.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.16.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.22.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.4.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:12 /var/run/ceph/ceph-osd.7.asok
srwxr-xr-x 1 167 167 0 Aug 30 17:53 
/var/run/ceph/ceph-mds.storage1n1-chi.asok
srwxr-xr-x 1 167 167 0 Aug 30 18:16 
/var/run/ceph/ceph-mon.storage1n1-chi.asok
srwxr-xr-x 1 167 167 0 Aug 30 18:40 
/var/run/ceph/ceph-mgr.storage1n1-chi.asok

srwxr-xr-x 1 167 167 0 Aug 30 18:43 /var/run/ceph/ceph-client.admin.asok
srwxr-xr-x 1 167 167 0 Aug 30 18:51 
/var/run/ceph/ceph-client.rgw.storage1n1-chi.asok


The file /var/run/ceph/ceph-client.admin.asok only shows on 1 node as 
well as the error.  I also have status


ceph -s
2018-08-30 18:57:49.673 7f76457c9700 -1 asok(0x7f764fe0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
to bind the UNIX domain socket to 
'/var/run/ceph/ceph-client.admin.asok': (17) File exists

   cluster:
     id: 7f2fcb31-655f-4fb5-879a-8d1f6e636f7a
     health: HEALTH_OK

   services:
     mon: 3 daemons, quorum storage1n1-chi,storage1n2-chi,storage1n3-chi
     mgr: storage1n1-chi(active), standbys: storage1n3-chi, storage1n2-chi
     mds: cephfs-1/1/1 up  {0=storage1n3-chi=up:active}, 2 up:standby
     osd: 27 osds: 27 up, 27 in
     rgw: 3 daemons active

   data:
     pools:   7 pools, 608 pgs
     objects: 213  objects, 5.5 KiB
     usage:   892 GiB used, 19 TiB / 20 TiB avail
     pgs: 608 active+clean

This is a new cluster with no data yet.  I have dashboard enabled on 
manager which runs on the node that displays the error.


Any help is greatly appreciated.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange Client admin socket error in a containerized ceph environment

2018-08-30 Thread Alex Litvak

I keep getting the following error message:

018-08-30 18:52:37.882 7fca9df7c700 -1 asok(0x7fca98000fe0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
to bind the UNIX domain socket to 
'/var/run/ceph/ceph-client.admin.asok': (17) File exists


Otherwise things seem to be fine

I am running mimic 13.2.1 deployed with ceph-ansible and running in 
docker containers


 s -latr /var/run/ceph/ceph-*

srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.10.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.25.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.1.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.19.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.13.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.16.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.22.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:11 /var/run/ceph/ceph-osd.4.asok
srwxr-xr-x 1 167 167 0 Aug 30 16:12 /var/run/ceph/ceph-osd.7.asok
srwxr-xr-x 1 167 167 0 Aug 30 17:53 
/var/run/ceph/ceph-mds.storage1n1-chi.asok
srwxr-xr-x 1 167 167 0 Aug 30 18:16 
/var/run/ceph/ceph-mon.storage1n1-chi.asok
srwxr-xr-x 1 167 167 0 Aug 30 18:40 
/var/run/ceph/ceph-mgr.storage1n1-chi.asok

srwxr-xr-x 1 167 167 0 Aug 30 18:43 /var/run/ceph/ceph-client.admin.asok
srwxr-xr-x 1 167 167 0 Aug 30 18:51 
/var/run/ceph/ceph-client.rgw.storage1n1-chi.asok


The file /var/run/ceph/ceph-client.admin.asok only shows on 1 node as 
well as the error.  I also have status


ceph -s
2018-08-30 18:57:49.673 7f76457c9700 -1 asok(0x7f764fe0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
to bind the UNIX domain socket to 
'/var/run/ceph/ceph-client.admin.asok': (17) File exists

  cluster:
id: 7f2fcb31-655f-4fb5-879a-8d1f6e636f7a
health: HEALTH_OK

  services:
mon: 3 daemons, quorum storage1n1-chi,storage1n2-chi,storage1n3-chi
mgr: storage1n1-chi(active), standbys: storage1n3-chi, storage1n2-chi
mds: cephfs-1/1/1 up  {0=storage1n3-chi=up:active}, 2 up:standby
osd: 27 osds: 27 up, 27 in
rgw: 3 daemons active

  data:
pools:   7 pools, 608 pgs
objects: 213  objects, 5.5 KiB
usage:   892 GiB used, 19 TiB / 20 TiB avail
pgs: 608 active+clean

This is a new cluster with no data yet.  I have dashboard enabled on 
manager which runs on the node that displays the error.


Any help is greatly appreciated.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practices for allocating memory to bluestore cache

2018-08-30 Thread David Turner
Be very careful trying to utilize more RAM while your cluster is healthy.
When you're going to need the extra RAM is when you're closer is unhealthy
and your osds are peering, recovering, backfilling, etc. That's when the
osd daemons start needing the RAM that is recommended in the docs.

On Thu, Aug 30, 2018, 4:21 PM Tyler Bishop 
wrote:

> Hi,
>
> My OSD host has 256GB of ram and I have 52 OSD.  Currently I have the
> cache set to 1GB and the system only consumes around 44GB of ram and the
> other ram sits as unallocated because I am using bluestore vs filestore.
>
> The documentation:
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
> list the defaults of ram to be used almost exclusively for the KV cache.
>
> With a system like mine do you think It would be safe to allow 3GB cache
> and change the KV ratio to 0.60?
>
> Thanks
> _
>
> *Tyler Bishop*
> EST 2007
>
>
> O: 513-299-7108 x1000
> M: 513-646-5809
> http://BeyondHosting.net 
>
>
> This email is intended only for the recipient(s) above and/or
> otherwise authorized personnel. The information contained herein and
> attached is confidential and the property of Beyond Hosting. Any
> unauthorized copying, forwarding, printing, and/or disclosing
> any information related to this email is prohibited. If you received this
> message in error, please contact the sender and destroy all copies of this
> email and any attachment(s).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practices for allocating memory to bluestore cache

2018-08-30 Thread Tyler Bishop
Hi,

My OSD host has 256GB of ram and I have 52 OSD.  Currently I have the cache
set to 1GB and the system only consumes around 44GB of ram and the other
ram sits as unallocated because I am using bluestore vs filestore.

The documentation:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
list the defaults of ram to be used almost exclusively for the KV cache.

With a system like mine do you think It would be safe to allow 3GB cache
and change the KV ratio to 0.60?

Thanks
_

*Tyler Bishop*
EST 2007


O: 513-299-7108 x1000
M: 513-646-5809
http://BeyondHosting.net 


This email is intended only for the recipient(s) above and/or
otherwise authorized personnel. The information contained herein and
attached is confidential and the property of Beyond Hosting. Any
unauthorized copying, forwarding, printing, and/or disclosing
any information related to this email is prohibited. If you received this
message in error, please contact the sender and destroy all copies of this
email and any attachment(s).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous missing osd_backfill_full_ratio

2018-08-30 Thread David Turner
This moved to the PG map in luminous. I think it might have been there in
Jewel as well.

http://docs.ceph.com/docs/luminous/man/8/ceph/#pg
ceph pg set_full_ratio 
ceph pg set_backfillfull_ratio 
ceph pg set_nearfull_ratio 


On Thu, Aug 30, 2018, 1:57 PM David C  wrote:

> Hi All
>
> I feel like this is going to be a silly query with a hopefully simple
> answer. I don't seem to have the osd_backfill_full_ratio config option on
> my OSDs and can't inject it. This a Lumimous 12.2.1 cluster that was
> upgraded from Jewel.
>
> I added an OSD to the cluster and woke up the next day to find the OSD had
> hit OSD_FULL. I'm pretty sure the reason it filled up was because the new
> host was weighted too high (I initially add two OSDs but decided to only
> backfill one at a time). The thing that surprised me was why a backfill
> full ratio didn't kick in to prevent this from happening.
>
> One potentially key piece of info is I haven't run the "ceph osd
> require-osd-release luminous" command yet (I wasn't sure what impact this
> would have so was waiting for a window with quiet client I/O).
>
> ceph osd dump is showing zero for all full ratios:
>
> # ceph osd dump | grep full_ratio
> full_ratio 0
> backfillfull_ratio 0
> nearfull_ratio 0
>
> Do I simply need to run ceph osd set -backfillfull-ratio? Or am I missing
> something here. I don't understand why I don't have a default backfill_full
> ratio on this cluster.
>
> Thanks,
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous missing osd_backfill_full_ratio

2018-08-30 Thread David C
Hi All

I feel like this is going to be a silly query with a hopefully simple
answer. I don't seem to have the osd_backfill_full_ratio config option on
my OSDs and can't inject it. This a Lumimous 12.2.1 cluster that was
upgraded from Jewel.

I added an OSD to the cluster and woke up the next day to find the OSD had
hit OSD_FULL. I'm pretty sure the reason it filled up was because the new
host was weighted too high (I initially add two OSDs but decided to only
backfill one at a time). The thing that surprised me was why a backfill
full ratio didn't kick in to prevent this from happening.

One potentially key piece of info is I haven't run the "ceph osd
require-osd-release luminous" command yet (I wasn't sure what impact this
would have so was waiting for a window with quiet client I/O).

ceph osd dump is showing zero for all full ratios:

# ceph osd dump | grep full_ratio
full_ratio 0
backfillfull_ratio 0
nearfull_ratio 0

Do I simply need to run ceph osd set -backfillfull-ratio? Or am I missing
something here. I don't understand why I don't have a default backfill_full
ratio on this cluster.

Thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS does not always failover to hot standby on reboot

2018-08-30 Thread Gregory Farnum
LOn Thu, Aug 30, 2018 at 12:46 PM William Lawton 
wrote:

Oh i see. We’d taken steps to reduce the risk of losing the active mds and
> mon leader instances at the same time in the hope that it would prevent
> this issue. Do you know if the mds always connects to a specific mon
> instance i.e. the mon provider and can it be determined which mon instance
> that is? Or is it adhoc?
>



On Thu, Aug 30, 2018 at 9:45 AM Gregory Farnum  wrote:

> If you need to co-locate, one thing that would make it better without
> being a lot of work is trying to have the MDS connect to one of the
> monitors on a different host. You can do that by just restricting the list
> of monitors you feed it in the ceph.conf, although it's not a guarantee
> that will *prevent* it from connecting to its own monitor if there are
> failures or reconnects after first startup.
>

:)


> Sent from my iPhone
>
> On 30 Aug 2018, at 20:01, Gregory Farnum  wrote:
>
> Okay, well that will be the same reason then. If the active MDS is
> connectedng  to a monitor and they fail at the same time, the monitors
> can’t replace the mds until they’ve been through their own election and a
> full mds timeout window.
>
>
> On Thu, Aug 30, 2018 at 11:46 AM William Lawton 
> wrote:
>
>> Thanks for the response Greg. We did originally have co-located mds and
>> mon but realised this wasn't a good idea early on and separated them out
>> onto different hosts. So our mds hosts are on ceph-01 and ceph-02, and our
>> mon hosts are on ceph-03, 04 and 05. Unfortunately we see this issue
>> occurring when we reboot ceph-02(mds) and ceph-04(mon) together. We expect
>> ceph-01 to become the active mds but often it doesnt.
>>
>> Sent from my iPhone
>>
>> On 30 Aug 2018, at 17:46, Gregory Farnum  wrote:
>>
>> Yes, this is a consequence of co-locating the MDS and monitors — if the
>> MDS reports to its co-located monitor and both fail, the monitor cluster
>> has to go through its own failure detection and then wait for a full MDS
>> timeout period after that before it marks the MDS down. :(
>>
>> We might conceivably be able to optimize for this, but there's not a
>> general solution. If you need to co-locate, one thing that would make it
>> better without being a lot of work is trying to have the MDS connect to one
>> of the monitors on a different host. You can do that by just restricting
>> the list of monitors you feed it in the ceph.conf, although it's not a
>> guarantee that will *prevent* it from connecting to its own monitor if
>> there are failures or reconnects after first startup.
>> -Greg
>>
>> On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
>> wrote:
>>
>>> Hi.
>>>
>>>
>>>
>>> We have a 5 node Ceph cluster (refer to ceph -s output at bottom of
>>> email). During resiliency tests we have an occasional problem when we
>>> reboot the active MDS instance and a MON instance together i.e.
>>>  dub-sitv-ceph-02 and dub-sitv-ceph-04. We expect the MDS to failover to
>>> the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and
>>> 80% of the time it does with no problems. However, 20% of the time it
>>> doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds
>>> later when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances
>>> come back up.
>>>
>>>
>>>
>>> When the MDS successfully fails over to the standby we see in the
>>> ceph.log the following:
>>>
>>>
>>>
>>> 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 50 : cluster [ERR] Health check failed: 1 filesystem is offline
>>> (MDS_ALL_DOWN)
>>>
>>> 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to
>>> filesystem cephfs as rank 0
>>>
>>> 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is
>>> offline)
>>>
>>>
>>>
>>> When the active MDS role does not failover to the standby the
>>> MDS_ALL_DOWN check is not cleared until after the rebooted instances have
>>> come back up e.g.:
>>>
>>>
>>>
>>> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 55 : cluster [ERR] Health check failed: 1 filesystem is offline
>>> (MDS_ALL_DOWN)
>>>
>>> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2
>>> 10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling
>>> monitor election
>>>
>>> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>>>
>>> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
>>> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
>>>
>>> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>>> 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum
>>> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
>>>
>>> 

Re: [ceph-users] MDS does not always failover to hot standby on reboot

2018-08-30 Thread William Lawton
Oh i see. We’d taken steps to reduce the risk of losing the active mds and mon 
leader instances at the same time in the hope that it would prevent this issue. 
Do you know if the mds always connects to a specific mon instance i.e. the mon 
provider and can it be determined which mon instance that is? Or is it adhoc?

Sent from my iPhone

On 30 Aug 2018, at 20:01, Gregory Farnum 
mailto:gfar...@redhat.com>> wrote:

Okay, well that will be the same reason then. If the active MDS is connectedng  
to a monitor and they fail at the same time, the monitors can’t replace the mds 
until they’ve been through their own election and a full mds timeout window.
On Thu, Aug 30, 2018 at 11:46 AM William Lawton 
mailto:william.law...@irdeto.com>> wrote:
Thanks for the response Greg. We did originally have co-located mds and mon but 
realised this wasn't a good idea early on and separated them out onto different 
hosts. So our mds hosts are on ceph-01 and ceph-02, and our mon hosts are on 
ceph-03, 04 and 05. Unfortunately we see this issue occurring when we reboot 
ceph-02(mds) and ceph-04(mon) together. We expect ceph-01 to become the active 
mds but often it doesnt.

Sent from my iPhone

On 30 Aug 2018, at 17:46, Gregory Farnum 
mailto:gfar...@redhat.com>> wrote:

Yes, this is a consequence of co-locating the MDS and monitors — if the MDS 
reports to its co-located monitor and both fail, the monitor cluster has to go 
through its own failure detection and then wait for a full MDS timeout period 
after that before it marks the MDS down. :(

We might conceivably be able to optimize for this, but there's not a general 
solution. If you need to co-locate, one thing that would make it better without 
being a lot of work is trying to have the MDS connect to one of the monitors on 
a different host. You can do that by just restricting the list of monitors you 
feed it in the ceph.conf, although it's not a guarantee that will *prevent* it 
from connecting to its own monitor if there are failures or reconnects after 
first startup.
-Greg

On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
mailto:william.law...@irdeto.com>> wrote:
Hi.

We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). 
During resiliency tests we have an occasional problem when we reboot the active 
MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and 
dub-sitv-ceph-04. We expect the MDS to failover to the standby instance 
dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does 
with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN 
health check is not cleared until 30 seconds later when the rebooted 
dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.

When the MDS successfully fails over to the standby we see in the ceph.log the 
following:

2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 50 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 52 : cluster [INF] Standby daemon 
mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 54 : cluster [INF] Health check 
cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

When the active MDS role does not failover to the standby the MDS_ALL_DOWN 
check is not cleared until after the rebooted instances have come back up e.g.:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 55 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0 226 : cluster [INF] 
mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 56 : cluster [INF] 
mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 57 : cluster [INF] 
mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in 
quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 62 : cluster [WRN] Health check 
failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 63 : cluster [WRN] overall 
HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum 
dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 64 : cluster [WRN] Health check 
failed: Reduced data availability: 2 pgs inactive, 115 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:12.429145 

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-30 Thread Jones de Andrade
Hi Eugen.

Ok, edited the file /etc/salt/minion, uncommented the "log_level_logfile"
line and set it to "debug" level.

Turned off the computer, waited a few minutes so that the time frame would
stand out in the /var/log/messages file, and restarted the computer.

Using vi I "greped out" (awful wording) the reboot section. From that, I
also removed most of what it seemed totally unrelated to ceph, salt,
minions, grafana, prometheus, whatever.

I got the lines below. It does not seem to complain about anything that I
can see. :(


2018-08-30T15:41:46.455383-03:00 torcello systemd[1]: systemd 234 running
in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP
+LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS
+KMOD -IDN2 -IDN default-hierarchy=hybrid)
2018-08-30T15:41:46.456330-03:00 torcello systemd[1]: Detected architecture
x86-64.
2018-08-30T15:41:46.456350-03:00 torcello systemd[1]: nss-lookup.target:
Dependency Before=nss-lookup.target dropped
2018-08-30T15:41:46.456357-03:00 torcello systemd[1]: Started Load Kernel
Modules.
2018-08-30T15:41:46.456369-03:00 torcello systemd[1]: Starting Apply Kernel
Variables...
2018-08-30T15:41:46.457230-03:00 torcello systemd[1]: Started Alertmanager
for prometheus.
2018-08-30T15:41:46.457237-03:00 torcello systemd[1]: Started Monitoring
system and time series database.
2018-08-30T15:41:46.457403-03:00 torcello systemd[1]: Starting NTP
client/server...






*2018-08-30T15:41:46.457425-03:00 torcello systemd[1]: Started Prometheus
exporter for machine metrics.2018-08-30T15:41:46.457706-03:00 torcello
prometheus[695]: level=info ts=2018-08-30T18:41:44.797896888Z
caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0,
branch=non-git, revision=non-git)"2018-08-30T15:41:46.457712-03:00 torcello
prometheus[695]: level=info ts=2018-08-30T18:41:44.797969232Z
caller=main.go:226 build_context="(go=go1.9.4, user=abuild@lamb69,
date=20180513-03:46:03)"2018-08-30T15:41:46.457719-03:00 torcello
prometheus[695]: level=info ts=2018-08-30T18:41:44.798008802Z
caller=main.go:227 host_details="(Linux 4.12.14-lp150.12.4-default #1 SMP
Tue May 22 05:17:22 UTC 2018 (66b2eda) x86_64 torcello
(none))"2018-08-30T15:41:46.457726-03:00 torcello prometheus[695]:
level=info ts=2018-08-30T18:41:44.798044088Z caller=main.go:228
fd_limits="(soft=1024, hard=4096)"2018-08-30T15:41:46.457738-03:00 torcello
prometheus[695]: level=info ts=2018-08-30T18:41:44.802067189Z
caller=web.go:383 component=web msg="Start listening for connections"
address=0.0.0.0:9090 2018-08-30T15:41:46.457745-03:00
torcello prometheus[695]: level=info ts=2018-08-30T18:41:44.802037354Z
caller=main.go:499 msg="Starting TSDB ..."*
2018-08-30T15:41:46.458145-03:00 torcello smartd[809]: Monitoring 1
ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
2018-08-30T15:41:46.458321-03:00 torcello systemd[1]: Started NTP
client/server.
*2018-08-30T15:41:50.387157-03:00 torcello ceph_exporter[690]: 2018/08/30
15:41:50 Starting ceph exporter on ":9128"*
2018-08-30T15:41:52.658272-03:00 torcello wicked[905]: lo  up
2018-08-30T15:41:52.658738-03:00 torcello wicked[905]: eth0up
2018-08-30T15:41:52.659989-03:00 torcello systemd[1]: Started wicked
managed network interfaces.
2018-08-30T15:41:52.660514-03:00 torcello systemd[1]: Reached target
Network.
2018-08-30T15:41:52.667938-03:00 torcello systemd[1]: Starting OpenSSH
Daemon...
2018-08-30T15:41:52.668292-03:00 torcello systemd[1]: Reached target
Network is Online.




*2018-08-30T15:41:52.669132-03:00 torcello systemd[1]: Started Ceph cluster
monitor daemon.2018-08-30T15:41:52.669328-03:00 torcello systemd[1]:
Reached target ceph target allowing to start/stop all ceph-mon@.service
instances at once.2018-08-30T15:41:52.670346-03:00 torcello systemd[1]:
Started Ceph cluster manager daemon.2018-08-30T15:41:52.670565-03:00
torcello systemd[1]: Reached target ceph target allowing to start/stop all
ceph-mgr@.service instances at once.2018-08-30T15:41:52.670839-03:00
torcello systemd[1]: Reached target ceph target allowing to start/stop all
ceph*@.service instances at once.*
2018-08-30T15:41:52.671246-03:00 torcello systemd[1]: Starting Login and
scanning of iSCSI devices...
*2018-08-30T15:41:52.672402-03:00 torcello systemd[1]: Starting Grafana
instance...*
2018-08-30T15:41:52.678922-03:00 torcello systemd[1]: Started Backup of
/etc/sysconfig.
2018-08-30T15:41:52.679109-03:00 torcello systemd[1]: Reached target Timers.
*2018-08-30T15:41:52.679630-03:00 torcello systemd[1]: Started The Salt
API.*
2018-08-30T15:41:52.692944-03:00 torcello systemd[1]: Starting Postfix Mail
Transport Agent...
*2018-08-30T15:41:52.694687-03:00 torcello systemd[1]: Started The Salt
Master Server.*
*2018-08-30T15:41:52.696821-03:00 torcello systemd[1]: Starting The Salt
Minion...*
2018-08-30T15:41:52.772750-03:00 torcello sshd-gen-keys-start[1408]:
Checking for missing server keys in /etc/ssh

Re: [ceph-users] MDS does not always failover to hot standby on reboot

2018-08-30 Thread Gregory Farnum
Okay, well that will be the same reason then. If the active MDS is
connected to a monitor and they fail at the same time, the monitors can’t
replace the mds until they’ve been through their own election and a full
mds timeout window.
On Thu, Aug 30, 2018 at 11:46 AM William Lawton 
wrote:

> Thanks for the response Greg. We did originally have co-located mds and
> mon but realised this wasn't a good idea early on and separated them out
> onto different hosts. So our mds hosts are on ceph-01 and ceph-02, and our
> mon hosts are on ceph-03, 04 and 05. Unfortunately we see this issue
> occurring when we reboot ceph-02(mds) and ceph-04(mon) together. We expect
> ceph-01 to become the active mds but often it doesnt.
>
> Sent from my iPhone
>
> On 30 Aug 2018, at 17:46, Gregory Farnum  wrote:
>
> Yes, this is a consequence of co-locating the MDS and monitors — if the
> MDS reports to its co-located monitor and both fail, the monitor cluster
> has to go through its own failure detection and then wait for a full MDS
> timeout period after that before it marks the MDS down. :(
>
> We might conceivably be able to optimize for this, but there's not a
> general solution. If you need to co-locate, one thing that would make it
> better without being a lot of work is trying to have the MDS connect to one
> of the monitors on a different host. You can do that by just restricting
> the list of monitors you feed it in the ceph.conf, although it's not a
> guarantee that will *prevent* it from connecting to its own monitor if
> there are failures or reconnects after first startup.
> -Greg
>
> On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
> wrote:
>
>> Hi.
>>
>>
>>
>> We have a 5 node Ceph cluster (refer to ceph -s output at bottom of
>> email). During resiliency tests we have an occasional problem when we
>> reboot the active MDS instance and a MON instance together i.e.
>>  dub-sitv-ceph-02 and dub-sitv-ceph-04. We expect the MDS to failover to
>> the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and
>> 80% of the time it does with no problems. However, 20% of the time it
>> doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds
>> later when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances
>> come back up.
>>
>>
>>
>> When the MDS successfully fails over to the standby we see in the
>> ceph.log the following:
>>
>>
>>
>> 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 50 : cluster [ERR] Health check failed: 1 filesystem is offline
>> (MDS_ALL_DOWN)
>>
>> 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to
>> filesystem cephfs as rank 0
>>
>> 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is
>> offline)
>>
>>
>>
>> When the active MDS role does not failover to the standby the
>> MDS_ALL_DOWN check is not cleared until after the rebooted instances have
>> come back up e.g.:
>>
>>
>>
>> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 55 : cluster [ERR] Health check failed: 1 filesystem is offline
>> (MDS_ALL_DOWN)
>>
>> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2
>> 10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling
>> monitor election
>>
>> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>>
>> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
>> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
>>
>> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum
>> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
>>
>> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
>> 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
>>
>> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs
>> inactive, 115 pgs peering (PG_AVAILABILITY)
>>
>> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504
>> objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
>>
>> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg
>> inactive, 69 pgs peering (PG_AVAILABILITY)
>>
>> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
>> 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
>> availability: 1 pg inactive, 69 pgs peering)
>>
>> 

[ceph-users] SPDK/DPDK with Intel P3700 NVMe pool

2018-08-30 Thread Kevin Olbrich
Hi!

During our move from filestore to bluestore, we removed several Intel P3700
NVMe from the nodes.

Is someone running a SPDK/DPDK NVMe-only EC pool? Is it working well?
The docs are very short about the setup:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage

I would like to re-use these cards for high-end (max IO) for database VMs.

Some notes or feedback about the setup (ceph-volume etc.) would be
appreciated.

Thank you.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb mon stores growing until restart

2018-08-30 Thread David Turner
The Hammer ticket was https://tracker.ceph.com/issues/13990.  The problem
here was when OSDs asked each other for which map they needed to keep and a
leak would set it to NULL then that OSD would never delete an OSD map again
until it was restarted.

On Thu, Aug 30, 2018 at 3:09 AM Joao Eduardo Luis  wrote:

> On 08/30/2018 09:28 AM, Dan van der Ster wrote:
> > Hi,
> >
> > Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
> > eventually triggering the 'mon is using a lot of disk space' warning?
> >
> > Since upgrading to luminous, we've seen this happen at least twice.
> > Each time, we restart all the mons and then stores slowly trim down to
> > <500MB. We have 'mon compact on start = true', but it's not the
> > compaction that's shrinking the rockdb's -- the space used seems to
> > decrease over a few minutes only after *all* mons have been restarted.
> >
> > This reminds me of a hammer-era issue where references to trimmed maps
> > were leaking -- I can't find that bug at the moment, though.
>
> Next time this happens, mind listing the store contents and check if you
> are holding way too many osdmaps? You shouldn't be holding more osdmaps
> than the default IF the cluster is healthy and all the pgs are clean.
>
> I've chased a bug pertaining this last year, even got a patch, but then
> was unable to reproduce it. Didn't pursue merging the patch any longer
> (I think I may still have an open PR for it though), simply because it
> was no longer clear if it was needed.
>
>   -Joao
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS does not always failover to hot standby on reboot

2018-08-30 Thread William Lawton
Thanks for the response Greg. We did originally have co-located mds and mon but 
realised this wasn't a good idea early on and separated them out onto different 
hosts. So our mds hosts are on ceph-01 and ceph-02, and our mon hosts are on 
ceph-03, 04 and 05. Unfortunately we see this issue occurring when we reboot 
ceph-02(mds) and ceph-04(mon) together. We expect ceph-01 to become the active 
mds but often it doesnt.

Sent from my iPhone

On 30 Aug 2018, at 17:46, Gregory Farnum 
mailto:gfar...@redhat.com>> wrote:

Yes, this is a consequence of co-locating the MDS and monitors — if the MDS 
reports to its co-located monitor and both fail, the monitor cluster has to go 
through its own failure detection and then wait for a full MDS timeout period 
after that before it marks the MDS down. :(

We might conceivably be able to optimize for this, but there's not a general 
solution. If you need to co-locate, one thing that would make it better without 
being a lot of work is trying to have the MDS connect to one of the monitors on 
a different host. You can do that by just restricting the list of monitors you 
feed it in the ceph.conf, although it's not a guarantee that will *prevent* it 
from connecting to its own monitor if there are failures or reconnects after 
first startup.
-Greg

On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
mailto:william.law...@irdeto.com>> wrote:
Hi.

We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). 
During resiliency tests we have an occasional problem when we reboot the active 
MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and 
dub-sitv-ceph-04. We expect the MDS to failover to the standby instance 
dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does 
with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN 
health check is not cleared until 30 seconds later when the rebooted 
dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.

When the MDS successfully fails over to the standby we see in the ceph.log the 
following:

2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 50 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 52 : cluster [INF] Standby daemon 
mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 54 : cluster [INF] Health check 
cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

When the active MDS role does not failover to the standby the MDS_ALL_DOWN 
check is not cleared until after the rebooted instances have come back up e.g.:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 55 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0 226 : cluster [INF] 
mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 56 : cluster [INF] 
mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 57 : cluster [INF] 
mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in 
quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 62 : cluster [WRN] Health check 
failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 63 : cluster [WRN] overall 
HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum 
dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 64 : cluster [WRN] Health check 
failed: Reduced data availability: 2 pgs inactive, 115 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 66 : cluster [WRN] Health check 
failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 67 : cluster [WRN] Health check 
update: Reduced data availability: 1 pg inactive, 69 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 68 : cluster [INF] Health check 
cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs 
peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 69 : cluster 

Re: [ceph-users] mimic/bluestore cluster can't allocate space for bluefs

2018-08-30 Thread David Turner
I have 2 OSDs failing to start due to this [1] segfault.  What is happening
matches what Sage said about this [2] bug.  The OSDs are NVMe disks and
rocksdb is compacting omaps.  I attempted setting `bluestore_bluefs_min_free
= 10737418240` and then start the OSDs, but they both segfaulted with the
same error. The segfault is immediate on OSD start happening within 5
seconds. Is there any testing that would be helpful to figuring this out
and/or get these 2 OSDs back up. All data has successfully migrated off of
them, so I'm at health OK with them marked out.


[1] FAILED assert(0 == "bluefs enospc")

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1600138


On Tue, Aug 14, 2018 at 12:29 PM Igor Fedotov  wrote:

> Hi Jakub,
>
> for the crashing OSD could you please set
>
> debug_bluestore=10
>
> bluestore_bluefs_balance_failure_dump_interval=1
>
>
> and collect more logs.
>
> This will hopefully provide more insight on why additional space isn't
> allocated for bluefs.
>
> Thanks,
>
> Igor
>
> On 8/14/2018 12:41 PM, Jakub Stańczak wrote:
>
> Hello All!
>
> I am using mimic full bluestore cluster with pure RGW workload. We use AWS
> i3 instance family for osd machines - each instance has 1 NVMe disk which
> is split into 4 partitions and each of those partitions is devoted to
> bluestore block device. We use 1 device per partition - so everything is
> managed by bluestore internally.
>
> The problem is that under write heavy conditions DB device is growing fast
> and at some point bluefs will stop getting more space which results in osd
> death. There is no recovery from this error - when bluefs runs out of space
> for rocksdb, osd dies and it cannot be restarted.
>
> With this particular osd there is plenty of free space but we can see that
> it cannot allocate more space under weird address '_balance_bluefs_freespace
> no allocate on 0x8000'.
>
> I've also did some bluefs tuning cause previously I had similar problems
> but it appeared that bluestore could not keep up with providing enough
> storage for bluefs.
>
> bluefs settings:
> bluestore_bluefs_balance_interval = 0.333 bluestore_bluefs_gift_ratio =
> 0.05 bluestore_bluefs_min_free = 3221225472
>
> snippet from osd logs:
>
> 2018-08-13 18:15:10.960 7f6a54073700  0 bluestore(/var/lib/ceph/osd/ceph-6) 
> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x2000
> 2018-08-13 18:15:11.330 7f6a54073700  0 bluestore(/var/lib/ceph/osd/ceph-6) 
> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x2000
> 2018-08-13 18:15:11.752 7f6a54073700  0 bluestore(/var/lib/ceph/osd/ceph-6) 
> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x2000
> 2018-08-13 18:15:11.785 7f6a5b882700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/src/rocksdb
> /db/compaction_job.cc:1166] [default] [JOB 41] Generated table #14590: 304401 
> keys, 68804532 bytes
> 2018-08-13 18:15:11.785 7f6a5b882700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1534184111786253, "cf_name": "default", "job": 41, "event": 
> "table_file_creation", "file_number": 14590, "file_size": 68804532, 
> "table_properties": {"data_size
> ": 67112437, "index_size": 92, "filter_size": 913252, "raw_key_size": 
> 13383306, "raw_average_key_size": 43, "raw_value_size": 58673606, 
> "raw_average_value_size": 192, "num_data_blocks": 17090, "num_entries": 
> 304401, "filter_policy_na
> me": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": 
> "0"}}
> 2018-08-13 18:15:12.245 7f6a54073700  0 bluestore(/var/lib/ceph/osd/ceph-6) 
> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x2000
> 2018-08-13 18:15:12.664 7f6a54073700  0 bluestore(/var/lib/ceph/osd/ceph-6) 
> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x2000
> 2018-08-13 18:15:12.743 7f6a5b882700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/src/rocksdb
> /db/compaction_job.cc:1166] [default] [JOB 41] Generated table #14591: 313351 
> keys, 68830515 bytes
> 2018-08-13 18:15:12.743 7f6a5b882700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1534184112744129, "cf_name": "default", "job": 41, "event": 
> "table_file_creation", "file_number": 14591, "file_size": 68830515, 
> "table_properties": {"data_size
> ": 67109446, "index_size": 785852, "filter_size": 934166, "raw_key_size": 
> 13762246, "raw_average_key_size": 43, "raw_value_size": 58469928, 
> "raw_average_value_size": 186, "num_data_blocks": 17124, "num_entries": 
> 313351, "filter_policy_na
> me": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": 
> "0"}}
> 2018-08-13 18:15:13.025 7f6a54073700  0 bluestore(/var/lib/ceph/osd/ceph-6) 
> _balance_bluefs_freespace no allocate on 0x8000 

Re: [ceph-users] cephfs speed

2018-08-30 Thread Peter Eisch
Thanks for the thought.  It’s mounted with this entry in fstab (one line, if 
email wraps it):

cephmon-s01,cephmon-s02,cephmon-s03:/     /loam    ceph    
noauto,name=clientname,secretfile=/etc/ceph/secret,noatime,_netdev    0       2

Pretty plain, but I'm open to tweaking!

peter


Peter Eisch
virginpulse.com
|globalchallenge.virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.10
From: Gregory Farnum 
Date: Thursday, August 30, 2018 at 11:47 AM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] cephfs speed

How are you mounting CephFS? It may be that the cache settings are just set 
very badly for a 10G pipe. Plus rados bench is a very parallel large-IO 
benchmark and many benchmarks you might dump into a filesystem are definitely 
not.
-Greg

On Thu, Aug 30, 2018 at 7:54 AM Peter Eisch 
 wrote:
Hi,

I have a cluster serving cephfs and it works. It’s just slow. Client is using 
the kernel driver. I can ‘rados bench’ writes to the cephfs_data pool at wire 
speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it is rare to 
get above 100Mb/s. Large file writes may start fast (2Gb/s) but within a minute 
slows. In the dashboard at the OSDs I get lots of triangles (it doesn't stream) 
which seems to be lots of starts and stops. By contrast the graphs show 
constant flow when using 'rados bench.'

I feel like I'm missing something obvious. What can I do to help diagnose this 
better or resolve the issue?

Errata:
Version: 12.2.7 (on everything)
mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
mgr: cephmon-s02(active), standbys: cephmon-s01, cephmon-s03
mds: cephfs1-1/1/1 up {0=cephmon-s02=up:active}, 2 up:standby
osd: 70 osds: 70 up, 70 in
rgw: 3 daemons active

rados bench summary:
Total time run: 600.043733
Total writes made: 167725
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1118.09
Stddev Bandwidth: 7.23868
Max bandwidth (MB/sec): 1140
Min bandwidth (MB/sec): 1084
Average IOPS: 279
Stddev IOPS: 1
Max IOPS: 285
Min IOPS: 271
Average Latency(s): 0.057239
Stddev Latency(s): 0.0354817
Max latency(s): 0.367037
Min latency(s): 0.0120791

peter


Peter Eisch​











https://www.virginpulse.com/
|

https://globalchallenge.virginpulse.com/


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.10

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] safe to remove leftover bucket index objects

2018-08-30 Thread David Turner
I'm glad you asked this, because it was on my to-do list. I know that based
on our not existing in the bucket marker does not mean it's safe to
delete.  I have an index pool with 22k objects in it. 70 objects match
existing bucket markers. I was having a problem on the cluster and started
deleting the objects in the index pool and after going through 200 objects
I stopped it and tested and list access to 3 pools. Luckily for me they
were all buckets I've been working on deleting, so no need for recovery.

I then compared bucket IDs to the objects in that pool, but still only
found a couple hundred more matching objects. I have no idea what the other
22k objects are in the index bucket that don't match bucket markers or
bucket IDs. I did confirm there was no resharding happening both in the
research list and all bucket reshard statuses.

Does anyone know how to parse the names of these objects and how to tell
what can be deleted?  This is if particular interest as I have another
costed with 1M injects in the index pool.

On Thu, Aug 30, 2018, 7:29 AM Dan van der Ster  wrote:

> Replying to self...
>
> On Wed, Aug 1, 2018 at 11:56 AM Dan van der Ster 
> wrote:
> >
> > Dear rgw friends,
> >
> > Somehow we have more than 20 million objects in our
> > default.rgw.buckets.index pool.
> > They are probably leftover from this issue we had last year:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html
> > and we want to clean the leftover / unused index objects
> >
> > To do this, I would rados ls the pool, get a list of all existing
> > buckets and their current marker, then delete any objects with an
> > unused marker.
> > Does that sound correct?
>
> More precisely, for example, there is an object
> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 in the index
> pool.
> I run `radosgw-admin bucket stats` to get the marker for all current
> existing buckets.
> The marker 61c59385-085d-4caa-9070-63a3868dccb6.2978181.59 is not
> mentioned in the bucket stats output.
> Is it safe to rados rm
> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 ??
>
> Thanks in advance!
>
> -- dan
>
>
>
>
>
>
>
> > Can someone suggest a better way?
> >
> > Cheers, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-08-30 Thread Alfredo Deza
On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
 wrote:
> Hi Alfredo,
>
>
> caught some logs:
> https://pastebin.com/b3URiA7p

That looks like there is an issue with bluestore. Maybe Radoslaw or
Adam might know a bit more.


>
> br
> wolfgang
>
> On 2018-08-29 15:51, Alfredo Deza wrote:
>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>>  wrote:
>>> Hi,
>>>
>>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
>>> random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
>>> affected.
>>> I destroyed and recreated some of the SSD OSDs which seemed to help.
>>>
>>> this happens on centos 7.5 (different kernels tested)
>>>
>>> /var/log/messages:
>>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 
>>> thread_name:bstore_kv_final
>>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection 
>>> ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
>>> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
>>> code=killed, status=11/SEGV
>>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
>>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling 
>>> restart.
>>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
>>> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
>>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
>>> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
>>> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
>>> code=killed, status=11/SEGV
>>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
>>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
>> These systemd messages aren't usually helpful, try poking around
>> /var/log/ceph/ for the output on that one OSD.
>>
>> If those logs aren't useful either, try bumping up the verbosity (see
>> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
>> )
>>> did I hit a known issue?
>>> any suggestions are highly appreciated
>>>
>>>
>>> br
>>> wolfgang
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
> --
> Wolfgang Lendl
> IT Systems & Communications
> Medizinische Universität Wien
> Spitalgasse 23 / BT 88 /Ebene 00
> A-1090 Wien
> Tel: +43 1 40160-21231
> Fax: +43 1 40160-921200
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs speed

2018-08-30 Thread Gregory Farnum
How are you mounting CephFS? It may be that the cache settings are just set
very badly for a 10G pipe. Plus rados bench is a very parallel large-IO
benchmark and many benchmarks you might dump into a filesystem are
definitely not.
-Greg

On Thu, Aug 30, 2018 at 7:54 AM Peter Eisch 
wrote:

> Hi,
>
> I have a cluster serving cephfs and it works. It’s just slow. Client is
> using the kernel driver. I can ‘rados bench’ writes to the cephfs_data pool
> at wire speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it
> is rare to get above 100Mb/s. Large file writes may start fast (2Gb/s) but
> within a minute slows. In the dashboard at the OSDs I get lots of triangles
> (it doesn't stream) which seems to be lots of starts and stops. By contrast
> the graphs show constant flow when using 'rados bench.'
>
> I feel like I'm missing something obvious. What can I do to help diagnose
> this better or resolve the issue?
>
> Errata:
> Version: 12.2.7 (on everything)
> mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
> mgr: cephmon-s02(active), standbys: cephmon-s01, cephmon-s03
> mds: cephfs1-1/1/1 up {0=cephmon-s02=up:active}, 2 up:standby
> osd: 70 osds: 70 up, 70 in
> rgw: 3 daemons active
>
> rados bench summary:
> Total time run: 600.043733
> Total writes made: 167725
> Write size: 4194304
> Object size: 4194304
> Bandwidth (MB/sec): 1118.09
> Stddev Bandwidth: 7.23868
> Max bandwidth (MB/sec): 1140
> Min bandwidth (MB/sec): 1084
> Average IOPS: 279
> Stddev IOPS: 1
> Max IOPS: 285
> Min IOPS: 271
> Average Latency(s): 0.057239
> Stddev Latency(s): 0.0354817
> Max latency(s): 0.367037
> Min latency(s): 0.0120791
>
> peter
>
>
>
> Peter Eisch​
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *globalchallenge.virginpulse.com*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.10
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS does not always failover to hot standby on reboot

2018-08-30 Thread Gregory Farnum
Yes, this is a consequence of co-locating the MDS and monitors — if the MDS
reports to its co-located monitor and both fail, the monitor cluster has to
go through its own failure detection and then wait for a full MDS timeout
period after that before it marks the MDS down. :(

We might conceivably be able to optimize for this, but there's not a
general solution. If you need to co-locate, one thing that would make it
better without being a lot of work is trying to have the MDS connect to one
of the monitors on a different host. You can do that by just restricting
the list of monitors you feed it in the ceph.conf, although it's not a
guarantee that will *prevent* it from connecting to its own monitor if
there are failures or reconnects after first startup.
-Greg

On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
wrote:

> Hi.
>
>
>
> We have a 5 node Ceph cluster (refer to ceph -s output at bottom of
> email). During resiliency tests we have an occasional problem when we
> reboot the active MDS instance and a MON instance together i.e.
>  dub-sitv-ceph-02 and dub-sitv-ceph-04. We expect the MDS to failover to
> the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and
> 80% of the time it does with no problems. However, 20% of the time it
> doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds
> later when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances
> come back up.
>
>
>
> When the MDS successfully fails over to the standby we see in the ceph.log
> the following:
>
>
>
> 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 50 : cluster [ERR] Health check failed: 1 filesystem is offline
> (MDS_ALL_DOWN)
>
> 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to
> filesystem cephfs as rank 0
>
> 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is
> offline)
>
>
>
> When the active MDS role does not failover to the standby the MDS_ALL_DOWN
> check is not cleared until after the rebooted instances have come back up
> e.g.:
>
>
>
> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 55 : cluster [ERR] Health check failed: 1 filesystem is offline
> (MDS_ALL_DOWN)
>
> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0
> 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
>
> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>
> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
>
> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum
> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
>
> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
> 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
>
> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs
> inactive, 115 pgs peering (PG_AVAILABILITY)
>
> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504
> objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg
> inactive, 69 pgs peering (PG_AVAILABILITY)
>
> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
> availability: 1 pg inactive, 69 pgs peering)
>
> 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0
> 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
>
> 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0
> 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future,
> clocks not synchronized
>
> 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>
> 2018-08-25 

[ceph-users] MDS does not always failover to hot standby on reboot

2018-08-30 Thread William Lawton
Hi.

We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). 
During resiliency tests we have an occasional problem when we reboot the active 
MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and 
dub-sitv-ceph-04. We expect the MDS to failover to the standby instance 
dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does 
with no problems. However, 20% of the time it doesn't and the MDS_ALL_DOWN 
health check is not cleared until 30 seconds later when the rebooted 
dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.

When the MDS successfully fails over to the standby we see in the ceph.log the 
following:

2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 50 : 
cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 52 : 
cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs 
as rank 0
2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 54 : 
cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

When the active MDS role does not failover to the standby the MDS_ALL_DOWN 
check is not cleared until after the rebooted instances have come back up e.g.:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 55 : 
cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 226 
: cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 56 : 
cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 57 : 
cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons 
dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 62 : 
cluster [WRN] Health check failed: 1/3 mons down, quorum 
dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 63 : 
cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons 
down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 64 : 
cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 
115 pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 66 : 
cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects 
degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 67 : 
cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 69 
pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 68 : 
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data 
availability: 1 pg inactive, 69 pgs peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 69 : 
cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 objects 
degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 71 : 
cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 objects 
degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 1 : 
cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 2 : 
cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks 
not synchronized
2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 72 : 
cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 227 
: cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 73 : 
cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons 
dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 78 : 
cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum 
dub-sitv-ceph-03,dub-sitv-ceph-05)
2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 79 : 
cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s
2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 80 : 
cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded 
data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded
2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 83 : 

[ceph-users] cephfs speed

2018-08-30 Thread Peter Eisch
Hi,

I have a cluster serving cephfs and it works.  It’s just slow.  Client is using 
the kernel driver.  I can ‘rados bench’ writes to the cephfs_data pool at wire 
speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it is rare to 
get above 100Mb/s.  Large file writes may start fast (2Gb/s) but within a 
minute slows.  In the dashboard at the OSDs I get lots of triangles (it doesn't 
stream) which seems to be lots of starts and stops.  By contrast the graphs 
show constant flow when using 'rados bench.'

I feel like I'm missing something obvious.  What can I do to help diagnose this 
better or resolve the issue?

Errata:
Version: 12.2.7 (on everything)
mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
mgr: cephmon-s02(active), standbys: cephmon-s01, cephmon-s03
mds: cephfs1-1/1/1 up  {0=cephmon-s02=up:active}, 2 up:standby
osd: 70 osds: 70 up, 70 in
rgw: 3 daemons active

rados bench summary:
Total time run: 600.043733
Total writes made:  167725
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 1118.09
Stddev Bandwidth:   7.23868
Max bandwidth (MB/sec): 1140
Min bandwidth (MB/sec): 1084
Average IOPS:   279
Stddev IOPS:1
Max IOPS:   285
Min IOPS:   271
Average Latency(s): 0.057239
Stddev Latency(s):  0.0354817
Max latency(s): 0.367037
Min latency(s): 0.0120791

peter




Peter Eisch
virginpulse.com
|globalchallenge.virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block

Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.


Thanks for the clarification!


Zitat von Ilya Dryomov :


On Thu, Aug 30, 2018 at 1:04 PM Eugen Block  wrote:


Hi again,

we still didn't figure out the reason for the flapping, but I wanted
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to
predict anything.


The kernel client is just that, a client.  Almost by definition,
everything it sees has already happened.



For example, when I changed the primary-affinity of OSD.24 last week,
one of the clients realized that only today, 4 days later. If the
clients don't have to communicate with the respective host/osd in the
meantime, they log those events on the next reconnect.


Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.

Thanks,

Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Ilya Dryomov
On Thu, Aug 30, 2018 at 1:04 PM Eugen Block  wrote:
>
> Hi again,
>
> we still didn't figure out the reason for the flapping, but I wanted
> to get back on the dmesg entries.
> They just reflect what happened in the past, they're no indicator to
> predict anything.

The kernel client is just that, a client.  Almost by definition,
everything it sees has already happened.

>
> For example, when I changed the primary-affinity of OSD.24 last week,
> one of the clients realized that only today, 4 days later. If the
> clients don't have to communicate with the respective host/osd in the
> meantime, they log those events on the next reconnect.

Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] safe to remove leftover bucket index objects

2018-08-30 Thread Dan van der Ster
Replying to self...

On Wed, Aug 1, 2018 at 11:56 AM Dan van der Ster  wrote:
>
> Dear rgw friends,
>
> Somehow we have more than 20 million objects in our
> default.rgw.buckets.index pool.
> They are probably leftover from this issue we had last year:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html
> and we want to clean the leftover / unused index objects
>
> To do this, I would rados ls the pool, get a list of all existing
> buckets and their current marker, then delete any objects with an
> unused marker.
> Does that sound correct?

More precisely, for example, there is an object
.dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 in the index
pool.
I run `radosgw-admin bucket stats` to get the marker for all current
existing buckets.
The marker 61c59385-085d-4caa-9070-63a3868dccb6.2978181.59 is not
mentioned in the bucket stats output.
Is it safe to rados rm .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 ??

Thanks in advance!

-- dan







> Can someone suggest a better way?
>
> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2018-08-30 Thread Marc Roos
 

How is it going with this? Are we getting close to a state where we can 
store a mailbox on ceph with this librmb?



-Original Message-
From: Wido den Hollander [mailto:w...@42on.com] 
Sent: maandag 25 september 2017 9:20
To: Gregory Farnum; Danny Al-Gaaf
Cc: ceph-users
Subject: Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot


> Op 22 september 2017 om 23:56 schreef Gregory Farnum 
:
> 
> 
> On Fri, Sep 22, 2017 at 2:49 PM, Danny Al-Gaaf 
 wrote:
> > Am 22.09.2017 um 22:59 schrieb Gregory Farnum:
> > [..]
> >> This is super cool! Is there anything written down that explains 
> >> this for Ceph developers who aren't familiar with the workings of 
Dovecot?
> >> I've got some questions I see going through it, but they may be 
> >> very dumb.
> >>
> >> *) Why are indexes going on CephFS? Is this just about wanting a 
> >> local cache, or about the existing Dovecot implementations, or 
> >> something else? Almost seems like you could just store the whole 
> >> thing in a CephFS filesystem if that's safe. ;)
> >
> > This is, if everything works as expected, only an intermediate step. 

> > An idea is
> > (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/status-
> > 3) be to use omap to store the index/meta data.
> >
> > We chose a step-by-step approach and since we are currently not sure 

> > if using omap would work performance wise, we use CephFS (also since 

> > this requires no changes in Dovecot). Currently we put our focus on 
> > the development of the first version of librmb, but the code to use 
> > omap is already there. It needs integration, testing, and 
> > performance tuning to verify if it would work with our requirements.
> >
> >> *) It looks like each email is getting its own object in RADOS, and 

> >> I assume those are small messages, which leads me to
> >
> > The mail distribution looks like this:
> > https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplat
> > form-mails-dist
> >
> >
> > Yes, the majority of the mails are under 500k, but the most objects 
> > are around 50k. Not so many very small objects.
> 
> Ah, that slide makes more sense with that context — I was paging 
> through it in bed last night and thought it was about the number of 
> emails per user or something weird.
> 
> So those mail objects are definitely bigger than I expected; 
interesting.
> 
> >
> >>   *) is it really cost-acceptable to not use EC pools on email 
data?
> >
> > We will use EC pools for the mail objects and replication for 
CephFS.
> >
> > But even without EC there would be a cost case compared to the 
> > current system. We will save a large amount of IOPs in the new 
> > platform since the (NFS) POSIX layer is removed from the IO path (at 

> > least for the mail objects). And we expect with Ceph and commodity 
> > hardware we can compete with a traditional enterprise NAS/NFS 
anyway.
> >
> >>   *) isn't per-object metadata overhead a big cost compared to the 
> >> actual stored data?
> >
> > I assume not. The metadata/index is not so much compared to the size 

> > of the mails (currently with NFS around 10% I would say). In the 
> > classic NFS based dovecot the number of index/cache/metadata files 
> > is an issue anyway. With 6.7 billion mails we have 1.2 billion 
> > index/cache/metadata files 
> > 
(https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatfor
m-mails-nums).
> 
> I was unclear; I meant the RADOS metadata cost of storing an object. I 

> haven't quantified that in a while but it was big enough to make 4KB 
> objects pretty expensive, which I was incorrectly assuming would be 
> the case for most emails.
> EC pools have the same issue; if you want to erasure-code a 40KB 
> object into 5+3 then you pay the metadata overhead for each 8KB
> (40KB/5) of data, but again that's more on the practical side of 
> things than my initial assumptions placed it.

Yes, it is. But combining object isn't easy either. RGW also has this 
limitation where objects are striped in RADOS and the EC overhead can 
become large.

At this moment the price/GB (correct me if needed Danny!) isn't th 
biggest problem. It could be that all mails will be stored on a 
replicated pool.

There also might be some overhead in BlueStore per object, but the 
number of Deutsche Telekom show that mails usually aren't 4kb. Only a 
small portion of e-mails is 4kb.

We will see how this turns out.

Wido

> 
> This is super cool!
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-08-30 Thread Wolfgang Lendl
Hi Alfredo,


caught some logs:
https://pastebin.com/b3URiA7p

br
wolfgang

On 2018-08-29 15:51, Alfredo Deza wrote:
> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>  wrote:
>> Hi,
>>
>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
>> random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
>> affected.
>> I destroyed and recreated some of the SSD OSDs which seemed to help.
>>
>> this happens on centos 7.5 (different kernels tested)
>>
>> /var/log/messages:
>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 thread_name:bstore_kv_final
>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection 
>> ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
>> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
>> code=killed, status=11/SEGV
>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling 
>> restart.
>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
>> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
>> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
>> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
>> code=killed, status=11/SEGV
>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
> These systemd messages aren't usually helpful, try poking around
> /var/log/ceph/ for the output on that one OSD.
>
> If those logs aren't useful either, try bumping up the verbosity (see
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
> )
>> did I hit a known issue?
>> any suggestions are highly appreciated
>>
>>
>> br
>> wolfgang
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
Wolfgang Lendl
IT Systems & Communications
Medizinische Universität Wien
Spitalgasse 23 / BT 88 /Ebene 00
A-1090 Wien
Tel: +43 1 40160-21231
Fax: +43 1 40160-921200




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb mon stores growing until restart

2018-08-30 Thread Joao Eduardo Luis
On 08/30/2018 09:28 AM, Dan van der Ster wrote:
> Hi,
> 
> Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
> eventually triggering the 'mon is using a lot of disk space' warning?
> 
> Since upgrading to luminous, we've seen this happen at least twice.
> Each time, we restart all the mons and then stores slowly trim down to
> <500MB. We have 'mon compact on start = true', but it's not the
> compaction that's shrinking the rockdb's -- the space used seems to
> decrease over a few minutes only after *all* mons have been restarted.
> 
> This reminds me of a hammer-era issue where references to trimmed maps
> were leaking -- I can't find that bug at the moment, though.

Next time this happens, mind listing the store contents and check if you
are holding way too many osdmaps? You shouldn't be holding more osdmaps
than the default IF the cluster is healthy and all the pgs are clean.

I've chased a bug pertaining this last year, even got a patch, but then
was unable to reproduce it. Didn't pursue merging the patch any longer
(I think I may still have an open PR for it though), simply because it
was no longer clear if it was needed.

  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS : fuse client vs kernel driver

2018-08-30 Thread Hervé Ballans

Hi all,

I just finished setting up a new Ceph cluster (Luminous 12.2.7, 3xMON 
nodes and 6xOSDs nodes, BlueStore OSD on sata hdd with WAL/DB on 
separated NVMe devices, 2x10 Gbs network per node, 3 replicas by pool)


I created a CephFS pool : data pool uses hdd OSDs and metadata pool uses 
dedicated NVMe OSDs. I deployed 3 MDS demons (2 active + 1 failover).


My Ceph cluster is in 'HEALTH_OK' state, for now everything seems to be 
working perfectly.


My question is regarding the cephfs client, and especially the huge 
performance gap between the fuse client and the kernel one.
On the same writing test, done one after the other, I find a factor 55 
between the 2 !


Here is an example from a client (connected to 10 Gbs on the same LAN) :

CephFS Fuse client :

# ceph-fuse -m /FIRST_MON_NODE_IP/:6789 /mnt/ceph_newhome/
# time sh -c "dd if=/dev/zero 
of=/mnt/ceph_newhome/test_io_fuse_mount.tmp bs=4k count=200 && sync"

200+0 records in
200+0 records out
819200 bytes (8.2 GB, 7.6 GiB) copied, 305.57 s, 26.8 MB/s

real    5m5.607s
user    0m1.784s
sys 0m28.584s

CephFS Kernel driver :

# umount /mnt/ceph_newhome
# mount -t ceph /FIRST_MON_NODE_IP/:6789:/ /mnt/ceph_newhome -o 
name=admin,secret=`ceph-authtool -p /etc/ceph/ceph.client.admin.keyring`
# time sh -c "dd if=/dev/zero 
of=/mnt/ceph_newhome/test_io_kernel_mount.tmp bs=4k count=200 && sync"

200+0 records in
200+0 records out
819200 bytes (8.2 GB, 7.6 GiB) copied, 5.47228 s, 1.5 GB/s

real    0m15.161s
user    0m0.444s
sys 0m5.024s

I'm impressed by the write speed with the kernel driver and, as I must 
be able to use this kernel driver on my client systems, I'm 
statisfied...but I would like to know if such a difference is normal, or 
are there options/optimizations that improve the IO speed with the fuse 
client ? (I'm thinking in particular of the recovery scenario where the 
kernel driver is no longer mounted following a system update/upgrade and 
where I have to use the fuse client as a temporary replacement...)


Thanks for your suggestions,
Hervé



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block

Hi again,

we still didn't figure out the reason for the flapping, but I wanted  
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to  
predict anything.


For example, when I changed the primary-affinity of OSD.24 last week,  
one of the clients realized that only today, 4 days later. If the  
clients don't have to communicate with the respective host/osd in the  
meantime, they log those events on the next reconnect.
I just wanted to share that in case anybody else is wondering (or  
maybe it was just me).


Regards,
Eugen


Zitat von Eugen Block :


Update:
I changed the primary affinity of one OSD back to 1.0 to test if  
those metrics change, and indeed they do:

OSD.24 immediately shows values greater than 0.
I guess the metrics are completely unrelated to the flapping.

So the search goes on...


Zitat von Eugen Block :

An hour ago host5 started to report the OSDs on host4 as down  
(still no clue why), resulting in slow requests. This time no  
flapping occured, the cluster recovered a couple of minutes later.  
No other OSDs reported that, only those two on host5. There's  
nothing in the logs of the reporting or the affected OSDs.


Then I compared a perf dump of one healthy OSD with one on host4.  
There's something strange about the metrics (many of them are 0), I  
just can't tell if it's related to the fact that host4 has no  
primary OSDs. But even with no primary OSD I would expect different  
values for OSDs that are running for a week now.


---cut here---
host1:~ # diff -u perfdump.osd1 perfdump.osd24
--- perfdump.osd1   2018-08-23 11:03:03.695927316 +0200
+++ perfdump.osd24  2018-08-23 11:02:09.919927375 +0200
@@ -1,99 +1,99 @@
{
"osd": {
"op_wip": 0,
-"op": 7878594,
-"op_in_bytes": 852767683202,
-"op_out_bytes": 1019871565411,
+"op": 0,
+"op_in_bytes": 0,
+"op_out_bytes": 0,
"op_latency": {
-"avgcount": 7878594,
-"sum": 1018863.131206702,
-"avgtime": 0.129320425
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_process_latency": {
-"avgcount": 7878594,
-"sum": 879970.400440694,
-"avgtime": 0.111691299
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_prepare_latency": {
-"avgcount": 8321733,
-"sum": 41376.442963329,
-"avgtime": 0.004972094
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_r": 3574792,
-"op_r_out_bytes": 1019871565411,
+"op_r": 0,
+"op_r_out_bytes": 0,
"op_r_latency": {
-"avgcount": 3574792,
-"sum": 54750.502669010,
-"avgtime": 0.015315717
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_r_process_latency": {
-"avgcount": 3574792,
-"sum": 34107.703579874,
-"avgtime": 0.009541171
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_r_prepare_latency": {
-"avgcount": 3574817,
-"sum": 34262.515884817,
-"avgtime": 0.009584411
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_w": 4249520,
-"op_w_in_bytes": 847518164870,
+"op_w": 0,
+"op_w_in_bytes": 0,
"op_w_latency": {
-"avgcount": 4249520,
-"sum": 960898.540843217,
-"avgtime": 0.226119312
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_w_process_latency": {
-"avgcount": 4249520,
-"sum": 844398.804808119,
-"avgtime": 0.198704513
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_w_prepare_latency": {
-"avgcount": 4692618,
-"sum": 7032.358957948,
-"avgtime": 0.001498600
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_rw": 54282,
-"op_rw_in_bytes": 5249518332,
+"op_rw": 0,
+"op_rw_in_bytes": 0,
"op_rw_out_bytes": 0,
"op_rw_latency": {
-"avgcount": 54282,
-"sum": 3214.087694475,
-"avgtime": 0.059210929
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_rw_process_latency": {
-"avgcount": 54282,
-"sum": 1463.892052701,
-"avgtime": 0.026968277
+"avgcount": 0,
+"sum": 0.0,
+

[ceph-users] rocksdb mon stores growing until restart

2018-08-30 Thread Dan van der Ster
Hi,

Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
eventually triggering the 'mon is using a lot of disk space' warning?

Since upgrading to luminous, we've seen this happen at least twice.
Each time, we restart all the mons and then stores slowly trim down to
<500MB. We have 'mon compact on start = true', but it's not the
compaction that's shrinking the rockdb's -- the space used seems to
decrease over a few minutes only after *all* mons have been restarted.

This reminds me of a hammer-era issue where references to trimmed maps
were leaking -- I can't find that bug at the moment, though.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Odp.: New Ceph community manager: Mike Perez

2018-08-30 Thread Tomasz Kuzemko
Welcome Mike! You're the perfect person for this role!

--
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com


Od: ceph-users  w imieniu użytkownika Sage 
Weil 
Wysłane: środa, 29 sierpnia 2018 03:13
Do: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; 
ceph-commun...@lists.ceph.com
Temat: [ceph-users] New Ceph community manager: Mike Perez

Hi everyone,

Please help me welcome Mike Perez, the new Ceph community manager!

Mike has a long history with Ceph: he started at DreamHost working on
OpenStack and Ceph back in the early days, including work on the original
RBD integration.  He went on to work in several roles in the OpenStack
project, doing a mix of infrastructure, cross-project and community
related initiatives, including serving as the Project Technical Lead for
Cinder.

Mike lives in Pasadena, CA, and can be reached at mpe...@redhat.com, on
IRC as thingee, or twitter as @thingee.

I am very excited to welcome Mike back to Ceph, and look forward to
working together on building the Ceph developer and user communities!

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-30 Thread Eugen Block

Hi,


So, it only contains logs concerning the node itself (is it correct? sincer
node01 is also the master, I was expecting it to have logs from the other
too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
available, and nothing "shines out" (sorry for my poor english) as a
possible error.


the logging is not configured to be centralised per default, you would  
have to configure that yourself.


Regarding the OSDs, if there are OSD logs created, they're created on  
the OSD nodes, not on the master. But since the OSD deployment fails,  
there probably are no OSD specific logs yet. So you'll have to take a  
look into the syslog (/var/log/messages), that's where the salt-minion  
reports its attempts to create the OSDs. Chances are high that you'll  
find the root cause in here.


If the output is not enough, set the log-level to debug:

osd-1:~ # grep -E "^log_level" /etc/salt/minion
log_level: debug


Regards,
Eugen


Zitat von Jones de Andrade :


Hi Eugen.

Sorry for the delay in answering.

Just looked in the /var/log/ceph/ directory. It only contains the following
files (for example on node01):

###
# ls -lart
total 3864
-rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz
drwxr-xr-x 1 root root 898 ago 28 10:07 ..
-rw-r--r-- 1 ceph ceph  189464 ago 28 23:59 ceph-mon.node01.log-20180829.xz
-rw--- 1 ceph ceph   24360 ago 28 23:59 ceph.log-20180829.xz
-rw-r--r-- 1 ceph ceph   48584 ago 29 00:00 ceph-mgr.node01.log-20180829.xz
-rw--- 1 ceph ceph   0 ago 29 00:00 ceph.audit.log
drwxrws--T 1 ceph ceph 352 ago 29 00:00 .
-rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log
-rw--- 1 ceph ceph  175229 ago 29 12:48 ceph.log
-rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log
###

So, it only contains logs concerning the node itself (is it correct? sincer
node01 is also the master, I was expecting it to have logs from the other
too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
available, and nothing "shines out" (sorry for my poor english) as a
possible error.

Any suggestion on how to proceed?

Thanks a lot in advance,

Jones


On Mon, Aug 27, 2018 at 5:29 AM Eugen Block  wrote:


Hi Jones,

all ceph logs are in the directory /var/log/ceph/, each daemon has its
own log file, e.g. OSD logs are named ceph-osd.*.

I haven't tried it but I don't think SUSE Enterprise Storage deploys
OSDs on partitioned disks. Is there a way to attach a second disk to
the OSD nodes, maybe via USB or something?

Although this thread is ceph related it is referring to a specific
product, so I would recommend to post your question in the SUSE forum
[1].

Regards,
Eugen

[1] https://forums.suse.com/forumdisplay.php?99-SUSE-Enterprise-Storage

Zitat von Jones de Andrade :

> Hi Eugen.
>
> Thanks for the suggestion. I'll look for the logs (since it's our first
> attempt with ceph, I'll have to discover where they are, but no problem).
>
> One thing called my attention on your response however:
>
> I haven't made myself clear, but one of the failures we encountered were
> that the files now containing:
>
> node02:
>--
>storage:
>--
>osds:
>--
>/dev/sda4:
>--
>format:
>bluestore
>standalone:
>True
>
> Were originally empty, and we filled them by hand following a model found
> elsewhere on the web. It was necessary, so that we could continue, but
the
> model indicated that, for example, it should have the path for /dev/sda
> here, not /dev/sda4. We chosen to include the specific partition
> identification because we won't have dedicated disks here, rather just
the
> very same partition as all disks were partitioned exactly the same.
>
> While that was enough for the procedure to continue at that point, now I
> wonder if it was the right call and, if it indeed was, if it was done
> properly.  As such, I wonder: what you mean by "wipe" the partition here?
> /dev/sda4 is created, but is both empty and unmounted: Should a different
> operation be performed on it, should I remove it first, should I have
> written the files above with only /dev/sda as target?
>
> I know that probably I wouldn't run in this issues with dedicated discks,
> but unfortunately that is absolutely not an option.
>
> Thanks a lot in advance for any comments and/or extra suggestions.
>
> Sincerely yours,
>
> Jones
>
> On Sat, Aug 25, 2018 at 5:46 PM Eugen Block  wrote:
>
>> Hi,
>>
>> take a look into the logs, they should point you in the right direction.
>> Since the deployment stage fails at the OSD level, start with the OSD
>> logs. Something's not right with the disks/partitions, did you wipe
>> the partition from previous attempts?
>>
>> Regards,
>> Eugen
>>
>> Zitat von Jones de Andrade :
>>
>>> (Please forgive my previous email: I was using another message and
>>> completely