Re: [ceph-users] out of memory bluestore osds

2019-08-08 Thread Jaime Ibar

Hi Mark,

thanks a lot for your explanation and clarification.

Adjusting osd_memory_target to fit in our systems did the trick.

Jaime

On 07/08/2019 14:09, Mark Nelson wrote:

Hi Jaime,


we only use the cache size parameters now if you've disabled 
autotuning.  With autotuning we adjust the cache size on the fly to 
try and keep the mapped process memory under the osd_memory_target.  
You can set a lower memory target than default, though you will have 
far less cache for bluestore onodes and rocksdb.  You may notice that 
it's slower, especially if you have a big active data set you are 
processing.  I don't usually recommend setting the osd_memory_target 
below 2GB.  At some point it will have shrunk the caches as far as it 
can and the process memory may start exceeding the target.  (with our 
default rocksdb and pglog settings this usually happens somewhere 
between 1.3-1.7GB once the OSD has been sufficiently saturated with 
IO). Given memory prices right now, I'd still recommend upgrading RAM 
if you have the ability though.  You might be able to get away with 
setting each OSD to 2-2.5GB in your scenario but you'll be pushing it.



I would not recommend lowering the osd_memory_cache_min.  You really 
want rocksdb indexes/filters fitting in cache, and as many bluestore 
onodes as you can get.  In any event, you'll still be bound by the 
(currently hardcoded) 64MB cache chunk allocation size in the 
autotuner which osd_memory_cache_min can't reduce (and that's per 
cache while osd_memory_cache_min is global for the kv,buffer, and 
rocksdb block caches).  IE each cache is going to get 64MB+growth room 
regardless of how low you set osd_memory_cache_min.  That's 
intentional as we don't want a single SST file in rocksdb to be able 
to completely blow everything else out of the block cache during 
compaction, only to quickly become invalid, removed from the cache, 
and make it look to the priority cache system like rocksdb doesn't 
actually need any more memory for cache.



Mark


On 8/7/19 7:44 AM, Jaime Ibar wrote:

Hi all,

we run a Ceph Luminous 12.2.12 cluster, 7 osds servers 12x4TB disks 
each.

Recently we redeployed the osds of one of them using bluestore backend,
however, after this, we're facing Out of memory errors(invoked 
oom-killer)

and the OS kills one of the ceph-osd process.
The osd is restarted automatically and back online after one minute.
We're running Ubuntu 16.04, kernel 4.15.0-55-generic.
The server has 32GB of RAM and 4GB of swap partition.
All the disks are hdd, no ssd disks.
Bluestore settings are the default ones

"osd_memory_target": "4294967296"
"osd_memory_cache_min": "134217728"
"bluestore_cache_size": "0"
"bluestore_cache_size_hdd": "1073741824"
"bluestore_cache_autotune": "true"

As stated in the documentation, bluestore assigns by default 4GB of
RAM per osd(1GB of RAM for 1TB).
So in this case 48GB of RAM would be needed. Am I right?

Are these the minimun requirements for bluestore?
In case adding more RAM is not an option, can any of
osd_memory_target, osd_memory_cache_min, bluestore_cache_size_hdd
be decrease to fit in our server specs?
Would this have any impact on performance?

Thanks
Jaime


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] out of memory bluestore osds

2019-08-07 Thread Jaime Ibar

Hi all,

we run a Ceph Luminous 12.2.12 cluster, 7 osds servers 12x4TB disks each.
Recently we redeployed the osds of one of them using bluestore backend,
however, after this, we're facing Out of memory errors(invoked oom-killer)
and the OS kills one of the ceph-osd process.
The osd is restarted automatically and back online after one minute.
We're running Ubuntu 16.04, kernel 4.15.0-55-generic.
The server has 32GB of RAM and 4GB of swap partition.
All the disks are hdd, no ssd disks.
Bluestore settings are the default ones

"osd_memory_target": "4294967296"
"osd_memory_cache_min": "134217728"
"bluestore_cache_size": "0"
"bluestore_cache_size_hdd": "1073741824"
"bluestore_cache_autotune": "true"

As stated in the documentation, bluestore assigns by default 4GB of
RAM per osd(1GB of RAM for 1TB).
So in this case 48GB of RAM would be needed. Am I right?

Are these the minimun requirements for bluestore?
In case adding more RAM is not an option, can any of
osd_memory_target, osd_memory_cache_min, bluestore_cache_size_hdd
be decrease to fit in our server specs?
Would this have any impact on performance?

Thanks
Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi Paul,

I tried ceph-fuse mounting it in a different mount point and it worked.

The problem here is we can't unmount ceph kernel client as it is in use

by some virsh processes. We forced the unmount and mount ceph-fuse

but we got an I/O error and mount -l cleared all the processes but after

rebooting the vm's they didn't come back and a server reboot was needed.

Not sure how can I restore mds session or remounting cephfs keeping

all processes.

Thanks a lot for your help.

Jaime


On 02/10/18 11:02, Paul Emmerich wrote:

Kernel 4.4 is not suitable for a multi MDS setup. In general, I
wouldn't feel comfortable running 4.4 with kernel cephfs in
production.
I think at least 4.15 (not sure, but definitely > 4.9) is recommended
for multi MDS setups.

If you can't reboot: maybe try cephfs-fuse instead which is usually
very awesome and usually fast enough.

Paul

Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :

Hi Paul,

we're using 4.4 kernel. Not sure if more recent kernels are stable

for production services. In any case, as there are some production

services running on those servers, rebooting wouldn't be an option

if we can bring ceph clients back without rebooting.

Thanks

Jaime


On 01/10/18 21:10, Paul Emmerich wrote:

Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
old, received at 2018-09-28 09:40:16.155841:
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
old, received at 2018-09-28 10:53:03.203476:
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 5 clients failing to respond
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before),
however, the clients pointing to mds.1 started hanging, however, the
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
X: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
stopping state forever due to the above error), we waited for it to
become active again,

unmount the problematic clients, wait for the cluster to be healthy and
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable
multi mds again to bring faulty clients back again, however no luck this
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi Paul,

we're using 4.4 kernel. Not sure if more recent kernels are stable

for production services. In any case, as there are some production

services running on those servers, rebooting wouldn't be an option

if we can bring ceph clients back without rebooting.

Thanks

Jaime


On 01/10/18 21:10, Paul Emmerich wrote:

Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
old, received at 2018-09-28 09:40:16.155841:
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
old, received at 2018-09-28 10:53:03.203476:
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 5 clients failing to respond
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before),
however, the clients pointing to mds.1 started hanging, however, the
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
X: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
stopping state forever due to the above error), we waited for it to
become active again,

unmount the problematic clients, wait for the cluster to be healthy and
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable
multi mds again to bring faulty clients back again, however no luck this
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi,

there's only one entry in blacklist, however is a mon, not a cephfs 
client and no cephfs


is mounted on that host.

We're using kernel client and the kernel version is 4.4 for ceph 
services and cephfs clients.


This is what we have in /sys/kernel/debug/ceph

cat mdsmap

epoch 59259
root 0
session_timeout 60
session_autoclose 300
    mds0    xxx:6800    (up:active)


cat mdsc

13049   mds0    getattr  #1506e43
13051   (no request)    getattr  #150922b
13053   (no request)    getattr  #150922b
13055   (no request)    getattr  #150922b
13057   (no request)    getattr  #150922b
13058   (no request)    getattr  #150922b
13059   (no request)    getattr  #150922b
13063   mds0    lookup   #150922b/.cache (.cache)

[...]

cat mds_sessions

global_id 29669848
name "cephfs"
mds.0 opening
mds.1 restarting

And is similar for other clients.

Thanks

Jaime


On 01/10/18 19:13, Burkhard Linke wrote:

Hi,


we also experience hanging clients after MDS restarts; in our case we 
only use a single active MDS server, and the client are actively 
blacklisted by the MDS server after restart. It usually happens if the 
clients are not responsive during MDS restart (e.g. being very busy).



You can check whether this is the case in your setup by inspecting the 
blacklist ('ceph osd blacklist ls'). It should print the connections 
which are currently blacklisted.



You can also remove entries ('ceph osd blacklist rm ...'), but be 
warned that the mechanism is there for a reason. Removing a 
blacklisted entry might result in file corruption if client and MDS 
server disagree about the current state. Use at own risk.



We were also trying a multi active setup after upgrading to luminous, 
but we were running into the same problem with the same error message. 
If was probably due to old kernel clients, so in case of kernel based 
cephfs I would recommend to upgrade to the latest available kernel.



As another approach you can check the current state of the cephfs 
client, either by using the daemon socket in case of ceph-fuse, or the 
debug information in /sys/kernel/debug/ceph/... for the kernel client.


Regards,

Burkhard


On 01.10.2018 18:34, Jaime Ibar wrote:

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we 
enabled multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, 
caller_gid=124{}) currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 
seconds old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to 
respond to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing 
mds.1 went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy 
and try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck 
this time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery 
completed


Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/list

[ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Jaime Ibar

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled 
multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) 
currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds 
old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to respond 
to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing mds.1 
went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy and 
try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck this 
time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests blocked. No rebalancing

2018-09-20 Thread Jaime Ibar

Hi all,

after increasing mon_max_pg_per_osd number ceph starts rebalancing as usual.

However, the slow requests warnings are still there, even after setting

primary-affinity to 0 beforehand.

By the other hand, if I destroy the osd, ceph will start rebalancing unless

noout flag is set, am I right?

Thanks

Jaime


On 20/09/18 14:25, Paul Emmerich wrote:

You can prevent creation of the PGs on the old filestore OSDs (which
seems to be the culprit here) during replacement by replacing the
disks the hard way:

* ceph osd destroy osd.X
* re-create with bluestore under the same id (ceph volume ... --osd-id X)

it will then just backfill onto the same disk without moving any PG.

Keep in mind that this means that you are running with one missing
copy during the recovery, so that's not the recommended way to do
that.

Paul


2018-09-20 13:51 GMT+02:00 Eugen Block :

Hi,

to reduce impact on clients during migration I would set the OSD's
primary-affinity to 0 beforehand. This should prevent the slow requests, at
least this setting has helped us a lot with problematic OSDs.

Regards
Eugen


Zitat von Jaime Ibar :



Hi all,

we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're
trying to migrate the

osd's to Bluestore following this document[0], however when I mark the osd
as out,

I'm getting warnings similar to these ones

2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2
slow requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW)
2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7
slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59
(REQUEST_SLOW)
2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15
slow requests are blocked > 32 sec. Implicated osds 10,16,28,31,32,59,78,80
(REQUEST_SLOW)

2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update:
244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED)
2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249
PGs pending on creation (PENDING_CREATING_PGS)

which prevent ceph start rebalancing and the vm's running on ceph start
hanging and we have to mark the osd back in.

I tried to reweight the osd to 0.90 in order to minimize the impact on the
cluster but the warnings are the same.

I tried to increased these settings to

mds cache memory limit = 2147483648
rocksdb cache size = 2147483648

but with no luck, same warnings.

We also have cephfs for storing files from different projects(no directory
fragmentation enabled).

The problem here is that if one osd dies, all the services will be blocked
as ceph won't be able to

start rebalancing.

The cluster is

- 3 mons

- 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby

- 3 mgr(running on the same hosts as the mons)

- 6 servers, 12 osd's each.

- 1GB private network


Does anyone know how to fix or where the problem could be?

Thanks a lot in advance.

Jaime


[0]
http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/  |ja...@tchpc.tcd.ie
Tel: +353-1-896-3725




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow requests blocked. No rebalancing

2018-09-20 Thread Jaime Ibar

Hi all,

we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're 
trying to migrate the


osd's to Bluestore following this document[0], however when I mark the 
osd as out,


I'm getting warnings similar to these ones

2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2 
slow requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW)
2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7 
slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59 
(REQUEST_SLOW)
2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15 
slow requests are blocked > 32 sec. Implicated osds 
10,16,28,31,32,59,78,80 (REQUEST_SLOW)


2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update: 
244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED)
2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 
PGs pending on creation (PENDING_CREATING_PGS)


which prevent ceph start rebalancing and the vm's running on ceph start 
hanging and we have to mark the osd back in.


I tried to reweight the osd to 0.90 in order to minimize the impact on 
the cluster but the warnings are the same.


I tried to increased these settings to

mds cache memory limit = 2147483648
rocksdb cache size = 2147483648

but with no luck, same warnings.

We also have cephfs for storing files from different projects(no 
directory fragmentation enabled).


The problem here is that if one osd dies, all the services will be 
blocked as ceph won't be able to


start rebalancing.

The cluster is

- 3 mons

- 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby

- 3 mgr(running on the same hosts as the mons)

- 6 servers, 12 osd's each.

- 1GB private network


Does anyone know how to fix or where the problem could be?

Thanks a lot in advance.

Jaime


[0] http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/  |ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph upgrade Jewel to Luminous

2018-08-15 Thread Jaime Ibar

Hi Tom,

thanks for the info.

That's what I thought but I asked just in case as breaking the entire 
cluster would be


very bad news.

Thanks again.

Jaime


On 14/08/18 20:18, Thomas White wrote:


Hi Jaime,

Upgrading directly should not be a problem. It is usually recommended 
to go to the latest minor release before upgrading major versions, but 
my own migration from 10.2.10 to 12.2.5 went seamlessly and I can’t 
see of any technical limitation which would hinder or prevent this 
process.


Kind Regards,

Tom

*From:*ceph-users  *On Behalf Of 
*Jaime Ibar

*Sent:* 14 August 2018 10:00
*To:* ceph-users@lists.ceph.com
*Subject:* [ceph-users] Ceph upgrade Jewel to Luminous

Hi all,

we're running Ceph Jewel 10.2.10 in our cluster and we plan to upgrade 
to latest Luminous


release(12.2.7). Jewel 10.2.11 was released one month ago and ours 
plans were upgrade to


this release first and then upgrade to Luminous, but as someone 
reported osd's crashes after


upgrading to Jewel 10.2.11, we wonder if would be possible to skip 
this Jewel release and


upgrade directly to Luminous 12.2.7.

Thanks

Jaime


Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie <http://www.tchpc.tcd.ie/> | 
ja...@tchpc.tcd.ie <mailto:ja...@tchpc.tcd.ie>

Tel: +353-1-896-3725



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph upgrade Jewel to Luminous

2018-08-14 Thread Jaime Ibar
Hi all,

we're running Ceph Jewel 10.2.10 in our cluster and we plan to upgrade to 
latest Luminous

release(12.2.7). Jewel 10.2.11 was released one month ago and ours plans were 
upgrade to

this release first and then upgrade to Luminous, but as someone reported osd's 
crashes after

upgrading to Jewel 10.2.11, we wonder if would be possible to skip this Jewel 
release and

upgrade directly to Luminous 12.2.7.

Thanks

Jaime
Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie (http://www.tchpc.tcd.ie/) | ja...@tchpc.tcd.ie 
(mailto:ja...@tchpc.tcd.ie)
Tel: +353-1-896-3725
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph upgrade Jewel to Luminous

2018-08-14 Thread Jaime Ibar

Hi all,

we're running Ceph Jewel 10.2.10 in our cluster and we plan to upgrade 
to latest Luminous


release(12.2.7). Jewel 10.2.11 was released one month ago and ours plans 
were upgrade to


this release first and then upgrade to Luminous, but as someone reported 
osd's crashes after


upgrading to Jewel 10.2.11, we wonder if would be possible to skip this 
Jewel release and


upgrade directly to Luminous 12.2.7.

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mount CephFS with dedicated user fails: mount error 13 = Permission denied

2017-07-24 Thread Jaime Ibar

Hi,

I think you there is missing perm for the mds.

Try adding allow r to mds permissions.

Something like

ceph auth get-or-create client.mtyadm mon 'allow r' mds '*allow r*, 
allow rw path=/MTY' osd 'allow rw pool=hdb-backup,allow rw 
pool=hdb-backup_metadata' -o /etc/ceph/ceph.client.mtyadm.keyring


Jaime


On 24/07/17 10:36, c.mo...@web.de wrote:

Hello!

I want to mount CephFS with a dedicated user in order to avoid putting the 
admin key on every client host.
Therefore I created a user account
ceph auth get-or-create client.mtyadm mon 'allow r' mds 'allow rw path=/MTY' 
osd 'allow rw pool=hdb-backup,allow rw pool=hdb-backup_metadata' -o 
/etc/ceph/ceph.client.mtyadm.keyring
and wrote out the keyring
ceph-authtool -p -n client.mtyadm ceph.client.mtyadm.keyring > 
ceph.client.mtyadm.key

This user is now displayed in auth list:
client.mtyadm
 key: AQBYu3VZLg66LBAAGM1jW+cvNE6BoJWfsORZKA==
 caps: [mds] allow rw path=/MTY
 caps: [mon] allow r
 caps: [osd] allow rw pool=hdb-backup,allow rw pool=hdb-backup_metadata

When I try to mount directory /MTY on the client host I get this error:
ld2398:/etc/ceph # mount -t ceph ldcephmon1,ldcephmon2,ldcephmon2:/MTY 
/mnt/cephfs -o name=mtyadm,secretfile=/etc/ceph/ceph.client.mtyadm.key
mount error 13 = Permission denied

The mount works using admin though:
ld2398:/etc/ceph # mount -t ceph ldcephmon1,ldcephmon2,ldcephmon2:/MTY 
/mnt/cephfs -o name=admin,secretfile=/etc/ceph/ceph.client.admin.key
ld2398:/etc/ceph # mount | grep cephfs
10.96.5.37,10.96.5.38,10.96.5.38:/MTY on /mnt/cephfs type ceph 
(rw,relatime,name=admin,secret=,acl)

What is causing this mount error?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount Ceph FS

2017-06-29 Thread Jaime Ibar

Hi,

I'd say there is no mds running

$ ceph mds stat
e47262: 1/1/1 up {0=ceph01=up:active}, 2 up:standby

$ ceph -s

[...]

  fsmap e47262: 1/1/1 up {0=ceph01=up:active}, 2 up:standby

[...]

Is mds up and running?

Jaime


On 29/06/17 12:26, Riccardo Murri wrote:

Hello!

I tried to create and mount a filesystem using the instructions at
<http://docs.ceph.com/docs/master/cephfs/createfs/> and
<http://docs.ceph.com/docs/master/cephfs/kernel> but I am getting
errors:

$ sudo ceph fs new cephfs cephfs_metadata cephfs_data
new fs with metadata pool 1 and data pool 2
$ sudo ceph mds stat
e6: 0/0/1 up
$ sudo mount -t ceph mds001:/ /mnt -o
name=admin,secretfile=/etc/ceph/client.admin.secret
mount error 110 = Connection timed out

I found this `mds cluster_up` command and thought I need to bring the
MDS cluster up before using FS functions but I get errors there as
well:

$ sudo ceph mds cluster_up
unmarked fsmap DOWN

Still, the cluster does not show any health issue:

$ sudo ceph -s
 cluster 00baac7a-0ad4-4ab7-9d5e-fdaf7d122aee
  health HEALTH_OK
  monmap e1: 1 mons at {mon001=172.23.140.181:6789/0}
 election epoch 3, quorum 0 mon001
   fsmap e7: 0/0/1 up
  osdmap e19: 3 osds: 3 up, 3 in
 flags sortbitwise,require_jewel_osds
   pgmap v1278: 192 pgs, 3 pools, 0 bytes data, 0 objects
 9728 MB used, 281 GB / 290 GB avail
  192 active+clean

Any hints?  What I am doing wrong?

Thanks,
Riccardo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Re: [ceph-users] osds down after upgrade hammer to jewel

2017-03-28 Thread Jaime Ibar

Nope, all osds are running 0.94.9


On 28/03/17 14:53, Brian Andrus wrote:
Well, you said you were running v0.94.9, but are there any OSDs 
running pre-v0.94.4 as the error states?


On Tue, Mar 28, 2017 at 6:51 AM, Jaime Ibar <ja...@tchpc.tcd.ie 
<mailto:ja...@tchpc.tcd.ie>> wrote:




On 28/03/17 14:41, Brian Andrus wrote:

What does
# ceph tell osd.* version

ceph tell osd.21 version
Error ENXIO: problem getting command descriptions from osd.21


reveal? Any pre-v0.94.4 hammer OSDs running as the error states?

Yes, as this is the first one I tried to upgrade.
The other ones are running hammer

Thanks




On Tue, Mar 28, 2017 at 1:21 AM, Jaime Ibar <ja...@tchpc.tcd.ie
<mailto:ja...@tchpc.tcd.ie>> wrote:

Hi,

I did change the ownership to user ceph. In fact, OSD
processes are running

ps aux | grep ceph
ceph2199  0.0  2.7 1729044 918792 ?   Ssl  Mar27 
 0:21 /usr/bin/ceph-osd --cluster=ceph -i 42 -f --setuser

ceph --setgroup ceph
ceph2200  0.0  2.7 1721212 911084 ?   Ssl  Mar27 
 0:20 /usr/bin/ceph-osd --cluster=ceph -i 18 -f --setuser

ceph --setgroup ceph
ceph2212  0.0  2.8 1732532 926580 ?   Ssl  Mar27 
 0:20 /usr/bin/ceph-osd --cluster=ceph -i 3 -f --setuser ceph

--setgroup ceph
ceph2215  0.0  2.8 1743552 935296 ?   Ssl  Mar27 
 0:20 /usr/bin/ceph-osd --cluster=ceph -i 47 -f --setuser

ceph --setgroup ceph
ceph2341  0.0  2.7 1715548 908312 ?   Ssl  Mar27 
 0:20 /usr/bin/ceph-osd --cluster=ceph -i 51 -f --setuser

ceph --setgroup ceph
ceph2383  0.0  2.7 1694944 893768 ?   Ssl  Mar27 
 0:20 /usr/bin/ceph-osd --cluster=ceph -i 56 -f --setuser

ceph --setgroup ceph
[...]

If I run one of the osd increasing debug

ceph-osd --debug_osd 5 -i 31

this is what I get in logs

[...]

0 osd.31 14016 done with init, starting boot process
2017-03-28 09:19:15.280182 7f083df0c800  1 osd.31 14016 We
are healthy, booting
2017-03-28 09:19:15.280685 7f081cad3700  1 osd.31 14016
osdmap indicates one or more pre-v0.94.4 hammer OSDs is running
[...]

It seems the osd is running but ceph is not aware of it

Thanks
Jaime




On 27/03/17 21:56, George Mihaiescu wrote:

Make sure the OSD processes on the Jewel node are
running. If you didn't change the ownership to user ceph,
they won't start.


On Mar 27, 2017, at 11:53, Jaime Ibar
<ja...@tchpc.tcd.ie <mailto:ja...@tchpc.tcd.ie>> wrote:

Hi all,

I'm upgrading ceph cluster from Hammer 0.94.9 to
jewel 10.2.6.

The ceph cluster has 3 servers (one mon and one mds
each) and another 6 servers with
12 osds each.
The monitoring and mds have been succesfully upgraded
to latest jewel release, however
after upgrade the first osd server(12 osds), ceph is
not aware of them and
are marked as down

ceph -s

cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
 health HEALTH_WARN
[...]
12/72 in osds are down
noout flag(s) set
 osdmap e14010: 72 osds: 60 up, 72 in; 14641
remapped pgs
flags noout
[...]

ceph osd tree

3   3.64000 osd.3 down  1.0 1.0
8   3.64000 osd.8 down  1.0 1.0
14   3.64000 osd.14  down  1.0 1.0
18   3.64000 osd.18  down  1.0 
1.0
21   3.64000 osd.21  down  1.0 
1.0
28   3.64000 osd.28  down  1.0 
1.0
31   3.64000 osd.31  down  1.0 
1.0
37   3.64000 osd.37  down  1.0 
1.0
42   3.64000 osd.42  down  1.0 
1.0
47   3.64000 osd.47  down  1.0 
1.0
51   3.64000 osd.51  down  1.0 
1.0
56   3.64000 osd.56  down  1.0 
1.0


If I run this command with one of the down osd
ceph osd in 14
osd.14 is already in.
however ceph doesn't mark it as up and the cl

Re: [ceph-users] osds down after upgrade hammer to jewel

2017-03-28 Thread Jaime Ibar



On 28/03/17 14:41, Brian Andrus wrote:

What does
# ceph tell osd.* version

ceph tell osd.21 version
Error ENXIO: problem getting command descriptions from osd.21


reveal? Any pre-v0.94.4 hammer OSDs running as the error states?

Yes, as this is the first one I tried to upgrade.
The other ones are running hammer

Thanks



On Tue, Mar 28, 2017 at 1:21 AM, Jaime Ibar <ja...@tchpc.tcd.ie 
<mailto:ja...@tchpc.tcd.ie>> wrote:


Hi,

I did change the ownership to user ceph. In fact, OSD processes
are running

ps aux | grep ceph
ceph2199  0.0  2.7 1729044 918792 ?  Ssl  Mar27  0:21
/usr/bin/ceph-osd --cluster=ceph -i 42 -f --setuser ceph
--setgroup ceph
ceph2200  0.0  2.7 1721212 911084 ?  Ssl  Mar27  0:20
/usr/bin/ceph-osd --cluster=ceph -i 18 -f --setuser ceph
--setgroup ceph
ceph2212  0.0  2.8 1732532 926580 ?  Ssl  Mar27  0:20
/usr/bin/ceph-osd --cluster=ceph -i 3 -f --setuser ceph --setgroup
ceph
ceph2215  0.0  2.8 1743552 935296 ?  Ssl  Mar27  0:20
/usr/bin/ceph-osd --cluster=ceph -i 47 -f --setuser ceph
--setgroup ceph
ceph2341  0.0  2.7 1715548 908312 ?  Ssl  Mar27  0:20
/usr/bin/ceph-osd --cluster=ceph -i 51 -f --setuser ceph
--setgroup ceph
ceph2383  0.0  2.7 1694944 893768 ?  Ssl  Mar27  0:20
/usr/bin/ceph-osd --cluster=ceph -i 56 -f --setuser ceph
--setgroup ceph
[...]

If I run one of the osd increasing debug

ceph-osd --debug_osd 5 -i 31

this is what I get in logs

[...]

0 osd.31 14016 done with init, starting boot process
2017-03-28 09:19:15.280182 7f083df0c800  1 osd.31 14016 We are
healthy, booting
2017-03-28 09:19:15.280685 7f081cad3700  1 osd.31 14016 osdmap
indicates one or more pre-v0.94.4 hammer OSDs is running
[...]

It seems the osd is running but ceph is not aware of it

Thanks
Jaime




On 27/03/17 21:56, George Mihaiescu wrote:

Make sure the OSD processes on the Jewel node are running. If
you didn't change the ownership to user ceph, they won't start.


On Mar 27, 2017, at 11:53, Jaime Ibar <ja...@tchpc.tcd.ie
<mailto:ja...@tchpc.tcd.ie>> wrote:

Hi all,

I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6.

The ceph cluster has 3 servers (one mon and one mds each)
and another 6 servers with
12 osds each.
The monitoring and mds have been succesfully upgraded to
latest jewel release, however
after upgrade the first osd server(12 osds), ceph is not
aware of them and
are marked as down

ceph -s

cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
 health HEALTH_WARN
[...]
12/72 in osds are down
noout flag(s) set
 osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs
flags noout
[...]

ceph osd tree

3   3.64000 osd.3  down  1.0 1.0
8   3.64000 osd.8  down  1.0 1.0
14   3.64000 osd.14 down  1.0 1.0
18   3.64000 osd.18 down  1.0  
1.0
21   3.64000 osd.21 down  1.0  
1.0
28   3.64000 osd.28 down  1.0  
1.0
31   3.64000 osd.31 down  1.0  
1.0
37   3.64000 osd.37 down  1.0  
1.0
42   3.64000 osd.42 down  1.0  
1.0
47   3.64000 osd.47 down  1.0  
1.0
51   3.64000 osd.51 down  1.0  
1.0
56   3.64000 osd.56 down  1.0  
1.0


If I run this command with one of the down osd
ceph osd in 14
osd.14 is already in.
however ceph doesn't mark it as up and the cluster health
remains
in degraded state.

Do I have to upgrade all the osds to jewel first?
Any help as I'm running out of ideas?

Thanks
Jaime

    -- 


Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
<mailto:ja...@tchpc.tcd.ie>
Tel: +353-1-896-3725 <tel:%2B353-1-896-3725>

___
ceph-users mailing list
  

Re: [ceph-users] osds down after upgrade hammer to jewel

2017-03-28 Thread Jaime Ibar

Hi,

I did change the ownership to user ceph. In fact, OSD processes are running

ps aux | grep ceph
ceph2199  0.0  2.7 1729044 918792 ?  Ssl  Mar27   0:21 
/usr/bin/ceph-osd --cluster=ceph -i 42 -f --setuser ceph --setgroup ceph
ceph2200  0.0  2.7 1721212 911084 ?  Ssl  Mar27   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 18 -f --setuser ceph --setgroup ceph
ceph2212  0.0  2.8 1732532 926580 ?  Ssl  Mar27   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 3 -f --setuser ceph --setgroup ceph
ceph2215  0.0  2.8 1743552 935296 ?  Ssl  Mar27   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 47 -f --setuser ceph --setgroup ceph
ceph2341  0.0  2.7 1715548 908312 ?  Ssl  Mar27   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 51 -f --setuser ceph --setgroup ceph
ceph2383  0.0  2.7 1694944 893768 ?  Ssl  Mar27   0:20 
/usr/bin/ceph-osd --cluster=ceph -i 56 -f --setuser ceph --setgroup ceph

[...]

If I run one of the osd increasing debug

ceph-osd --debug_osd 5 -i 31

this is what I get in logs

[...]

0 osd.31 14016 done with init, starting boot process
2017-03-28 09:19:15.280182 7f083df0c800  1 osd.31 14016 We are healthy, 
booting
2017-03-28 09:19:15.280685 7f081cad3700  1 osd.31 14016 osdmap indicates 
one or more pre-v0.94.4 hammer OSDs is running

[...]

It seems the osd is running but ceph is not aware of it

Thanks
Jaime



On 27/03/17 21:56, George Mihaiescu wrote:

Make sure the OSD processes on the Jewel node are running. If you didn't change 
the ownership to user ceph, they won't start.



On Mar 27, 2017, at 11:53, Jaime Ibar <ja...@tchpc.tcd.ie> wrote:

Hi all,

I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6.

The ceph cluster has 3 servers (one mon and one mds each) and another 6 servers 
with
12 osds each.
The monitoring and mds have been succesfully upgraded to latest jewel release, 
however
after upgrade the first osd server(12 osds), ceph is not aware of them and
are marked as down

ceph -s

cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
 health HEALTH_WARN
[...]
12/72 in osds are down
noout flag(s) set
 osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs
flags noout
[...]

ceph osd tree

3   3.64000 osd.3  down  1.0 1.0
8   3.64000 osd.8  down  1.0 1.0
14   3.64000 osd.14 down  1.0 1.0
18   3.64000 osd.18 down  1.0  1.0
21   3.64000 osd.21 down  1.0  1.0
28   3.64000 osd.28 down  1.0  1.0
31   3.64000 osd.31 down  1.0  1.0
37   3.64000 osd.37 down  1.0  1.0
42   3.64000 osd.42 down  1.0  1.0
47   3.64000 osd.47 down  1.0  1.0
51   3.64000 osd.51 down  1.0  1.0
56   3.64000 osd.56 down  1.0  1.0

If I run this command with one of the down osd
ceph osd in 14
osd.14 is already in.
however ceph doesn't mark it as up and the cluster health remains
in degraded state.

Do I have to upgrade all the osds to jewel first?
Any help as I'm running out of ideas?

Thanks
Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osds down after upgrade hammer to jewel

2017-03-27 Thread Jaime Ibar

Hi all,

I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6.

The ceph cluster has 3 servers (one mon and one mds each) and another 6 
servers with

12 osds each.
The monitoring and mds have been succesfully upgraded to latest jewel 
release, however

after upgrade the first osd server(12 osds), ceph is not aware of them and
are marked as down

ceph -s

 cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
 health HEALTH_WARN
[...]
12/72 in osds are down
noout flag(s) set
 osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs
flags noout
[...]

ceph osd tree

3   3.64000 osd.3  down  1.0 1.0
 8   3.64000 osd.8  down  1.0 1.0
14   3.64000 osd.14 down  1.0 1.0
18   3.64000 osd.18 down  1.0  1.0
21   3.64000 osd.21 down  1.0  1.0
28   3.64000 osd.28 down  1.0  1.0
31   3.64000 osd.31 down  1.0  1.0
37   3.64000 osd.37 down  1.0  1.0
42   3.64000 osd.42 down  1.0  1.0
47   3.64000 osd.47 down  1.0  1.0
51   3.64000 osd.51 down  1.0  1.0
56   3.64000 osd.56 down  1.0  1.0

If I run this command with one of the down osd
ceph osd in 14
osd.14 is already in.
however ceph doesn't mark it as up and the cluster health remains
in degraded state.

Do I have to upgrade all the osds to jewel first?
Any help as I'm running out of ideas?

Thanks
Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com