[ceph-users] Reading a crushtool compare output

2019-08-01 Thread Linh Vu
Hi all,

I'd like to update the tunables on our older ceph cluster, created with firefly 
and now on luminous. I need to update two tunables, chooseleaf_vary_r from 2 to 
1, and chooseleaf_stable from 0 to 1. I'm going to do 1 tunable update at a 
time.

With the first one, I've dumped the current crushmap out and compared it to the 
proposed updated crushmap with chooseleaf_vary_r set to 1 instead of 2. I need 
some help to understand the output:

# ceph osd getcrushmap -o crushmap-20190801-chooseleaf-vary-r-2
# crushtool -i crushmap-20190801-chooseleaf-vary-r-2 --set-chooseleaf-vary-r 1 
-o crushmap-20190801-chooseleaf-vary-r-1
# crushtool -i crushmap-20190801-chooseleaf-vary-r-2 --compare 
crushmap-20190801-chooseleaf-vary-r-1
rule 0 had 9137/10240 mismatched mappings (0.892285)
rule 1 had 9152/10240 mismatched mappings (0.89375)
rule 4 had 9173/10240 mismatched mappings (0.895801)
rule 5 had 0/7168 mismatched mappings (0)
rule 6 had 0/7168 mismatched mappings (0)
warning: maps are NOT equivalent

So I've learned in the past doing this sort of stuff that if the maps are 
equivalent then there is no data movement. In this case, obviously I'm 
expecting data movement, but by how much? Rules 0, 1 and 4 are about our 3 
different device classes in this cluster.

Does that mean I'm going to expect almost 90% mismatched based on the above 
output? That's much bigger than I expected, as in the previous steps of 
changing the chooseleaf-vary-r from 0 to 5 then down to 2 by 1 at a time 
(before knowing anything about this crushtool --compare command) I had only up 
to about 28% mismatched objects.

Also, if you've done a similar change, please let me know how mcuh data 
movement you encountered. Thanks!

Cheers,
Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-05 Thread Linh Vu
I think 12 months cycle is much better from the cluster operations perspective. 
I also like March as a release month as well.

From: ceph-users  on behalf of Sage Weil 

Sent: Thursday, 6 June 2019 1:57 AM
To: ceph-us...@ceph.com; ceph-de...@vger.kernel.org; d...@ceph.io
Subject: [ceph-users] Changing the release cadence

Hi everyone,

Since luminous, we have had the follow release cadence and policy:
 - release every 9 months
 - maintain backports for the last two releases
 - enable upgrades to move either 1 or 2 releases heads
   (e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; ...)

This has mostly worked out well, except that the mimic release received
less attention that we wanted due to the fact that multiple downstream
Ceph products (from Red Has and SUSE) decided to based their next release
on nautilus.  Even though upstream every release is an "LTS" release, as a
practical matter mimic got less attention than luminous or nautilus.

We've had several requests/proposals to shift to a 12 month cadence. This
has several advantages:

 - Stable/conservative clusters only have to be upgraded every 2 years
   (instead of every 18 months)
 - Yearly releases are more likely to intersect with downstream
   distribution release (e.g., Debian).  In the past there have been
   problems where the Ceph releases included in consecutive releases of a
   distro weren't easily upgradeable.
 - Vendors that make downstream Ceph distributions/products tend to
   release yearly.  Aligning with those vendors means they are more likely
   to productize *every* Ceph release.  This will help make every Ceph
   release an "LTS" release (not just in name but also in terms of
   maintenance attention).

So far the balance of opinion seems to favor a shift to a 12 month
cycle[1], especially among developers, so it seems pretty likely we'll
make that shift.  (If you do have strong concerns about such a move, now
is the time to raise them.)

That brings us to an important decision: what time of year should we
release?  Once we pick the timing, we'll be releasing at that time *every
year* for each release (barring another schedule shift, which we want to
avoid), so let's choose carefully!

A few options:

 - November: If we release Octopus 9 months from the Nautilus release
   (planned for Feb, released in Mar) then we'd target this November.  We
   could shift to a 12 months candence after that.
 - February: That's 12 months from the Nautilus target.
 - March: That's 12 months from when Nautilus was *actually* released.

November is nice in the sense that we'd wrap things up before the
holidays.  It's less good in that users may not be inclined to install the
new release when many developers will be less available in December.

February kind of sucked in that the scramble to get the last few things
done happened during the holidays.  OTOH, we should be doing what we can
to avoid such scrambles, so that might not be something we should factor
in.  March may be a bit more balanced, with a solid 3 months before when
people are productive, and 3 months after before they disappear on holiday
to address any post-release issues.

People tend to be somewhat less available over the summer months due to
holidays etc, so an early or late summer release might also be less than
ideal.

Thoughts?  If we can narrow it down to a few options maybe we could do a
poll to gauge user preferences.

Thanks!
sage


[1] https://twitter.com/larsmb/status/1130010208971952129

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice for increasing number of pg and pgp

2019-01-29 Thread Linh Vu
We use https://github.com/cernceph/ceph-scripts  ceph-gentle-split script to 
slowly increase by 16 pgs at a time until we hit the target.


From: ceph-users  on behalf of Albert Yue 

Sent: Wednesday, 30 January 2019 1:39:40 PM
To: ceph-users
Subject: [ceph-users] Best practice for increasing number of pg and pgp

Dear Ceph Users,

As the number of OSDs increase in our cluster, we reach a point where pg/osd is 
lower than recommend value and we want to increase it from 4096 to 8192.

Somebody recommends that this adjustment should be done in multiple stages, 
e.g. increase 1024 pg each time. Is this a good practice? or should we increase 
it to 8192 in one time. Thanks!

Best regards,
Albert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

2018-11-08 Thread Linh Vu
If you're using kernel client for cephfs, I strongly advise to have the client 
on the same subnet as the ceph public one i.e all traffic should be on the same 
subnet/VLAN. Even if your firewall situation is good, if you have to cross 
subnets or VLANs, you will run into weird problems later. Fuse has much better 
tolerance for that scenario.


From: ceph-users  on behalf of Alexandre 
DERUMIER 
Sent: Friday, 9 November 2018 12:06:43 PM
To: ceph-users
Subject: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN)

Ok,
It seem to come from firewall,
I'm seeing dropped session exactly 15min before the log.

The sessions are the session to osd,  session to mon && mds are ok.


Seem that keeplive2 is used to monitor the mon session
https://patchwork.kernel.org/patch/7105641/

but I'm not sure about osd sessions ?

- Mail original -
De: "aderumier" 
À: "ceph-users" 
Cc: "Alexandre Bruyelles" 
Envoyé: Vendredi 9 Novembre 2018 01:12:25
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN)

To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 
15 minutes.

Maybe related to:

ms tcp read timeout
Description: If a client or daemon makes a request to another Ceph daemon and 
does not drop an unused connection, the ms tcp read timeout defines the 
connection as idle after the specified number of seconds.
Type: Unsigned 64-bit Integer
Required: No
Default: 900 15 minutes.

?

Find a similar bug report with firewall too:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html


- Mail original -
De: "aderumier" 
À: "ceph-users" 
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN)

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse 
(worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.


Is the client logs, we are seeing this kind of logs:

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con 
state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state 
OPEN)


and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> 
x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> 
x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1)


cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure 
yet if the session could be dropped. (I don't have find any logs in the 
firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure 
how to understand the osd logs)

Regards,

Alexandre



client ceph.conf

[client]
fuse_disable_pagecache = true
client_reconnect_stale = true


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel client versions - pg-upmap

2018-11-08 Thread Linh Vu
Kernel 4.13+ (i tested up to 4.18) missed some non-essential feature (explained 
by a Ceph dev on this ML) that was in Luminous, so they show up as Jewel, but 
otherwise they're fully compatible with upmap. We have a few hundred nodes on 
the kernel client with CephFS, and we also run balancer with upmap mode on 
those clusters successfully.


From: ceph-users  on behalf of Stefan Kooman 

Sent: Friday, 9 November 2018 3:10:09 AM
To: Ilya Dryomov
Cc: ceph-users
Subject: Re: [ceph-users] CephFS kernel client versions - pg-upmap

Quoting Stefan Kooman (ste...@bit.nl):
> I'm pretty sure it isn't. I'm trying to do the same (force luminous
> clients only) but ran into the same issue. Even when running 4.19 kernel
> it's interpreted as a jewel client. Here is the list I made so far:
>
> Kernel 4.13 / 4.15:
> "features": "0x7010fb86aa42ada",
> "release": "jewel"
>
> kernel 4.18 / 4.19
>  "features": "0x27018fb86aa42ada",
>  "release": "jewel"

On a test cluster with kernel clients 4.13, 4.15, 4.19 I have set the
"ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it"
while doing active IO ... no issues. Remount also works ... makes me
wonder how strict this "require-min-compat-client" is ...

Gr. Stefan

--
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Linh Vu
Might be a networking problem. Are your client nodes on the same subnet as ceph 
client network (i.e public_network in ceph.conf)? In my experience, the kernel 
client only likes being on the same public_network subnet as the MDS, Mons and 
OSDs. Else you get tons of weird issues. The fuse client however is a lot more 
tolerant of this and can jump through gateways etc. no problem.


From: ceph-users  on behalf of Andras Pataki 

Sent: Tuesday, 2 October 2018 6:40:44 AM
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] cephfs kernel client stability

Unfortunately the CentOS kernel (3.10.0-862.14.4.el7.x86_64) has issues
as well.  Different ones, but the nodes end up with an unusable mount in
an hour or two.  Here are some syslogs:

Oct  1 11:50:28 worker1004 kernel: INFO: task fio:29007 blocked for more
than 120 seconds.
Oct  1 11:50:28 worker1004 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 11:50:28 worker1004 kernel: fio D
996d86e92f70 0 29007  28970 0x
Oct  1 11:50:28 worker1004 kernel: Call Trace:
Oct  1 11:50:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:50:28 worker1004 kernel: [] schedule+0x29/0x70
Oct  1 11:50:28 worker1004 kernel: []
schedule_timeout+0x239/0x2c0
Oct  1 11:50:28 worker1004 kernel: [] ?
ktime_get_ts64+0x52/0xf0
Oct  1 11:50:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:50:28 worker1004 kernel: []
io_schedule_timeout+0xad/0x130
Oct  1 11:50:28 worker1004 kernel: []
io_schedule+0x18/0x20
Oct  1 11:50:28 worker1004 kernel: []
bit_wait_io+0x11/0x50
Oct  1 11:50:28 worker1004 kernel: []
__wait_on_bit_lock+0x61/0xc0
Oct  1 11:50:28 worker1004 kernel: []
__lock_page+0x74/0x90
Oct  1 11:50:28 worker1004 kernel: [] ?
wake_bit_function+0x40/0x40
Oct  1 11:50:28 worker1004 kernel: []
__find_lock_page+0x54/0x70
Oct  1 11:50:28 worker1004 kernel: []
grab_cache_page_write_begin+0x55/0xc0
Oct  1 11:50:28 worker1004 kernel: []
ceph_write_begin+0x43/0xe0 [ceph]
Oct  1 11:50:28 worker1004 kernel: []
generic_file_buffered_write+0x124/0x2c0
Oct  1 11:50:28 worker1004 kernel: []
ceph_aio_write+0xa3e/0xcb0 [ceph]
Oct  1 11:50:28 worker1004 kernel: [] ?
do_numa_page+0x1be/0x250
Oct  1 11:50:28 worker1004 kernel: [] ?
handle_pte_fault+0x316/0xd10
Oct  1 11:50:28 worker1004 kernel: [] ?
aio_read_events+0x1f3/0x2e0
Oct  1 11:50:28 worker1004 kernel: [] ?
security_file_permission+0x27/0xa0
Oct  1 11:50:28 worker1004 kernel: [] ?
ceph_direct_read_write+0xcd0/0xcd0 [ceph]
Oct  1 11:50:28 worker1004 kernel: []
do_io_submit+0x3c3/0x870
Oct  1 11:50:28 worker1004 kernel: []
SyS_io_submit+0x10/0x20
Oct  1 11:50:28 worker1004 kernel: []
system_call_fastpath+0x22/0x27
Oct  1 11:52:28 worker1004 kernel: INFO: task fio:29007 blocked for more
than 120 seconds.
Oct  1 11:52:28 worker1004 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 11:52:28 worker1004 kernel: fio D
996d86e92f70 0 29007  28970 0x
Oct  1 11:52:28 worker1004 kernel: Call Trace:
Oct  1 11:52:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:52:28 worker1004 kernel: [] schedule+0x29/0x70
Oct  1 11:52:28 worker1004 kernel: []
schedule_timeout+0x239/0x2c0
Oct  1 11:52:28 worker1004 kernel: [] ?
ktime_get_ts64+0x52/0xf0
Oct  1 11:52:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:52:28 worker1004 kernel: []
io_schedule_timeout+0xad/0x130
Oct  1 11:52:28 worker1004 kernel: []
io_schedule+0x18/0x20
Oct  1 11:52:28 worker1004 kernel: []
bit_wait_io+0x11/0x50
Oct  1 11:52:28 worker1004 kernel: []
__wait_on_bit_lock+0x61/0xc0
Oct  1 11:52:28 worker1004 kernel: []
__lock_page+0x74/0x90
Oct  1 11:52:28 worker1004 kernel: [] ?
wake_bit_function+0x40/0x40
Oct  1 11:52:28 worker1004 kernel: []
__find_lock_page+0x54/0x70
Oct  1 11:52:28 worker1004 kernel: []
grab_cache_page_write_begin+0x55/0xc0
Oct  1 11:52:28 worker1004 kernel: []
ceph_write_begin+0x43/0xe0 [ceph]
Oct  1 11:52:28 worker1004 kernel: []
generic_file_buffered_write+0x124/0x2c0
Oct  1 11:52:28 worker1004 kernel: []
ceph_aio_write+0xa3e/0xcb0 [ceph]
Oct  1 11:52:28 worker1004 kernel: [] ?
do_numa_page+0x1be/0x250
Oct  1 11:52:28 worker1004 kernel: [] ?
handle_pte_fault+0x316/0xd10
Oct  1 11:52:28 worker1004 kernel: [] ?
aio_read_events+0x1f3/0x2e0
Oct  1 11:52:28 worker1004 kernel: [] ?
security_file_permission+0x27/0xa0
Oct  1 11:52:28 worker1004 kernel: [] ?
ceph_direct_read_write+0xcd0/0xcd0 [ceph]
Oct  1 11:52:28 worker1004 kernel: []
do_io_submit+0x3c3/0x870
Oct  1 11:52:28 worker1004 kernel: []
SyS_io_submit+0x10/0x20
Oct  1 11:52:28 worker1004 kernel: []
system_call_fastpath+0x22/0x27

Oct  1 15:04:08 worker1004 kernel: libceph: reset on mds0
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 closed our session
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 reconnect start
Oct  1 15:04:08 worker1004 kernel: libceph: osd182 10.128.150.155:6976
socket closed (con state OPEN)
Oct  1 15:04:08 

Re: [ceph-users] Ceph and NVMe

2018-09-06 Thread Linh Vu
We have P3700s and Optane 900P (similar to P4800 but the workstation version 
and a lot cheaper) on R730xds, for WAL, DB and metadata pools for cephfs and 
radosgw. They perform great!


From: ceph-users  on behalf of Jeff Bailey 

Sent: Friday, 7 September 2018 7:36:19 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph and NVMe

I haven't had any problems using 375GB P4800X's in R730 and R740xd
machines for DB+WAL.  The iDRAC whines a bit on the R740 but everything
works fine.

On 9/6/2018 3:09 PM, Steven Vacaroaia wrote:
> Hi ,
> Just to add to this question, is anyone using Intel Optane DC P4800X on
> DELL R630 ...or any other server ?
> Any gotchas / feedback/ knowledge sharing will be greatly appreciated
> Steven
>
> On Thu, 6 Sep 2018 at 14:59, Stefan Priebe - Profihost AG
> mailto:s.pri...@profihost.ag>> wrote:
>
> Hello list,
>
> has anybody tested current NVMe performance with luminous and bluestore?
> Is this something which makes sense or just a waste of money?
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No announce for 12.2.8 / available in repositories

2018-09-05 Thread Linh Vu
With more testing and checking, we realised that this had nothing to do with 
Ceph. One part of the upgrade accidentally changed the MTU of our VMs tap 
interface from 9000 to 1500... Sorry for the false warning everyone!


From: ceph-users  on behalf of Linh Vu 

Sent: Tuesday, 4 September 2018 4:28 PM
To: Dan van der Ster
Cc: ceph-users
Subject: Re: [ceph-users] No announce for 12.2.8 / available in repositories


We're going to reproduce this again in testing (12.2.8 drops right between our 
previous testing and going production) and compare it to 12.2.7. Will update 
with our findings soon. :)


From: Dan van der Ster 
Sent: Tuesday, 4 September 2018 3:41:01 PM
To: Linh Vu
Cc: nhuill...@dolomede.fr; ceph-users
Subject: Re: [ceph-users] No announce for 12.2.8 / available in repositories

I don't think those issues are known... Could you elaborate on your
librbd issues with v12.2.8 ?

-- dan

On Tue, Sep 4, 2018 at 7:30 AM Linh Vu  wrote:
>
> Version 12.2.8 seems broken. Someone earlier on the ML had a MDS issue. We 
> accidentally upgraded an openstack compute node from 12.2.7 to 12.2.8 
> (librbd) and it caused all kinds of issues writing to the VM disks.
>
> 
> From: ceph-users  on behalf of Nicolas 
> Huillard 
> Sent: Sunday, 2 September 2018 7:31:08 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] No announce for 12.2.8 / available in repositories
>
> Hi all,
>
> I just noticed that 12.2.8 was available on the repositories, without
> any announce. Since upgrading to unannounced 12.2.6 was a bad idea,
> I'll wait a bit anyway ;-)
> Where can I find info on this bugfix release ?
> Nothing there : http://lists.ceph.com/pipermail/ceph-announce-ceph.com/
>
> TIA
>
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No announce for 12.2.8 / available in repositories

2018-09-04 Thread Linh Vu
We're going to reproduce this again in testing (12.2.8 drops right between our 
previous testing and going production) and compare it to 12.2.7. Will update 
with our findings soon. :)


From: Dan van der Ster 
Sent: Tuesday, 4 September 2018 3:41:01 PM
To: Linh Vu
Cc: nhuill...@dolomede.fr; ceph-users
Subject: Re: [ceph-users] No announce for 12.2.8 / available in repositories

I don't think those issues are known... Could you elaborate on your
librbd issues with v12.2.8 ?

-- dan

On Tue, Sep 4, 2018 at 7:30 AM Linh Vu  wrote:
>
> Version 12.2.8 seems broken. Someone earlier on the ML had a MDS issue. We 
> accidentally upgraded an openstack compute node from 12.2.7 to 12.2.8 
> (librbd) and it caused all kinds of issues writing to the VM disks.
>
> 
> From: ceph-users  on behalf of Nicolas 
> Huillard 
> Sent: Sunday, 2 September 2018 7:31:08 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] No announce for 12.2.8 / available in repositories
>
> Hi all,
>
> I just noticed that 12.2.8 was available on the repositories, without
> any announce. Since upgrading to unannounced 12.2.6 was a bad idea,
> I'll wait a bit anyway ;-)
> Where can I find info on this bugfix release ?
> Nothing there : http://lists.ceph.com/pipermail/ceph-announce-ceph.com/
>
> TIA
>
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No announce for 12.2.8 / available in repositories

2018-09-03 Thread Linh Vu
Version 12.2.8 seems broken. Someone earlier on the ML had a MDS issue. We 
accidentally upgraded an openstack compute node from 12.2.7 to 12.2.8 (librbd) 
and it caused all kinds of issues writing to the VM disks.


From: ceph-users  on behalf of Nicolas 
Huillard 
Sent: Sunday, 2 September 2018 7:31:08 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] No announce for 12.2.8 / available in repositories

Hi all,

I just noticed that 12.2.8 was available on the repositories, without
any announce. Since upgrading to unannounced 12.2.6 was a bad idea,
I'll wait a bit anyway ;-)
Where can I find info on this bugfix release ?
Nothing there : http://lists.ceph.com/pipermail/ceph-announce-ceph.com/

TIA

--
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs meta data pool to ssd and measuring performance difference

2018-08-03 Thread Linh Vu
Try IOR mdtest for metadata performance.


From: ceph-users  on behalf of Marc Roos 

Sent: Friday, 3 August 2018 7:49:13 PM
To: dcsysengineer
Cc: ceph-users
Subject: Re: [ceph-users] Cephfs meta data pool to ssd and measuring 
performance difference


I have moved the pool, but strange thing is that if I do something like
this

for object in `cat out`; do rados -p fs_meta get $object /dev/null ;
done

I do not see any activity on the ssd drives with something like dstat
(checked on all nodes (sdh))

net/eth4.60-net/eth4.52
--dsk/sda-dsk/sdb-dsk/sdc-dsk/sdd-dsk/sde-dsk/sdf---
--dsk/sdg-dsk/sdh-dsk/sdi--
recv send: recv send| read writ: read writ: read writ: read writ:
read writ: read writ: read writ: read writ: read writ
0 0 : 0 0 | 415B 201k:3264k 1358k:4708k 3487k:5779k 3046k:
13M 6886k:9055B 487k:1118k 836k: 327k 210k:3497k 1569k
154k 154k: 242k 336k| 0 0 : 0 0 : 0 12k: 0 0 :
0 0 : 0 0 : 0 0 : 0 0 : 0 0
103k 92k: 147k 199k| 0 164k: 0 0 : 0 32k: 0 0
:8192B 12k: 0 0 : 0 0 : 0 0 : 0 0
96k 124k: 108k 96k| 0 4096B: 0 0 :4096B 20k: 0 0
:4096B 12k: 0 8192B: 0 0 : 0 0 : 0 0
175k 375k: 330k 266k| 0 69k: 0 0 : 0 0 : 0 0 :
0 0 :8192B 136k: 0 0 : 0 0 : 0 0
133k 102k: 124k 103k| 0 0 : 0 0 : 0 76k: 0 0 :
0 32k: 0 0 : 0 0 : 0 0 : 0 0
350k 185k: 318k 1721k| 0 57k: 0 0 : 0 16k: 0 0 :
0 36k:1416k 0 : 0 0 : 0 144k: 0 0
206k 135k: 164k 797k| 0 0 : 0 0 :8192B 44k: 0 0 :
0 28k: 660k 0 : 0 0 :4096B 260k: 0 0
138k 136k: 252k 273k| 0 51k: 0 0 :4096B 16k: 0 0 :
0 0 : 0 0 : 0 0 : 0 0 : 0 0
158k 117k: 436k 369k| 0 0 : 0 0 : 0 0 : 0 20k:
0 0 :4096B 20k: 0 20k: 0 0 : 0 0
146k 106k: 327k 988k| 0 63k: 0 16k: 0 52k: 0 0 :
0 0 : 0 52k: 0 0 : 0 0 : 0 0
77k 74k: 361k 145k| 0 0 : 0 0 : 0 16k: 0 0 :
0 0 : 0 0 : 0 0 : 0 0 : 0 0
186k 149k: 417k 824k| 0 51k: 0 0 : 0 28k: 0 0 :
0 28k: 0 0 : 0 0 : 0 36k: 0 0

But this showed some activity

[@c01 ~]# ceph osd pool stats | grep fs_meta -A 2
pool fs_meta id 19
client io 0 B/s rd, 17 op/s rd, 0 op/s wr

I took maybe around 20h to move the fs_meta pool (only 200MB, 2483328
objects) from hdd to ssd, also maybe because of some other remapping of
one replaced and one added hdd. (I have slow hdd's)

I did not manage to do a good test, because the results seem to be
similar as before the move. I did not want to create files because I
thought it would include the fs_data pool to much, which is on my slow
hdd's. So I did the readdir and stats tests.

I checked if mds.a was active, limited the cache of mds.a to 1000 inodes
(I think) with:
ceph daemon mds.a config set mds_cache_size 1000 ()

Flushed caches on the nodes with:
free && sync && echo 3 > /proc/sys/vm/drop_caches && free

And ran these tests:
python ../smallfile-master/smallfile_cli.py --operation stat --threads 1
--file-size 128 --files-per-dir 5 --files 50 --top
/home/backup/test/kernel/
python ../smallfile-master/smallfile_cli.py --operation readdir
--threads 1 --file-size 128 --files-per-dir 5 --files 50 --top
/home/backup/test/kernel/

Maybe this is helpful in selecting a better test for your move.


-Original Message-
From: David C [mailto:dcsysengin...@gmail.com]
Sent: maandag 30 juli 2018 14:23
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Cephfs meta data pool to ssd and measuring
performance difference

Something like smallfile perhaps? https://github.com/bengland2/smallfile

Or you just time creating/reading lots of files

With read benching you would want to ensure you've cleared your mds
cache or use a dataset larger than the cache.

I'd be interested in seeing your results, I this on the to do list
myself.

On 25 Jul 2018 15:18, "Marc Roos"  wrote:




>From this thread, I got how to move the meta data pool from the
hdd's to
the ssd's.
https://www.spinics.net/lists/ceph-users/msg39498.html

ceph osd pool get fs_meta crush_rule
ceph osd pool set fs_meta crush_rule replicated_ruleset_ssd

I guess this can be done on a live system?

What would be a good test to show the performance difference
between the
old hdd and the new ssd?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mgr cephx caps to run `ceph fs status`?

2018-07-31 Thread Linh Vu
Thanks John, that works! Also works with multiple commands, e.g I granted my 
user access to both `ceph fs status` and `ceph status`:


mgr 'allow command "fs status", allow command "status"'


From: John Spray 
Sent: Tuesday, 31 July 2018 8:12:00 PM
To: Linh Vu
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mgr cephx caps to run `ceph fs status`?

On Tue, Jul 31, 2018 at 3:36 AM Linh Vu  wrote:
>
> Hi all,
>
>
> I want a non-admin client to be able to run `ceph fs status`, either via the 
> ceph CLI or a python script. Adding `mgr "allow *"` to this client's cephx 
> caps works, but I'd like to be more specific if possible. I can't find the 
> complete list of mgr cephx caps anywhere, so if you could point me in the 
> right direction, that'd be great!

Both mgr and mon caps have an "allow command" syntax that lets you
restrict users to specific named commands (and even specific
arguments). Internally, the mgr and the mon use the same code to
intepret capabilities.

I just went looking for the documentation for those mon caps and it
appears not to exist!

Anyway, in your case it's something like this:

mgr "allow command \"fs status\""

I don't think I've ever tested this on a mgr daemon, so let us know
how you get on.

John



>
> Cheers,
>
> Linh
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mgr cephx caps to run `ceph fs status`?

2018-07-30 Thread Linh Vu
Hi all,


I want a non-admin client to be able to run `ceph fs status`, either via the 
ceph CLI or a python script. Adding `mgr "allow *"` to this client's cephx caps 
works, but I'd like to be more specific if possible. I can't find the complete 
list of mgr cephx caps anywhere, so if you could point me in the right 
direction, that'd be great!


Cheers,

Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.7 - Available space decreasing when adding disks

2018-07-21 Thread Linh Vu
Something funny going on with your  new disks:


138   ssd 0.90970  1.0  931G  820G  111G 88.08 2.71 216 Added
139   ssd 0.90970  1.0  931G  771G  159G 82.85 2.55 207 Added
140   ssd 0.90970  1.0  931G  709G  222G 76.12 2.34 197 Added
141   ssd 0.90970  1.0  931G  664G  267G 71.31 2.19 184 Added


The last 3 columns are: % used, variation, and PG count. These 4 have much 
higher %used and PG count than the rest, almost double. You probably have these 
disks in multiple pools and therefore have too many PGs on them.


One of them is at 88% used. The max available capacity of a pool is calculated 
based on the most full OSD in it, which is why your total available capacity 
drops to 0.6TB.


From: ceph-users  on behalf of Glen Baars 

Sent: Saturday, 21 July 2018 10:43:16 AM
To: ceph-users
Subject: [ceph-users] 12.2.7 - Available space decreasing when adding disks


Hello Ceph Users,



We have added more ssd storage to our ceph cluster last night. We added 4 x 1TB 
drives and the available space went from 1.6TB to 0.6TB ( in `ceph df` for the 
SSD pool ).



I would assume that the weight needs to be changed but I didn’t think I would 
need to? Should I change them to 0.75 from 0.9 and hopefully it will rebalance 
correctly?



#ceph osd tree | grep -v hdd

ID  CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF

-1   534.60309 root default

-1962.90637 host NAS-AUBUN-RK2-CEPH06

115   ssd   0.43660 osd.115   up  1.0 1.0

116   ssd   0.43660 osd.116   up  1.0 1.0

117   ssd   0.43660 osd.117   up  1.0 1.0

118   ssd   0.43660 osd.118   up  1.0 1.0

-22   105.51169 host NAS-AUBUN-RK2-CEPH07

138   ssd   0.90970 osd.138   up  1.0 1.0 Added

139   ssd   0.90970 osd.139   up  1.0 1.0 Added

-25   105.51169 host NAS-AUBUN-RK2-CEPH08

140   ssd   0.90970 osd.140   up  1.0 1.0 Added

141   ssd   0.90970 osd.141   up  1.0 1.0 Added

-356.32617 host NAS-AUBUN-RK3-CEPH01

60   ssd   0.43660 osd.60up  1.0 1.0

61   ssd   0.43660 osd.61up  1.0 1.0

62   ssd   0.43660 osd.62up  1.0 1.0

63   ssd   0.43660 osd.63up  1.0 1.0

-556.32617 host NAS-AUBUN-RK3-CEPH02

64   ssd   0.43660 osd.64up  1.0 1.0

65   ssd   0.43660 osd.65up  1.0 1.0

66   ssd   0.43660 osd.66up  1.0 1.0

67   ssd   0.43660 osd.67up  1.0 1.0

-756.32617 host NAS-AUBUN-RK3-CEPH03

68   ssd   0.43660 osd.68up  1.0 1.0

69   ssd   0.43660 osd.69up  1.0 1.0

70   ssd   0.43660 osd.70up  1.0 1.0

71   ssd   0.43660 osd.71up  1.0 1.0

-1345.84741 host NAS-AUBUN-RK3-CEPH04

72   ssd   0.54579 osd.72up  1.0 1.0

73   ssd   0.54579 osd.73up  1.0 1.0

76   ssd   0.54579 osd.76up  1.0 1.0

77   ssd   0.54579 osd.77up  1.0 1.0

-1645.84741 host NAS-AUBUN-RK3-CEPH05

74   ssd   0.54579 osd.74up  1.0 1.0

75   ssd   0.54579 osd.75up  1.0 1.0

78   ssd   0.54579 osd.78up  1.0 1.0

79   ssd   0.54579 osd.79up  1.0 1.0



# ceph osd df | grep -v hdd

ID  CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS

115   ssd 0.43660  1.0  447G  250G  196G 56.00 1.72 103

116   ssd 0.43660  1.0  447G  191G  255G 42.89 1.32  84

117   ssd 0.43660  1.0  447G  213G  233G 47.79 1.47  92

118   ssd 0.43660  1.0  447G  208G  238G 46.61 1.43  85

138   ssd 0.90970  1.0  931G  820G  111G 88.08 2.71 216 Added

139   ssd 0.90970  1.0  931G  771G  159G 82.85 2.55 207 Added

140   ssd 0.90970  1.0  931G  709G  222G 76.12 2.34 197 Added

141   ssd 0.90970  1.0  931G  664G  267G 71.31 2.19 184 Added

60   ssd 0.43660  1.0  447G  275G  171G 61.62 1.89 100

61   ssd 0.43660  1.0  447G  237G  209G 53.04 1.63  90

62   ssd 0.43660  1.0  447G  275G  171G 61.58 1.89  95

63   ssd 0.43660  1.0  447G  260G  187G 58.15 1.79  97

64   ssd 0.43660  1.0  447G  232G  214G 52.08 1.60  83

65   ssd 0.43660  1.0  447G  207G  239G 46.36 1.42  75

66   ssd 0.43660  1.0  447G  217G  230G 48.54 1.49  84

67   ssd 0.43660  1.0  447G  

Re: [ceph-users] Crush Rules with multiple Device Classes

2018-07-19 Thread Linh Vu
Since the new NVMes are meant to replace the existing SSDs, why don't you 
assign class "ssd" to the new NVMe OSDs? That way you don't need to change the 
existing OSDs nor the existing crush rule. And the new NVMe OSDs won't lose any 
performance, "ssd" or "nvme" is just a name.

When you deploy the new NVMe, you can chuck this under [osd] in their local 
ceph.conf: `osd_class_update_on_start = false` They should then come up with a 
blank class and you can set the class to ssd afterwards.



From: ceph-users  on behalf of Oliver 
Freyermuth 
Sent: Thursday, 19 July 2018 6:13:25 AM
To: ceph-users@lists.ceph.com
Cc: Peter Wienemann
Subject: [ceph-users] Crush Rules with multiple Device Classes

Dear Cephalopodians,

we use an SSD-only pool to store the metadata of our CephFS.
In the future, we will add a few NVMEs, and in the long-term view, replace the 
existing SSDs by NVMEs, too.

Thinking this through, I came up with three questions which I do not find 
answered in the docs (yet).

Currently, we use the following crush-rule:

rule cephfs_metadata {
id 1
type replicated
min_size 1
max_size 10
step take default class ssd
step choose firstn 0 type osd
step emit
}

As you can see, this uses "class ssd".

Now my first question is:
1) Is there a way to specify "take default class (ssd or nvme)"?
   Then we could just do this for the migration period, and at some point 
remove "ssd".

If multi-device-class in a crush rule is not supported yet, the only workaround 
which comes to my mind right now is to issue:
  $ ceph osd crush set-device-class nvme 
for all our old SSD-backed osds, and modify the crush rule to refer to class 
"nvme" straightaway.

This leads to my second question:
2) Since the OSD IDs do not change, Ceph should not move any data around by 
changing both the device classes of the OSDs and the device class in the crush 
rule - correct?

After this operation, adding NVMEs to our cluster should let them automatically 
join this crush rule, and once all SSDs are replaced with NVMEs,
the workaround is automatically gone.

As long as the SSDs are still there, some tunables might not fit well anymore 
out of the box, i.e. the "sleep" values for scrub and repair, though.

Here my third question:
3) Are the tunables used for NVME devices the same as for SSD devices?
   I do not find any NVME tunables here:
   http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
   Only SSD, HDD and Hybrid are shown.

Cheers,
Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] v12.2.7 Luminous released

2018-07-18 Thread Linh Vu
Awesome, thank you Sage! With that explanation, it's actually a lot easier and 
less impacting than I thought. :)


Cheers,

Linh


From: Sage Weil 
Sent: Thursday, 19 July 2018 9:35:33 AM
To: Linh Vu
Cc: Stefan Kooman; ceph-de...@vger.kernel.org; ceph-us...@ceph.com; 
ceph-maintain...@ceph.com; ceph-annou...@ceph.com
Subject: Re: [Ceph-maintainers] [ceph-users] v12.2.7 Luminous released

On Wed, 18 Jul 2018, Linh Vu wrote:
> Thanks for all your hard work in putting out the fixes so quickly! :)
>
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS,
> not RGW. In the release notes, it says RGW is a risk especially the
> garbage collection, and the recommendation is to either pause IO or
> disable RGW garbage collection.
>
> In our case with CephFS, not RGW, is it a lot less risky to perform the
> upgrade to 12.2.7 without the need to pause IO?

It is hard to quantify.  I think we only saw the problem with RGW, but
CephFS also sends deletes to non-existent objects when deleting or
truncating sparse files.  Those are probably not too common in most
environments...

> What does pause IO do? Do current sessions just get queued up and IO
> resume normally with no problem after unpausing?

Exactly.  As long as the application doesn't have some timeout coded where
it gives up when a read or write is taking to long, everything will just
pause.

> If we have to pause IO, is it better to do something like: pause IO,
> restart OSDs on one node, unpause IO - repeated for all the nodes
> involved in the EC pool?

Yes, that sounds like a great way to proceed!

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Linh Vu
I think the P4600 should be fine, although 2TB is probably way over kill for 15 
OSDs.


Our older nodes use the P3700 400GB for 16 OSDs. I have yet to see the WAL and 
DB getting filled up at 2GB/10GB each. Our newer nodes use the Intel Optane 
900P 480GB, that's actually faster than the P4600 and significantly cheaper in 
our country (we bought ~100 OSD nodes recently and that was a big saving) and 
has a big 10 DWPD. For NLSAS OSDs, even the older P3700 is more than enough, 
but for our flash OSDs, the Optane 900P performs a lot better. It's about 2x 
faster than the P3700 we had, and allow us to get more out of our flash drives.


From: Oliver Schulz 
Sent: Wednesday, 18 July 2018 12:00:14 PM
To: Linh Vu; ceph-users
Subject: Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Thanks, Linh!

A question regarding choice of NVMe - do you think an
Intel P4510 or P4600 would do well for WAL+DB? I'd
thinking about using a single 2 TB NVMe for 15 OSDs.
Would you recommend a different model?

Is there any experience on how many 4k IOPS one should
have for WAL+DB per OSD?

We have a few new BlueStore nodes in an older
cluster, and we use Intel Optanes for WAL. We wanted to
use them for DB too - only to learn that while fast
they're just to small for the DB for several OSDs ...
so I hope a "regular" NVMe is fast enough?

We currently use the Gigabyte D120-C21 server barebone
(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100<https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100>)
for our OSD nodes, and we'd like to use it in our
next cluster too, because of the high storage density
and the good hdd-price to server-price ratio.
But it can only fit a single NVMe-drive (we use one of
the 16 HDD slots for an U.2 drive and connect it to the
single M.2-PCIe slot on the mainboard).


Cheers,

Oliver


On 18.07.2018 09:11, Linh Vu wrote:
> On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and
> DBs (we accept that the risk of 1 card failing is low, and our failure
> domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB
> of DB.
>
>
> On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node,
> and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL
> and 40GB of DB.
>
>
> On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such
> special allocation. 
>
>
> Cheers,
>
> Linh
>
>
> ----
> *From:* Oliver Schulz 
> *Sent:* Tuesday, 17 July 2018 11:39:26 PM
> *To:* Linh Vu; ceph-users
> *Subject:* Re: [ceph-users] CephFS with erasure coding, do I need a
> cache-pool?
> Dear Linh,
>
> another question, if I may:
>
> How do you handle Bluestore WAL and DB, and
> how much SSD space do you allocate for them?
>
>
> Cheers,
>
> Oliver
>
>
> On 17.07.2018 08:55, Linh Vu wrote:
> > Hi Oliver,
> >
> >
> > We have several CephFS on EC pool deployments, one been in production
> > for a while, the others about to pending all the Bluestore+EC fixes in
> > 12.2.7 
> >
> >
> > Firstly as John and Greg have said, you don't need SSD cache pool at all.
> >
> >
> > Secondly, regarding k/m, it depends on how many hosts or racks you have,
> > and how many failures you want to tolerate.
> >
> >
> > For our smallest pool with only 8 hosts in 4 different racks and 2
> > different pairs of switches (note: we consider switch failure more
> > common than rack cooling or power failure), we're using 4/2 with failure
> > domain = host. We currently use this for SSD scratch storage for HPC.
> >
> >
> > For one of our larger pools, with 24 hosts over 6 different racks and 6
> > different pairs of switches, we're using 4:2 with failure domain = rack.
> >
> >
> > For another pool with similar host count but not spread over so many
> > pairs of switches, we're using 6:3 and failure domain = host.
> >
> >
> > Also keep in mind that a higher value of k/m may give you more
> > throughput but increase latency especially for small files, so it also
> > depends on how important performance is and what kind of file size you
> > store on your CephFS.
> >
> >
> > Cheers,
> >
> > Linh
> >
> > 
> > *From:* ceph-users  on behalf of
> > Oliver Schulz 
> > *Sent:* Sunday, 15 July 2018 9:46:16 PM
> > *To:* ceph-users
> > *Subject:* [ceph-users] CephFS with erasure coding, do I need a
> cache-pool?
> > Dear all,
> >
> > we're planning a new Ceph-Clusterm, with CephFS as the
> >

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Linh Vu
On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and DBs (we 
accept that the risk of 1 card failing is low, and our failure domain is host 
anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB of DB.


On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node, and 2x 
NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL and 40GB of DB.


On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such 
special allocation. 


Cheers,

Linh



From: Oliver Schulz 
Sent: Tuesday, 17 July 2018 11:39:26 PM
To: Linh Vu; ceph-users
Subject: Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Dear Linh,

another question, if I may:

How do you handle Bluestore WAL and DB, and
how much SSD space do you allocate for them?


Cheers,

Oliver


On 17.07.2018 08:55, Linh Vu wrote:
> Hi Oliver,
>
>
> We have several CephFS on EC pool deployments, one been in production
> for a while, the others about to pending all the Bluestore+EC fixes in
> 12.2.7 
>
>
> Firstly as John and Greg have said, you don't need SSD cache pool at all.
>
>
> Secondly, regarding k/m, it depends on how many hosts or racks you have,
> and how many failures you want to tolerate.
>
>
> For our smallest pool with only 8 hosts in 4 different racks and 2
> different pairs of switches (note: we consider switch failure more
> common than rack cooling or power failure), we're using 4/2 with failure
> domain = host. We currently use this for SSD scratch storage for HPC.
>
>
> For one of our larger pools, with 24 hosts over 6 different racks and 6
> different pairs of switches, we're using 4:2 with failure domain = rack.
>
>
> For another pool with similar host count but not spread over so many
> pairs of switches, we're using 6:3 and failure domain = host.
>
>
> Also keep in mind that a higher value of k/m may give you more
> throughput but increase latency especially for small files, so it also
> depends on how important performance is and what kind of file size you
> store on your CephFS.
>
>
> Cheers,
>
> Linh
>
> 
> *From:* ceph-users  on behalf of
> Oliver Schulz 
> *Sent:* Sunday, 15 July 2018 9:46:16 PM
> *To:* ceph-users
> *Subject:* [ceph-users] CephFS with erasure coding, do I need a cache-pool?
> Dear all,
>
> we're planning a new Ceph-Clusterm, with CephFS as the
> main workload, and would like to use erasure coding to
> use the disks more efficiently. Access pattern will
> probably be more read- than write-heavy, on average.
>
> I don't have any practical experience with erasure-
> coded pools so far.
>
> I'd be glad for any hints / recommendations regarding
> these questions:
>
> * Is an SSD cache pool recommended/necessary for
> CephFS on an erasure-coded HDD pool (using Ceph
> Luminous and BlueStore)?
>
> * What are good values for k/m for erasure coding in
> practice (assuming a cluster of about 300 OSDs), to
> make things robust and ease maintenance (ability to
> take a few nodes down)? Is k/m = 6/3 a good choice?
>
> * Will it be sufficient to have k+m racks, resp. failure
> domains?
>
>
> Cheers and thanks for any advice,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Linh Vu
Thanks for all your hard work in putting out the fixes so quickly! :)

We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not RGW. 
In the release notes, it says RGW is a risk especially the garbage collection, 
and the recommendation is to either pause IO or disable RGW garbage collection.


In our case with CephFS, not RGW, is it a lot less risky to perform the upgrade 
to 12.2.7 without the need to pause IO?


What does pause IO do? Do current sessions just get queued up and IO resume 
normally with no problem after unpausing?


If we have to pause IO, is it better to do something like: pause IO, restart 
OSDs on one node, unpause IO - repeated for all the nodes involved in the EC 
pool?


Regards,

Linh


From: ceph-users  on behalf of Sage Weil 

Sent: Wednesday, 18 July 2018 4:42:41 AM
To: Stefan Kooman
Cc: ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; 
ceph-maintain...@ceph.com; ceph-us...@ceph.com
Subject: Re: [ceph-users] v12.2.7 Luminous released

On Tue, 17 Jul 2018, Stefan Kooman wrote:
> Quoting Abhishek Lekshmanan (abhis...@suse.com):
>
> > *NOTE* The v12.2.5 release has a potential data corruption issue with
> > erasure coded pools. If you ran v12.2.5 with erasure coding, please see
^^^
> > below.
>
> < snip >
>
> > Upgrading from v12.2.5 or v12.2.6
> > -
> >
> > If you used v12.2.5 or v12.2.6 in combination with erasure coded
^
> > pools, there is a small risk of corruption under certain workloads.
> > Specifically, when:
>
> < snip >
>
> One section mentions Luminous clusters _with_ EC pools specifically, the other
> section mentions Luminous clusters running 12.2.5.

I think they both do?

> I might be misreading this, but to make things clear for current Ceph
> Luminous 12.2.5 users. Is the following statement correct?
>
> If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there 
> is
> no need to quiesce IO (ceph osd pause).

Correct.

> http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
> If your cluster did not run v12.2.5 or v12.2.6 then none of the above
> issues apply to you and you should upgrade normally.
>
> ^^ Above section would indicate all 12.2.5 luminous clusters.

The intent here is to clarify that any cluster running 12.2.4 or
older can upgrade without reading carefully. If the cluster
does/did run 12.2.5 or .6, then read carefully because it may (or may not)
be affected.

Does that help? Any suggested revisions to the wording in the release
notes that make it clearer are welcome!

Thanks-
sage


>
> Please clarify,
>
> Thanks,
>
> Stefan
>
> --
> | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-16 Thread Linh Vu
Hi Oliver,


We have several CephFS on EC pool deployments, one been in production for a 
while, the others about to pending all the Bluestore+EC fixes in 12.2.7 


Firstly as John and Greg have said, you don't need SSD cache pool at all.


Secondly, regarding k/m, it depends on how many hosts or racks you have, and 
how many failures you want to tolerate.


For our smallest pool with only 8 hosts in 4 different racks and 2 different 
pairs of switches (note: we consider switch failure more common than rack 
cooling or power failure), we're using 4/2 with failure domain = host. We 
currently use this for SSD scratch storage for HPC.


For one of our larger pools, with 24 hosts over 6 different racks and 6 
different pairs of switches, we're using 4:2 with failure domain = rack.


For another pool with similar host count but not spread over so many pairs of 
switches, we're using 6:3 and failure domain = host.


Also keep in mind that a higher value of k/m may give you more throughput but 
increase latency especially for small files, so it also depends on how 
important performance is and what kind of file size you store on your CephFS.


Cheers,

Linh


From: ceph-users  on behalf of Oliver Schulz 

Sent: Sunday, 15 July 2018 9:46:16 PM
To: ceph-users
Subject: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Dear all,

we're planning a new Ceph-Clusterm, with CephFS as the
main workload, and would like to use erasure coding to
use the disks more efficiently. Access pattern will
probably be more read- than write-heavy, on average.

I don't have any practical experience with erasure-
coded pools so far.

I'd be glad for any hints / recommendations regarding
these questions:

* Is an SSD cache pool recommended/necessary for
CephFS on an erasure-coded HDD pool (using Ceph
Luminous and BlueStore)?

* What are good values for k/m for erasure coding in
practice (assuming a cluster of about 300 OSDs), to
make things robust and ease maintenance (ability to
take a few nodes down)? Is k/m = 6/3 a good choice?

* Will it be sufficient to have k+m racks, resp. failure
domains?


Cheers and thanks for any advice,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.6 release date?

2018-07-11 Thread Linh Vu
Going by http://tracker.ceph.com/issues/24597, does this only affect FileStore 
OSDs or are BlueStore ones affected too?


Cheers,

Linh


From: ceph-users  on behalf of Sage Weil 

Sent: Thursday, 12 July 2018 3:48:10 AM
To: Ken Dreyer
Cc: ceph-users; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] Luminous 12.2.6 release date?

On Wed, 11 Jul 2018, Ken Dreyer wrote:
> Sage, does 
> http://tracker.ceph.com/issues/24597 
> cover the full problem
> you're describing?

Yeah. I've added some detail to that bug. Working on fixing our rados
test suite to reproduce the issue.

sage


>
> - Ken
>
> On Wed, Jul 11, 2018 at 9:40 AM, Sage Weil  wrote:
> > Please hold off on upgrading. We discovered a regression (in 12.2.5
> > actually) but the triggering event is OSD restarts or other peering
> > combined with RGW workloads on EC pools, so unnecessary OSD restarts
> > should be avoided with 12.2.5 until we have is sorted out.
> >
> > sage
> >
> >
> > On Wed, 11 Jul 2018, Dan van der Ster wrote:
> >
> >> And voila, I see the 12.2.6 rpms were released overnight.
> >>
> >> Waiting here for an announcement before upgrading.
> >>
> >> -- dan
> >>
> >> On Tue, Jul 10, 2018 at 10:08 AM Sean Purdy  
> >> wrote:
> >> >
> >> > While we're at it, is there a release date for 12.2.6? It fixes a 
> >> > reshard/versioning bug for us.
> >> >
> >> > Sean
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-11 Thread Linh Vu
Thanks John :) Yeah I did the `ceph fs reset` as well, because we did have 2 
active MDSes. Currently running on just one until we completely cleared all 
these issues.


Our original problem started with a partial network outage a few weeks ago 
around the weekend. After it came back, post-DR and all the scans, it's been 
getting these add_inode assert outs every now and then and would crash the 
active MDS. Currently the standby would take over and continue fine, but on 
some occasions the standbys would all crash too and we had to do DR again. If 
it happens next, we will try the previously mentioned steps.


Cheers,

Linh


From: John Spray 
Sent: Wednesday, 11 July 2018 8:00:29 PM
To: Linh Vu
Cc: Wido den Hollander; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

On Wed, Jul 11, 2018 at 2:23 AM Linh Vu  wrote:
>
> Hi John,
>
>
> Thanks for the explanation, that command is a lot more impacting than I 
> thought! I hope the change of name for the verb "reset" comes through in the 
> next version, because that is very easy to misunderstand.
>
> "The first question is why we're talking about running it at all. What
> chain of reasoning led you to believe that your inotable needed
> erasing?"
>
> I thought the reset inode command is just like the reset session command, as 
> you can pass mds rank to it as a param, it only resets whatever the MDS was 
> holding.
>
> "The most typical case is where the journal has been recovered/erased,
> and take_inos is used to skip forward to avoid re-using any inode
> numbers that had been claimed by journal entries that we threw away."
>
> We had the situation where our MDS was crashing at 
> MDCache::add_inode(CInode*), as discussed earlier. take_inos should fix this, 
> as you mentioned, but we thought that we would need to reset what the MDS was 
> holding, just like the session.
>
> So with your clarification, I believe we only need to do these:
>
> journal backup
> recover dentries
> reset mds journal (it wasn't replaying anyway, kept crashing)
> reset session
> take_inos
> start mds up again
>
> Is that correct?

Probably... I have to be a bit hesitant because we don't know what
originally went wrong with your cluster. You'd also need to add an
"fs reset" before starting up again if you had multiple active MDS
ranks to begin with.

John

>
> Many thanks, I've learned a lot more about this process.
>
> Cheers,
> Linh
>
> 
> From: John Spray 
> Sent: Tuesday, 10 July 2018 7:24 PM
> To: Linh Vu
> Cc: Wido den Hollander; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
> On Tue, Jul 10, 2018 at 2:49 AM Linh Vu  wrote:
> >
> > While we're on this topic, could someone please explain to me what 
> > `cephfs-table-tool all reset inode` does?
>
> The inode table stores an interval set of free inode numbers. Active
> MDS daemons consume inode numbers as they create files. Resetting the
> inode table means rewriting it to its original state (i.e. everything
> free). Using the "take_inos" command consumes some range of inodes,
> to reflect that the inodes up to a certain point aren't really free,
> but in use by some files that already exist.
>
> > Does it only reset what the MDS has in its cache, and after starting up 
> > again, the MDS will read in new inode range from the metadata pool?
>
> I'm repeating myself a bit, but for the benefit of anyone reading this
> thread in the future: no, it's nothing like that. It effectively
> *erases the inode table* by overwriting it ("resetting") with a blank
> one.
>
>
> As with the journal tool (https://github.com/ceph/ceph/pull/22853),
> perhaps the verb "reset" is too prone to misunderstanding.
>
> > If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must 
> > run `cephfs-table-tool all reset inode`?
>
> The first question is why we're talking about running it at all. What
> chain of reasoning led you to believe that your inotable needed
> erasing?
>
> The most typical case is where the journal has been recovered/erased,
> and take_inos is used to skip forward to avoid re-using any inode
> numbers that had been claimed by journal entries that we threw away.
>
> John
>
> >
> > Cheers,
> >
> > Linh
> >
> > 
> > From: ceph-users  on behalf of Wido den 
> > Hollander 
> > Sent: Saturday, 7 July 2018 12:26:15 AM
> > To: John Spray
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: 

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-11 Thread Linh Vu
For this cluster, we currently don't build our own ceph packages (although just 
had to do that for one other cluster recently). Is it safe to comment out that 
particular assert, in the event that the full fix isn't coming really soon?


From: Wido den Hollander 
Sent: Wednesday, 11 July 2018 5:23:30 PM
To: Linh Vu; John Spray
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors



On 07/11/2018 01:47 AM, Linh Vu wrote:
> Thanks John :) Has it - asserting out on dupe inode - already been
> logged as a bug yet? I could put one in if needed.
>

Did you just comment out the assert? And indeed, my next question would
be, do we have a issue tracker for this?

Wido

>
> Cheers,
>
> Linh
>
>
>
> 
> *From:* John Spray 
> *Sent:* Tuesday, 10 July 2018 7:11 PM
> *To:* Linh Vu
> *Cc:* Wido den Hollander; ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] CephFS - How to handle "loaded dup inode"
> errors
>
> On Tue, Jul 10, 2018 at 12:43 AM Linh Vu  wrote:
>>
>> We're affected by something like this right now (the dup inode causing MDS 
>> to crash via assert(!p) with add_inode(CInode) function).
>>
>> In terms of behaviours, shouldn't the MDS simply skip to the next available 
>> free inode in the event of a dup, than crashing the entire FS because of one 
>> file? Probably I'm missing something but that'd be a no brainer picking 
>> between the two?
>
> Historically (a few years ago) the MDS asserted out on any invalid
> metadata.  Most of these cases have been picked up and converted into
> explicit damage handling, but this one appears to have been missed --
> so yes, it's a bug that the MDS asserts out.
>
> John
>
>> 
>> From: ceph-users  on behalf of Wido den 
>> Hollander 
>> Sent: Saturday, 7 July 2018 12:26:15 AM
>> To: John Spray
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>>
>>
>>
>> On 07/06/2018 01:47 PM, John Spray wrote:
>> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
>> >>
>> >>
>> >>
>> >> On 07/05/2018 03:36 PM, John Spray wrote:
>> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
>> >>> wrote:
>> >>>>
>> >>>> Hi list,
>> >>>>
>> >>>> I have a serious problem now... I think.
>> >>>>
>> >>>> One of my users just informed me that a file he created (.doc file) has
>> >>>> a different content then before. It looks like the file's inode is
>> >>>> completely wrong and points to the wrong object. I myself have found
>> >>>> another file with the same symptoms. I'm afraid my (production) FS is
>> >>>> corrupt now, unless there is a possibility to fix the inodes.
>> >>>
>> >>> You can probably get back to a state with some valid metadata, but it
>> >>> might not necessarily be the metadata the user was expecting (e.g. if
>> >>> two files are claiming the same inode number, one of them's is
>> >>> probably going to get deleted).
>> >>>
>> >>>> Timeline of what happend:
>> >>>>
>> >>>> Last week I upgraded our Ceph Jewel to Luminous.
>> >>>> This went without any problem.
>> >>>>
>> >>>> I already had 5 MDS available and went with the Multi-MDS feature and
>> >>>> enabled it. The seemed to work okay, but after a while my MDS went
>> >>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
>> >>>>
>> >>>> The only way to fix this and get the FS back online was the disaster
>> >>>> recovery procedure:
>> >>>>
>> >>>> cephfs-journal-tool event recover_dentries summary
>> >>>> ceph fs set cephfs cluster_down true
>> >>>> cephfs-table-tool all reset session
>> >>>> cephfs-table-tool all reset inode
>> >>>> cephfs-journal-tool --rank=cephfs:0 journal reset
>> >>>> ceph mds fail 0
>> >>>> ceph fs reset cephfs --yes-i-really-mean-it
>> >>>
>> >>> My concern with this procedure is that the recover_dentries and
>> >>> journal reset only hap

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread Linh Vu
Hi John,

Thanks for the explanation, that command is a lot more impacting than I 
thought! I hope the change of name for the verb "reset" comes through in the 
next version, because that is very easy to misunderstand.

"The first question is why we're talking about running it at all.  What
chain of reasoning led you to believe that your inotable needed
erasing?"

I thought the reset inode command is just like the reset session command, as 
you can pass mds rank to it as a param, it only resets whatever the MDS was 
holding.

"The most typical case is where the journal has been recovered/erased,
and take_inos is used to skip forward to avoid re-using any inode
numbers that had been claimed by journal entries that we threw away."

We had the situation where our MDS was crashing at MDCache::add_inode(CInode*), 
as discussed earlier. take_inos should fix this, as you mentioned, but we 
thought that we would need to reset what the MDS was holding, just like the 
session.

So with your clarification, I believe we only need to do these:

journal backup
recover dentries
reset mds journal (it wasn't replaying anyway, kept crashing)
reset session
take_inos
start mds up again

Is that correct?

Many thanks, I've learned a lot more about this process.

Cheers,
Linh


From: John Spray 
Sent: Tuesday, 10 July 2018 7:24 PM
To: Linh Vu
Cc: Wido den Hollander; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

On Tue, Jul 10, 2018 at 2:49 AM Linh Vu  wrote:
>
> While we're on this topic, could someone please explain to me what 
> `cephfs-table-tool all reset inode` does?

The inode table stores an interval set of free inode numbers.  Active
MDS daemons consume inode numbers as they create files.  Resetting the
inode table means rewriting it to its original state (i.e. everything
free).  Using the "take_inos" command consumes some range of inodes,
to reflect that the inodes up to a certain point aren't really free,
but in use by some files that already exist.

> Does it only reset what the MDS has in its cache, and after starting up 
> again, the MDS will read in new inode range from the metadata pool?

I'm repeating myself a bit, but for the benefit of anyone reading this
thread in the future: no, it's nothing like that.  It effectively
*erases the inode table* by overwriting it ("resetting") with a blank
one.


As with the journal tool (https://github.com/ceph/ceph/pull/22853),
perhaps the verb "reset" is too prone to misunderstanding.

> If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must 
> run `cephfs-table-tool all reset inode`?

The first question is why we're talking about running it at all.  What
chain of reasoning led you to believe that your inotable needed
erasing?

The most typical case is where the journal has been recovered/erased,
and take_inos is used to skip forward to avoid re-using any inode
numbers that had been claimed by journal entries that we threw away.

John

>
> Cheers,
>
> Linh
>
> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Sent: Saturday, 7 July 2018 12:26:15 AM
> To: John Spray
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
>
>
> On 07/06/2018 01:47 PM, John Spray wrote:
> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
> >>
> >>
> >>
> >> On 07/05/2018 03:36 PM, John Spray wrote:
> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
> >>> wrote:
> >>>>
> >>>> Hi list,
> >>>>
> >>>> I have a serious problem now... I think.
> >>>>
> >>>> One of my users just informed me that a file he created (.doc file) has
> >>>> a different content then before. It looks like the file's inode is
> >>>> completely wrong and points to the wrong object. I myself have found
> >>>> another file with the same symptoms. I'm afraid my (production) FS is
> >>>> corrupt now, unless there is a possibility to fix the inodes.
> >>>
> >>> You can probably get back to a state with some valid metadata, but it
> >>> might not necessarily be the metadata the user was expecting (e.g. if
> >>> two files are claiming the same inode number, one of them's is
> >>> probably going to get deleted).
> >>>
> >>>> Timeline of what happend:
> >>>>
> >>>> Last week I upgraded our Ceph Jewel to Luminous.
> >>>> This went without any problem.
> >>>>
> >>>> I already had 5 MDS available and w

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread Linh Vu
Thanks John :) Has it - asserting out on dupe inode - already been logged as a 
bug yet? I could put one in if needed.


Cheers,

Linh



From: John Spray 
Sent: Tuesday, 10 July 2018 7:11 PM
To: Linh Vu
Cc: Wido den Hollander; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

On Tue, Jul 10, 2018 at 12:43 AM Linh Vu  wrote:
>
> We're affected by something like this right now (the dup inode causing MDS to 
> crash via assert(!p) with add_inode(CInode) function).
>
> In terms of behaviours, shouldn't the MDS simply skip to the next available 
> free inode in the event of a dup, than crashing the entire FS because of one 
> file? Probably I'm missing something but that'd be a no brainer picking 
> between the two?

Historically (a few years ago) the MDS asserted out on any invalid
metadata.  Most of these cases have been picked up and converted into
explicit damage handling, but this one appears to have been missed --
so yes, it's a bug that the MDS asserts out.

John

> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Sent: Saturday, 7 July 2018 12:26:15 AM
> To: John Spray
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
>
>
> On 07/06/2018 01:47 PM, John Spray wrote:
> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
> >>
> >>
> >>
> >> On 07/05/2018 03:36 PM, John Spray wrote:
> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
> >>> wrote:
> >>>>
> >>>> Hi list,
> >>>>
> >>>> I have a serious problem now... I think.
> >>>>
> >>>> One of my users just informed me that a file he created (.doc file) has
> >>>> a different content then before. It looks like the file's inode is
> >>>> completely wrong and points to the wrong object. I myself have found
> >>>> another file with the same symptoms. I'm afraid my (production) FS is
> >>>> corrupt now, unless there is a possibility to fix the inodes.
> >>>
> >>> You can probably get back to a state with some valid metadata, but it
> >>> might not necessarily be the metadata the user was expecting (e.g. if
> >>> two files are claiming the same inode number, one of them's is
> >>> probably going to get deleted).
> >>>
> >>>> Timeline of what happend:
> >>>>
> >>>> Last week I upgraded our Ceph Jewel to Luminous.
> >>>> This went without any problem.
> >>>>
> >>>> I already had 5 MDS available and went with the Multi-MDS feature and
> >>>> enabled it. The seemed to work okay, but after a while my MDS went
> >>>> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
> >>>>
> >>>> The only way to fix this and get the FS back online was the disaster
> >>>> recovery procedure:
> >>>>
> >>>> cephfs-journal-tool event recover_dentries summary
> >>>> ceph fs set cephfs cluster_down true
> >>>> cephfs-table-tool all reset session
> >>>> cephfs-table-tool all reset inode
> >>>> cephfs-journal-tool --rank=cephfs:0 journal reset
> >>>> ceph mds fail 0
> >>>> ceph fs reset cephfs --yes-i-really-mean-it
> >>>
> >>> My concern with this procedure is that the recover_dentries and
> >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
> >>> would have retained lots of content in their journals.  I wonder if we
> >>> should be adding some more multi-mds aware checks to these tools, to
> >>> warn the user when they're only acting on particular ranks (a
> >>> reasonable person might assume that recover_dentries with no args is
> >>> operating on all ranks, not just 0).  Created
> >>> http://tracker.ceph.com/issues/24780 to track improving the default
> >>> behaviour.
> >>>
> >>>> Restarted the MDS and I was back online. Shortly after I was getting a
> >>>> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
> >>>> looks like it had trouble creating new inodes. Right before the crash
> >>>> it mostly complained something like:
> >>>>
> >>>> -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
> >>>> handle_client_request client_

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-09 Thread Linh Vu
While we're on this topic, could someone please explain to me what 
`cephfs-table-tool all reset inode` does?


Does it only reset what the MDS has in its cache, and after starting up again, 
the MDS will read in new inode range from the metadata pool?


If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must run 
`cephfs-table-tool all reset inode`?


Cheers,

Linh


From: ceph-users  on behalf of Wido den 
Hollander 
Sent: Saturday, 7 July 2018 12:26:15 AM
To: John Spray
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors



On 07/06/2018 01:47 PM, John Spray wrote:
> On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
>>
>>
>>
>> On 07/05/2018 03:36 PM, John Spray wrote:
>>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  wrote:

 Hi list,

 I have a serious problem now... I think.

 One of my users just informed me that a file he created (.doc file) has
 a different content then before. It looks like the file's inode is
 completely wrong and points to the wrong object. I myself have found
 another file with the same symptoms. I'm afraid my (production) FS is
 corrupt now, unless there is a possibility to fix the inodes.
>>>
>>> You can probably get back to a state with some valid metadata, but it
>>> might not necessarily be the metadata the user was expecting (e.g. if
>>> two files are claiming the same inode number, one of them's is
>>> probably going to get deleted).
>>>
 Timeline of what happend:

 Last week I upgraded our Ceph Jewel to Luminous.
 This went without any problem.

 I already had 5 MDS available and went with the Multi-MDS feature and
 enabled it. The seemed to work okay, but after a while my MDS went
 beserk and went flapping (crashed -> replay -> rejoin -> crashed)

 The only way to fix this and get the FS back online was the disaster
 recovery procedure:

 cephfs-journal-tool event recover_dentries summary
 ceph fs set cephfs cluster_down true
 cephfs-table-tool all reset session
 cephfs-table-tool all reset inode
 cephfs-journal-tool --rank=cephfs:0 journal reset
 ceph mds fail 0
 ceph fs reset cephfs --yes-i-really-mean-it
>>>
>>> My concern with this procedure is that the recover_dentries and
>>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
>>> would have retained lots of content in their journals.  I wonder if we
>>> should be adding some more multi-mds aware checks to these tools, to
>>> warn the user when they're only acting on particular ranks (a
>>> reasonable person might assume that recover_dentries with no args is
>>> operating on all ranks, not just 0).  Created
>>> http://tracker.ceph.com/issues/24780 to track improving the default
>>> behaviour.
>>>
 Restarted the MDS and I was back online. Shortly after I was getting a
 lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
 looks like it had trouble creating new inodes. Right before the crash
 it mostly complained something like:

 -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
 handle_client_request client_request(client.324932014:1434 create
 #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
 caller_gid=0{}) v2
 -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
 _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
 dirs], 1 open files
  0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
 12.2.5/src/mds/MDCache.cc: In function 'void
 MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
 05:05:01.615123
 /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)

 I also tried to counter the create inode crash by doing the following:

 cephfs-journal-tool event recover_dentries
 cephfs-journal-tool journal reset
 cephfs-table-tool all reset session
 cephfs-table-tool all reset inode
 cephfs-table-tool all take_inos 10
>>>
>>> This procedure is recovering some metadata from the journal into the
>>> main tree, then resetting everything, but duplicate inodes are
>>> happening when the main tree has multiple dentries containing inodes
>>> using the same inode number.
>>>
>>> What you need is something that scans through all the metadata,
>>> notices which entries point to the a duplicate, and snips out those
>>> dentries.  I'm not quite up to date on the latest CephFS forward scrub
>>> bits, so hopefully someone else can chime in to comment on whether we
>>> have the tooling for this already.
>>
>> But to prevent these crashes setting take_inos to a higher number is a
>> good choice, right? You'll loose inodes numbers, but you will have it
>> running without duplicate (new inodes).
>
> Yes -- that's the motivation to skipping inode numbers after some
> damage (but 

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-09 Thread Linh Vu
We're affected by something like this right now (the dup inode causing MDS to 
crash via assert(!p) with add_inode(CInode) function).

In terms of behaviours, shouldn't the MDS simply skip to the next available 
free inode in the event of a dup, than crashing the entire FS because of one 
file? Probably I'm missing something but that'd be a no brainer picking between 
the two?

From: ceph-users  on behalf of Wido den 
Hollander 
Sent: Saturday, 7 July 2018 12:26:15 AM
To: John Spray
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors



On 07/06/2018 01:47 PM, John Spray wrote:
> On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
>>
>>
>>
>> On 07/05/2018 03:36 PM, John Spray wrote:
>>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  wrote:

 Hi list,

 I have a serious problem now... I think.

 One of my users just informed me that a file he created (.doc file) has
 a different content then before. It looks like the file's inode is
 completely wrong and points to the wrong object. I myself have found
 another file with the same symptoms. I'm afraid my (production) FS is
 corrupt now, unless there is a possibility to fix the inodes.
>>>
>>> You can probably get back to a state with some valid metadata, but it
>>> might not necessarily be the metadata the user was expecting (e.g. if
>>> two files are claiming the same inode number, one of them's is
>>> probably going to get deleted).
>>>
 Timeline of what happend:

 Last week I upgraded our Ceph Jewel to Luminous.
 This went without any problem.

 I already had 5 MDS available and went with the Multi-MDS feature and
 enabled it. The seemed to work okay, but after a while my MDS went
 beserk and went flapping (crashed -> replay -> rejoin -> crashed)

 The only way to fix this and get the FS back online was the disaster
 recovery procedure:

 cephfs-journal-tool event recover_dentries summary
 ceph fs set cephfs cluster_down true
 cephfs-table-tool all reset session
 cephfs-table-tool all reset inode
 cephfs-journal-tool --rank=cephfs:0 journal reset
 ceph mds fail 0
 ceph fs reset cephfs --yes-i-really-mean-it
>>>
>>> My concern with this procedure is that the recover_dentries and
>>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
>>> would have retained lots of content in their journals.  I wonder if we
>>> should be adding some more multi-mds aware checks to these tools, to
>>> warn the user when they're only acting on particular ranks (a
>>> reasonable person might assume that recover_dentries with no args is
>>> operating on all ranks, not just 0).  Created
>>> http://tracker.ceph.com/issues/24780 to track improving the default
>>> behaviour.
>>>
 Restarted the MDS and I was back online. Shortly after I was getting a
 lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
 looks like it had trouble creating new inodes. Right before the crash
 it mostly complained something like:

 -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
 handle_client_request client_request(client.324932014:1434 create
 #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
 caller_gid=0{}) v2
 -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
 _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
 dirs], 1 open files
  0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
 12.2.5/src/mds/MDCache.cc: In function 'void
 MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
 05:05:01.615123
 /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)

 I also tried to counter the create inode crash by doing the following:

 cephfs-journal-tool event recover_dentries
 cephfs-journal-tool journal reset
 cephfs-table-tool all reset session
 cephfs-table-tool all reset inode
 cephfs-table-tool all take_inos 10
>>>
>>> This procedure is recovering some metadata from the journal into the
>>> main tree, then resetting everything, but duplicate inodes are
>>> happening when the main tree has multiple dentries containing inodes
>>> using the same inode number.
>>>
>>> What you need is something that scans through all the metadata,
>>> notices which entries point to the a duplicate, and snips out those
>>> dentries.  I'm not quite up to date on the latest CephFS forward scrub
>>> bits, so hopefully someone else can chime in to comment on whether we
>>> have the tooling for this already.
>>
>> But to prevent these crashes setting take_inos to a higher number is a
>> good choice, right? You'll loose inodes numbers, but you will have it
>> running without duplicate (new inodes).
>
> Yes -- that's the motivation to skipping inode numbers after some
> damage (but it won't 

Re: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)

2018-06-25 Thread Linh Vu
So my colleague Sean Crosby and I were looking through the logs (with debug mds 
= 10) and found some references just before the crash to inode number. We 
converted it from HEX to decimal and got something like 109953*5*627776 (last 
few digits not necessarily correct). We set one digit up i.e to 109953*6*627776 
and used that as the value for take_inos i.e:

$ cephfs-table-tool all take_inos 1099536627776


After that, the MDS could start successfully and we have a HEALTH_OK cluster 
once more!


It would still be useful if `show inode` in cephfs-table-tool actually shows us 
the max inode number at least though. And I think take_inos should be 
documented as well in the Disaster Recovery guide. :)


We'll be monitoring the cluster for the next few days. Hopefully nothing too 
interesting to share after this! 


Cheers,

Linh


From: ceph-users  on behalf of Linh Vu 

Sent: Monday, 25 June 2018 7:06:45 PM
To: ceph-users
Subject: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't 
start (failing at MDCache::add_inode)


Hi all,


We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active 
and 1 standby MDS. The active MDS crashed and now won't start again with this 
same error:

###

 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 In function 'void MDCache::add_inode(CInode*)' thread 7f01c2749700 time 
2018-06-25 16:11:21.133236
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 262: FAILED assert(!p)
###

Right before that point is just a bunch of client connection requests.

There are also a few other inode errors such as:

###
2018-06-25 09:30:37.889166 7f934c1e5700 -1 log_channel(cluster) log [ERR] : 
loaded dup inode 0x198f00a [2,head] v3426852030 at 
~mds0/stray5/198f00a, but inode 0x198f00a.head v3426838533 already 
exists at ~mds0/stray2/198f00a
###

We've done this for recovery:

$ make sure all MDS are shut down (all crashed by this point anyway)
$ ceph fs set myfs cluster_down true
$ cephfs-journal-tool journal export backup.bin
$ cephfs-journal-tool event recover_dentries summary
Events by type:
  FRAGMENT: 9
  OPEN: 29082
  SESSION: 15
  SUBTREEMAP: 241
  UPDATE: 171835
Errors: 0
$ cephfs-table-tool all reset session
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-table-tool all reset inode
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-journal-tool --rank=myfs:0 journal reset

old journal was 35714605847583~423728061

new journal start will be 35715031236608 (1660964 bytes past old end)
writing journal head
writing EResetJournal entry
done
$ ceph mds fail 0
$ ceph fs reset hpc_projects --yes-i-really-mean-it
$ start up MDS again

However, we keep getting the same error as above.

We found this: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023136.html 
which has a similar issue, and some suggestions on using the cephfs-table-tool 
take_inos command, as our problem looks like we can't create new inodes. 
However, we don't quite understand the show inode or take_inos command. On our 
cluster, we see this:

$ cephfs-table-tool 0 show inode
{
"0": {
"data": {
"version": 1,
"inotable": {
"projected_free": [
{
"start": 1099511627776,
"len": 1099511627776
}
],
"free": [
{
"start": 1099511627776,
"len": 1099511627776
}
]
}
},
"result": 0
}
}

Our test cluster shows the exact same output, and running `cephfs-table-tool 
all take_inos 10` (on the test cluster) doesn't seem to do anything to the 
output of the above, and also the inode number from creating new files doesn't 
seem to jump +100K from where it was (likely we misunderstood how take_inos 
works). On our test cluster (no recovery nor reset has been run on it), the 
latest max inode, from our file creation and running `ls -li` is 1099511627792, 
just a tiny bit bigger than the "start" value above which seems to match the 
file count we've created on it.

How do we find out what is our latest max inode on our production cluster, when 
`show inode` doesn't seem to show us anything useful?


Also, FYI, over a week ago, we had a network failu

[ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)

2018-06-25 Thread Linh Vu
Hi all,


We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active 
and 1 standby MDS. The active MDS crashed and now won't start again with this 
same error:

###

 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 In function 'void MDCache::add_inode(CInode*)' thread 7f01c2749700 time 
2018-06-25 16:11:21.133236
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 262: FAILED assert(!p)
###

Right before that point is just a bunch of client connection requests.

There are also a few other inode errors such as:

###
2018-06-25 09:30:37.889166 7f934c1e5700 -1 log_channel(cluster) log [ERR] : 
loaded dup inode 0x198f00a [2,head] v3426852030 at 
~mds0/stray5/198f00a, but inode 0x198f00a.head v3426838533 already 
exists at ~mds0/stray2/198f00a
###

We've done this for recovery:

$ make sure all MDS are shut down (all crashed by this point anyway)
$ ceph fs set myfs cluster_down true
$ cephfs-journal-tool journal export backup.bin
$ cephfs-journal-tool event recover_dentries summary
Events by type:
  FRAGMENT: 9
  OPEN: 29082
  SESSION: 15
  SUBTREEMAP: 241
  UPDATE: 171835
Errors: 0
$ cephfs-table-tool all reset session
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-table-tool all reset inode
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-journal-tool --rank=myfs:0 journal reset

old journal was 35714605847583~423728061

new journal start will be 35715031236608 (1660964 bytes past old end)
writing journal head
writing EResetJournal entry
done
$ ceph mds fail 0
$ ceph fs reset hpc_projects --yes-i-really-mean-it
$ start up MDS again

However, we keep getting the same error as above.

We found this: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023136.html 
which has a similar issue, and some suggestions on using the cephfs-table-tool 
take_inos command, as our problem looks like we can't create new inodes. 
However, we don't quite understand the show inode or take_inos command. On our 
cluster, we see this:

$ cephfs-table-tool 0 show inode
{
"0": {
"data": {
"version": 1,
"inotable": {
"projected_free": [
{
"start": 1099511627776,
"len": 1099511627776
}
],
"free": [
{
"start": 1099511627776,
"len": 1099511627776
}
]
}
},
"result": 0
}
}

Our test cluster shows the exact same output, and running `cephfs-table-tool 
all take_inos 10` (on the test cluster) doesn't seem to do anything to the 
output of the above, and also the inode number from creating new files doesn't 
seem to jump +100K from where it was (likely we misunderstood how take_inos 
works). On our test cluster (no recovery nor reset has been run on it), the 
latest max inode, from our file creation and running `ls -li` is 1099511627792, 
just a tiny bit bigger than the "start" value above which seems to match the 
file count we've created on it.

How do we find out what is our latest max inode on our production cluster, when 
`show inode` doesn't seem to show us anything useful?


Also, FYI, over a week ago, we had a network failure, and had to perform 
recovery then. The recovery seemed OK, but there were some clients that were 
still running jobs from previously and seemed to have recovered so we were 
still in the process of draining and rebooting them as they finish their jobs. 
Some would come back with bad files but nothing that caused troubles until now.

Very much appreciate any help!

Cheers,

Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore -> Bluestore

2018-06-12 Thread Linh Vu
ceph-volume lvm zap --destroy $DEVICE


From: ceph-users  on behalf of Vadim Bulst 

Sent: Tuesday, 12 June 2018 4:46:44 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Filestore -> Bluestore


Thanks Sergey.

Could you specify your answer a bit more? When I look into the manpage of 
ceph-volume I couldn't find an option named "--destroy".

I just like to make clear - this script has already migrated several servers. 
The problem is appearing when it should migrate devices in the expansion shelf.

"-->  RuntimeError: Cannot use device (/dev/dm-0). A vg/lv path or an existing 
device is needed"

Cheers,

Vadim


I would say the handling of devices
On 11.06.2018 23:58, Sergey Malinin wrote:
“Device or resource busy” error rises when no “--destroy” option is passed to 
ceph-volume.
On Jun 11, 2018, 22:44 +0300, Vadim Bulst 
, wrote:
Dear Cephers,

I'm trying to migrate our OSDs to Bluestore using this little script:

#!/bin/bash
HOSTNAME=$(hostname -s)
OSDS=`ceph osd metadata | jq -c '[.[] | select(.osd_objectstore |
contains("filestore")) ]' | jq '[.[] | select(.hostname |
contains("'${HOSTNAME}'")) ]' | jq '.[].id'`
IFS=' ' read -a OSDARRAY <<<$OSDS
for OSD in "${OSDARRAY[@]}"; do
  DEV=/dev/`ceph osd metadata | jq -c '.[] | select(.id=='${OSD}') |
.backend_filestore_dev_node' | sed 's/"//g'`
  echo "=== Migrating OSD nr ${OSD} on device ${DEV} ==="
  ceph osd out ${OSD}
while ! ceph osd safe-to-destroy ${OSD} ; do echo "waiting for full
evacuation"; sleep 60 ; done
  systemctl stop ceph-osd@${OSD}
  umount /var/lib/ceph/osd/ceph-${OSD}
  /usr/sbin/ceph-volume lvm zap ${DEV}
  ceph osd destroy ${OSD} --yes-i-really-mean-it
  /usr/sbin/ceph-volume lvm create --bluestore --data ${DEV}
--osd-id ${OSD}
done

Unfortunately - under normal circumstances this works flawlessly. In our
case we have expansion shelfs connected as multipath devices to our nodes.

/usr/sbin/ceph-volume lvm zap ${DEV}  is breaking with an error:

OSD(s) 1 are safe to destroy without reducing data durability.
--> Zapping: /dev/dm-0
Running command: /sbin/cryptsetup status /dev/mapper/
 stdout: /dev/mapper/ is inactive.
Running command: wipefs --all /dev/dm-0
 stderr: wipefs: error: /dev/dm-0: probing initialization failed:
Device or resource busy
-->  RuntimeError: command returned non-zero exit status: 1
destroyed osd.1
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
osd tree -f json
Running command: /usr/bin/ceph --cluster ceph --name
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
-i - osd new 74f6ff02-d027-4fc6-9b93-3a96d753
5c8f 1
--> Was unable to complete a new OSD, will rollback changes
--> OSD will be destroyed, keeping the ID because it was provided with
--osd-id
Running command: ceph osd destroy osd.1 --yes-i-really-mean-it
 stderr: destroyed osd.1

-->  RuntimeError: Cannot use device (/dev/dm-0). A vg/lv path or an
existing device is needed


Does anybody know how to solve this problem?

Cheers,

Vadim

--
Vadim Bulst

Universität Leipzig / URZ
04109 Leipzig, Augustusplatz 10

phone: +49-341-97-33380
mail: vadim.bu...@uni-leipzig.de

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Vadim Bulst

Universität Leipzig / URZ
04109  Leipzig, Augustusplatz 10

phone: ++49-341-97-33380
mail:vadim.bu...@uni-leipzig.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.4: CephFS kernel client (4.15/4.16) shows up as jewel

2018-05-31 Thread Linh Vu
I see, thanks a lot Ilya :) Will test that out.


From: Ilya Dryomov 
Sent: Thursday, 31 May 2018 10:50:48 PM
To: Heðin Ejdesgaard Møller
Cc: Linh Vu; ceph-users
Subject: Re: [ceph-users] Luminous 12.2.4: CephFS kernel client (4.15/4.16) 
shows up as jewel

On Thu, May 31, 2018 at 2:39 PM, Heðin Ejdesgaard Møller  wrote:
> I have encountered the same issue and wrote to the mailing list about it, 
> with the subject: [ceph-users] krbd upmap support on kernel-4.16 ?
>
> The odd thing is that I can krbd map an image after setting min compat to 
> luminous, without specifying --yes-i-really-mean-it . It's only nessecary at 
> the point in time when you set the min_compat parameter, if you at that time 
> have krbd mapped an image.

Correct.  You are forcing the set-require-min-compat-client setting,
but as the feature bit that is causing this isn't actually required,
"rbd map" and everything else continues to work as before.

Thanks,

Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous 12.2.4: CephFS kernel client (4.15/4.16) shows up as jewel

2018-05-30 Thread Linh Vu
Hi all,


On my test Luminous 12.2.4 cluster, with this set (initially so I could use 
upmap in the mgr balancer module):


# ceph osd set-require-min-compat-client luminous

# ceph osd dump | grep client
require_min_compat_client luminous
min_compat_client jewel


Not quite sure why min_compat_client is still jewel.


I have created cephfs on the cluster, and use a mix of fuse and kernel clients 
to test it. The fuse clients are on ceph-fuse 12.2.5 and show up as luminous 
clients.


The kernel client (just one mount) either on kernel 4.15.13 or 4.16.13 (the 
latest, just out) is showing up as jewel, seen in `ceph features`:


"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 1
},
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 8
}
}

I thought I read somewhere here that kernel 4.13+ should have full support for 
Luminous, so I don't know why this is showing up as jewel. I'm also surprised 
that it could mount and write to my cephfs share just fine despite that. It 
also doesn't seem to matter when I run ceph balancer with upmap mode despite 
this client being connected and writing files.


I can't see anything in mount.ceph options to specify jewel vs luminous either.


Is this just a mislabel i.e my kernel client is actually fully Luminous 
supported but showing up as Jewel? Or is the kernel client a bit behind still?


Currently we have a mix of ceph-fuse 12.2.5 and kernel client 4.15.13 in our 
production cluster, and I'm looking to set `ceph osd 
set-require-min-compat-client luminous` so I can use ceph balancer with upmap 
mode.


Cheers,

Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Linh Vu
That could be it. Every time it happens for me, it is indeed from a non-auth 
MDS.


From: Yan, Zheng 
Sent: Wednesday, 30 May 2018 11:25:59 AM
To: Linh Vu
Cc: Oliver Freyermuth; Ceph Users; Peter Wienemann
Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
authpin local pins"

I could be http://tracker.ceph.com/issues/24172


On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
> In my case, I have multiple active MDS (with directory pinning at the very
> top level), and there would be "Client xxx failing to respond to capability
> release" health warning every single time that happens.
>
> 
> From: ceph-users  on behalf of Yan, Zheng
> 
> Sent: Tuesday, 29 May 2018 9:53:43 PM
> To: Oliver Freyermuth
> Cc: Ceph Users; Peter Wienemann
> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
> authpin local pins"
>
> Single or multiple acitve mds? Were there "Client xxx failing to
> respond to capability release" health warning?
>
> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> we just had a "lockup" of many MDS requests, and also trimming fell
>> behind, for over 2 days.
>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>> GB in those 2 days.
>>
>> Rebooting the node to force a client eviction solved the issue, and now
>> metadata usage is down again, and all stuck requests were processed quickly.
>>
>> Is there any idea on what could cause something like that? On the client,
>> der was no CPU load, but many processes waiting for cephfs to respond.
>> Syslog did yield anything. It only affected one user and his user
>> directory.
>>
>> If there are no ideas: How can I collect good debug information in case
>> this happens again?
>>
>> Cheers,
>> Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Linh Vu
In my case, I have multiple active MDS (with directory pinning at the very top 
level), and there would be "Client xxx failing to respond to capability 
release" health warning every single time that happens.


From: ceph-users  on behalf of Yan, Zheng 

Sent: Tuesday, 29 May 2018 9:53:43 PM
To: Oliver Freyermuth
Cc: Ceph Users; Peter Wienemann
Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
authpin local pins"

Single or multiple acitve mds? Were there "Client xxx failing to
respond to capability release" health warning?

On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
 wrote:
> Dear Cephalopodians,
>
> we just had a "lockup" of many MDS requests, and also trimming fell behind, 
> for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
>
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
>
> Is there any idea on what could cause something like that? On the client, der 
> was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user directory.
>
> If there are no ideas: How can I collect good debug information in case this 
> happens again?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster - how to find out which clients are still jewel?

2018-05-29 Thread Linh Vu
Ah I remember that one, I still have it on my watch list on tracker.ceph.com


Thanks 


Alternatively, is there a way to check on a client node what ceph features 
(jewel, luminous etc.) it has? In our case, it's all CephFS clients, and it's a 
mix between ceph-fuse (which is Luminous 12.2.5) and kernel client (4.15.x). I 
suspect the latter is only supporting jewel features but I'd like to confirm.


From: Massimo Sgaravatto 
Sent: Tuesday, 29 May 2018 4:51:56 PM
To: Linh Vu
Cc: ceph-users
Subject: Re: [ceph-users] Luminous cluster - how to find out which clients are 
still jewel?

As far as I know the status wrt this issue is still the one reported in this 
thread:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020585.html<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020585.html>

See also:

http://tracker.ceph.com/issues/21315<http://tracker.ceph.com/issues/21315>

Cheers, Massimo

On Tue, May 29, 2018 at 8:39 AM, Linh Vu 
mailto:v...@unimelb.edu.au>> wrote:

Hi all,


I have a Luminous 12.2.4 cluster. This is what `ceph features` tells me:


...

"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 257
},
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 820
}
}
...

How do I find out which clients (IP/hostname/IDs) are actually on jewel feature 
set?

Regards,
Linh


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous cluster - how to find out which clients are still jewel?

2018-05-29 Thread Linh Vu
Hi all,


I have a Luminous 12.2.4 cluster. This is what `ceph features` tells me:


...

"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 257
},
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 820
}
}
...

How do I find out which clients (IP/hostname/IDs) are actually on jewel feature 
set?

Regards,
Linh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-28 Thread Linh Vu
I get the exact opposite to the same error message "currently failed to authpin 
local pins". Had a few clients on ceph-fuse 12.2.2 and they ran into those 
issues a lot (evicting works). Upgrading to ceph-fuse 12.2.5 fixed it. The main 
cluster is on 12.2.4.


The cause is user's HPC jobs or even just their login on multiple nodes 
accessing the same files, in a particular way. Doesn't happen to other users. 
Haven't quite dug into it deep enough as upgrading to 12.2.5 fixed our problem.


From: ceph-users  on behalf of Oliver 
Freyermuth 
Sent: Tuesday, 29 May 2018 7:29:06 AM
To: Paul Emmerich
Cc: Ceph Users; Peter Wienemann
Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
authpin local pins"

Dear Paul,

Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
> I encountered the exact same issue earlier today immediately after upgrading 
> a customer's cluster from 12.2.2 to 12.2.5.
> I've evicted the session and restarted the ganesha client to fix it, as I 
> also couldn't find any obvious problem.

interesting! In our case, the client with the problem (it happened again a few 
hours later...) always was a ceph-fuse client. Evicting / rebooting the client 
node helped.
However, it may well be that the original issue way caused by a Ganesha client, 
which we also use (and the user in question who complained was accessing files 
in parallel via NFS and ceph-fuse),
but I don't have a clear indication of that.

Cheers,
Oliver

>
> Paul
>
> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth  >:
>
> Dear Cephalopodians,
>
> we just had a "lockup" of many MDS requests, and also trimming fell 
> behind, for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
>
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
>
> Is there any idea on what could cause something like that? On the client, 
> der was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user 
> directory.
>
> If there are no ideas: How can I collect good debug information in case 
> this happens again?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't get ceph mgr balancer to work (Luminous 12.2.4)

2018-05-28 Thread Linh Vu
I think mystery is solved :) Someone set the tunables to hammer, down from 
luminous/optimal on this test cluster before I did my test, so that's why it 
didn't work as upmap requires straw2 bucket type I believe.The error messages 
were rather cryptic/unhelpful though!


From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Linh Vu 
<v...@unimelb.edu.au>
Sent: Monday, 28 May 2018 2:26:41 PM
To: ceph-users
Subject: Re: [ceph-users] Can't get ceph mgr balancer to work (Luminous 12.2.4)


I turned debug_mgr to 4/5 and found this while executing the plan. Apparently 
the command has error but the reply is "Success!" while nothing is done. Not 
sure what the 'foo' part is doing there.


2018-05-28 14:24:02.570822 7fc3f5ff7700  0 log_channel(audit) log [DBG] : 
from='client.282784 $IPADDRESS:0/2504876563' entity='client.admin' 
cmd=[{"prefix": "balancer execute", "plan": "mynewplan2", "target": ["mgr", 
""]}]: dispatch
2018-05-28 14:24:02.570858 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer status'
2018-05-28 14:24:02.570862 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer mode'
2018-05-28 14:24:02.570864 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer on'
2018-05-28 14:24:02.570866 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer off'
2018-05-28 14:24:02.570869 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer eval'
2018-05-28 14:24:02.570872 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer eval-verbose'
2018-05-28 14:24:02.570874 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer optimize'
2018-05-28 14:24:02.570876 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer show'
2018-05-28 14:24:02.570878 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer rm'
2018-05-28 14:24:02.570880 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer reset'
2018-05-28 14:24:02.570882 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer dump'
2018-05-28 14:24:02.570885 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer execute'
2018-05-28 14:24:02.570886 7fc3f5ff7700  4 mgr.server handle_command passing 
through 3
2018-05-28 14:24:02.571183 7fc3f67f8700  1 mgr[balancer] Handling command: 
'{'prefix': 'balancer execute', 'plan': 'mynewplan2', 'target': ['mgr', '']}'
2018-05-28 14:24:02.571252 7fc3f67f8700  4 mgr[balancer] Executing plan 
mynewplan2
2018-05-28 14:24:02.571855 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.3 mappings [{'to': 5L, 'from': 15L}, {'to': 45L, 'from': 
33L}, {'to': 58L, 'from': 62L}]
2018-05-28 14:24:02.572073 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.8 mappings [{'to': 45L, 'from': 41L}, {'to': 5L, 'from': 13L}]
2018-05-28 14:24:02.572217 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.b mappings [{'to': 18L, 'from': 17L}, {'to': 45L, 'from': 
41L}, {'to': 5L, 'from': 13L}]
2018-05-28 14:24:02.572367 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.10 mappings [{'to': 28L, 'from': 27L}]
2018-05-28 14:24:02.572491 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.11 mappings [{'to': 58L, 'from': 51L}]
2018-05-28 14:24:02.572602 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.2a mappings [{'to': 45L, 'from': 32L}, {'to': 21L, 'from': 
27L}]
2018-05-28 14:24:02.572712 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.32 mappings [{'to': 45L, 'from': 40L}, {'to': 5L, 'from': 
7L}, {'to': 58L, 'from': 51L}]
2018-05-28 14:24:02.572848 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.47 mappings [{'to': 45L, 'from': 43L}, {'to': 58L, 'from': 
51L}]
2018-05-28 14:24:02.572940 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.4c mappings [{'to': 5L, 'from': 4L}, {'to': 54L, 'from': 51L}]
2018-05-28 14:24:02.573028 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.61 mappings [{'to': 54L, 'from': 51L}]
2018-05-28 14:24:02.573341 7fc3f67f8700  0 mgr[balancer] Error on command
2018-05-28 14:24:02.573407 7fc3f67f8700  1 mgr.server reply handle_command (0) 
Success
2018-05-28 14:24:02.573560 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573617 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573660 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573699 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573873 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573915 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573961 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.574008 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.574047 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.574142 7fc3f67f8700  1 mgr[restful]

Re: [ceph-users] Can't get ceph mgr balancer to work (Luminous 12.2.4)

2018-05-27 Thread Linh Vu
I turned debug_mgr to 4/5 and found this while executing the plan. Apparently 
the command has error but the reply is "Success!" while nothing is done. Not 
sure what the 'foo' part is doing there.


2018-05-28 14:24:02.570822 7fc3f5ff7700  0 log_channel(audit) log [DBG] : 
from='client.282784 $IPADDRESS:0/2504876563' entity='client.admin' 
cmd=[{"prefix": "balancer execute", "plan": "mynewplan2", "target": ["mgr", 
""]}]: dispatch
2018-05-28 14:24:02.570858 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer status'
2018-05-28 14:24:02.570862 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer mode'
2018-05-28 14:24:02.570864 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer on'
2018-05-28 14:24:02.570866 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer off'
2018-05-28 14:24:02.570869 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer eval'
2018-05-28 14:24:02.570872 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer eval-verbose'
2018-05-28 14:24:02.570874 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer optimize'
2018-05-28 14:24:02.570876 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer show'
2018-05-28 14:24:02.570878 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer rm'
2018-05-28 14:24:02.570880 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer reset'
2018-05-28 14:24:02.570882 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer dump'
2018-05-28 14:24:02.570885 7fc3f5ff7700  1 mgr.server handle_command 
pyc_prefix: 'balancer execute'
2018-05-28 14:24:02.570886 7fc3f5ff7700  4 mgr.server handle_command passing 
through 3
2018-05-28 14:24:02.571183 7fc3f67f8700  1 mgr[balancer] Handling command: 
'{'prefix': 'balancer execute', 'plan': 'mynewplan2', 'target': ['mgr', '']}'
2018-05-28 14:24:02.571252 7fc3f67f8700  4 mgr[balancer] Executing plan 
mynewplan2
2018-05-28 14:24:02.571855 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.3 mappings [{'to': 5L, 'from': 15L}, {'to': 45L, 'from': 
33L}, {'to': 58L, 'from': 62L}]
2018-05-28 14:24:02.572073 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.8 mappings [{'to': 45L, 'from': 41L}, {'to': 5L, 'from': 13L}]
2018-05-28 14:24:02.572217 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.b mappings [{'to': 18L, 'from': 17L}, {'to': 45L, 'from': 
41L}, {'to': 5L, 'from': 13L}]
2018-05-28 14:24:02.572367 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.10 mappings [{'to': 28L, 'from': 27L}]
2018-05-28 14:24:02.572491 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.11 mappings [{'to': 58L, 'from': 51L}]
2018-05-28 14:24:02.572602 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.2a mappings [{'to': 45L, 'from': 32L}, {'to': 21L, 'from': 
27L}]
2018-05-28 14:24:02.572712 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.32 mappings [{'to': 45L, 'from': 40L}, {'to': 5L, 'from': 
7L}, {'to': 58L, 'from': 51L}]
2018-05-28 14:24:02.572848 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.47 mappings [{'to': 45L, 'from': 43L}, {'to': 58L, 'from': 
51L}]
2018-05-28 14:24:02.572940 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.4c mappings [{'to': 5L, 'from': 4L}, {'to': 54L, 'from': 51L}]
2018-05-28 14:24:02.573028 7fc3f67f8700  4 mgr[balancer] ceph osd 
pg-upmap-items 10.61 mappings [{'to': 54L, 'from': 51L}]
2018-05-28 14:24:02.573341 7fc3f67f8700  0 mgr[balancer] Error on command
2018-05-28 14:24:02.573407 7fc3f67f8700  1 mgr.server reply handle_command (0) 
Success
2018-05-28 14:24:02.573560 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573617 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573660 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573699 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573873 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573915 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.573961 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.574008 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.574047 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'
2018-05-28 14:24:02.574142 7fc3f67f8700  1 mgr[restful] Unknown request 'foo'



________
From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Linh Vu 
<v...@unimelb.edu.au>
Sent: Monday, 28 May 2018 12:47:41 PM
To: ceph-users
Subject: [ceph-users] Can't get ceph mgr balancer to work (Luminous 12.2.4)


Hi all,


I'm testing out ceph mgr balancer as per 
http://docs.ceph.com/docs/master/mgr/balancer/ on our test cluster on Luminous 
12.2.4, but can't seem to get it to work. Everything looks good in the prep, 
the new plan shows that it will take some

[ceph-users] Can't get ceph mgr balancer to work (Luminous 12.2.4)

2018-05-27 Thread Linh Vu
Hi all,


I'm testing out ceph mgr balancer as per 
http://docs.ceph.com/docs/master/mgr/balancer/ on our test cluster on Luminous 
12.2.4, but can't seem to get it to work. Everything looks good in the prep, 
the new plan shows that it will take some actions, but it doesn't execute at 
all. Am I missing something? Details below:

# ceph mgr module enable balancer

# ceph balancer eval
current cluster score 0.06 (lower is better)

# ceph balancer mode upmap

# ceph balancer optimize mynewplan2

# ceph balancer status
{
"active": true,
"plans": [
"mynewplan2"
],
"mode": "upmap"
}

# ceph balancer show mynewplan2
# starting osdmap epoch 10629
# starting crush version 71
# mode upmap
ceph osd pg-upmap-items 10.3 15 5 33 45 62 58
ceph osd pg-upmap-items 10.8 41 45 13 5
ceph osd pg-upmap-items 10.b 17 18 41 45 13 5
ceph osd pg-upmap-items 10.10 27 28
ceph osd pg-upmap-items 10.11 51 58
ceph osd pg-upmap-items 10.2a 32 45 27 21
ceph osd pg-upmap-items 10.32 40 45 7 5 51 58
ceph osd pg-upmap-items 10.47 43 45 51 58
ceph osd pg-upmap-items 10.4c 4 5 51 54
ceph osd pg-upmap-items 10.61 51 54

# ceph balancer eval mynewplan2
plan mynewplan2 final score 0.010474 (lower is better)

# ceph balancer execute mynewplan2
(nothing happens)

# ceph balancer on
(still nothing happens)

Regards,
Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can Bluestore work with 2 replicas or still need 3 for data integrity?

2018-05-24 Thread Linh Vu
You can use erasure code for your SSDs in Luminous if you're worried about cost 
per TB.


From: ceph-users  on behalf of Pardhiv Karri 

Sent: Friday, 25 May 2018 11:16:07 AM
To: ceph-users
Subject: [ceph-users] Can Bluestore work with 2 replicas or still need 3 for 
data integrity?

Hi,

Can Ceph Bluestore in Luminous work with 2 replicas using crc32c checksum which 
is more powerful than hashing in filestore versions or do we still need 3 
replicas for data integrity?

In our current Hammer-filestore environment we are using 3 replicas with HDD 
but planning to move to Bluestore-Luminous all SSD. Due to the cost of SSD's 
want to know if 2 replica is good or still need 3.

Thanks,
Pardhiv Karri


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Linh Vu
Thanks Patrick! Good to know that it's nothing and will be fixed soon :)


From: Patrick Donnelly <pdonn...@redhat.com>
Sent: Wednesday, 25 April 2018 5:17:57 AM
To: Linh Vu
Cc: ceph-users
Subject: Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with 
manual pinning

Hello Linh,

On Tue, Apr 24, 2018 at 12:34 AM, Linh Vu <v...@unimelb.edu.au> wrote:
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.

As Dan said, this is simply a spurious log message. Nothing is being
exported. This will be fixed in 12.2.6 as part of several fixes to the
load balancer:

https://github.com/ceph/ceph/pull/21412/commits/cace918dd044b979cd0d54b16a6296094c8a9f90

--
Patrick Donnelly

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Linh Vu
Hi Dan,


Thanks! Ah so the "nicely exporting" thing is just a distraction, that's good 
to know.


I did bump mds log max segments and max expiring to 240 after reading the 
previous discussion. It seemed to help when there was just 1 active MDS. It 
doesn't really do much at the moment, although the load remains roughly the 
same as before.


I also saw messages about old clients failing to release caps, but wasn't sure 
what caused it due to the "nicely exporting" warnings. Evicting old clients 
only cleared the issue for about 10s, then more of other clients joined the 
warning list. Only restarting mds.0 so that the standby mds replaces it 
restored cluster health.


Cheers,

Linh


From: Dan van der Ster <d...@vanderster.com>
Sent: Tuesday, 24 April 2018 6:20:18 PM
To: Linh Vu
Cc: ceph-users
Subject: Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with 
manual pinning

That "nicely exporting" thing is a logging issue that was apparently
fixed in https://github.com/ceph/ceph/pull/19220. I'm not sure if that
will be backported to luminous.

Otherwise the slow requests could be due to either slow trimming (see
previous discussions about mds log max expiring and mds log max
segments options) or old clients failing to release caps correctly
(you would see appropriate warnings about this).

-- Dan


On Tue, Apr 24, 2018 at 9:34 AM, Linh Vu <v...@unimelb.edu.au> wrote:
> Hi all,
>
>
> I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1
> standby. I have 3 shares: /projects, /home and /scratch, and I've decided to
> try manual pinning as described here:
> http://docs.ceph.com/docs/master/cephfs/multimds/
>
>
> /projects is pinned to mds.0 (rank 0)
>
> /home and /scratch are pinned to mds.1 (rank 1)
>
> Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[]
> | [.dir.path, .auth_first, .export_pin]'`
>
>
> Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.
>
>
> On our test cluster (same version and setup), it works as I think it should.
> I simulate metadata load via mdtest (up to around 2000 req/s on each mds,
> which is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0,
> loads on the other shares go to mds.1. Nothing pops up in the logs. I can
> also successfully reset to no pinning (i.e using the default load balancing)
> via setting the ceph.dir.pin value to -1, and vice versa. All that happens
> is this show in the logs:
>
>   mds.mds1-test-ceph2 asok_command: get subtrees (starting...)
>
>   mds.mds1-test-ceph2 asok_command: get subtrees (complete)
>
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.
>
> Sometimes (depending on which mds starts first), I would get the same
> message but the other way around i.e "mds.0.migrator nicely exporting to
> mds.1" the workload that mds.0 should be doing. This only appears on one
> mds, never the other, until one is restarted.
>
> And we've had a couple of occasions where we get this sort of slow requests:
>
> 7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406
> seconds old, received at 2018-04-20 08:17:35.970498:
> client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116
> 2018-04-20 08:17:35.970319 caller_uid=10171, caller_gid=1{1,10123,})
> currently failed to authpin local pins
>
> Which then seems to snowball into thousands of slow requests, until mds.0 is
> restarted. When these slow requests happen, loads are fairly low on the
> active MDSes, although it is possible that the users could be doing
> something funky with metadata on production that I can't reproduce with
> mdtest.
>
> I thought the manual pinning likely isn't working as intended due to the
> "mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it
> seems to indicate that we have a bad load balancing situation) but I can't
> seem to replicate this issue in test. Test cluster seems to b

[ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Linh Vu
Hi all,


I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1 
standby. I have 3 shares: /projects, /home and /scratch, and I've decided to 
try manual pinning as described here: 
http://docs.ceph.com/docs/master/cephfs/multimds/


/projects is pinned to mds.0 (rank 0)

/home and /scratch are pinned to mds.1 (rank 1)

Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[] | 
[.dir.path, .auth_first, .export_pin]'`


Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.


On our test cluster (same version and setup), it works as I think it should. I 
simulate metadata load via mdtest (up to around 2000 req/s on each mds, which 
is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0, loads on the 
other shares go to mds.1. Nothing pops up in the logs. I can also successfully 
reset to no pinning (i.e using the default load balancing) via setting the 
ceph.dir.pin value to -1, and vice versa. All that happens is this show in the 
logs:

  mds.mds1-test-ceph2 asok_command: get subtrees (starting...)

  mds.mds1-test-ceph2 asok_command: get subtrees (complete)

However, on our production cluster, with more powerful MDSes (10 cores 3.4GHz, 
256GB RAM, much faster networking), I get this in the logs constantly:

2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting to 
mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699 
cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84 
55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711 
423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1 
replicated=1 dirty=1 authpin=0 0x55691ccf1c00]

To clarify, /home is pinned to mds.1, so there is no reason it should export 
this to mds.0, and the loads on both MDSes (req/s, network load, CPU load) are 
fairly low, lower than those on the test MDS VMs.

Sometimes (depending on which mds starts first), I would get the same message 
but the other way around i.e "mds.0.migrator nicely exporting to mds.1" the 
workload that mds.0 should be doing. This only appears on one mds, never the 
other, until one is restarted.

And we've had a couple of occasions where we get this sort of slow requests:

7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406 
seconds old, received at 2018-04-20 08:17:35.970498: 
client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116 2018-04-20 
08:17:35.970319 caller_uid=10171, caller_gid=1{1,10123,}) currently 
failed to authpin local pins

Which then seems to snowball into thousands of slow requests, until mds.0 is 
restarted. When these slow requests happen, loads are fairly low on the active 
MDSes, although it is possible that the users could be doing something funky 
with metadata on production that I can't reproduce with mdtest.

I thought the manual pinning likely isn't working as intended due to the 
"mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it seems 
to indicate that we have a bad load balancing situation) but I can't seem to 
replicate this issue in test. Test cluster seems to be working as intended.

Am I doing manual pinning right? Should I even be using it?

Cheers,
Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS very unstable with many small files

2018-02-25 Thread Linh Vu
Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the 
OSD nodes have 128GB each. Networking is 2x25Gbe.


We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 
500-ish compute nodes. We have done stress testing with small files up to 2M 
per directory as part of our acceptance testing, and encountered no problem.


From: ceph-users  on behalf of Oliver 
Freyermuth 
Sent: Monday, 26 February 2018 3:45:59 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] CephFS very unstable with many small files

Dear Cephalopodians,

in preparation for production, we have run very successful tests with large 
sequential data,
and just now a stress-test creating many small files on CephFS.

We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 
hosts with 32 OSDs each, running in EC k=4 m=2.
Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 
12.2.3.
There are (at the moment) only two MDS's, one is active, the other standby.

For the test, we had 1120 client processes on 40 client machines (all 
cephfs-fuse!) extract a tarball with 150k small files
( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a 
separate subdirectory.

Things started out rather well (but expectedly slow), we had to increase
mds_log_max_segments => 240
mds_log_max_expiring => 160
due to https://github.com/ceph/ceph/pull/18624
and adjusted mds_cache_memory_limit to 4 GB.

Even though the MDS machine has 32 GB, it is also running 2 OSDs (for metadata) 
and so we have been careful with the cache
(e.g. due to http://tracker.ceph.com/issues/22599 ).

After a while, we tested MDS failover and realized we entered a flip-flop 
situation between the two MDS nodes we have.
Increasing mds_beacon_grace to 240 helped with that.

Now, with about 100,000,000 objects written, we are in a disaster situation.
First off, the MDS could not restart anymore - it required >40 GB of memory, 
which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
So it tried to recover and OOMed quickly after. Replay was reasonably fast, but 
join took many minutes:
2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
and finally, 5 minutes later, OOM.

I stopped half of the stress-test tar's, which did not help - then I rebooted 
half of the clients, which did help and let the MDS recover just fine.
So it seems the client caps have been too many for the MDS to handle. I'm 
unsure why "tar" would cause so many open file handles.
Is there anything that can be configured to prevent this from happening?
Now, I only lost some "stress test data", but later, it might be user's data...


In parallel, I had reinstalled one OSD host.
It was backfilling well, but now, <24 hours later, before backfill has 
finished, several OSD hosts enter OOM condition.
Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
default bluestore cache size of 1 GB. However, it seems the processes are using 
much more,
up to several GBs until memory is exhausted. They then become sluggish, are 
kicked out of the cluster, come back, and finally at some point they are OOMed.

Now, I have restarted some OSD processes and hosts which helped to reduce the 
memory usage - but now I have some OSDs crashing continously,
leading to PG unavailability, and preventing recovery from completion.
I have reported a ticket about that, with stacktrace and log:
http://tracker.ceph.com/issues/23120
This might well be a consequence of a previous OOM killer condition.

However, my final question after these ugly experiences is:
Did somebody ever stresstest CephFS for many small files?
Are those issues known? Can special configuration help?
Are the memory issues known? Are there solutions?

We don't plan to use Ceph for many small files, but we don't have full control 
of our users, which is why we wanted to test this "worst case" scenario.
It would be really bad if we lost a production filesystem due to such a 
situation, so the plan was to test now to know what happens before we enter 
production.
As of now, this looks really bad, and I'm not sure the cluster will ever 
recover.
I'll give it some more time, but we'll likely kill off all remaining clients 
next week and see what happens, and worst case recreate the Ceph cluster.

Cheers,
Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous: Help with Bluestore WAL

2018-02-20 Thread Linh Vu
Yeah that is the expected behaviour.


From: ceph-users  on behalf of Balakumar 
Munusawmy 
Sent: Wednesday, 21 February 2018 1:41:36 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Luminous: Help with Bluestore WAL


Hi,

We were recently testing luminous with bluestore. We have 6 node cluster 
with 12 HDD and 1 SSD each, we used ceph-volume with LVM to create all the OSD 
and attached with SSD WAL (LVM ). We create individual 10GBx12 LVM on single 
SDD for each WAL. So all the OSD WAL is on the singe SSD. Problem is if we pull 
the SSD out, it brings down all the 12 OSD on that node. Is that expected 
behavior or we are missing any configuration ?





Thanks and Regards,
Balakumar Munusawmy
Mobile:+19255771645
Skype: 
bala.munusa...@latticeworkinc.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Automated Failover of CephFS Clients

2018-02-20 Thread Linh Vu
You're welcome :)


From: Paul Kunicki <pkuni...@sproutloud.com>
Sent: Wednesday, 21 February 2018 1:16:32 PM
To: Linh Vu
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Automated Failover of CephFS Clients

Thanks for the hint Linh. I had neglected to read up on mount.fuse.ceph here: 
http://docs.ceph.com/docs/master/man/8/mount.fuse.ceph/<http://docs.ceph.com/docs/master/man/8/mount.fuse.ceph/>

I am trying this right now.

Thanks again.




 *   [http://storage.pardot.com/24972/128155/sl_logo.png]
*   Paul Kunicki
*   Systems Manager
*   SproutLoud Media Networks, LLC.
*   954-476-6211 ext. 144
*   pkuni...@sproutloud.com<mailto:pkuni...@sproutloud.com>
*   www.sproutloud.com<http://www.sproutloud.com/>
*   [http://storage.pardot.com/24972/128145/inc_500.jpg]  •  
[http://storage.pardot.com/24972/128137/deloitte.jpg]  •  
[http://storage.pardot.com/24972/128135/CIO_Review___google_logo.png]  •  
[http://storage.pardot.com/24972/128151/Marketing_logo_email.jpg]
*   [Facebook] <http://www.facebook.com/sproutLoud>   [Twitter] 
<http://twitter.com/sproutloud>   [LinkedIn] 
<http://www.linkedin.com/company/440456?trk=tyah>   [LinkedIn] 
<https://www.instagram.com/sproutloud/>   [YouTube] 
<https://www.youtube.com/user/SproutLoudVideo>

The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and for others authorized 
to receive it. It may contain confidential or legally privileged information. 
If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, or taking any action in reliance on these 
contents is strictly prohibited and may be unlawful. In the event the recipient 
or recipients of this communication are under a non-disclosure agreement, any 
and all information discussed during phone calls and online presentations fall 
under the agreements signed by both parties. If you received this communication 
in error, please notify us immediately by responding to this e-mail.

<mailto:pkuni...@sproutloud.com>

On Tue, Feb 20, 2018 at 8:35 PM, Linh Vu 
<v...@unimelb.edu.au<mailto:v...@unimelb.edu.au>> wrote:

Why are you mounting with a single monitor? What is your mount command or 
/etc/fstab? Ceph-fuse should use the available mons you have on the client's 
/etc/ceph/ceph.conf.


e.g our /etc/fstab entry:


none/home   fuse.ceph   
_netdev,ceph.id<http://ceph.id>=myclusterid,ceph.client_mountpoint=/home,nonempty,defaults
  0   0


From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Paul Kunicki 
<pkuni...@sproutloud.com<mailto:pkuni...@sproutloud.com>>
Sent: Wednesday, 21 February 2018 10:23:37 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Automated Failover of CephFS Clients

We currently have multiple CephFS fuse clients mounting the same filesystem 
from a single monitor even though our cluster has several monitors. I would 
like to automate the fail over from one monitor to another. Is this possible 
and where should I bee looking for guidance on accomplishing this in 
production? I would like to avoid involving NFS if possible and Pacemaker seems 
to be overkill but we can go that route if that is what is in fact needed?

We are currently at 12.2.2 on Centos 7.4.

Thanks.



 *   [http://storage.pardot.com/24972/128155/sl_logo.png]
*   Paul Kunicki
*   Systems Manager
*   SproutLoud Media Networks, LLC.
*   954-476-6211 ext. 144
*   pkuni...@sproutloud.com<mailto:pkuni...@sproutloud.com>
*   www.sproutloud.com<http://www.sproutloud.com/>
*   [http://storage.pardot.com/24972/128145/inc_500.jpg]  •  
[http://storage.pardot.com/24972/128137/deloitte.jpg]  •  
[http://storage.pardot.com/24972/128135/CIO_Review___google_logo.png]  •  
[http://storage.pardot.com/24972/128151/Marketing_logo_email.jpg]
*   [Facebook] <http://www.facebook.com/sproutLoud>   [Twitter] 
<http://twitter.com/sproutloud>   [LinkedIn] 
<http://www.linkedin.com/company/440456?trk=tyah>   [LinkedIn] 
<https://www.instagram.com/sproutloud/>   [YouTube] 
<https://www.youtube.com/user/SproutLoudVideo>

The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and for others authorized 
to receive it. It may contain confidential or legally privileged information. 
If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, or taking any action in reliance on these 
contents is strictly prohibited and may be unlawful. In the event the

Re: [ceph-users] Automated Failover of CephFS Clients

2018-02-20 Thread Linh Vu
Why are you mounting with a single monitor? What is your mount command or 
/etc/fstab? Ceph-fuse should use the available mons you have on the client's 
/etc/ceph/ceph.conf.


e.g our /etc/fstab entry:


none/home   fuse.ceph   
_netdev,ceph.id=myclusterid,ceph.client_mountpoint=/home,nonempty,defaults  0   
0


From: ceph-users  on behalf of Paul Kunicki 

Sent: Wednesday, 21 February 2018 10:23:37 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Automated Failover of CephFS Clients

We currently have multiple CephFS fuse clients mounting the same filesystem 
from a single monitor even though our cluster has several monitors. I would 
like to automate the fail over from one monitor to another. Is this possible 
and where should I bee looking for guidance on accomplishing this in 
production? I would like to avoid involving NFS if possible and Pacemaker seems 
to be overkill but we can go that route if that is what is in fact needed?

We are currently at 12.2.2 on Centos 7.4.

Thanks.



 *   [http://storage.pardot.com/24972/128155/sl_logo.png]
*   Paul Kunicki
*   Systems Manager
*   SproutLoud Media Networks, LLC.
*   954-476-6211 ext. 144
*   pkuni...@sproutloud.com
*   www.sproutloud.com
*   [http://storage.pardot.com/24972/128145/inc_500.jpg]  •  
[http://storage.pardot.com/24972/128137/deloitte.jpg]  •  
[http://storage.pardot.com/24972/128135/CIO_Review___google_logo.png]  •  
[http://storage.pardot.com/24972/128151/Marketing_logo_email.jpg]
*   [Facebook]    [Twitter] 
   [LinkedIn] 
   [LinkedIn] 
   [YouTube] 


The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and for others authorized 
to receive it. It may contain confidential or legally privileged information. 
If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, or taking any action in reliance on these 
contents is strictly prohibited and may be unlawful. In the event the recipient 
or recipients of this communication are under a non-disclosure agreement, any 
and all information discussed during phone calls and online presentations fall 
under the agreements signed by both parties. If you received this communication 
in error, please notify us immediately by responding to this e-mail.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help rebalancing OSD usage, Luminous 12.2.2

2018-01-30 Thread Linh Vu
Your PG count per OSD looks really low, that might be why. I think in Luminous, 
you should aim for about 200. I'd use the pgcalc on ceph.com to verify.


From: ceph-users  on behalf of Bryan 
Banister 
Sent: Wednesday, 31 January 2018 8:28:06 AM
To: Ceph Users
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminous 12.2.2


Sorry I hadn’t RTFML archive before posting this… Looking at the following 
thread for guidance:  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008626.html



Not the exact same situation (e.g. didn’t add larger OSD disks later on) but 
seems like the same recommendations from this nearly 2yr old thread would still 
apply?



Thanks again,

-Bryan



From: Bryan Banister
Sent: Tuesday, January 30, 2018 10:26 AM
To: Bryan Banister ; Ceph Users 

Subject: RE: Help rebalancing OSD usage, Luminous 12.2.2



Sorry, obviously should have been Luminous 12.2.2,

-B



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bryan 
Banister
Sent: Tuesday, January 30, 2018 10:24 AM
To: Ceph Users >
Subject: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2



Note: External Email



Hi all,



We are still very new to running a Ceph cluster and have run a RGW cluster for 
a while now (6-ish mo), it mainly holds large DB backups (Write once, read 
once, delete after N days).  The system is now warning us about an OSD that is 
near_full and so we went to look at the usage across OSDs.  We are somewhat 
surprised at how imbalanced the usage is across the OSDs, with the lowest usage 
at 22% full, the highest at nearly 90%, and an almost linear usage pattern 
across the OSDs (though it looks to step in roughly 5% increments):



[root@carf-ceph-osd01 ~]# ceph osd df | sort -nk8

ID  CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS

77   hdd 7.27730  1.0 7451G 1718G 5733G 23.06 0.43  32

73   hdd 7.27730  1.0 7451G 1719G 5732G 23.08 0.43  31

  3   hdd 7.27730  1.0 7451G 2059G 5392G 27.63 0.52  27

46   hdd 7.27730  1.0 7451G 2060G 5391G 27.65 0.52  32

48   hdd 7.27730  1.0 7451G 2061G 5390G 27.66 0.52  25

127   hdd 7.27730  1.0 7451G 2066G 5385G 27.73 0.52  31

42   hdd 7.27730  1.0 7451G 2067G 5384G 27.74 0.52  42

107   hdd 7.27730  1.0 7451G 2402G 5049G 32.24 0.61  34

56   hdd 7.27730  1.0 7451G 2405G 5046G 32.28 0.61  37

51   hdd 7.27730  1.0 7451G 2406G 5045G 32.29 0.61  30

106   hdd 7.27730  1.0 7451G 2408G 5043G 32.31 0.61  29

81   hdd 7.27730  1.0 7451G 2408G 5043G 32.32 0.61  25

123   hdd 7.27730  1.0 7451G 2411G 5040G 32.37 0.61  35

47   hdd 7.27730  1.0 7451G 2412G 5039G 32.37 0.61  29

122   hdd 7.27730  1.0 7451G 2749G 4702G 36.90 0.69  30

84   hdd 7.27730  1.0 7451G 2750G 4701G 36.91 0.69  35

114   hdd 7.27730  1.0 7451G 2751G 4700G 36.92 0.69  26

82   hdd 7.27730  1.0 7451G 2751G 4700G 36.92 0.69  43

103   hdd 7.27730  1.0 7451G 2753G 4698G 36.94 0.69  39

36   hdd 7.27730  1.0 7451G 2752G 4699G 36.94 0.69  37

105   hdd 7.27730  1.0 7451G 2754G 4697G 36.97 0.69  26

14   hdd 7.27730  1.0 7451G 3091G 4360G 41.49 0.78  31

  2   hdd 7.27730  1.0 7451G 3091G 4360G 41.49 0.78  43

  8   hdd 7.27730  1.0 7451G 3091G 4360G 41.49 0.78  37

20   hdd 7.27730  1.0 7451G 3092G 4359G 41.50 0.78  28

60   hdd 7.27730  1.0 7451G 3092G 4359G 41.50 0.78  29

69   hdd 7.27730  1.0 7451G 3092G 4359G 41.50 0.78  37

110   hdd 7.27730  1.0 7451G 3093G 4358G 41.51 0.78  38

68   hdd 7.27730  1.0 7451G 3092G 4358G 41.51 0.78  34

76   hdd 7.27730  1.0 7451G 3093G 4358G 41.51 0.78  28

99   hdd 7.27730  1.0 7451G 3092G 4358G 41.51 0.78  34

50   hdd 7.27730  1.0 7451G 3095G 4356G 41.54 0.78  35

95   hdd 7.27730  1.0 7451G 3095G 4356G 41.54 0.78  31

  0   hdd 7.27730  1.0 7451G 3096G 4355G 41.55 0.78  36

125   hdd 7.27730  1.0 7451G 3096G 4355G 41.55 0.78  34

128   hdd 7.27730  1.0 7451G 3095G 4355G 41.55 0.78  37

94   hdd 7.27730  1.0 7451G 3096G 4355G 41.55 0.78  33

63   hdd 7.27730  1.0 7451G 3096G 4355G 41.56 0.78  41

30   hdd 7.27730  1.0 7451G 3100G 4351G 41.60 0.78  31

26   hdd 7.27730  1.0 7451G 3435G 4015G 46.11 0.87  30

64   hdd 7.27730  1.0 7451G 3435G 4016G 46.11 0.87  42

57   hdd 7.27730  1.0 7451G 3437G 4014G 46.12 0.87  29

33   hdd 7.27730  1.0 7451G 3437G 4014G 46.13 0.87  27

65   hdd 7.27730  1.0 7451G 3439G 4012G 46.15 0.87  29

109   hdd 7.27730  1.0 7451G 3439G 4012G 46.16 0.87  39

11   hdd 7.27730  1.0 7451G 3441G 4010G 46.18 0.87  32

121   hdd 7.27730  1.0 7451G 3441G 4010G 46.18 0.87  46

78   hdd 7.27730  

Re: [ceph-users] OSDs going down/up at random

2018-01-09 Thread Linh Vu
Have you checked your firewall?


From: ceph-users  on behalf of Mike O'Connor 

Sent: Wednesday, 10 January 2018 3:40:30 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] OSDs going down/up at random

Hi All

I have a ceph host (12.2.2) which has 14 OSDs which seem to go down the
up, what should I look at to try to identify the issue ?
The system has three LSI SAS9201-8i cards which is then connected 14
drives at this time. (option of 24 drives)
I have three of these chassis but only one is running right now so I
have CEPH set for singe node.

I have very carefully looks at the logs files and not found anything
which indicates any issues with the controller and the drives.

dmesg has these messages.
---
[78752.708932] libceph: osd3 http://10.1.6.2:6834 socket closed (con state OPEN)
[78752.710319] libceph: osd3 http://10.1.6.2:6834 socket closed (con state
CONNECTING)
[78753.426244] libceph: osd3 down
[78753.426640] libceph: osd3 down
[78776.496962] libceph: osd5 http://10.1.6.2:6810 socket closed (con state OPEN)
[78776.498626] libceph: osd5 http://10.1.6.2:6810 socket closed (con state
CONNECTING)
[78777.446384] libceph: osd5 down
[78777.446720] libceph: osd5 down
[78806.466973] libceph: osd3 up
[78806.467429] libceph: osd3 up
[78855.565098] libceph: osd10 http://10.1.6.2:6801 socket closed (con state 
OPEN)
[78855.567062] libceph: osd10 http://10.1.6.2:6801 socket closed (con state
CONNECTING)
[78856.554209] libceph: osd10 down
[78856.554357] libceph: osd10 down
[78868.265665] libceph: osd1 http://10.1.6.2:6830 socket closed (con state OPEN)
[78868.266347] libceph: osd1 http://10.1.6.2:6830 socket closed (con state
CONNECTING)
[78868.529575] libceph: osd1 down
[78869.469264] libceph: osd1 down
[78899.538533] libceph: osd10 up
[78899.538808] libceph: osd10 up
[78903.556418] libceph: osd5 up
[78905.309401] libceph: osd5 up
[78909.755499] libceph: osd1 up
[78912.008581] libceph: osd1 up
[78912.040872] libceph: osd4 http://10.1.6.2:6850 socket error on write
[78924.736964] libceph: osd8 http://10.1.6.2:6809 socket closed (con state OPEN)
[78924.738402] libceph: osd8 http://10.1.6.2:6809 socket closed (con state
CONNECTING)
[78925.602597] libceph: osd8 down
[78925.602942] libceph: osd8 down
[78988.648108] libceph: osd8 up
[78988.648462] libceph: osd8 up
[79010.808917] libceph: osd4 http://10.1.6.2:6850 socket closed (con state OPEN)
[79010.810722] libceph: osd4 http://10.1.6.2:6850 socket closed (con state
CONNECTING)
[79011.617598] libceph: osd4 down
[79011.617861] libceph: osd4 down
[79072.772966] libceph: osd14 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[79072.773434] libceph: osd14 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[79072.774219] libceph: osd14 http://10.1.6.2:6854 socket closed (con state
CONNECTING)
[79073.657383] libceph: osd14 down
[79073.657552] libceph: osd14 down
[79082.565025] libceph: osd13 http://10.1.6.2:6846 socket closed (con state 
OPEN)
[79082.565814] libceph: osd13 http://10.1.6.2:6846 socket closed (con state 
OPEN)
[79082.566279] libceph: osd13 http://10.1.6.2:6846 socket closed (con state
CONNECTING)
[79082.670861] libceph: osd13 down
[79082.671023] libceph: osd13 down
[79115.435180] libceph: osd14 up
[79115.435989] libceph: osd14 up
[79117.603991] libceph: osd13 up
[79118.557601] libceph: osd13 up
[79154.719547] libceph: osd4 up
[79154.720232] libceph: osd4 up
[79175.900935] libceph: osd12 http://10.1.6.2:6822 socket closed (con state 
OPEN)
[79175.902922] libceph: osd12 http://10.1.6.2:6822 socket closed (con state
CONNECTING)
[79176.650847] libceph: osd12 down
[79176.651138] libceph: osd12 down
[79219.762665] libceph: osd12 up
[79219.763090] libceph: osd12 up
[79252.405666] libceph: osd11 http://10.1.6.2:6805 socket closed (con state 
OPEN)
[79252.406349] libceph: osd11 http://10.1.6.2:6805 socket closed (con state
CONNECTING)
[79252.462748] libceph: osd11 down
[79252.462855] libceph: osd11 down
[79285.656850] libceph: osd11 up
[79285.657341] libceph: osd11 up
[80558.024975] libceph: osd13 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[80558.025751] libceph: osd13 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[80558.026341] libceph: osd13 http://10.1.6.2:6854 socket closed (con state
CONNECTING)
[80558.652903] libceph: osd13 http://10.1.6.2:6854 socket error on write
[80558.734330] libceph: osd13 down
[80558.734501] libceph: osd13 down
[80590.753493] libceph: osd13 up
[80592.884936] libceph: osd13 up
[80592.897062] libceph: osd12 http://10.1.6.2:6822 socket closed (con state 
OPEN)
[90351.841800] libceph: osd1 down
[90371.299988] libceph: osd1 down
[90391.238370] libceph: osd1 up
[90391.778979] libceph: osd1 up

Thanks for any help/ideas
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous - SSD partitions disssapeared

2018-01-03 Thread Linh Vu
Just want to point out as well that the first thing I did when noticing this 
bug was to add the `ceph` user to the group `disk` thus giving it write 
permission to the devices. However this did not actually work (haven't checked 
in 12.2.2 yet), and I suspect that something in the ceph code was checking 
whether the devices are owned by ceph:ceph. I did not have time to hunt that 
down though.


From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Linh Vu 
<v...@unimelb.edu.au>
Sent: Thursday, 4 January 2018 11:46:40 AM
To: Sergey Malinin; Steven Vacaroaia
Cc: ceph-users
Subject: Re: [ceph-users] ceph luminous - SSD partitions disssapeared


Seen this issue when I first created our Luminous cluster. I use a custom 
systemd service to chown the DB and WAL partitions before ceph osd services get 
started. The script in /usr/local/sbin just does the chowning.

ceph-nvme.service:


# This is a workaround to chown the rocksdb and wal partitions
# for ceph-osd on nvme, because ceph-disk currently does not
# chown them to ceph:ceph so OSDs can't come up at OS startup

[Unit]
Description=Chown rocksdb and wal partitions on NVMe workaround
ConditionFileIsExecutable=/usr/local/sbin/ceph-nvme.sh
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ceph-nvme.sh
TimeoutSec=0

[Install]
WantedBy=multi-user.target



From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Sergey 
Malinin <h...@newmail.com>
Sent: Thursday, 4 January 2018 10:56:13 AM
To: Steven Vacaroaia
Cc: ceph-users
Subject: Re: [ceph-users] ceph luminous - SSD partitions disssapeared

To make device ownership persist over reboots, you can to set up udev rules.
The article you referenced seems to have nothing to do with bluestore. When you 
had zapped /dev/sda, you zapped bluestore metadata stored on db partition so 
newly created partitions, if they were created apart from block storage, are no 
longer relevant and that’s why osd daemon throws error.


From: Steven Vacaroaia <ste...@gmail.com>
Sent: Wednesday, January 3, 2018 7:20:12 PM
To: Sergey Malinin
Cc: ceph-users
Subject: Re: [ceph-users] ceph luminous - SSD partitions disssapeared

They were not
After I change it manually I was still unable to start the service
Further more, a reboot screed up permissions again

 ls -al /dev/sda*
brw-rw 1 root disk 8, 0 Jan  3 11:10 /dev/sda
brw-rw 1 root disk 8, 1 Jan  3 11:10 /dev/sda1
brw-rw 1 root disk 8, 2 Jan  3 11:10 /dev/sda2
[root@osd01 ~]# chown ceph:ceph /dev/sda1
[root@osd01 ~]# chown ceph:ceph /dev/sda2
[root@osd01 ~]# ls -al /dev/sda*
brw-rw 1 root disk 8, 0 Jan  3 11:10 /dev/sda
brw-rw 1 ceph ceph 8, 1 Jan  3 11:10 /dev/sda1
brw-rw 1 ceph ceph 8, 2 Jan  3 11:10 /dev/sda2
[root@osd01 ~]# systemctl start ceph-osd@3
[root@osd01 ~]# systemctl status ceph-osd@3
● ceph-osd@3.service - Ceph object storage daemon osd.3
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2018-01-03 
11:18:09 EST; 5s ago
  Process: 3823 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i 
--setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 3818 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 3823 (code=exited, status=1/FAILURE)

Jan 03 11:18:09 osd01.tor.medavail.net<http://osd01.tor.medavail.net> 
systemd[1]: Unit ceph-osd@3.service entered failed state.
Jan 03 11:18:09 osd01.tor.medavail.net<http://osd01.tor.medavail.net> 
systemd[1]: ceph-osd@3.service failed.



ceph-osd[3823]: 2018-01-03 11:18:08.515687 7fa55aec8d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3/block.db) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void bluesto
ceph-osd[3823]: 2018-01-03 11:18:08.515710 7fa55aec8d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db check block 
device(/var/lib/ceph/osd/ceph-3/block.db) label returned: (22) Invalid argument



This is very odd as the server was working fine

What is the proper procedure for replacing a failed SSD drive used by Blustore  
?



On 3 January 2018 at 10:23, Sergey Malinin 
<h...@newmail.com<mailto:h...@newmail.com>> wrote:
Are actual devices (not only udev links) owned by user “ceph”?


From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Steven Vacaroaia <ste...@gmail.com<mailto:ste...@gmail.com>>
Sent: Wednesday, January 3, 2018 6:19:45 PM
To: ceph-users
Subject: [ceph-users] ceph luminous - SSD partitions disssapeared

Hi,

After a reboot, all the partitions created on the SSD drive dissapeared
They were used by bluestore DB and WAL so the OSD are down

The fol

Re: [ceph-users] ceph luminous - SSD partitions disssapeared

2018-01-03 Thread Linh Vu
Seen this issue when I first created our Luminous cluster. I use a custom 
systemd service to chown the DB and WAL partitions before ceph osd services get 
started. The script in /usr/local/sbin just does the chowning.

ceph-nvme.service:


# This is a workaround to chown the rocksdb and wal partitions
# for ceph-osd on nvme, because ceph-disk currently does not
# chown them to ceph:ceph so OSDs can't come up at OS startup

[Unit]
Description=Chown rocksdb and wal partitions on NVMe workaround
ConditionFileIsExecutable=/usr/local/sbin/ceph-nvme.sh
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ceph-nvme.sh
TimeoutSec=0

[Install]
WantedBy=multi-user.target



From: ceph-users  on behalf of Sergey 
Malinin 
Sent: Thursday, 4 January 2018 10:56:13 AM
To: Steven Vacaroaia
Cc: ceph-users
Subject: Re: [ceph-users] ceph luminous - SSD partitions disssapeared

To make device ownership persist over reboots, you can to set up udev rules.
The article you referenced seems to have nothing to do with bluestore. When you 
had zapped /dev/sda, you zapped bluestore metadata stored on db partition so 
newly created partitions, if they were created apart from block storage, are no 
longer relevant and that’s why osd daemon throws error.


From: Steven Vacaroaia 
Sent: Wednesday, January 3, 2018 7:20:12 PM
To: Sergey Malinin
Cc: ceph-users
Subject: Re: [ceph-users] ceph luminous - SSD partitions disssapeared

They were not
After I change it manually I was still unable to start the service
Further more, a reboot screed up permissions again

 ls -al /dev/sda*
brw-rw 1 root disk 8, 0 Jan  3 11:10 /dev/sda
brw-rw 1 root disk 8, 1 Jan  3 11:10 /dev/sda1
brw-rw 1 root disk 8, 2 Jan  3 11:10 /dev/sda2
[root@osd01 ~]# chown ceph:ceph /dev/sda1
[root@osd01 ~]# chown ceph:ceph /dev/sda2
[root@osd01 ~]# ls -al /dev/sda*
brw-rw 1 root disk 8, 0 Jan  3 11:10 /dev/sda
brw-rw 1 ceph ceph 8, 1 Jan  3 11:10 /dev/sda1
brw-rw 1 ceph ceph 8, 2 Jan  3 11:10 /dev/sda2
[root@osd01 ~]# systemctl start ceph-osd@3
[root@osd01 ~]# systemctl status ceph-osd@3
● ceph-osd@3.service - Ceph object storage daemon osd.3
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2018-01-03 
11:18:09 EST; 5s ago
  Process: 3823 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i 
--setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 3818 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 3823 (code=exited, status=1/FAILURE)

Jan 03 11:18:09 osd01.tor.medavail.net 
systemd[1]: Unit ceph-osd@3.service entered failed state.
Jan 03 11:18:09 osd01.tor.medavail.net 
systemd[1]: ceph-osd@3.service failed.



ceph-osd[3823]: 2018-01-03 11:18:08.515687 7fa55aec8d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3/block.db) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void bluesto
ceph-osd[3823]: 2018-01-03 11:18:08.515710 7fa55aec8d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db check block 
device(/var/lib/ceph/osd/ceph-3/block.db) label returned: (22) Invalid argument



This is very odd as the server was working fine

What is the proper procedure for replacing a failed SSD drive used by Blustore  
?



On 3 January 2018 at 10:23, Sergey Malinin 
> wrote:
Are actual devices (not only udev links) owned by user “ceph”?


From: ceph-users 
> 
on behalf of Steven Vacaroaia >
Sent: Wednesday, January 3, 2018 6:19:45 PM
To: ceph-users
Subject: [ceph-users] ceph luminous - SSD partitions disssapeared

Hi,

After a reboot, all the partitions created on the SSD drive dissapeared
They were used by bluestore DB and WAL so the OSD are down

The following error message are in /var/log/messages


Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992218 7f4b52b9ed00 -1 
bluestore(/var/lib/ceph/osd/ceph-6) _open_db /var/lib/ceph/osd/ceph-6/block.db 
link target doesn't exist
Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.993231 7f7ad37b1d00 -1 
bluestore(/var/lib/ceph/osd/ceph-5) _open_db /var/lib/ceph/osd/ceph-5/block.db 
link target doesn't exist

Then I decided to take this opportunity and "assume" a dead SSD thiuse recreate 
partitions

I zapped /dev/sda and then
I used this 
http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/
  to recreate partition for ceph-3
Unfortunatelyy it is now "complaining' about 

Re: [ceph-users] Is the 12.2.1 really stable? Anybody have production cluster with Luminous Bluestore?

2017-11-16 Thread Linh Vu
We have a small prod cluster with 12.2.1 and bluestore, running just cephfs, 
for HPC use. It's been in prod for about 7 weeks now, and pretty stable. 


From: ceph-users  on behalf of Eric Nelson 

Sent: Friday, 17 November 2017 7:29:54 AM
To: Ashley Merrick
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Is the 12.2.1 really stable? Anybody have production 
cluster with Luminous Bluestore?

We upgraded to it a few weeks ago in order to get some of the new indexing 
features, but have also had a few nasty bugs in the process (including this 
one) as we have been upgrading osds from filestore to bluestore. Currently 
these are isolated to our SSD cache tier so I've been evicting everything in 
hopes of removing the tier and getting a functional cluster again :-/

On Thu, Nov 16, 2017 at 6:33 AM, Ashley Merrick 
> wrote:
Currently experiencing a nasty bug 
http://tracker.ceph.com/issues/21142

I would say wait a while for the next point release.

,Ashley

-Original Message-
From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Jack
Sent: 16 November 2017 22:22
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Is the 12.2.1 really stable? Anybody have production 
cluster with Luminous Bluestore?

My cluster (55 OSDs) runs 12.2.x since the release, and bluestore too All good 
so far

On 16/11/2017 15:14, Konstantin Shalygin wrote:
> Hi cephers.
> Some thoughts...
> At this time my cluster on Kraken 11.2.0 - works smooth with FileStore
> and RBD only.
> I want upgrade to Luminous 12.2.1 and go to Bluestore because this
> cluster want grows double with new disks, so is best opportunity
> migrate to Bluestore.
>
> In ML I was found two problems:
> 1. Increased memory usage, should be fixed in upstream
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021676.html).
>
> 2. OSD drops and goes cluster offline
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022494.html).
> Don't know this Bluestore or FileStore OSD'.s.
>
> If the first case I can safely survive - hosts has enough memory to go
> to Bluestore and with the growing I can wait until the next stable release.
> That second case really scares me. As I understood clusters with this
> problem for now not in production.
>
> By this point I have completed all the preparations for the update and
> now I need to figure out whether I should update to 12.2.1 or wait for
> the next stable release, because my cluster is in production and I
> can't fail. Or I can upgrade and use FileStore until next release,
> this is acceptable for me.
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous Directory

2017-11-15 Thread Linh Vu
Luminous supports this now http://docs.ceph.com/docs/master/cephfs/dirfrags/


and in my testing it has handled 2M files per directory with no problem.

Configuring Directory fragmentation — Ceph 
Documentation
docs.ceph.com
Configuring Directory fragmentation¶ In CephFS, directories are fragmented when 
they become very large or very busy. This splits up the metadata so that it can 
be ...



From: ceph-users  on behalf of Hauke Homburg 

Sent: Thursday, 16 November 2017 5:06:59 PM
To: ceph-users
Subject: [ceph-users] Ceph Luminous Directory

Hello List,

In our Factory we have again the Question cephfs. We noticed the new
Release Luminous.

We had problems with Jewel and big Directories and cephfs.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013628.html

Does anyone know a Limitation in Luminous in cephfs in big Directories?

Thanks for help.

Regards
Hauke

--
http://www.w3-creative.de

http://www.westchat.de

https://friendica.westchat.de/profile/hauke

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous vs jewel rbd performance

2017-11-15 Thread Linh Vu
Noticed that you're on 12.2.0 Raf. 12.2.1 fixed a lot of performance issues 
from 12.2.0 for us on Luminous/Bluestore. Have you tried upgrading to it?


From: ceph-users  on behalf of Rafael Lopez 

Sent: Thursday, 16 November 2017 11:59:14 AM
To: Mark Nelson
Cc: ceph-users
Subject: Re: [ceph-users] luminous vs jewel rbd performance

Hi Mark,

Sorry for the late reply... I have been away on vacation/openstack summit etc 
for over a month now and looking at this again.

Yeah the snippet was a bit misleading. The fio file contains small block jobs 
as well as big block jobs:

[write-rbd1-4m-depth1]
rbdname=rbd-tester-fio
bs=4m
iodepth=1
rw=write
stonewall
[write-rbd2-4m-depth16]
rbdname=rbd-tester-fio-2
bs=4m
iodepth=16
rw=write
stonewall

[read-rbd1-4m-depth1]
rbdname=rbd-tester-fio
bs=4m
iodepth=1
rw=read
stonewall
[read-rbd2-4m-depth16]
rbdname=rbd-tester-fio-2
bs=4m
iodepth=16
rw=read
stonewall

The performance hit is more noticeable on bigblock, I think up to 10x slower on 
some runs but as a percentage it seems to affect a small block workload too. I 
understand that runs will vary... I wish I had more runs from before upgrading 
to luminous but I only have that single set of results. Regardless, I cannot 
come close to that single set of results since upgrading to luminous.
I understand the caching stuff you mentioned, however we have not changed any 
of that config since the upgrade and the fio job is exactly the same. So if I 
do many runs on luminous throughout the course of a day, including when we 
think the cluster is least busy, we should be able to come pretty close to the 
jewel result on at least one of the runs or is my thinking flawed?

Sage mentioned at openstack that there was a perf regression with librbd which 
will be fixed in 12.2.2 are you aware of this? If so can you send me the 
link to the bug?

Cheers,
Raf


On 22 September 2017 at 00:31, Mark Nelson 
> wrote:
Hi Rafael,

In the original email you mentioned 4M block size, seq read, but here it looks 
like you are doing 4k writes?  Can you clarify?  If you are doing 4k direct 
sequential writes with iodepth=1 and are also using librbd cache, please make 
sure that librbd is set to writeback mode in both cases.  RBD by default will 
not kick into WB mode until it sees a flush request, and the librbd engine in 
fio doesn't issue one before a test is started.  It can be pretty easy to end 
up in a situation where writeback cache is active on some tests but not others 
if you aren't careful.  IE If one of your tests was done after a flush and the 
other was not, you'd likely see a dramatic difference in performance during 
this test.

You can avoid this by telling librbd to always use WB mode (at least when 
benchmarking):

rbd cache writethrough until flush = false

Mark


On 09/20/2017 01:51 AM, Rafael Lopez wrote:
Hi Alexandre,

Yeah we are using filestore for the moment with luminous. With regards
to client, I tried both jewel and luminous librbd versions against the
luminous cluster - similar results.

I am running fio on a physical machine with fio rbd engine. This is a
snippet of the fio config for the runs (the complete jobfile adds
variations of read/write/block size/iodepth).

[global]
ioengine=rbd
clientname=cinder-volume
pool=rbd-bronze
invalidate=1
ramp_time=5
runtime=30
time_based
direct=1

[write-rbd1-4k-depth1]
rbdname=rbd-tester-fio
bs=4k
iodepth=1
rw=write
stonewall

[write-rbd2-4k-depth16]
rbdname=rbd-tester-fio-2
bs=4k
iodepth=16
rw=write
stonewall

Raf

On 20 September 2017 at 16:43, Alexandre DERUMIER 

>> wrote:

Hi

so, you use also filestore on luminous ?

do you have also upgraded librbd on client ? (are you benching
inside a qemu machine ? or directly with fio-rbd ?)



(I'm going to do a lot of benchmarks in coming week, I'll post
results on mailing soon.)



- Mail original -
De: "Rafael Lopez" 
>>
À: "ceph-users" 
>>

Envoyé: Mercredi 20 Septembre 2017 08:17:23
Objet: [ceph-users] luminous vs jewel rbd performance

hey guys.
wondering if anyone else has done some solid benchmarking of jewel
vs luminous, in particular on the same cluster that has been
upgraded (same cluster, client and config).

we have recently upgraded a cluster from 10.2.9 to 12.2.0, and
unfortunately i only captured results from a single fio (librbd) run
with a few jobs in it before upgrading. i have run the same fio
jobfile many times at different times of the 

Re: [ceph-users] mount failed since failed to load ceph kernel module

2017-11-14 Thread Linh Vu
Odd, you only got 2 mons and 0 osds? Your cluster build looks incomplete.

Get Outlook for Android<https://aka.ms/ghei36>


From: Dai Xiang <xiang@sky-data.cn>
Sent: Tuesday, November 14, 2017 6:12:27 PM
To: Linh Vu
Cc: ceph-users@lists.ceph.com
Subject: Re: mount failed since failed to load ceph kernel module

On Tue, Nov 14, 2017 at 02:24:06AM +0000, Linh Vu wrote:
> Your kernel is way too old for CephFS Luminous. I'd use one of the newer 
> kernels from http://elrepo.org. :) We're on 4.12 here on RHEL 7.4.

I had updated kernel version to newest:
[root@d32f3a7b6eb8 ~]$ uname -a
Linux d32f3a7b6eb8 4.14.0-1.el7.elrepo.x86_64 #1 SMP Sun Nov 12 20:21:04 EST 
2017 x86_64 x86_64 x86_64 GNU/Linux
[root@d32f3a7b6eb8 ~]$ cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

But still failed:
[root@d32f3a7b6eb8 ~]$ /bin/mount 172.17.0.4,172.17.0.5:/ /cephfs -t ceph -o 
name=admin,secretfile=/etc/ceph/admin.secret -v
failed to load ceph kernel module (1)
parsing options: rw,name=admin,secretfile=/etc/ceph/admin.secret
mount error 2 = No such file or directory
[root@d32f3a7b6eb8 ~]$ ll /cephfs
total 0

[root@d32f3a7b6eb8 ~]$ ceph -s
  cluster:
id: a5f1d744-35eb-4e1b-a7c7-cb9871ec559d
health: HEALTH_WARN
Reduced data availability: 128 pgs inactive
Degraded data redundancy: 128 pgs unclean

  services:
mon: 2 daemons, quorum d32f3a7b6eb8,1d22f2d81028
mgr: d32f3a7b6eb8(active), standbys: 1d22f2d81028
mds: cephfs-1/1/1 up  {0=1d22f2d81028=up:creating}, 1 up:standby
osd: 0 osds: 0 up, 0 in

  data:
pools:   2 pools, 128 pgs
objects: 0 objects, 0 bytes
usage:   0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
 128 unknown

[root@d32f3a7b6eb8 ~]$ lsmod | grep ceph
ceph  372736  0
libceph   315392  1 ceph
fscache65536  3 ceph,nfsv4,nfs
libcrc32c  16384  5 
libceph,nf_conntrack,xfs,dm_persistent_data,nf_nat


--
Best Regards
Dai Xiang
>
>
> Hi!
>
> I got a confused issue in docker as below:
>
> After install ceph successfully, i want to mount cephfs but failed:
>
> [root@dbffa72704e4 ~]$ /bin/mount http://172.17.0.4:/<http://172.17.0.4:/> 
> /cephfs -t ceph -o name=admin,secretfile=/etc/ceph/admin.secret -v
> failed to load ceph kernel module (1)
> parsing options: rw,name=admin,secretfile=/etc/ceph/admin.secret
> mount error 5 = Input/output error
>
> But ceph related kernel modules have existed:
>
> [root@dbffa72704e4 ~]$ lsmod | grep ceph
> ceph  327687  0
> libceph   287066  1 ceph
> dns_resolver   13140  2 nfsv4,libceph
> libcrc32c  12644  3 xfs,libceph,dm_persistent_data
>
> Check the ceph state(i only set data disk for osd):
>
> [root@dbffa72704e4 ~]$ ceph -s
>   cluster:
> id: 20f51975-303e-446f-903f-04e1feaff7d0
> health: HEALTH_WARN
> Reduced data availability: 128 pgs inactive
> Degraded data redundancy: 128 pgs unclean
>
>   services:
> mon: 2 daemons, quorum dbffa72704e4,5807d12f920e
> mgr: dbffa72704e4(active), standbys: 5807d12f920e
> mds: cephfs-1/1/1 up  {0=5807d12f920e=up:creating}, 1 up:standby
> osd: 0 osds: 0 up, 0 in
>
>   data:
> pools:   2 pools, 128 pgs
> objects: 0 objects, 0 bytes
> usage:   0 kB used, 0 kB / 0 kB avail
> pgs: 100.000% pgs unknown
>  128 unknown
>
> [root@dbffa72704e4 ~]$ ceph version
> ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous 
> (stable)
>
> My container is based on centos:centos7.2.1511, kernel is 3e0728877e22 
> 3.10.0-514.el7.x86_64.
>
> I saw some ceph related images on docker hub so that i think above
> operation is ok, did i miss something important?
>
> --
> Best Regards
> Dai Xiang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount failed since failed to load ceph kernel module

2017-11-13 Thread Linh Vu
Your kernel is way too old for CephFS Luminous. I'd use one of the newer 
kernels from elrepo.org. :) We're on 4.12 here on RHEL 7.4.


From: ceph-users  on behalf of 
xiang@sky-data.cn 
Sent: Tuesday, 14 November 2017 1:13:47 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] mount failed since failed to load ceph kernel module

Hi!

I got a confused issue in docker as below:

After install ceph successfully, i want to mount cephfs but failed:

[root@dbffa72704e4 ~]$ /bin/mount 172.17.0.4:/ /cephfs -t 
ceph -o name=admin,secretfile=/etc/ceph/admin.secret -v
failed to load ceph kernel module (1)
parsing options: rw,name=admin,secretfile=/etc/ceph/admin.secret
mount error 5 = Input/output error

But ceph related kernel modules have existed:

[root@dbffa72704e4 ~]$ lsmod | grep ceph
ceph  327687  0
libceph   287066  1 ceph
dns_resolver   13140  2 nfsv4,libceph
libcrc32c  12644  3 xfs,libceph,dm_persistent_data

Check the ceph state(i only set data disk for osd):

[root@dbffa72704e4 ~]$ ceph -s
  cluster:
id: 20f51975-303e-446f-903f-04e1feaff7d0
health: HEALTH_WARN
Reduced data availability: 128 pgs inactive
Degraded data redundancy: 128 pgs unclean

  services:
mon: 2 daemons, quorum dbffa72704e4,5807d12f920e
mgr: dbffa72704e4(active), standbys: 5807d12f920e
mds: cephfs-1/1/1 up  {0=5807d12f920e=up:creating}, 1 up:standby
osd: 0 osds: 0 up, 0 in

  data:
pools:   2 pools, 128 pgs
objects: 0 objects, 0 bytes
usage:   0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
 128 unknown

[root@dbffa72704e4 ~]$ ceph version
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

My container is based on centos:centos7.2.1511, kernel is 3e0728877e22 
3.10.0-514.el7.x86_64.

I saw some ceph related images on docker hub so that i think above
operation is ok, did i miss something important?

--
Best Regards
Dai Xiang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: Re: Luminous LTS: `ceph osd crush class create` isgone?

2017-11-05 Thread Linh Vu
Thanks! It seems less intuitive than crush class create, but should work well 
enough.

Get Outlook for Android<https://aka.ms/ghei36>


From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of 
xie.xing...@zte.com.cn <xie.xing...@zte.com.cn>
Sent: Saturday, November 4, 2017 12:22:31 PM
To: caspars...@supernas.eu; bhubb...@redhat.com
Cc: ceph-users@lists.ceph.com
Subject: [ceph-users] 答复: Re: Luminous LTS: `ceph osd crush class create` 
isgone?

> With the caveat that the "ceph osd crush set-device-class" command only works 
> on existing OSD's which already have a default assigned class so you cannot 
> plan/create your classes before > adding some OSD's first.
> The "ceph osd crush class create" command could be run without any OSD's 
> configured.


That is not true. "ceph osd crush set-device-class" will fail if the input OSD 
has already been assigned a class. Instead you should do "ceph osd crush 
rm-device-class" before proceeding.

Generally you don't have to pre-create an empty class and associate some 
specific OSDs with that class through  "ceph osd crush set-device-class" later, 
as the command will automatically create it if it does not exist.

Also you can simply disable the "auto-class" feature by setting 
"osd_class_update_on_start" to false, and then newly created OSDs will be bound 
to no class(e.g., hdd/ssd).

Otherwise you have to do "ceph osd crush rm-device-class" before you can safely 
reset OSDs' class to any freeform as you wish.






原始邮件
发件人: <caspars...@supernas.eu>;
收件人: <ceph-users@lists.ceph.com>;
日 期 :2017年11月03日 17:41
主 题 :Re: [ceph-users] Luminous LTS: `ceph osd crush class create` isgone?


2017-11-03 7:59 GMT+01:00 Brad Hubbard 
<bhubb...@redhat.com<mailto:bhubb...@redhat.com>>:
On Fri, Nov 3, 2017 at 4:04 PM, Linh Vu 
<v...@unimelb.edu.au<mailto:v...@unimelb.edu.au>> wrote:
> Hi all,
>
>
> Back in Luminous Dev and RC, I was able to do this:
>
>
> `ceph osd crush class create myclass`

This was removed as part of https://github.com/ceph/ceph/pull/16388

It looks like the set-device-class command is the replacement or equivalent..

$ ceph osd crush class ls
[
"ssd"
]

$ ceph osd crush set-device-class myclass 0 1

$ ceph osd crush class ls
[
"ssd",
"myclass"
]


With the caveat that the "ceph osd crush set-device-class" command only works 
on existing OSD's which already have a default assigned class so you cannot 
plan/create your classes before adding some OSD's first.
The "ceph osd crush class create" command could be run without any OSD's 
configured.

Kind regards,
Caspar

>
>
> so I could utilise the new CRUSH device classes feature as described here:
> http://ceph.com/community/new-luminous-crush-device-classes/<http://ceph.com/community/new-luminous-crush-device-classes/>
>
>
>
> and in use here:
> http://blog-fromsomedude.rhcloud.com/2017/05/16/Luminous-series-CRUSH-devices-class/<http://blog-fromsomedude.rhcloud.com/2017/05/16/Luminous-series-CRUSH-devices-class/>
>
>
> Now I'm on Luminous LTS 12.2.1 and my custom device classes are still seen
> in:
>
>
> `ceph osd crush class ls`
>
>
> `ceph osd tree` and so on. The cluster is working fine and healthy.
>
>
> However, when I run `ceph osd crush class create myclass2` now, it tells me
> the command doesn't exist anymore.
>
>
> Are we not meant to create custom device classes anymore?
>
>
> Regards,
>
> Linh
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>



--
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous LTS: `ceph osd crush class create` is gone?

2017-11-03 Thread Linh Vu
Hi all,


Back in Luminous Dev and RC, I was able to do this:


`ceph osd crush class create myclass`


so I could utilise the new CRUSH device classes feature as described here: 
http://ceph.com/community/new-luminous-crush-device-classes/



and in use here: 
http://blog-fromsomedude.rhcloud.com/2017/05/16/Luminous-series-CRUSH-devices-class/


Now I'm on Luminous LTS 12.2.1 and my custom device classes are still seen in:


`ceph osd crush class ls`


`ceph osd tree` and so on. The cluster is working fine and healthy.


However, when I run `ceph osd crush class create myclass2` now, it tells me the 
command doesn't exist anymore.


Are we not meant to create custom device classes anymore?


Regards,

Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: some metadata operations take seconds to complete

2017-10-16 Thread Linh Vu
We're using cephfs here as well for HPC scratch, but we're on Luminous 12.2.1. 
This issue seems to have been fixed between Jewel and Luminous, we don't have 
such problems. :) Any reason you guys aren't evaluating the latest LTS?


From: ceph-users  on behalf of Tyanko 
Aleksiev 
Sent: Tuesday, 17 October 2017 4:07:26 AM
To: ceph-users
Subject: [ceph-users] cephfs: some metadata operations take seconds to complete

Hi,

At UZH we are currently evaluating cephfs as a distributed file system
for the scratch space of an HPC installation. Some slow down of the
metadata operations seems to occur under certain circumstances. In
particular, commands issued after some big file deletion could take
several seconds.

Example:

dd bs=$((1024*1024*128)) count=2048 if=/dev/zero of=./dd-test
274877906944 bytes (275 GB, 256 GiB) copied, 224.798 s, 1.2 GB/s

dd bs=$((1024*1024*128)) count=2048 if=./dd-test of=./dd-test2
274877906944 bytes (275 GB, 256 GiB) copied, 1228.87 s, 224 MB/s

ls; time rm dd-test2 ; time ls
dd-test  dd-test2

real0m0.004s
user0m0.000s
sys 0m0.000s
dd-test

real0m8.795s
user0m0.000s
sys 0m0.000s

Additionally, the time it takes to complete the "ls" command appears to
be proportional to the size of the deleted file. The issue described
above is not limited to "ls" but extends to other commands:

ls ; time rm dd-test2 ; time du -hs ./*
dd-test  dd-test2

real0m0.003s
user0m0.000s
sys 0m0.000s
128G./dd-test

real0m9.974s
user0m0.000s
sys 0m0.000s

What might be causing this behavior and eventually how could we improve it?

Setup:

- ceph version: 10.2.9, OS: Ubuntu 16.04, kernel: 4.8.0-58-generic,
- 3 monitors,
- 1 mds,
- 3 storage nodes with 24 X 4TB disks on each node: 1 OSD/disk (72 OSDs
in total). 4TB disks are used for the cephfs_data pool. Journaling is on
SSDs,
- we installed an 400GB NVMe disk on each storage node and aggregated
the tree disks in crush rule. cephfs_metadata pool was then created
using that rule and therefore is hosted on the NVMes. Journaling and
data are on the same partition here.

So far we are using the default ceph configuration settings.

Clients are mounting the file system with the kernel driver using the
following options (again default):
"rw,noatime,name=admin,secret=,acl,_netdev".

Thank you in advance for the help.

Cheers,
Tyanko



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

2017-06-22 Thread Linh Vu
Permissions of your mon data directory under /var/lib/ceph/mon/ might have 
changed as part of Hammer -> Jewel upgrade. Have you had a look there?


From: ceph-users  on behalf of 许雪寒 

Sent: Thursday, 22 June 2017 3:32:45 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Can't start ceph-mon through systemctl start 
ceph-mon@.service after upgrading from Hammer to Jewel

Hi, everyone.

I upgraded one of our ceph clusters from Hammer to Jewel. After upgrading, I 
can’t start ceph-mon through “systemctl start ceph-mon@ceph1”, while, on the 
other hand, I can start ceph-mon, either as user ceph or root, if I directly 
call “/usr/bin/ceph-mon –cluster ceph –id ceph1 –setuser ceph –setgroup ceph”. 
I looked “/var/log/messages”, and find that the reason systemctl can’t start 
ceph-mon is that ceph-mon can’t access its configured data directory. Why 
ceph-mon can’t access its data directory when its called by systemctl?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Packages for Luminous RC 12.1.0?

2017-06-19 Thread Linh Vu
No worries, thanks a lot, look forward to testing it :)


From: Abhishek Lekshmanan <alekshma...@suse.de>
Sent: Monday, 19 June 2017 10:03:15 PM
To: Linh Vu; ceph-users
Subject: Re: [ceph-users] Packages for Luminous RC 12.1.0?

Linh Vu <v...@unimelb.edu.au> writes:

> Hi all,
>
>
> I saw that Luminous RC 12.1.0 has been mentioned in the latest release notes 
> here: http://docs.ceph.com/docs/master/release-notes/

the PR mentioning the release generally goes in around the same time we
announce the release, this time due to another issue we had to hold off
on the release. Sorry about that. we should be announcing the release
this week and have the packages soon
--
Abhishek Lekshmanan
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Packages for Luminous RC 12.1.0?

2017-06-14 Thread Linh Vu
Hi all,


I saw that Luminous RC 12.1.0 has been mentioned in the latest release notes 
here: http://docs.ceph.com/docs/master/release-notes/


However, I can't see any 12.1.0 package yet on http://download.ceph.com


Does anyone have any idea when the packages will be available? Thanks 


Cheers,

Linh

Release Notes — Ceph 
Documentation
docs.ceph.com
v12.0.3 Luminous (dev)¶ This is the fourth development checkpoint release of 
Luminous, the next long term stable release. This release introduces several ...


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com