Re: [ceph-users] Cephfs write fail when node goes down

2018-05-14 Thread Yan, Zheng
On Mon, May 14, 2018 at 5:37 PM, Josef Zelenka
 wrote:
> Hi everyone, we've encountered an unusual thing in our setup(4 nodes, 48
> OSDs, 3 monitors - ceph Jewel, Ubuntu 16.04 with kernel 4.4.0). Yesterday,
> we were doing a HW upgrade of the nodes, so they went down one by one - the
> cluster was in good shape during the upgrade, as we've done this numerous
> times and we're quite sure that the redundancy wasn't screwed up while doing
> this. However, during this upgrade one of the clients that does backups to
> cephfs(mounted via the kernel driver) failed to write the backup file
> correctly to the cluster with the following trace after we turned off one of
> the nodes:
>
> [2585732.529412]  8800baa279a8 813fb2df 880236230e00
> 8802339c
> [2585732.529414]  8800baa28000 88023fc96e00 7fff
> 8800baa27b20
> [2585732.529415]  81840ed0 8800baa279c0 818406d5
> 
> [2585732.529417] Call Trace:
> [2585732.529505]  [] ? cpumask_next_and+0x2f/0x40
> [2585732.529558]  [] ? bit_wait+0x60/0x60
> [2585732.529560]  [] schedule+0x35/0x80
> [2585732.529562]  [] schedule_timeout+0x1b5/0x270
> [2585732.529607]  [] ? kvm_clock_get_cycles+0x1e/0x20
> [2585732.529609]  [] ? bit_wait+0x60/0x60
> [2585732.529611]  [] io_schedule_timeout+0xa4/0x110
> [2585732.529613]  [] bit_wait_io+0x1b/0x70
> [2585732.529614]  [] __wait_on_bit_lock+0x4e/0xb0
> [2585732.529652]  [] __lock_page+0xbb/0xe0
> [2585732.529674]  [] ? autoremove_wake_function+0x40/0x40
> [2585732.529676]  [] pagecache_get_page+0x17d/0x1c0
> [2585732.529730]  [] ? ceph_pool_perm_check+0x48/0x700
> [ceph]
> [2585732.529732]  [] grab_cache_page_write_begin+0x26/0x40
> [2585732.529738]  [] ceph_write_begin+0x48/0xe0 [ceph]
> [2585732.529739]  [] generic_perform_write+0xce/0x1c0
> [2585732.529763]  [] ? file_update_time+0xc9/0x110
> [2585732.529769]  [] ceph_write_iter+0xf89/0x1040 [ceph]
> [2585732.529792]  [] ? __alloc_pages_nodemask+0x159/0x2a0
> [2585732.529808]  [] new_sync_write+0x9b/0xe0
> [2585732.529811]  [] __vfs_write+0x26/0x40
> [2585732.529812]  [] vfs_write+0xa9/0x1a0
> [2585732.529814]  [] SyS_write+0x55/0xc0
> [2585732.529817]  [] entry_SYSCALL_64_fastpath+0x16/0x71
>
>

is there any hang osd request in /sys/kernel/debug/ceph//osdc?

> I have encountered this behavior on Luminous, but not on Jewel. Anyone who
> has a clue why the write fails? As far as i'm concerned, it should always
> work if all the PGs are available. Thanks
> Josef
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs write fail when node goes down

2018-05-14 Thread Paul Emmerich
Which kernel version are you using? If it's an older kernel: consider using
the cephfs-fuse client instead

Paul


2018-05-14 11:37 GMT+02:00 Josef Zelenka :

> Hi everyone, we've encountered an unusual thing in our setup(4 nodes, 48
> OSDs, 3 monitors - ceph Jewel, Ubuntu 16.04 with kernel 4.4.0). Yesterday,
> we were doing a HW upgrade of the nodes, so they went down one by one - the
> cluster was in good shape during the upgrade, as we've done this numerous
> times and we're quite sure that the redundancy wasn't screwed up while
> doing this. However, during this upgrade one of the clients that does
> backups to cephfs(mounted via the kernel driver) failed to write the backup
> file correctly to the cluster with the following trace after we turned off
> one of the nodes:
>
> [2585732.529412]  8800baa279a8 813fb2df 880236230e00
> 8802339c
> [2585732.529414]  8800baa28000 88023fc96e00 7fff
> 8800baa27b20
> [2585732.529415]  81840ed0 8800baa279c0 818406d5
> 
> [2585732.529417] Call Trace:
> [2585732.529505]  [] ? cpumask_next_and+0x2f/0x40
> [2585732.529558]  [] ? bit_wait+0x60/0x60
> [2585732.529560]  [] schedule+0x35/0x80
> [2585732.529562]  [] schedule_timeout+0x1b5/0x270
> [2585732.529607]  [] ? kvm_clock_get_cycles+0x1e/0x20
> [2585732.529609]  [] ? bit_wait+0x60/0x60
> [2585732.529611]  [] io_schedule_timeout+0xa4/0x110
> [2585732.529613]  [] bit_wait_io+0x1b/0x70
> [2585732.529614]  [] __wait_on_bit_lock+0x4e/0xb0
> [2585732.529652]  [] __lock_page+0xbb/0xe0
> [2585732.529674]  [] ? autoremove_wake_function+0x40/
> 0x40
> [2585732.529676]  [] pagecache_get_page+0x17d/0x1c0
> [2585732.529730]  [] ? ceph_pool_perm_check+0x48/0x700
> [ceph]
> [2585732.529732]  [] grab_cache_page_write_begin+0x
> 26/0x40
> [2585732.529738]  [] ceph_write_begin+0x48/0xe0 [ceph]
> [2585732.529739]  [] generic_perform_write+0xce/0x1c0
> [2585732.529763]  [] ? file_update_time+0xc9/0x110
> [2585732.529769]  [] ceph_write_iter+0xf89/0x1040 [ceph]
> [2585732.529792]  [] ? __alloc_pages_nodemask+0x159/0
> x2a0
> [2585732.529808]  [] new_sync_write+0x9b/0xe0
> [2585732.529811]  [] __vfs_write+0x26/0x40
> [2585732.529812]  [] vfs_write+0xa9/0x1a0
> [2585732.529814]  [] SyS_write+0x55/0xc0
> [2585732.529817]  [] entry_SYSCALL_64_fastpath+0x16/0x71
>
>
> I have encountered this behavior on Luminous, but not on Jewel. Anyone who
> has a clue why the write fails? As far as i'm concerned, it should always
> work if all the PGs are available. Thanks
> Josef
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] nfs-ganesha 2.6 deb packages

2018-05-14 Thread Benjeman Meekhof
I see that luminous RPM packages are up at download.ceph.com for
ganesha-ceph 2.6 but there is nothing in the Deb area.  Any estimates
on when we might see those packages?

http://download.ceph.com/nfs-ganesha/deb-V2.6-stable/luminous/

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a big cluster or several small

2018-05-14 Thread Paul Emmerich
Hi,

don't do multiple clusters on the same server without containers; support
for the cluster name stuff is deprecated and will probably be removed:
https://github.com/ceph/ceph-deploy/pull/441

Also, I wouldn't split your cluster (yet?), ~300 OSDs is still quite small.
But it depends on the exact circumstances...

Paul


2018-05-14 18:49 GMT+02:00 Marc Boisis :

>
> Hi,
>
> Hello,
> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients
> only, 1 single pool (size=3).
>
> We want to divide this cluster into several to minimize the risk in case
> of failure/crash.
> For example, a cluster for the mail, another for the file servers, a test
> cluster ...
> Do you think it's a good idea ?
>
> Do you have experience feedback on multiple clusters in production on the
> same hardware:
> - containers (LXD or Docker)
> - multiple cluster on the same host without virtualization (with
> ceph-deploy ... --cluster ...)
> - multilple pools
> ...
>
> Do you have any advice?
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a big cluster or several small

2018-05-14 Thread João Paulo Sacchetto Ribeiro Bastos
Hello Marc,

In my beliefs that's exactly the main reason why people use Ceph: its gets
more reliable the more nodes we put in the cluster. You should take a look
in documentation and try to make use of placement rules, erasure codes or
whatever fits your needs. I'm yet new in Ceph (been using for about 1 year)
and I strongly tell you that your ideia just *may be* good, but may be a
little overkill too =D

Regards,

On Mon, May 14, 2018 at 2:26 PM Michael Kuriger  wrote:

> The more servers you have in your cluster, the less impact a failure
> causes to the cluster. Monitor your systems and keep them up to date.  You
> can also isolate data with clever crush rules and creating multiple zones.
>
>
>
> *Mike Kuriger*
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Marc Boisis
> *Sent:* Monday, May 14, 2018 9:50 AM
> *To:* ceph-users
> *Subject:* [ceph-users] a big cluster or several small
>
>
>
>
> Hi,
>
>
>
> Hello,
>
> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients
> only, 1 single pool (size=3).
>
>
>
> We want to divide this cluster into several to minimize the risk in case
> of failure/crash.
>
> For example, a cluster for the mail, another for the file servers, a test
> cluster ...
>
> Do you think it's a good idea ?
>
>
>
> Do you have experience feedback on multiple clusters in production on the
> same hardware:
>
> - containers (LXD or Docker)
>
> - multiple cluster on the same host without virtualization (with
> ceph-deploy ... --cluster ...)
>
> - multilple pools
>
> ...
>
>
>
> Do you have any advice?
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 

João Paulo Bastos
DevOps Engineer at Mav Tecnologia
Belo Horizonte - Brazil
+55 31 99279-7092
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a big cluster or several small

2018-05-14 Thread Michael Kuriger
The more servers you have in your cluster, the less impact a failure causes to 
the cluster. Monitor your systems and keep them up to date.  You can also 
isolate data with clever crush rules and creating multiple zones.

Mike Kuriger


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Marc 
Boisis
Sent: Monday, May 14, 2018 9:50 AM
To: ceph-users
Subject: [ceph-users] a big cluster or several small


Hi,

Hello,
Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients only, 1 
single pool (size=3).

We want to divide this cluster into several to minimize the risk in case of 
failure/crash.
For example, a cluster for the mail, another for the file servers, a test 
cluster ...
Do you think it's a good idea ?

Do you have experience feedback on multiple clusters in production on the same 
hardware:
- containers (LXD or Docker)
- multiple cluster on the same host without virtualization (with ceph-deploy 
... --cluster ...)
- multilple pools
...


Do you have any advice?





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a big cluster or several small

2018-05-14 Thread Jack
Well

I currently manage 27 nodes, over 9 clusters
There is some burden that you should considers

The easiest is : "what do we do when two smalls clusters, which grows
slowly, need more space"
With one cluster: buy a node, add it, done
With two clusters: buy two nodes, add them, done

This can be an issue;


If you can move the data between clusters transparently and painlessly,
then it's OK : most of our data is used via Proxmox clusters, which
allow us to move from one Ceph cluster to an other, so we can
"rebalance" the whole stuff

However, we also have some Cephfs stuff, and this is not the same deal:
moving part of a cephfs between clusters in a pita (youhou, rsync & friends)


Considering all of this, splitting your cluster may be a sane idea, or
may be not, I however recommend against over-splitting : it does not
worth it


On 05/14/2018 06:49 PM, Marc Boisis wrote:
> 
> Hi,
> 
> Hello,
> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients only, 
> 1 single pool (size=3).
> 
> We want to divide this cluster into several to minimize the risk in case of 
> failure/crash.
> For example, a cluster for the mail, another for the file servers, a test 
> cluster ...
> Do you think it's a good idea ?
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] a big cluster or several small

2018-05-14 Thread Marc Boisis

Hi,

Hello,
Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients only, 1 
single pool (size=3).

We want to divide this cluster into several to minimize the risk in case of 
failure/crash.
For example, a cluster for the mail, another for the file servers, a test 
cluster ...
Do you think it's a good idea ?

Do you have experience feedback on multiple clusters in production on the same 
hardware:
- containers (LXD or Docker)
- multiple cluster on the same host without virtualization (with ceph-deploy 
... --cluster ...) 
- multilple pools
...

Do you have any advice?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked

2018-05-14 Thread Grigory Murashov

Hello David!

2. I set it up 10/10

3. Thanks, my problem was I did it on host where was no osd.15 daemon.

Could you please help to read osd logs?

Here is a part from ceph.log

2018-05-14 13:46:32.644323 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553895 : cluster [INF] Cluster is now healthy
2018-05-14 13:46:43.741921 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553896 : cluster [WRN] Health check failed: 21 slow 
requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-14 13:46:49.746994 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553897 : cluster [WRN] Health check update: 23 slow 
requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-14 13:46:55.752314 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553900 : cluster [WRN] Health check update: 3 slow 
requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-14 13:47:01.030686 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553901 : cluster [WRN] Health check update: 4 slow 
requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-14 13:47:07.764236 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553903 : cluster [WRN] Health check update: 32 slow 
requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-14 13:47:13.770833 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553904 : cluster [WRN] Health check update: 21 slow 
requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-14 13:47:17.774530 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553905 : cluster [INF] Health check cleared: 
REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
2018-05-14 13:47:17.774582 mon.storage-ru1-osd1 mon.0 
185.164.149.2:6789/0 553906 : cluster [INF] Cluster is now healthy


At 13-47 I had a problem with osd.21

1. Ceph Health (storage-ru1-osd1.voximplant.com:ceph.health): HEALTH_WARN
{u'REQUEST_SLOW': {u'severity': u'HEALTH_WARN', u'summary': {u'message': u'4 slow 
requests are blocked > 32 sec'}}}
HEALTH_WARN 4 slow requests are blocked > 32 sec
REQUEST_SLOW 4 slow requests are blocked > 32 sec
2 ops are blocked > 65.536 sec
2 ops are blocked > 32.768 sec
osd.21 has blocked requests > 65.536 sec

Here is a part from ceph-osd.21.log

2018-05-14 13:47:06.891399 7fb806dd6700 10 osd.21 pg_epoch: 236 pg[2.0( 
v 236'297 (0'0,236'297] local-lis/les=223/224 n=1 ec=119/119 lis/c 
223/223 les/c/f 224/224/0 223/223/212) [21,29,15]
r=0 lpr=223 crt=236'297 lcod 236'296 mlcod 236'296 active+clean] 
dropping ondisk_read_lock
2018-05-14 13:47:06.891435 7fb806dd6700 10 osd.21 236 dequeue_op 
0x56453b753f80 finish

2018-05-14 13:47:07.111388 7fb8185f9700 10 osd.21 236 tick
2018-05-14 13:47:07.111398 7fb8185f9700 10 osd.21 236 do_waiters -- start
2018-05-14 13:47:07.111401 7fb8185f9700 10 osd.21 236 do_waiters -- finish
2018-05-14 13:47:07.800421 7fb817df8700 10 osd.21 236 tick_without_osd_lock
2018-05-14 13:47:07.800444 7fb817df8700 10 osd.21 236 
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; 
target 25 obj/sec or 5120 k bytes/sec
2018-05-14 13:47:07.800449 7fb817df8700 10 osd.21 236 
promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted 
new_prob 1000, prob 1000 -> 1000

2018-05-14 13:47:08.111470 7fb8185f9700 10 osd.21 236 tick
2018-05-14 13:47:08.111483 7fb8185f9700 10 osd.21 236 do_waiters -- start
2018-05-14 13:47:08.111485 7fb8185f9700 10 osd.21 236 do_waiters -- finish
2018-05-14 13:47:08.181070 7fb8055d3700 10 osd.21 236 dequeue_op 
0x564539651000 prio 63 cost 0 latency 0.000143 
osd_op(client.2597258.0:213844298 6.1d4 6.4079fd4 (undecoded) 
ondisk+read+kno
wn_if_redirected e236) v8 pg pg[6.1d4( v 236'20882 (236'19289,236'20882] 
local-lis/les=223/224 n=20791 ec=145/132 lis/c 223/223 les/c/f 224/224/0 
223/223/212) [21,29,17] r=0 lpr=223 crt=236

'20882 lcod 236'20881 mlcod 236'20881 active+clean]
2018-05-14 13:47:08.181112 7fb8055d3700 10 osd.21 pg_epoch: 236 
pg[6.1d4( v 236'20882 (236'19289,236'20882] local-lis/les=223/224 
n=20791 ec=145/132 lis/c 223/223 les/c/f 224/224/0 223/223/
212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 
active+clean] _handle_message: 0x564539651000
2018-05-14 13:47:08.181141 7fb8055d3700 10 osd.21 pg_epoch: 236 
pg[6.1d4( v 236'20882 (236'19289,236'20882] local-lis/les=223/224 
n=20791 ec=145/132 lis/c 223/223 les/c/f 224/224/0 223/223/
212) [21,29,17] r=0 lpr=223 crt=236'20882 lcod 236'20881 mlcod 236'20881 
active+clean] do_op osd_op(client.2597258.0:213844298 6.1d4 
6:2bf9e020:::eb359f44-3316-4cd3-9006-d416c21e0745.120446

4.6_2018%2f05%2f14%2fYWRjNmZmNzQzODI2ZGQzOTc3ZjFiNGMxZjIxOGZlYzQvaHR0cDovL3d3dy1sdS0wMS0zNi52b3hpbXBsYW50LmNvbS9yZWNvcmRzLzIwMTgvMDUvMTQvOTRlNjYxY2JiZjU3MTk4NS4xNTI2MjkwMzQ0Ljk2NjQ5MS5tcDM-
:head [getxattrs,stat,read 0~4194304] snapc 0=[] 
ondisk+read+known_if_redirected e236) v8 may_read -> read-ordered flags 
ondisk+read+known_if_redirected
2018-05-14 13:47:08.181179 7fb8055d3700 10 osd.21 pg_epoch: 236 
pg[6.1d4( v 236'20882 (236'19289,236'20882] local-lis/les=223/224 
n=20791 ec=145/132 lis/c 223/223 les/c/f 224/224/0 223/223/
212) 

Re: [ceph-users] PG show inconsistent active+clean+inconsistent

2018-05-14 Thread David Turner
Just for clarification, the PG state is not the cause of the scrub errors.
Something happened in your cluster that caused inconsistencies between
copies of the data, the scrub noticed them, the scrub errors are why the PG
is flagged inconsistent, which does put the cluster in HEALTH_ERR.  Anyway,
just semantics from your original assessment of the situation.

Disabling scrubs is a bad idea here.  While you have a lot of scrub errors,
you only know of 1 PG that has those errors.  You may have multiple PGs
with the same problem.  Perhaps a single disk is having problems and every
PG on that disk has scrub errors.  There are a lot of other scenarios that
could be happening as well.  I would start by issuing `ceph osd scrub $osd`
to scrub all PGs on the currently known OSDs used by this PG.  If that
doesn't find anything, then try `ceph osd deep-scrub $osd`.  Those commands
are a shortcut to schedule a scrub/deep-scrub for every PG that is primary
on the given OSD.  If you don't find any more scrub errors, then you may
need to check the rest of the PGs in your cluster, definitely the ones
inside of the same pool #2 along with the currently inconsistent PG.

Now, while that's diagnosing and getting us more information... what
happened to your cluster?  Anything where OSDs were flapping up and down?
You added new storage?  Lost a drive?  Upgraded versions?  What is your
version?  What has happened in the past few weeks in your cluster?

Likely, the fix is going to start with issuing a repair of your PG.  I like
to diagnose the full scope of the problem before trying to repair things.
Also, if I can't figure out what's going on, I try to backup the PG copies
I'm repairing before doing so just in case something doesn't repair
properly.

On Sat, May 12, 2018 at 2:38 AM Faizal Latif  wrote:

> Hi Guys,
>
> i need some help. i can see currently my ceph storage showing "
> *active+clean+inconsistent*". which result HEALTH_ERR state and cause
> scrubbing error. you may find below are sample output.
>
> HEALTH_ERR 1 pgs inconsistent; 11685 scrub errors; noscrub,nodeep-scrub
> flag(s) set
> pg 2.2c0 is active+clean+inconsistent, acting [28,17,37]
> 11685 scrub errors
> noscrub,nodeep-scrub flag(s) set
>
> i have disable scrubbing since i can see there are scrub errors. i have
> also try to use rados command to see object status. and below are the
> results.
>
> rados list-inconsistent-obj 2.2c0 --format=json-pretty
> {
> "epoch": 57580,
> "inconsistents": [
> {
> "object": {
> "name": "rbd_data.10815ea2ae8944a.0385",
> "nspace": "",
> "locator": "",
> "snap": 55,
> "version": 0
> },
> "errors": [],
> "union_shard_errors": [
> "missing",
> "*oi_attr_missing*"
> ],
> "shards": [
> {
> "osd": 10,
> "errors": [
> "*oi_attr_missing*"
> ],
> "size": 4194304,
> "omap_digest": "0x",
> "data_digest": "0x32133b39"
> },
> {
> "osd": 28,
> "errors": [
> "missing"
> ]
> },
> {
> "osd": 37,
> "errors": [
> "missing"
> ]
> }
> ]
> },
> {
> "object": {
> "name": "rbd_data.10815ea2ae8944a.0730",
> "nspace": "",
> "locator": "",
> "snap": 55,
> "version": 0
> },
> "errors": [],
> "union_shard_errors": [
> "missing",
> "*oi_attr_missing*"
> ],
> "shards": [
> {
> "osd": 10,
> "errors": [
> "*oi_attr_missing*"
> ],
> "size": 4194304,
> "omap_digest": "0x",
> "data_digest": "0x0f843f64"
> },
> {
> "osd": 28,
> "errors": [
> "missing"
> ]
> },
> {
> "osd": 37,
> "errors": [
> "missing"
> ]
> }
> ]
> },
>
> i can see most of the objects show *oi_attr_missing. *is there anyway
> that i can solved this? i believe this is the reason why scrubbing keep
> failing to this pg group.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-14 Thread Nick Fisk
Hi Wido,

Are you trying this setting?

/sys/devices/system/cpu/intel_pstate/min_perf_pct



-Original Message-
From: ceph-users  On Behalf Of Wido den
Hollander
Sent: 14 May 2018 14:14
To: n...@fisk.me.uk; 'Blair Bethwaite' 
Cc: 'ceph-users' 
Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on
NVMe/SSD Ceph OSDs



On 05/01/2018 10:19 PM, Nick Fisk wrote:
> 4.16 required?
> https://www.phoronix.com/scan.php?page=news_item=Skylake-X-P-State-
> Linux-
> 4.16
> 

I've been trying with the 4.16 kernel for the last few days, but still, it's
not working.

The CPU's keep clocking down to 800Mhz

I've set scaling_min_freq=scaling_max_freq in /sys, but that doesn't change
a thing. The CPUs keep scaling down.

Still not close to the 1ms latency with these CPUs :(

Wido

> 
> -Original Message-
> From: ceph-users  On Behalf Of 
> Blair Bethwaite
> Sent: 01 May 2018 16:46
> To: Wido den Hollander 
> Cc: ceph-users ; Nick Fisk 
> 
> Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency 
> scaling on NVMe/SSD Ceph OSDs
> 
> Also curious about this over here. We've got a rack's worth of R740XDs 
> with Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active 
> on them, though I don't believe they are any different at the OS level 
> to our Broadwell nodes (where it is loaded).
> 
> Have you tried poking the kernel's pmqos interface for your use-case?
> 
> On 2 May 2018 at 01:07, Wido den Hollander  wrote:
>> Hi,
>>
>> I've been trying to get the lowest latency possible out of the new 
>> Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
>>
>> However, I can't seem to pin the CPUs to always run at their maximum 
>> frequency.
>>
>> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 
>> 4110), but that disables the boost.
>>
>> With the Power Saving enabled in the BIOS and when giving the OS all 
>> control for some reason the CPUs keep scaling down.
>>
>> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>>
>> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report 
>> errors and bugs to cpuf...@vger.kernel.org, please.
>> analyzing CPU 0:
>>   driver: intel_pstate
>>   CPUs which run at the same hardware frequency: 0
>>   CPUs which need to have their frequency coordinated by software: 0
>>   maximum transition latency: 0.97 ms.
>>   hardware limits: 800 MHz - 3.00 GHz
>>   available cpufreq governors: performance, powersave
>>   current policy: frequency should be within 800 MHz and 3.00 GHz.
>>   The governor "performance" may decide which speed to
use
>>   within this range.
>>   current CPU frequency is 800 MHz.
>>
>> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down 
>> again to 800Mhz and that hurts latency. (50% difference!)
>>
>> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to 
>> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
>>
>> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>> performance
>>
>> Everything seems to be OK and I would expect the CPUs to stay at 
>> 2.10Ghz, but they aren't.
>>
>> C-States are also pinned to 0 as a boot parameter for the kernel:
>>
>> processor.max_cstate=1 intel_idle.max_cstate=0
>>
>> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
>>
>> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
>>
>> Thanks,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-14 Thread John Hearns
Wido, I am going to put my rather large foot in it here.
I am sure it is understood that the Turbo mode will not keep all cores at
the maximum frequency at any given time.
There is a thermal envelope for the chip, and the chip works to keep  the
power dissipation within that envelope.
>From what I gather there is a range of thermal limits even within a given
processor SKU, so every chip will exhibit
different Turbo mode behaviour.
And I am sure we all know that when AVX comes into use the Turbo limit is
lower.

I guess what I am saying that for to have reproducible behaviour, if you
care about it for timings etc. Turbo
can be switched off.
Before you say it, in this case you want to achieve the minimum latency and
reproducibility at the Mhz level is not important.

Also worth saying that cooling is important with Turboboost comes into
play. I heard a paper at an HPC Advisory Council
where a Russian setup by Lenovo got significantly more performance at the
HPC acceptance testing stage when cooling was turned up.

I guess my rambling has not added much to this debate, sorry.
cue a friendly Intel engineer to wander in and tell us exactly what is
going on.



On 14 May 2018 at 15:13, Wido den Hollander  wrote:

>
>
> On 05/01/2018 10:19 PM, Nick Fisk wrote:
> > 4.16 required?
> > https://www.phoronix.com/scan.php?page=news_item=Skylake-
> X-P-State-Linux-
> > 4.16
> >
>
> I've been trying with the 4.16 kernel for the last few days, but still,
> it's not working.
>
> The CPU's keep clocking down to 800Mhz
>
> I've set scaling_min_freq=scaling_max_freq in /sys, but that doesn't
> change a thing. The CPUs keep scaling down.
>
> Still not close to the 1ms latency with these CPUs :(
>
> Wido
>
> >
> > -Original Message-
> > From: ceph-users  On Behalf Of Blair
> > Bethwaite
> > Sent: 01 May 2018 16:46
> > To: Wido den Hollander 
> > Cc: ceph-users ; Nick Fisk 
> > Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling
> on
> > NVMe/SSD Ceph OSDs
> >
> > Also curious about this over here. We've got a rack's worth of R740XDs
> with
> > Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active on them,
> > though I don't believe they are any different at the OS level to our
> > Broadwell nodes (where it is loaded).
> >
> > Have you tried poking the kernel's pmqos interface for your use-case?
> >
> > On 2 May 2018 at 01:07, Wido den Hollander  wrote:
> >> Hi,
> >>
> >> I've been trying to get the lowest latency possible out of the new
> >> Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
> >>
> >> However, I can't seem to pin the CPUs to always run at their maximum
> >> frequency.
> >>
> >> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver
> >> 4110), but that disables the boost.
> >>
> >> With the Power Saving enabled in the BIOS and when giving the OS all
> >> control for some reason the CPUs keep scaling down.
> >>
> >> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> >>
> >> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report
> >> errors and bugs to cpuf...@vger.kernel.org, please.
> >> analyzing CPU 0:
> >>   driver: intel_pstate
> >>   CPUs which run at the same hardware frequency: 0
> >>   CPUs which need to have their frequency coordinated by software: 0
> >>   maximum transition latency: 0.97 ms.
> >>   hardware limits: 800 MHz - 3.00 GHz
> >>   available cpufreq governors: performance, powersave
> >>   current policy: frequency should be within 800 MHz and 3.00 GHz.
> >>   The governor "performance" may decide which speed to
> use
> >>   within this range.
> >>   current CPU frequency is 800 MHz.
> >>
> >> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down
> >> again to 800Mhz and that hurts latency. (50% difference!)
> >>
> >> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to
> >> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
> >>
> >> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> >> performance
> >>
> >> Everything seems to be OK and I would expect the CPUs to stay at
> >> 2.10Ghz, but they aren't.
> >>
> >> C-States are also pinned to 0 as a boot parameter for the kernel:
> >>
> >> processor.max_cstate=1 intel_idle.max_cstate=0
> >>
> >> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
> >>
> >> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
> >>
> >> Thanks,
> >>
> >> Wido
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > --
> > Cheers,
> > ~Blairo
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-14 Thread Webert de Souza Lima
On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER 
wrote:

> The documentation (luminous) say:
>


> >mds cache size
> >
> >Description:The number of inodes to cache. A value of 0 indicates an
> unlimited number. It is recommended to use mds_cache_memory_limit to limit
> the amount of memory the MDS cache uses.
> >Type:   32-bit Integer
> >Default:0
> >

and, my mds_cache_memory_limit is currently at 5GB.


yeah I have only suggested that because the high memory usage seemed to
trouble you and it might be a bug, so it's more of a workaround.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-14 Thread Wido den Hollander


On 05/01/2018 10:19 PM, Nick Fisk wrote:
> 4.16 required?
> https://www.phoronix.com/scan.php?page=news_item=Skylake-X-P-State-Linux-
> 4.16
> 

I've been trying with the 4.16 kernel for the last few days, but still,
it's not working.

The CPU's keep clocking down to 800Mhz

I've set scaling_min_freq=scaling_max_freq in /sys, but that doesn't
change a thing. The CPUs keep scaling down.

Still not close to the 1ms latency with these CPUs :(

Wido

> 
> -Original Message-
> From: ceph-users  On Behalf Of Blair
> Bethwaite
> Sent: 01 May 2018 16:46
> To: Wido den Hollander 
> Cc: ceph-users ; Nick Fisk 
> Subject: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on
> NVMe/SSD Ceph OSDs
> 
> Also curious about this over here. We've got a rack's worth of R740XDs with
> Xeon 4114's running RHEL 7.4 and intel-pstate isn't even active on them,
> though I don't believe they are any different at the OS level to our
> Broadwell nodes (where it is loaded).
> 
> Have you tried poking the kernel's pmqos interface for your use-case?
> 
> On 2 May 2018 at 01:07, Wido den Hollander  wrote:
>> Hi,
>>
>> I've been trying to get the lowest latency possible out of the new 
>> Xeon Scalable CPUs and so far I got down to 1.3ms with the help of Nick.
>>
>> However, I can't seem to pin the CPUs to always run at their maximum 
>> frequency.
>>
>> If I disable power saving in the BIOS they stay at 2.1Ghz (Silver 
>> 4110), but that disables the boost.
>>
>> With the Power Saving enabled in the BIOS and when giving the OS all 
>> control for some reason the CPUs keep scaling down.
>>
>> $ echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>>
>> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report 
>> errors and bugs to cpuf...@vger.kernel.org, please.
>> analyzing CPU 0:
>>   driver: intel_pstate
>>   CPUs which run at the same hardware frequency: 0
>>   CPUs which need to have their frequency coordinated by software: 0
>>   maximum transition latency: 0.97 ms.
>>   hardware limits: 800 MHz - 3.00 GHz
>>   available cpufreq governors: performance, powersave
>>   current policy: frequency should be within 800 MHz and 3.00 GHz.
>>   The governor "performance" may decide which speed to use
>>   within this range.
>>   current CPU frequency is 800 MHz.
>>
>> I do see the CPUs scale up to 2.1Ghz, but they quickly scale down 
>> again to 800Mhz and that hurts latency. (50% difference!)
>>
>> With the CPUs scaling down to 800Mhz my latency jumps from 1.3ms to 
>> 2.4ms on avg. With turbo enabled I hope to get down to 1.1~1.2ms on avg.
>>
>> $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>> performance
>>
>> Everything seems to be OK and I would expect the CPUs to stay at 
>> 2.10Ghz, but they aren't.
>>
>> C-States are also pinned to 0 as a boot parameter for the kernel:
>>
>> processor.max_cstate=1 intel_idle.max_cstate=0
>>
>> Running Ubuntu 16.04.4 with the 4.13 kernel from the HWE from Ubuntu.
>>
>> Has anybody tried this yet with the recent Intel Xeon Scalable CPUs?
>>
>> Thanks,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Cache and rbd-nbd

2018-05-14 Thread Jason Dillaman
On Mon, May 14, 2018 at 12:15 AM, Marc Schöchlin  wrote:
> Hello Jason,
>
> many thanks for your informative response!
>
> Am 11.05.2018 um 17:02 schrieb Jason Dillaman:
>> I cannot speak for Xen, but in general IO to a block device will hit
>> the pagecache unless the IO operation is flagged as direct (e.g.
>> O_DIRECT) to bypass the pagecache and directly send it to the block
>> device.
> Sure, but it seems that xenserver just forwards io from virtual machines
> (vm: blkfront, dom-0: blkback) to the ndb device in dom-0.
>>> Sorry, my question was a bit unprecice: I was searching for usage statistics
>>> of the rbd cache.
>>> Is there also a possibility to gather rbd_cache usage statistics as a source
>>> of verification for optimizing the cache settings?
>> You can run "perf dump" instead of "config show" to dump out the
>> current performance counters. There are some stats from the in-memory
>> cache included in there.
> Great, i was not aware of that.
> There are really a lot of statistics which might be useful for analyzing
> whats going on or if the optimizations improve the performance of our
> systems.
>>> Can you provide some hints how to about adequate cache settings for a write
>>> intensive environment (70% write, 30% read)?
>>> Is it a good idea to specify a huge rbd cache of 1 GB with a max dirty age
>>> of 10 seconds?
>> Depends on your workload and your testing results. I suspect a
>> database on top of RBD is going to do its own read caching and will be
>> issuing lots of flush calls to the block device, potentially negating
>> the need for a large cache.
>
> Sure, reducing flushes with the acceptance of a degraded level of
> reliability seems to be one import key for improved performance.
>
>>>
>>> Our typical workload is originated over 70 percent in database write
>>> operations in the virtual machines.
>>> Therefore collecting write operations with rbd cache and writing them in
>>> chunks to ceph might be a good thing.
>>> A higher limit for "rbd cache max dirty" might be a adequate here.
>>> At the other side our read workload typically reads huge files in sequential
>>> manner.
>>>
>>> Therefore it might be useful to do start with a configuration like that:
>>>
>>> rbd cache size = 64MB
>>> rbd cache max dirty = 48MB
>>> rbd cache target dirty = 32MB
>>> rbd cache max dirty age = 10
>>>
>>> What is the strategy of librbd to write data to the storage from rbd_cache
>>> if "rbd cache max dirty = 48MB" is reached?
>>> Is there a reduction of io operations (merging of ios) compared to the
>>> granularity of writes of my virtual machines?
>> If the cache is full, incoming IO will be stalled as the dirty bits
>> are written back to the backing RBD image to make room available for
>> the new IO request.
> Sure, i will have a look at the statistics and the throughput.
> Is there any consolidation of write requests in rbd cache?
>
> Example:
> If a vm writes small io-requests to the ndb device with belong to the
> same rados object - does librbd consollidate these requests to  a single
> ceph io?
> What strategies does librd use for that?

The librbd cache will consolidate sequential dirty extents within the
same object, but it does not consolidate all dirty extents within the
same object to the same write request.

> Regards
> Marc
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs write fail when node goes down

2018-05-14 Thread Josef Zelenka
Hi everyone, we've encountered an unusual thing in our setup(4 nodes, 48 
OSDs, 3 monitors - ceph Jewel, Ubuntu 16.04 with kernel 4.4.0). 
Yesterday, we were doing a HW upgrade of the nodes, so they went down 
one by one - the cluster was in good shape during the upgrade, as we've 
done this numerous times and we're quite sure that the redundancy wasn't 
screwed up while doing this. However, during this upgrade one of the 
clients that does backups to cephfs(mounted via the kernel driver) 
failed to write the backup file correctly to the cluster with the 
following trace after we turned off one of the nodes:


[2585732.529412]  8800baa279a8 813fb2df 880236230e00 
8802339c
[2585732.529414]  8800baa28000 88023fc96e00 7fff 
8800baa27b20
[2585732.529415]  81840ed0 8800baa279c0 818406d5 

[2585732.529417] Call Trace:
[2585732.529505]  [] ? cpumask_next_and+0x2f/0x40
[2585732.529558]  [] ? bit_wait+0x60/0x60
[2585732.529560]  [] schedule+0x35/0x80
[2585732.529562]  [] schedule_timeout+0x1b5/0x270
[2585732.529607]  [] ? kvm_clock_get_cycles+0x1e/0x20
[2585732.529609]  [] ? bit_wait+0x60/0x60
[2585732.529611]  [] io_schedule_timeout+0xa4/0x110
[2585732.529613]  [] bit_wait_io+0x1b/0x70
[2585732.529614]  [] __wait_on_bit_lock+0x4e/0xb0
[2585732.529652]  [] __lock_page+0xbb/0xe0
[2585732.529674]  [] ? autoremove_wake_function+0x40/0x40
[2585732.529676]  [] pagecache_get_page+0x17d/0x1c0
[2585732.529730]  [] ? ceph_pool_perm_check+0x48/0x700 [ceph]
[2585732.529732]  [] grab_cache_page_write_begin+0x26/0x40
[2585732.529738]  [] ceph_write_begin+0x48/0xe0 [ceph]
[2585732.529739]  [] generic_perform_write+0xce/0x1c0
[2585732.529763]  [] ? file_update_time+0xc9/0x110
[2585732.529769]  [] ceph_write_iter+0xf89/0x1040 [ceph]
[2585732.529792]  [] ? __alloc_pages_nodemask+0x159/0x2a0
[2585732.529808]  [] new_sync_write+0x9b/0xe0
[2585732.529811]  [] __vfs_write+0x26/0x40
[2585732.529812]  [] vfs_write+0xa9/0x1a0
[2585732.529814]  [] SyS_write+0x55/0xc0
[2585732.529817]  [] entry_SYSCALL_64_fastpath+0x16/0x71


I have encountered this behavior on Luminous, but not on Jewel. Anyone who has 
a clue why the write fails? As far as i'm concerned, it should always work if 
all the PGs are available. Thanks
Josef

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Cache and rbd-nbd

2018-05-14 Thread Marc Schöchlin
Hello Jason,

many thanks for your informative response!

Am 11.05.2018 um 17:02 schrieb Jason Dillaman:
> I cannot speak for Xen, but in general IO to a block device will hit
> the pagecache unless the IO operation is flagged as direct (e.g.
> O_DIRECT) to bypass the pagecache and directly send it to the block
> device.
Sure, but it seems that xenserver just forwards io from virtual machines
(vm: blkfront, dom-0: blkback) to the ndb device in dom-0.
>> Sorry, my question was a bit unprecice: I was searching for usage statistics
>> of the rbd cache.
>> Is there also a possibility to gather rbd_cache usage statistics as a source
>> of verification for optimizing the cache settings?
> You can run "perf dump" instead of "config show" to dump out the
> current performance counters. There are some stats from the in-memory
> cache included in there.
Great, i was not aware of that.
There are really a lot of statistics which might be useful for analyzing
whats going on or if the optimizations improve the performance of our
systems.
>> Can you provide some hints how to about adequate cache settings for a write
>> intensive environment (70% write, 30% read)?
>> Is it a good idea to specify a huge rbd cache of 1 GB with a max dirty age
>> of 10 seconds?
> Depends on your workload and your testing results. I suspect a
> database on top of RBD is going to do its own read caching and will be
> issuing lots of flush calls to the block device, potentially negating
> the need for a large cache.

Sure, reducing flushes with the acceptance of a degraded level of
reliability seems to be one import key for improved performance.

>>
>> Our typical workload is originated over 70 percent in database write
>> operations in the virtual machines.
>> Therefore collecting write operations with rbd cache and writing them in
>> chunks to ceph might be a good thing.
>> A higher limit for "rbd cache max dirty" might be a adequate here.
>> At the other side our read workload typically reads huge files in sequential
>> manner.
>>
>> Therefore it might be useful to do start with a configuration like that:
>>
>> rbd cache size = 64MB
>> rbd cache max dirty = 48MB
>> rbd cache target dirty = 32MB
>> rbd cache max dirty age = 10
>>
>> What is the strategy of librbd to write data to the storage from rbd_cache
>> if "rbd cache max dirty = 48MB" is reached?
>> Is there a reduction of io operations (merging of ios) compared to the
>> granularity of writes of my virtual machines?
> If the cache is full, incoming IO will be stalled as the dirty bits
> are written back to the backing RBD image to make room available for
> the new IO request.
Sure, i will have a look at the statistics and the throughput.
Is there any consolidation of write requests in rbd cache?

Example:
If a vm writes small io-requests to the ndb device with belong to the
same rados object - does librbd consollidate these requests to  a single
ceph io?
What strategies does librd use for that?

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel to luminous upgrade, chooseleaf_vary_r and chooseleaf_stable

2018-05-14 Thread Dan van der Ster
Hi Adrian,

Is there a strict reason why you *must* upgrade the tunables?

It is normally OK to run with old (e.g. hammer) tunables on a luminous
cluster. The crush placement won't be state of the art, but that's not
a huge problem.

We have a lot of data in a jewel cluster with hammer tunables. We'll
upgrade that to luminous soon, but don't plan to set chooseleaf_stable
until there's less disruptive procedure, e.g.  [1].

Cheers, Dan

[1] One idea I had to make this much less disruptive would be to
script something that uses upmap's to lock all PGs into their current
placement, then set chooseleaf_stable, then gradually remove the
upmap's. There are some details to work out, and it requires all
clients to be running luminous, but I think something like this could
help...




On Mon, May 14, 2018 at 9:01 AM, Adrian  wrote:
> Hi all,
>
> We recently upgraded our old ceph cluster to jewel (5xmon, 21xstorage hosts
> with 9x6tb filestore osds and 3xssd's with 3 journals on each) - mostly used
> for openstack compute/cinder.
>
> In order to get there we had to go with chooseleaf_vary_r = 4 in order to
> minimize client impact and save time. We now need to get to luminous (on a
> deadline and time is limited).
>
> Current tunables are:
>   {
>   "choose_local_tries": 0,
>   "choose_local_fallback_tries": 0,
>   "choose_total_tries": 50,
>   "chooseleaf_descend_once": 1,
>   "chooseleaf_vary_r": 4,
>   "chooseleaf_stable": 0,
>   "straw_calc_version": 1,
>   "allowed_bucket_algs": 22,
>   "profile": "unknown",
>   "optimal_tunables": 0,
>   "legacy_tunables": 0,
>   "minimum_required_version": "firefly",
>   "require_feature_tunables": 1,
>   "require_feature_tunables2": 1,
>   "has_v2_rules": 0,
>   "require_feature_tunables3": 1,
>   "has_v3_rules": 0,
>   "has_v4_buckets": 0,
>   "require_feature_tunables5": 0,
>   "has_v5_rules": 0
>   }
>
> Setting chooseleaf_stable to 1, the crush compare tool says:
>Replacing the crushmap specified with --origin with the crushmap
>   specified with --destination will move 8774 PGs (59.08417508417509% of the
> total)
>   from one item to another.
>
> Current tunings we have in ceph.conf are:
>   #THROTTLING CEPH
>   osd_max_backfills = 1
>   osd_recovery_max_active = 1
>   osd_recovery_op_priority = 1
>   osd_client_op_priority = 63
>
>   #PERFORMANCE TUNING
>   osd_op_threads = 6
>   filestore_op_threads = 10
>   filestore_max_sync_interval = 30
>
> I was wondering if anyone has any advice as to anything else we can do
> balancing client impact and speed of recovery or war stories of other things
> to consider.
>
> I'm also wondering about the interplay between chooseleaf_vary_r and
> chooseleaf_stable.
> Are we better with
> 1) sticking with choosleaf_vary_r = 4, setting chooseleaf_stable =1,
> upgrading and then setting chooseleaf_vary_r incrementally to 1 when more
> time is available
> or
> 2) setting chooseleaf_vary_r incrementally first, then chooseleaf_stable and
> finally upgrade
>
> All this bearing in mind we'd like to keep the time it takes us to get to
> luminous as short as possible ;-) (guestimating a 59% rebalance to take many
> days)
>
> Any advice/thoughts gratefully received.
>
> Regards,
> Adrian.
>
> --
> ---
> Adrian : aussie...@gmail.com
> If violence doesn't solve your problem, you're not using enough of it.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com