Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-01 Thread Christian Balzer

Hello,

firstly, Jack pretty much correctly correlated my issues to Mark's points,
more below.

On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:

> On 03/29/2018 08:59 PM, Christian Balzer wrote:
> 
> > Hello,
> >
> > my crappy test cluster was rendered inoperational by an IP renumbering
> > that wasn't planned and forced on me during a DC move, so I decided to
> > start from scratch and explore the fascinating world of Luminous/bluestore
> > and all the assorted bugs. ^_-
> > (yes I could have recovered the cluster by setting up a local VLAN with
> > the old IPs, extract the monmap, etc, but I consider the need for a
> > running monitor a flaw, since all the relevant data was present in the
> > leveldb).
> >
> > Anyways, while I've read about bluestore OSD cache in passing here, the
> > back of my brain was clearly still hoping that it would use pagecache/SLAB
> > like other filesystems.
> > Which after my first round of playing with things clearly isn't the case.
> >
> > This strikes me as a design flaw and regression because:  
> 
> Bluestore's cache is not broken by design.
> 

During further tests I verified something that caught my attention out of
the corner of my when glancing at atop output of the OSDs during my fio
runs.

Consider this fio run, after having done the same with write to populate
the file and caches (1GB per OSD default on the test cluster, 20 OSDs
total on 5 nodes):
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randread --name=fiojob --blocksize=4M --iodepth=32 
---

This is being run against a kernel mounted RBD image.
On the Luminous test cluster it will read the data from the disks,
completely ignoring the pagecache on the host (as expected and desired)
AND the bluestore cache.

On a Jewel based test cluster with filestore the reads will be served from
the pagecaches of the OSD nodes, not only massively improving speed but
more importantly spindle contention. 

My guess is that bluestore treats "direct" differently than the kernel
accessing a filestore based OSD and I'm not sure what the "correct"
behavior here is.
But somebody migrating to bluestore with such a use case and plenty of RAM
on their OSD nodes is likely to notice this and not going to be happy about
it.


> I'm not totally convinced that some of the trade-offs we've made with 
> bluestore's cache implementation are optimal, but I think you should 
> consider cooling your rhetoric down.
> 
> > 1. Completely new users may think that bluestore defaults are fine and
> > waste all that RAM in their machines.  
> 
> What does "wasting" RAM mean in the context of a node running ceph? Are 
> you upset that other applications can't come in and evict bluestore 
> onode, OMAP, or object data from cache?
> 
What Jack pointed out, unless you go around and start tuning things,
all available free RAM won't be used for caching.

This raises another point, it being per process data and from skimming
over some bluestore threads here, if you go and raise the cache to use
most RAM during normal ops you're likely to be visited by the evil OOM
witch during heavy recovery OPS.

Whereas the good ole pagecache would just get evicted in that scenario.

> > 2. Having a per OSD cache is inefficient compared to a common cache like
> > pagecache, since an OSD that is busier than others would benefit from a
> > shared cache more.  
> 
> It's only "inefficient" if you assume that using the pagecache, and more 
> generally, kernel syscalls, is free.  Yes the pagecache is convenient 
> and yes it gives you a lot of flexibility, but you pay for that 
> flexibility if you are trying to do anything fast.
> 
> For instance, take the new KPTI patches in the kernel for meltdown. Look 
> at how badly it can hurt MyISAM database performance in MariaDB:
> 
I, like many others here, have decided that all the Meltdown and Spectre
patches are a bit pointless on pure OSD nodes, because if somebody on the
node is running random code you're already in deep doodoo.

That being said, I will totally concur that syscalls aren't free.
However given the latencies induced by the rather long/complex code IOPS
have to transverse within Ceph, how much of a gain would you say
eliminating these particular calls did achieve?

> https://mariadb.org/myisam-table-scan-performance-kpti/
> 
> MyISAM does not have a dedicated row cache and instead caches row data 
> in the page cache as you suggest Bluestore should do for it's data.  
> Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a 
> dedicated 128MB cache (less than 1%).  KPTI is a really good example of 
> how much this stuff can hurt you, but syscalls, context switches, and 
> page faults were already expensive even before meltdown.  Not to mention 
> that right now bluestore keeps onodes and buffers stored in it's cache 
> in an unencoded form.
> 
That last bit is quite relevant of course.

> Here's a couple of other articles worth looking at:
> 
> 

[ceph-users] Have an inconsistent PG, repair not working

2018-04-01 Thread Michael Sudnick
Hello,

I have a small cluster with an inconsistent pg. I've tried ceph pg repair
multiple times to no luck. rados list-inconsistent-obj 49.11c returns:

# rados list-inconsistent-obj 49.11c
No scrub information available for pg 49.11c
error 2: (2) No such file or directory

I'm a bit at a loss here as what to do to recover. That pg is part of a
cephfs_data pool with compression set to force/snappy.

Does anyone have an suggestions?

-Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?

2018-04-01 Thread shadow_lin
Thanks.
Is there any workaround for 10.2.10 to avoid all osd start spliting at the same 
time?

2018-04-01 


shadowlin




发件人:Pavan Rallabhandi 
发送时间:2018-04-01 22:39
主题:Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?
收件人:"shadow_lin","ceph-users"
抄送:

No, it is supported in the next version of Jewel 
http://tracker.ceph.com/issues/22658
 
From: ceph-users  on behalf of shadow_lin 

Date: Sunday, April 1, 2018 at 3:53 AM
To: ceph-users 
Subject: EXT: [ceph-users] Does jewel 10.2.10 support 
filestore_split_rand_factor?
 
Hi list,
The document page of jewel has filestore_split_rand_factor config but I can't 
find the config by using 'ceph daemon osd.x config'.
 
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
ceph daemon osd.0 config show|grep split
"mon_osd_max_split_count": "32",
"journaler_allow_split_entries": "true",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"filestore_split_multiple": "4",
"filestore_debug_verify_split": "false",


 
2018-04-01



shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?

2018-04-01 Thread Pavan Rallabhandi
No, it is supported in the next version of Jewel 
http://tracker.ceph.com/issues/22658

From: ceph-users  on behalf of shadow_lin 

Date: Sunday, April 1, 2018 at 3:53 AM
To: ceph-users 
Subject: EXT: [ceph-users] Does jewel 10.2.10 support 
filestore_split_rand_factor?

Hi list,
The document page of jewel has filestore_split_rand_factor config but I can't 
find the config by using 'ceph daemon osd.x config'.

ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
ceph daemon osd.0 config show|grep split
"mon_osd_max_split_count": "32",
"journaler_allow_split_entries": "true",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"filestore_split_multiple": "4",
"filestore_debug_verify_split": "false",


2018-04-01

shadow_lin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?

2018-04-01 Thread shadow_lin
Hi list,
The document page of jewel has filestore_split_rand_factor config but I can't 
find the config by using 'ceph daemon osd.x config'.

ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
ceph daemon osd.0 config show|grep split
"mon_osd_max_split_count": "32",
"journaler_allow_split_entries": "true",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"filestore_split_multiple": "4",
"filestore_debug_verify_split": "false",


2018-04-01


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com