> A long time ago I was responsible for validating the performance of CXFS on an SGI Altix UV distributed shared-memory supercomputer. As it turns out, we could achieve about 22GB/s writes with XFS (a huge >number at the time), but CXFS was 5-10x slower. A big part of that turned out to be the kernel distributing page cache across the Numalink5 interconnects to remote memory. > The problem can potentially happen on any NUMA system to varying degrees.
That's very interesting. I used to manage Itanium Altixes and then an UV system. That work sounds very interesting. I set up cpusets on the UV system, which had a big performance increase since user jobs had CPUs and memory close to each other. I also had a boot cpuset on the first blade, which had the fibrechannel HBA, so I guess that had a similar effect in that the CXFS processes were local to the IO card. UV was running SuSE - sorry. On the subject of memory allocation, GPFS uses an amount of pagepool memory. The given advice always seems to be make this large. There is one fixed pagepool on a server, even if it has multiple NSDs How does this compare to CEPH memory allocation? On 31 March 2018 at 15:24, Mark Nelson <[email protected]> wrote: > On 03/29/2018 08:59 PM, Christian Balzer wrote: > > Hello, >> >> my crappy test cluster was rendered inoperational by an IP renumbering >> that wasn't planned and forced on me during a DC move, so I decided to >> start from scratch and explore the fascinating world of Luminous/bluestore >> and all the assorted bugs. ^_- >> (yes I could have recovered the cluster by setting up a local VLAN with >> the old IPs, extract the monmap, etc, but I consider the need for a >> running monitor a flaw, since all the relevant data was present in the >> leveldb). >> >> Anyways, while I've read about bluestore OSD cache in passing here, the >> back of my brain was clearly still hoping that it would use pagecache/SLAB >> like other filesystems. >> Which after my first round of playing with things clearly isn't the case. >> >> This strikes me as a design flaw and regression because: >> > > Bluestore's cache is not broken by design. > > I'm not totally convinced that some of the trade-offs we've made with > bluestore's cache implementation are optimal, but I think you should > consider cooling your rhetoric down. > > 1. Completely new users may think that bluestore defaults are fine and >> waste all that RAM in their machines. >> > > What does "wasting" RAM mean in the context of a node running ceph? Are > you upset that other applications can't come in and evict bluestore onode, > OMAP, or object data from cache? > > 2. Having a per OSD cache is inefficient compared to a common cache like >> pagecache, since an OSD that is busier than others would benefit from a >> shared cache more. >> > > It's only "inefficient" if you assume that using the pagecache, and more > generally, kernel syscalls, is free. Yes the pagecache is convenient and > yes it gives you a lot of flexibility, but you pay for that flexibility if > you are trying to do anything fast. > > For instance, take the new KPTI patches in the kernel for meltdown. Look > at how badly it can hurt MyISAM database performance in MariaDB: > > https://mariadb.org/myisam-table-scan-performance-kpti/ > > MyISAM does not have a dedicated row cache and instead caches row data in > the page cache as you suggest Bluestore should do for it's data. Look at > how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated > 128MB cache (less than 1%). KPTI is a really good example of how much this > stuff can hurt you, but syscalls, context switches, and page faults were > already expensive even before meltdown. Not to mention that right now > bluestore keeps onodes and buffers stored in it's cache in an unencoded > form. > > Here's a couple of other articles worth looking at: > > https://eng.uber.com/mysql-migration/ > https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/ > http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-melt > down-performance.html > > 3. A uniform OSD cache size of course will be a nightmare when having >> non-uniform HW, either with RAM or number of OSDs. >> > > Non-Uniform hardware is a big reason that pinning dedicated memory to > specific cores/sockets is really nice vs relying on potentially remote > memory page cache reads. A long time ago I was responsible for validating > the performance of CXFS on an SGI Altix UV distributed shared-memory > supercomputer. As it turns out, we could achieve about 22GB/s writes with > XFS (a huge number at the time), but CXFS was 5-10x slower. A big part of > that turned out to be the kernel distributing page cache across the > Numalink5 interconnects to remote memory. The problem can potentially > happen on any NUMA system to varying degrees. > > Personally I have two primary issues with bluestore's memory configuration > right now: > > 1) It's too complicated for users to figure out where to assign memory and > in what ratios. I'm attempting to improve this by making bluestore's cache > autotuning so the user just gives it a number and bluestore will try to > work out where it should assign memory. > > 2) In the case where a subset of OSDs are really hot (maybe RGW bucket > accesses) you might want some OSDs to get more memory than others. I think > we can tackle this better if we migrate to a one-osd-per-node sharded > architecture (likely based on seastar), though we'll still need to be very > aware of remote memory. Given that this is fairly difficult to do well, > we're probably going to be better off just dedicating a static pool to each > shard initially. > > Mark > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
