> A long time ago I was responsible for validating the performance of CXFS
on an SGI Altix UV distributed shared-memory supercomputer.  As it turns
out, we could achieve about 22GB/s writes with XFS (a huge >number at the
time), but CXFS was 5-10x slower.  A big part of that turned out to be the
kernel distributing page cache across the Numalink5 interconnects to remote
memory.
> The problem can potentially happen on any NUMA system to varying degrees.

That's very interesting. I used to manage Itanium Altixes and then an UV
system. That work sounds very interesting.
I set up cpusets on the UV system, which had a big performance increase
since user jobs had CPUs and memory close to each other.
I also had a boot cpuset on the first blade, which had the fibrechannel
HBA, so I guess that had a similar effect in that the CXFS processes were
local to the IO card.
UV was running SuSE - sorry.

On the subject of memory allocation, GPFS uses an amount of pagepool
memory. The given advice always seems to be make this large.
There is one fixed pagepool on a server, even if it has multiple NSDs
How does this compare to CEPH memory allocation?







On 31 March 2018 at 15:24, Mark Nelson <[email protected]> wrote:

> On 03/29/2018 08:59 PM, Christian Balzer wrote:
>
> Hello,
>>
>> my crappy test cluster was rendered inoperational by an IP renumbering
>> that wasn't planned and forced on me during a DC move, so I decided to
>> start from scratch and explore the fascinating world of Luminous/bluestore
>> and all the assorted bugs. ^_-
>> (yes I could have recovered the cluster by setting up a local VLAN with
>> the old IPs, extract the monmap, etc, but I consider the need for a
>> running monitor a flaw, since all the relevant data was present in the
>> leveldb).
>>
>> Anyways, while I've read about bluestore OSD cache in passing here, the
>> back of my brain was clearly still hoping that it would use pagecache/SLAB
>> like other filesystems.
>> Which after my first round of playing with things clearly isn't the case.
>>
>> This strikes me as a design flaw and regression because:
>>
>
> Bluestore's cache is not broken by design.
>
> I'm not totally convinced that some of the trade-offs we've made with
> bluestore's cache implementation are optimal, but I think you should
> consider cooling your rhetoric down.
>
> 1. Completely new users may think that bluestore defaults are fine and
>> waste all that RAM in their machines.
>>
>
> What does "wasting" RAM mean in the context of a node running ceph? Are
> you upset that other applications can't come in and evict bluestore onode,
> OMAP, or object data from cache?
>
> 2. Having a per OSD cache is inefficient compared to a common cache like
>> pagecache, since an OSD that is busier than others would benefit from a
>> shared cache more.
>>
>
> It's only "inefficient" if you assume that using the pagecache, and more
> generally, kernel syscalls, is free.  Yes the pagecache is convenient and
> yes it gives you a lot of flexibility, but you pay for that flexibility if
> you are trying to do anything fast.
>
> For instance, take the new KPTI patches in the kernel for meltdown. Look
> at how badly it can hurt MyISAM database performance in MariaDB:
>
> https://mariadb.org/myisam-table-scan-performance-kpti/
>
> MyISAM does not have a dedicated row cache and instead caches row data in
> the page cache as you suggest Bluestore should do for it's data.  Look at
> how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated
> 128MB cache (less than 1%).  KPTI is a really good example of how much this
> stuff can hurt you, but syscalls, context switches, and page faults were
> already expensive even before meltdown.  Not to mention that right now
> bluestore keeps onodes and buffers stored in it's cache in an unencoded
> form.
>
> Here's a couple of other articles worth looking at:
>
> https://eng.uber.com/mysql-migration/
> https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
> http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-melt
> down-performance.html
>
> 3. A uniform OSD cache size of course will be a nightmare when having
>> non-uniform HW, either with RAM or number of OSDs.
>>
>
> Non-Uniform hardware is a big reason that pinning dedicated memory to
> specific cores/sockets is really nice vs relying on potentially remote
> memory page cache reads.  A long time ago I was responsible for validating
> the performance of CXFS on an SGI Altix UV distributed shared-memory
> supercomputer.  As it turns out, we could achieve about 22GB/s writes with
> XFS (a huge number at the time), but CXFS was 5-10x slower.  A big part of
> that turned out to be the kernel distributing page cache across the
> Numalink5 interconnects to remote memory.  The problem can potentially
> happen on any NUMA system to varying degrees.
>
> Personally I have two primary issues with bluestore's memory configuration
> right now:
>
> 1) It's too complicated for users to figure out where to assign memory and
> in what ratios.  I'm attempting to improve this by making bluestore's cache
> autotuning so the user just gives it a number and bluestore will try to
> work out where it should assign memory.
>
> 2) In the case where a subset of OSDs are really hot (maybe RGW bucket
> accesses) you might want some OSDs to get more memory than others.  I think
> we can tackle this better if we migrate to a one-osd-per-node sharded
> architecture (likely based on seastar), though we'll still need to be very
> aware of remote memory.  Given that this is fairly difficult to do well,
> we're probably going to be better off just dedicating a static pool to each
> shard initially.
>
> Mark
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to