from:"Mark Nelson"

Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread Mark Nelson

The way I try to look at this is:

1) How much more do the enterprise grade drives cost?

2) What are the benefits? (Faster performance, longer life, etc)

3) How much does it cost to deal with downtime, diagnose issues, and 
replace malfunctioning hardware?

My personal take is that enterprise drives are usually worth it. There 
may be consumer grade drives that may be worth considering in very 
specific scenarios if they still have power loss protection and high 
write durability.  Even when I was in academia years ago with very 
limited budgets, we got burned with consumer grade SSDs to the point 
where we had to replace them all.  You have to be very careful and know 
exactly what you are buying.

Mark

On 12/19/19 12:04 PM, jes...@krogh.cc wrote:

I dont think “usually” is good enough in a production setup.

Sent from myMail for iOS

Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов 
:

Usually it doesn't, it only harms performance and probably SSD
lifetime
too

> I would not be running ceph on ssds without powerloss protection. I
> delivers a potential data loss scenario

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] very high ram usage by OSDs on Nautilus

2019-10-29 Thread Mark Nelson


Ok, assuming my math is right you've got ~14G of data in the mempools.


~6.5GB bluestore data

~1.8GB bluestore onode

~5GB bluestore other


Rest is other misc stuff.  That seems to be pretty inline with the 
numbers you posted in your screenshot. IE this doesn't appear to be a 
leak, but rather the bluestore caches are all using significantly more 
data than is typical given the default 4GB osd_memory_target.  You can 
check what an OSD's memory target it set to via the config show command 
(I'm using the admin socket here but you don't have to):


ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep 
'"osd_memory_target"'

    "osd_memory_target": "4294967296",

Mark


On 10/29/19 8:07 AM, Philippe D'Anjou wrote:
Ok looking at mempool, what does it tell me? This affects multiple 
OSDs, got crashes almost every hour.


{
    "mempool": {
    "by_pool": {
    "bloom_filter": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_alloc": {
    "items": 2545349,
    "bytes": 20362792
    },
    "bluestore_cache_data": {
    "items": 28759,
    "bytes": 6972870656
    },
    "bluestore_cache_onode": {
    "items": 2885255,
    "bytes": 1892727280
    },
    "bluestore_cache_other": {
    "items": 202831651,
    "bytes": 5403585971
    },
    "bluestore_fsck": {
    "items": 0,
    "bytes": 0
    },
    "bluestore_txc": {
    "items": 21,
    "bytes": 15792
    },
    "bluestore_writing_deferred": {
    "items": 77,
    "bytes": 7803168
    },
    "bluestore_writing": {
    "items": 4,
    "bytes": 5319827
    },
    "bluefs": {
    "items": 5242,
    "bytes": 175096
    },
    "buffer_anon": {
    "items": 726644,
    "bytes": 193214370
    },
    "buffer_meta": {
    "items": 754360,
    "bytes": 66383680
    },
    "osd": {
    "items": 29,
    "bytes": 377464
    },
    "osd_mapbl": {
    "items": 50,
    "bytes": 3492082
    },
    "osd_pglog": {
    "items": 99011,
    "bytes": 46170592
    },
    "osdmap": {
    "items": 48130,
    "bytes": 1151208
    },
    "osdmap_mapping": {
    "items": 0,
    "bytes": 0
    },
    "pgmap": {
    "items": 0,
    "bytes": 0
    },
    "mds_co": {
    "items": 0,
    "bytes": 0
    },
    "unittest_1": {
    "items": 0,
    "bytes": 0
    },
    "unittest_2": {
    "items": 0,
    "bytes": 0
    }
    },
    "total": {
    "items": 209924582,
    "bytes": 14613649978
    }
    }
}



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] very high ram usage by OSDs on Nautilus

2019-10-28 Thread Mark Nelson


Hi Philippe,


Have you looked at the mempool stats yet?


ceph daemon osd.NNN dump_mempools


You may also want to look at the heap stats, and potentially enable 
debug 5 for bluestore to see what the priority cache manager is doing.  
Typically in these cases we end up seeing a ton of memory used by 
something and the priority cache manager is trying to compensate by 
shrinking the caches, but you won't really know until you start looking 
at the various statistics and logging.



Mark


On 10/28/19 2:54 AM, Philippe D'Anjou wrote:

Hi,

we are seeing quite a high memory usage by OSDs since Nautilus. 
Averaging 10GB/OSD for 10TB HDDs. But I had OOM issues on 128GB 
Systems because some single OSD processes used up to 32%.


Here an example how they look on average: https://i.imgur.com/kXCtxMe.png

Is that normal? I never seen this on luminous. Memory leaks?
Using all default values, OSDs have no special configuration. Use case 
is cephfs.


v14.2.4 on Ubuntu 18.04 LTS

Seems a bit high?

Thanks for help

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Mark Nelson



On 8/26/19 7:39 AM, Wido den Hollander wrote:


On 8/26/19 1:35 PM, Simon Oosthoek wrote:

On 26-08-19 13:25, Simon Oosthoek wrote:

On 26-08-19 13:11, Wido den Hollander wrote:


The reweight might actually cause even more confusion for the balancer.
The balancer uses upmap mode and that re-allocates PGs to different OSDs
if needed.

Looking at the output send earlier I have some replies. See below.




Looking at this output the balancing seems OK, but from a different
perspective.

PGs are allocated to OSDs and not Objects nor data. All OSDs have 95~97
Placement Groups allocated.

That's good! A almost perfect distribution.

The problem that now rises is the difference in the size of these
Placement Groups as they hold different objects.

This is one of the side-effects of larger disks. The PGs on them will
grow and this will lead to inbalance between the OSDs.

I *think* that increasing the amount of PGs on this cluster would help,
but only for the pools which will contain most of the data.

This will consume a bit more CPU Power and Memory, but on modern systems
this should be less of a problem.

The good thing is that with Nautilus you can also scale down on the
amount of PGs if things would become a problem.

More PGs will mean smaller PGs and thus lead to a better data
distribution.



That makes sense, dividing the data in smaller chunks makes it more
flexible. The osd nodes are quite underloaded, even with turbo
recovery mode on (10, not 32 ;-).

When the cluster is in HEALTH_OK again, I'll increase the PGs for the
cephfs pools...

On second thought, I reverted my reweight commands and adjusted the PGs,
which were quite low for some of the pools. The reason they were low is
that when we first created them, we expected them to be rarely used, but
then we started filling them just for the filling, and these are
probably the cause of the unbalance.


You should make sure that the pools which contain the most data have the
most PGs.

Although ~100 PGs per OSD is the recommendation it won't hurt to have
~200 PGs as long as you have enough CPU power and Memory. More PGs will
mean better data distribution with such large disks.



Memory is probably the biggest concern, since the pglog can eat up a 
surprising amount of memory with lots of PGs on the OSD.  I suspect we 
should consider having the pglog controlled by the prioritycachemanager 
and set the lengths based on the amount of memory we want assigned to 
it.  Perhaps even dynamically changing based on the pool and current 
workload.  In the long run, we should probably have a much longer log on 
disk and shorter log in memory regardless.



Mark





The cluster now has over 8% misplaced objects, so that can take a while...

Cheers

/Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] hsbench 0.2 released

2019-08-22 Thread Mark Nelson


Hi Folks,


I've updated hsbench (new S3 benchmark) to 0.2


Notable changes since 0.1:


- Can now output CSV results

- Can now output JSON results

- Fix for poor read performance with low thread counts

- New bucket listing benchmark with a new "mk" flag that lets you 
control the number of keys to fetch at once.



You can get it here:


https://github.com/markhpc/hsbench


I've been doing quite a bit of testing of the RadosGW with hsbench this 
week and so far it's doing exactly what I hoped it would do!



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Mark Nelson


Hi Vladimir,


On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. 
Periodically, radosgw process on those machines starts consuming 100% 
of 5 CPU cores for days at a time, even though the machine is not 
being used for data transfers (nothing in radosgw logs, couple of KB/s 
of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, 
which, if I interpret things correctly, is called from a symbol with a 
long name that contains the substring "boost9intrusive9list_impl"



I don't normally look at the RGW code so maybe Matt/Casey/Eric can chime 
in.  That code is in src/rgw/rgw_rados.cc in the get_obj_data struct.  
The flush method does some sorting/merging and then walks through a 
listed of completed IOs and appears to copy a bufferlist out of each 
one, then deletes it from the list and passes the BL off to 
client_cb->handle_data.  Looks like it could be pretty CPU intensive but 
if you are seeing that much CPU for that long it sounds like something 
is rather off.



You might want to try grabbing a a callgraph from perf instead of just 
running perf top or using my wallclock profiler to see if you can drill 
down and find out where in that method it's spending the most time.



My wallclock profiler is here:


https://github.com/markhpc/gdbpmp


Mark




This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [RFC] New S3 Benchmark

2019-08-15 Thread Mark Nelson

Ha, I hadn't thought to check if that existed which was probably pretty 
short-sighted on my part. :)



I suppose the good news is that it might be a good candidate for 
comparison, and once I've implemented an S3 client endpoint in CBT I 
should be able to target both pretty easily.



Mark


On 8/15/19 6:41 PM, David Byte wrote:

Mark, did the S3 engine for fio not work?

Sent from my iPhone. Typos are Apple's fault.


On Aug 15, 2019, at 6:37 PM, Mark Nelson  wrote:

Hi Guys,

Earlier this week I was working on investigating the impact of OMAP performance 
on RGW and wanted to see if putting rocksdb on ramdisk would help speed up 
bucket index updates.  While running tests I found out that the benchmark tool 
I was using consumed roughly 15 cores of CPU to push 4K puts/second to RGW from 
128 threads.  That wasn't really viable, so I started looking for alternate S3 
benchmarking tools.  COSBench is sort of the most well known choice out there, 
but it's a bit cumbersome if you just want to run some quick tests from the 
command-line.

I managed to find a simple yet very nice benchmark written in go developed by 
Wasabi Inc called s3-benchmark.  While it works well, it really only targets 
single buckets and is designed more for AWS testing than for Ceph.  I forked 
the project and pretty much refactored the whole thing to be more useful for 
the kind of testing I want to do. It's now at the point where I think it might 
be ready to experiment with.  S3 benchmarking has been a semi-recurring topic 
on the list so I figured other folks might be interested in trying it too and 
hopefully providing feedback.

The new benchmark is called hsbench and it's available here:

https://github.com/markhpc/hsbench

See the README.md for some of the advantages over the original s3-benchmark 
program it was forked from.  I consider this release to be alpha level quality 
and disclaim all responsibility if it breaks and deletes every object in all of 
your buckets.  Consider yourself warned. :)

Mark
___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io


___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

2019-08-15 Thread Mark Nelson


Hi Folks,


The basic idea behind the WAL is that for every DB write transaction you 
first write it into an in-memory buffer and to a region on disk.  
RocksDB typically is setup to have multiple WAL buffers, and when one or 
more fills up, it will start flushing the data to L0 while new writes 
are written to the next buffer.  If rocksdb can't flush data fast 
enough, it will throttle write throughput down so that hopefully you 
don't fill all of of the buffers up and stall before a flush completes.  
The combined total size/number of buffers governs both how much disk 
space you need for the WAL and how much RAM is needed to store incoming 
IO that hasn't finished flushing into the DB.  There are various 
tradeoffs when adjust the size, number, and behavior of the WAL.  On one 
hand there's an advantage to having small buffers to favor frequent 
swift flush events and hopefully keep overall memory usage low and CPU 
overhead of key comparisons low.  On the other hand, having large WAL 
buffers means you have more runway both in terms of being able to absorb 
longer L0 compaction events but also potentially in terms of being able 
to avoid writing pglog entries to L0 entirely if a tombstone lands in 
the same WAL buffer as the initial write.  We've seen evidence that 
write amplification is (sometimes much) lower with bigger WAL buffers 
and we think this is a big part of the reason why.



Right now our default WAL settings for rocksdb is:


max_write_buffer_number=4

min_write_buffer_number_to_merge=1

write_buffer_size=268435456


which means we will store up to 4 256MB buffers and start flushing as 
soon as 1 fills up.  Alternate strategies could be to something like 16 
64MB buffers, and set min_write_buffer_number_to_merge to something like 
4.  Potentially that might provide slightly more fine grained control 
and also may be advantageous with a larger number of column families, 
but we haven't seen evidence yet that splitting the buffers into more 
smaller segments definitely improves things.  Probably the bigger 
take-away is that you can't simply make the WAL huge to give yourself 
extra runway for writes unless you are also willing to eat the RAM cost 
of storing all of that data in-memory as well. That's one of the reasons 
why we tell people regularly that 1-2GB is enough for the WAL.  With a 
target OSD memory of 4GB, (up to) 1GB for the WAL is already pushing 
it.  Luckily in most cases it doesn't actually use the full 1GB though.  
RocksDB will throttle before you get to that point so in reality it's 
more likely the WAL is probably using more like 0-512MB of Disk/RAM with 
2-3 extra buffers of capacity in case things get hairy.



Mark


On 8/15/19 1:59 AM, Janne Johansson wrote:
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri 
mailto:a...@dreamsnake.net>>:


Good points in both posts, but I think there’s still some unclarity.


...

We’ve seen good explanations on the list of why only specific DB
sizes, say 30GB, are actually used _for the DB_.
If the WAL goes along with the DB, shouldn’t we also explicitly
determine an appropriate size N for the WAL, and make the
partition (30+N) GB?
If so, how do we derive N?  Or is it a constant?

Filestore was so much simpler, 10GB set+forget for the journal. 
Not that I miss XFS, mind you.


But we got a simple handwaving-best-effort-guesstimate that went "WAL 
1GB is fine, yes." so there you have an N you can use for the

30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you 
showed. Not that I think journal=10G was wrong or anything.


--
May the most significant bit of your life be positive.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Mark Nelson



On 8/14/19 1:06 PM, solarflow99 wrote:


Actually standalone WAL is required when you have either very
small fast
device (and don't want db to use it) or three devices (different in
performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be
located
at the fastest one.

For the given use case you just have HDD and NVMe and DB and WAL can
safely collocate. Which means you don't need to allocate specific
volume
for WAL. Hence no need to answer the question how many space is
needed
for WAL. Simply allocate DB and WAL will appear there automatically.


Yes, i'm surprised how often people talk about the DB and WAL 
separately for no good reason.  In common setups bluestore goes on 
flash and the storage goes on the HDDs, simple.


In the event flash is 100s of GB and would be wasted, is there 
anything that needs to be done to set rocksdb to use the highest 
level?  600 I believe






When you first setup the OSD you could manually tweak the level 
sizes/multipliers so that one of the level boundaries + WAL falls 
somewhat under the total allocated size of the DB device.  Keep in mind 
that there can be temporary space usage increases due to compaction.  
Ultimately though I think this is a bad approach. The better bet is the 
work that Igor and Adam are doing:



https://github.com/ceph/ceph/pull/28960

https://github.com/ceph/ceph/pull/29047


Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

2019-08-13 Thread Mark Nelson


On 8/13/19 3:51 PM, Paul Emmerich wrote:


On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  wrote:

I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
use. No slow db in use.

random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
10GB omap for index and whatever.

That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
coding and small-ish objects.



I've talked with many people from the community and I don't see an
agreement for the 4% rule.

agreed, 4% isn't a reasonable default.
I've seen setups with even 10% metadata usage, but these are weird
edge cases with very small objects on NVMe-only setups (obviously
without a separate DB device).

Paul



I agree, and I did quite a bit of the early space usage analysis.  I 
have a feeling that someone was trying to be well-meaning and make a 
simple ratio for users to target that was big enough to handle the 
majority of use cases.  The problem is that reality isn't that simple 
and one-size-fits all doesn't really work here.



For RBD you can usually get away with far less than 4%.  A small 
fraction of that is often sufficient.  For tiny (say 4K) RGW objects  
(especially objects with very long names!) you potentially can end up 
using significantly more than 4%. Unfortunately there's no really good 
way for us to normalize this so long as RGW is using OMAP to store 
bucket indexes.  I think the best we can do long run is make it much 
clearer how space is being used on the block/db/wal devices and easier 
for users to shrink/grow the amount of "fast" disk they have on an OSD. 
Alternately we could put bucket indexes into rados objects instead of 
OMAP, but that would be a pretty big project (with it's own challenges 
but potentially also with rewards).



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] out of memory bluestore osds

2019-08-07 Thread Mark Nelson


Hi Jaime,


we only use the cache size parameters now if you've disabled 
autotuning.  With autotuning we adjust the cache size on the fly to try 
and keep the mapped process memory under the osd_memory_target.  You can 
set a lower memory target than default, though you will have far less 
cache for bluestore onodes and rocksdb.  You may notice that it's 
slower, especially if you have a big active data set you are 
processing.  I don't usually recommend setting the osd_memory_target 
below 2GB.  At some point it will have shrunk the caches as far as it 
can and the process memory may start exceeding the target.  (with our 
default rocksdb and pglog settings this usually happens somewhere 
between 1.3-1.7GB once the OSD has been sufficiently saturated with IO). 
Given memory prices right now, I'd still recommend upgrading RAM if you 
have the ability though.  You might be able to get away with setting 
each OSD to 2-2.5GB in your scenario but you'll be pushing it.



I would not recommend lowering the osd_memory_cache_min.  You really 
want rocksdb indexes/filters fitting in cache, and as many bluestore 
onodes as you can get.  In any event, you'll still be bound by the 
(currently hardcoded) 64MB cache chunk allocation size in the autotuner 
which osd_memory_cache_min can't reduce (and that's per cache while 
osd_memory_cache_min is global for the kv,buffer, and rocksdb block 
caches).  IE each cache is going to get 64MB+growth room regardless of 
how low you set osd_memory_cache_min.  That's intentional as we don't 
want a single SST file in rocksdb to be able to completely blow 
everything else out of the block cache during compaction, only to 
quickly become invalid, removed from the cache, and make it look to the 
priority cache system like rocksdb doesn't actually need any more memory 
for cache.



Mark


On 8/7/19 7:44 AM, Jaime Ibar wrote:

Hi all,

we run a Ceph Luminous 12.2.12 cluster, 7 osds servers 12x4TB disks each.
Recently we redeployed the osds of one of them using bluestore backend,
however, after this, we're facing Out of memory errors(invoked 
oom-killer)

and the OS kills one of the ceph-osd process.
The osd is restarted automatically and back online after one minute.
We're running Ubuntu 16.04, kernel 4.15.0-55-generic.
The server has 32GB of RAM and 4GB of swap partition.
All the disks are hdd, no ssd disks.
Bluestore settings are the default ones

"osd_memory_target": "4294967296"
"osd_memory_cache_min": "134217728"
"bluestore_cache_size": "0"
"bluestore_cache_size_hdd": "1073741824"
"bluestore_cache_autotune": "true"

As stated in the documentation, bluestore assigns by default 4GB of
RAM per osd(1GB of RAM for 1TB).
So in this case 48GB of RAM would be needed. Am I right?

Are these the minimun requirements for bluestore?
In case adding more RAM is not an option, can any of
osd_memory_target, osd_memory_cache_min, bluestore_cache_size_hdd
be decrease to fit in our server specs?
Would this have any impact on performance?

Thanks
Jaime


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Mark Nelson

You may be interested in using my wallclock profiler to look at lock 
contention:



https://github.com/markhpc/gdbpmp


It will greatly slow down the OSD but will show you where time is being 
spent and so far the results appear to at least be relatively 
informative.  I used it recently when refactoring the bluestore caches 
to trim on add (from multiple threads) and break the bluestore cache 
into separate onode/buffer caches with their own locks:



https://github.com/ceph/ceph/pull/28597


One of the things you'll notice is that we have a single kv sync 
thread.  Historically that has been one of the limiting factors in terms 
of write throughput, though these days I tend to see a mix of various 
factors (potentially the shardedopwq, optracker, kv sync, etc).  
Certainly lock contention plays a part here.



Mark


On 8/6/19 11:41 AM, Mark Lehrer wrote:

I have a few more cycles this week to dedicate to the problem of
making OSDs do more than maybe 5 simultaneous operations (as measured
by the iostat effective queue depth of the drive).

However, I'm starting to think that the problem isn't with the number
of threads that have work to do... the problem may just be that the
OSD & PG code has enough thread locking happening that there is no
possible way to have more than a few things happening on a single OSD
(or perhaps a single placement group).

Has anyone thought about the problem from this angle?  This would help
explain why multiple-OSDs-per-SSD is so effective (even though the
thought of doing this in production is utterly terrifying).

For my next set of tests, I'll try some multi-pool testing and see if
isolating the placement groups helps with the thread limitations I'm
seeing.  Last time, I was testing multiple RBDs in the same pool.

Thanks,
Mark



On Sat, May 11, 2019 at 5:50 AM Maged Mokhtar  wrote:


On 10/05/2019 19:54, Mark Lehrer wrote:

I'm setting up a new Ceph cluster with fast SSD drives, and there is
one problem I want to make sure to address straight away:
comically-low OSD queue depths.

On the past several clusters I built, there was one major performance
problem that I never had time to really solve, which is this:
regardless of how much work the RBDs were being asked to do, the OSD
effective queue depth (as measured by iostat's "avgrq-sz" column)
never went above 3... even if I had multiple RBDs with queue depths in
the thousands.

This made sense back in the old days of spinning drives.  However, for
example with these particular drives and a 4K or 16K block size you
don't see maximum read performance until the queue depth gets to 50+.
At a queue depth of 4 the bandwidth is less than 20% what it is at
256.  The bottom line here is that Ceph performance is simply
embarrassing whenever the OSD effective queue depth is in single
digits.

On my last cluster, I spent a week or two researching and trying OSD
config parameters trying to increase the queue depth.  So far, the
only effective method I have seen to increase the effective OSD queue
depth is a gross hack - using multiple partitions per SSD to create
multiple OSDs.

My questions:

1) Is there anyone on this list who has solved this problem already?
On the performance articles I have seen, the authors don't show iostat
results (or any OSD effective queue depth numbers) so I can't really
tell.

2) If there isn't a good response to #1, is anyone else out there able
to do some experimentation to help figure this out?  All you would
need to do to get started is collect the output of this command while
a high-QD rbd test is happening: "iostat -mtxy 1" -- you should
collect it on all of the OSD servers as well as the client (you will
want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat
probably won't see it).

3) If there is any technical reason why this is impossible, please let
me know before I get to far down this road... but because the multiple
partitions trick works so well I expect it must be possible somehow.

Thanks,
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

i assume you mean avgqu-sz (queue size) rather than avgrq-sz (request
size). if so, what avgrq-sz do you get ? what kernel and io scheduler
being used ?

It is not uncommon if the system is not well tuned for your workload,
you may have a bottleneck in cpu running near 100% and your disks would
be single digit % busy, the faster your disks are and the more disks you
have, the less they will be busy if there is some cpu or network
bottleneck. If so the queue depth on them will be very low.

It is also possible the cluster has good performance but the bottleneck
is from the client(s) doing the test and is/are not fast enough to fully
stress your cluster, hence your disks.

To know more, we need more numbers:
-How many SSDs/OSDs do you have, what is their raw device random 4k
write sync iops ?
-How many hosts and cpu cores do you have ?

Re: [ceph-users] Bluestore caching oddities, again

2019-08-05 Thread Mark Nelson



On 8/4/19 7:36 PM, Christian Balzer wrote:

Hello,

On Sun, 4 Aug 2019 06:34:46 -0500 Mark Nelson wrote:


On 8/4/19 6:09 AM, Paul Emmerich wrote:


On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer  wrote:
  

2. Bluestore caching still broken
When writing data with the fios below, it isn't cached on the OSDs.
Worse, existing cached data that gets overwritten is removed from the
cache, which while of course correct can't be free in terms of allocation
overhead.
Why not doing what any sensible person would expect from experience with
any other cache there is, cache writes in case the data gets read again
soon and in case of overwrites use existing allocations.

This is by design.
The BlueStore only populates its cache on reads, not on writes. The idea is
that a reasonable application does not read data it just wrote (and if it does
it's already cached at a higher layer like the page cache or a cache on the
hypervisor).


Note that this behavior can be change by setting
bluestore_default_buffered_write = true.


Thanks to Mark for his detailed reply.
Given the points I assume that with HDD backed (but SSD WAL/DB) OSDs it's
not actually a performance killer?



Not typically from the overhead perspective (ie CPU usage shouldn't be 
an issue unless you have a lot of HDDs and wimpy CPUs or possibly if you 
are also doing EC/compression/encryption with lots of small IO).  The 
next question though is if you are better off caching bluestore onodes 
vs rocksdb block cache vs object data.  When you have DB/WAL on the same 
device as bluestore block, you typically want to prioritize rocksdb 
indexes/filters, bluestore onode, rocksdb block cache, and bluestore 
data in that order (the ratios here though are very workload 
dependent).  If you have HDD + SSD DB/WAL, you probably still want to 
cache the indexes/filters with high priority (these are relatively small 
and will reduce read amplification in the DB significantly!).  Now 
caching bluestore onodes and rocksdb block cache may be less important 
since the SSDs may be able to handle the metadata reads fast enough to 
have little impact on the HDD side of things.  Not all SSDs are made 
equal and people often like to put multiple DB/WALs on a single SSD, so 
all of this can be pretty hardware dependent.  You'll also eat more CPU 
going this path due to encode/decode between bluestore and rocksdb and 
all of the work involved in finding the right key/value pair in rocksdb 
itself. So there are definitely going to be hardware-dependent 
trade-offs (ie even if it's faster on HDD/SSD setups to focus on 
bluestore buffer cache, you may eat more CPU per IO doing it).  Probably 
the take-away is that if you have really beefy CPUs and really buffer 
SSDs in a HDD+SSD setup, it may be worth trying a higher buffer cache 
ratio and see what happens.



Note that with the prioritycachemanager and osd memory autotuning, if 
you enable bluestore_default_buffered_write and neither the rocksdb 
block cache nor the bluestore onode cache need more memory, the rest 
automatically gets assigned to bluestore buffer cache for objects.



Mark


I'll test that of course but a gut feeling or ball park would be
appreciated by probably more people that me.

As Paul's argument, I'm not buying it because:
- It's a complete paradigm change when comparing it to filestore. Somebody
   migrating from FS to BS is likely to experience yet another performance
   decrease they didn't expect.
- Arguing for larger caches on the client only increases the cost of Ceph
   further. In that vein, BS currently can't utilize as much memory as FS
   did for caching in a save manner.
- Use cases like a DB with enough caching to deal with the normal working
   set but doing some hourly crunching on data that exceeds come to mind.
   One application here also processes written data once an hour, more than
   would fit in the VM pagecache, but currently comes from the FS pagecache.
- The overwrites of already cached date _clearly_ indicate a hotness and
   thus should be preferably cached. That bit in particular is upsetting,
   initial write caching or not.
  
Regards,


Christian

FWIW, there's also a CPU usage and lock contention penalty for default
buffered write when using extremely fast flash storage.  A lot of my
recent work on improving cache performance and intelligence in bluestore
is to reduce contention in the onode/buffer cache and also significantly
reduce the impact of default buffered write = true.  The
PriorityCacheManger was a big one to do a better job of autotuning.
Another big one that recently merged was refactoring bluestore's caches
to trim on write (better memory behavior, shorter more frequent trims,
trims distributed across threads) and not share a single lock between
the onode and buffer cache:


https://github.com/ceph/ceph/pull/28597


Ones still coming down the pipe are to avoid double caching onodes in
the bluestore onode cache and rocksdb block cache and age-binning the
LRU

Re: [ceph-users] Bluestore caching oddities, again

2019-08-04 Thread Mark Nelson


On 8/4/19 6:09 AM, Paul Emmerich wrote:


On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer  wrote:


2. Bluestore caching still broken
When writing data with the fios below, it isn't cached on the OSDs.
Worse, existing cached data that gets overwritten is removed from the
cache, which while of course correct can't be free in terms of allocation
overhead.
Why not doing what any sensible person would expect from experience with
any other cache there is, cache writes in case the data gets read again
soon and in case of overwrites use existing allocations.

This is by design.
The BlueStore only populates its cache on reads, not on writes. The idea is
that a reasonable application does not read data it just wrote (and if it does
it's already cached at a higher layer like the page cache or a cache on the
hypervisor).



Note that this behavior can be change by setting 
bluestore_default_buffered_write = true.



FWIW, there's also a CPU usage and lock contention penalty for default 
buffered write when using extremely fast flash storage.  A lot of my 
recent work on improving cache performance and intelligence in bluestore 
is to reduce contention in the onode/buffer cache and also significantly 
reduce the impact of default buffered write = true.  The 
PriorityCacheManger was a big one to do a better job of autotuning. 
Another big one that recently merged was refactoring bluestore's caches 
to trim on write (better memory behavior, shorter more frequent trims, 
trims distributed across threads) and not share a single lock between 
the onode and buffer cache:



https://github.com/ceph/ceph/pull/28597


Ones still coming down the pipe are to avoid double caching onodes in 
the bluestore onode cache and rocksdb block cache and age-binning the 
LRU caches to better redistribute memory between caches based on 
relative age.  This is the piece that hopefully would let you cache on 
write while still having the priority of those cached writes quickly 
fall off if they are never read back (the more cache you have, the more 
effective this would be at keeping the onode/omap ratios relatively higher).



Mark






Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High memory usage OSD with BlueStore

2019-08-01 Thread Mark Nelson


Hi Danny,


Are your arm binaries built using tcmalloc?  At least on x86 we saw 
significantly higher memory fragmentation and memory usage with glibc 
malloc.



First, you can look at the mempool stats which may provide a hint:


ceph daemon osd.NNN dump_mempools


Assuming you are using tcmalloc and have the cache autotuning enabled, 
you can also enable debug_bluestore = "5" and debug_prioritycache = "5" 
on one of the OSDs that using lots of memory.  Look for the lines 
containing "cache_size" "tune_memory target".  Those will tell you how 
much of your memory is being devoted for bluestore caches and how it's 
being divided up between kv, buffer, and rocksdb block cache.



Mark

On 8/1/19 4:25 AM, dannyyang(杨耿丹) wrote:


H all:
we have a cephfs env,ceph version is 12.2.10,server in arm,but fuse clients 
are x86,osd disk size is 8T,some osd use 12GB memory,is that normal?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New best practices for osds???

2019-07-26 Thread Mark Nelson



On 7/25/19 9:27 PM, Anthony D'Atri wrote:

We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD 
in order to be able
to use -battery protected- write cache from the RAID controller. It really 
improves performance, for both
bluestore and filestore OSDs.

Having run something like 6000 HDD-based FileStore OSDs with colo journals on 
RAID HBAs I’d like to offer some contrasting thoughts.

TL;DR:  Never again!  False economy.  ymmv.

Details:

* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone, 
try as I might I could not get it fixed.

* Single-drive RAID0 VDs were created to expose the underlying drives to the 
OS.  When the architecture was conceived, the HBAs in question didn’t have 
JBOD/passthrough, though a firmware update shortly thereafter did bring that 
ability.  That caching was a function of VDs wasn’t known at the time.

* My sense was that the FBWC did offer some throughput performance for at least 
some workloads, but at the cost of latency.

* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the 
presence and status of the BBU/supercap

* The utility needed for that monitoring, when invoked with ostensibly 
innocuous parameters, would lock up the HBA for several seconds.

* Traditional BBUs are rated for lifespan of *only* one year.  FBWCs maybe for 
… three?  Significant cost to RMA or replace them:  time and karma wasted 
fighting with the system vendor CSO, engineer and remote hands time to take the 
system down and swap.  And then the connectors for the supercap were touchy; 
15% of the time the system would come up and not see it at all.

* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred 
more than an IT / JBOD equivalent

* There was a little-known flaw in secondary firmware that caused FBWC / 
supercap modules to be falsely reported bad.  The system vendor acted like I 
was making this up and washed their hands of it, even when I provided them the 
HBA vendors’ artifacts and documents.

* There were two design flaws that could and did result in cache data loss when 
a system rebooted or lost power.  There was a field notice for this, which 
required harvesting serial numbers and checking each.  The affected range of 
serials was quite a bit larger than what the validation tool admitted.  I had 
to manage the replacement of 302+ of these in production use, each needing 
engineer time time to manage Ceph, to do the hands work, and hassle with RMA 
paperwork.

* There was a firmware / utility design flaw that caused the HDD’s onboard 
volatile write cache to be silently turned on, despite an HBA config dump 
showing a setting that should have left it off.  Again data was lost when a 
node crashed hard or lost power.

* There was another firmware flaw that prevented booting if there was pinned / 
preserved cache data after a reboot / power loss if a drive failed or was 
yanked.  The HBA’s option ROM utility would block booting and wait for input on 
the console.  One could get in and tell it to discard that cache, but it would 
not actually do so, instead looping back to the same screen.  The only way to 
get the system to boot again was to replace and RMA the HBA.

* The VD layer lessened the usefulness of iostat data.  It also complicated OSD 
deployment / removal / replacement.  A smartctl hack to access SMART attributes 
below the VD layer would work on some systems but not others.

* The HBA model in question would work normally with a certain CPU generation, 
but not with slightly newer servers with the next CPU generation.  They would 
randomly, on roughly one boot out of five, negotiate PCIe gen3 which they 
weren’t capable of handling properly, and would silently run at about 20% of 
normal speed.  Granted this isn’t necessarily specific to an IR HBA.



Add it all up, and my assertion is that the money, time, karma, and user impact 
you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for 
OSDs instead.



This is worse than I feared, but very much in the realm of concerns I 
had with using single-disk RAID0 setups.  Thank you very much for 
posting your experience!  My money would still be on using *high write 
endurance* NVMes for DB/WAL and whatever I could afford for block.  I 
still have vague hopes that in the long run we move away from the idea 
of of distinct block/db/wal devices and toward pools of resources that 
the OSD makes it's own decisions about.  I'd like to be able to hand the 
OSD a pile of hardware and say "go".  That might mean something like an 
internal caching scheme but with slow eviction and initial placement 
hints (IE L0 SST files should nearly always end up on fast storage).



If it were structured like the PriorityCacheManager, we'd have SSTs for 
different column family prefixes (OMAP, onodes, etc) competing for fast 
BlueFS device storage with bluestore at different priority levels (so 
for example onode L0 would be

Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-21 Thread Mark Nelson

FWIW, the DB and WAL don't really do the same thing that the cache tier 
does.  The WAL is similar to filestore's journal, and the DB is 
primarily for storing metadata (onodes, blobs, extents, and OMAP data).  
Offloading these things to an SSD will definitely help, but you won't 
see the same kind of behavior that you would see with cache tiering 
(especially if the workload is small enough to fit entirely in the cache 
tier).



IMHO the biggest performance consideration with cache tiering is when 
your workload doesn't fit entirely in the cache and you are evicting 
large quantities of data over the network.  Depending on a variety of 
factors this can be pretty slow (and in fact can be slower than not 
using a cache tier at all!).  If your workload fits entirely within the 
cache tier though, it's almost certainly going to be faster than 
bluestore without a cache tier.



Mark


On 7/21/19 9:39 AM, Shawn Iverson wrote:
Just wanted to post an observation here.  Perhaps someone with 
resources to perform some performance tests is interested in comparing 
or has some insight into why I observed this.


Background:

12 node ceph cluster
3-way replicated by chassis group
3 chassis groups
4 nodes per chassis
running Luminous (up to date)
heavy use of block storage for kvm virtual machines (proxmox)
some cephfs usage (<10%)
~100 OSDs
~100 pgs/osd
500GB average OSD capacity

I recently attempted to do away with my ssd cache tier on Luminous and 
replace it with bluestore with db/wal on ssd as this seemed to be a 
better practice, or so I thought.


Sadly, after 2 weeks of rebuilding OSDs and placing the db/wall on 
ssd, I was sorely disappointed with performance. My cluster performed 
poorly.  It seemed that the db/wal on ssd did not boost performance as 
I was used to having.  I used 60gb for the size.  Unfortunately, I did 
not have enough ssd capacity to make it any larger for my OSDs


Despite the words of caution on the Ceph docs in regard to replicated 
base tier and replicated cache-tier, I returned to cache tiering.


Performance has returned to expectations.

It would be interesting if someone had the spare iron and resources to 
benchmark bluestore OSDs with SSD db/wal against cache tiering and 
provide some statistics.


--
Shawn Iverson, CETL
Director of Technology
Rush County Schools
765-932-3901 option 7
ivers...@rushville.k12.in.us 

Cybersecurity

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] which tool to use for benchmarking rgw s3, yscb or cosbench

2019-07-21 Thread Mark Nelson


Hi Wei Zhao,


I've used ycsb for mongodb on rbd testing before.  It worked fine and 
was pretty straightforward to run.  The only real concern I had was that 
many of the default workloads used a zipfian distribution for reads.  
This basically meant reads were entirely coming from cache and didn't 
really test the storage system at all.  I ended up creating some my own 
profiles so that we could test both the default zipfian read setup and 
using a random read distribution as well.  I hadn't heard about nor have 
used the YCSB S3 tests, but I would be very interested in giving it a 
try. Cosbench can be a bit heavy if you only need to run a couple of 
simple tests and have other tools for the test orchestration and data 
visualization.



Mark


On 7/21/19 10:51 AM, Wei Zhao wrote:

Hi:
   I found cosbench is a very convenient tool for benchmaring rgw. But
when I read papers ,  I found YCSB tool,
https://github.com/brianfrankcooper/YCSB/tree/master/s3  . It seems
that this is used for test cloud service , and seems a right tool for
our service . Has  anyone tried this tool ?How is it  compared to
cosbench ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore Runaway Memory

2019-07-18 Thread Mark Nelson


Hi Brett,


Can you enable debug_bluestore = 5 and debug_prioritycache = 5 on one of 
the OSDs that's showing the behavior?  You'll want to look in the logs 
for lines that look like this:



2019-07-18T19:34:42.587-0400 7f4048b8d700  5 prioritycache tune_memory 
target: 4294967296 mapped: 4260962304 unmapped: 856948736 heap: 
5117911040 old mem: 2845415707 new mem: 2845415707
2019-07-18T19:34:33.527-0400 7f4048b8d700  5 
bluestore.MempoolThread(0x55a6d330ead0) _resize_shards cache_size: 
2845415707 kv_alloc: 1241513984 kv_used: 874833889 meta_alloc: 
1258291200 meta_used: 889040246 data_alloc: 318767104 data_used: 0


The first line will tell you what your memory target is set to, how much 
memory is currently mapped, how much is unmapped (ie what's been freed 
but the kernel hasn't reclaimed), the total heap size, and the old and 
new aggregate size for all of bluestores caches.  The second line also 
tells you the aggregate cache size, and then how much space is being 
allocated and used for the kv, meta, and data caches.  If there's a leak 
somewhere in the OSD or bluestore the autotuner will shrink the cache 
way down but eventually won't be able to contain it and eventually your 
process will start growing beyond the target size despite having a tiny 
amount of bluestore cache.  If it's something else like a huge amount of 
freed memory not being reclaimed by the kernel, you'll see large amount 
of unmapped memory and a big heap size despite the mapped memory staying 
near the target.  If it's a bug in the autotuner, we might see the 
mapped memory greatly exceeding the target.



Mark


On 7/18/19 4:02 PM, Brett Kelly wrote:


Hello,

We have a Nautilus cluster exhibiting what looks like this bug: 
https://tracker.ceph.com/issues/39618


No matter what is set as the osd_memory_target (currently 2147483648 
), each OSD process will surpass this value and peak around ~4.0GB 
then eventually start using swap. Cluster stays stable for about a 
week and then starts running into OOM issues, kills off OSDs and 
requires a reboot of each node to get back to a stable state.


Has anyone run into similar/workarounds ?

Ceph version: 14.2.1, RGW Clients

CentOS Linux release 7.6.1810 (Core)

Kernel: 3.10.0-957.12.1.el7.x86_64

256GB RAM per OSD node, 60 OSD's in each node.


Thanks,

--
Brett Kelly


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New best practices for osds???

2019-07-17 Thread Mark Nelson

Some of the first performance studies we did back at Inktank were 
looking at RAID-0 vs JBOD setups! :)  You are absolutely right that the 
controller cache (especially write-back with a battery or supercap) can 
help with HDD-only configurations.  Where we typically saw problems was 
when you load up a chassis with lots of drives and use SAS expanders 
with a single controller.  In some cases we saw higher tail latency and 
of course you can hit throughput limitations for large IOs too with 
enough disks.  This happens much quicker if you've got SSDs in the mix 
for DB/WAL too. Back then, the question really was whether you were 
better off investing money in a controller+cache or jbod-only setup with 
flash journals (now DB/WAL).  I'm guessing it's still worth prioritizing 
flash over the controller, but if you've already got the controller it 
may not be a bad idea to use single-disk RAID0 depending on your use 
case.  JBOD does make system management a bit more friendly imho.



Regarding Disk Cache:  We've diagnosed some very strange disk cache 
behavior with customers in the past.  Nothing recent though.



Mark


On 7/17/19 7:27 AM, John Petrini wrote:
Dell has a whitepaper that compares Ceph performance using JBOD and 
RAID-0 per disk that recommends RAID-0 for HDD's: 
en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download 



After switching from JBOD to RAID-0 we saw a huge reduction in 
latency, the difference was much more significant than that whitepaper 
shows. RAID-0 allows us to leverage the controller cache which has 
major performance improvements when used with HDD's. We also disable 
the disk cache on our HDD's and SSD's as we had inconsistent 
performance with disk cache enabled.


As always I'd suggest testing various configurations with your own 
hardware but I wouldn't shy away from RAID-0 simply because of "best 
practice".


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore_allocated vs bluestore_stored

2019-06-17 Thread Mark Nelson

Earlier in bluestore's life, we couldn't handle a 4K min_alloc size on 
NVMe without incurring pretty significant slowdowns (and also generally 
higher amounts of metadata in the DB).  Lately I've been seeing some 
indications that we've improved the stack to the point where 4K 
min_alloc no longer is significantly slower on NVMe than 16K.  It might 
be time to consider switching back for Octopus.  On the HDD side I'm not 
sure if we want to consider dropping down from 64K.  There are 
definitely going to be some trade-offs there.



Mark


On 6/17/19 3:22 AM, Igor Fedotov wrote:

Hi Maged,

min_alloc_size determines allocation granularity hence if object size 
isn't aligned with its value allocation overhead still takes place.


E.g. with min_alloc_size = 16K and object size = 24K total allocation 
(i.e. bluestore_allocated) would be 32K.


And yes, this overhead is permanent.

Thanks,

Igor

On 6/17/2019 1:06 AM, Maged Mokhtar wrote:

Hi all,

I want to understand more the difference between bluestore_allocated 
and bluestore_stored in the case of no compression. If i am writing 
fixed objects with sizes greater than min alloc size, would 
bluestore_allocated still be higher than bluestore_stored ? If so, is 
this a permanent overhead/penalty or is something the allocator can 
re-use/optimize later as more objects are stored ?


Appreciate any help.

Cheers /Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Verifying current configuration values

2019-06-12 Thread Mark Nelson



On 6/12/19 5:51 PM, Jorge Garcia wrote:
I'm following the bluestore config reference guide and trying to 
change the value for osd_memory_target. I added the following entry in 
the /etc/ceph/ceph.conf file:


  [osd]
  osd_memory_target = 2147483648

and restarted the osd daemons doing "systemctl restart 
ceph-osd.target". Now, how do I verify that the value has changed? I 
have tried "ceph daemon osd.0 config show" and it lists many settings, 
but osd_memory_target isn't one of them. What am I doing wrong?



What version of ceph are you using? Here's a quick dump of one of my 
test OSDs from master (ignore the low memory target):



$ sudo ceph daemon osd.0 config show  | grep osd_memory_target
    "osd_memory_target": "1073741824",
    "osd_memory_target_cgroup_limit_ratio": "0.80",


It looks to me like you may be on an older version of ceph that hasn't 
had the osd_memory_target code backported?



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD RAM recommendations

2019-06-07 Thread Mark Nelson

The truth of the matter is that folks try to boil this down to some kind 
of hard and fast rule but it's often not that simple. With our current 
default settings for pglog, rocksdb WAL buffers, etc, the OSD basically 
needs about 1GB of RAM for bare-bones operation (not under recovery or 
extreme write workload) and any additional space is used for various 
caches.  How much memory you need for caches depends on a variety of 
factors, but the big ones are how likely you are to miss bluestore onode 
reads, omap reads, how big the bloomfilters/indexes are in rocksdb 
(which scale with the total number of objects), whether or not cached 
data is important, etc.


So basically the answer is that how much memory you need depends largely 
on how much you care about performance, how many objects are present on 
an OSD, and how many objects (and how much data) you have in your active 
data set.  4GB is sort of our current default memory target per OSD, but 
as someone else mentioned bumping that up to 8-12GB per OSD might make 
sense for OSDs on large NVMe drives.  You can also lower that down to 
about 2GB before you start having real issues, but it definitely can 
have an impact on OSD performance.



Mark


On 6/7/19 12:00 PM, Jorge Garcia wrote:
I'm a bit confused by the RAM recommendations for OSD servers. I have 
also seen conflicting information in the lists (1 GB RAM per OSD, 1 GB 
RAM per TB, 3-5 GB RAM per OSD, etc.). I guess I'm a lot better with a 
concrete example:


Say this is your cluster (using Bluestore):

8 Machines serving OSDs. Each machine is the same:

12 x 10 TB disks for data for 120 TB total per machine (1 disk per OSD)

Each machine is running 12 OSD daemons. The whole cluster has 96 OSDs 
(8 x 12) and a total of 960 TB of space.


What is the recommended amount of RAM for each of the 8 machines 
serving OSDs?


- 12 GB (1 GB per OSD)
- 10 GB (1 GB per TB of each OSD)
- 120 GB (1 GB per TB per machine)
- 960 GB (1 GB per TB for the whole cluster)
- 36 to 60 GB (3-5 GB per OSD)
- None of the above (then what is the answer?)

Thanks!

Jorge

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Mark Nelson



On 5/3/19 1:38 AM, Denny Fuchs wrote:

hi,

I never recognized the Debian /etc/default/ceph :-)

=
# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728


that is, what is active now.



Yep, if you profile the OSD under a small write workload you can see how 
changing this affects tcmalloc's behavior.  This was especially 
important back before we had the async messenger, but I believe we've 
seen evidence that we're still getting some benefit with larger thread 
cache even now.






Huge pages:

# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never


# dpkg -S  /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1


so file exists on PROXMOX 5.x (Ceph version 12.2.11-pve1)


If I understand correct: I should try to set bitmap allocator


[osd]
...
bluestore_allocator = bitmap
bluefs_allocator = bitmap

I would restart the nodes one by one and see, what happens.



If you are using 12.2.11 you likely still have the old bitmap allocator 
which we do not recommend using at all.  Igor Fedotov backported his 
excellent new bitmap allocator to 12.2.12 though:



https://github.com/ceph/ceph/tree/v12.2.12/src/os/bluestore


Mark




cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson



On 5/2/19 1:51 PM, Igor Podlesny wrote:

On Fri, 3 May 2019 at 01:29, Mark Nelson  wrote:

On 5/2/19 11:46 AM, Igor Podlesny wrote:

On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]

FWIW, if you still have an OSD up with tcmalloc, it's probably worth
looking at the heap stats to see how much memory tcmalloc thinks it's
allocated vs how much RSS memory is being used by the process.  It's
quite possible that there is memory that has been unmapped but that the
kernel can't (or has decided not yet to) reclaim.
Transparent huge pages can potentially have an effect here both with tcmalloc 
and with
jemalloc so it's not certain that switching the allocator will fix it entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.


  From one of our centos nodes with no special actions taken to change
THP settings (though it's possible it was inherited from something else):


$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

-- https://www.kernel.org/doc/Documentation/vm/transhuge.txt



Why are you quoting the description for the madvise setting when that's 
clearly not what was set in the case I just showed you?






And regarding madvise and alternate memory allocators:
https:

[...]

did you ever read any of it?

one link's info:

"By default jemalloc does not use huge pages for heap memory (there is
opt.metadata_thp which uses THP for internal metadata though)"



"It turns out that|jemalloc(3)|uses|madvise(2)|extensively to notify the 
operating system that it's done with a range of memory which it had 
previously|malloc|'ed. Because the machine used transparent huge pages, 
the page size was 2MB. As such, a lot of the memory which was being 
marked with|madvise(..., MADV_DONTNEED)|was within ranges substantially 
smaller than 2MB. This meant that the operating system never was able to 
evict pages which had ranges marked as|MADV_DONTNEED|because the entire 
page would have to be unneeded to allow it to be reused.


So despite initially looking like a leak, the operating system itself 
was unable to free memory because of|madvise(2)|and transparent huge 
pages.^4 
<https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/#fn:4> 
This led to sustained memory pressure on the machine 
and|redis-server|eventually getting OOM killed."



I'm not going to argue with you about this.  Test it if you want or don't.


Mark




(and I've said

None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.

before)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson




On 5/2/19 11:46 AM, Igor Podlesny wrote:

On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]

FWIW, if you still have an OSD up with tcmalloc, it's probably worth
looking at the heap stats to see how much memory tcmalloc thinks it's
allocated vs how much RSS memory is being used by the process.  It's
quite possible that there is memory that has been unmapped but that the
kernel can't (or has decided not yet to) reclaim.
Transparent huge pages can potentially have an effect here both with tcmalloc 
and with
jemalloc so it's not certain that switching the allocator will fix it entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.



From one of our centos nodes with no special actions taken to change 
THP settings (though it's possible it was inherited from something else):



$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


And regarding madvise and alternate memory allocators:


https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/

https://www.nuodb.com/techblog/linux-transparent-huge-pages-jemalloc-and-nuodb

https://github.com/gperftools/gperftools/issues/1073

https://github.com/jemalloc/jemalloc/issues/1243

https://github.com/jemalloc/jemalloc/issues/1128





First I would just get the heap stats and then after that I would be
very curious if disabling transparent huge pages helps. Alternately,
it's always possible it's a memory leak. :D

RedHat can do better (hopefully). ;-P

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Mark Nelson


On 5/1/19 12:59 AM, Igor Podlesny wrote:

On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:

On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]

Any suggestions ?

-- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?



FWIW, if you still have an OSD up with tcmalloc, it's probably worth 
looking at the heap stats to see how much memory tcmalloc thinks it's 
allocated vs how much RSS memory is being used by the process.  It's 
quite possible that there is memory that has been unmapped but that the 
kernel can't (or has decided not yet to) reclaim.  Transparent huge 
pages can potentially have an effect here both with tcmalloc and with 
jemalloc so it's not certain that switching the allocator will fix it 
entirely.



First I would just get the heap stats and then after that I would be 
very curious if disabling transparent huge pages helps. Alternately, 
it's always possible it's a memory leak. :D



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Mark Nelson

They have the same issue, but depending on the SSD may be better at 
absorbing the extra IO if network or CPU are bigger bottlenecks.  That's 
one of the reasons that a lot of folks like to put the DB on flash for 
HDD based clusters.  It's still possible to oversubscribe them, but 
you've got more headroom.

Mark

On 4/12/19 10:25 AM, Charles Alva wrote:

Thanks Mark,

This is interesting. I'll take a look at the links you provided.

Does rocksdb compacting issue only affect HDDs? Or SSDs are having 
same issue?

Kind regards,

Charles Alva
Sent from Gmail Mobile

On Fri, Apr 12, 2019, 9:01 PM Mark Nelson <mailto:mnel...@redhat.com>> wrote:

Hi Charles,

Basically the goal is to reduce write-amplification as much as
possible.  The deeper that the rocksdb hierarchy gets, the worse the
write-amplifcation for compaction is going to be.  If you look at the
OSD logs you'll see the write-amp factors for compaction in the
rocksdb
compaction summary sections that periodically pop up. There's a
couple
of things we are trying to see if we can improve things on our end:

1) Adam has been working on experimenting with sharding data across
multiple column families.  The idea here is that it might be
better to
hav multiple L0 and L1 levels rather than L0, L1, L2 and L3. I'm not
sure if this will pan out of not, but that was one of the goals
behind
trying this.

2) Toshiba recently released trocksdb which could have a really big
impact on compaction write amplification:

Code: https://github.com/ToshibaMemoryAmerica/trocksdb/tree/TRocksRel

Wiki: https://github.com/ToshibaMemoryAmerica/trocksdb/wiki

I recently took a look to see if our key/value size distribution
would
work well with the approach that trocksdb is taking to reduce
write-amplification:

https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing

The good news is that it sounds like the "Trocks Ratio" for the
data we
put in rocksdb is sufficiently high that we'd see some benefit
since it
should greatly reduce write-amplification during compaction for data
(but not keys). This doesn't help your immediate problem, but I
wanted
you to know that you aren't the only one and we are thinking about
ways
to reduce the compaction impact.

Mark

On 4/10/19 2:07 AM, Charles Alva wrote:
> Hi Ceph Users,
>
> Is there a way around to minimize rocksdb compacting event so
that it
> won't use all the spinning disk IO utilization and avoid it being
> marked as down due to fail to send heartbeat to others?
>
> Right now we have frequent high IO disk utilization for every 20-25
> minutes where the rocksdb reaches level 4 with 67GB data to compact.
>
>
> Kind regards,
>
> Charles Alva
> Sent from Gmail Mobile
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Mark Nelson


Hi Charles,


Basically the goal is to reduce write-amplification as much as 
possible.  The deeper that the rocksdb hierarchy gets, the worse the 
write-amplifcation for compaction is going to be.  If you look at the 
OSD logs you'll see the write-amp factors for compaction in the rocksdb 
compaction summary sections that periodically pop up. There's a couple 
of things we are trying to see if we can improve things on our end:



1) Adam has been working on experimenting with sharding data across 
multiple column families.  The idea here is that it might be better to 
hav multiple L0 and L1 levels rather than L0, L1, L2 and L3.  I'm not 
sure if this will pan out of not, but that was one of the goals behind 
trying this.



2) Toshiba recently released trocksdb which could have a really big 
impact on compaction write amplification:



Code: https://github.com/ToshibaMemoryAmerica/trocksdb/tree/TRocksRel

Wiki: https://github.com/ToshibaMemoryAmerica/trocksdb/wiki


I recently took a look to see if our key/value size distribution would 
work well with the approach that trocksdb is taking to reduce 
write-amplification:



https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing


The good news is that it sounds like the "Trocks Ratio" for the data we 
put in rocksdb is sufficiently high that we'd see some benefit since it 
should greatly reduce write-amplification during compaction for data 
(but not keys). This doesn't help your immediate problem, but I wanted 
you to know that you aren't the only one and we are thinking about ways 
to reduce the compaction impact.



Mark


On 4/10/19 2:07 AM, Charles Alva wrote:

Hi Ceph Users,

Is there a way around to minimize rocksdb compacting event so that it 
won't use all the spinning disk IO utilization and avoid it being 
marked as down due to fail to send heartbeat to others?


Right now we have frequent high IO disk utilization for every 20-25 
minutes where the rocksdb reaches level 4 with 67GB data to compact.



Kind regards,

Charles Alva
Sent from Gmail Mobile

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-10 Thread Mark Nelson


In fact the autotuner does it itself every time it tunes the cache size:


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L3630


Mark


On 4/10/19 2:53 AM, Frédéric Nass wrote:

Hi everyone,

So if the kernel is able to reclaim those pages, is there still a 
point in running the heap release on a regular basis?


Regards,
Frédéric.

Le 09/04/2019 à 19:33, Olivier Bonvalet a écrit :

Good point, thanks !

By making memory pressure (by playing with vm.min_free_kbytes), memory
is freed by the kernel.

So I think I essentially need to update monitoring rules, to avoid
false positive.

Thanks, I continue to read your resources.


Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit :

My understanding is that basically the kernel is either unable or
uninterested (maybe due to lack of memory pressure?) in reclaiming
the
memory .  It's possible you might have better behavior if you set
/sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or
maybe disable transparent huge pages entirely.


Some background:

https://github.com/gperftools/gperftools/issues/1073

https://blog.nelhage.com/post/transparent-hugepages/

https://www.kernel.org/doc/Documentation/vm/transhuge.txt


Mark


On 4/9/19 7:31 AM, Olivier Bonvalet wrote:

Well, Dan seems to be right :

_tune_cache_size
  target: 4294967296
    heap: 6514409472
    unmapped: 2267537408
  mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :

One of the difficulties with the osd_memory_target work is that
we
can't
tune based on the RSS memory usage of the process. Ultimately
it's up
to
the kernel to decide to reclaim memory and especially with
transparent
huge pages it's tough to judge what the kernel is going to do
even
if
memory has been unmapped by the process.  Instead the autotuner
looks
at
how much memory has been mapped and tries to balance the caches
based
on
that.


In addition to Dan's advice, you might also want to enable debug
bluestore at level 5 and look for lines containing "target:" and
"cache_size:".  These will tell you the current target, the
mapped
memory, unmapped memory, heap size, previous aggregate cache
size,
and
new aggregate cache size.  The other line will give you a break
down
of
how much memory was assigned to each of the bluestore caches and
how
much each case is using.  If there is a memory leak, the
autotuner
can
only do so much.  At some point it will reduce the caches to fit
within
cache_min and leave it there.


Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell
osd.0
heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
ceph.l...@daevel.fr> wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed
the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph    3646 17.1 12.0 6828916 5893136 ? Ssl mars29
1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --
setuser
ceph --setgroup ceph
ceph    3991 12.9 11.2 6342812 5485356 ? Ssl mars29
1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --
setuser
ceph --setgroup ceph
ceph    4361 16.9 11.8 6718432 5783584 ? Ssl mars29
1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --
setuser
ceph --setgroup ceph
ceph    4731 19.7 12.2 6949584 5982040 ? Ssl mars29
2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --
setuser
ceph --setgroup ceph
ceph    5073 16.7 11.6 6639568 5701368 ? Ssl mars29
1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --
setuser
ceph --setgroup ceph
ceph    5417 14.6 11.2 6386764 5519944 ? Ssl mars29
1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --
setuser
ceph --setgroup ceph
ceph    5760 16.9 12.0 6806448 5879624 ? Ssl mars29
1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --
setuser
ceph --setgroup ceph
ceph    6105 16.0 11.6 6576336 5694556 ? Ssl mars29
1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --
setuser
ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
 total    used    free shared  bu
ff/ca
che   available
Mem:  47771   45210    1643 17
    9
17   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
   "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

  $ ceph status
    cluster:
  id: de035250-323d-4cf6-8c4b-cf0faf6296b1
  hea

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Mark Nelson

My understanding is that basically the kernel is either unable or 
uninterested (maybe due to lack of memory pressure?) in reclaiming the 
memory .  It's possible you might have better behavior if you set 
/sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or 
maybe disable transparent huge pages entirely.



Some background:

https://github.com/gperftools/gperftools/issues/1073

https://blog.nelhage.com/post/transparent-hugepages/

https://www.kernel.org/doc/Documentation/vm/transhuge.txt


Mark


On 4/9/19 7:31 AM, Olivier Bonvalet wrote:

Well, Dan seems to be right :

_tune_cache_size
 target: 4294967296
   heap: 6514409472
   unmapped: 2267537408
 mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :

One of the difficulties with the osd_memory_target work is that we
can't
tune based on the RSS memory usage of the process. Ultimately it's up
to
the kernel to decide to reclaim memory and especially with
transparent
huge pages it's tough to judge what the kernel is going to do even
if
memory has been unmapped by the process.  Instead the autotuner looks
at
how much memory has been mapped and tries to balance the caches based
on
that.


In addition to Dan's advice, you might also want to enable debug
bluestore at level 5 and look for lines containing "target:" and
"cache_size:".  These will tell you the current target, the mapped
memory, unmapped memory, heap size, previous aggregate cache size,
and
new aggregate cache size.  The other line will give you a break down
of
how much memory was assigned to each of the bluestore caches and how
much each case is using.  If there is a memory leak, the autotuner
can
only do so much.  At some point it will reduce the caches to fit
within
cache_min and leave it there.


Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell osd.0
heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
ceph.l...@daevel.fr> wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser
ceph --setgroup ceph
ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser
ceph --setgroup ceph
ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser
ceph --setgroup ceph
ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser
ceph --setgroup ceph
ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser
ceph --setgroup ceph
ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser
ceph --setgroup ceph
ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser
ceph --setgroup ceph
ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser
ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
totalusedfree  shared  buff/ca
che   available
Mem:  47771   452101643  17 9
17   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
  "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

 $ ceph status
   cluster:
 id: de035250-323d-4cf6-8c4b-cf0faf6296b1
 health: HEALTH_OK

   services:
 mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
 mgr: tsyne(active), standbys: olkas, tolriq, lorunde,
amphel
 osd: 120 osds: 116 up, 116 in

   data:
 pools:   20 pools, 12736 pgs
 objects: 15.29M objects, 31.1TiB
 usage:   101TiB used, 75.3TiB / 177TiB avail
 pgs: 12732 active+clean
  4 active+clean+scrubbing+deep

   io:
 client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd,
1.29kop/s wr


 On an other host, in the same pool, I see also high memory
usage :

 daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
 ceph6287  6.6 10.6 6027388 5190032 ?

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-08 Thread Mark Nelson

One of the difficulties with the osd_memory_target work is that we can't 
tune based on the RSS memory usage of the process. Ultimately it's up to 
the kernel to decide to reclaim memory and especially with transparent 
huge pages it's tough to judge what the kernel is going to do even if 
memory has been unmapped by the process.  Instead the autotuner looks at 
how much memory has been mapped and tries to balance the caches based on 
that.



In addition to Dan's advice, you might also want to enable debug 
bluestore at level 5 and look for lines containing "target:" and 
"cache_size:".  These will tell you the current target, the mapped 
memory, unmapped memory, heap size, previous aggregate cache size, and 
new aggregate cache size.  The other line will give you a break down of 
how much memory was assigned to each of the bluestore caches and how 
much each case is using.  If there is a memory leak, the autotuner can 
only do so much.  At some point it will reduce the caches to fit within 
cache_min and leave it there.



Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell osd.0 heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet  wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29 1903:42 
/usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser ceph --setgroup ceph
ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29 1443:41 
/usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser ceph --setgroup ceph
ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29 1889:41 
/usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser ceph --setgroup ceph
ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29 2198:47 
/usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser ceph --setgroup ceph
ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29 1866:05 
/usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser ceph --setgroup ceph
ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29 1634:30 
/usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser ceph --setgroup ceph
ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29 1882:42 
/usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser ceph --setgroup ceph
ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29 1782:52 
/usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
   totalusedfree  shared  buff/cache   available
Mem:  47771   452101643  17 917   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
 "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

$ ceph status
  cluster:
id: de035250-323d-4cf6-8c4b-cf0faf6296b1
health: HEALTH_OK

  services:
mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
mgr: tsyne(active), standbys: olkas, tolriq, lorunde, amphel
osd: 120 osds: 116 up, 116 in

  data:
pools:   20 pools, 12736 pgs
objects: 15.29M objects, 31.1TiB
usage:   101TiB used, 75.3TiB / 177TiB avail
pgs: 12732 active+clean
 4 active+clean+scrubbing+deep

  io:
client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd, 1.29kop/s wr


On an other host, in the same pool, I see also high memory usage :

daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21 1511:07 
/usr/bin/ceph-osd -f --cluster ceph --id 131 --setuser ceph --setgroup ceph
ceph6759  7.3 11.2 6299140 5484412 ? Ssl  mars21 1665:22 
/usr/bin/ceph-osd -f --cluster ceph --id 132 --setuser ceph --setgroup ceph
ceph7114  7.0 11.7 6576168 5756236 ? Ssl  mars21 1612:09 
/usr/bin/ceph-osd -f --cluster ceph --id 133 --setuser ceph --setgroup ceph
ceph7467  7.4 11.1 6244668 5430512 ? Ssl  mars21 1704:06 
/usr/bin/ceph-osd -f --cluster ceph --id 134 --setuser ceph --setgroup ceph
ceph7821  7.7 11.1 6309456 5469376 ? Ssl  mars21 1754:35 
/usr/bin/ceph-osd -f --cluster ceph --id 135 --setuser ceph --setgroup ceph
ceph8174  6.9 11.6 6545224 5705412 ? Ssl  mars21 1590:31 
/usr/bin/ceph-osd -f --cluster ceph --id 136 --setuser ceph --setgroup ceph
ceph8746  6.6 11.1 6290004 5477204 ? Ssl  mars21 1511:11 
/usr/bin/ceph-osd -f --cluster ceph --id 137 --setuser ceph --setgroup ceph
ceph9100  7.7 11.6 6552080 5713560 ? Ssl

Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Mark Nelson


On 3/20/19 3:12 AM, Vitaliy Filippov wrote:

`cpupower idle-set -D 0` will help you a lot, yes.

However it seems that not only the bluestore makes it slow. >= 50% of 
the latency is introduced by the OSD itself. I'm just trying to 
understand WHAT parts of it are doing so much work. For example in my 
current case (with cpupower idle-set -D 0 of course) when I was 
testing a single OSD on a very good drive (Intel NVMe, capable of 
4+ single-thread sync write iops) it was delivering me only 
950-1000 iops. It's roughly 1 ms latency, and only 50% of it comes 
from bluestore (you can see it `ceph daemon osd.x perf dump`)! I've 
even tuned bluestore a little, so that now I'm getting ~1200 iops from 
it. It means that the bluestore's latency dropped by 33% (it was 
around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the 
overall improvement is only 20% - everything else is eaten by the OSD 
itself.




I'd suggest looking in the direction of pglog.  See:


https://www.spinics.net/lists/ceph-devel/msg38975.html


Back around that time I hacked pglog updates out of the code when I was 
testing a custom version of the memstore backend and saw some pretty 
dramatic reductions in CPU usage (and at least somewhat an increase in 
performance).  Unfortunately I think fixing it is going to be a big job, 
but it's high on my list of troublemakers.



Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson



On 3/12/19 8:40 AM, vita...@yourcmc.ru wrote:

One way or another we can only have a single thread sending writes to
rocksdb.  A lot of the prior optimization work on the write side was
to get as much processing out of the kv_sync_thread as possible.
That's still a worthwhile goal as it's typically what bottlenecks with
high amounts of concurrency.  What I think would be very interesting
though is if we moved more toward a model where we had lots shards
(OSDs or shards of an OSD) with independent rocksdb instances and less
threading overhead per shard.  That's the way the seastar work is
going, and also sort of the model I've been thinking about for a very
simple single-threaded OSD.


Doesn't rocksdb have pipelined writes? Isn't it better to just use 
that builtin concurrency instead of factoring in your own?



Pipelined writes were added in rocksdb 5.5.1 back in the summer of 
2017.  That wasn't available when bluestore was being written. We may be 
able to make use of it now but I don't think anyone has taken the time 
to figure out how much work it would take or what kind of benefit we 
would get.



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Mark Nelson

Our default of 4 256MB WAL buffers is arguably already too big. On one 
hand we are making these buffers large to hopefully avoid short lived 
data going into the DB (pglog writes).  IE if a pglog write comes in and 
later a tombstone invalidating it comes in, we really want those to land 
in the same WAL log to avoid that write being propagated into the DB.  
On the flip side, large buffers mean that there's more work that rocksdb 
has to perform to compare keys to get everything ordered.  This is done 
in the kv_sync_thread where we often bottleneck on small random write 
workloads:



    | | |   |   |   | + 13.30% 
rocksdb::InlineSkipListconst&>::Insert


So on one hand we want large buffers to avoid short lived data going 
into the DB, and on the other hand we want small buffers to avoid large 
amounts of comparisons eating CPU, especially in CPU limited environments.



Mark



On 3/12/19 8:25 AM, Benjamin Zapiec wrote:

May I configure the size of WAL to increase block.db usage?
For example I configure 20GB I would get an usage of about 48GB on L3.

Or should I stay with ceph defaults?
Is there a maximal size for WAL that makes sense?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson



On 3/12/19 7:31 AM, vita...@yourcmc.ru wrote:

Decreasing the min_alloc size isn't always a win, but ican be in some
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasing the WAL buffers in rocksdb to reduce write
amplification).  Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last
time I looked.  It might be a good idea to drop it down to 4k now but
I think we need to be careful because there are tradeoffs.


I think it's all about your disks' latency. Deferred write is 1 
IO+sync and redirect-write is 2 IOs+syncs. So if your IO or sync is 
slow (like it is on HDDs and bad SSDs) then the deferred write is 
better in terms of latency. If your IO is fast then you're only 
bottlenecked by the OSD code itself eating a lot of CPU and then 
direct write may be better. By the way, I think OSD itself is way TOO 
slow currently (see below).



Don't disagree, bluestore's write path has gotten *really* complicated.




The idea I was talking about turned out to be only viable for HDD/slow 
SSDs and only for low iodepths. But the gain is huge - something 
between +50% iops to +100% iops (2x less latency). There is a stupid 
problem in current bluestore implementation which makes it do 2 
journal writes and FSYNCs instead of one for every incoming 
transaction. The details are here: https://tracker.ceph.com/issues/38559


The unnecessary commit is the BlueFS's WAL. All it's doing is 
recording the increased size of a RocksDB WAL file. Which obviously 
shouldn't be required with RocksDB as its default setting is 
"kTolerateCorruptedTailRecords". However, without this setting the WAL 
is not synced to the disk with every write because by some clever 
logic sync_file_range is called only with SYNC_FILE_RANGE_WRITE in the 
corresponding piece of code. Thus the OSD's database gets corrupted 
when you kill it with -9 and thus it's impossible to set 
`bluefs_preextend_wal_files` to true. And thus you get two writes and 
commits instead of one.


I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE - 
as I understand there is currently no benefit in doing this. It could 
be a benefit if RocksDB was writing journal in small parts and then 
doing a single sync - but it's always flushing the newly written part 
of a journal to disk as a whole.


The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFORE 
and SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc. 
My pull request is here: https://github.com/ceph/ceph/pull/26909 - 
I've tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes, 
it does increase single-thread iops on HDDs two times (!). After this 
change BlueStore becomes actually better than FileStore at least on HDDs.


Another way of fixing it would be to add an explicit bdev->flush at 
the end of the kv_sync_thread, after db->submit_transaction_sync(), 
and possibly remove the redundant sync_file_range at all. But then you 
must do the same in another place in _txc_state_proc, because it's 
also sometimes doing submit_transaction_sync(). In the end I 
personally think that to add flags to sync_file_range is better 
because a function named "submit_transaction_sync" should be in fact 
SYNC! It shouldn't require additional steps from the caller to make 
the data durable.



I'm glad you are peaking under the covers here. :)  There's a lot going 
on here, and it's not immediate obvious what the intent is and the 
failure conditions are.  I suspect the intent here was to error on the 
side of caution but we really need to document this better.  To be fair 
it's not just us, there's confusion and terribleness all the way up to 
the kernel and beyond.





Also I have a small funny test result to share.

I've created one OSD on my laptop on a loop device in a tmpfs (i.e. 
RAM), created 1 RBD image inside it and tested it with `fio 
-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the test 
I've turned off CPU power saving with `cpupower idle-set -D 0`.


The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500 
iops with -iodepth=128.
- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000 
iops with -iodepth=128.
- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000 
iops with -iodepth=128.


If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but that 
doesn't matter).

- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms

The conclusion is that bluestore is actually almost TWO TIMES slower 
than filestore in terms of pure latency, and the throughput is only 
slightly

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Mark Nelson



On 3/12/19 7:24 AM, Benjamin Zapiec wrote:

Hello,

i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block- 6TB



By default we currently use 4 256MB WAL buffers.  2GB should be enough, 
though in most cases you are better off just leaving it on block.db as 
you did below.




Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.



WAL will only use about 1GB of that FWIW




So everything should be "fine". But the block.db only contains
about 10GB of data.



If this is an RBD workload, that's quite possible as RBD tends to use 
far less metadata than RGW.





If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.



Data in the DB in general will suffer space amplification and it gets 
worse the more levels in rocksdb you have as multiple levels may have 
copies of the same data at different points in time.  The bigger issue 
is that currently an entire level has to fit on the DB device.  IE if 
level 0 takes 1GB, level 1 takes 10GB, level 2 takes 100GB, and level 3 
takes 1000GB, you will only get 0, 1 and 2 on block.db with 250GB.





I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.



The DB holds metadata for the images (and some metadata for bluestore).  
This is going to be a very small fraction of the overall data size but 
is really important.  Whenever we do a write to an object we first try 
to read some metadata about it (if it exists).  Having those read 
attempts happen quickly is really important to make sure that the write 
happens quickly.





Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?



See above.  Also, rocksdb periodically has to compact data and with lots 
of metadata (and as a result lots of levels) it can get pretty slow.  
Having rocksdb on fast storage helps speed that process up and avoid 
write stalls due to level0 compaction (higher level compaction can 
happen in alternate threads).





Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.



The DB is there primarily to store metadata.  RBD doesn't use a lot of 
space but may do a lot of reads from the DB if it can't keep all of the 
bluestore onodes in it's own in-memory cache (the kv cache).  RGW uses 
the DB much more heavily and in some cases you may see 40-50% space 
usage if you have tiny RGW objects (~4KB).  See this spreadsheet for 
more info:



https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing


Mark




Any thoughts about this?


Best regards


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 13.2.4 odd memory leak?

2019-03-08 Thread Mark Nelson



On 3/8/19 8:12 AM, Steffen Winther Sørensen wrote:



On 8 Mar 2019, at 14.30, Mark Nelson <mailto:mnel...@redhat.com>> wrote:



On 3/8/19 5:56 AM, Steffen Winther Sørensen wrote:


On 5 Mar 2019, at 10.02, Paul Emmerich <mailto:paul.emmer...@croit.io>> wrote:


Yeah, there's a bug in 13.2.4. You need to set it to at least ~1.2GB.

Yeap thanks, setting it at 1G+256M worked :)
Hope this won’t bloat memory during coming weekend VM backups 
through CephFS





FWIW, setting it to 1.2G will almost certainly result in the 
bluestore caches being stuck at cache_min, ie 128MB and the autotuner 
may not be able to keep the OSD memory that low.  I typically 
recommend a bare minimum of 2GB per OSD, and on SSD/NVMe backed OSDs 
3-4+ can improve performance significantly.

This a smaller dev cluster, not much IO, 4 nodes of 16GB & 6x HDD OSD

Just want to avoid consuming swap, which bloated after patching to 
13.2.4 from 13.2.2 after performing VM snapshots to CephFS, Otherwise 
cluster has been fine for ages…

/Steffen



Understood.  We struggled with whether we should have separate HDD and 
SSD defaults for osd_memory_target, but we were seeing other users 
having problems with setting the global default vs the ssd/hdd default 
and not seeing expected behavior.  We decided to have a single 
osd_memory_target to try to make the whole thing simpler with only a 
single parameter to set.  The 4GB/OSD is aggressive but can dramatically 
improve performance on NVMe and we figured that it sort of communicates 
to users where we think the sweet spot is (and as devices and data sets 
get larger, this is going to be even more important).



Mark







Mark




On Tue, Mar 5, 2019 at 9:00 AM Steffen Winther Sørensen
mailto:ste...@gmail.com>> wrote:



On 4 Mar 2019, at 16.09, Paul Emmerich <mailto:paul.emmer...@croit.io>> wrote:


Bloated to ~4 GB per OSD and you are on HDDs?

Something like that yes.


13.2.3 backported the cache auto-tuning which targets 4 GB memory
usage by default.


See https://ceph.com/releases/13-2-4-mimic-released/

Right, thanks…


The bluestore_cache_* options are no longer needed. They are replaced
by osd_memory_target, defaulting to 4GB. BlueStore will expand
and contract its cache to attempt to stay within this
limit. Users upgrading should note this is a higher default
than the previous bluestore_cache_size of 1GB, so OSDs using
BlueStore will use more memory by default.
For more details, see the BlueStore docs.

Adding a 'osd memory target’ value to our ceph.conf and restarting 
an OSD just makes the OSD dump like this:


[osd]
  ; this key makes 13.2.4 OSDs abort???
  osd memory target = 1073741824

  ; other OSD key settings
  osd pool default size = 2  # Write an object 2 times.
  osd pool default min size = 1 # Allow writing one copy in a 
degraded state.


  osd pool default pg num = 256
  osd pool default pgp num = 256

  client cache size = 131072
  osd client op priority = 40
  osd op threads = 8
  osd client message size cap = 512
  filestore min sync interval = 10
  filestore max sync interval = 60

  recovery max active = 2
  recovery op priority = 30
  osd max backfills = 2




osd log snippet:
 -472> 2019-03-05 08:36:02.233 7f2743a8c1c0  1 -- - start start
 -471> 2019-03-05 08:36:02.234 7f2743a8c1c0  2 osd.12 0 init 
/var/lib/ceph/osd/ceph-12 (looks like hdd)
 -470> 2019-03-05 08:36:02.234 7f2743a8c1c0  2 osd.12 0 journal 
/var/lib/ceph/osd/ceph-12/journal
 -469> 2019-03-05 08:36:02.234 7f2743a8c1c0  1 
bluestore(/var/lib/ceph/osd/ceph-12) _mount path 
/var/lib/ceph/osd/ceph-12
 -468> 2019-03-05 08:36:02.235 7f2743a8c1c0  1 bdev create path 
/var/lib/ceph/osd/ceph-12/block type kernel
 -467> 2019-03-05 08:36:02.235 7f2743a8c1c0  1 bdev(0x55b31af4a000 
/var/lib/ceph/osd/ceph-12/block) open path 
/var/lib/ceph/osd/ceph-12/block
 -466> 2019-03-05 08:36:02.236 7f2743a8c1c0  1 bdev(0x55b31af4a000 
/var/lib/ceph/osd/ceph-12/block) open size 146775474176 
(0x222c80, 137 GiB) block_size 4096 (4 KiB) rotational
 -465> 2019-03-05 08:36:02.236 7f2743a8c1c0  1 
bluestore(/var/lib/ceph/osd/ceph-12) _set_cache_sizes cache_size 
1073741824 meta 0.4 kv 0.4 data 0.2
 -464> 2019-03-05 08:36:02.237 7f2743a8c1c0  1 bdev create path 
/var/lib/ceph/osd/ceph-12/block type kernel
 -463> 2019-03-05 08:36:02.237 7f2743a8c1c0  1 bdev(0x55b31af4aa80 
/var/lib/ceph/osd/ceph-12/block) open path 
/var/lib/ceph/osd/ceph-12/block
 -462> 2019-03-05 08:36:02.238 7f2743a8c1c0  1 bdev(0x55b31af4aa80 
/var/lib/ceph/osd/ceph-12/block) open size 146775474176 
(0x222c80, 137 GiB) block_size 4096 (4 KiB) rotational
 -461> 2019-03-05 08:36:02.238 7f2743a8c1c0  1 bluefs 
add_block_device bdev 1 path /var/lib/ceph/osd/ceph-12/block size 
137 GiB

 -460> 2019-03-05 08:36:02.238 7f2743a8c1c0  1 bluefs mount
 -459> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
compaction_readahead_size = 2097152
 -458> 2019-03-05 08

Re: [ceph-users] 13.2.4 odd memory leak?

2019-03-08 Thread Mark Nelson



On 3/8/19 5:56 AM, Steffen Winther Sørensen wrote:



On 5 Mar 2019, at 10.02, Paul Emmerich  wrote:

Yeah, there's a bug in 13.2.4. You need to set it to at least ~1.2GB.

Yeap thanks, setting it at 1G+256M worked :)
Hope this won’t bloat memory during coming weekend VM backups through CephFS

/Steffen



FWIW, setting it to 1.2G will almost certainly result in the bluestore 
caches being stuck at cache_min, ie 128MB and the autotuner may not be 
able to keep the OSD memory that low.  I typically recommend a bare 
minimum of 2GB per OSD, and on SSD/NVMe backed OSDs 3-4+ can improve 
performance significantly.



Mark




On Tue, Mar 5, 2019 at 9:00 AM Steffen Winther Sørensen
 wrote:



On 4 Mar 2019, at 16.09, Paul Emmerich  wrote:

Bloated to ~4 GB per OSD and you are on HDDs?

Something like that yes.


13.2.3 backported the cache auto-tuning which targets 4 GB memory
usage by default.


See https://ceph.com/releases/13-2-4-mimic-released/

Right, thanks…


The bluestore_cache_* options are no longer needed. They are replaced
by osd_memory_target, defaulting to 4GB. BlueStore will expand
and contract its cache to attempt to stay within this
limit. Users upgrading should note this is a higher default
than the previous bluestore_cache_size of 1GB, so OSDs using
BlueStore will use more memory by default.
For more details, see the BlueStore docs.

Adding a 'osd memory target’ value to our ceph.conf and restarting an OSD just 
makes the OSD dump like this:

[osd]
   ; this key makes 13.2.4 OSDs abort???
   osd memory target = 1073741824

   ; other OSD key settings
   osd pool default size = 2  # Write an object 2 times.
   osd pool default min size = 1 # Allow writing one copy in a degraded state.

   osd pool default pg num = 256
   osd pool default pgp num = 256

   client cache size = 131072
   osd client op priority = 40
   osd op threads = 8
   osd client message size cap = 512
   filestore min sync interval = 10
   filestore max sync interval = 60

   recovery max active = 2
   recovery op priority = 30
   osd max backfills = 2




osd log snippet:
  -472> 2019-03-05 08:36:02.233 7f2743a8c1c0  1 -- - start start
  -471> 2019-03-05 08:36:02.234 7f2743a8c1c0  2 osd.12 0 init 
/var/lib/ceph/osd/ceph-12 (looks like hdd)
  -470> 2019-03-05 08:36:02.234 7f2743a8c1c0  2 osd.12 0 journal 
/var/lib/ceph/osd/ceph-12/journal
  -469> 2019-03-05 08:36:02.234 7f2743a8c1c0  1 
bluestore(/var/lib/ceph/osd/ceph-12) _mount path /var/lib/ceph/osd/ceph-12
  -468> 2019-03-05 08:36:02.235 7f2743a8c1c0  1 bdev create path 
/var/lib/ceph/osd/ceph-12/block type kernel
  -467> 2019-03-05 08:36:02.235 7f2743a8c1c0  1 bdev(0x55b31af4a000 
/var/lib/ceph/osd/ceph-12/block) open path /var/lib/ceph/osd/ceph-12/block
  -466> 2019-03-05 08:36:02.236 7f2743a8c1c0  1 bdev(0x55b31af4a000 
/var/lib/ceph/osd/ceph-12/block) open size 146775474176 (0x222c80, 137 GiB) 
block_size 4096 (4 KiB) rotational
  -465> 2019-03-05 08:36:02.236 7f2743a8c1c0  1 
bluestore(/var/lib/ceph/osd/ceph-12) _set_cache_sizes cache_size 1073741824 meta 
0.4 kv 0.4 data 0.2
  -464> 2019-03-05 08:36:02.237 7f2743a8c1c0  1 bdev create path 
/var/lib/ceph/osd/ceph-12/block type kernel
  -463> 2019-03-05 08:36:02.237 7f2743a8c1c0  1 bdev(0x55b31af4aa80 
/var/lib/ceph/osd/ceph-12/block) open path /var/lib/ceph/osd/ceph-12/block
  -462> 2019-03-05 08:36:02.238 7f2743a8c1c0  1 bdev(0x55b31af4aa80 
/var/lib/ceph/osd/ceph-12/block) open size 146775474176 (0x222c80, 137 GiB) 
block_size 4096 (4 KiB) rotational
  -461> 2019-03-05 08:36:02.238 7f2743a8c1c0  1 bluefs add_block_device bdev 1 
path /var/lib/ceph/osd/ceph-12/block size 137 GiB
  -460> 2019-03-05 08:36:02.238 7f2743a8c1c0  1 bluefs mount
  -459> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
compaction_readahead_size = 2097152
  -458> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option compression 
= kNoCompression
  -457> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
max_write_buffer_number = 4
  -456> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
min_write_buffer_number_to_merge = 1
  -455> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
recycle_log_file_num = 4
  -454> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
writable_file_max_buffer_size = 0
  -453> 2019-03-05 08:36:02.339 7f2743a8c1c0  0  set rocksdb option 
write_buffer_size = 268435456
  -452> 2019-03-05 08:36:02.340 7f2743a8c1c0  0  set rocksdb option 
compaction_readahead_size = 2097152
  -451> 2019-03-05 08:36:02.340 7f2743a8c1c0  0  set rocksdb option compression 
= kNoCompression
  -450> 2019-03-05 08:36:02.340 7f2743a8c1c0  0  set rocksdb option 
max_write_buffer_number = 4
  -449> 2019-03-05 08:36:02.340 7f2743a8c1c0  0  set rocksdb option 
min_write_buffer_number_to_merge = 1
  -448> 2019-03-05 08:36:02.340 7f2743a8c1c0  0  set rocksdb option 
recycle_log_file_num = 4
  -447> 2019-03-05 08:36:02.340 7f2743a8c1c0  0  set rocksdb option

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Mark Nelson



On 3/6/19 5:12 AM, Stefan Priebe - Profihost AG wrote:

Hi Mark,
Am 05.03.19 um 23:12 schrieb Mark Nelson:

Hi Stefan,


Could you try running your random write workload against bluestore and
then take a wallclock profile of an OSD using gdbpmp? It's available here:


https://github.com/markhpc/gdbpmp

sure but it does not work:


# ./gdbpmp.py -p 3760442 -n 100 -o gdbpmp.data
Attaching to process 3760442...0x7f917b6a615f in
pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0

Thread 1 "ceph-osd" received signal SIGCONT, Continued.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
Done.
Gathering Samples
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linu

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Mark Nelson

On 3/5/19 4:23 PM, Vitaliy Filippov wrote:
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch
IO, or just fio -ioengine=rbd from outside a VM) is rather pointless -
you're benchmarking the RBD cache, not Ceph itself. RBD cache is
coalescing your writes into big sequential writes. Of course bluestore
is faster in this case - it has no double write for big writes.

I'll probably try to test these settings - I'm also interested in
random write iops in an all-flash bluestore cluster :) but I don't
think any rocksdb options will help. I found bluestore pretty
untunable in terms of performance :)

For random writes, you often end up bottlenecked in the kv sync thread
so long as you aren't generally CPU bound. Anything you can do to
reduce the work being done in the kv sync thread usually helps. A big
one is making sure you are hitting onodes in the bluestore cache rather
than rocksdb cache or disk. IE having enough onode cache available for
the dataset being benchmarked.

The best thing to do for me was to disable CPU powersaving (set
governor to performance + cpupower idle-set -D 1). Your CPUs become
frying pans but write IOPS, especially single-thread write IOPS which
are the worst-case scenario AND at the same time the thing
applications usually need increase 2-3 times. Test it with fio
-ioengine=rbd -bs=4k -iodepth=1.

Yep, this is a big one. I've asked for clarification from vendors if we
can actually recommend doing this but haven't gotten a clear answer yet. :/

Another thing that I've done on my cluster was to set
`bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that
it's 16kb by default which means all writes below 16kb use the same
deferred write path as with HDDs. Deferred writes only increase WA
factor for SSDs and lower the performance. You have to recreate OSDs
after changing this variable - it's only applied at the time of OSD
creation.

Decreasing the min_alloc size isn't always a win, but ican be in some
cases. Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow and
increasing it resulted in a pretty significant performance win (along
with increasing the WAL buffers in rocksdb to reduce write
amplification). Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last time
I looked. It might be a good idea to drop it down to 4k now but I think
we need to be careful because there are tradeoffs.

You can see some of the original work we did in 2016 looking at this on
our performance test cluster here:

https://docs.google.com/spreadsheets/d/1YPiiDu0IxQdB4DcVVz8WON9CpWX9QOy5r-XmYJL0Sys/edit?usp=sharing

And follow-up work in 2017 here:

https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing

It might be time to revisit again.

I'm also currently trying another performance fix, kind of... but it
involves patching ceph's code, so I'll share it later if I succeed.

Would you consider sharing what your idea is? There are absolutely
areas where performance can be improved, but often times they involve
tradeoffs in some respect.

Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Mark Nelson


Hi Stefan,


Could you try running your random write workload against bluestore and 
then take a wallclock profile of an OSD using gdbpmp? It's available here:



https://github.com/markhpc/gdbpmp


Thanks,

Mark


On 3/5/19 2:29 AM, Stefan Priebe - Profihost AG wrote:

Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Mark Nelson



On 3/5/19 10:20 AM, Darius Kasparavičius wrote:

Thank you for your response.

I was planning to use a 100GbE or 45GbE bond for this cluster. It was
acceptable for our use case to lose sequential/larger I/O speed for
it.  Dual socket would be and option, but I do not want to touch numa,
cgroups and the rest settings. Most of the time is just easier to add
a higher clock CPU or more cores. The plan is currently for 2xosd per
nvme device, but if testing shows that it’s better to use one. We will
stick with one. Which RocksDB settings would you recommend tweaking? I
haven’t had the chance to test them yet. Most of the clusters I have
access to are using leveldb and are still running filestore.



Yeah, numa makes everything more complicated.  I'd just consider jumping 
up to the 7601 then if IOPS is a concern and know that you might still 
be CPU bound (though it's also possible you could also hit some other 
bottleneck before it becomes an issue).  Given that the cores aren't 
clocked super high it's possible that you might see a benefit to 2x 
OSDs/device.



RocksDB is tough.  Right now we are heavily tuned to favor reducing 
write amplification but eat CPU to do it.  That can help performance 
when write throughput is a bottleneck and also reduces wear on the drive 
(which is always good, but especially with low write endurance drives).  
Reducing the size of the WAL buffers will (probably) reduce CPU usage 
and also reduce the amount of memory used by the OSD, but we've observed 
higher write-amplification on our test nodes.  I suspect that might be a 
worthwhile trade-off for nvdimms or optane, but I'm not sure it's a good 
idea for typical NVMe drives.



Mark




On Tue, Mar 5, 2019 at 5:35 PM Mark Nelson  wrote:

Hi,


I've got a ryzen7 1700 box that I regularly run tests on along with the
upstream community performance test nodes that have Intel Xeon E5-2650v3
processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons
are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the
tests I've done with Ceph. Typically I see a single OSD using fewer
cores on the Xeon processors vs Ryzen to hit similar performance numbers
despite being clocked lower (though I haven't verified the turbo
frequencies of both under load).  On the other hand, the Ryzen processor
is significantly cheaper per core.  If you only looked at cores you'd
think something like Ryzen would be the way to go, but there are other
things to consider.  The number of PCIE lanes, memory configuration,
cache configuration, and CPU interconnect (in multi-socket
configurations) all start becoming really important if you are targeting
multiple NVMe drives like what you are talking about below.  The EPYC
processors give you more of all of that, but also costs a lot more than
Ryzen.  Ultimately the CPU is only a small part of the price for nodes
like this so I wouldn't skimp if your goal is to maximize IOPS.


With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
will be network bound for large IO workloads unless you are sticking
2x100GbE in.  You might want to consider jumping up to the 7601.  That
would get you closer to where you want to be for 10 NVMe drives
(3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:

https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm


Figure that with sufficient client parallelism/load you'll get about
3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before
replication) with OSDs typically topping out at a max of about 6-8 cores
each.  Doubling up OSDs on each NVMe drive might improve or hurt
performance depending on what the limitations are (typically it seems to
help most when the kv sync thread is the primary bottleneck in
bluestore, which most likely happens with tons of slow cores and very
fast NVMe drives).  Those are all very rough hand-wavy numbers and
depend on a huge variety of factors so take them with a grain of salt.
Doing things like disabling authentication, disabling logging, forcing
high level P/C states, tweaking RocksDB WAL and compaction settings, the
number of osd shards/threads, and the system numa configuration might
get you higher performance/core, though it's all pretty hard to predict
without outright testing it.


Though you didn't ask about it, probably the most important thing you
can spend money on with NVMe drives is getting high write endurance
(DWPD) if you expect even a moderately high write workload.


Mark


On 3/5/19 3:49 AM, Darius Kasparavičius wrote:

Hello,


I was thinking of using AMD based system for my new nvme based
cluster. In particular I'm looking at
https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
anyone tried running it on this particular hardware?

General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive

Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Mark Nelson


Hi,


I've got a ryzen7 1700 box that I regularly run tests on along with the 
upstream community performance test nodes that have Intel Xeon E5-2650v3 
processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons 
are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the 
tests I've done with Ceph. Typically I see a single OSD using fewer 
cores on the Xeon processors vs Ryzen to hit similar performance numbers 
despite being clocked lower (though I haven't verified the turbo 
frequencies of both under load).  On the other hand, the Ryzen processor 
is significantly cheaper per core.  If you only looked at cores you'd 
think something like Ryzen would be the way to go, but there are other 
things to consider.  The number of PCIE lanes, memory configuration, 
cache configuration, and CPU interconnect (in multi-socket 
configurations) all start becoming really important if you are targeting 
multiple NVMe drives like what you are talking about below.  The EPYC 
processors give you more of all of that, but also costs a lot more than 
Ryzen.  Ultimately the CPU is only a small part of the price for nodes 
like this so I wouldn't skimp if your goal is to maximize IOPS.



With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is 
going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but 
will be network bound for large IO workloads unless you are sticking 
2x100GbE in.  You might want to consider jumping up to the 7601.  That 
would get you closer to where you want to be for 10 NVMe drives 
(3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:


https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm


Figure that with sufficient client parallelism/load you'll get about 
3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before 
replication) with OSDs typically topping out at a max of about 6-8 cores 
each.  Doubling up OSDs on each NVMe drive might improve or hurt 
performance depending on what the limitations are (typically it seems to 
help most when the kv sync thread is the primary bottleneck in 
bluestore, which most likely happens with tons of slow cores and very 
fast NVMe drives).  Those are all very rough hand-wavy numbers and 
depend on a huge variety of factors so take them with a grain of salt.  
Doing things like disabling authentication, disabling logging, forcing 
high level P/C states, tweaking RocksDB WAL and compaction settings, the 
number of osd shards/threads, and the system numa configuration might 
get you higher performance/core, though it's all pretty hard to predict 
without outright testing it.



Though you didn't ask about it, probably the most important thing you 
can spend money on with NVMe drives is getting high write endurance 
(DWPD) if you expect even a moderately high write workload.



Mark


On 3/5/19 3:49 AM, Darius Kasparavičius wrote:

Hello,


I was thinking of using AMD based system for my new nvme based
cluster. In particular I'm looking at
https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
anyone tried running it on this particular hardware?

General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD poor performance

2019-02-27 Thread Mark Nelson

FWIW, I've got recent tests of a fairly recent master build
(14.0.1-3118-gd239c2a) showing a single OSD hitting ~33-38K 4k randwrite
IOPS with 3 client nodes running fio (io_depth = 32) both with RBD and
with CephFS. The OSD node had older gen CPUs (Xeon E5-2650 v3) and NVMe
drives (Intel P3700). The OSD process and threads were pinned to run on
the first socket. It took between 5-7 cores to pull off that throughput
though.

Jumping up to 4 OSDs in the node (no replication) improved aggregate
throughput to ~54-55K IOPS with ~15 cores used, so 13-14K IOPS per OSD
with around 3.5-4 cores each on average. IE with more OSDs running on
the same socket competing for cores, the throughput per OSD went down
and the IOPS/core rate went down too. With NVMe, you are likely best
off when multiple OSD processes aren't competing with each other for
cores and can mostly just run on a specific set of cores without
contention. I'd expect that numa pinning each OSD process to specific
cores with enough cores to satisfy the OSD might help. (Nick Fisk also
showed a while back that forcing the CPU to not drop into low-power C/P
states can help dramatically as well).

Mark

On 2/27/19 4:30 PM, Vitaliy Filippov wrote:
By "maximum write iops of an osd" I mean total iops divided by the
number of OSDs. For example, an expensive setup from Micron
(https://www.micron.com/about/blog/2018/april/micron-9200-max-red-hat-ceph-storage-30-reference-architecture-block-performance)
has got only 8750 peak write iops per an NVMe. These exact NVMes they
used are rated for 26+ iops when connected directly :). CPU is a
real bottleneck. The need for a Seastar-based rewrite is not a joke! :)

Total iops is the number coming from a test like:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128
-rw=randwrite -pool= -runtime=60 -rbdname=testimg

...or from several such jobs run in parallel each over a separate RBD
image.

This is a "random write bandwidth" test, and, in fact, it's not the
most useful one - the single-thread latency usually does matter more
than just total bandwidth. To test for it, run:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite
-pool= -runtime=60 -rbdname=testimg

You'll get a pretty low number (< 100 for HDD clusters, 500-1000 for
SSD clusters). It's as expected that it's low. Everything above 1000
iops (< 1ms latency, single-thread iops = 1 / avg latency) is hard to
achieve with Ceph no matter what disks you're using. Also
single-thread latency does not depend on the number of OSDs in the
cluster, because the workload is not parallel.

However you can also test iops of single OSDs by creating a pool with
size=1 and using a custom benchmark tool we've made with our
colleagues from a russian Ceph chat... we can publish it here a short
time later if you want :).

At some point I would expect the cpu to be the bottleneck. They have
always been saying this here for better latency get fast cpu's.
Would be nice to know what GHz you are testing, and how that scales. Rep
1-3, erasure propably also takes a hit.
How do you test maximum iops of the osd? (Just curious, so I can test
mine)

I have posted here a while ago a cephfs test on ssd rep 1. that was
performing nowhere near native, asking if this was normal. But never got
a response to it. I can remember that they send everyone a questionaire
and asked if they should focus on performance more, now I wished I
checked that box ;)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Mark Nelson



On 1/30/19 7:45 AM, Alexandre DERUMIER wrote:

I don't see any smoking gun here... :/

I need to test to compare when latency are going very high, but I need to wait 
more days/weeks.



The main difference between a warm OSD and a cold one is that on startup
the bluestore cache is empty. You might try setting the bluestore cache
size to something much smaller and see if that has an effect on the CPU
utilization?

I will try to test. I also wonder if the new auto memory tuning from Mark could 
help too ?
(I'm still on mimic 13.2.1, planning to update to 13.2.5 next month)

also, could check some bluestore related counters ? (onodes, rocksdb,bluestore 
cache)



If it does, probably only by accident. :)  The autotuner in master is 
pretty dumb and mostly just grows/shrinks the caches based on the 
default ratios but accounts for the memory needed for rocksdb 
indexes/filters.  It will try to keep the total OSD memory consumption 
below the specified limit.  It doesn't do anything smart like monitor 
whether or not large caches may introduce more latency than small 
caches.  It actually adds a small amount of additional overhead in the 
mempool thread to perform the calculations.  If you had a static 
workload and tuned the bluestore cache size and ratios perfectly it 
would only add extra (albeit fairly minimal with the default settings) 
computational cost.



If perf isn't showing anything conclusive, you might try my wallclock 
profiler: http://github.com/markhpc/gdbpmp



Some other things to watch out for are CPUs switching C states and the 
effect of having transparent huge pages enabled (though I'd be more 
concerned about this in terms of memory usage).



Mark





Note that this doesn't necessarily mean that's what you want. Maybe the
reason why the CPU utilization is higher is because the cache is warm and
the OSD is serving more requests per second...

Well, currently, the server is really quiet

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   2,00   515,00   48,00 1182,00   304,00 11216,0018,73 
0,010,000,000,00   0,01   1,20

%Cpu(s):  1,5 us,  1,0 sy,  0,0 ni, 97,2 id,  0,2 wa,  0,0 hi,  0,1 si,  0,0 st

And this is only with writes, not reads



- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 30 Janvier 2019 14:33:23
Objet: Re: ceph osd commit latency increase over time, until restart

On Wed, 30 Jan 2019, Alexandre DERUMIER wrote:

Hi,

here some new results,
different osd/ different cluster

before osd restart latency was between 2-5ms
after osd restart is around 1-1.5ms

http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms)
http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
http://odisoweb1.odiso.net/cephperf2/diff.txt

I don't see any smoking gun here... :/

The main difference between a warm OSD and a cold one is that on startup
the bluestore cache is empty. You might try setting the bluestore cache
size to something much smaller and see if that has an effect on the CPU
utilization?

Note that this doesn't necessarily mean that's what you want. Maybe the
reason why the CPU utilization is higher is because the cache is warm and
the OSD is serving more requests per second...

sage



>From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 


(I'm using tcmalloc 2.5-2.2)


- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 10:49:02
Objet: Re: ceph osd commit latency increase over time, until restart

Can you capture a perf top or perf record to see where teh CPU time is
going on one of the OSDs wth a high latency?

Thanks!
sage


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote:


Hi,

I have a strange behaviour of my osd, on multiple clusters,

All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers,
workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
export-diff/snapshotdelete each day for backup

When the osd are refreshly started, the commit latency is between 0,5-1ms.

But overtime, this latency increase slowly (maybe around 1ms by day), until 
reaching crazy
values like 20-200ms.

Some example graphs:

http://odisoweb1.odiso.net/osdlatency1.png
http://odisoweb1.odiso.net/osdlatency2.png

All osds have this behaviour, in all clusters.

The latency of physical disks is ok. (Clusters are far to be full loaded)

And if I restart the osd, the latency come back to 0,5-1ms.

That's remember me old tcmalloc bug, but maybe could it be a bluestore memory 
bug ?

Any Hints for counters/logs to check ?


Regards,

Alexandre






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-18 Thread Mark Nelson



On 1/18/19 9:22 AM, Nils Fahldieck - Profihost AG wrote:

Hello Mark,

I'm answering on behalf of Stefan.
Am 18.01.19 um 00:22 schrieb Mark Nelson:

On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:

Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?


Hi Stefan,


The autotuner uses the existing in-tree perfglue code that grabs the
tcmalloc heap and unmapped memory statistics to determine how to tune
the caches.  Theoretically we might be able to do the same thing for
jemalloc and maybe even glibc malloc, but there's no perfglue code for
those yet.  If the autotuner can't get heap statistics it won't try to
tune the caches and should instead revert to using the
bluestore_cache_size and whatever the ratios are (the same as if you set
bluestore_cache_autotune to false).


Thank you for that information on the difference between tcmalloc and
jemalloc. We compiled a new 12.2.10 version using tcmalloc. I upgraded a
cluster, which was running _our_ old 12.2.10 version (which used
jemalloc). This cluster has a very low load, so the
jemalloc-ceph-version didn't trigger any performance problems. Prior to
upgrading, one OSD never used more than 1 GB of RAM. After upgrading
there are OSDs using approx. 5,7 GB right now.

I also removed the 'osd_memory_target' option, which we falsely believed
has replaced 'bluestore_cache_size'.

We still have to test this on a cluster generating more I/O load.

For now, this seems to be working fine. Thanks.


Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).


If you are using the autotuner (but it sounds like maybe you are not if
jemalloc is being used?) you'll want to set the osd_memory_target at
least 1GB higher than what you previously had the bluestore_cache_size
set to.  It's likely that trying to set the OSD to stay within 1GB of
memory will cause the cache to sit at osd_memory_cache_min because the
tuner simply can't shrink the cache enough to meet the target (too much
other memory consumed by pglog, rocksdb WAL buffers, random other stuff).

The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10
sounds like a clue.  A bluestore OSD using 1GB of memory is going to
have very little space for cache and it's quite likely that it would be
performing reads from disk for a variety of reasons.  Getting to the
root of that might explain what's going on.  If you happen to still have
a 12.2.8 OSD up that's consuming 6-7GB of memory (with
bluestore_cache_size = 1073741824), can you dump the mempool stats and
running configuration for it?



This is one OSD from a different cluster using approximately 6,1 GB of
memory. This OSD and it's cluster is still running with version 12.2.8.

This OSD (and every other OSD running with 12.2.8) is still configured
with 'bluestore_cache_size = 1073741824'. Please see the following
pastebins:


ceph daemon osd.NNN dump_mempools

https://pastebin.com/Pdcrr4ut


And


ceph daemon osd.NNN show config

https://pastebin.com/nkKpNFU3

Best Regards
Nils



Hi Nils,


Forgive me if you already said this, but is osd.32 backed by an SSD?  I 
believe what you are seeing is that the OSD is actually using 3GB of 
cache due to:



bluestore_cache_size_ssd = 3221225472


on line 132 of your show config paste.


That is backed up by the mempool data:


    "bluestore_cache_other": {
    "items": 62839413,
    "bytes": 2573767714
    },

    "total": {
    "items": 214595893,
    "bytes": 3087934707
    }


IE even though you guys set bluestore_cache_size to 1GB, it is being 
overridden by bluestore_cache_size_ssd.  Later when you compiled the 
tcmalloc version of 12.2.10 and set the osd_memory_target to 1GB, it was 
properly being applied and the autotuner desperately attempted to fit 
the entire OSD into 1GB of memory by shrinking all of the caches to fit 
within osd_memory_cache_min (128MB by default).  Ultimately that lead to 
many reads from disk as even the rocksdb bloom filters may not have 
properly fit into that small of a cache.  Generally I think the absolute 
minimum osd_memory_target for bluestore is probably around 1.5-2GB (with 
potential performance penalties), but 3-4GB gives it a lot more 
breathing room.  If you are ok with the OSD taking up 6-7GB of memory 
you might set the osd_memory_target accordingly.



The reason we wrote the autotuning code is to try to make all of this 
simpler and more explicit.  The idea is that a user shouldn't need to 
think about any of this beyond giving the OSD a target for how much 
memory it should consume and let it worry about figuring out h

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Mark Nelson



On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:

Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?



Hi Stefan,


The autotuner uses the existing in-tree perfglue code that grabs the 
tcmalloc heap and unmapped memory statistics to determine how to tune 
the caches.  Theoretically we might be able to do the same thing for 
jemalloc and maybe even glibc malloc, but there's no perfglue code for 
those yet.  If the autotuner can't get heap statistics it won't try to 
tune the caches and should instead revert to using the 
bluestore_cache_size and whatever the ratios are (the same as if you set 
bluestore_cache_autotune to false).




Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).



If you are using the autotuner (but it sounds like maybe you are not if 
jemalloc is being used?) you'll want to set the osd_memory_target at 
least 1GB higher than what you previously had the bluestore_cache_size 
set to.  It's likely that trying to set the OSD to stay within 1GB of 
memory will cause the cache to sit at osd_memory_cache_min because the 
tuner simply can't shrink the cache enough to meet the target (too much 
other memory consumed by pglog, rocksdb WAL buffers, random other stuff).


The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10 
sounds like a clue.  A bluestore OSD using 1GB of memory is going to 
have very little space for cache and it's quite likely that it would be 
performing reads from disk for a variety of reasons.  Getting to the 
root of that might explain what's going on.  If you happen to still have 
a 12.2.8 OSD up that's consuming 6-7GB of memory (with 
bluestore_cache_size = 1073741824), can you dump the mempool stats and 
running configuration for it?



ceph daemon osd.NNN dump_mempools


And


ceph daemon osd.NNN show config


Thanks,

Mark




Greets,
Stefan

Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:

Hello Mark,

for whatever reason i didn't get your mails - most probably you kicked
me out of CC/TO and only sent to the ML? I've only subscribed to a daily
digest. (changed that for now)

So i'm very sorry to answer so late.

My messages might sound a bit confuse as it isn't easy reproduced and we
tried a lot to find out what's going on.

As 12.2.10 does not contain the pg hard limit i don't suspect it is
related to it.

What i can tell right now is:

1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824

2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
1073741824

3.) i also tried 12.2.10 without setting osd_memory_target or
bluestore_cache_size

4.) it's not kernel related - for some unknown reason it worked for some
hours with a newer kernel but gave problems again later

5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
12.2.10 while it took 2 hours with 12.2.8

6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
read on 12.2.8.

7.) upgrades on small clusters or fresh installs seem to work fine. (no
idea why or it is related to cluste size)

That's currently all i know.

Thanks a lot!

Greets,
Stefan
Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:

i reverted the whole cluster back to 12.2.8 - recovery speed also
dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
broken.

Greets,
Stefan
Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:

This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:

Hello,

while digging into this further i saw that it takes ages until all pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:


2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Mark Nelson


Hi Stefan,


I'm taking a stab at reproducing this in-house.  Any details you can 
give me that might help would be much appreciated.  I'll let you know 
what I find.



Thanks,

Mark


On 1/16/19 1:56 PM, Stefan Priebe - Profihost AG wrote:

i reverted the whole cluster back to 12.2.8 - recovery speed also
dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
broken.

Greets,
Stefan
Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:

This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:

Hello,

while digging into this further i saw that it takes ages until all pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:


2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size 3 up 3
2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset size 3 up 2

Greets,
Stefan
Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:

Hi,

no ok it was not. Bug still present. It was only working because the
osdmap was so far away that it has started backfill instead of recovery.

So it happens only in the recovery case.

Greets,
Stefan

Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:

Am 15.01.19 um 12:45 schrieb Marc Roos:
  
I upgraded this weekend from 12.2.8 to 12.2.10 without such issues

(osd's are idle)


it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan



-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
Sent: 15 January 2019 10:26
To: ceph-users@lists.ceph.com
Cc: n.fahldi...@profihost.ag
Subject: Re: [ceph-users] slow requests and high i/o / read rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:

while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm

experience

issues with bluestore osds - so i canceled the upgrade and all

bluestore

osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-16 Thread Mark Nelson


Hi Stefan,


12.2.9 included the pg hard limit patches and the osd_memory_autotuning 
patches.  While at first I was wondering if this was autotuning, it 
sounds like it may be more related to the pg hard limit.  I'm not 
terribly familiar with those patches though so some of the other members 
from the core team may need to take a look.



Mark


On 1/16/19 9:00 AM, Stefan Priebe - Profihost AG wrote:

This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:

Hello,

while digging into this further i saw that it takes ages until all pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:


2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size 3 up 3
2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset size 3 up 2

Greets,
Stefan
Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:

Hi,

no ok it was not. Bug still present. It was only working because the
osdmap was so far away that it has started backfill instead of recovery.

So it happens only in the recovery case.

Greets,
Stefan

Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:

Am 15.01.19 um 12:45 schrieb Marc Roos:
  
I upgraded this weekend from 12.2.8 to 12.2.10 without such issues

(osd's are idle)


it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan



-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
Sent: 15 January 2019 10:26
To: ceph-users@lists.ceph.com
Cc: n.fahldi...@profihost.ag
Subject: Re: [ceph-users] slow requests and high i/o / read rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:

while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm

experience

issues with bluestore osds - so i canceled the upgrade and all

bluestore

osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests

caused

by very high read rates.


Device: rrqm/s

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-15 Thread Mark Nelson



On 1/15/19 9:02 AM, Stefan Priebe - Profihost AG wrote:

Am 15.01.19 um 12:45 schrieb Marc Roos:
  
I upgraded this weekend from 12.2.8 to 12.2.10 without such issues

(osd's are idle)


it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan



Hi Stefan, can you tell me what kernel you were on and what hardware was 
involved?  I want to make sure that it's recorded for the community in 
case others run into the same issue.



Thanks,

Mark






-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
Sent: 15 January 2019 10:26
To: ceph-users@lists.ceph.com
Cc: n.fahldi...@profihost.ag
Subject: Re: [ceph-users] slow requests and high i/o / read rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:

while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm

experience

issues with bluestore osds - so i canceled the upgrade and all

bluestore

osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests

caused

by very high read rates.


Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda  45,00   187,00  767,00   39,00 482040,00  8660,00
1217,6258,16   74,60   73,85   89,23   1,24 100,00

it reads permanently with 500MB/s from the disk and can't service

client

requests. Overall client read rate is at 10.9MiB/s rd

I can't reproduce this with 12.2.8. Is this a known bug / regression?

Greets,
Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-14 Thread Mark Nelson


Hi Stefan,


Any idea if the reads are constant or bursty?  One cause of heavy reads 
is when rocksdb is compacting and has to read SST files from disk.  It's 
also possible you could see heavy read traffic during writes if data has 
to be read from SST files rather than cache. It's possible this could be 
related to the osd_memory_autotune feature.  It will try to keep OSD 
memory usage within a certain footprint (4GB by default) which 
supercedes the bluestore cache size (it automatically sets the cache 
size based on the osd_memory_target).



To see what's happening during compaction, you can run this script 
against one of your bluestore OSD logs:


https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


Mark

On 1/14/19 1:35 PM, Stefan Priebe - Profihost AG wrote:

Hi,

while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm experience
issues with bluestore osds - so i canceled the upgrade and all bluestore
osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests caused
by very high read rates.


Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda  45,00   187,00  767,00   39,00 482040,00  8660,00
1217,6258,16   74,60   73,85   89,23   1,24 100,00

it reads permanently with 500MB/s from the disk and can't service client
requests. Overall client read rate is at 10.9MiB/s rd

I can't reproduce this with 12.2.8. Is this a known bug / regression?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-12-13 Thread Mark Nelson


Hi Florian,

On 12/13/18 7:52 AM, Florian Haas wrote:

On 02/12/2018 19:48, Florian Haas wrote:

Hi Mark,

just taking the liberty to follow up on this one, as I'd really like to
get to the bottom of this.

On 28/11/2018 16:53, Florian Haas wrote:

On 28/11/2018 15:52, Mark Nelson wrote:

Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
     .set_default(true)
     .set_flag(Option::FLAG_RUNTIME)
     .set_description("Cache read results by default (unless hinted
NOCACHE or WONTNEED)"),

     Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
     .set_default(false)
     .set_flag(Option::FLAG_RUNTIME)
     .set_description("Cache writes by default (unless hinted NOCACHE or
WONTNEED)"),


This is one area where bluestore is a lot more confusing for users that
filestore was.  There was a lot of concern about enabling buffer cache
on writes by default because there's some associated overhead
(potentially both during writes and in the mempool thread when trimming
the cache).  It might be worth enabling bluestore_default_buffered_write
and see if it helps reads.

So yes this is rather counterintuitive, but I happily gave it a shot and
the results are... more head-scratching than before. :)

The output is here: http://paste.openstack.org/show/736324/

In summary:

1. Write benchmark is in the same ballpark as before (good).

2. Read benchmark *without* readahead is *way* better than before
(splendid!) but has a weird dip down to 9K IOPS that I find
inexplicable. Any ideas on that?

3. Read benchmark *with* readahead is still abysmal, which I also find
rather odd. What do you think about that one?

These two still confuse me.

And in addition, I'm curious as to what you think of the approach to
configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% of cache memory for metadata/KV data/objects, the
OSDs use 1%/49%/50%. Is this sensible? I assume the default of not using
any memory to actually cache object data is there for a reason, but I am
struggling to grasp what that reason would be. Particularly since in
filestore, we always got in-memory object caching for free, via the page
cache.

Hi Mark,

do you mind if I give this another poke?



Sorry, I got super busy with things and totally forgot about this.  
Weird dips always make me think compaction.  Once thing we've seen is 
that compaction can force the entire cache to flush, invalidate all of 
the indexes/filters, and generally slow everything down.  If you still 
have the OSD log you can run this tool to get compaction event stats 
(and restrict it to certain level compactions if you like):



https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


No idea why readahead would be that much slower.  We just saw a case 
where large sequential reads were incredibly slow with certain NVMe 
drives and LVM that were fixed by a kernel upgrade, but that was a very 
specific case.


Regarding meta/kv/data ratios:  It's really tough to configure optimal 
settings for all situations.  Generally for RGW you need more KV cache 
and for RBD you need more meta cache, but it's highly variable (ie even 
in the RBD case you need enough KV cache to make sure all 
indexes/filters are cached, and in the RGW case you still may want to 
prioritize hot bluestore onodes).  That's why I started writing the 
autotuning code.  Because the cache is hierarchical, the worst case 
situation is that you just end up caching the same onode data twice in 
both places (especially if you end up forcing out omap data you need 
cached).  The best case situation is that you cache the most useful 
recent data with as little double caching as possible.  That's sort of 
the direction I'm trying to head with the autotuner.





Cheers,
Florian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-10 Thread Mark Nelson


Hi Tyler,

I think we had a user a while back that reported they had background 
deletion work going on after upgrading their OSDs from filestore to 
bluestore due to PGs having been moved around.  Is it possible that your 
cluster is doing a bunch of work (deletion or otherwise) beyond the 
regular client load?  I don't remember how to check for this off the top 
of my head, but it might be something to investigate.  If that's what it 
is, we just recently added the ability to throttle background deletes:


https://github.com/ceph/ceph/pull/24749


If the logs/admin socket don't tell you anything, you could also try 
using our wallclock profiler to see what the OSD is spending it's time 
doing:


https://github.com/markhpc/gdbpmp/


./gdbpmp -t 1000 -p`pidof ceph-osd` -o foo.gdbpmp

./gdbpmp -i foo.gdbpmp -t 1


Mark

On 12/10/18 6:09 PM, Tyler Bishop wrote:

Hi,

I have an SSD only cluster that I recently converted from filestore to 
bluestore and performance has totally tanked. It was fairly decent 
before, only having a little additional latency than expected.  Now 
since converting to bluestore the latency is extremely high, SECONDS.  
I am trying to determine if it an issue with the SSD's or Bluestore 
treating them differently than filestore... potential garbage 
collection? 24+ hrs ???


I am now seeing constant 100% IO utilization on ALL of the devices and 
performance is terrible!


IOSTAT

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.37    0.00    0.34   18.59    0.00   79.70

Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await svctm  %util
sda               0.00     0.00    0.00    9.50  0.00    64.00    
13.47     0.01    1.16    0.00    1.16  1.11   1.05
sdb               0.00    96.50    4.50   46.50 34.00 11776.00  
 463.14   132.68 1174.84  782.67 1212.80 19.61 100.00
dm-0              0.00     0.00    5.50  128.00 44.00  8162.00  
 122.94   507.84 1704.93  674.09 1749.23  7.49 100.00


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.85    0.00    0.30   23.37    0.00   75.48

Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await svctm  %util
sda               0.00     0.00    0.00    3.00  0.00    17.00    
11.33     0.01    2.17    0.00    2.17  2.17   0.65
sdb               0.00    24.50    9.50   40.50 74.00 1.00  
 402.96    83.44 2048.67 1086.11 2274.46 20.00 100.00
dm-0              0.00     0.00   10.00   33.50 78.00  2120.00  
 101.06   287.63 8590.47 1530.40 10697.96 22.99 100.00


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.81    0.00    0.30   11.40    0.00   87.48

Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await svctm  %util
sda               0.00     0.00    0.00    6.00  0.00    40.25    
13.42     0.01    1.33    0.00    1.33  1.25   0.75
sdb               0.00   314.50   15.50   72.00  122.00 17264.00  
 397.39    61.21 1013.30  740.00 1072.13  11.41  99.85
dm-0              0.00     0.00   10.00  427.00 78.00 27728.00  
 127.26   224.12  712.01 1147.00  701.82  2.28  99.85


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.22    0.00    0.29    4.01    0.00   94.47

Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await svctm  %util
sda               0.00     0.00    0.00    3.50  0.00    17.00    
 9.71     0.00    1.29    0.00    1.29  1.14   0.40
sdb               0.00     0.00    1.00   39.50  8.00 10112.00  
 499.75    78.19 1711.83 1294.50 1722.39 24.69 100.00



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-11-28 Thread Mark Nelson



On 11/28/18 8:36 AM, Florian Haas wrote:

On 14/08/2018 15:57, Emmanuel Lacour wrote:

Le 13/08/2018 à 16:58, Jason Dillaman a écrit :

See [1] for ways to tweak the bluestore cache sizes. I believe that by
default, bluestore will not cache any data but instead will only
attempt to cache its key/value store and metadata.

I suppose too because default ratio is to cache as much as possible k/v
up to 512M and hdd cache is 1G by default.

I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
processes uses 20GB now.


In general, however, I would think that attempting to have bluestore
cache data is just an attempt to optimize to the test instead of
actual workloads. Personally, I think it would be more worthwhile to
just run 'fio --ioengine=rbd' directly against a pre-initialized image
after you have dropped the cache on the OSD nodes.

So with bluestore, I assume that we need to think more of client page
cache (at least when using a VM)  when with old filestore both osd and
client cache where used.
  
For benchmark, I did real benchmark here for the expected app workload

of this new cluster and it's ok for us :)


Thanks for your help Jason.

Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.

I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.

This is an all-bluestore cluster on spinning disks with Luminous, and
I've tried the following things:

- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)

- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)

- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%

None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.

I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!



Hi Florian,


By default bluestore will cache buffers on reads but not on writes 
(unless there are hints):



Option("bluestore_default_buffered_read", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)

    .set_default(true)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache read results by default (unless hinted 
NOCACHE or WONTNEED)"),


    Option("bluestore_default_buffered_write", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)

    .set_default(false)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache writes by default (unless hinted NOCACHE or 
WONTNEED)"),



This is one area where bluestore is a lot more confusing for users that 
filestore was.  There was a lot of concern about enabling buffer cache 
on writes by default because there's some associated overhead 
(potentially both during writes and in the mempool thread when trimming 
the cache).  It might be worth enabling bluestore_default_buffered_write 
and see if it helps reads.  You'll probably also want to pay attention 
to writes though.  I think we might want to consider enabling it by 
default but we should go through and do a lot of careful testing first. 
FWIW I did have it enabled when testing the new memory target code (and 
the not-yet-merged age-binned autotuning).  It was doing OK in my tests, 
but I didn't do an apples-to-apples comparison with it off.



Mark




Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW performance with lots of objects

2018-11-27 Thread Mark Nelson


Hi Robert,


Solved is probably a strong word.  I'd say that things have improved.  
Bluestore in general tends to handle large numbers of objects better 
than filestore does for several reasons including that it doesn't suffer 
from pg directory splitting (though RocksDB compaction can become a 
bottleneck with very large DBs and heavy metadata traffic)  Bluestore 
also has less overhead for OMAP operations and so far we've generally 
seen higher OMAP performance (ie how bucket indexes are currently 
stored).  The bucket index sharding of course helps too.  One counter 
argument is that bluestore uses the KeyvalueDB a lot more aggressively 
than filestore does and that could have an impact on bucket indexes 
hosted on the same OSDs as user objects.  This gets sort of complicated 
though and may primarily be an issue if all of your OSDs are backed by 
NVMe and sustaining very high write traffic. Ultimately I suspect that 
if you ran the same 500+m object single-bucket test, that a modern 
bluestore deployment would probably be faster than what you saw 
pre-luminous with filestore. Whether or not it's acceptable is a 
different question.  For example I've noticed in past tests that delete 
performance improved dramatically when objects were spread across a 
higher number of buckets.  Probably the best course of action will be to 
run tests and diagnose the behavior to see if it's going to meet your needs.



Thanks,

Mark


On 11/27/18 12:10 PM, Robert Stanford wrote:


In the old days when I first installed Ceph with RGW the performance 
would be very slow after storing 500+ million objects in my buckets. 
With Luminous and index sharding is this still a problem or is this an 
old problem that has been solved?


Regards
R

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bucket indices: ssd-only or is a large fast block.db sufficient?

2018-11-20 Thread Mark Nelson

One consideration is that you may not be able to fit higher DB levels on 
the db partition and end up with a lot of waste (Nick Fisk recently saw 
this on his test cluster).  We've talked about potentially trying to 
pre-compute the hierarchy sizing so that we can align a level boundary 
to fit within the db partition size. I'm concerned there could be some 
unintended consequences (IE having a media transition and a write-amp 
jump hit at the same time)  I tend to wonder if we should focus on 
either DB or column family sharding and just get some fraction of the 
high level SSTs from different shards on the db partition.


I think you could probably make either configuration work, and there are 
advantages and disadvantages to each approach. (sizing, complexity, 
write-amp, etc).  If you go for the 2nd option, you probably still want 
some portion of the SSDs carved out for DB/WAL for the data pool which 
would shrink how much you'd have available for the flash-only OSDs.  One 
point I do want to bring up is that we're considering experimenting with 
layering bucket index pools on top of objects rather than using OMAP.  
No idea if that will pan out (or even how far we'll get), but if that 
ends up being a win, you might prefer the second approach as the objects 
would end up on flash.  The 2nd approach is also the only option as far 
as filestore goes, though I'm not sure if that really matters to you guys.



Mark

On 11/20/18 8:48 AM, Gregory Farnum wrote:
Looks like you’ve considered the essential points for bluestore OSDs, 
yep. :)
My concern would just be the surprisingly-large block.db requirements 
for rgw workloads that have been brought up. (300+GB per OSD, I think 
someone saw/worked out?).

-Greg

On Tue, Nov 20, 2018 at 1:35 AM Dan van der Ster > wrote:


Hi ceph-users,

Most of our servers have 24 hdds plus 4 ssds.
Any experience how these should be configured to get the best rgw
performance?

We have two options:
   1) All osds the same, with data on the hdd and block.db on a 40GB
ssd partition
   2) Two osd device types: hdd-only for the rgw data pool and
ssd-only for bucket index pool

But all of the bucket index data is in omap, right?
And all of the omap is stored in the rocks db, right?

After reading the recent threads about bluefs slow_used_bytes, I had
the thought that as long as we have a large enough block.db, then
slow_used_bytes will be 0 and all of the bucket indexes will be on
ssd-only, regardless of option (1) or (2) above.

Any thoughts?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How many PGs per OSD is too many?

2018-11-14 Thread Mark Nelson



On 11/14/18 1:45 PM, Vladimir Brik wrote:

Hello

I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs 
and 4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 
400 PGs each (a lot more pools use SSDs than HDDs). Servers are fairly 
powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.


The impression I got from the docs is that having more than 200 PGs 
per OSD is not a good thing, but justifications were vague (no 
concrete numbers), like increased peering time, increased resource 
consumption, and possibly decreased recovery performance. None of 
these appeared to be a significant problem in my testing, but the 
tests were very basic and done on a pretty empty cluster under minimal 
load, so I worry I'll run into trouble down the road.


Here are the questions I have:
- In practice, is it a big deal that some OSDs have ~400 PGs?
- In what situations would our cluster most likely fare significantly 
better if I went through the trouble of re-creating pools so that no 
OSD would have more than, say, ~100 PGs?
- What performance metrics could I monitor to detect possible issues 
due to having too many PGs?



It's a fuzzy sort of thing.  During normal operation: With more PGs 
you'll store more pglog info in memory, so you'll have a more bloated 
OSD process.  If you use the new bluestore option for setting an osd 
memory target, that will mean less memory for caches.  It will also 
likely mean that there's a greater chance that pglog entries won't be 
invalidated before memtable flushes in rocksdb, so you might end up with 
higher write amp and slower DB performance as those entries get 
compacted into L0+.  That could matter with RGW or if you are doing lots 
of small 4k writes with RBD.


I'd see what Neha/Josh think about the impact on recovery, though I 
suppose one upside is that more PGs means you get a longer log based 
recovery window.  You could accomplish the same effect by increasing the 
number of pglog entries per pg (or keep the same overall number of 
entries by having more PGs and lower the number of entries per PG).  And 
upside to having more PGs is better data distribution quality, though we 
can now get much better distributions with the new balancer code, even 
with fewer PGs. One bad thing about having too few PGs is that you can 
have increased lock contention.  The balancer can make the data 
distribution better but you still can't shrink the number of PGs per 
pool too low.


The gist of it is that if you decide to look into this yourself you are 
probably going to find some contradictory evidence and trade-offs.  
There are pitfalls if you go too high and pitfalls if you go too low.  
I'm not sure we can easily define the exact PG counts/OSD where they 
happen since it's sort of dependent on how much memory you have, how 
fast your hardware is, whether you are using the balancer, and what your 
expectations are.


How's that for a non-answer? ;)

Mark




Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-05 Thread Mark Nelson

FWIW, here are values I measured directly from the RocksDB SST files 
under different small write workloads (ie the ones where you'd expect a 
larger DB footprint):


https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing

These tests were only with 256GB of data written to a single OSD, so 
there's no guarantee that it will scale linearly up to 10TB 
(specifically it's possible that much larger RocksDB databases could 
have higher space amplification).  Also note that the RGW numbers could 
be very dependent on the client workload and are not likely universally 
representative.


Also remember that if you run out of space on your DB partitions you'll 
just end up putting higher rocksdb levels on the block device.  Slower 
to be sure, but not necessarily worse than filestore's behavior 
(especially in the RGW case, where the large object counts will cause PG 
directory splitting chaos).


Mark

On 10/05/2018 01:38 PM, solarflow99 wrote:
oh my.. yes 2TB enterprise class SSDs, that a much higher requirement 
than filestore needed.  That would be cost prohibitive to any lower 
end ceph cluster,




On Thu, Oct 4, 2018 at 11:19 PM Massimo Sgaravatto 
mailto:massimo.sgarava...@gmail.com>> 
wrote:


Argg !!
With 10x10TB SATA DB and 2 SSD disks this would mean 2 TB for each
SSD !
If this is really required I am afraid I will keep using filestore ...

Cheers, Massimo

On Fri, Oct 5, 2018 at 7:26 AM mailto:c...@elchaka.de>> wrote:

Hello

Am 4. Oktober 2018 02:38:35 MESZ schrieb solarflow99
mailto:solarflo...@gmail.com>>:
>I use the same configuration you have, and I plan on using
bluestore.
>My
>SSDs are only 240GB and it worked with filestore all this time, I
>suspect
>bluestore should be fine too.
>
>
>On Wed, Oct 3, 2018 at 4:25 AM Massimo Sgaravatto <
>massimo.sgarava...@gmail.com
> wrote:
>
>> Hi
>>
>> I have a ceph cluster, running luminous, composed of 5 OSD
nodes,
>which is
>> using filestore.
>> Each OSD node has 2 E5-2620 v4 processors, 64 GB of RAM,
10x6TB SATA
>disk
>> + 2x200GB SSD disk (then I have 2 other disks in RAID for
the OS), 10
>Gbps.
>> So each SSD disk is used for the journal for 5 OSDs. With this
>> configuration everything is running smoothly ...
>>
>>
>> We are now buying some new storage nodes, and I am trying
to buy
>something
>> which is bluestore compliant. So the idea is to consider a
>configuration
>> something like:
>>
>> - 10 SATA disks (8TB / 10TB / 12TB each. TBD)
>> - 2 processor (~ 10 core each)
>> - 64 GB of RAM
>> - 2 SSD to be used for WAL+DB
>> - 10 Gbps
>>
>> For what concerns the size of the SSD disks I read in this
mailing
>list
>> that it is suggested to have at least 10GB of SSD disk/10TB
of SATA
>disk.
>>
>>
>> So, the questions:
>>
>> 1) Does this hardware configuration seem reasonable ?
>>
>> 2) Are there problems to live (forever, or until filestore
>deprecation)
>> with some OSDs using filestore (the old ones) and some OSDs
using
>bluestore
>> (the old ones) ?
>>
>> 3) Would you suggest to update to bluestore also the old
OSDs, even
>if the
>> available SSDs are too small (they don't satisfy the "10GB
of SSD
>disk/10TB
>> of SATA disk" rule) ?

AFAIR should the db size 4% of the osd in question.

So

For example, if the block size is 1TB, then block.db shouldn’t
be less than 40GB

See:

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

Hth
- Mehmet

>>
>> Thanks, Massimo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Mark Nelson


On 09/10/2018 12:22 PM, Igor Fedotov wrote:


Hi Nick.


On 9/10/2018 1:30 PM, Nick Fisk wrote:
If anybody has 5 minutes could they just clarify a couple of things 
for me


1. onode count, should this be equal to the number of objects stored 
on the OSD?
Through reading several posts, there seems to be a general indication 
that this is the case, but looking at my OSD's the maths don't

work.
onode_count is the number of onodes in the cache, not the total number 
of onodes at an OSD.

Hence the difference...


Eg.
ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL  %USE  VAR  PGS
  0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115

So 3TB OSD, roughly half full. This is pure RBD workload (no 
snapshots or anything clever) so let's assume worse case scenario of
4MB objects (Compression is on however, which would only mean more 
objects for given size)

1347000/4=~336750 expected objects

sudo ceph daemon osd.0 perf dump | grep blue
 "bluefs": {
 "bluestore": {
 "bluestore_allocated": 1437813964800,
 "bluestore_stored": 2326118994003,
 "bluestore_compressed": 445228558486,
 "bluestore_compressed_allocated": 547649159168,
 "bluestore_compressed_original": 1437773843456,
 "bluestore_onodes": 99022,
 "bluestore_onode_hits": 18151499,
 "bluestore_onode_misses": 4539604,
 "bluestore_onode_shard_hits": 10596780,
 "bluestore_onode_shard_misses": 4632238,
 "bluestore_extents": 896365,
 "bluestore_blobs": 861495,

99022 onodes, anyone care to enlighten me?

2. block.db Size
sudo ceph daemon osd.0 perf dump | grep db
 "db_total_bytes": 8587829248,
 "db_used_bytes": 2375024640,

2.3GB=0.17% of data size. This seems a lot lower than the 1% 
recommendation (10GB for every 1TB) or 4% given in the official docs. I
know that different workloads will have differing overheads and 
potentially smaller objects. But am I understanding these figures

correctly as they seem dramatically lower?
Just in case - is slow_used_bytes equal to 0? Some DB data might 
reside at slow device if spill over has happened. Which doesn't 
require full DB volume to happen - that's by RocksDB's design.


And recommended numbers are a bit... speculative. So it's quite 
possible that you numbers are absolutely adequate.


FWIW, these are the numbers I came up with after examining the SST files 
generated under different workloads:


https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing



Regards,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Increase tcmalloc thread cache bytes - still recommended?

2018-07-19 Thread Mark Nelson

I believe that the standard mechanisms for launching OSDs already sets 
the thread cache higher than default.  It's possible we might be able to 
relax that now as async messenger doesn't thrash the cache as badly as 
simple messenger did.  I suspect there's probably still some value to 
increasing it over default though for SSDs.


Mark

On 07/19/2018 01:35 PM, Gregory Farnum wrote:
I don't think that's a default recommendation — Ceph is doing more 
configuration of tcmalloc these days, tcmalloc has resolved a lot of 
bugs, and that was only ever a thing that mattered for SSD-backed OSDs 
anyway.

-Greg

On Thu, Jul 19, 2018 at 5:50 AM Robert Stanford 
mailto:rstanford8...@gmail.com>> wrote:



 It seems that the Ceph community no longer recommends changing to
jemalloc.  However this also recommends to do what's in this
email's subject:
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/

 Is it still recommended to increase the tcmalloc thread cache
bytes, or is that recommendation old and no longer applicable?

 Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Mark Nelson


Hi Uwe,


As luck would have it we were just looking at memory allocators again 
and ran some quick RBD and RGW tests that stress memory allocation:



https://drive.google.com/uc?export=download=1VlWvEDSzaG7fE4tnYfxYtzeJ8mwx4DFg


The gist of it is that tcmalloc looks like it's doing pretty well 
relative to the version of jemalloc and libc malloc tested (The jemalloc 
version here is pretty old though).  You are also correct that there 
have been reports of crashes with jemalloc, potentially related to 
rocksdb.  Right now it looks like our decision to stick with tcmalloc is 
still valid.  I wouldn't suggest switching unless you can find evidence 
that tcmalloc is behaving worse than the others (and please let me know 
if you do!).


Thanks,

Mark


On 07/05/2018 08:08 AM, Uwe Sauter wrote:

Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. 
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 
2015 where jemalloc
is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread Mark Nelson


On 04/01/2018 07:59 PM, Christian Balzer wrote:


Hello,

firstly, Jack pretty much correctly correlated my issues to Mark's points,
more below.

On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:


On 03/29/2018 08:59 PM, Christian Balzer wrote:


Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:

Bluestore's cache is not broken by design.


During further tests I verified something that caught my attention out of
the corner of my when glancing at atop output of the OSDs during my fio
runs.

Consider this fio run, after having done the same with write to populate
the file and caches (1GB per OSD default on the test cluster, 20 OSDs
total on 5 nodes):
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randread --name=fiojob --blocksize=4M --iodepth=32
---

This is being run against a kernel mounted RBD image.
On the Luminous test cluster it will read the data from the disks,
completely ignoring the pagecache on the host (as expected and desired)
AND the bluestore cache.

On a Jewel based test cluster with filestore the reads will be served from
the pagecaches of the OSD nodes, not only massively improving speed but
more importantly spindle contention.


Filestore absolutely will be able to do better than bluestore in the 
case where a single OSD benefits by utilizing all of the memory in a 
node even at the expense of other OSDs.  One situation where this could 
be the case is RGW bucket indexes, but even there the better solution 
imho is to shard the buckets.  I'd argue though that you need to be 
careful about how you approach this.  Let's say you have a single node 
with multiple OSDs and one of those OSDs has a big set of temporarily 
hot read data.  If you let that OSD use up most of the memory  on the 
node to cache the data set, all of the other OSDs have to give up 
something:  Namely cached onodes.  That means that once your hot data is 
no longer hot, all of those other OSDs will need to perform future onode 
reads from disk.  Whether or not it's beneficial to cache the hot data 
set depends on how long it's going to stay hot and how likely those 
other OSDs are going to have a read/write operation at some point in the 
future.  I'd argue that if you assume a generally mixed workload that 
generally spans multiple OSDs, you are much better off ignoring the hot 
data and simply keeping the onodes cached.


I suspect that the more common case where bluestore looks bad is when 
someone is benchmarking reads on a single filestore OSD vs a single 
bluestore OSD and doesn't bother giving bluestore a large portion of the 
memory on the node.  Filestore can look faster than bluestore in that 
case, especially if the data set is relatively small and can fit 
entirely in memory.  In the case where you've configured bluestore to 
use most of your available memory, bluestore should be pretty close.  
For some configurations/workloads potentially faster.




My guess is that bluestore treats "direct" differently than the kernel
accessing a filestore based OSD and I'm not sure what the "correct"
behavior here is.
But somebody migrating to bluestore with such a use case and plenty of RAM
on their OSD nodes is likely to notice this and not going to be happy about
it.


Like I said earlier, it's all about trade-offs.  The pagecache gives you 
a lot of flexibility and on slower devices the price you pay isn't 
terribly high.  On faster devices it's a bigger issue.






I'm not totally convinced that some of the trade-offs we've made with
bluestore's cache implementation are optimal, but I think you should
consider cooling your rhetoric down.


1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.

What does "wasting" RAM mean in the context of a node running ceph? Are
you upset that other applications can't come in and evict bluestore
onode, OMAP, or object data from cache?


What Jack pointed out, unless you go around and start tuning things,
all available free RAM won't be used for caching.

This raises another point, it being per process data and from skimming
over some bluestore threads here, if you go and raise the cache to use
most RAM during normal ops you're likely

Re: [ceph-users] Bluestore caching, flawed by design?

2018-03-31 Thread Mark Nelson


On 03/29/2018 08:59 PM, Christian Balzer wrote:


Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:


Bluestore's cache is not broken by design.

I'm not totally convinced that some of the trade-offs we've made with 
bluestore's cache implementation are optimal, but I think you should 
consider cooling your rhetoric down.



1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.


What does "wasting" RAM mean in the context of a node running ceph? Are 
you upset that other applications can't come in and evict bluestore 
onode, OMAP, or object data from cache?



2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.


It's only "inefficient" if you assume that using the pagecache, and more 
generally, kernel syscalls, is free.  Yes the pagecache is convenient 
and yes it gives you a lot of flexibility, but you pay for that 
flexibility if you are trying to do anything fast.


For instance, take the new KPTI patches in the kernel for meltdown. Look 
at how badly it can hurt MyISAM database performance in MariaDB:


https://mariadb.org/myisam-table-scan-performance-kpti/

MyISAM does not have a dedicated row cache and instead caches row data 
in the page cache as you suggest Bluestore should do for it's data.  
Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a 
dedicated 128MB cache (less than 1%).  KPTI is a really good example of 
how much this stuff can hurt you, but syscalls, context switches, and 
page faults were already expensive even before meltdown.  Not to mention 
that right now bluestore keeps onodes and buffers stored in it's cache 
in an unencoded form.


Here's a couple of other articles worth looking at:

https://eng.uber.com/mysql-migration/
https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html


3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.


Non-Uniform hardware is a big reason that pinning dedicated memory to 
specific cores/sockets is really nice vs relying on potentially remote 
memory page cache reads.  A long time ago I was responsible for 
validating the performance of CXFS on an SGI Altix UV distributed 
shared-memory supercomputer.  As it turns out, we could achieve about 
22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x 
slower.  A big part of that turned out to be the kernel distributing 
page cache across the Numalink5 interconnects to remote memory.  The 
problem can potentially happen on any NUMA system to varying degrees.


Personally I have two primary issues with bluestore's memory 
configuration right now:


1) It's too complicated for users to figure out where to assign memory 
and in what ratios.  I'm attempting to improve this by making 
bluestore's cache autotuning so the user just gives it a number and 
bluestore will try to work out where it should assign memory.


2) In the case where a subset of OSDs are really hot (maybe RGW bucket 
accesses) you might want some OSDs to get more memory than others.  I 
think we can tackle this better if we migrate to a one-osd-per-node 
sharded architecture (likely based on seastar), though we'll still need 
to be very aware of remote memory.  Given that this is fairly difficult 
to do well, we're probably going to be better off just dedicating a 
static pool to each shard initially.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What do you use to benchmark your rgw?

2018-03-28 Thread Mark Nelson

Personally I usually use a modified version of Mark Seger's getput tool 
here:


https://github.com/markhpc/getput/tree/wip-fix-timing

The difference between this version and upstream is primarily to make 
getput more accurate/useful when using something like CBT for 
orchestration instead of the included orchestration wrapper (gpsuite).


CBT can use this version of getput and run relatively accurate 
mutli-client tests without requiring quite as much setup as cosbench.  
Having said that, many folks have used cosbench effectively and I 
suspect that might be a good option for many people.  I'm not sure how 
much development is happening these days, I think the primary author may 
no longer be working on the project.


Mark

On 03/28/2018 09:21 AM, David Byte wrote:
I use cosbench (the last rc works well enough). I can get multiple 
GB/s from my 6 node cluster with 2 RGWs.


David Byte
Sr. Technical Strategist
IHV Alliances and Embedded
SUSE

Sent from my iPhone. Typos are Apple's fault.

On Mar 28, 2018, at 5:26 AM, Janne Johansson > wrote:


s3cmd and cli version of cyberduck to test it end-to-end using 
parallelism if possible.


Getting some 100MB/s at most, from 500km distance over https against 
5*radosgw behind HAProxy.



2018-03-28 11:17 GMT+02:00 Matthew Vernon >:


Hi,

What are people here using to benchmark their S3 service (i.e.
the rgw)?
rados bench is great for some things, but doesn't tell me about what
performance I can get from my rgws.

It seems that there used to be rest-bench, but that isn't in Jewel
AFAICT; I had a bit of a look at cosbench but it looks fiddly to
set up
and a bit under-maintained (the most recent version doesn't work
out of
the box, and the PR to fix that has been languishing for a while).

This doesn't seem like an unusual thing to want to do, so I'd like to
know what other ceph folk are using (and, if you like, the
numbers you
get from the benchmarkers)...?

Thanks,

Matthew


--
 The Wellcome Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Cbt] Poor libRBD write performance

2017-11-20 Thread Mark Nelson


On 11/20/2017 10:06 AM, Moreno, Orlando wrote:

Hi all,



I’ve been experiencing weird performance behavior when using FIO RBD
engine directly to an RBD volume with numjobs > 1. For a 4KB random
write test at 32 QD and 1 numjob, I can get about 40K IOPS, but when I
increase the numjobs to 4, it plummets to 2800 IOPS. I tried running the
same exact test on a VM using FIO libaio targeting a block device
(volume) attached through QEMU/RBD and I get ~35K-40K IOPS in both
situations. In all cases, CPU was not fully utilized and there were no
signs of any hardware bottlenecks. I did not disable any RBD features
and most of the Ceph parameters are default (besides auth, debug, pool
size, etc).



My Ceph cluster is running on 6 nodes, all-NVMe, 22-core, 376GB mem,
Luminous 12.2.1, Ubuntu 16.04, and clients running FIO job/VM on similar
HW/SW spec. The VM has 16 vCPU, 64GB mem, and the root disk is locally
stored while the persistent disk comes from an RBD volume serviced by
the Ceph cluster.



If anyone has seen this issue or have any suggestions please let me know.


Hi Orlando,

Try seeing if disabling the RBD image exclusive lock helps (if only to 
confirm that's what's going on).  I usually test with numjobs=1 and run 
multiple fio instances with higher iodepth values instead to avoid this. 
 See:


https://www.spinics.net/lists/ceph-devel/msg30468.html

and

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004872.html

Mark





Thanks,

Orlando



___
Cbt mailing list
c...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/cbt-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-16 Thread Mark Nelson

It depends on what you expect your typical workload to be like.  Ceph 
(and distributed storage in general) likes high io depths so writes can 
hit all of the drives at the same time.  There are tricks (like 
journals, writahead logs, centralized caches, etc) that can help 
mitigate this, but I suspect you'll see much better performance with 
more concurrent writes.


Regarding file size, the smaller the file, the more likely those tricks 
mentioned above are to help you.  Based on your results, it appears 
filestore may be doing a better job of it than bluestore.  The question 
you have to ask is whether or not this kind of test represents what you 
are likely to see for real on your cluster.


Doing writes over a much larger file, say 3-4x over the total amount of 
RAM in all of the nodes, helps you get a better idea of what the 
behavior is like when those tricks are less effective.  I think that's 
probably a more likely scenario in most production environments, but 
it's up to you which workload you think better represents what you are 
going to see in practice.  A while back Nick Fisk showed some results 
wehre bluestore was slower than filestore at small sync writes and it 
could be that we simply have more work to do in this area.  On the other 
hand, we pretty consistently see bluestore doing better than filestore 
with 4k random writes and higher IO depths, which is why I'd be curious 
to see how it goes if you try that.


Mark

On 11/16/2017 10:11 AM, Milanov, Radoslav Nikiforov wrote:

No,
What test parameters (iodepth/file size/numjobs) would make sense  for 3 
node/27OSD@4TB ?
- Rado

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: Thursday, November 16, 2017 10:56 AM
To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner 
<drakonst...@gmail.com>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Bluestore performance 50% of filestore

Did you happen to have a chance to try with a higher io depth?

Mark

On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote:

FYI

Having 50GB bock.db made no difference on the performance.



- Rado



*From:*David Turner [mailto:drakonst...@gmail.com]
*Sent:* Tuesday, November 14, 2017 6:13 PM
*To:* Milanov, Radoslav Nikiforov <rad...@bu.edu>
*Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Bluestore performance 50% of filestore



I'd probably say 50GB to leave some extra space over-provisioned.
50GB should definitely prevent any DB operations from spilling over to the HDD.



On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov
<rad...@bu.edu <mailto:rad...@bu.edu>> wrote:

Thank you,

It is 4TB OSDs and they might become full someday, I’ll try 60GB db
partition – this is the max OSD capacity.



- Rado



*From:*David Turner [mailto:drakonst...@gmail.com
<mailto:drakonst...@gmail.com>]
*Sent:* Tuesday, November 14, 2017 5:38 PM


*To:* Milanov, Radoslav Nikiforov <rad...@bu.edu
<mailto:rad...@bu.edu>>

*Cc:*Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>;
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>


*Subject:* Re: [ceph-users] Bluestore performance 50% of filestore



You have to configure the size of the db partition in the config
file for the cluster.  If you're db partition is 1GB, then I can all
but guarantee that you're using your HDD for your blocks.db very
quickly into your testing.  There have been multiple threads
recently about what size the db partition should be and it seems to
be based on how many objects your OSD is likely to have on it.  The
recommendation has been to err on the side of bigger.  If you're
running 10TB OSDs and anticipate filling them up, then you probably
want closer to an 80GB+ db partition.  That's why I asked how full
your cluster was and how large your HDDs are.



Here's a link to one of the recent ML threads on this
topic.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020
822.html

On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov
<rad...@bu.edu <mailto:rad...@bu.edu>> wrote:

Block-db partition is the default 1GB (is there a way to modify
this? journals are 5GB in filestore case) and usage is low:



[root@kumo-ceph02 ~]# ceph df

GLOBAL:

SIZEAVAIL  RAW USED %RAW USED

100602G 99146G1455G  1.45

POOLS:

NAME  ID USED   %USED MAX AVAIL
OBJECTS

kumo-vms  1  19757M  0.02
31147G5067

kumo-volumes  2214G  0.18
31147G   55248

kumo-images   3203G  0.17
31147G   66486

kumo-vms3 11 45824M  0.

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-16 Thread Mark Nelson

Did you happen to have a chance to try with a higher io depth?

Mark

On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote:

FYI

Having 50GB bock.db made no difference on the performance.

- Rado

*From:*David Turner [mailto:drakonst...@gmail.com]
*Sent:* Tuesday, November 14, 2017 6:13 PM
*To:* Milanov, Radoslav Nikiforov <rad...@bu.edu>
*Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Bluestore performance 50% of filestore

I'd probably say 50GB to leave some extra space over-provisioned.  50GB
should definitely prevent any DB operations from spilling over to the HDD.

On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov <rad...@bu.edu
<mailto:rad...@bu.edu>> wrote:

Thank you,

It is 4TB OSDs and they might become full someday, I’ll try 60GB db
partition – this is the max OSD capacity.

- Rado

*From:*David Turner [mailto:drakonst...@gmail.com
<mailto:drakonst...@gmail.com>]
*Sent:* Tuesday, November 14, 2017 5:38 PM

*To:* Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>>

*Cc:*Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>;
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>

*Subject:* Re: [ceph-users] Bluestore performance 50% of filestore

You have to configure the size of the db partition in the config
file for the cluster.  If you're db partition is 1GB, then I can all
but guarantee that you're using your HDD for your blocks.db very
quickly into your testing.  There have been multiple threads
recently about what size the db partition should be and it seems to
be based on how many objects your OSD is likely to have on it.  The
recommendation has been to err on the side of bigger.  If you're
running 10TB OSDs and anticipate filling them up, then you probably
want closer to an 80GB+ db partition.  That's why I asked how full
your cluster was and how large your HDDs are.

Here's a link to one of the recent ML threads on this
topic.  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020822.html

On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov
<rad...@bu.edu <mailto:rad...@bu.edu>> wrote:

Block-db partition is the default 1GB (is there a way to modify
this? journals are 5GB in filestore case) and usage is low:

[root@kumo-ceph02 ~]# ceph df

GLOBAL:

SIZEAVAIL  RAW USED %RAW USED

100602G 99146G1455G  1.45

POOLS:

NAME  ID USED   %USED MAX AVAIL
OBJECTS

kumo-vms  1  19757M  0.02
31147G5067

kumo-volumes  2214G  0.18
31147G   55248

kumo-images   3203G  0.17
31147G   66486

kumo-vms3 11 45824M  0.04
31147G   11643

kumo-volumes3 13 10837M 0
31147G2724

kumo-images3  15 82450M  0.09
31147G   10320

- Rado

*From:*David Turner [mailto:drakonst...@gmail.com
<mailto:drakonst...@gmail.com>]
*Sent:* Tuesday, November 14, 2017 4:40 PM
*To:* Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>
*Cc:* Milanov, Radoslav Nikiforov <rad...@bu.edu
<mailto:rad...@bu.edu>>; ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>

*Subject:* Re: [ceph-users] Bluestore performance 50% of filestore

How big was your blocks.db partition for each OSD and what size
are your HDDs?  Also how full is your cluster?  It's possible
that your blocks.db partition wasn't large enough to hold the
entire db and it had to spill over onto the HDD which would
definitely impact performance.

On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com
<mailto:mnel...@redhat.com>> wrote:

How big were the writes in the windows test and how much
concurrency was
there?

Historically bluestore does pretty well for us with small
random writes
so your write results surprise me a bit.  I suspect it's the
low queue
depth.  Sometimes bluestore does worse with reads, especially if
readahead isn't enabled on the client.

Mark

On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote:
> Hi Mark,
> Yes RBD is in write back, and the only thing that changed
was converting OSDs to bluestore. It is 7200 rpm drives and
triple replication. I also get same results (bluestore 2

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-14 Thread Mark Nelson

How big were the writes in the windows test and how much concurrency was 
there?


Historically bluestore does pretty well for us with small random writes 
so your write results surprise me a bit.  I suspect it's the low queue 
depth.  Sometimes bluestore does worse with reads, especially if 
readahead isn't enabled on the client.


Mark

On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote:

Hi Mark,
Yes RBD is in write back, and the only thing that changed was converting OSDs 
to bluestore. It is 7200 rpm drives and triple replication. I also get same 
results (bluestore 2 times slower) testing continuous writes on a 40GB 
partition on a Windows VM, completely different tool.

Right now I'm going back to filestore for the OSDs so additional tests are 
possible if that helps.

- Rado

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Tuesday, November 14, 2017 4:04 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Bluestore performance 50% of filestore

Hi Radoslav,

Is RBD cache enabled and in writeback mode?  Do you have client side readahead?

Both are doing better for writes than you'd expect from the native performance 
of the disks assuming they are typical 7200RPM drives and you are using 3X 
replication (~150IOPS * 27 / 3 = ~1350 IOPS).  Given the small file size, I'd 
expect that you might be getting better journal coalescing in filestore.

Sadly I imagine you can't do a comparison test at this point, but I'd be 
curious how it would look if you used libaio with a high iodepth and a much 
bigger partition to do random writes over.

Mark

On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote:

Hi

We have 3 node, 27 OSDs cluster running Luminous 12.2.1

In filestore configuration there are 3 SSDs used for journals of 9
OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs).

I've converted filestore to bluestore by wiping 1 host a time and
waiting for recovery. SSDs now contain block-db - again one SSD
serving
3 OSDs.



Cluster is used as storage for Openstack.

Running fio on a VM in that Openstack reveals bluestore performance
almost twice slower than filestore.

fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G
--numjobs=2 --time_based --runtime=180 --group_reporting

fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G
--numjobs=2 --time_based --runtime=180 --group_reporting





Filestore

  write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec

  write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec

  write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec



  read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec

  read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec

  read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec



Bluestore

  write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec

  write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec

  write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec



  read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec

  read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec

  read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec





- Rado





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-14 Thread Mark Nelson


Hi Radoslav,

Is RBD cache enabled and in writeback mode?  Do you have client side 
readahead?


Both are doing better for writes than you'd expect from the native 
performance of the disks assuming they are typical 7200RPM drives and 
you are using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS).  Given 
the small file size, I'd expect that you might be getting better journal 
coalescing in filestore.


Sadly I imagine you can't do a comparison test at this point, but I'd be 
curious how it would look if you used libaio with a high iodepth and a 
much bigger partition to do random writes over.


Mark

On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote:

Hi

We have 3 node, 27 OSDs cluster running Luminous 12.2.1

In filestore configuration there are 3 SSDs used for journals of 9 OSDs
on each hosts (1 SSD has 3 journal paritions for 3 OSDs).

I’ve converted filestore to bluestore by wiping 1 host a time and
waiting for recovery. SSDs now contain block-db – again one SSD serving
3 OSDs.



Cluster is used as storage for Openstack.

Running fio on a VM in that Openstack reveals bluestore performance
almost twice slower than filestore.

fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G
--numjobs=2 --time_based --runtime=180 --group_reporting

fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G
--numjobs=2 --time_based --runtime=180 --group_reporting





Filestore

  write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec

  write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec

  write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec



  read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec

  read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec

  read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec



Bluestore

  write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec

  write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec

  write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec



  read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec

  read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec

  read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec





- Rado





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-10 Thread Mark Nelson

On 11/10/2017 12:21 PM, Maged Mokhtar wrote:

Hi Mark,

It will be interesting to know:

The impact of replication. I guess it will decrease by a higher factor
than the replica count.

I assume you mean the 30K IOPS per OSD is what the client sees, if so
the OSD raw disk itself will be doing more IOPS, is this correct and if
so what is the factor ( the less the better efficiency).

In those tests it's 1x replication with 1 OSD.  You do lose more than 3X 
for 3X replication, but it's not necessarily easy to tell how much 
depending on the network, kernel, etc.

Are you running 1 OSD per physical drive or multiple..any recommendations ?

In those tests 1 OSD per NVMe.  You can do better if you put multiple 
OSDs on the same drive, both for filestore and bluestore.

Mark

Cheers /Maged

On 2017-11-10 18:51, Mark Nelson wrote:

FWIW, on very fast drives you can achieve at least 1.4GB/s and 30K+
write IOPS per OSD (before replication).  It's quite possible to do
better but those are recent numbers on a mostly default bluestore
configuration that I'm fairly confident to share.  It takes a lot of
CPU, but it's possible.

Mark

On 11/10/2017 10:35 AM, Robert Stanford wrote:

 Thank you for that excellent observation.  Are there any rumors / has
anyone had experience with faster clusters, on faster networks?  I
wonder how Ceph can get ("it depends"), of course, but I wonder about
numbers people have seen.

On Fri, Nov 10, 2017 at 10:31 AM, Denes Dolhay <de...@denkesys.com
<mailto:de...@denkesys.com>
<mailto:de...@denkesys.com <mailto:de...@denkesys.com>>> wrote:

So you are using a 40 / 100 gbit connection all the way to your client?

John's question is valid because 10 gbit = 1.25GB/s ... subtract
some ethernet, ip, tcp and protocol overhead take into account some
additional network factors and you are about there...

Denes

On 11/10/2017 05:10 PM, Robert Stanford wrote:

 The bandwidth of the network is much higher than that.  The
bandwidth I mentioned came from "rados bench" output, under the
"Bandwidth (MB/sec)" row.  I see from comparing mine to others
online that mine is pretty good (relatively).  But I'd like to get
much more than that.

Does "rados bench" show a near maximum of what a cluster can do?
Or is it possible that I can tune it to get more bandwidth?
|
|

On Fri, Nov 10, 2017 at 3:43 AM, John Spray <jsp...@redhat.com
<mailto:jsp...@redhat.com>
<mailto:jsp...@redhat.com <mailto:jsp...@redhat.com>>> wrote:

On Fri, Nov 10, 2017 at 4:29 AM, Robert Stanford
<rstanford8...@gmail.com
<mailto:rstanford8...@gmail.com> <mailto:rstanford8...@gmail.com
<mailto:rstanford8...@gmail.com>>> wrote:
>
>  In my cluster, rados bench shows about 1GB/s bandwidth.
I've done some
> tuning:
>
> [osd]
> osd op threads = 8
> osd disk threads = 4
> osd recovery max active = 7
>
>
> I was hoping to get much better bandwidth.  My network can
handle it, and my
> disks are pretty fast as well.  Are there any major tunables
I can play with
> to increase what will be reported by "rados bench"?  Am I
pretty much stuck
> around the bandwidth it reported?

Are you sure your 1GB/s isn't just the NIC bandwidth limit of the
client you're running rados bench from?

John

>
>  Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com> <mailto:ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com> <mailto:ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com> <mailto:ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ce

Re: [ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-10 Thread Mark Nelson

FWIW, on very fast drives you can achieve at least 1.4GB/s and 30K+ 
write IOPS per OSD (before replication).  It's quite possible to do 
better but those are recent numbers on a mostly default bluestore 
configuration that I'm fairly confident to share.  It takes a lot of 
CPU, but it's possible.

Mark

On 11/10/2017 10:35 AM, Robert Stanford wrote:

 Thank you for that excellent observation.  Are there any rumors / has
anyone had experience with faster clusters, on faster networks?  I
wonder how Ceph can get ("it depends"), of course, but I wonder about
numbers people have seen.

On Fri, Nov 10, 2017 at 10:31 AM, Denes Dolhay > wrote:

So you are using a 40 / 100 gbit connection all the way to your client?

John's question is valid because 10 gbit = 1.25GB/s ... subtract
some ethernet, ip, tcp and protocol overhead take into account some
additional network factors and you are about there...

Denes

On 11/10/2017 05:10 PM, Robert Stanford wrote:

 The bandwidth of the network is much higher than that.  The
bandwidth I mentioned came from "rados bench" output, under the
"Bandwidth (MB/sec)" row.  I see from comparing mine to others
online that mine is pretty good (relatively).  But I'd like to get
much more than that.

Does "rados bench" show a near maximum of what a cluster can do?
Or is it possible that I can tune it to get more bandwidth?
|
|

On Fri, Nov 10, 2017 at 3:43 AM, John Spray > wrote:

On Fri, Nov 10, 2017 at 4:29 AM, Robert Stanford
> wrote:
>
>  In my cluster, rados bench shows about 1GB/s bandwidth.
I've done some
> tuning:
>
> [osd]
> osd op threads = 8
> osd disk threads = 4
> osd recovery max active = 7
>
>
> I was hoping to get much better bandwidth.  My network can
handle it, and my
> disks are pretty fast as well.  Are there any major tunables
I can play with
> to increase what will be reported by "rados bench"?  Am I
pretty much stuck
> around the bandwidth it reported?

Are you sure your 1GB/s isn't just the NIC bandwidth limit of the
client you're running rados bench from?

John

>
>  Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Mark Nelson

One small point:  It's a bit easier to observe distinct WAL and DB 
behavior when they are on separate partitions.  I often do this for 
benchmarking and testing though I don't know that it would be enough of 
a benefit to do it in production.


Mark

On 11/09/2017 04:16 AM, Richard Hesketh wrote:

You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:


I get confused there because on the documentation:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

"If there is more, provisioning a DB device makes more sense. The BlueStore journal 
will always be placed on the fastest device available, so using a DB device will provide 
the same benefit that the WAL device would while also allowing additional metadata to be 
stored there"

So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)

Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?


 Mensaje original 
De: Nick Fisk <n...@fisk.me.uk>
Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
Para: 'Mark Nelson' <mnel...@redhat.com>, 'Wolfgang Lendl' 
<wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.


If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.




Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/0

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson

On 11/08/2017 03:16 PM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.

Hi Nick,

You've done more investigation in this area than most I think.  Once you 
get to the point under continuous load where RocksDB is compacting, do 
you see better than a 2X gain?

Mark

Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal
on a fast device (ssd,nvme) when using filestore backend.
it's not when it comes to bluestore - are there any resources,
performance test, etc. out there how a fast wal,db device impacts
performance?

br
wolfgang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson


Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since 
you have a small number of large objects and little extra OMAP data. 
Having the allocation and object metadata on flash certainly shouldn't 
hurt, and you should still have less overhead for small (<64k) writes. 
With RGW however you also have to worry about bucket index updates 
during writes and that's a big potential bottleneck that you don't need 
to worry about with RBD.


Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to be
a very good reason to put the WAL and DB on a separate device otherwise
I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more metadata) -
and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one is
creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB and
WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read performance
is very dependent on client-side readahead.  There's still a number of
optimizations to various things ranging from threading and locking in
the shardedopwq to pglog and dup_ops that potentially could improve
performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal on
a fast device (ssd,nvme) when using filestore backend.
it's not when it comes to bluestore - are there any resources,
performance test, etc. out there how a fast wal,db device impacts
performance?


br
wolfgang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson


Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's 
journal, but bluestore isn't dependent on it for guaranteeing durability 
of large writes.  With bluestore you can often get higher large-write 
throughput than with filestore when using HDD-only or flash-only OSDs.


Bluestore also stores allocation, object, and cluster metadata in the 
DB.  That, in combination with the way bluestore stores objects, 
dramatically improves behavior during certain workloads.  A big one is 
creating millions of small objects as quickly as possible.  In 
filestore, PG splitting has a huge impact on performance and tail 
latency.  Bluestore is much better just on HDD, and putting the DB and 
WAL on flash makes it better still since metadata no longer is a bottleneck.


Bluestore does have a couple of shortcomings vs filestore currently. 
The allocator is not as good as XFS's and can fragment more over time. 
There is no server-side readahead so small sequential read performance 
is very dependent on client-side readahead.  There's still a number of 
optimizations to various things ranging from threading and locking in 
the shardedopwq to pglog and dup_ops that potentially could improve 
performance.


I have a blog post that we've been working on that explores some of 
these things but I'm still waiting on review before I publish it.


Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal on
a fast device (ssd,nvme) when using filestore backend.
it's not when it comes to bluestore - are there any resources,
performance test, etc. out there how a fast wal,db device impacts
performance?


br
wolfgang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson




On 11/03/2017 08:25 AM, Wido den Hollander wrote:



Op 3 november 2017 om 13:33 schreef Mark Nelson <mnel...@redhat.com>:




On 11/03/2017 02:44 AM, Wido den Hollander wrote:



Op 3 november 2017 om 0:09 schreef Nigel Williams <nigel.willi...@tpac.org.au>:


On 3 November 2017 at 07:45, Martin Overgaard Hansen <m...@multihouse.dk> wrote:

I want to bring this subject back in the light and hope someone can provide
insight regarding the issue, thanks.


Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.



It depends on the size of your backing disk. The DB will grow for the amount of 
Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 
10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is rather 
hard to do. But if you have Billions of Objects and thus tens of millions 
object per OSD.


Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them?


Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have 
per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but 
other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in 
that OSD.


The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.



I would check your running Ceph clusters and calculate the amount of objects 
per OSD.

total objects / num osd * 3


One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.



True. But how many systems do we have out there with 10M objects in ONE OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse, but 
statistics aren't the golden rule, but users will want some guideline on how to 
size their DB.


That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see 
some use cases with flash backed SSDs pushing far more.




WAL should be sufficient with 1GB~2GB, right?


Yep.  On the surface this appears to be a simple question, but a much 
deeper question is what are we actually doing with the WAL?  How should 
we be storing PG log and dup ops data?  How can we get away from the 
large WAL buffers and memtables we have now?  These are questions we are 
actively working on solving.  For the moment though, having multiple (4) 
256MB WAL buffers appears to give us the best performance despite 
resulting in large memtables, so 1-2GB for the WAL is right.


Mark



Wido



Wido


An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson




On 11/03/2017 04:08 AM, Jorge Pinilla López wrote:

well I haven't found any recomendation either  but I think that
sometimes the SSD space is being wasted.


If someone wanted to write it, you could have bluefs share some of the 
space on the drive for hot object data and release space as needed for 
the DB.  I'd very much recommend keeping the promotion rate incredibly low.




I was thinking about making an OSD from the rest of my SSD space, but it
wouldnt scale in case more speed is needed.


I think there's a temptation to try to shove more stuff on the SSD, but 
honestly I'm not sure it's a great idea.  These drives are already 
handling WAL and DB traffic, potentially for multiple OSDs.  If you have 
a very read centric workload or are using drives with high write 
endurance that's one thing.  From a monetary perspective, think 
carefully about how much drive endurance and mttf matter to you.




Other option I asked was to use bcache or a mix between bcache and small
DB partitions but I was only reply with corruption problems so I decided
not to do it.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021535.html

I think a good idea would be to use the space needed to store the Hot DB
and the rest use it as a cache (at least a read cache)


Given that bluestore is already storing all of the metadata in rocksdb, 
putting the DB partition on flash is already going to buy you a lot. 
Having said that, something that could let the DB and a cache 
share/reclaim space on the SSD could be interesting.  It won't be a cure 
all, but at least could provide a small improvement so long as the 
promotion overhead is kept very low.




I dont really know a lot about this topic but I think that maybe giving
50GB of a really expensive SSD is pointless with its only using 10GB.


Think of it less as "space" and more of it as cells of write endurance. 
That's really what you are buying.  Whether that's a small drive with 
high write endurance or a big drive with low write endurance.  Some may 
have better properties for reads, some may have power-loss-protection 
that allows O_DSYNC writes to go much faster.  As far as the WAL and DB 
goes, it's all about how many writes you can get out of the drive before 
it goes kaput.




El 02/11/2017 a las 21:45, Martin Overgaard Hansen escribió:


Hi, it seems like I’m in the same boat as everyone else in
this particular thread.

I’m also unable to find any guidelines or recommendations regarding
sizing of the wal and / or db.

I want to bring this subject back in the light and hope someone can
provide insight regarding the issue, thanks.

Best Regards,
Martin Overgaard Hansen

MultiHouse IT Partner A/S



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson




On 11/03/2017 02:44 AM, Wido den Hollander wrote:



Op 3 november 2017 om 0:09 schreef Nigel Williams :


On 3 November 2017 at 07:45, Martin Overgaard Hansen  wrote:

I want to bring this subject back in the light and hope someone can provide
insight regarding the issue, thanks.


Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.



It depends on the size of your backing disk. The DB will grow for the amount of 
Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 
10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is rather 
hard to do. But if you have Billions of Objects and thus tens of millions 
object per OSD.


Are you doing RBD, RGW, or something else to test?  What size are the 
objets and are you fragmenting them?


Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have 
per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but 
other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in 
that OSD.


The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.



I would check your running Ceph clusters and calculate the amount of objects 
per OSD.

total objects / num osd * 3


One nagging concern I have in the back of my mind is that the amount of 
space amplification in rocksdb might grow with the number of levels (ie 
the number of objects).  The space used per object might be different at 
10M objects and 50M objects.




Wido


An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore with SSD-backed DBs; what if the SSD fails?

2017-10-25 Thread Mark Nelson


On 10/25/2017 03:51 AM, Caspar Smit wrote:

Hi,

I've asked the exact same question a few days ago, same answer:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021708.html

I guess we'll have to bite the bullet on this one and take this into
account when designing.


This is one reason (amongst several others) that it's a good idea to 
stick with enterprise grade SSDs that have high write endurance. 
Typically you'll also get power loss protection which allows O_DSYNC 
writes to complete quickly without having to flush cache.




Kind regards,
Caspar

2017-10-25 10:39 GMT+02:00 koukou73gr :

On 2017-10-25 11:21, Wido den Hollander wrote:



Op 25 oktober 2017 om 5:58 schreef Christian Sarrasin 
:

The one thing I'm still wondering about is failure domains.  With
Filestore and SSD-backed journals, an SSD failure would kill writes but
OSDs were otherwise still whole.  Replacing the failed SSD quickly would
get you back on your feet with relatively little data movement.



Not true. If you loose your OSD's journal with FileStore without a clean 
shutdown of the OSD you loose the OSD. You'd have to rebalance the complete OSD.


Could you crosscheck please? Because this
http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/
suggests otherwise.

-K.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Mark Nelson




On 10/17/2017 01:54 AM, Wido den Hollander wrote:



Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
<richard.hesk...@rd.bbc.co.uk>:


On 16/10/17 13:45, Wido den Hollander wrote:

Op 26 september 2017 om 16:39 schreef Mark Nelson <mnel...@redhat.com>:
On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.


It's possible that we might be able to get ranges for certain kinds of
scenarios.  Maybe if you do lots of small random writes on RBD, you can
expect a typical metadata size of X per object.  Or maybe if you do lots
of large sequential object writes in RGW, it's more like Y.  I think
it's probably going to be tough to make it accurate for everyone though.


So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~#

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido


If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I 
think that the db_used_bytes includes the WAL at least in my case, as I don't 
have a separate WAL device. I'm not sure how big the WAL is relative to 
metadata and hence how much this might be thrown off, but ~6kb/object seems 
like a reasonable value to take for back-of-envelope calculating.



Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are 
welcome in this case.

Some input from a BlueStore dev might be helpful as well to see we are not 
drawing the wrong conclusions here.

Wido


I would be very careful about drawing too many conclusions given a 
single snapshot in time, especially if there haven't been a lot of 
partial object rewrites yet.  Just on the surface, 6KB/object feels low 
(especially if you they are moderately large objects), but perhaps if 
they've never been rewritten this is a reasonable lower

Re: [ceph-users] BlueStore Cache Ratios

2017-10-11 Thread Mark Nelson


Hi Jorge,

I was sort of responsible for all of this. :)

So basically there are different caches in different places:

- rocksdb bloom filter and index cache
- rocksdb block cache (which can be configured to include filters and 
indexes)

- rocksdb compressed block cache
- bluestore onode cache

The bluestore onode cache is the only one that stores onode/extent/blob 
metadata before it is encoded, ie it's bigger but has lower impact on 
the CPU.  The next step is the regular rocksdb block cache where we've 
already encoded the data, but it's not compressed.  Optionally we could 
also compress the data and then cache it using rocksdb's compressed 
block cache.  Finally, rocksdb can set memory aside for bloom filters 
and indexes but we're configuring those to go into the block cache so we 
can get a better accounting for how memory is being used (otherwise it's 
difficult to control how much memory index and filters get).  The 
downside is that bloom filters and indexes can theoretically get paged 
out under heavy cache pressure.  We set these to be high priority in the 
block cache and also pin the L0 filters/index though to help avoid this.


In the testing I did earlier this year, what I saw is that in low memory 
scenarios it's almost always best to give all of the cache to rocksdb's 
block cache.  Once you hit about the 512MB mark, we start seeing bigger 
gains by giving additional memory to bluestore's onode cache.  So we 
devised a mechanism where you can decide where to cut over.  It's quite 
possible that on very fast CPUs it might make sense ot use rocksdb 
compressed cache, or possibly if you have a huge number of objects these 
ratios might change.  The values we have now were sort of the best 
jack-of-all-trades values we found.


Mark

On 10/11/2017 08:32 AM, Jorge Pinilla López wrote:

okay, thanks for the explanation, so from the 3GB of Cache (default
cache for SSD) only a 0.5GB is going to K/V and 2.5 going to metadata.

Is there a way of knowing how much k/v, metadata, data is storing and
how full cache is so I can adjust my ratios?, I was thinking some ratios
(like 0.9 k/v, 0.07 meta 0.03 data) but only speculating, I dont have
any real data.

El 11/10/2017 a las 14:32, Mohamad Gebai escribió:

Hi Jorge,

On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:

Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
too disproporcionate.

Yes, this is correct.


Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would
be used for KV but there is also another attributed called
bluestore_cache_kv_max which is by fault 512MB, then what is the rest
of the cache used for?, nothing? shouldnt it be more kv_max value or
less KV ratio?

Anything over the *cache_kv_max value goes to the metadata cache. You
can look in your logs to see the final values of kv, metadata and data
cache ratios. To get data cache, you need to lower the ratios of
metadata and kv caches.

Mohamad


--

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore questions about workflow and performance

2017-10-03 Thread Mark Nelson




On 10/03/2017 07:59 AM, Alex Gorbachev wrote:

Hi Sam,

On Mon, Oct 2, 2017 at 6:01 PM Sam Huracan > wrote:

Anyone can help me?

On Oct 2, 2017 17:56, "Sam Huracan" > wrote:

Hi,

I'm reading this document:
 
http://storageconference.us/2017/Presentations/CephObjectStore-slides.pdf

I have 3 questions:

1. BlueStore writes both data (to raw block device) and metadata
(to RockDB) simultaneously, or sequentially?

2. From my opinion, performance of BlueStore can not compare to
FileStore using SSD Journal, because performance of raw disk is
less than using buffer. (this is buffer purpose). How do you think?

3.  Do setting Rock DB and Rock DB Wal in SSD only enhance
write, read performance? or both?

Hope your answer,


I am researching the same thing, but recommend you look
at http://ceph.com/community/new-luminous-bluestore

And also search for Bluestore cache to answer some questions.  My test
Luminous cluster so far is not as performant as I would like, but I have
not yet put a serious effort into tuning it, amd it does seem stable.

Hth, Alex


Hi Alex,

If you see anything specific please let us know.  There are a couple of 
corner cases where bluestore is likely to be slower than filestore 
(specifically small sequential reads/writes with no client side cache or 
read ahead).  I've also seen some cases where filestore has higher read 
throughput potential (4MB seq reads with multiple NVMe drives per OSD 
node).  In many other cases bluestore is faster (and sometimes much 
faster) than filestore in our tests.  Writes in general tend to be 
faster and high volume object creation is much faster with much lower 
tail latencies (filestore really suffers in this test due to PG splitting).


Mark






___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
--
Alex Gorbachev
Storcium


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-26 Thread Mark Nelson

On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

It's possible that we might be able to get ranges for certain kinds of 
scenarios.  Maybe if you do lots of small random writes on RBD, you can 
expect a typical metadata size of X per object.  Or maybe if you do lots 
of large sequential object writes in RGW, it's more like Y.  I think 
it's probably going to be tough to make it accurate for everyone though.

Mark

Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:

db/wal partitions are per OSD.  DB partitions need to be made as big as
you need them.  If they run out of space, they will fall back to the
block device.  If the DB and block are on the same device, then there's
no reason to partition them and figure out the best size.  If they are
on separate devices, then you need to make it as big as you need to to
ensure that it won't spill over (or if it does that you're ok with the
degraded performance while the db partition is full).  I haven't come
across an equation to judge what size should be used for either
partition yet.

On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
<dietmar.rie...@i-med.ac.at <mailto:dietmar.rie...@i-med.ac.at>> wrote:

On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we don’t have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
>
> This is basically correct.  What's more, it's not just the object
size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.  I'm skeptical how well we can
estimate
> all of this in the long run.  I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is
being
> used, how much is spilling over to the block device, and make it
easy to
> upgrade to a new device once it gets full.
>
>>
>> If you want to put wal and db on same ssd, you don’t need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
>
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.  There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single  db/wal partition" for each
OSD  on the node?

>
>>
>> I’m also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>> <richard.hesk...@rd.bbc.co.uk
<mailto:richard.hesk...@rd.bbc.co.uk>> wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible
mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>>>> Some of this thread seems to contradict the documentation and
confuses
>>>> me.  Is the statement below correct

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson




On 09/25/2017 05:02 PM, Nigel Williams wrote:

On 26 September 2017 at 01:10, David Turner  wrote:

If they are on separate
devices, then you need to make it as big as you need to to ensure that it
won't spill over (or if it does that you're ok with the degraded performance
while the db partition is full).  I haven't come across an equation to judge
what size should be used for either partition yet.


Is it the case that only the WAL will spill if there is a backlog
clearing entries into the DB partition? so the WAL's fill-mark
oscillates but the DB is going to steadily grow (depending on the
previously mentioned factors of "...extents, checksums, RGW bucket
indices, and potentially other random stuff".


The WAL should never grow larger than the size of the buffers you've 
specified.  It's the DB that can grow and is difficult to estimate both 
because different workloads will cause different numbers of extents and 
objects, but also because rocksdb itself causes a certain amount of 
space-amplification due to a variety of factors.




Is there an indicator that can be monitored to show that a spill is occurring?


I think there's a message in the logs, but beyond that I don't remember 
if we added any kind of indication in the user tools.  At one point I 
think I remember Sage mentioning he wanted to add something to ceph df.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson


On 09/25/2017 03:31 AM, TYLin wrote:

Hi,

To my understand, the bluestore write workflow is

For normal big write
1. Write data to block
2. Update metadata to rocksdb
3. Rocksdb write to memory and block.wal
4. Once reach threshold, flush entries in block.wal to block.db

For overwrite and small write
1. Write data and metadata to rocksdb
2. Apply the data to block

Seems we don’t have a formula or suggestion to the size of block.db. It depends 
on the object size and number of objects in your pool. You can just give big 
partition to block.db to ensure all the database files are on that fast 
partition. If block.db full, it will use block to put db files, however, this 
will slow down the db performance. So give db size as much as you can.


This is basically correct.  What's more, it's not just the object size, 
but the number of extents, checksums, RGW bucket indices, and 
potentially other random stuff.  I'm skeptical how well we can estimate 
all of this in the long run.  I wonder if we would be better served by 
just focusing on making it easy to understand how the DB device is being 
used, how much is spilling over to the block device, and make it easy to 
upgrade to a new device once it gets full.




If you want to put wal and db on same ssd, you don’t need to create block.wal. 
It will implicitly use block.db to put wal. The only case you need block.wal is 
that you want to separate wal to another disk.


I always make explicit partitions, but only because I (potentially 
illogically) like it that way.  There may actually be some benefits to 
using a single partition for both if sharing a single device.




I’m also studying bluestore, this is what I know so far. Any correction is 
welcomed.

Thanks



On Sep 22, 2017, at 5:27 PM, Richard Hesketh <richard.hesk...@rd.bbc.co.uk> 
wrote:

I asked the same question a couple of weeks ago. No response I got contradicted 
the documentation but nobody actively confirmed the documentation was correct 
on this subject, either; my end state was that I was relatively confident I 
wasn't making some horrible mistake by simply specifying a big DB partition and 
letting bluestore work itself out (in my case, I've just got HDDs and SSDs that 
were journals under filestore), but I could not be sure there wasn't some sort 
of performance tuning I was missing out on by not specifying them separately.

Rich

On 21/09/17 20:37, Benjeman Meekhof wrote:

Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

 it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
<dietmar.rie...@i-med.ac.at> wrote:

On 09/21/2017 05:03 PM, Mark Nelson wrote:


On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
<mrxlazuar...@gmail.com <mailto:mrxlazuar...@gmail.com>> wrote:

Hi,

1. Is it possible configure use osd_data not as small partition on
OSD but a folder (ex. on root disk)? If yes, how to do that with
ceph-disk and any pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected
throughput like on journal device of filestore? If no, what is the
default value and pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB
if using separate device for them?

Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I am also looking for recommendations on wal/db partition sizes. Some
hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Mark Nelson




On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> wrote:

Hi,

1. Is it possible configure use osd_data not as small partition on
OSD but a folder (ex. on root disk)? If yes, how to do that with
ceph-disk and any pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected
throughput like on journal device of filestore? If no, what is the
default value and pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB
if using separate device for them?

Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




I am also looking for recommendations on wal/db partition sizes. Some hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in

wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.


This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

 (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?


Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write 
amp but mean larger memtables and potentially higher overhead scanning 
through memtables).  4x256MB buffers works pretty well, but it means 
memory overhead too.  Beyond that, I'd devote the entire rest of the 
device to DB partitions.


Mark




Dietmar
 --
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous vs jewel rbd performance

2017-09-21 Thread Mark Nelson


Hi Rafael,

In the original email you mentioned 4M block size, seq read, but here it 
looks like you are doing 4k writes?  Can you clarify?  If you are doing 
4k direct sequential writes with iodepth=1 and are also using librbd 
cache, please make sure that librbd is set to writeback mode in both 
cases.  RBD by default will not kick into WB mode until it sees a flush 
request, and the librbd engine in fio doesn't issue one before a test is 
started.  It can be pretty easy to end up in a situation where writeback 
cache is active on some tests but not others if you aren't careful.  IE 
If one of your tests was done after a flush and the other was not, you'd 
likely see a dramatic difference in performance during this test.


You can avoid this by telling librbd to always use WB mode (at least 
when benchmarking):


rbd cache writethrough until flush = false

Mark

On 09/20/2017 01:51 AM, Rafael Lopez wrote:

Hi Alexandre,

Yeah we are using filestore for the moment with luminous. With regards
to client, I tried both jewel and luminous librbd versions against the
luminous cluster - similar results.

I am running fio on a physical machine with fio rbd engine. This is a
snippet of the fio config for the runs (the complete jobfile adds
variations of read/write/block size/iodepth).

[global]
ioengine=rbd
clientname=cinder-volume
pool=rbd-bronze
invalidate=1
ramp_time=5
runtime=30
time_based
direct=1

[write-rbd1-4k-depth1]
rbdname=rbd-tester-fio
bs=4k
iodepth=1
rw=write
stonewall

[write-rbd2-4k-depth16]
rbdname=rbd-tester-fio-2
bs=4k
iodepth=16
rw=write
stonewall

Raf

On 20 September 2017 at 16:43, Alexandre DERUMIER > wrote:

Hi

so, you use also filestore on luminous ?

do you have also upgraded librbd on client ? (are you benching
inside a qemu machine ? or directly with fio-rbd ?)



(I'm going to do a lot of benchmarks in coming week, I'll post
results on mailing soon.)



- Mail original -
De: "Rafael Lopez" >
À: "ceph-users" >
Envoyé: Mercredi 20 Septembre 2017 08:17:23
Objet: [ceph-users] luminous vs jewel rbd performance

hey guys.
wondering if anyone else has done some solid benchmarking of jewel
vs luminous, in particular on the same cluster that has been
upgraded (same cluster, client and config).

we have recently upgraded a cluster from 10.2.9 to 12.2.0, and
unfortunately i only captured results from a single fio (librbd) run
with a few jobs in it before upgrading. i have run the same fio
jobfile many times at different times of the day since upgrading,
and been unable to produce a close match to the pre-upgrade (jewel)
run from the same client. one particular job is significantly slower
(4M block size, iodepth=1, seq read), up to 10x in one run.

i realise i havent supplied much detail and it could be dozens of
things, but i just wanted to see if anyone else had done more
quantitative benchmarking or had similar experiences. keep in mind
all we changed was daemons were restarted to use luminous code,
everything else exactly the same. granted it is possible that
some/all osds had some runtime config injected that differs from
now, but i'm fairly confident this is not the case as they were
recently restarted (on jewel code) after OS upgrades.

cheers,
Raf

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
*Rafael Lopez*
Research Devops Engineer
Monash University eResearch Centre

T: +61 3 9905 9118 
M: +61 (0)427682670 
E: rafael.lo...@monash.edu 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Mark Nelson

On 09/21/2017 03:19 AM, Maged Mokhtar wrote:

On 2017-09-21 10:01, Dietmar Rieder wrote:

Hi,

I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
questions to myself.
For now I decided to use the NVMEs as wal and db devices for the SAS
HDDs and on the SSDs I colocate wal and  db.

However, I'm still wonderin how (to what size) and if I should change
the default sizes of wal and db.

Dietmar

On 09/21/2017 01:18 AM, Alejandro Comisario wrote:

But for example, on the same server i have 3 disks technologies to
deploy pools, SSD, SAS and SATA.
The NVME were bought just thinking on the journal for SATA and SAS,
since journals for SSD were colocated.

But now, exactly the same scenario, should i trust the NVME for the SSD
pool ? are there that much of a  gain ? against colocating block.* on
the same SSD?

best.

On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams

>> wrote:

On 21 September 2017 at 04:53, Maximiliano Venesio

>> wrote:

Hi guys i'm reading different documents about bluestore, and it
never recommends to use NVRAM to store the bluefs db,
nevertheless the official documentation says that, is better to
use the faster device to put the block.db in.

Likely not mentioned since no one yet has had the opportunity to
test it.

So how do i have to deploy using bluestore, regarding where i
should put block.wal and block.db ?

block.* would be best on your NVRAM device, like this:

ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
/dev/nvme0n1 --block-db /dev/nvme0n1

___
ceph-users mailing list
ceph-users@lists.ceph.com 
>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejan...@nubeliu.com 
>Cell: +54 9
11 3770 1857
_
www.nubeliu.com  

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

My guess is for wal: you are dealing with a 2 step io operation so in
case it is collocated on your SSDs your iops for small writes will be
halfed. The decision is if you add a small NVMEs as wal for 4 or 5
(large) SSDs, you will double their iops for small io sized. This is not
the case for db.

For wal size:  512 MB is recommended ( ceph-disk default )

For db size: a "few" GB..probably 10GB is a good number. I guess we will
hear more in the future.

There's a pretty good chance that if you are writing out lots of small 
RGW or rados objects you'll blow past 10GB of metadata once rocksdb 
space-amp is factored in.  I can pretty routinely do it when writing out 
millions of rados objects per OSD.  Bluestore will switch to write 
metadata out to the block disk and in this case it might not be that bad 
of a transition (NVMe to SSD).  If you have spare room, you might as 
well give the DB partition whatever you have available on the device.  A 
harder question is how much fast storage to buy for the WAL/DB.  It's 
not straight forward, and rocksdb can be tuned in various ways to favor 
reducing space/write/read amplification, but not all 3 at once.  Right 
now we are likely favoring reducing write-amplification over space/read 
amp, but one could imagine that with a small amount of incredibly fast 
storage it might be better to favor reducing space-amp.

Mark

Maged Mokhtar

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Mark Nelson


I'm really glad to hear that it wasn't bluestore! :)

It raises another concern though. We didn't expect to see that much of a 
slowdown with the current throttle settings.  An order of magnitude 
slowdown in recovery performance isn't ideal at all.


I wonder if we could improve things dramatically if we kept track of 
client IO activity on the OSD and remove the throttle if there's been no 
client activity for X seconds.  Theoretically more advanced heuristics 
might cover this, but in the interim it seems to me like this would 
solve the very specific problem you are seeing while still throttling 
recovery when IO is happening.


Mark

On 09/14/2017 06:19 AM, Richard Hesketh wrote:

Yeah, that hit the nail on the head. Significantly reducing/eliminating the 
recovery sleep times increases the recovery speed back up (and beyond!) the 
levels I was expecting to see - recovery is almost an order of magnitude faster 
now. Thanks for educating me about those changes!

Rich

On 14/09/17 11:16, Richard Hesketh wrote:

Hi Mark,

No, I wasn't familiar with that work. I am in fact comparing speed of recovery 
to maintenance work I did while the cluster was in Jewel; I haven't manually 
done anything to sleep settings, only adjusted max backfills OSD settings. New 
options that introduce arbitrary slowdown to recovery operations to preserve 
client performance would explain what I'm seeing! I'll have a tinker with 
adjusting those values (in my particular case client load on the cluster is 
very low and I don't have to honour any guarantees about client performance - 
getting back into HEALTH_OK asap is preferable).

Rich

On 13/09/17 21:14, Mark Nelson wrote:

Hi Richard,

Regarding recovery speed, have you looked through any of Neha's results on 
recovery sleep testing earlier this summer?

https://www.spinics.net/lists/ceph-devel/msg37665.html

She tested bluestore and filestore under a couple of different scenarios.  The 
gist of it is that time to recover changes pretty dramatically depending on the 
sleep setting.

I don't recall if you said earlier, but are you comparing filestore and 
bluestore recovery performance on the same version of ceph with the same sleep 
settings?

Mark

On 09/12/2017 05:24 AM, Richard Hesketh wrote:

Thanks for the links. That does seem to largely confirm that what I haven't 
horribly misunderstood anything and I've not been doing anything obviously 
wrong while converting my disks; there's no point specifying separate WAL/DB 
partitions if they're going to go on the same device, throw as much space as 
you have available at the DB partitions and they'll use all the space they can, 
and significantly reduced I/O on the DB/WAL device compared to Filestore is 
expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, 
but I guess that's a tuning issue rather than evidence of catastrophe.

Rich




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-13 Thread Mark Nelson


Hi Richard,

Regarding recovery speed, have you looked through any of Neha's results 
on recovery sleep testing earlier this summer?


https://www.spinics.net/lists/ceph-devel/msg37665.html

She tested bluestore and filestore under a couple of different 
scenarios.  The gist of it is that time to recover changes pretty 
dramatically depending on the sleep setting.


I don't recall if you said earlier, but are you comparing filestore and 
bluestore recovery performance on the same version of ceph with the same 
sleep settings?


Mark

On 09/12/2017 05:24 AM, Richard Hesketh wrote:

Thanks for the links. That does seem to largely confirm that what I haven't 
horribly misunderstood anything and I've not been doing anything obviously 
wrong while converting my disks; there's no point specifying separate WAL/DB 
partitions if they're going to go on the same device, throw as much space as 
you have available at the DB partitions and they'll use all the space they can, 
and significantly reduced I/O on the DB/WAL device compared to Filestore is 
expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, 
but I guess that's a tuning issue rather than evidence of catastrophe.

Rich

On 12/09/17 00:13, Brad Hubbard wrote:

Take a look at these which should answer at least some of your questions.

http://ceph.com/community/new-luminous-bluestore/

http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/

On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh
 wrote:

On 08/09/17 11:44, Richard Hesketh wrote:

Hi,

Reading the ceph-users list I'm obviously seeing a lot of people talking about 
using bluestore now that Luminous has been released. I note that many users 
seem to be under the impression that they need separate block devices for the 
bluestore data block, the DB, and the WAL... even when they are going to put 
the DB and the WAL on the same device!

As per the docs at 
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ this 
is nonsense:


If there is only a small amount of fast storage available (e.g., less than a 
gigabyte), we recommend using it as a WAL device. If there is more, 
provisioning a DB
device makes more sense. The BlueStore journal will always be placed on the 
fastest device available, so using a DB device will provide the same benefit 
that the WAL
device would while also allowing additional metadata to be stored there (if it will fix). 
[sic, I assume that should be "fit"]


I understand that if you've got three speeds of storage available, there may be 
some sense to dividing these. For instance, if you've got lots of HDD, a bit of 
SSD, and a tiny NVMe available in the same host, data on HDD, DB on SSD and WAL 
on NVMe may be a sensible division of data. That's not the case for most of the 
examples I'm reading; they're talking about putting DB and WAL on the same 
block device, but in different partitions. There's even one example of someone 
suggesting to try partitioning a single SSD to put data/DB/WAL all in separate 
partitions!

Are the docs wrong and/or I am missing something about optimal bluestore setup, 
or do people simply have the wrong end of the stick? I ask because I'm just 
going through switching all my OSDs over to Bluestore now and I've just been 
reusing the partitions I set up for journals on my SSDs as DB devices for 
Bluestore HDDs without specifying anything to do with the WAL, and I'd like to 
know sooner rather than later if I'm making some sort of horrible mistake.

Rich


Having had no explanatory reply so far I'll ask further...

I have been continuing to update my OSDs and so far the performance offered by 
bluestore has been somewhat underwhelming. Recovery operations after replacing 
the Filestore OSDs with Bluestore equivalents have been much slower than 
expected, not even half the speed of recovery ops when I was upgrading 
Filestore OSDs with larger disks a few months ago. This contributes to my sense 
that I am doing something wrong.

I've found that if I allow ceph-disk to partition my DB SSDs rather than 
reusing the rather large journal partitions I originally created for Filestore, 
it is only creating very small 1GB partitions. Attempting to search for 
bluestore configuration parameters has pointed me towards 
bluestore_block_db_size and bluestore_block_wal_size config settings. 
Unfortunately these settings are completely undocumented so I'm not sure what 
their functional purpose is. In any event in my running config I seem to have 
the following default values:

# ceph-conf --show-config | grep bluestore
...
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path =
bluestore_block_db_size = 0
bluestore_block_path =
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false

Re: [ceph-users] Help with down OSD with Ceph 12.1.4 on Bluestore back

2017-08-29 Thread Mark Nelson


Hi Bryan,

Check out your SCSI device failures, but if that doesn't pan out, Sage 
and I have been tracking this:


http://tracker.ceph.com/issues/21171

There's a fix in place being tested now!

Mark

On 08/29/2017 05:41 PM, Bryan Banister wrote:

Found some bad stuff in the messages file about SCSI block device fails…
I think I found my smoking gun…

-B



*From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
Of *Bryan Banister
*Sent:* Tuesday, August 29, 2017 5:02 PM
*To:* ceph-users@lists.ceph.com
*Subject:* [ceph-users] Help with down OSD with Ceph 12.1.4 on Bluestore
back



/Note: External Email/



Hi all,



Not sure what to do with this down OSD:



-2> 2017-08-29 16:55:34.588339 72d58700  1 --
7.128.13.57:6979/18818 --> 7.128.13.55:0/52877 -- osd_ping(ping_reply
e935 stamp 2017-08-29 16:55:34.587991) v4 -- 0x67397000 con 0

-1> 2017-08-29 16:55:34.588351 72557700  1 --
7.128.13.57:6978/18818 --> 7.128.13.55:0/52877 -- osd_ping(ping_reply
e935 stamp 2017-08-29 16:55:34.587991) v4 -- 0x67395000 con 0

 0> 2017-08-29 16:55:34.650061 7fffecd93700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/os/bluestore/KernelDevice.cc:
In function 'void KernelDevice::_aio_thread()' thread 7fffecd93700 time
2017-08-29 16:55:34.648642

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/os/bluestore/KernelDevice.cc:
372: FAILED assert(r >= 0)



ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x55fb4420]

2: (KernelDevice::_aio_thread()+0x4b5) [0x55f59ce5]

3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x55f5e3cd]

4: (()+0x7dc5) [0x75635dc5]

5: (clone()+0x6d) [0x7472a73d]

NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.



--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

[snip]



Any help with recovery would be greatly appreciated, thanks!

-Bryan






Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you are hereby
notified that any review, dissemination or copying of this email is
strictly prohibited, and to please notify the sender immediately and
destroy this email and any attachments. Email transmission cannot be
guaranteed to be secure or error-free. The Company, therefore, does not
make any guarantees as to the completeness or accuracy of this email or
any attachments. This email is for informational purposes only and does
not constitute a recommendation, offer, request or solicitation of any
kind to buy, sell, subscribe, redeem or perform any type of transaction
of a financial product.




Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you are hereby
notified that any review, dissemination or copying of this email is
strictly prohibited, and to please notify the sender immediately and
destroy this email and any attachments. Email transmission cannot be
guaranteed to be secure or error-free. The Company, therefore, does not
make any guarantees as to the completeness or accuracy of this email or
any attachments. This email is for informational purposes only and does
not constitute a recommendation, offer, request or solicitation of any
kind to buy, sell, subscribe, redeem or perform any type of transaction
of a financial product.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...

2017-08-23 Thread Mark Nelson




On 08/23/2017 07:17 PM, Mark Nelson wrote:



On 08/23/2017 06:18 PM, Xavier Trilla wrote:

Oh man, what do you know!... I'm quite amazed. I've been reviewing
more documentation about min_replica_size and seems like it doesn't
work as I thought (Although I remember specifically reading it
somewhere some years ago :/ ).

And, as all replicas need to be written before primary OSD informs the
client about the write being completed, we cannot have the third
replica on HDDs, no way. It would kill latency.

Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and
P4500 price difference is negligible) and we'll decrease the primary
affinity weight for SATA SSDs, just to be sure we get the most out of
NVMe.

BTW, does anybody have any experience so far with erasure coding and
rbd? A 2/3 profile, would really save space on SSDs but I'm afraid
about the extra calculations needed and how will it affect
performance... Well, maybe I'll check into it, and I'll start a new
thread :)


There's a decent chance you'll get higher performance with something
like EC 6+2 vs 3X replication for large writes due simply to having less
data to write (we see somewhere between 2x and 3x rep performance in the
lab for 4MB writes to RBD). Small random writes will almost certainly be
slower due to increased latency.  Reads in general will be slower as
well.  With replication the read comes entirely from the primary but in
EC you have to fetch chunks from the secondaries and reconstruct the
object before sending it back to the client.

So basically compared to 3X rep you'll likely gain some performance on
large writes, lose some performance on large reads, and lose more
performance on small writes/reads (dependent on cpu speed and various
other factors).


I should follow up and mention though that you gain space vs 3X as well, 
so it's very much a question of what trade-offs you are willing to make.




Mark



Anyway, thanks for the info!
Xavier.

-Mensaje original-
De: Christian Balzer [mailto:ch...@gol.com]
Enviado el: martes, 22 de agosto de 2017 2:40
Para: ceph-users@lists.ceph.com
CC: Xavier Trilla <xavier.tri...@silicontower.net>
Asunto: Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...


Hello,


Firstly, what David said.

On Mon, 21 Aug 2017 20:25:07 + Xavier Trilla wrote:


Hi,

I'm working into improving the costs of our actual ceph cluster. We
actually keep 3 x replicas, all of them in SSDs (That cluster hosts
several hundred VMs RBD disks) and lately I've been wondering if the
following setup would make sense, in order to improve cost /
performance.



Have you done a full analysis of your current cluster, as in
utilization of your SSDs (IOPS), CPU, etc with
atop/iostat/collectd/grafana?
During peak utilization times?

If so, you should have a decent enough idea of what level IOPS you
need and can design from there.


The ideal would be to move PG primaries to high performance nodes
using NVMe, keep secondary replica in SSDs and move the third replica
to HDDs.

Most probably the hardware will be:

1st Replica: Intel P4500 NVMe (2TB)
2nd Replica: Intel S3520 SATA SSD (1.6TB)

Unless you have:
a) a lot of these and/or
b) very little writes
what David said.

Aside from that whole replica idea not working. as you think.


3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o
2TB model, as I want to have as many spins as possible)

Also, hosts running OSDs would have a quite different HW configuration
(In our experience NVMe need crazy CPU power in order to get the best
out of them)


Correct, one might run into that with pure NVMe/SSD nodes.


I know the NVMe and SATA SSD replicas will work, no problem about
that (We'll just adjust the primary affinity and crushmap in order to
have the desired data layoff + primary OSDs) what I'm worried is
about the HDD replica.

Also the pool will have min_size 1 (Would love to use min_size 2, but
it would kill latency times) so, even if we have to do some
maintenance in the NVMe nodes, writes to HDDs will be always "lazy".

Before bluestore (we are planning to move to luminous most probably
by the end of the year or beginning 2018, once it is released and
tested properly) I would just use  SSD/NVMe journals for the HDDs.
So, all writes would go to the SSD journal, and then moved to the
HDD. But now, with Bluestore I don't think that's an option anymore.


Bluestore bits are still a bit of dark magic in terms of concise and
complete documentation, but the essentials have been mentioned here
before.

Essentially, if you can get the needed IOPS with SSD/NVMe journals and
HDDs, Bluestore won't be worse than that, if done correctly.

With Bluestore use either NVMe for the WAL (small space, high
IOPS/data), SSDs for the actual rocksdb and the (surprise, surprise!)
journal for small writes (large space, nobody knows for sure how large
is large enough) and finally the HDDs.

If you're trying to optimize costs, decent SSDs (good

Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...

2017-08-23 Thread Mark Nelson




On 08/23/2017 06:18 PM, Xavier Trilla wrote:

Oh man, what do you know!... I'm quite amazed. I've been reviewing more 
documentation about min_replica_size and seems like it doesn't work as I 
thought (Although I remember specifically reading it somewhere some years ago 
:/ ).

And, as all replicas need to be written before primary OSD informs the client 
about the write being completed, we cannot have the third replica on HDDs, no 
way. It would kill latency.

Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and P4500 price 
difference is negligible) and we'll decrease the primary affinity weight for 
SATA SSDs, just to be sure we get the most out of NVMe.

BTW, does anybody have any experience so far with erasure coding and rbd? A 2/3 
profile, would really save space on SSDs but I'm afraid about the extra 
calculations needed and how will it affect performance... Well, maybe I'll 
check into it, and I'll start a new thread :)


There's a decent chance you'll get higher performance with something 
like EC 6+2 vs 3X replication for large writes due simply to having less 
data to write (we see somewhere between 2x and 3x rep performance in the 
lab for 4MB writes to RBD). Small random writes will almost certainly be 
slower due to increased latency.  Reads in general will be slower as 
well.  With replication the read comes entirely from the primary but in 
EC you have to fetch chunks from the secondaries and reconstruct the 
object before sending it back to the client.


So basically compared to 3X rep you'll likely gain some performance on 
large writes, lose some performance on large reads, and lose more 
performance on small writes/reads (dependent on cpu speed and various 
other factors).


Mark



Anyway, thanks for the info!
Xavier.

-Mensaje original-
De: Christian Balzer [mailto:ch...@gol.com]
Enviado el: martes, 22 de agosto de 2017 2:40
Para: ceph-users@lists.ceph.com
CC: Xavier Trilla 
Asunto: Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...


Hello,


Firstly, what David said.

On Mon, 21 Aug 2017 20:25:07 + Xavier Trilla wrote:


Hi,

I'm working into improving the costs of our actual ceph cluster. We actually 
keep 3 x replicas, all of them in SSDs (That cluster hosts several hundred VMs 
RBD disks) and lately I've been wondering if the following setup would make 
sense, in order to improve cost / performance.



Have you done a full analysis of your current cluster, as in utilization of 
your SSDs (IOPS), CPU, etc with atop/iostat/collectd/grafana?
During peak utilization times?

If so, you should have a decent enough idea of what level IOPS you need and can 
design from there.


The ideal would be to move PG primaries to high performance nodes using NVMe, 
keep secondary replica in SSDs and move the third replica to HDDs.

Most probably the hardware will be:

1st Replica: Intel P4500 NVMe (2TB)
2nd Replica: Intel S3520 SATA SSD (1.6TB)

Unless you have:
a) a lot of these and/or
b) very little writes
what David said.

Aside from that whole replica idea not working. as you think.


3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o
2TB model, as I want to have as many spins as possible)

Also, hosts running OSDs would have a quite different HW configuration
(In our experience NVMe need crazy CPU power in order to get the best
out of them)


Correct, one might run into that with pure NVMe/SSD nodes.


I know the NVMe and SATA SSD replicas will work, no problem about that (We'll 
just adjust the primary affinity and crushmap in order to have the desired data 
layoff + primary OSDs) what I'm worried is about the HDD replica.

Also the pool will have min_size 1 (Would love to use min_size 2, but it would kill 
latency times) so, even if we have to do some maintenance in the NVMe nodes, writes to 
HDDs will be always "lazy".

Before bluestore (we are planning to move to luminous most probably by the end 
of the year or beginning 2018, once it is released and tested properly) I would 
just use  SSD/NVMe journals for the HDDs. So, all writes would go to the SSD 
journal, and then moved to the HDD. But now, with Bluestore I don't think 
that's an option anymore.


Bluestore bits are still a bit of dark magic in terms of concise and complete 
documentation, but the essentials have been mentioned here before.

Essentially, if you can get the needed IOPS with SSD/NVMe journals and HDDs, 
Bluestore won't be worse than that, if done correctly.

With Bluestore use either NVMe for the WAL (small space, high IOPS/data), SSDs 
for the actual rocksdb and the (surprise, surprise!) journal for small writes 
(large space, nobody knows for sure how large is large enough) and finally the 
HDDs.

If you're trying to optimize costs, decent SSDs (good luck finding any with 
Intel 37xx and 36xx basically unavailable), maybe the S or P 4600, to hold both 
the WAL and DB should do the trick.

Christian


What I'm worried is

Re: [ceph-users] Optimise Setup with Bluestore

2017-08-16 Thread Mark Nelson


Hi Mehmet!

On 08/16/2017 11:12 AM, Mehmet wrote:

:( no suggestions or recommendations on this?

Am 14. August 2017 16:50:15 MESZ schrieb Mehmet :

Hi friends,

my actual hardware setup per OSD-node is as follow:

# 3 OSD-Nodes with
- 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no
Hyper-Threading
- 64GB RAM
- 12x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
12 Disks (20G Journal size)
- 1x Samsung SSD 840/850 Pro only for the OS

# and 1x OSD Node with
- 1x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (10 Cores 20 Threads)
- 64GB RAM
- 23x 2TB TOSHIBA MK2001TRKB SAS2 (6GB/s) Disks as OSDs
- 1x SEAGATE ST32000445SS SAS2 (6GB/s) Disk as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
24 Disks (15G Journal size)
- 1x Samsung SSD 850 Pro only for the OS


The single P3700 for 23 spinning disks is pushing it.  They have high 
write durability but based on the model that is the 400GB version?  If 
you are doing a lot of writes you might wear it out pretty fast and it's 
a single point of failure for the entire node (if it dies you have a lot 
of data dying with it).  General unbalanced setups like this are 
trickier to get performing well as well.




As you can see, i am using 1 (one) NVMe (Intel DC P3700 NVMe – 400G)
Device for whole Spinning Disks (partitioned) on each OSD-node.

When „Luminous“ is available (as next LTE) i plan to switch vom
„filestore“ to „bluestore“ 

As far as i have read bluestore consists of
- „the device“
- „block-DB“: device that store RocksDB metadata
- „block-WAL“: device that stores RocksDB „write-ahead journal“

Which setup would be usefull in my case?
I Would setup the disks via "ceph-deploy".


So typically we recommend something like a 1-2GB WAL partition on the 
NVMe drive per OSD and use the remaining space for DB.  If you run out 
of DB space, bluestore will start using the spinning disks to store KV 
data instead.  I suspect this will still be the advice you will want to 
follow, though at some point having so many WAL and DB partitions on the 
NVMe may start becoming a bottleneck.  Something like 63K sequential 
writes to heavily fragmented objects might be worth testing, but in most 
cases I suspect DB and WAL on NVMe is still going to be faster.




Thanks in advance for your suggestions!
- Mehmet


ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-14 Thread Mark Nelson

On 08/14/2017 02:42 PM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Ronny Aasen
Sent: 14 August 2017 18:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] luminous/bluetsore osd memory requirements

On 10.08.2017 17:30, Gregory Farnum wrote:

This has been discussed a lot in the performance meetings so I've
added Mark to discuss. My naive recollection is that the per-terabyte
recommendation will be more realistic  than it was in the past (an
effective increase in memory needs), but also that it will be under
much better control than previously.

Is there any way to tune or reduce the memory footprint? perhaps by
sacrificing performace ? our jewel cluster osd servers is maxed out on
memory. And with the added memory requirements I  fear we may not be
able to upgrade to luminous/bluestore..

Check out this PR, it shows the settings to set memory used for cache and
their defaults

https://github.com/ceph/ceph/pull/16157

Hey guys, sorry for the late reply.  The gist of it is that memory is 
used in bluestore is a couple of different ways:

1) various internal buffers and such
2) bluestore specific cache (unencoded onodes, extents, etc)
3) rocksdb block cache
  3a) encoded data from bluestore
  3b) bloom filters and table indexes
4) other rocksdb memory/etc

Right now when you set the bluestore cache size it first favors rocksdb 
block cache up to 512MB and then start favoring bluestore onode cache 
after that.  Even without bloom filters that seems to improve bluestore 
performance with small cache sizes.  With bloom filters it's likely even 
more important to feed whatever you can to rocksdb's block cache to keep 
the index and bloom filters in memory as much as possible.  It's unclear 
right now how quickly we should let the block cache grow as the number 
of objects increases.  Prior to using bloom filters it appeared that 
favoring the onode cache was better.  Now we probably both want to favor 
the bloom filters and bluestore's onode cache.

So the first order of business is to see how changing the bluestore 
cache size hurts you.  Bluestore's default behavior of favoring the 
rocksdb block cache (and specifically the bloom filters) first is 
probably still decent but you may want to play around with it if you 
expect a lot of small objects and limited memory.  For really low memory 
scenarios you could also try reducing the rocksdb buffer sizes, but 
smaller buffers are going to give you higher write-amp.  It's possible 
this PR may help though:

https://github.com/ceph/rocksdb/pull/19

You might be able to lower memory further with smaller PG/OSD maps, but 
at some point you start hitting diminishing returns.

Mark

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore SSD

2017-08-14 Thread Mark Nelson


On 08/14/2017 12:52 PM, Ashley Merrick wrote:

Hello,


Hi Ashley!



Currently run 10x4TB , 2xSSD for Journal, planning to move fully to BS,
looking at adding extra servers.

With the removal of the double write on BS and from the testing so far
of BS (having WAL & DB on SSD Seeing very minimal SSD use)

Does it make sense for further servers to go with 12*4TB and get the
benefit of an extra 2 spinning disk per a server over what seems to be
the smaller benefit of having the WAL and DB now on SSD.


Depends on your use case.  Small IOs under the min_alloc size (64k by 
default for HDDs) will still suffer the double write penalty.  You can 
tweak it to be smaller, but by decreasing it you increase the amount of 
metadata.  With bluestore you can put both the WAL (more or less the 
equivalent of the journal in FS) and the KeyValueDB store (including 
OMAP!) on flash.  Bluestore will automatically roll KV data over to the 
block disk when your flash DB partition fills up.


For some workloads like small object RGW, having the DB and WAL on flash 
yields a pretty significant performance advantage in bluestore.  It has 
higher and significantly more stable average performance characteristics 
vs filestore for continuous small object write workloads.


For other workloads like large sequential reads/writes to RBD volumes, 
having flash journal and WAL likely won't help as much.  For small 
random writes to RBD however, you might still want flash for the WAL/DB.


Mark



Thanks,
Ashley
Sent from my iPhone


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Mark Nelson


Hello Massimiliano,

Based on the configuration below, it appears you have 8 SSDs total (2 
nodes with 4 SSDs each)?


I'm going to assume you have 3x replication and are you using filestore, 
so in reality you are writing 3 copies and doing full data journaling 
for each copy, for 6x writes per client write.  Taking this into 
account, your per-SSD throughput should be somewhere around:


Sequential write:
~600 * 3 (copies) * 2 (journal write per copy) / 8 (ssds) = ~450MB/s

Sequential read
~3000 / 8 (ssds) = ~375MB/s

Random read
~3337 / 8 (ssds) = ~417MB/s

These numbers are pretty reasonable for SATA based SSDs, though the read 
throughput is a little low.  You didn't include the model of SSD, but if 
you look at Intel's DC S3700 which is a fairly popular SSD for ceph:


https://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3700-spec.html

Sequential read is up to ~500MB/s and Sequential write speeds up to 
460MB/s.  Not too far off from what you are seeing.  You might try 
playing with readahead on the OSD devices to see if that improves things 
at all.  Still, unless I've missed something these numbers aren't terrible.


Mark

On 06/22/2017 12:19 PM, Massimiliano Cuttini wrote:

Hi everybody,

I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).
We are testing a testing environment with 2 nodes having the same
configuration:

  * CentOS 7.3
  * 24 CPUs (12 for real in hyper threading)
  * 32Gb of RAM
  * 2x 100Gbit/s ethernet cards
  * 2x OS dedicated in raid SSD Disks
  * 4x OSD SSD Disks SATA 6Gbit/s

We are already expecting the following bottlenecks:

  * [ SATA speed x n° disks ] = 24Gbit/s
  * [ Networks speed x n° bonded cards ] = 200Gbit/s

So the minimum between them is 24 Gbit/s per node (not taking in account
protocol loss).

24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross
speed.

Here are the tests:
///IPERF2/// Tests are quite good scoring 88% of the bottleneck.
Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).

[ ID] Interval   Transfer Bandwidth
[ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
[  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
[  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
[  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
[  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
[  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
[  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
[ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
[ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
[  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
[SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec

///RADOS BENCH

Take in consideration the maximum hypotetical speed of 48Gbit/s tests
(due to disks bottleneck), tests are not good enought.

  * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)
  * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)
  * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).

Here are the reports.
Write:

# rados bench -p scbench 10 write --no-cleanup
Total time run: 10.229369
Total writes made:  1538
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 601.406
Stddev Bandwidth:   357.012
Max bandwidth (MB/sec): 1080
Min bandwidth (MB/sec): 204
Average IOPS:   150
Stddev IOPS:89
Max IOPS:   270
Min IOPS:   51
Average Latency(s): 0.106218
Stddev Latency(s):  0.198735
Max latency(s): 1.87401
Min latency(s): 0.0225438

sequential read:

# rados bench -p scbench 10 seq
Total time run:   2.054359
Total reads made: 1538
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   2994.61
Average IOPS  748
Stddev IOPS:  67
Max IOPS: 802
Min IOPS: 707
Average Latency(s):   0.0202177
Max latency(s):   0.223319
Min latency(s):   0.00589238

random read:

# rados bench -p scbench 10 rand
Total time run:   10.036816
Total reads made: 8375
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   3337.71
Average IOPS: 834
Stddev IOPS:  78
Max IOPS: 927
Min IOPS: 741
Average Latency(s):   0.0182707
Max latency(s):   0.257397
Min latency(s):   0.00469212

//

It's seems like that there are some bottleneck somewhere that we are
understimating.
Can you help me to found it?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Mark Nelson


Hi Tyler,

I wanted to make sure you got a reply to this, but unfortunately I don't 
have much to give you.  It sounds like you already took a look at the 
disk metrics and ceph is probably not waiting on disk IO based on your 
description.  If you can easily invoke the problem, you could attach gdb 
to the OSD and do a "thread apply all bt" to see what the threads are 
doing when it's timing out.  Also, please open a tracker ticket if one 
doesn't already exist so we can make sure we get it recorded in case 
other people see the same thing.


Mark

On 06/12/2017 06:12 PM, Tyler Bischel wrote:

Hi,
  We've been having this ongoing problem with threads timing out on the
OSDs.  Typically we'll see the OSD become unresponsive for about a
minute, as threads from other OSDs time out.  The timeouts don't seem to
be correlated to high load.  We turned up the logs to 10/10 for part of
a day to catch some of these in progress, and saw the pattern below in
the logs several times (grepping for individual threads involved in the
time outs).

We are using Jewel 10.2.7.

*Logs:*

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484
pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476
n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967027 lcod
5484'12967028 active] add_log_entry 5484'12967030 (0'0) modify
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484
pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476
n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967028 lcod
5484'12967028 active] append_log: trimming to 5484'12967028 entries
5484'12967028 (5484'12967026) delete
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to:
0'0, dirty_from: 4294967295'18446744073709551615,
dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15


*Metrics:*

OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0
to 16, IO In progress spikes from 0 to hundreds, IO Time Weighted, IO
Time spike.  Average Queue Size on the device spikes.  One minute later,
Write Time, Reads, and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread Mark Nelson


Hi Jayaram,

Thanks for creating a tracker entry! Any chance you could add a note 
about how you are generating the 200MB/s client workload?  I've not seen 
this problem in the lab, but any details you could give that would help 
us reproduce the problem would be much appreciated!


Mark

On 06/08/2017 06:08 AM, nokia ceph wrote:

Hello Mark,

Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222

Jake can you share the restart_OSD_and_log-this.sh script

Thanks
Jayaram

On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk
<mailto:j...@mrc-lmb.cam.ac.uk>> wrote:

Hi Mark & List,

Unfortunately, even when using yesterdays master version of ceph,
I'm still seeing OSDs go down, same error as before:

OSD log shows lots of entries like this:

(osd38)
2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
'tp_osd_tp thread tp_osd_tp' had timed out after 60

(osd3)
2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
'tp_osd_tp thread tp_osd_tp' had timed out after 60
2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:
no reply from 10.1.0.86:6811 <http://10.1.0.86:6811> osd.2 since
back 2017-06-07 17:00:19.640002
front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)


[root@ceph4 ceph]# ceph -v
ceph version 12.0.2-2399-ge38ca14
(e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)


I'll continue running the cluster with my "restart_OSD_and_log-this.sh"
workaround...

thanks again for your help,

Jake

On 06/06/17 15:52, Jake Grimmett wrote:
> Hi Mark,
>
> OK, I'll upgrade to the current master and retest...
>
> best,
    >
    > Jake
>
> On 06/06/17 15:46, Mark Nelson wrote:
>> Hi Jake,
>>
>> I just happened to notice this was on 12.0.3.  Would it be
possible to
>> test this out with current master and see if it still is a problem?
>>
>> Mark
>>
>> On 06/06/2017 09:10 AM, Mark Nelson wrote:
>>> Hi Jake,
>>>
>>> Thanks much.  I'm guessing at this point this is probably a
bug.  Would
>>> you (or nokiauser) mind creating a bug in the tracker with a short
>>> description of what's going on and the collectl sample showing
this is
>>> not IOs backing up on the disk?
>>>
>>> If you want to try it, we have a gdb based wallclock profiler
that might
>>> be interesting to run while it's in the process of timing out.
It tries
>>> to grab 2000 samples from the osd process which typically takes
about 10
>>> minutes or so.  You'll need to either change the number of
samples to be
>>> lower in the python code (maybe like 50-100), or change the
timeout to
>>> be something longer.
>>>
>>> You can find the code here:
>>>
>>> https://github.com/markhpc/gdbprof
<https://github.com/markhpc/gdbprof>
>>>
>>> and invoke it like:
>>>
>>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source
>>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'
>>>
>>> where 27962 in this case is the PID of the ceph-osd process.  You'll
>>> need gdb with the python bindings and the ceph debug symbols for
it to
>>> work.
>>>
>>> This might tell us over time if the tp_osd_tp processes are just
sitting
>>> on pg::locks.
>>>
>>> Mark
>>>
>>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:
>>>> Hi Mark,
>>>>
>>>> Thanks again for looking into this problem.
>>>>
>>>> I ran the cluster overnight, with a script checking for dead
OSDs every
>>>> second, and restarting them.
>>>>
>>>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple
times,
>>>> (there are 50 OSDs in the EC tier).
>>>>
>>>> Unfortunately, the output of collectl doesn't appear to show any
>>>> increase in disk queue depth and service times before the OSDs die.
>>>>
>>>> I've put a couple of examples of collectl output for the disks
>>>> associated with the OSDs here:
>>>>
>>>> https://hastebin.com/icuvotemot.scala
<https://hastebin.com/icuvotemot.scala>
>>>>
>>>> please let me know if you need more info...
>>>>
>>>> best regards,
>>>>
>>>> Jake
>>>>
>>>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 6 >

1 - 100 of 534 matches

Mail list logo