On 05/02/2017 01:32 AM, Frédéric Nass wrote:


Le 28/04/2017 à 17:03, Mark Nelson a écrit :
On 04/28/2017 08:23 AM, Frédéric Nass wrote:

Le 28/04/2017 à 15:19, Frédéric Nass a écrit :

Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on
Luminous dev (1st release) and came to the same (early) conclusion
regarding the performance drop with many small objects on bluestore,
whatever the number of PGs is on a pool. Here is the graph I generated
from the results:



The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128
--no-cleanup

Correction: test was made on a size 1 pool hosted on a single 12x OSDs
node. The rados bench was run from this single host (to this single
host).

Frédéric.

If you happen to have time, I would be very interested to see what the
compaction statistics look like in rocksdb (available via the osd
logs).  We actually wrote a tool that's in the cbt tools directory
that can parse the data and look at what rocksdb is doing.  Here's
some of the data we collected last fall:

https://drive.google.com/open?id=0B2gTBZrkrnpZRFdiYjFRNmxLblU

The idea there was to try to determine how WAL buffer size / count and
min_alloc size affected the amount of compaction work that rocksdb was
doing.  There are also some more general compaction statistics that
are more human readable in the logs that are worth looking at (ie
things like write amp and such).

The gist of it is that as you do lots of small writes the amount of
metadata that has to be kept track of in rocksdb increases, and
rocksdb ends up doing a *lot* of compaction work, with the associated
read and write amplification.  The only ways to really deal with this
are to either reduce the amount of metadata (onodes, extents, etc) or
see if we can find any ways to reduce the amount of work rocksdb has
to do.

On the first point, increasing the min_alloc size in bluestore tends
to help, but with tradeoffs.  Any io smaller than the min_alloc size
will be doubly-written like with filestore journals, so you trade
reducing metadata for an extra WAL write. We did a bunch of testing
last fall and at least on NVMe it was better to use a 16k min_alloc
size and eat the WAL write than use a 4K min_alloc size, skip the WAL
write, but shove more metadata at rocksdb.  For HDDs, I wouldn't
expect too bad of behavior with the default 64k min alloc size, but it
sounds like it could be a problem based on your results.  That's why
it would be interesting to see if that's what's happening during your
tests.

Another issue is that short lived WAL writes potentially can leak into
level0 and cause additional compaction work.  Sage has a pretty clever
idea to fix this but we need someone knowledgeable about rocksdb to go
in and try to implement it (or something like it).

Anyway, we still see a significant amount of work being done by
rocksdb due to compaction, most of it being random reads.  We actually
spoke about this quite a bit yesterday at the performance meeting.  If
you look at a wallclock profile of 4K random writes, you'll see a ton
of work being doing on compact (about 70% in total of thread 2):

https://paste.fedoraproject.org/paste/uS3LHRHw2Yma0iUYSkgKOl5M1UNdIGYhyRLivL9gydE=


One thing we are still confused about is why rocksdb is doing
random_reads for compaction rather than sequential reads.  It would be
really great if someone that knows rocksdb well could help us
understand why it's doing this.

Ultimately for something like RBD I suspect the performance will stop
dropping once you've completely filled the disk with 4k random writes.
For RGW type work, the more tiny objects you add the more data rocksdb
has to keep track of and the more rocksdb is going to slow down.  It's
not the same problem filestore suffers from, but it's similar in that
the more keys/bytes/levels rocksdb has to deal with, the more data
gets moved around between levels, the more background work that
happens, the more likely we are waiting on rocksdb before we can write
more data.

Mark


Hi Mark,

This is very interesting. I actually did use "bluefs buffered io = true"
and "bluestore compression mode = aggressive" during the tests as I saw
these 2 options were improving write performances (x4) but didn't look
at the logs for compaction statistics. These nodes I used for the tests
made it to production so I won't be able to reproduce the test any soon,
but I will when we get new hardware.

Frederic.


FWIW, I spent some time yesterday digging into rocksdb's compaction code. It looks like every time comapaction is done, the iterator ends up walking through doing "random reads" that ultimately work their way down to bluefs, but in reality it's seeking almost entirely sequentially as it walks through the SST. In the blktrace I did the behavior looks like:

259,24   4    78185    38.709165799     0  C  RS 21658424 + 8 [0]
259,24   4    78192    38.709230052     0  C  RS 21658424 + 16 [0]
259,24   4    78199    38.709311168     0  C  RS 21658432 + 16 [0]
259,24   4    78206    38.709392489     0  C  RS 21658440 + 16 [0]
259,24   4    78213    38.709492782     0  C  RS 21658448 + 16 [0]
259,24   4    78220    38.709578648     0  C  RS 21658456 + 16 [0]
259,24   4    78227    38.709685765     0  C  RS 21658464 + 16 [0]

Where first there is an 8 sector (ie 4k read), followed by lots of 8k sequential reads that increment by 4k. This is what the compaction is spending almost all of it's time doing. One thing I noticed is that before this read takes place, the block cache and compressed block cache are checked to see if we can do the read there. I tried pumping the block cache up from the default 128MB in ceph to 8GB which is the same size as my DB partition. It didn't appear to affect the cache hit rate at all. For some reason it doesn't look like these reads are ever hitting cache. That's what I'm going to be looking into today.

Mark
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to