Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-23 Thread Виталий Филиппов
Bluestore's deferred write queue doesn't act like Filestore's journal because 
a) it's very small = 64 requests b) it doesn't have a background flush thread. 
Bluestore basically refuses to do writes faster than the HDD can do them 
_on_average_. With Filestore you can have 1000-2000 write iops until the 
journal becomes full. After that the performance will drop to 30-50 iops with 
very unstable latency. With Bluestore you only get 100-300 iops, but these 
100-300 iops are always stable :-)

I'd recommend bcache. It should perform much better than ceph's tiering.
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-21 Thread Anthony D'Atri
This may be somewhat controversial, so I’ll try to tread lightly.

Might we infer that your OSDs are on spinners?  And at 500 GB it would seem 
likely that they and the servers are old?  Please share hardware details and OS.

Having suffered an “enterprise” dogfood deployment in which I had to attempt to 
support thousands of RBD clients on spinners with colo journals (and a serious 
design flaw that some of you are familiar with), my knee-jerk thought is that 
they are antithetical to “heavy use of block storage”.  I understand though 
that in an education setting you may not have choices.

How highly utilized are your OSD drives?  Depending on your workload you 
*might* benefit with more PGs.  But since you describe your OSDs as being 500GB 
on average, I have to ask:  do their sizes vary considerably?  If so, larger 
OSDs are going to have more PGs (and thus receive more workload) than smaller.  
“ceph osd df” will show the number of PGs on each.  If you do have a 
significant disparity of drive sizes, careful enabling and tweaking of primary 
affinity can have measureable results in read performance.

Is the number of PGs a power of 2?  If not, some of your PGs will be much 
larger than others.  Do you have OSD fillage reasonably well balanced?  If 
“ceph osd df” shows a wide variance, this can also hamper performance as the 
workload will not be spread evenly.

With all due respect to those who have tighter constraints than I enjoy in my 
my current corporate setting, heavy RBD usage on spinners can be sisyphean.  
Granted I’ve never run with a cache tier myself, or with separate WAL/DB 
devices.  In a corporate setting the additional cost of SSD OSDs can easily be 
balanced by reduced administrative hassle and user experience.  If that isn’t 
an option for you anytime soon, then by all means I’d stick with the cache 
tier, and maybe with Luminous indefinitely.  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-21 Thread Mark Nelson
FWIW, the DB and WAL don't really do the same thing that the cache tier 
does.  The WAL is similar to filestore's journal, and the DB is 
primarily for storing metadata (onodes, blobs, extents, and OMAP data).  
Offloading these things to an SSD will definitely help, but you won't 
see the same kind of behavior that you would see with cache tiering 
(especially if the workload is small enough to fit entirely in the cache 
tier).



IMHO the biggest performance consideration with cache tiering is when 
your workload doesn't fit entirely in the cache and you are evicting 
large quantities of data over the network.  Depending on a variety of 
factors this can be pretty slow (and in fact can be slower than not 
using a cache tier at all!).  If your workload fits entirely within the 
cache tier though, it's almost certainly going to be faster than 
bluestore without a cache tier.



Mark


On 7/21/19 9:39 AM, Shawn Iverson wrote:
Just wanted to post an observation here.  Perhaps someone with 
resources to perform some performance tests is interested in comparing 
or has some insight into why I observed this.


Background:

12 node ceph cluster
3-way replicated by chassis group
3 chassis groups
4 nodes per chassis
running Luminous (up to date)
heavy use of block storage for kvm virtual machines (proxmox)
some cephfs usage (<10%)
~100 OSDs
~100 pgs/osd
500GB average OSD capacity

I recently attempted to do away with my ssd cache tier on Luminous and 
replace it with bluestore with db/wal on ssd as this seemed to be a 
better practice, or so I thought.


Sadly, after 2 weeks of rebuilding OSDs and placing the db/wall on 
ssd, I was sorely disappointed with performance. My cluster performed 
poorly.  It seemed that the db/wal on ssd did not boost performance as 
I was used to having.  I used 60gb for the size.  Unfortunately, I did 
not have enough ssd capacity to make it any larger for my OSDs


Despite the words of caution on the Ceph docs in regard to replicated 
base tier and replicated cache-tier, I returned to cache tiering.


Performance has returned to expectations.

It would be interesting if someone had the spare iron and resources to 
benchmark bluestore OSDs with SSD db/wal against cache tiering and 
provide some statistics.


--
Shawn Iverson, CETL
Director of Technology
Rush County Schools
765-932-3901 option 7
ivers...@rushville.k12.in.us 

Cybersecurity

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Observation of bluestore db/wal performance

2019-07-21 Thread Shawn Iverson
Just wanted to post an observation here.  Perhaps someone with resources to
perform some performance tests is interested in comparing or has some
insight into why I observed this.

Background:

12 node ceph cluster
3-way replicated by chassis group
3 chassis groups
4 nodes per chassis
running Luminous (up to date)
heavy use of block storage for kvm virtual machines (proxmox)
some cephfs usage (<10%)
~100 OSDs
~100 pgs/osd
500GB average OSD capacity

I recently attempted to do away with my ssd cache tier on Luminous and
replace it with bluestore with db/wal on ssd as this seemed to be a better
practice, or so I thought.

Sadly, after 2 weeks of rebuilding OSDs and placing the db/wall on ssd, I
was sorely disappointed with performance.  My cluster performed poorly.  It
seemed that the db/wal on ssd did not boost performance as I was used to
having.  I used 60gb for the size.  Unfortunately, I did not have enough
ssd capacity to make it any larger for my OSDs

Despite the words of caution on the Ceph docs in regard to replicated base
tier and replicated cache-tier, I returned to cache tiering.

Performance has returned to expectations.

It would be interesting if someone had the spare iron and resources to
benchmark bluestore OSDs with SSD db/wal against cache tiering and provide
some statistics.

-- 
Shawn Iverson, CETL
Director of Technology
Rush County Schools
765-932-3901 option 7
ivers...@rushville.k12.in.us

[image: Cybersecurity]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com