Re: [ceph-users] Collecting BlueStore per Object DB overhead

Wido den Hollander Mon, 30 Apr 2018 22:57:41 -0700


On 04/30/2018 10:25 PM, Gregory Farnum wrote:
> 
> 
> On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander <w...@42on.com
> <mailto:w...@42on.com>> wrote:
> 
>     Hi,
> 
>     I've been investigating the per object overhead for BlueStore as I've
>     seen this has become a topic for a lot of people who want to store a lot
>     of small objects in Ceph using BlueStore.
> 
>     I've writting a piece of Python code which can be run on a server
>     running OSDs and will print the overhead.
> 
>     https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> 
>     Feedback on this script is welcome, but also the output of what people
>     are observing.
> 
>     The results from my tests are below, but what I see is that the overhead
>     seems to range from 10kB to 30kB per object.
> 
>     On RBD-only clusters the overhead seems to be around 11kB, but on
>     clusters with a RGW workload the overhead goes higher to 20kB.
> 
> 
> This change seems implausible as RGW always writes full objects, whereas
> RBD will frequently write pieces of them and do overwrites.
> I'm not sure what all knobs are available and which diagnostics
> BlueStore exports, but is it possible you're looking at the total
> RocksDB data store rather than the per-object overhead? The distinction
> here being that the RocksDB instance will also store "client" (ie, RGW)
> omap data and xattrs, in addition to the actual BlueStore onodes.


Yes, that is possible. But in the end, the amount of onodes is the
objects you store and then you want to know how many bytes the RocksDB
database uses.

I do agree that RGW doesn't do partial writes and has more metadata, but
eventually that all has to be stored.

We just need to come up with some good numbers on how to size the DB.

Currently I assume a 10GB:1TB ratio and that is working out, but with
people wanting to use 12TB disks we need to drill those numbers down
even more. Otherwise you will need a lot of SSD space to store the DB in
SSD if you want to.

Wido

> -Greg
>  
> 
> 
>     I know that partial overwrites and appends contribute to higher overhead
>     on objects and I'm trying to investigate this and share my information
>     with the community.
> 
>     I have two use-cases who want to store >2 billion objects with a avg
>     object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
>     become a big problem.
> 
>     Anybody willing to share the overhead they are seeing with what
>     use-case?
> 
>     The more data we have on this the better we can estimate how DBs need to
>     be sized for BlueStore deployments.
> 
>     Wido
> 
>     # Cluster #1
>     osd.25 onodes=178572 db_used_bytes=2188378112 <tel:(218)%20837-8112>
>     avg_obj_size=6196529
>     overhead=12254
>     osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
>     overhead=10996
>     osd.10 onodes=195502 db_used_bytes=2395996160 <tel:(239)%20599-6160>
>     avg_obj_size=6013645
>     overhead=12255
>     osd.30 onodes=186172 db_used_bytes=2393899008 <tel:(239)%20389-9008>
>     avg_obj_size=6359453
>     overhead=12858
>     osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883
>     overhead=10589
>     osd.0 onodes=199658 db_used_bytes=2028994560 <tel:(202)%20899-4560>
>     avg_obj_size=4835928
>     overhead=10162
>     osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715
>     overhead=11687
> 
>     # Cluster #2
>     osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992
>     overhead_per_obj=12508
>     osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248
>     overhead_per_obj=13473
>     osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150
>     overhead_per_obj=12924
>     osd.2 onodes=185757 db_used_bytes=3567255552 avg_obj_size=5359974
>     overhead_per_obj=19203
>     osd.5 onodes=198822 db_used_bytes=3033530368 <tel:(303)%20353-0368>
>     avg_obj_size=6765679
>     overhead_per_obj=15257
>     osd.4 onodes=161142 db_used_bytes=2136997888 <tel:(213)%20699-7888>
>     avg_obj_size=6377323
>     overhead_per_obj=13261
>     osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527
>     overhead_per_obj=11551
>     osd.6 onodes=178874 db_used_bytes=2542796800 <tel:(254)%20279-6800>
>     avg_obj_size=6539688
>     overhead_per_obj=14215
>     osd.9 onodes=195166 db_used_bytes=2538602496 <tel:(253)%20860-2496>
>     avg_obj_size=6237672
>     overhead_per_obj=13007
>     osd.8 onodes=203946 db_used_bytes=3279945728 avg_obj_size=6523555
>     overhead_per_obj=16082
> 
>     # Cluster 3
>     osd.133 onodes=68558 db_used_bytes=15868100608
>     <tel:(586)%20810-0608> avg_obj_size=14743206
>     overhead_per_obj=231455
>     osd.132 onodes=60164 db_used_bytes=13911457792 avg_obj_size=14539445
>     overhead_per_obj=231225
>     osd.137 onodes=62259 db_used_bytes=15597568000
>     <tel:(559)%20756-8000> avg_obj_size=15138484
>     overhead_per_obj=250527
>     osd.136 onodes=70361 db_used_bytes=14540603392 avg_obj_size=13729154
>     overhead_per_obj=206657
>     osd.135 onodes=68003 db_used_bytes=12285116416
>     <tel:(228)%20511-6416> avg_obj_size=12877744
>     overhead_per_obj=180655
>     osd.134 onodes=64962 db_used_bytes=14056161280
>     <tel:(405)%20616-1280> avg_obj_size=15923550
>     overhead_per_obj=216375
>     osd.139 onodes=68016 db_used_bytes=20782776320 avg_obj_size=13619345
>     overhead_per_obj=305557
>     osd.138 onodes=66209 db_used_bytes=12850298880 avg_obj_size=14593418
>     overhead_per_obj=194086
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Collecting BlueStore per Object DB overhead

Reply via email to