We ended up with a 1.6TB PCIe NVMe in each node. For 8 drives this
worked out to a DB size of something like 163GB per OSD. Allowing for
expansion to 12 drives brings it down to 124GB. So maybe just put the
WALs on NVMe and leave the DBs on the platters?
Understood that we will want to move to more nodes rather than more
drives per node, but our funding is grant and donation based, so we may
end up adding drives in the short term. The long term plan is to get to
separate MON/MGR/MDS nodes and 10s of OSD nodes.
Due to our current low node count, we are considering erasure-coded PGs
rather than replicated in order to maximize usable space. Any
guidelines or suggestions on this?
Also, sorry for not replying inline. I haven't done this much in a
while - I'll figure it out.
On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
I'd like to expand on this answer, briefly...
The information in the docs is wrong. There have been many discussions about
changing it, but no good alternative has been suggested, thus it hasn't been
The 3rd party project that Ceph's BlueStore uses for its database (RocksDB),
apparently only uses DB sizes of 3GB, 30GB, and 300GB. As Dave mentions below,
when RocksDB executes a compact operation, it creates a new blob of the same
target size, and writes the compacted data into it. This doubles the necessary
space. In addition, BlueStore places its Write Ahead Log (WAL) into the
fastest storage that is available to OSD daemon, i.e. NVMe if available.
Since this is done before the first compaction is requested, the WAL can force
compaction onto slower storage.
Thus, the numbers I've had floating around in my head for our next cluster are:
7GB, 66GB, and 630GB. From all the discussion I've seen around RocksDB, those
seem like good, common sense targets. Pick the largest one that works for your
All that said... You would really want to pair a 600GB+ NVMe with 12TB drives,
otherwise your DB is almost guaranteed to overflow onto the spinning drive, and
I became aware of most of this after we planned our clusters, so I haven't
tried it, YMMV.
One final note: more hosts, and more spindles usually translates into better
cluster-wide performance. I can't predict what the relatively low client
counts you're suggesting would impact that.
Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Sent: Thursday, January 16, 2020 10:55 AM
To: Dave Hall
Subject: Re: [ceph-users] Beginner questions
I would definitely go for Nautilus. there are quite some optimizations that
went in after mimic.
Bluestore DB size usually ends up at either 30 or 60 GB.
30 GB is one of the sweet spots during normal operation. But during compaction,
ceph writes the new data before removing the old, hence the 60GB.
Next sweetspot is 300/600GB. any size between 60 and 300 will never be unused.
DB Usage is also dependent on ceph usage, object storage is known to use a lot
more db space than rbd images for example.
Op do 16 jan. 2020 om 17:46 schreef Dave Hall <kdh...@binghamton.edu>:
Sorry for the beginner questions...
I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster to
store some research data. It is expected that this cluster will grow
significantly in the next year, possibly to multiple petabytes and 10s of
nodes. At this time I'm expected a relatively small number of clients, with
only one or two actively writing collected data - albeit at a high volume per
Currently I'm deploying on Debian 9 via ceph-ansible.
Before I put this cluster into production I have a couple questions based on my
experience to date:
Luminous, Mimic, or Nautilus? I need stability for this deployment, so I am
sticking with Debian 9 since Debian 10 is fairly new, and I have been hesitant
to go with Nautilus. Yet Mimic seems to have had a hard road on Debian but for
the efforts at Croit.
• Statements on the Releases page are now making more sense to me, but I would
like to confirm that Nautilus is the right choice at this time?
Bluestore DB size: My nodes currently have 8 x 12TB drives (plus 4 empty bays)
and a PCIe NVMe drive. If I understand the suggested calculation correctly,
the DB size for a 12 TB Bluestore OSD would be 480GB. If my NVMe isn't big
enough to provide this size, should I skip provisioning the DBs on the NVMe, or
should I give each OSD 1/12th of what I have available? Also, should I try to
shift budget a bit to get more NVMe as soon as I can, and redo the OSDs when
sufficient NVMe is available?
ceph-users mailing list
ceph-users mailing list