Re: [ceph-users] Bluestore nvme DB/WAL size

Anthony D'Atri Fri, 21 Dec 2018 13:45:12 -0800

> It'll cause problems if yours the only one NVMe drive will die - you'll lost 
> all the DB partitions and all the OSDs are going to be failed



The severity of this depends a lot on the size of the cluster.  If there are 
only, say, 4 nodes total, for sure the loss of a quarter of the OSDs will be 
somewhere between painful and fatal.  Especially if the subtree limit does not 
forestall rebalancing, and if EC is being used vs replication.  From a pain 
angle, though, this is no worse than if the server itself smokes.

It's easy to say "don't do that" but sometimes one doesn't have a choice:

* Unit economics can confound provisioning of larger/more external metadata 
devices.  I'm sure Vlad isn't using spinners because he hates SSDs.

* Devices have to go somewhere.  It's not uncommon to have a server using 2 
PCIe slots for NICs (1) and another for an HBA, leaving as few as 1 or 0 free.  
Sometimes the potential for a second PCI riser is canceled by the need to 
provision a rear drive cage for OS/boot drives to maximize front-panel bay 
availability.

* Cannibalizing one or more front drive bays for metadata drives can be 
problematic:
- Usable cluster capacity is decreased, along with unit economics
- Dogfood or draconian corporate policy (Herbert! Herbert!) can prohibit this.  
I've personally in the past been prohibited from the obvious choise to use a 
simple open-market LFF to SFF adapter because it wasn't officially "supported" 
and would use components without a corporate SKU.

The 4% guidance was 1% until not all that long ago.  Guidance on calculating 
adequate sizing based on application and workload would be NTH.  I've been told 
that an object storage (RGW) use case can readily get away with less because 
L2/L3/etc are both rarely accessed and the first to be overflowed onto slower 
storage.  And that block (RBD) workloads have different access patterns that 
are more impacted by overflow of higher levels.  As RBD pools increasingly are 
deployed on SSD/NVMe devices, the case for colocating their metadata is strong, 
and obviates having to worry about sizing before deployment.













(1) One of many reasons to seriously consider not having a separate replication 
network





_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore nvme DB/WAL size

Reply via email to