> After you have filled that up, if such a host crashes or needs
> maintenance, another 80-100TB will need recreating from the other huge
> drives.

A judicious setting of mon_osd_down_out_subtree_limit can help mitigate the 
thundering herd FWIW.

> I don't think there are specific limitations on the size itself, but
> as the single drive becomes larger and larger, just adding a new host
> or a drive will mean the cluster is rebalancing for days or weeks if
> not more.

Especially if EC is used.  A corollary here is that IOPS/TB decrease as HDDs 
grow larger.  We see some incremental tweaks, but in the end the interface 
speed hasn’t grown in some time.  Seek and rotational latency are helped 
somewhat by increasing areal density, though capacity growth is also acheived 
by making the platters increasingly thinner and more numerous: recent drives 
pack as many as 9 in there (perhaps fewer for SMR models).  I’ve seen scale 
deployments cap HDD size at, say, 8TB because the IOPS/TB beyond was 
increasingly untenable, depending of course on the use-case.

> At some point you would end up having the cluster almost never in
> HEALTH_OK state because of normal replacements, expansions and other
> surprises

With recent releases backfill doesn’t trigger HEALTH_WARN, though, right?  

>  which in turn could cause secondary problems with mon DBs
> and things like that.

Your point is well made, though — Dan @ CERN observed several years ago that 
with a sufficently large cluster one has to come to terms with backfill going 
on all the time.  The idea here is that mon DB compaction tends to block if 
there is any degradation — with at least some releases, that means even if you 
have HEALTH_OK, but some OSDs are down/out.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to