Hi,

ok, i am starting to get clearer on the wording, that seems to be one problem.

The etcd database should be in virtual disks for each of the kubernetes nodes, 
not directly on ceph.

I would plan either a "regular" replicated cluster and a stretched cluster 
(even if we would have to install 2 times, i think it is worth testing both), 
test both in behaviour and i assume an outcome of these results is also 
interesting for this mailing list or maybe a blog article.

Thank you again for taking the time, it is much clearer now

Cheers
Soeren


________________________________
From: Joachim Kraftmayer <joachim.kraftma...@clyso.com>
Sent: Monday, April 28, 2025 1:11 PM
To: Soeren Malchow <soeren.malc...@convotis.com>
Cc: Anthony D'Atri <anthony.da...@gmail.com>; ceph-users@ceph.io 
<ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: Stretched pool or not ?

Hi Soeren,

understood. So stretched pools need also a stretched ceph cluster.
So a simple setup would be with replication size 3 for replicated pools and 3 
or more ceph monitors, ...
The failure domain is the datacenter. That means if you lose one DC the ceph 
cluster is still online.

back to your original question:

> If using a stretched pool across all 3 datacenters, what happens if one 
> datacenter fails ? I did read the documentation and the question came up, 
> because it do not understand the sentence "Individual Stretch Pools do not 
> support I/O operations during a netsplit scenario between two or more zones" 
> completely, does it mean there is no IO already if one datacenter fails ?

If you have changed your failover domain to ‘Datacenter’, at least two DCs must 
be available to handle I/O operations.

Do you want to use Ceph for the etcd database?

Hope it helps, Joachim



[https://lh7-qw.googleusercontent.com/docsz/AD_4nXeuFWFpeRWW3xA-qDZ9fTo1odW6ytvBUW8sGR-g_x-SZ7ASHMHLdAXvBBPTMPm2z2J95n1nRQ81Aab0TBV5f7_DiIz8ASD9cqai8ANINTHdSIOjWjftpDRRMkUukFjr5n2u2yc22I9jYDv8aowVe3TmGeM?key=gxHyQmfkoAeHUWMwR2J6fA]
  joachim.kraftma...@clyso.com<mailto:joachim.kraftma...@clyso.com>

[https://lh7-qw.googleusercontent.com/docsz/AD_4nXcbgqtXMzhmdUBlDb9tcjGwRcVHDyjJlYOT9kJQMQ1R1RlqR3JAmX2FiaXnEO4xN2aS95XKIyBXF0zooyuI9e2aFN8YEfQUOg4R3gt8ytiwvGKrbgyecc9I_W0XIHkaCJqKWYi11W34GqRxVBGPFuMcpimg?key=gxHyQmfkoAeHUWMwR2J6fA]
  
www.clyso.com<https://urldefense.com/v3/__http://www.clyso.com/__;!!KHkafc0ElKqcERD9YPg!g02qlzu3UHTr37Md2fCmnbBOA0_kR1B2PjHvhelDALqD8dPP1QSJuxIsR4TLaX8VJ4kbkQRTr29kHG9usjkr_08uiRE9s45K$>

[https://lh7-qw.googleusercontent.com/docsz/AD_4nXeYECk1HXhGrVUJqpefhuHdvP4uH7cknBgdTHpQxy5oJzS_UYrXumarAR7w41M4C4W56eahyOANvJsybrshCeqV3HWjazMID05l_Ua4J_q9ude-H4uqhm0F_yYLXcGB8-3JHz0o9l1v5RLSEt971bnD-Z2x?key=gxHyQmfkoAeHUWMwR2J6fA]
  Hohenzollernstr. 27, 80801 Munich

Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677


[https://lh7-qw.googleusercontent.com/docsz/AD_4nXfNyeLIbOzh5A66KvA66b56OBK6tWOt1NXAArJA_jUpMSpJ5qSAEDzuzh6-rznES5RFWtCmFdI6KN4R0kSFWrYHkby6-dOoID6hXcVJtHAlDM70A5Cto1jzMj0_V9oQkq9goEun4IIYnKkXz50qXFa_oYC1?key=gxHyQmfkoAeHUWMwR2J6fA]


Am Mo., 28. Apr. 2025 um 12:15 Uhr schrieb Soeren Malchow 
<soeren.malc...@convotis.com<mailto:soeren.malc...@convotis.com>>:
HI Joachim and Anthony,

first: thanks for taking the time to answer. (now in plain text, sorry, i did 
not think about that)

the sentence i am referring to is for "stretched pools" without a tiebreaker 
which is "stretch mode" if i understood the documentation correctly.
I read this in the "Limitations" section exactly on the page your link refers 
to as well.

The reason behind having 3 datacenters is because we are having alot of k8s 
clusters which also need to have quorum, if i distribute the etcd nodes across 
3 datacenters, the outage of one datacenter will keep the k8s cluster 
operational.
Thats why i was explicitly referring to stretched pools, not stretch mode 
(still hope i understand everything right).

We do not have a single point of failure in the setup, all connections and 
devices are redundant.
The latency between the datacenters is most likely very low (we can not measure 
since i am in planning stages.
The connections between the datacenters are on dark fibres connected through 
modules directly in the Top of the Rack switches, compared to the local 
connectivity it will be almost the same.
We have an existing similar setup between 2 datacenters where the WAN 
connection add below 1ms latency.

On "exceptionally large nodes", those are all identical servers, 3 per 
datacenter with 16 x 3.84 TB nvme disks, 128 AMD Epyc cores (on 2 sockets) and 
1.5 TB memory, i would not count them as "exceptionally large".

I will read up a little more on asych replication.

Cheers
Soeren



________________________________________
From: Joachim Kraftmayer 
<joachim.kraftma...@clyso.com<mailto:joachim.kraftma...@clyso.com>>
Sent: Monday, April 28, 2025 8:23 AM
To: Anthony D'Atri <anthony.da...@gmail.com<mailto:anthony.da...@gmail.com>>
Cc: Soeren Malchow 
<soeren.malc...@convotis.com<mailto:soeren.malc...@convotis.com>>; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
<ceph-users@ceph.io<mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] Re: Stretched pool or not ?

Hi Soeren.
First, I would like to clarify something.
There are two options:
stretched cluster
and
stretch mode.
Sometimes this cannot be relied upon. If you have a “stretched-cluster” 
deployment in which much of your cluster is behind a single network component, 
you might need to use stretch mode to ensure data integrity.
source: 
https://docs.ceph.com/en/latest/rados/operations/stretch-mode/#id1<https://urldefense.com/v3/__https://docs.ceph.com/en/latest/rados/operations/stretch-mode/*id1__;Iw!!KHkafc0ElKqcERD9YPg!g02qlzu3UHTr37Md2fCmnbBOA0_kR1B2PjHvhelDALqD8dPP1QSJuxIsR4TLaX8VJ4kbkQRTr29kHG9usjkr_08uiaDvFI8S$>
The focus in this sentence is on ‘single network component’. I hope you don't 
have a single point of failure in your setup.

Which option is best for your requirements?
Regards, Joachim

  joachim.kraftma...@clyso.com<mailto:joachim.kraftma...@clyso.com>
  
www.clyso.com<https://urldefense.com/v3/__http://www.clyso.com__;!!KHkafc0ElKqcERD9YPg!g02qlzu3UHTr37Md2fCmnbBOA0_kR1B2PjHvhelDALqD8dPP1QSJuxIsR4TLaX8VJ4kbkQRTr29kHG9usjkr_08uiQyt45SG$>
  Hohenzollernstr. 27, 80801 Munich
Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677



Am Mo., 28. Apr. 2025 um 02:50 Uhr schrieb Anthony D'Atri 
<anthony.da...@gmail.com<mailto:anthony.da...@gmail.com>>:

Please make list posts in plain text.


> i am working on the plan for a 3 datacenter setup using ceph (in proxmox 
> nodes).
>
> Each datacenter has 3 physical nodes to start with and 100Gbit switches. I 
> will also have 2 x 100 Gbit/s connectivity between the datacenters (each 
> datacenter to each other).
> the physical nodes have 2 x 100Gbit/s for the public network and 2 x 
> 100Gbit/s for the cluster network.

You almost certainly don’t need a cluster / replication network unless these 
are exceptionally large nodes.

>
> About this setup i have 2 questions.
>
> is it even necessary to evaluate a stretched cluster since the WAN 
> connections are as fast as the local ones (including the latency, since it is 
> only 25km) ?

There’s more to latency than just distance.  What is the measured latency?  
A:B, B:C, C:A?



>
> If using a stretched pool across all 3 datacenters, what happens if one 
> datacenter fails ? I did read the documentation and the question came up, 
> because it do not understand the sentence "Individual Stretch Pools do not 
> support I/O operations during a netsplit scenario between two or more zones" 
> completely, does it mean there is no IO already if one datacenter fails ?

That sentence refers to a non-stretch cluster.

Tell us why you’re spreading across three DCs, what you’re trying to 
accomplish, and what your performance requirements are.

AIUI a stretch 3-site cluster requires all pools to be replicated, size=6.

Explicit stretch mode treats the mon quorum in a different way.  With two OSD 
sites you deploy a tiebreaker at a third site, which is possibly just a cloud 
VM.   With three OSD sites, I might speculate that one would deploy 7 mons, 2 
At each OSD site +  tiebreaker.

Operations on a stretch cluster can be slow.   Sometimes separate clusters with 
asynchronous replication make more sense.

>
>
> If i am on the wrong path, maybe someone has a link for me, where is can find 
> information on this setup ?
>
> Cheers
> Soeren
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to