[ceph-users] Re: Building new cluster had a couple of questions

2023-12-26 Thread Drew Weaver
Okay so NVMe is the only path forward?

I was simply going to replace the PERC H750s with some HBA350s but if that will 
not work I will just wait until I have a pile of NVMe servers that we aren't 
using in a few years, I guess.

Thanks,
-Drew



From: Anthony D'Atri 
Sent: Friday, December 22, 2023 12:33 PM
To: Drew Weaver 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Building new cluster had a couple of questions




Sorry I thought of one more thing.

I was actually re-reading the hardware recommendations for Ceph and it seems to 
imply that both RAID controllers as well as HBAs are bad ideas.

Advice I added most likely ;)   "RAID controllers" *are* a subset of HBAs BTW.  
The nomenclature can be confusing and there's this guy on Reddit 


I believe I remember knowing that RAID controllers are sub optimal but I guess 
I don't understand how you would actually build a cluster with many disks 12-14 
each per server without any HBAs in the servers.

NVMe

Are there certain HBAs that are worse than others? Sorry I am just confused.

For liability and professionalism I won't name names, especially in serverland 
there's one manufacturer who dominates.

There are three main points, informed by years of wrangling the things.  I 
posted a litany of my experiences to this very list a few years back, including 
data-losing firmware / utility bugs and operationally expensive ECOs.

* RoC HBAs aka "RAID controllers" are IMHO a throwback to the days when x86 / 
x64 servers didn't have good software RAID.  In the land of the Sun we had VxVM 
(at $$$) that worked well, and Sun's SDS/ODS that ... got better over time.  I 
dunno if the Microsoft world has bootable software RAID now or not.  They are 
in my experience flaky and a pain to monitor.  Granted they offer the potential 
for a system to boot without intervention if the first boot drive is horqued, 
but IMHO that doesn't happen nearly often enough to justify the hassle.

* These things can have significant cost, especially if one shells out for 
cache RAM, BBU/FBWC, etc.  Today I have a handful of legacy systems that were 
purchased with a tri-mode HBA that in 2018 had a list price of USD 2000.  The  
*only* thing it's doing is mirroring two SATA boot drives.  That money would be 
better spent on SSDs, either with a non-RAID aka JBOD HBA, or better NVMe.

* RAID HBAs confound observability.  Many models today have a JBOD / 
passthrough mode -- in which case why pay for all the RAID-fu?  Some, 
bizarrely, still don't, and one must set up a single-drive RAID0 volume around 
every drive for the system to see it.  This makes iostat even less useful than 
it already is, and one has to jump through hoops to get SMART info.  Hoops 
that, for example, the very promising upstream smartctl_exporter doesn't have.

There's a certain current M.2 boot drive module like this, the OS cannot see 
the drives AT ALL unless they're in a virtual drive.  Like Chong said, there's 
too much recession.

When using SATA or SAS, you can get a non-RAID HBA for much less money than a 
RAID HBA.  But the nuance here is that unless you have pre-existing gear, SATA 
and especially SATA *do not save money*.  This is heterodox to conventional 
wisdom.

An NVMe-only chassis does not need an HBA of any kind.  NVMe *is* PCI-e.  It 
especially doesn't need an astronomically expensive NVMe-capable RAID 
controller, at least not for uses like Ceph.  If one has an unusual use-case 
that absolutely requires a single volume, if LVM doesn't cut it for some reason 
-- maybe.  And there are things like Phison and Scaleflux that are out of 
scope, we're talking about Ceph here.

Some chassis vendors try hard to stuff an RoC HBA down your throat, with rather 
high markups.  Others may offer a basic SATA HBA built into the motherboard if 
you need it for some reason.

So when you don't have to spend USD hundreds to a thousand on an RoC HBA + 
BBU/cache/FBWC and jump through hoops to have one more thing to monitor, and an 
NVMe SSD doesn't cost significantly more than a SATA SSD, an all-NVMe system 
can easily be *less* expensive than SATA or especially SAS.  SAS is very, very 
much in its last days in the marketplace; SATA is right behind it.  In 5-10 
years you'll be hard-pressed to find enterprise SAS/SATA SSDs, or if you can, 
might only be from a single manufacturer -- which is an Akbarian trap.

This calculator can help show the false economy of SATA / SAS, and especially 
of HDDs.  Yes, in the enterprise, HDDs are *not* less expensive unless you're a 
slave to $chassisvendor.

Total Cost of Ownership (TCO) Model for 
Storage<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.snia.org_forums_cmsi_programs_TCOcalc=DwMFAg=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM=OPufM5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw=M811ByC6EF1KzizsPxKtADBJDEbo7HguscxGSxUK3fN5VqH6k410tFMhUt4tAaNA=m2f0ZHAwOxOp0EYlVtdSd7mmhB2j4-xQBB0RfYBoUug=>
snia.org<https://urldefense.

[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Johan Hattne

On 2023-12-22 03:28, Robert Sander wrote:

Hi,

On 22.12.23 11:41, Albert Shih wrote:


for n in 1-100
   Put off line osd on server n
   Uninstall docker on server n
   Install podman on server n
   redeploy on server n
end


Yep, that's basically the procedure.

But first try it on a test cluster.

Regards


For reference, this was also discussed about two years ago:

  https://www.spinics.net/lists/ceph-users/msg70108.html

Worked for me.

// Johan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Anthony D'Atri


> Sorry I thought of one more thing.
> 
> I was actually re-reading the hardware recommendations for Ceph and it seems 
> to imply that both RAID controllers as well as HBAs are bad ideas.

Advice I added most likely ;)   "RAID controllers" *are* a subset of HBAs BTW.  
The nomenclature can be confusing and there's this guy on Reddit 

> I believe I remember knowing that RAID controllers are sub optimal but I 
> guess I don't understand how you would actually build a cluster with many 
> disks 12-14 each per server without any HBAs in the servers.

NVMe

> Are there certain HBAs that are worse than others? Sorry I am just confused.

For liability and professionalism I won't name names, especially in serverland 
there's one manufacturer who dominates.

There are three main points, informed by years of wrangling the things.  I 
posted a litany of my experiences to this very list a few years back, including 
data-losing firmware / utility bugs and operationally expensive ECOs.

* RoC HBAs aka "RAID controllers" are IMHO a throwback to the days when x86 / 
x64 servers didn't have good software RAID.  In the land of the Sun we had VxVM 
(at $$$) that worked well, and Sun's SDS/ODS that ... got better over time.  I 
dunno if the Microsoft world has bootable software RAID now or not.  They are 
in my experience flaky and a pain to monitor.  Granted they offer the potential 
for a system to boot without intervention if the first boot drive is horqued, 
but IMHO that doesn't happen nearly often enough to justify the hassle.

* These things can have significant cost, especially if one shells out for 
cache RAM, BBU/FBWC, etc.  Today I have a handful of legacy systems that were 
purchased with a tri-mode HBA that in 2018 had a list price of USD 2000.  The  
*only* thing it's doing is mirroring two SATA boot drives.  That money would be 
better spent on SSDs, either with a non-RAID aka JBOD HBA, or better NVMe.  

* RAID HBAs confound observability.  Many models today have a JBOD / 
passthrough mode -- in which case why pay for all the RAID-fu?  Some, 
bizarrely, still don't, and one must set up a single-drive RAID0 volume around 
every drive for the system to see it.  This makes iostat even less useful than 
it already is, and one has to jump through hoops to get SMART info.  Hoops 
that, for example, the very promising upstream smartctl_exporter doesn't have.

There's a certain current M.2 boot drive module like this, the OS cannot see 
the drives AT ALL unless they're in a virtual drive.  Like Chong said, there's 
too much recession.

When using SATA or SAS, you can get a non-RAID HBA for much less money than a 
RAID HBA.  But the nuance here is that unless you have pre-existing gear, SATA 
and especially SATA *do not save money*.  This is heterodox to conventional 
wisdom.

An NVMe-only chassis does not need an HBA of any kind.  NVMe *is* PCI-e.  It 
especially doesn't need an astronomically expensive NVMe-capable RAID 
controller, at least not for uses like Ceph.  If one has an unusual use-case 
that absolutely requires a single volume, if LVM doesn't cut it for some reason 
-- maybe.  And there are things like Phison and Scaleflux that are out of 
scope, we're talking about Ceph here.

Some chassis vendors try hard to stuff an RoC HBA down your throat, with rather 
high markups.  Others may offer a basic SATA HBA built into the motherboard if 
you need it for some reason.

So when you don't have to spend USD hundreds to a thousand on an RoC HBA + 
BBU/cache/FBWC and jump through hoops to have one more thing to monitor, and an 
NVMe SSD doesn't cost significantly more than a SATA SSD, an all-NVMe system 
can easily be *less* expensive than SATA or especially SAS.  SAS is very, very 
much in its last days in the marketplace; SATA is right behind it.  In 5-10 
years you'll be hard-pressed to find enterprise SAS/SATA SSDs, or if you can, 
might only be from a single manufacturer -- which is an Akbarian trap.

This calculator can help show the false economy of SATA / SAS, and especially 
of HDDs.  Yes, in the enterprise, HDDs are *not* less expensive unless you're a 
slave to $chassisvendor.

https://www.snia.org/forums/cmsi/programs/TCOcalc


You read that right.

Don't plug in the cost of a 22TB SMR SATA drive, it likely won't be usable in 
real life.  It's not uncommon to limit spinners to say 8TB just because of the 
interface and seek bottlenecks.  The above tool has a multiplier for how many 
additional spindles one has to provision to get semi-acceptable IOPS, for RUs, 
power, AFR, cost to repair, etc.

At scale, consider how many chassis you need when you can stuff 32x 60TB SSDs 
into one, vs 12x 8TB HDDs.  Consider also the risks when it takes a couple of 
weeks to migrate data onto a replacement spinner, or if you can't do 
maintenance because your cluster becomes unusable to users.


-- aad


> 
> Thanks,
> -Drew
> 
> -Original Message-
> From: Drew Weaver  
> Sent: Thursday, December 21, 

[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Drew Weaver
Sorry I thought of one more thing.

I was actually re-reading the hardware recommendations for Ceph and it seems to 
imply that both RAID controllers as well as HBAs are bad ideas.

I believe I remember knowing that RAID controllers are sub optimal but I guess 
I don't understand how you would actually build a cluster with many disks 12-14 
each per server without any HBAs in the servers.

Are there certain HBAs that are worse than others? Sorry I am just confused.

Thanks,
-Drew

-Original Message-
From: Drew Weaver  
Sent: Thursday, December 21, 2023 8:51 AM
To: 'ceph-users@ceph.io' 
Subject: [ceph-users] Building new cluster had a couple of questions

Howdy,

I am going to be replacing an old cluster pretty soon and I am looking for a 
few suggestions.

#1 cephadm or ceph-ansible for management?
#2 Since the whole... CentOS thing... what distro appears to be the most 
straightforward to use with Ceph?  I was going to try and deploy it on Rocky 9.

That is all I have.

Thanks,
-Drew

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Robert Sander

On 22.12.23 11:46, Marc wrote:


Does podman have this still, what dockers has. That if you kill the docker 
daemon all tasks are killed?


Podman does not come with a daemon to start containers.

The Ceph orchestrator creates systemd units to start the daemons in 
podman containers.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread André Gemünd
Podman doesn't use a daemon, thats one of the basic ideas.

We also use it in production btw.

- Am 22. Dez 2023 um 11:46 schrieb Marc m...@f1-outsourcing.eu:

>> >
>> >> It's been claimed to me that almost nobody uses podman in
>> >> production, but I have no empirical data.
> 
> As opposed to docker or to having no containers at all?
> 
>> > I even converted clusters from Docker to podman while they stayed
>> > online thanks to "ceph orch redeploy".
>> >
> 
> Does podman have this still, what dockers has. That if you kill the docker
> daemon all tasks are killed?
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemu...@scai.fraunhofer.de
Tel: +49 2241 14-4199
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Robert Sander

Hi,

On 22.12.23 11:41, Albert Shih wrote:


for n in 1-100
   Put off line osd on server n
   Uninstall docker on server n
   Install podman on server n
   redeploy on server n
end


Yep, that's basically the procedure.

But first try it on a test cluster.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Marc


> >
> >> It's been claimed to me that almost nobody uses podman in
> >> production, but I have no empirical data.

As opposed to docker or to having no containers at all?

> > I even converted clusters from Docker to podman while they stayed
> > online thanks to "ceph orch redeploy".
> >

Does podman have this still, what dockers has. That if you kill the docker 
daemon all tasks are killed?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Eugen Block
That's good to know, I have the same in mind for one of the clusters  
but didn't have the time to test it yet.


Zitat von Robert Sander :


On 21.12.23 22:27, Anthony D'Atri wrote:

It's been claimed to me that almost nobody uses podman in  
production, but I have no empirical data.


I even converted clusters from Docker to podman while they stayed  
online thanks to "ceph orch redeploy".


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Robert Sander

On 21.12.23 22:27, Anthony D'Atri wrote:


It's been claimed to me that almost nobody uses podman in production, but I 
have no empirical data.


I even converted clusters from Docker to podman while they stayed online 
thanks to "ceph orch redeploy".


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-21 Thread Robert Sander

Hi,

On 21.12.23 15:13, Nico Schottelius wrote:


I would strongly recommend k8s+rook for new clusters, also allows
running Alpine Linux as the host OS.


Why would I want to learn Kubernetes before I can deploy a new Ceph 
cluster when I have no need for K8s at all?


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-21 Thread Robert Sander

Hi,

On 21.12.23 19:11, Albert Shih wrote:


What is the advantage of podman vs docker ? (I mean not in general but for
ceph).


Docker comes with the Docker daemon that when it gets an update has to 
be restarted and restarts all containers. For a storage system not the 
best procedure.


Everything needed for the Ceph containers is provided by podman.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-21 Thread Simon Ironside

On 21/12/2023 13:50, Drew Weaver wrote:

Howdy,

I am going to be replacing an old cluster pretty soon and I am looking for a 
few suggestions.

#1 cephadm or ceph-ansible for management?
#2 Since the whole... CentOS thing... what distro appears to be the most 
straightforward to use with Ceph?  I was going to try and deploy it on Rocky 9.


I'm in the same boat and have used cephadm on Rocky 9 and the standard 
podman packages that come with the distro. Installation went without a 
hitch, a breeze actually compared to the old ceph-deploy/Nautilus 
install it's going to replace.


Cheers,
Simon.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-21 Thread Nico Schottelius


Hey Drew,

Drew Weaver  writes:
> #1 cephadm or ceph-ansible for management?
> #2 Since the whole... CentOS thing... what distro appears to be the most 
> straightforward to use with Ceph?  I was going to try and deploy it on Rocky 
> 9.

I would strongly recommend k8s+rook for new clusters, also allows
running Alpine Linux as the host OS.

BR,

Nico


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-21 Thread Robert Sander

Hi,

On 12/21/23 14:50, Drew Weaver wrote:

#1 cephadm or ceph-ansible for management?


cephadm.

The ceph-ansible project writes in its README:

NOTE: cephadm is the new official installer, you should consider 
migrating to cephadm.


https://github.com/ceph/ceph-ansible


#2 Since the whole... CentOS thing... what distro appears to be the most 
straightforward to use with Ceph?  I was going to try and deploy it on Rocky 9.


Any distribution with a recent systemd, podman, LVM2 and time 
synchronization is viable. I prefer Debian, others RPM-based distributions.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io