[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-19 Thread Simon Kepp
Hi Ondrej,
When running multiple OSDs on a shared DB/WAL NVME,it is important to take
into account,when designing your redundancy/failure domains,that the loss
of a single NVMe drive will take out a number of OSDs.You must design your
redundancy,so tat it is acceptable to lose that many OSDs
simultaneously,and still being able to rebuild without data loss.In most
scenarios, this is easily addressed simply by using failure_domain=Host, as
you won't be sharing DB/WAL NVMes across multiple hosts. I don't think
there's any generally agreed perfect number of OSDs per DB/WAL NVMe,but
I'veseen others argue for a best practice of maximum3 OSDs per DB/WAL NVMe,
and have myself adopted that as a standard. I run hosts with 12 HDD OSDs
and 4 DB/WAL NVMEs. and a FAILURE_DOMAIN=Host.

Best Regards,
Simon Kepp,
Founder,
Kepp Technologies.

On Fri, Apr 19, 2024 at 2:07 PM Ondřej Kukla  wrote:

> Hello,
>
> I’m going to mainly answer the practical questions Niklaus had.
>
> Our standart setup is 12HDDs and 2 Enterprise NVMe per node which means we
> have 6 OSDs per 1 NVMe. For the partition we use LVM.
>
> The fact that one one failed NVMe takes down 6 OSDs isn’t great but our
> osd-node count is more then double the M + K values for Erasure coding
> which means 6 OSDs should be ok-ish. Failing multiple NVMe could be an
> issue. If you use replicated pools then this isn’t that problematic.
>
> When it comes to recovery Ceph can easily recover that. Just recreate the
> LVMs and OSDs and you are good to go.
>
> One other benefit for us is that because we use large NVMes (7.7TiB) we
> can use the spare space for a fast pool.
>
> Ondrej
>
> > On 19. 4. 2024, at 12:04, Torkil Svensgaard  wrote:
> >
> > Hi
> >
> > Red Hat Ceph support told us back in the day that 16 DB/WAL partitions
> pr NVMe were the max supported by RHCS because their testing showed
> performance suffered beyond that. We are running with 11 pr NVMe.
> >
> > We are prepared to lose a bunch of OSDs if we have an NVMe die. We
> expect ceph will handle it and we can redeploy the OSDs with a new NVMe
> device.
> >
> > We use a service spec for the chopping up bit:
> >
> > service_type: osd
> > service_id: slow
> > service_name: osd.slow
> > placement:
> >  host_pattern: '*'
> > spec:
> >  block_db_size: 290966113186
> >  data_devices:
> >rotational: 1
> >  db_devices:
> >rotational: 0
> >size: '1000G:'
> >  filter_logic: AND
> >  objectstore: bluestore
> >
> > Mvh.
> >
> > Torkil
> >
> > On 19-04-2024 11:02, Niklaus Hofer wrote:
> >> Dear all
> >> We have an HDD ceph cluster that could do with some more IOPS. One
> solution we are considering is installing NVMe SSDs into the storage nodes
> and using them as WAL- and/or DB devices for the Bluestore OSDs.
> >> However, we have some questions about this and are looking for some
> guidance and advice.
> >> The first one is about the expected benefits. Before we undergo the
> efforts involved in the transition, we are wondering if it is even worth
> it. How much of a performance boost one can expect when adding NVMe SSDs
> for WAL-devices to an HDD cluster? Plus, how much faster than that does it
> get with the DB also being on SSD. Are there rule-of-thumb number of that?
> Or maybe someone has done benchmarks in the past?
> >> The second question is of more practical nature. Are there any best-
> practices on how to implement this? I was thinking we won't do one SSD per
> HDD - surely an NVMe SSD is plenty fast to handle the traffic from multiple
> OSDs. But what is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or
> even 8? Also, how should I chop-up the SSD, using partitions or using LVM?
> Last but not least, if I have one SSD handle WAL and DB for multiple OSDs,
> losing that SSD means losing multiple OSDs. How do people deal with this
> risk? Is it generally deemed acceptable or is this something people tend to
> mitigate and if so how? Do I run multiple SSDs in RAID?
> >> I do realize that for some of these, there might not be the one perfect
> answer that fits all use cases. I am looking for best practices and in
> general just trying to avoid any obvious mistakes.
> >> Any advice is much appreciated.
> >> Sincerely
> >> Niklaus Hofer
> >
> > --
> > Torkil Svensgaard
> > Systems Administrator
> > Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> > Copenhagen University Hospital Amager and Hvidovre
> > Kettegaard Allé 30, 2650 Hvidovre, Denmark
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io  ceph-users-le...@ceph.io>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-17 Thread Simon Kepp
I know that your question is regarding the service servers, but may I ask,
why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
(= 50 OSDs per node)?
This is possible to do, but sounds like the nodes were designed for
scale-up rather than a scale-out architecture like ceph. Going with such
"fat nodes" is doable, but will significantly limit performance,
reliability and availability, compared to distributing the same OSDs
on more thinner nodes.

Best regards,
Simon Kepp

Founder/CEO
Kepp Technologies

On Fri, Nov 17, 2023 at 10:59 AM Albert Shih  wrote:

> Hi everyone,
>
> In the purpose to deploy a medium size of ceph cluster (300 OSD) we have 6
> bare-metal server for the OSD, and 5 bare-metal server for the service
> (MDS, Mon, etc.)
>
> Those 5 bare-metal server have each 48 cores and 256 Gb.
>
> What would be the smartest way to use those 5 server, I see two way :
>
>   first :
>
> Server 1 : MDS,MON, grafana, prometheus, webui
> Server 2:  MON
> Server 3:  MON
> Server 4 : MDS
> Server 5 : MDS
>
>   so 3 MDS, 3 MON. and we can loose 2 servers.
>
>   Second
>
> KVM on each server
>   Server 1 : 3 VM : One for grafana & CIe, and 1 MDS, 2 MON
>   other server : 1 MDS, 1 MON
>
>   in total :  5 MDS, 5 MON and we can loose 4 servers.
>
> So on paper it's seem the second are smarter, but it's also more complex,
> so my question are «is it worth the complexity to have 5 MDS/MON for 300
> OSD».
>
> Important : The main goal of this ceph cluster are not to get the maximum
> I/O speed, I would not say the speed is not a factor, but it's not the main
> point.
>
> Regards.
>
>
> --
> Albert SHIH 嶺 
> Observatoire de Paris
> France
> Heure locale/Local time:
> ven. 17 nov. 2023 10:49:27 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to copy an OSD from one failing disk to another one

2020-12-08 Thread Simon Kepp
For Ceph,this is fortunately not a major issue. Drives failing is
considered entirely normal, and Ceph will automatically rebuild your data
from redundancy onto a new replacement drive.If You're able to predict the
imminent failure of a drive, adding a new drive /OSD will automatically
start flowing data onto that drive immediately, thus reducing the time
period with decreased redundancy.If You're running with very tight levels
of redundancy, You're better off, creating a new OSD on a replacement drive
before  destroying the old OSD on the failed drive. But if You're running
with anything near the recommended/default levels of redundancy, it doesn't
really matter in which order you do it.


Best regards,
Simon Kepp,
Kepp Technologies.

On Tue, Dec 8, 2020 at 8:59 PM Konstantin Shalygin  wrote:

> Destroy this OSD, replace disk, deploy OSD.
>
>
> k
>
> Sent from my iPhone
>
> > On 8 Dec 2020, at 15:13, huxia...@horebdata.cn wrote:
> >
> > Hi, dear cephers,
> >
> > On one ceph i have a failing disk, whose SMART information signals an
> impending failure but still availble for reads and writes. I am setting up
> a new disk on the same node to replace it.
> > What is the best procedure to migrate data (or COPY ) from the failing
> OSD to the new one?
> >
> > Is there any stardard method to copy the OSD from one to another?
> >
> > best regards,
> >
> > samuel
> >
> >
> >
> > huxia...@horebdata.cn
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io