Re: [ceph-users] The way to minimize osd memory usage?

Subhachandra Chandra Mon, 11 Dec 2017 10:08:35 -0800

I ran an experiment with 1GB memory per OSD using Bluestore. 12.2.2 made a
big difference.


In addition, you should have a look at your max object size. It looks like
you will see a jump in memory usage if a particular OSD happens to be the
primary for a number of objects being written in parallel. In our case
reducing the number of clients reduced memory requirements. Reducing max
object size should also reduce memory requirements on the OSD daemon.

Subhachandra



On Sun, Dec 10, 2017 at 1:01 PM, <[email protected]> wrote:

> Send ceph-users mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
>
> Today's Topics:
>
>    1. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
>    2. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
>    3. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
>    4. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
>    5. The way to minimize osd memory usage? (shadow_lin)
>    6. Re: The way to minimize osd memory usage? (Konstantin Shalygin)
>    7. Re: The way to minimize osd memory usage? (shadow_lin)
>    8. Random checksum errors (bluestore on Luminous) (Martin Preuss)
>    9. Re: The way to minimize osd memory usage? (David Turner)
>   10. what's the maximum number of OSDs per OSD server? (Igor Mendelev)
>   11. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
>   12. Re: what's the maximum number of OSDs per OSD server?
>       (Igor Mendelev)
>   13. Re: RBD+LVM -> iSCSI -> VMWare (He?in Ejdesgaard M?ller)
>   14. Re: Random checksum errors (bluestore on Luminous) (Martin Preuss)
>   15. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 10 Dec 2017 00:26:39 +0000
> From: Donny Davis <[email protected]>
> To: Brady Deetz <[email protected]>
> Cc: Aaron Glenn <[email protected]>, ceph-users
>         <[email protected]>
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
>         <CAMHmko_35Y0pRqFp89MLJCi+6Uv9BMtF=Z71pkq8YDhDR0E3Mw@
> mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Just curious but why not just use a hypervisor with rbd support? Are there
> VMware specific features you are reliant on?
>
> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <[email protected]> wrote:
>
> > I'm testing using RBD as VMWare datastores. I'm currently testing with
> > krbd+LVM on a tgt target hosted on a hypervisor.
> >
> > My Ceph cluster is HDD backed.
> >
> > In order to help with write latency, I added an SSD drive to my
> hypervisor
> > and made it a writeback cache for the rbd via LVM. So far I've managed to
> > smooth out my 4k write latency and have some pleasing results.
> >
> > Architecturally, my current plan is to deploy an iSCSI gateway on each
> > hypervisor hosting that hypervisor's own datastore.
> >
> > Does anybody have any experience with this kind of configuration,
> > especially with regard to LVM writeback caching combined with RBD?
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/4f055103/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Sat, 9 Dec 2017 18:56:53 -0600
> From: Brady Deetz <[email protected]>
> To: Donny Davis <[email protected]>
> Cc: Aaron Glenn <[email protected]>, ceph-users
>         <[email protected]>
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
>         <CADU_9qV6VVVbzxdbEBCofvON-Or9sajS-E0j_22Wf=RdRycBwQ@
> mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> We have over 150 VMs running in vmware. We also have 2PB of Ceph for
> filesystem. With our vmware storage aging and not providing the IOPs we
> need, we are considering and hoping to use ceph. Ultimately, yes we will
> move to KVM, but in the short term, we probably need to stay on VMware.
>
> On Dec 9, 2017 6:26 PM, "Donny Davis" <[email protected]> wrote:
>
> > Just curious but why not just use a hypervisor with rbd support? Are
> there
> > VMware specific features you are reliant on?
> >
> > On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <[email protected]> wrote:
> >
> >> I'm testing using RBD as VMWare datastores. I'm currently testing with
> >> krbd+LVM on a tgt target hosted on a hypervisor.
> >>
> >> My Ceph cluster is HDD backed.
> >>
> >> In order to help with write latency, I added an SSD drive to my
> >> hypervisor and made it a writeback cache for the rbd via LVM. So far
> I've
> >> managed to smooth out my 4k write latency and have some pleasing
> results.
> >>
> >> Architecturally, my current plan is to deploy an iSCSI gateway on each
> >> hypervisor hosting that hypervisor's own datastore.
> >>
> >> Does anybody have any experience with this kind of configuration,
> >> especially with regard to LVM writeback caching combined with RBD?
> >> _______________________________________________
> >> ceph-users mailing list
> >> [email protected]
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171209/8d02eb27/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 10 Dec 2017 01:09:39 +0000
> From: Donny Davis <[email protected]>
> To: Brady Deetz <[email protected]>
> Cc: Aaron Glenn <[email protected]>, ceph-users
>         <[email protected]>
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
>         <[email protected].
> com>
> Content-Type: text/plain; charset="utf-8"
>
> What I am getting at is that instead of sinking a bunch of time into this
> bandaid, why not sink that time into a hypervisor migration. Seems well
> timed if you ask me.
>
> There are even tools to make that migration easier
>
> http://libguestfs.org/virt-v2v.1.html
>
> You should ultimately move your hypervisor instead of building a one off
> case for ceph. Ceph works really well if you stay inside the box. So does
> KVM. They work like Gang Buster's together.
>
> I know that doesn't really answer your OP, but this is what I would do.
>
> ~D
>
> On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <[email protected]> wrote:
>
> > We have over 150 VMs running in vmware. We also have 2PB of Ceph for
> > filesystem. With our vmware storage aging and not providing the IOPs we
> > need, we are considering and hoping to use ceph. Ultimately, yes we will
> > move to KVM, but in the short term, we probably need to stay on VMware.
> > On Dec 9, 2017 6:26 PM, "Donny Davis" <[email protected]> wrote:
> >
> >> Just curious but why not just use a hypervisor with rbd support? Are
> >> there VMware specific features you are reliant on?
> >>
> >> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <[email protected]> wrote:
> >>
> >>> I'm testing using RBD as VMWare datastores. I'm currently testing with
> >>> krbd+LVM on a tgt target hosted on a hypervisor.
> >>>
> >>> My Ceph cluster is HDD backed.
> >>>
> >>> In order to help with write latency, I added an SSD drive to my
> >>> hypervisor and made it a writeback cache for the rbd via LVM. So far
> I've
> >>> managed to smooth out my 4k write latency and have some pleasing
> results.
> >>>
> >>> Architecturally, my current plan is to deploy an iSCSI gateway on each
> >>> hypervisor hosting that hypervisor's own datastore.
> >>>
> >>> Does anybody have any experience with this kind of configuration,
> >>> especially with regard to LVM writeback caching combined with RBD?
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> [email protected]
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/afb26767/attachment-0001.html>
>
> ------------------------------
>
> Message: 4
> Date: Sat, 9 Dec 2017 19:17:01 -0600
> From: Brady Deetz <[email protected]>
> To: Donny Davis <[email protected]>
> Cc: Aaron Glenn <[email protected]>, ceph-users
>         <[email protected]>
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
>         <[email protected].
> com>
> Content-Type: text/plain; charset="utf-8"
>
> That's not a bad position. I have concerns with what I'm proposing, so a
> hypervisor migration may actually bring less risk than a storage
> abomination.
>
> On Dec 9, 2017 7:09 PM, "Donny Davis" <[email protected]> wrote:
>
> > What I am getting at is that instead of sinking a bunch of time into this
> > bandaid, why not sink that time into a hypervisor migration. Seems well
> > timed if you ask me.
> >
> > There are even tools to make that migration easier
> >
> > http://libguestfs.org/virt-v2v.1.html
> >
> > You should ultimately move your hypervisor instead of building a one off
> > case for ceph. Ceph works really well if you stay inside the box. So does
> > KVM. They work like Gang Buster's together.
> >
> > I know that doesn't really answer your OP, but this is what I would do.
> >
> > ~D
> >
> > On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <[email protected]> wrote:
> >
> >> We have over 150 VMs running in vmware. We also have 2PB of Ceph for
> >> filesystem. With our vmware storage aging and not providing the IOPs we
> >> need, we are considering and hoping to use ceph. Ultimately, yes we will
> >> move to KVM, but in the short term, we probably need to stay on VMware.
> >> On Dec 9, 2017 6:26 PM, "Donny Davis" <[email protected]> wrote:
> >>
> >>> Just curious but why not just use a hypervisor with rbd support? Are
> >>> there VMware specific features you are reliant on?
> >>>
> >>> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <[email protected]> wrote:
> >>>
> >>>> I'm testing using RBD as VMWare datastores. I'm currently testing with
> >>>> krbd+LVM on a tgt target hosted on a hypervisor.
> >>>>
> >>>> My Ceph cluster is HDD backed.
> >>>>
> >>>> In order to help with write latency, I added an SSD drive to my
> >>>> hypervisor and made it a writeback cache for the rbd via LVM. So far
> I've
> >>>> managed to smooth out my 4k write latency and have some pleasing
> results.
> >>>>
> >>>> Architecturally, my current plan is to deploy an iSCSI gateway on each
> >>>> hypervisor hosting that hypervisor's own datastore.
> >>>>
> >>>> Does anybody have any experience with this kind of configuration,
> >>>> especially with regard to LVM writeback caching combined with RBD?
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> [email protected]
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171209/e19aa6ab/attachment-0001.html>
>
> ------------------------------
>
> Message: 5
> Date: Sun, 10 Dec 2017 11:35:33 +0800
> From: "shadow_lin"<[email protected]>
> To: "ceph-users"<[email protected]>
> Subject: [ceph-users] The way to minimize osd memory usage?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi All,
> I am testing running ceph luminous(12.2.1-249-g42172a4 (
> 42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.
> The ARM server has a two [email protected] cpu and 2GB ram and I am running 2
> osd per ARM server with 2x8TB(or 2x10TB) hdd.
> Now I am facing constantly oom problem.I have tried upgrade ceph(to fix
> osd memroy leak problem) and lower the bluestore  cache setting.The oom
> problems did get better but still occurs constantly.
>
> I am hoping someone can gives me some advice of the follow questions.
>
> Is it impossible to run ceph in this config of hardware or Is it possible
> I can do some tunning the solve this problem(even to lose some performance
> to avoid the oom problem)?
>
> Is it a good idea to use raid0 to combine the 2 HDD into one so I can only
> run one osd to save some memory?
>
> How is memory usage of osd related to the size of HDD?
>
>
>
>
> PS:my ceph.conf bluestore cache setting
> [osd]
>         bluestore_cache_size = 104857600
>         bluestore_cache_kv_max = 67108864
>         osd client message size cap = 67108864
>
>
>
> 2017-12-10
>
>
>
> lin.yunfan
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/f096c25b/attachment-0001.html>
>
> ------------------------------
>
> Message: 6
> Date: Sun, 10 Dec 2017 11:29:23 +0700
> From: Konstantin Shalygin <[email protected]>
> To: [email protected]
> Cc: shadow_lin <[email protected]>
> Subject: Re: [ceph-users] The way to minimize osd memory usage?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> > I am testing running ceph luminous(12.2.1-249-g42172a4 (
> 42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.
> Try new 12.2.2 - this release should fix memory issues with Bluestore.
>
>
>
> ------------------------------
>
> Message: 7
> Date: Sun, 10 Dec 2017 12:33:36 +0800
> From: "shadow_lin"<[email protected]>
> To: "Konstantin Shalygin"<[email protected]>,
>         "ceph-users"<[email protected]>
> Subject: Re: [ceph-users] The way to minimize osd memory usage?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf)
> we are running is with the memory issues fix.And we are working on to
> upgrade to 12.2.2 release to see if there is any furthermore improvement.
>
> 2017-12-10
>
>
> lin.yunfan
>
>
>
> ????Konstantin Shalygin <[email protected]>
> ?????2017-12-10 12:29
> ???Re: [ceph-users] The way to minimize osd memory usage?
> ????"ceph-users"<[email protected]>
> ???"shadow_lin"<[email protected]>
>
> > I am testing running ceph luminous(12.2.1-249-g42172a4 (
> 42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.
> Try new 12.2.2 - this release should fix memory issues with Bluestore.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/e5870ab8/attachment-0001.html>
>
> ------------------------------
>
> Message: 8
> Date: Sun, 10 Dec 2017 14:34:03 +0100
> From: Martin Preuss <[email protected]>
> To: [email protected]
> Subject: [ceph-users] Random checksum errors (bluestore on Luminous)
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,
> consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently
> totalling 10 hdds).
>
> Right from the start I always received random scrub errors telling me
> that some checksums didn't match the expected value, fixable with "ceph
> pg repair".
>
> I looked at the ceph-osd logfiles on each of the hosts and compared with
> the corresponding syslogs. I never found any hardware error, so there
> was no problem reading or writing a sector hardware-wise. Also there was
> never any other suspicious syslog entry around the time of checksum
> error reporting.
>
> When I looked at the checksum error entries I found that the reported
> bad checksum always was "0x6706be76".
>
> Could someone please tell me where to look further for the source of the
> problem?
>
> I appended an excerpt of the osd logs.
>
>
> Kind regards
> Martin
>
>
> --
> "Things are only impossible until they're not"
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: ceph-osd.log
> Type: text/x-log
> Size: 4645 bytes
> Desc: not available
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/460992fe/attachment-0001.bin>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 181 bytes
> Desc: OpenPGP digital signature
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/460992fe/attachment-0001.sig>
>
> ------------------------------
>
> Message: 9
> Date: Sun, 10 Dec 2017 15:05:16 +0000
> From: David Turner <[email protected]>
> To: shadow_lin <[email protected]>
> Cc: Konstantin Shalygin <[email protected]>, ceph-users
>         <[email protected]>
> Subject: Re: [ceph-users] The way to minimize osd memory usage?
> Message-ID:
>         <[email protected].
> com>
> Content-Type: text/plain; charset="utf-8"
>
> The docs recommend 1GB/TB of OSDs. I saw people asking if this was still
> accurate for bluestore and the answer was that it is more true for
> bluestore than filestore. There might be a way to get this working at the
> cost of performance. I would look at Linux kernel memory settings as much
> as ceph and bluestore settings. Cache pressure is one that comes to mind
> that an aggressive setting might help.
>
> On Sat, Dec 9, 2017, 11:33 PM shadow_lin <[email protected]> wrote:
>
> > The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe
> 678db293bf)
> > we are running is with the memory issues fix.And we are working on to
> > upgrade to 12.2.2 release to see if there is any furthermore improvement.
> >
> > 2017-12-10
> > ------------------------------
> > lin.yunfan
> > ------------------------------
> >
> > *????*Konstantin Shalygin <[email protected]>
> > *?????*2017-12-10 12:29
> > *???*Re: [ceph-users] The way to minimize osd memory usage?
> > *????*"ceph-users"<[email protected]>
> > *???*"shadow_lin"<[email protected]>
> >
> >
> >
> > > I am testing running ceph luminous(12.2.1-249-g42172a4 (
> 42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.
> > Try new 12.2.2 - this release should fix memory issues with Bluestore.
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/534133c9/attachment-0001.html>
>
> ------------------------------
>
> Message: 10
> Date: Sun, 10 Dec 2017 10:38:53 -0500
> From: Igor Mendelev <[email protected]>
> To: [email protected]
> Subject: [ceph-users] what's the maximum number of OSDs per OSD
>         server?
> Message-ID:
>         <CAKtyfj_0NKQmPNO2C6CuU47xZhM_Xagm2WF4yLUdUhfSw2G7Qg@mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB
> RAM - as well as 12TB HDDs - are easily available and somewhat reasonably
> priced I wonder what's the maximum number of OSDs per OSD server (if using
> 10TB or 12TB HDDs) and how much RAM does it really require if total storage
> capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB
> RAM per TB of HDD or it could be less (during normal operations - and
> extended with NVMe SSDs swap space for extra space during recovery)?
>
> Are there any known scalability limits in Ceph Luminous (12.2.2 with
> BlueStore) and/or Linux that'll make such high capacity OSD server not
> scale well (using sequential IO speed per HDD as a metric)?
>
> Thanks.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/01aa76db/attachment-0001.html>
>
> ------------------------------
>
> Message: 11
> Date: Sun, 10 Dec 2017 16:17:40 -0000
> From: Nick Fisk <[email protected]>
> To: 'Igor Mendelev' <[email protected]>, [email protected]
> Subject: Re: [ceph-users] what's the maximum number of OSDs per OSD
>         server?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> From: ceph-users [mailto:[email protected]] On Behalf Of
> Igor Mendelev
> Sent: 10 December 2017 15:39
> To: [email protected]
> Subject: [ceph-users] what's the maximum number of OSDs per OSD server?
>
>
>
> Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB
> RAM - as well as 12TB HDDs - are easily available and somewhat reasonably
> priced I wonder what's the maximum number of OSDs per OSD server (if using
> 10TB or 12TB HDDs) and how much RAM does it really require if total storage
> capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB
> RAM per TB of HDD or it could be less (during normal operations - and
> extended with NVMe SSDs swap space for extra space during recovery)?
>
>
>
> Are there any known scalability limits in Ceph Luminous (12.2.2 with
> BlueStore) and/or Linux that'll make such high capacity OSD server not
> scale well (using sequential IO speed per HDD as a metric)?
>
>
>
> Thanks.
>
>
>
> How many total OSD?s will you have? If you are planning on having
> thousands then dense nodes might make sense. Otherwise you are leaving
> yourself open to having a few number of very large nodes, which will likely
> shoot you in the foot further down the line. Also don?t forget, unless this
> is purely for archiving, you will likely need to scale the networking up
> per node, 2x10G won?t cut it when you have 10-20+ disks per node.
>
>
>
> With Bluestore, you are probably looking at around 2-3GB of RAM per OSD,
> so say 4GB to be on the safe side.
>
> 7.2k HDD?s will likely only use a small proportion of a CPU core due to
> their limited IO potential. A would imagine that even with 90 bay JBOD?s,
> you will run into physical limitations before you hit CPU ones.
>
>
>
> Without knowing your exact requirements, I would suggest that larger
> number of smaller nodes, might be a better idea. If you choose your
> hardware right, you can often get the cost down to comparable levels by not
> going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/3f1a50cf/attachment-0001.html>
>
> ------------------------------
>
> Message: 12
> Date: Sun, 10 Dec 2017 12:37:05 -0500
> From: Igor Mendelev <[email protected]>
> To: [email protected], [email protected]
> Subject: Re: [ceph-users] what's the maximum number of OSDs per OSD
>         server?
> Message-ID:
>         <CAKtyfj-zCAPpPANb-5S6gXet+XYX33HhOC_65FP6HrTWBKFfDw@
> mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Expected number of nodes for initial setup is 10-15 and of OSDs -
> 1,500-2,000.
>
> Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD
> node).
>
> JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)
>
> Choice of hardware is done considering (non-trivial) per-server sw
> licensing costs -
> so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs
> cost (which
> is estimated to be below 10% of the total cost in the setup I'm currently
> considering).
>
> EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for
> most of the storage space.
>
> Main applications are expected to be archiving and sequential access to
> large (multiGB) files/objects.
>
> Nick, which physical limitations you're referring to ?
>
> Thanks.
>
> On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk <[email protected]> wrote:
>
> > *From:* ceph-users [mailto:[email protected]] *On Behalf
> > Of *Igor Mendelev
> > *Sent:* 10 December 2017 15:39
> > *To:* [email protected]
> > *Subject:* [ceph-users] what's the maximum number of OSDs per OSD server?
> >
> >
> >
> > Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB
> > RAM - as well as 12TB HDDs - are easily available and somewhat reasonably
> > priced I wonder what's the maximum number of OSDs per OSD server (if
> using
> > 10TB or 12TB HDDs) and how much RAM does it really require if total
> storage
> > capacity for such OSD server is on the order of 1,000+ TB - is it still
> 1GB
> > RAM per TB of HDD or it could be less (during normal operations - and
> > extended with NVMe SSDs swap space for extra space during recovery)?
> >
> >
> >
> > Are there any known scalability limits in Ceph Luminous (12.2.2 with
> > BlueStore) and/or Linux that'll make such high capacity OSD server not
> > scale well (using sequential IO speed per HDD as a metric)?
> >
> >
> >
> > Thanks.
> >
> >
> >
> > How many total OSD?s will you have? If you are planning on having
> > thousands then dense nodes might make sense. Otherwise you are leaving
> > yourself open to having a few number of very large nodes, which will
> likely
> > shoot you in the foot further down the line. Also don?t forget, unless
> this
> > is purely for archiving, you will likely need to scale the networking up
> > per node, 2x10G won?t cut it when you have 10-20+ disks per node.
> >
> >
> >
> > With Bluestore, you are probably looking at around 2-3GB of RAM per OSD,
> > so say 4GB to be on the safe side.
> >
> > 7.2k HDD?s will likely only use a small proportion of a CPU core due to
> > their limited IO potential. A would imagine that even with 90 bay JBOD?s,
> > you will run into physical limitations before you hit CPU ones.
> >
> >
> >
> > Without knowing your exact requirements, I would suggest that larger
> > number of smaller nodes, might be a better idea. If you choose your
> > hardware right, you can often get the cost down to comparable levels by
> not
> > going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/9c3b98f0/attachment-0001.html>
>
> ------------------------------
>
> Message: 13
> Date: Sun, 10 Dec 2017 17:38:30 +0000
> From: He?in Ejdesgaard M?ller  <[email protected]>
> To: Brady Deetz <[email protected]>, Donny Davis <[email protected]>
> Cc: Aaron Glenn <[email protected]>, ceph-users
>         <[email protected]>
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="UTF-8"
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Another option is to utilize the iscsi gateway, provided in 12.2
> http://docs.ceph.com/docs/master/rbd/iscsi-overview/
>
> Benefits:
> You can EOL your old SAN wtihout having to simultaneously migrate to
> another hypervisor.
> Any infrastructure that ties in to vSphere, is unaffected. (CEPH is just
> another set of datastores.)
> If you have the appropriate vmware licenses etc. then your move to CEPH
> can be done without any downtime.
>
> Drawback from my tests, using ceph-12.2-latest and ESXi-6.5, is that you
> get around 30% performance penalty, and the
> latency is higher, compared to a direct rbd mount.
>
>
> On ley, 2017-12-09 at 19:17 -0600, Brady Deetz wrote:
> > That's not a bad position. I have concerns with what I'm proposing, so a
> hypervisor migration may actually bring less
> > risk than a storage abomination.?
> >
> > On Dec 9, 2017 7:09 PM, "Donny Davis" <[email protected]> wrote:
> > > What I am getting at is that instead of sinking a bunch of time into
> this bandaid, why not sink that time into a
> > > hypervisor migration. Seems well timed if you ask me.
> > >
> > > There are even tools to make that migration easier
> > >
> > > http://libguestfs.org/virt-v2v.1.html
> > >
> > > You should ultimately move your hypervisor instead of building a one
> off case for ceph. Ceph works really well if
> > > you stay inside the box. So does KVM. They work like Gang Buster's
> together.
> > >
> > > I know that doesn't really answer your OP, but this is what I would do.
> > >
> > > ~D
> > >
> > > On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <[email protected]> wrote:
> > > > We have over 150 VMs running in vmware. We also have 2PB of Ceph for
> filesystem. With our vmware storage aging and
> > > > not providing the IOPs we need, we are considering and hoping to use
> ceph. Ultimately, yes we will move to KVM,
> > > > but in the short term, we probably need to stay on VMware.?
> > > > On Dec 9, 2017 6:26 PM, "Donny Davis" <[email protected]> wrote:
> > > > > Just curious but why not just use a hypervisor with rbd support?
> Are there VMware specific features you are
> > > > > reliant on??
> > > > >
> > > > > On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <[email protected]>
> wrote:
> > > > > > I'm testing using RBD?as VMWare datastores. I'm currently
> testing with krbd+LVM on a tgt target hosted on a
> > > > > > hypervisor.
> > > > > >
> > > > > > My Ceph cluster is HDD backed.
> > > > > >
> > > > > > In order to help with write latency, I added an SSD drive to my
> hypervisor and made it a writeback cache for
> > > > > > the rbd via LVM. So far I've managed to smooth out my 4k write
> latency and have some pleasing results.
> > > > > >
> > > > > > Architecturally, my current plan is to deploy an iSCSI gateway
> on each hypervisor hosting that hypervisor's
> > > > > > own datastore.
> > > > > >
> > > > > > Does anybody have any experience with this kind of
> configuration, especially with regard to LVM writeback
> > > > > > caching combined with RBD?
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > [email protected]
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> -----BEGIN PGP SIGNATURE-----
>
> iQIzBAEBCAAdFiEElZWfRQVsNukQFi9Ko80MCbT/An0FAlotcRYACgkQo80MCbT/
> An36fQ//ULP6gwd4qUbXG3yKBHqMtcsTV76+CfP8e3jcuEqyEzlCugoR10DXPELj
> TLCnrBp4fDP5gTd1zIHcU+PMPcVJ91dBYUWoMZrSLAraM0+7kvNQ9Nsacsl6CsiZ
> yq+506uOhwcLub55oLSpKgnaW1rEG6TAG/6TNIBGakb2a79iC1xev16S3lJ8V7zI
> cb3psUCePv/T753q/0E9B5SH9L5BiygsMT4DjiE09xGcFzH3lqkMWm2HMCFXNogI
> WbwqQVTTgk5Ch3oilz6cpOIqLK2VMkK0PPFXSGi1SAEjkw2c/XIBykB9MclVQn+8
> q5kO5g+uFcflEVnFhKTZknXVoOjrybhs4lMYmK4LJJ340Ay1uLyAlFdZdh+xAN3B
> 43QBKfcd1dL+EgKkMVuzGOaYOAqrFbh2/DN5rAz3l1YUy5h3OtjrXlNU/F7AkZfc
> +UECf9wa6M7uS6DqaPMVxtLhROyMnHw+Z6jrKz7V8EamUduxQyNwOxBNIJYDmKVC
> SHSkQi+oykPHWcOIXr1BNR2raaH1YVqXG+6mK8b6YV6sGtVeXA+KCa8RgrtabU3F
> tgDW8cPkeTcPYi5BOVZeQ2OSD90A6eiC4fJbMcWVbUQim+0gSY2paoC8Rk/HQkMF
> ug8xc9Os7SXe/wEOGQAzRHjDi16eKC9JghrS7dH4JLPg4gvBn4E=
> =auLW
> -----END PGP SIGNATURE-----
>
>
>
> ------------------------------
>
> Message: 14
> Date: Sun, 10 Dec 2017 19:45:31 +0100
> From: Martin Preuss <[email protected]>
> To: [email protected]
> Subject: Re: [ceph-users] Random checksum errors (bluestore on
>         Luminous)
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi (again),
>
> meanwhile I tried
>
> "ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0"
>
> but that resulted in a segfault (please see attached console log).
>
>
> Regards
> Martin
>
>
> Am 10.12.2017 um 14:34 schrieb Martin Preuss:
> > Hi,
> >
> > I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,
> > consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently
> > totalling 10 hdds).
> >
> > Right from the start I always received random scrub errors telling me
> > that some checksums didn't match the expected value, fixable with "ceph
> > pg repair".
> >
> > I looked at the ceph-osd logfiles on each of the hosts and compared with
> > the corresponding syslogs. I never found any hardware error, so there
> > was no problem reading or writing a sector hardware-wise. Also there was
> > never any other suspicious syslog entry around the time of checksum
> > error reporting.
> >
> > When I looked at the checksum error entries I found that the reported
> > bad checksum always was "0x6706be76".
> >
> > Could someone please tell me where to look further for the source of the
> > problem?
> >
> > I appended an excerpt of the osd logs.
> >
> >
> > Kind regards
> > Martin
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> "Things are only impossible until they're not"
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: fsck.log
> Type: text/x-log
> Size: 4314 bytes
> Desc: not available
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/1a19349d/attachment-0001.bin>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 181 bytes
> Desc: OpenPGP digital signature
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/1a19349d/attachment-0001.sig>
>
> ------------------------------
>
> Message: 15
> Date: Sun, 10 Dec 2017 20:32:45 -0000
> From: Nick Fisk <[email protected]>
> To: 'Igor Mendelev' <[email protected]>, [email protected]
> Subject: Re: [ceph-users] what's the maximum number of OSDs per OSD
>         server?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> From: ceph-users [mailto:[email protected]] On Behalf Of
> Igor Mendelev
> Sent: 10 December 2017 17:37
> To: [email protected]; [email protected]
> Subject: Re: [ceph-users] what's the maximum number of OSDs per OSD server?
>
>
>
> Expected number of nodes for initial setup is 10-15 and of OSDs -
> 1,500-2,000.
>
>
>
> Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD
> node).
>
>
>
> JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)
>
>
>
> Choice of hardware is done considering (non-trivial) per-server sw
> licensing costs -
>
> so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs
> cost (which
>
> is estimated to be below 10% of the total cost in the setup I'm currently
> considering).
>
>
>
> EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for
> most of the storage space.
>
>
>
> Main applications are expected to be archiving and sequential access to
> large (multiGB) files/objects.
>
>
>
> Nick, which physical limitations you're referring to ?
>
>
>
> Thanks.
>
>
>
>
>
> Hi Igor,
>
>
>
> I guess I meant physical annoyances rather than limitations. Being able to
> pull out a 1 or 2U node is always much less of a chore vs dealing with
> several U of SAS interconnected JBOD?s.
>
>
>
> If you have some license reason for larger nodes, then there is a very
> valid argument for larger nodes. Is this license cost  related in some way
> to Ceph (I thought Redhat was capacity based) or is this some sort of
> collocated software? Just make sure you size the nodes to a point that if
> one has to be taken offline for any reason, that you are happy with the
> resulting state of the cluster, including the peering when suddenly taking
> ~200 OSD?s offline/online.
>
>
>
> Nick
>
>
>
>
>
> On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk <[email protected] <mailto:
> [email protected]> > wrote:
>
> From: ceph-users [mailto:[email protected] <mailto:
> [email protected]> ] On Behalf Of Igor Mendelev
> Sent: 10 December 2017 15:39
> To: [email protected] <mailto:[email protected]>
> Subject: [ceph-users] what's the maximum number of OSDs per OSD server?
>
>
>
> Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB
> RAM - as well as 12TB HDDs - are easily available and somewhat reasonably
> priced I wonder what's the maximum number of OSDs per OSD server (if using
> 10TB or 12TB HDDs) and how much RAM does it really require if total storage
> capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB
> RAM per TB of HDD or it could be less (during normal operations - and
> extended with NVMe SSDs swap space for extra space during recovery)?
>
>
>
> Are there any known scalability limits in Ceph Luminous (12.2.2 with
> BlueStore) and/or Linux that'll make such high capacity OSD server not
> scale well (using sequential IO speed per HDD as a metric)?
>
>
>
> Thanks.
>
>
>
> How many total OSD?s will you have? If you are planning on having
> thousands then dense nodes might make sense. Otherwise you are leaving
> yourself open to having a few number of very large nodes, which will likely
> shoot you in the foot further down the line. Also don?t forget, unless this
> is purely for archiving, you will likely need to scale the networking up
> per node, 2x10G won?t cut it when you have 10-20+ disks per node.
>
>
>
> With Bluestore, you are probably looking at around 2-3GB of RAM per OSD,
> so say 4GB to be on the safe side.
>
> 7.2k HDD?s will likely only use a small proportion of a CPU core due to
> their limited IO potential. A would imagine that even with 90 bay JBOD?s,
> you will run into physical limitations before you hit CPU ones.
>
>
>
> Without knowing your exact requirements, I would suggest that larger
> number of smaller nodes, might be a better idea. If you choose your
> hardware right, you can often get the cost down to comparable levels by not
> going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.
>
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> attachments/20171210/1e954b89/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ------------------------------
>
> End of ceph-users Digest, Vol 59, Issue 9
> *****************************************
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] The way to minimize osd memory usage?

Reply via email to