Christian,

Thank you for the tips -- I certainly googled my eyes out for a good while
before asking -- maybe my google-fu wasn't too good last night.

> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).

I've always heard people speak fondly of IB, but I've honestly never dealt
with it. I'm mostly a network guy at heart, so I'm perfectly comfortable
aggregating 10GB/s connections till the cows come home. What are some of
the virtues of IB, over ethernet? (not ethernet over IB)

> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you

I do like to play with fire often, but not normally with other people's
data. I suppose I will stay away from Bluestore for now, unless Luminous is
released within the next few weeks. I am using it on  Kraken in my small
test-cluster so far without a visit from Murphy.

> That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

> I'm guessing you have no budget to improve on that gift horse?

It's a Micron 1100 256Gb, rated for 120TBW, which works out to about
100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to
journal 36 1TB drives.

I do have some room in the budget, and NVMe journals have been on the back
of my mind. These servers have 6 PCIe x8 slots in them, so tons of room.
But then I'm going to get asked about a cache tier, which everyone seems to
think is the holy grail (and probably would be, if they could 'just work')

But from what I read, they're an utter nightmare to manage, particularly
without a well defined workload, and often would hurt more than they help.

I haven't spent a ton of time with the network gear that was dumped on me,
but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do
have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB
switch), what appears to be a giant IB switch (Qlogic 12800-120) and
another apparently big boy (Qlogic 12800-180). I'm going to pick them up
from the warehouse tomorrow.

If I stay away from IB completely, may just use the IB card as a 4x10GB +
the 2x 10GB on board like I had originally mentioned. But if that IB gear
is good, I'd hate to see it go to waste. Might be worth getting a second IB
card for each server.



Again, thanks a million for the advice. I'd rather learn this the easy way
than to have to rebuild this 6 times over the next 6 months.






On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <[email protected]> wrote:

>
> Hello,
>
> lots of similar questions in the past, google is your friend.
>
> On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:
>
> > I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> > Supermicro servers and dual 10Gb interfaces(one cluster, one public)
> >
> > I now have 9x 36-drive supermicro StorageServers made available to me,
> each
> > with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> > IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> > drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
> >
> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).
>
> > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and
> saw
> > much worse performance with the first cluster, so it seems this may be
> the
> > better way, but I'm open to other suggestions.
> >
> I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph, though other people here have made significant efforts.
>
> > Considerations:
> > Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> > 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> > raid card to present a fewer number of larger devices to ceph? Or run
> > multiple drives per OSD?
> >
> You're definitely underpowered in the CPU department and I personally
> would make RAID1 or 10s for never having to re-balance an OSD.
> But if space is an issue, RAID0s would do.
> OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
> CPU hungry than others.
>
> > There is a single 256gb SSD which i feel would be a bottleneck if I used
> it
> > as a journal for all 36 drives, so I believe bluestore with a journal on
> > each drive would be the best option.
> >
> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you.
>
> That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.
>
> I'm guessing you have no budget to improve on that gift horse?
>
> > Is 1.7Ghz too slow for what I'm doing?
> >
> If you're going to have a lot of small I/Os it probably will be.
>
> > I like the idea of keeping the public and cluster networks separate.
>
> I don't, at least not on a physical level when you pay for this by loosing
> redundancy.
> Do you have 2 switches, are they MC-LAG capable (aka stackable)?
>
> >Any
> > suggestions on which interfaces to use for what? I could theoretically
> push
> > 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see
> > that?
> Not by a long shot, even with Bluestore.
> With the WAL and other bits on SSD and very kind write patterns, maybe
> 100MB/s per drive, but IIRC there were issues with current Bluestore and
> performance as well.
>
> >Perhaps bond the two 10GB and use them as the public, and the 40gb as
> > the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded
> > for each?
> >
> If you can actually split it up, see above, mc-LAG.
> That will give you 60Gb/s, half that if a switch fails and if it makes you
> fell better, do the cluster and public with VLANs.
>
> But that will cost you in not so cheap switch ports, of course.
>
> Christian
> > If there is a more appropriate venue for my request, please point me in
> > that direction.
> >
> > Thanks,
> > Dan
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Rakuten Communications
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to