Christian, Thank you for the tips -- I certainly googled my eyes out for a good while before asking -- maybe my google-fu wasn't too good last night.
> I love using IB, alas with just one port per host you're likely best off > ignoring it, unless you have a converged network/switches that can make > use of it (or run it in Ethernet mode). I've always heard people speak fondly of IB, but I've honestly never dealt with it. I'm mostly a network guy at heart, so I'm perfectly comfortable aggregating 10GB/s connections till the cows come home. What are some of the virtues of IB, over ethernet? (not ethernet over IB) > Bluestore doesn't have journals per se and unless you're going to wait for > Luminous I wouldn't recommend using Bluestore in production. > Hell, I won't be using it any time soon, but anything pre L sounds > like outright channeling Murphy to smite you I do like to play with fire often, but not normally with other people's data. I suppose I will stay away from Bluestore for now, unless Luminous is released within the next few weeks. I am using it on Kraken in my small test-cluster so far without a visit from Murphy. > That said, what SSD is it? > Bluestore WAL needs are rather small. > OTOH, a single SSD isn't something I'd recommend either, SPOF and all. > I'm guessing you have no budget to improve on that gift horse? It's a Micron 1100 256Gb, rated for 120TBW, which works out to about 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to journal 36 1TB drives. I do have some room in the budget, and NVMe journals have been on the back of my mind. These servers have 6 PCIe x8 slots in them, so tons of room. But then I'm going to get asked about a cache tier, which everyone seems to think is the holy grail (and probably would be, if they could 'just work') But from what I read, they're an utter nightmare to manage, particularly without a well defined workload, and often would hurt more than they help. I haven't spent a ton of time with the network gear that was dumped on me, but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB switch), what appears to be a giant IB switch (Qlogic 12800-120) and another apparently big boy (Qlogic 12800-180). I'm going to pick them up from the warehouse tomorrow. If I stay away from IB completely, may just use the IB card as a 4x10GB + the 2x 10GB on board like I had originally mentioned. But if that IB gear is good, I'd hate to see it go to waste. Might be worth getting a second IB card for each server. Again, thanks a million for the advice. I'd rather learn this the easy way than to have to rebuild this 6 times over the next 6 months. On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <[email protected]> wrote: > > Hello, > > lots of similar questions in the past, google is your friend. > > On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote: > > > I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive > > Supermicro servers and dual 10Gb interfaces(one cluster, one public) > > > > I now have 9x 36-drive supermicro StorageServers made available to me, > each > > with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except > > IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36 > > drives. Currently 32GB of ram. 36x 1TB 7.2k drives. > > > I love using IB, alas with just one port per host you're likely best off > ignoring it, unless you have a converged network/switches that can make > use of it (or run it in Ethernet mode). > > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and > > 6.0 hosts(migrating from a VMWare environment), later to transition to > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and > saw > > much worse performance with the first cluster, so it seems this may be > the > > better way, but I'm open to other suggestions. > > > I've never seen any ultimate solution to providing HA iSCSI on top of > Ceph, though other people here have made significant efforts. > > > Considerations: > > Best practice documents indicate .5 cpu per OSD, but I have 36 drives and > > 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware > > raid card to present a fewer number of larger devices to ceph? Or run > > multiple drives per OSD? > > > You're definitely underpowered in the CPU department and I personally > would make RAID1 or 10s for never having to re-balance an OSD. > But if space is an issue, RAID0s would do. > OTOH, w/o any SSDs in the game your HDD only cluster is going to be less > CPU hungry than others. > > > There is a single 256gb SSD which i feel would be a bottleneck if I used > it > > as a journal for all 36 drives, so I believe bluestore with a journal on > > each drive would be the best option. > > > Bluestore doesn't have journals per se and unless you're going to wait for > Luminous I wouldn't recommend using Bluestore in production. > Hell, I won't be using it any time soon, but anything pre L sounds > like outright channeling Murphy to smite you. > > That said, what SSD is it? > Bluestore WAL needs are rather small. > OTOH, a single SSD isn't something I'd recommend either, SPOF and all. > > I'm guessing you have no budget to improve on that gift horse? > > > Is 1.7Ghz too slow for what I'm doing? > > > If you're going to have a lot of small I/Os it probably will be. > > > I like the idea of keeping the public and cluster networks separate. > > I don't, at least not on a physical level when you pay for this by loosing > redundancy. > Do you have 2 switches, are they MC-LAG capable (aka stackable)? > > >Any > > suggestions on which interfaces to use for what? I could theoretically > push > > 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see > > that? > Not by a long shot, even with Bluestore. > With the WAL and other bits on SSD and very kind write patterns, maybe > 100MB/s per drive, but IIRC there were issues with current Bluestore and > performance as well. > > >Perhaps bond the two 10GB and use them as the public, and the 40gb as > > the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded > > for each? > > > If you can actually split it up, see above, mc-LAG. > That will give you 60Gb/s, half that if a switch fails and if it makes you > fell better, do the cluster and public with VLANs. > > But that will cost you in not so cheap switch ports, of course. > > Christian > > If there is a more appropriate venue for my request, please point me in > > that direction. > > > > Thanks, > > Dan > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Rakuten Communications >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
