Thank you very much for your answer!
> Yes, I gathered that. > The question is, what servers between the Windows clients and the final > Ceph storage are you planning to use. > That I do not yet understand. While I believe that the client can be connected directly to the Ceph :) I will read on :) >Понедельник, 22 августа 2016, 10:57 +03:00 от Christian Balzer <[email protected]>: > >On Mon, 22 Aug 2016 10:18:51 +0300 Александр Пивушков wrote: > >> Hello, >> Several answers below >> >> >Среда, 17 августа 2016, 8:57 +03:00 от Christian Balzer < [email protected] >: >> > >> > >> >Hello, >> > >> >On Wed, 17 Aug 2016 09:27:30 +0500 Дробышевский, Владимир wrote: >> > >> >> Christian, >> >> >> >> thanks a lot for your time. Please see below. >> >> >> >> >> >> 2016-08-17 5:41 GMT+05:00 Christian Balzer < [email protected] >: >> >> >> >> > >> >> > Hello, >> >> > >> >> > On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote: >> >> > >> >> > > So demands look like these: >> >> > > >> >> > > 1. He has a number of clients which need to periodically write a set >> >> > > of >> >> > > data as big as 160GB to a storage. The acceptable write speed is >> >> > > about a >> >> > > minute for the such amount, so it is around 2700-2800MB per second. >> >> > > Each >> >> > > write session will happend in a dedicated manner. >> >> > >> >> > Let me confirm that "dedicated" here means non-concurrent, sequential. >> >> > So not more than one client at a time, the cluster and network would be >> >> > good if doing 3GB/s? >> >> > >> >> Yes, this is what I meant. >> >> >> >That's good to know, it makes that data dump from a single client/server >> >at least marginally possible, without resorting to even more expensive >> >network infrastructure. >> > >> >> >> >> > >> >> > Note that with IPoIB and QDR 3GB/s is about the best you can hope for, >> >> > that's with a single client of course. >> >> > >> >> I understand, thank you. Alexander doesn't have any setup yet and would >> >> like to build a cost-effective one (not exactly 'cheap', but with minimal >> >> costs to satify requirements), so I've recommended him QDR IB as a minimal >> >> setup if they will be able to live with the used hardware (which is pretty >> >> cheap in general and would allow to make inexpensive multi-port per server >> >> setup with bonding, but hardly to get in Russia) or FDR if it is possible >> >> to get new network hardware only. >> >> >> >Single link QDR should do the trick. >> >Bonding via a Linux bondn: interface with IPoIB currently only supports >> >failover (active-standby), not load balancing. >> >Never mind that load balancing may still not improve bandwidth for a >> >single client talking to a single target (it would help on a server >> >talking to Ceph, thus multiple OSD nodes). >> > >> >There are of course other ways of using 2 interfaces to achieve higher >> >bandwidth, like using routing to the host. >> >But that gets more involved. >> We decided to test, buy the 40GbE. >> There will be two link. One on the external network. Another on the internal >> network. > >Splitting Ceph into an internal (cluster, replication) and external >(client) network only makes sense in your case if you have more than that >bandwidth on your local storage. >Which would mean more than 4x 1.6TB DC P3608s per node, 4GB/s. >Don't think you need or want to afford that. > >Also having just 1 link w/o failover and 2 switches (active-active with >MC-LAG or active-backup) is a bad idea. > >> > >> > >> >> >> >> > >> >> > >Data read should also be >> >> > > pretty fast. The written data must be shared after the write. >> >> > Fast reading might be achieved by these factors: >> >> > a) lots of RAM, to hold all FS SLAB data and of course page cache. >> >> > b) splitting writes and reads amongst the pools by using readfoward >> >> > cache >> >> > mode, so writes go (primarily, initially) to the SSD cache pool and >> What is "readfoward cache mode " >> >This (read the tracker link on that page), unfortunately still >un-documented. > >> > >> >> > (cold) reads come from the HDD base pool. >> >> > c) having a large cache pool. >> >> > >> >> > >Clients OS - >> >> > > Windows. >> >> > So what server(s) are they writing to? >> >> > I don't think that Windows RBD port (dokan) is a well tested >> >> > implementation, besides not being updated for a year or so. >> Now everything is written to local Intel NVE 3608 >> >Yes, I gathered that. >The question is, what servers between the Windows clients and the final >Ceph storage are you planning to use. > >> > >> >> > >> >> This is the question I haven't asked (I hope Alexander will read this and >> >> write me an answer, and I answer here), but I believe they use local P3608 >> >> for this at the moment. The main problem is that P3608s are pretty >> >> expensive, and local setup doesn't provide enough reliability, so they >> >> would like to build a cost-effective reliable setup with more inxepensive >> >> drives as well as providing a network storage for another data as well. >> >> The situation with dokan is exactly what I thought and told Alexander. So >> >> the only way is to setup intermediate servers which will significantly >> >> reduce speed. >> >> >> >I haven't even tried to use Samba or NFS on top of RBD or CephFS, but >> >given that fio (with direct=1!) gives me the full speed of the OSDs, same >> >as with a "cp -ar", I'd hope that such file servers wouldn't be >> >significantly slower than their storage system. >> Can you tell us more about the use of SAMBA? >> We use something special, or all of the default? >> >As I said, haven't used it (with Ceph), but it should be able to use most >of the speed, based my own experience with local disks and all the tuning >guides out there like this one: >http://www.eggplant.pro/blog/faster-samba-smb-cifs-share-performance/ > >> > >> > >> >> >> >> > > 2. It is necessary to have a regular storage as well. He thinks about >> >> > 1.2TB >> >> > > HDD storage with 34TB SSD cache tier at the moment. >> >> > > >> >> > A 34TB cache pool with (at the very least) 2x replication will not be >> >> > cheap. >> >> > >> >> > > The main question with an answer I don't have is how to >> >> > > calculate\predict >> >> > > per client write speed for a ceph cluster? >> >> > This question has been asked before and in fact quite recently, see the >> >> > very short lived "Ceph performance calculator" thread. >> >> > >> >> Thank you, I've founded it. I've been following for the list for a pretty >> >> long time but seems that I missed this discussion. >> >> >> >> >> >> > >> >> > In short, too many variables. >> >> > >> >> > >For example, if there will be a >> >> > > cache tier or even a dedicated SSD-only pool with Intel S3710 or >> >> > > Samsung >> >> > > SM863 drives - how to get approximation for the write speed? Concurent >> >> > > writes to the 6-8 good SSD drives could probably give such speed, but >> >> > > is >> >> > it >> >> > > true for the cluster in general? >> >> > >> >> > Since we're looking here at one of the relatively few use case where >> >> > bandwidth/throughput is the main factor and not IOPS, this calculation >> >> > becomes a bit easier and predictable. >> >> > For an example, see my recent post: >> >> > "Better late than never, some XFS versus EXT4 test results" >> >> > >> >> Found it too, thanks! Very useful tests. Beside of the current topic, >> >> wouldn't btrfs give some advantages in case of pure SSD pool with inline >> >> (on the same drive) journals? >> >> >> >In theory yes, but I think the bigger win here is with IOPS, as opposed to >> >throughput. >> >With BTRFS you could use filestore_journal_parallel, but AFAIK that will >> >still result in 2 writes, so the the full speed of the drive won't be >> >available either. >> >Main advantage here would be that either a successful journal or FS >> >write will result in an ACK, so if the FS if faster you get some speedup. >> > >> >The question is, how well tested is this code path, by the automatic Ceph >> >build tests and users out there? >> >At least fragmentation wouldn't matter with SSDs. ^o^ >> > >> >At this point in time, I'd go with "well supported" and migrate to >> >Bluestore once that becomes trustworthy. >> Do I understand that now can be safely and advantageously used Bluestore to >> Productions? >> >Definitely not, BlueStore is not production ready and won't be for at >least 1-2 more releases, so sometimes next year at the earliest. > >Christian >> > >> > >> >Christian >> >> >> >> > Which basically shows that with sufficient network bandwidth all >> >> > available >> >> > drive speed can be utilized. >> >> > >> >> > With fio randwrite and 4MB blocks the above setup gives me 440MB/s and >> >> > with 4K blocks 8000 IOPS. >> >> > So throughput wise, 100% utilization, full speed present. >> >> > IOPS, less than a third (the SSDs are at 33% utilization, the delays are >> >> > caused by Ceph and network latencies). >> >> > >> >> > >3 sets per 8 drives in 13 servers (with an >> >> > > additional overhead for the network operations, ACKs and placement >> >> > > calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, >> >> > > is >> >> > > there a formula exists to calculate speed expectations from the raw >> >> > > speed >> >> > > and/or IOPS point of view? >> >> > > >> >> > >> >> > Lets look at a simplified example: >> >> > 10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes), >> >> > 40Gb/s (QDR, Ether) interconnects. >> >> > Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes >> >> > speeds. >> >> > Of course journals needs to go somewhere, so the effective speed is half >> >> > of that. >> >> > Thus we get a top speed per node of 2GB/s. >> >> > With a replication of 2 we would get a 10GB/s write capable cluster, >> >> > with >> >> > 3 it's down to a theoretical 6.6GB/s. >> >> > >> >> > I'm ignoring the latency, ACK overhead up there, which has a >> >> > significantly >> >> > lower impact on throughput than on IOPS. >> >> >> >> >> >> > Having a single client or intermediary file server write all that to the >> >> > Ceph cluster over a single link is the bit I'd be more worried about. >> >> > >> >> I totally agree. >> >> >> >> >> >> > >> >> > Christian >> >> > >> >> > > Or, from another side, if there are pre-requisites exist, how to be >> >> > > sure >> >> > > the projected cluster meets them? I'm pretty sure it's a typical task, >> >> > how >> >> > > would you solve it? >> >> > > >> >> > > Thanks a lot in advance and best regards, >> >> > > Vladimir >> >> > > >> >> > > >> >> > > С уважением, >> >> > > Дробышевский Владимир >> >> > > Компания "АйТи Город" >> >> > > +7 343 2222192 >> >> > > >> >> > > Аппаратное и программное обеспечение >> >> > > IBM, Microsoft, Eset >> >> > > Поставка проектов "под ключ" >> >> > > Аутсорсинг ИТ-услуг >> >> > > >> >> > > 2016-08-08 19:39 GMT+05:00 Александр Пивушков < [email protected] >: >> >> > > >> >> > > > Hello dear community! >> >> > > > I'm new to the Ceph and not long ago took up the theme of building >> >> > > > clusters. >> >> > > > Therefore it is very important to your opinion. >> >> > > > >> >> > > > It is necessary to create a cluster from 1.2 PB storage and very >> >> > > > rapid >> >> > > > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB >> >> > > > NVMe >> >> > > > PCIe 3.0 x4 Solid State Drive" were used, their speed of all >> >> > satisfies, but >> >> > > > with increase of volume of storage, the price of such cluster very >> >> > strongly >> >> > > > grows and therefore there was an idea to use Ceph. >> >> > > > There are following requirements: >> >> > > > >> >> > > > - The amount of data 160 GB should be read and written at speeds of >> >> > > > SSD >> >> > > > P3608 >> >> > > > - There must be created a high-speed storage of the SSD drives 36 TB >> >> > > > volume with read / write speed tends to SSD P3608 >> >> > > > - Must be created store 1.2 PB with the access speed than the >> >> > > > bigger, >> >> > the >> >> > > > better ... >> >> > > > - Must have triple redundancy >> >> > > > I do not really understand yet, so to create a configuration with >> >> > > > SSD >> >> > > > P3608 Disk. Of course, the configuration needs to be changed, it is >> >> > very >> >> > > > expensive. >> >> > > > >> >> > > > InfiniBand will be used, and 40 GB Ethernet. >> >> > > > We will also use virtualization to high-performance hardware to >> >> > optimize >> >> > > > the number of physical servers. >> >> > > > I'm not tied to a specific server models and manufacturers. I create >> >> > only >> >> > > > the cluster scheme which should be criticized :) >> >> > > > >> >> > > > 1. OSD - 13 pieces. >> >> > > > a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 >> >> > > > pieces >> >> > > > b. Fiber Channel 16 Gbit / c - 2 port. >> >> > > > c. An array (not RAID) to 284 TB of SATA-based drives (36 >> >> > > > drives >> >> > for >> >> > > > 8TB); >> >> > > > d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece >> >> > > > e. SATA drive 40 GB for installation of the operating system >> >> > > > (or >> >> > > > booting from the network, which is preferable) >> >> > > > f. RAM 288 GB >> >> > > > g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4 >> >> > > > 2. MON - 3 pieces. All virtual server: >> >> > > > a. 1 Gbps Ethernet / c - 1 port. >> >> > > > b. SATA drive 40 GB for installation of the operating system >> >> > > > (or >> >> > > > booting from the network, which is preferable) >> >> > > > c. SATA drive 40 GB >> >> > > > d. 6GB RAM >> >> > > > e. 1 x CPU - 2 cores at 1.9 Ghz >> >> > > > 3. MDS - 2 pcs. All virtual server: >> >> > > > a. 1 Gbps Ethernet / c - 1 port. >> >> > > > b. SATA drive 40 GB for installation of the operating system >> >> > > > (or >> >> > > > booting from the network, which is preferable) >> >> > > > c. SATA drive 40 GB >> >> > > > d. 6GB RAM >> >> > > > e. 1 x CPU - min. 2 cores at 1.9 Ghz >> >> > > > >> >> > > > I assume to use for an acceleration SSD for a cache and a log of >> >> > > > OSD. >> >> > > > >> >> > > > -- >> >> > > > Alexander Pushkov >> >> > > > >> >> > > > _______________________________________________ >> >> > > > ceph-users mailing list >> >> > > > [email protected] >> >> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > >> >> > > > >> >> > >> >> > >> >> > -- >> >> > Christian Balzer Network/Systems Engineer >> >> > [email protected] Global OnLine Japan/Rakuten Communications >> >> > http://www.gol.com/ >> >> > >> >> >> >> -- >> >> Best regards, >> >> Vladimir >> > >> > >> >-- >> >Christian Balzer Network/Systems Engineer >> > [email protected] Global OnLine Japan/Rakuten Communications >> > http://www.gol.com/ >> >_______________________________________________ >> >ceph-users mailing list >> > [email protected] >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > >-- >Christian Balzer Network/Systems Engineer >[email protected] Global OnLine Japan/Rakuten Communications >http://www.gol.com/ -- Александр Пивушков
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
