Hello, Several answers below >Среда, 17 августа 2016, 8:57 +03:00 от Christian Balzer <[email protected]>: > > >Hello, > >On Wed, 17 Aug 2016 09:27:30 +0500 Дробышевский, Владимир wrote: > >> Christian, >> >> thanks a lot for your time. Please see below. >> >> >> 2016-08-17 5:41 GMT+05:00 Christian Balzer < [email protected] >: >> >> > >> > Hello, >> > >> > On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote: >> > >> > > So demands look like these: >> > > >> > > 1. He has a number of clients which need to periodically write a set of >> > > data as big as 160GB to a storage. The acceptable write speed is about a >> > > minute for the such amount, so it is around 2700-2800MB per second. Each >> > > write session will happend in a dedicated manner. >> > >> > Let me confirm that "dedicated" here means non-concurrent, sequential. >> > So not more than one client at a time, the cluster and network would be >> > good if doing 3GB/s? >> > >> Yes, this is what I meant. >> >That's good to know, it makes that data dump from a single client/server >at least marginally possible, without resorting to even more expensive >network infrastructure. > >> >> > >> > Note that with IPoIB and QDR 3GB/s is about the best you can hope for, >> > that's with a single client of course. >> > >> I understand, thank you. Alexander doesn't have any setup yet and would >> like to build a cost-effective one (not exactly 'cheap', but with minimal >> costs to satify requirements), so I've recommended him QDR IB as a minimal >> setup if they will be able to live with the used hardware (which is pretty >> cheap in general and would allow to make inexpensive multi-port per server >> setup with bonding, but hardly to get in Russia) or FDR if it is possible >> to get new network hardware only. >> >Single link QDR should do the trick. >Bonding via a Linux bondn: interface with IPoIB currently only supports >failover (active-standby), not load balancing. >Never mind that load balancing may still not improve bandwidth for a >single client talking to a single target (it would help on a server >talking to Ceph, thus multiple OSD nodes). > >There are of course other ways of using 2 interfaces to achieve higher >bandwidth, like using routing to the host. >But that gets more involved. We decided to test, buy the 40GbE. There will be two link. One on the external network. Another on the internal network. > > >> >> > >> > >Data read should also be >> > > pretty fast. The written data must be shared after the write. >> > Fast reading might be achieved by these factors: >> > a) lots of RAM, to hold all FS SLAB data and of course page cache. >> > b) splitting writes and reads amongst the pools by using readfoward cache >> > mode, so writes go (primarily, initially) to the SSD cache pool and What is "readfoward cache mode "
> >> > (cold) reads come from the HDD base pool. >> > c) having a large cache pool. >> > >> > >Clients OS - >> > > Windows. >> > So what server(s) are they writing to? >> > I don't think that Windows RBD port (dokan) is a well tested >> > implementation, besides not being updated for a year or so. Now everything is written to local Intel NVE 3608 > >> > >> This is the question I haven't asked (I hope Alexander will read this and >> write me an answer, and I answer here), but I believe they use local P3608 >> for this at the moment. The main problem is that P3608s are pretty >> expensive, and local setup doesn't provide enough reliability, so they >> would like to build a cost-effective reliable setup with more inxepensive >> drives as well as providing a network storage for another data as well. >> The situation with dokan is exactly what I thought and told Alexander. So >> the only way is to setup intermediate servers which will significantly >> reduce speed. >> >I haven't even tried to use Samba or NFS on top of RBD or CephFS, but >given that fio (with direct=1!) gives me the full speed of the OSDs, same >as with a "cp -ar", I'd hope that such file servers wouldn't be >significantly slower than their storage system. Can you tell us more about the use of SAMBA? We use something special, or all of the default? > > >> >> > > 2. It is necessary to have a regular storage as well. He thinks about >> > 1.2TB >> > > HDD storage with 34TB SSD cache tier at the moment. >> > > >> > A 34TB cache pool with (at the very least) 2x replication will not be >> > cheap. >> > >> > > The main question with an answer I don't have is how to calculate\predict >> > > per client write speed for a ceph cluster? >> > This question has been asked before and in fact quite recently, see the >> > very short lived "Ceph performance calculator" thread. >> > >> Thank you, I've founded it. I've been following for the list for a pretty >> long time but seems that I missed this discussion. >> >> >> > >> > In short, too many variables. >> > >> > >For example, if there will be a >> > > cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung >> > > SM863 drives - how to get approximation for the write speed? Concurent >> > > writes to the 6-8 good SSD drives could probably give such speed, but is >> > it >> > > true for the cluster in general? >> > >> > Since we're looking here at one of the relatively few use case where >> > bandwidth/throughput is the main factor and not IOPS, this calculation >> > becomes a bit easier and predictable. >> > For an example, see my recent post: >> > "Better late than never, some XFS versus EXT4 test results" >> > >> Found it too, thanks! Very useful tests. Beside of the current topic, >> wouldn't btrfs give some advantages in case of pure SSD pool with inline >> (on the same drive) journals? >> >In theory yes, but I think the bigger win here is with IOPS, as opposed to >throughput. >With BTRFS you could use filestore_journal_parallel, but AFAIK that will >still result in 2 writes, so the the full speed of the drive won't be >available either. >Main advantage here would be that either a successful journal or FS >write will result in an ACK, so if the FS if faster you get some speedup. > >The question is, how well tested is this code path, by the automatic Ceph >build tests and users out there? >At least fragmentation wouldn't matter with SSDs. ^o^ > >At this point in time, I'd go with "well supported" and migrate to >Bluestore once that becomes trustworthy. Do I understand that now can be safely and advantageously used Bluestore to Productions? > > >Christian >> >> > Which basically shows that with sufficient network bandwidth all available >> > drive speed can be utilized. >> > >> > With fio randwrite and 4MB blocks the above setup gives me 440MB/s and >> > with 4K blocks 8000 IOPS. >> > So throughput wise, 100% utilization, full speed present. >> > IOPS, less than a third (the SSDs are at 33% utilization, the delays are >> > caused by Ceph and network latencies). >> > >> > >3 sets per 8 drives in 13 servers (with an >> > > additional overhead for the network operations, ACKs and placement >> > > calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, is >> > > there a formula exists to calculate speed expectations from the raw speed >> > > and/or IOPS point of view? >> > > >> > >> > Lets look at a simplified example: >> > 10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes), >> > 40Gb/s (QDR, Ether) interconnects. >> > Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes speeds. >> > Of course journals needs to go somewhere, so the effective speed is half >> > of that. >> > Thus we get a top speed per node of 2GB/s. >> > With a replication of 2 we would get a 10GB/s write capable cluster, with >> > 3 it's down to a theoretical 6.6GB/s. >> > >> > I'm ignoring the latency, ACK overhead up there, which has a significantly >> > lower impact on throughput than on IOPS. >> >> >> > Having a single client or intermediary file server write all that to the >> > Ceph cluster over a single link is the bit I'd be more worried about. >> > >> I totally agree. >> >> >> > >> > Christian >> > >> > > Or, from another side, if there are pre-requisites exist, how to be sure >> > > the projected cluster meets them? I'm pretty sure it's a typical task, >> > how >> > > would you solve it? >> > > >> > > Thanks a lot in advance and best regards, >> > > Vladimir >> > > >> > > >> > > С уважением, >> > > Дробышевский Владимир >> > > Компания "АйТи Город" >> > > +7 343 2222192 >> > > >> > > Аппаратное и программное обеспечение >> > > IBM, Microsoft, Eset >> > > Поставка проектов "под ключ" >> > > Аутсорсинг ИТ-услуг >> > > >> > > 2016-08-08 19:39 GMT+05:00 Александр Пивушков < [email protected] >: >> > > >> > > > Hello dear community! >> > > > I'm new to the Ceph and not long ago took up the theme of building >> > > > clusters. >> > > > Therefore it is very important to your opinion. >> > > > >> > > > It is necessary to create a cluster from 1.2 PB storage and very rapid >> > > > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe >> > > > PCIe 3.0 x4 Solid State Drive" were used, their speed of all >> > satisfies, but >> > > > with increase of volume of storage, the price of such cluster very >> > strongly >> > > > grows and therefore there was an idea to use Ceph. >> > > > There are following requirements: >> > > > >> > > > - The amount of data 160 GB should be read and written at speeds of SSD >> > > > P3608 >> > > > - There must be created a high-speed storage of the SSD drives 36 TB >> > > > volume with read / write speed tends to SSD P3608 >> > > > - Must be created store 1.2 PB with the access speed than the bigger, >> > the >> > > > better ... >> > > > - Must have triple redundancy >> > > > I do not really understand yet, so to create a configuration with SSD >> > > > P3608 Disk. Of course, the configuration needs to be changed, it is >> > very >> > > > expensive. >> > > > >> > > > InfiniBand will be used, and 40 GB Ethernet. >> > > > We will also use virtualization to high-performance hardware to >> > optimize >> > > > the number of physical servers. >> > > > I'm not tied to a specific server models and manufacturers. I create >> > only >> > > > the cluster scheme which should be criticized :) >> > > > >> > > > 1. OSD - 13 pieces. >> > > > a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces >> > > > b. Fiber Channel 16 Gbit / c - 2 port. >> > > > c. An array (not RAID) to 284 TB of SATA-based drives (36 drives >> > for >> > > > 8TB); >> > > > d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece >> > > > e. SATA drive 40 GB for installation of the operating system (or >> > > > booting from the network, which is preferable) >> > > > f. RAM 288 GB >> > > > g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4 >> > > > 2. MON - 3 pieces. All virtual server: >> > > > a. 1 Gbps Ethernet / c - 1 port. >> > > > b. SATA drive 40 GB for installation of the operating system (or >> > > > booting from the network, which is preferable) >> > > > c. SATA drive 40 GB >> > > > d. 6GB RAM >> > > > e. 1 x CPU - 2 cores at 1.9 Ghz >> > > > 3. MDS - 2 pcs. All virtual server: >> > > > a. 1 Gbps Ethernet / c - 1 port. >> > > > b. SATA drive 40 GB for installation of the operating system (or >> > > > booting from the network, which is preferable) >> > > > c. SATA drive 40 GB >> > > > d. 6GB RAM >> > > > e. 1 x CPU - min. 2 cores at 1.9 Ghz >> > > > >> > > > I assume to use for an acceleration SSD for a cache and a log of OSD. >> > > > >> > > > -- >> > > > Alexander Pushkov >> > > > >> > > > _______________________________________________ >> > > > ceph-users mailing list >> > > > [email protected] >> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > >> > > > >> > >> > >> > -- >> > Christian Balzer Network/Systems Engineer >> > [email protected] Global OnLine Japan/Rakuten Communications >> > http://www.gol.com/ >> > >> >> -- >> Best regards, >> Vladimir > > >-- >Christian Balzer Network/Systems Engineer >[email protected] Global OnLine Japan/Rakuten Communications >http://www.gol.com/ >_______________________________________________ >ceph-users mailing list >[email protected] >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Александр Пивушков
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
