Re: [ceph-users] Fast Ceph a Cluster with PB storage

Александр Пивушков Mon, 22 Aug 2016 00:19:51 -0700

 Hello,
Several answers below

>Среда, 17 августа 2016, 8:57 +03:00 от Christian Balzer <[email protected]>:
>
>
>Hello,
>
>On Wed, 17 Aug 2016 09:27:30 +0500 Дробышевский, Владимир wrote:
>
>> Christian,
>> 
>>   thanks a lot for your time. Please see below.
>> 
>> 
>> 2016-08-17 5:41 GMT+05:00 Christian Balzer < [email protected] >:
>> 
>> >
>> > Hello,
>> >
>> > On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote:
>> >
>> > >   So demands look like these:
>> > >
>> > > 1. He has a number of clients which need to periodically write a set of
>> > > data as big as 160GB to a storage. The acceptable write speed is about a
>> > > minute for the such amount, so it is around 2700-2800MB per second. Each
>> > > write session will happend in a dedicated manner.
>> >
>> > Let me confirm that "dedicated" here means non-concurrent, sequential.
>> > So not more than one client at a time, the cluster and network would be
>> > good if doing 3GB/s?
>> >
>> Yes, this is what I meant.
>>
>That's good to know, it makes that data dump from a single client/server
>at least marginally possible, without resorting to even more expensive
>network infrastructure.
>
>> 
>> >
>> > Note that with IPoIB and QDR 3GB/s is about the best you can hope for,
>> > that's with a single client of course.
>> >
>> I understand, thank you. Alexander doesn't have any setup yet and would
>> like to build a cost-effective one (not exactly 'cheap', but with minimal
>> costs to satify requirements), so I've recommended him QDR IB as a minimal
>> setup if they will be able to live with the used hardware (which is pretty
>> cheap in general and would allow to make inexpensive multi-port per server
>> setup with bonding, but hardly to get in Russia) or FDR if it is possible
>> to get new network hardware only.
>> 
>Single link QDR should do the trick.
>Bonding via a Linux bondn: interface with IPoIB currently only supports
>failover (active-standby), not load balancing.
>Never mind that load balancing may still not improve bandwidth for a
>single client talking to a single target (it would help on a server
>talking to Ceph, thus multiple OSD nodes).
>
>There are of course other ways of using 2 interfaces to achieve higher
>bandwidth, like using routing to the host. 
>But that gets more involved. 
We decided to test, buy the 40GbE.
There will be two link. One on the external network. Another on the internal 
network.
>
>
>> 
>> >
>> > >Data read should also be
>> > > pretty fast. The written data must be shared after the write.
>> > Fast reading might be achieved by these factors:
>> > a) lots of RAM, to hold all FS SLAB data and of course page cache.
>> > b) splitting writes and reads amongst the pools by using readfoward cache
>> > mode, so writes go (primarily, initially) to the SSD cache pool and
What is "readfoward cache mode "


>
>> > (cold) reads come from the HDD base pool.
>> > c) having a large cache pool.
>> >
>> > >Clients OS -
>> > > Windows.
>> > So what server(s) are they writing to?
>> > I don't think that Windows RBD port (dokan) is a well tested
>> > implementation, besides not being updated for a year or so.
Now everything is written to local Intel NVE 3608

>
>> >
>> This is the question I haven't asked (I hope Alexander will read this and
>> write me an answer, and I answer here), but I believe they use local P3608
>> for this at the moment. The main problem is that P3608s are pretty
>> expensive, and local setup doesn't provide enough reliability, so they
>> would like to build a cost-effective reliable setup with more inxepensive
>> drives as well as providing a network storage for another data as well.
>> The situation with dokan is exactly what I thought and told Alexander. So
>> the only way is to setup intermediate servers which will significantly
>> reduce speed.
>> 
>I haven't even tried to use Samba or NFS on top of RBD or CephFS, but
>given that fio (with direct=1!) gives me the full speed of the OSDs, same
>as with a "cp -ar", I'd hope that such file servers wouldn't be
>significantly slower than their storage system.
Can you tell us more about the use of SAMBA?
We use something special, or all of the default?

>
>
>> 
>> > > 2. It is necessary to have a regular storage as well. He thinks about
>> > 1.2TB
>> > > HDD storage with 34TB SSD cache tier at the moment.
>> > >
>> > A 34TB cache pool with (at the very least) 2x replication will not be
>> > cheap.
>> >
>> > > The main question with an answer I don't have is how to calculate\predict
>> > > per client write speed for a ceph cluster?
>> > This question has been asked before and in fact quite recently, see the
>> > very short lived "Ceph performance calculator" thread.
>> >
>> Thank you, I've founded it. I've been following for the list for a pretty
>> long time but seems that I missed this discussion.
>> 
>> 
>> >
>> > In short, too many variables.
>> >
>> > >For example, if there will be a
>> > > cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung
>> > > SM863 drives - how to get approximation for the write speed? Concurent
>> > > writes to the 6-8 good SSD drives could probably give such speed, but is
>> > it
>> > > true for the cluster in general?
>> >
>> > Since we're looking here at one of the relatively few use case where
>> > bandwidth/throughput is the main factor and not IOPS, this calculation
>> > becomes a bit easier and predictable.
>> > For an example, see my recent post:
>> > "Better late than never, some XFS versus EXT4 test results"
>> >
>> Found it too, thanks! Very useful tests. Beside of the current topic,
>> wouldn't btrfs give some advantages in case of pure SSD pool with inline
>> (on the same drive) journals?
>> 
>In theory yes, but I think the bigger win here is with IOPS, as opposed to
>throughput.
>With BTRFS you could use filestore_journal_parallel, but AFAIK that will
>still result in 2 writes, so the the full speed of the drive won't be
>available either. 
>Main advantage here would be that either a successful journal or FS
>write will result in an ACK, so if the FS if faster you get some speedup. 
>
>The question is, how well tested is this code path, by the automatic Ceph
>build tests and users out there?
>At least fragmentation wouldn't matter with SSDs. ^o^
>
>At this point in time, I'd go with "well supported" and migrate to
>Bluestore once that becomes trustworthy.
Do I understand that now can be safely and advantageously used Bluestore  to 
Productions?

>
>
>Christian
>> 
>> > Which basically shows that with sufficient network bandwidth all available
>> > drive speed can be utilized.
>> >
>> > With fio randwrite and 4MB blocks the above setup gives me 440MB/s and
>> > with 4K blocks 8000 IOPS.
>> > So throughput wise, 100% utilization, full speed present.
>> > IOPS, less than a third (the SSDs are at 33% utilization, the delays are
>> > caused by Ceph and network latencies).
>> >
>> > >3 sets per 8 drives in 13 servers (with an
>> > > additional overhead for the network operations, ACKs and placement
>> > > calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, is
>> > > there a formula exists to calculate speed expectations from the raw speed
>> > > and/or IOPS point of view?
>> > >
>> >
>> > Lets look at a simplified example:
>> > 10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes),
>> > 40Gb/s (QDR, Ether) interconnects.
>> > Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes speeds.
>> > Of course journals needs to go somewhere, so the effective speed is half
>> > of that.
>> > Thus we get a top speed per node of 2GB/s.
>> > With a replication of 2 we would get a 10GB/s write capable cluster, with
>> > 3 it's down to a theoretical 6.6GB/s.
>> >
>> > I'm ignoring the latency, ACK overhead up there, which has a significantly
>> > lower impact on throughput than on IOPS.
>> 
>> 
>> > Having a single client or intermediary file server write all that to the
>> > Ceph cluster over a single link is the bit I'd be more worried about.
>> >
>> I totally agree.
>> 
>> 
>> >
>> > Christian
>> >
>> > > Or, from another side, if there are pre-requisites exist, how to be sure
>> > > the projected cluster meets them? I'm pretty sure it's a typical task,
>> > how
>> > > would you solve it?
>> > >
>> > > Thanks a lot in advance and best regards,
>> > > Vladimir
>> > >
>> > >
>> > > С уважением,
>> > > Дробышевский Владимир
>> > > Компания "АйТи Город"
>> > > +7 343 2222192
>> > >
>> > > Аппаратное и программное обеспечение
>> > > IBM, Microsoft, Eset
>> > > Поставка проектов "под ключ"
>> > > Аутсорсинг ИТ-услуг
>> > >
>> > > 2016-08-08 19:39 GMT+05:00 Александр Пивушков < [email protected] >:
>> > >
>> > > > Hello dear community!
>> > > > I'm new to the Ceph and not long ago took up the theme of building
>> > > > clusters.
>> > > > Therefore it is very important to your opinion.
>> > > >
>> > > > It is necessary to create a cluster from 1.2 PB storage and very rapid
>> > > > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe
>> > > > PCIe 3.0 x4 Solid State Drive" were used, their speed of all
>> > satisfies, but
>> > > > with increase of volume of storage, the price of such cluster very
>> > strongly
>> > > > grows and therefore there was an idea to use Ceph.
>> > > > There are following requirements:
>> > > >
>> > > > - The amount of data 160 GB should be read and written at speeds of SSD
>> > > > P3608
>> > > > - There must be created a high-speed storage of the SSD drives 36 TB
>> > > > volume with read / write speed tends to SSD P3608
>> > > > - Must be created store 1.2 PB with the access speed than the bigger,
>> > the
>> > > > better ...
>> > > > - Must have triple redundancy
>> > > > I do not really understand yet, so to create a configuration with SSD
>> > > > P3608 Disk. Of course, the configuration needs to be changed, it is
>> > very
>> > > > expensive.
>> > > >
>> > > > InfiniBand will be used, and 40 GB Ethernet.
>> > > > We will also use virtualization to high-performance hardware to
>> > optimize
>> > > > the number of physical servers.
>> > > > I'm not tied to a specific server models and manufacturers. I create
>> > only
>> > > > the cluster scheme which should be criticized :)
>> > > >
>> > > > 1. OSD - 13 pieces.
>> > > >      a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces
>> > > >      b. Fiber Channel 16 Gbit / c - 2 port.
>> > > >      c. An array (not RAID) to 284 TB of SATA-based drives (36 drives
>> > for
>> > > > 8TB);
>> > > >      d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece
>> > > >      e. SATA drive 40 GB for installation of the operating system (or
>> > > > booting from the network, which is preferable)
>> > > >      f. RAM 288 GB
>> > > >      g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4
>> > > > 2. MON - 3 pieces. All virtual server:
>> > > >      a. 1 Gbps Ethernet / c - 1 port.
>> > > >      b. SATA drive 40 GB for installation of the operating system (or
>> > > > booting from the network, which is preferable)
>> > > >      c. SATA drive 40 GB
>> > > >      d. 6GB RAM
>> > > >      e. 1 x CPU - 2 cores at 1.9 Ghz
>> > > > 3. MDS - 2 pcs. All virtual server:
>> > > >      a. 1 Gbps Ethernet / c - 1 port.
>> > > >      b. SATA drive 40 GB for installation of the operating system (or
>> > > > booting from the network, which is preferable)
>> > > >      c. SATA drive 40 GB
>> > > >      d. 6GB RAM
>> > > >      e. 1 x CPU - min. 2 cores at 1.9 Ghz
>> > > >
>> > > > I assume to use for an acceleration SSD for a cache and a log of OSD.
>> > > >
>> > > > --
>> > > > Alexander Pushkov
>> > > >
>> > > > _______________________________________________
>> > > > ceph-users mailing list
>> > > >  [email protected]
>> > > >  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > >
>> > > >
>> >
>> >
>> > --
>> > Christian Balzer        Network/Systems Engineer
>> >  [email protected] Global OnLine Japan/Rakuten Communications
>> >  http://www.gol.com/
>> >
>> 
>> --
>> Best regards,
>> Vladimir
>
>
>-- 
>Christian Balzer        Network/Systems Engineer 
>[email protected] Global OnLine Japan/Rakuten Communications
>http://www.gol.com/
>_______________________________________________
>ceph-users mailing list
>[email protected]
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Александр Пивушков

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fast Ceph a Cluster with PB storage

Reply via email to