Re: [ceph-users] Fast Ceph a Cluster with PB storage

Александр Пивушков Mon, 22 Aug 2016 04:01:19 -0700

 Thank you very much for your answer!


> Yes, I gathered that. 
> The question is, what servers between the Windows clients and the final
> Ceph storage are you planning to use.
> 

That I do not yet understand.
While I believe that the client can be connected directly to the Ceph  :)
I will read on :)


>Понедельник, 22 августа 2016, 10:57 +03:00 от Christian Balzer <[email protected]>:
>
>On Mon, 22 Aug 2016 10:18:51 +0300 Александр Пивушков wrote:
>
>>  Hello,
>> Several answers below
>> 
>> >Среда, 17 августа 2016, 8:57 +03:00 от Christian Balzer < [email protected] >:
>> >
>> >
>> >Hello,
>> >
>> >On Wed, 17 Aug 2016 09:27:30 +0500 Дробышевский, Владимир wrote:
>> >
>> >> Christian,
>> >> 
>> >>   thanks a lot for your time. Please see below.
>> >> 
>> >> 
>> >> 2016-08-17 5:41 GMT+05:00 Christian Balzer <  [email protected] >:
>> >> 
>> >> >
>> >> > Hello,
>> >> >
>> >> > On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote:
>> >> >
>> >> > >   So demands look like these:
>> >> > >
>> >> > > 1. He has a number of clients which need to periodically write a set 
>> >> > > of
>> >> > > data as big as 160GB to a storage. The acceptable write speed is 
>> >> > > about a
>> >> > > minute for the such amount, so it is around 2700-2800MB per second. 
>> >> > > Each
>> >> > > write session will happend in a dedicated manner.
>> >> >
>> >> > Let me confirm that "dedicated" here means non-concurrent, sequential.
>> >> > So not more than one client at a time, the cluster and network would be
>> >> > good if doing 3GB/s?
>> >> >
>> >> Yes, this is what I meant.
>> >>
>> >That's good to know, it makes that data dump from a single client/server
>> >at least marginally possible, without resorting to even more expensive
>> >network infrastructure.
>> >
>> >> 
>> >> >
>> >> > Note that with IPoIB and QDR 3GB/s is about the best you can hope for,
>> >> > that's with a single client of course.
>> >> >
>> >> I understand, thank you. Alexander doesn't have any setup yet and would
>> >> like to build a cost-effective one (not exactly 'cheap', but with minimal
>> >> costs to satify requirements), so I've recommended him QDR IB as a minimal
>> >> setup if they will be able to live with the used hardware (which is pretty
>> >> cheap in general and would allow to make inexpensive multi-port per server
>> >> setup with bonding, but hardly to get in Russia) or FDR if it is possible
>> >> to get new network hardware only.
>> >> 
>> >Single link QDR should do the trick.
>> >Bonding via a Linux bondn: interface with IPoIB currently only supports
>> >failover (active-standby), not load balancing.
>> >Never mind that load balancing may still not improve bandwidth for a
>> >single client talking to a single target (it would help on a server
>> >talking to Ceph, thus multiple OSD nodes).
>> >
>> >There are of course other ways of using 2 interfaces to achieve higher
>> >bandwidth, like using routing to the host. 
>> >But that gets more involved. 
>> We decided to test, buy the 40GbE.
>> There will be two link. One on the external network. Another on the internal 
>> network.
>
>Splitting Ceph into an internal (cluster, replication) and external
>(client) network only makes sense in your case if you have more than that
>bandwidth on your local storage. 
>Which would mean more than 4x 1.6TB DC P3608s per node, 4GB/s.
>Don't think you need or want to afford that.
>
>Also having just 1 link w/o failover and 2 switches (active-active with
>MC-LAG or active-backup) is a bad idea.
>
>> >
>> >
>> >> 
>> >> >
>> >> > >Data read should also be
>> >> > > pretty fast. The written data must be shared after the write.
>> >> > Fast reading might be achieved by these factors:
>> >> > a) lots of RAM, to hold all FS SLAB data and of course page cache.
>> >> > b) splitting writes and reads amongst the pools by using readfoward 
>> >> > cache
>> >> > mode, so writes go (primarily, initially) to the SSD cache pool and
>> What is "readfoward cache mode "
>> 
>This (read the tracker link on that page), unfortunately still
>un-documented.
>
>> >
>> >> > (cold) reads come from the HDD base pool.
>> >> > c) having a large cache pool.
>> >> >
>> >> > >Clients OS -
>> >> > > Windows.
>> >> > So what server(s) are they writing to?
>> >> > I don't think that Windows RBD port (dokan) is a well tested
>> >> > implementation, besides not being updated for a year or so.
>> Now everything is written to local Intel NVE 3608
>> 
>Yes, I gathered that. 
>The question is, what servers between the Windows clients and the final
>Ceph storage are you planning to use.
>
>> >
>> >> >
>> >> This is the question I haven't asked (I hope Alexander will read this and
>> >> write me an answer, and I answer here), but I believe they use local P3608
>> >> for this at the moment. The main problem is that P3608s are pretty
>> >> expensive, and local setup doesn't provide enough reliability, so they
>> >> would like to build a cost-effective reliable setup with more inxepensive
>> >> drives as well as providing a network storage for another data as well.
>> >> The situation with dokan is exactly what I thought and told Alexander. So
>> >> the only way is to setup intermediate servers which will significantly
>> >> reduce speed.
>> >> 
>> >I haven't even tried to use Samba or NFS on top of RBD or CephFS, but
>> >given that fio (with direct=1!) gives me the full speed of the OSDs, same
>> >as with a "cp -ar", I'd hope that such file servers wouldn't be
>> >significantly slower than their storage system.
>> Can you tell us more about the use of SAMBA?
>> We use something special, or all of the default?
>> 
>As I said, haven't used it (with Ceph), but it should be able to use most
>of the speed, based my own experience with local disks and all the tuning
>guides out there like this one:
>http://www.eggplant.pro/blog/faster-samba-smb-cifs-share-performance/
>
>> >
>> >
>> >> 
>> >> > > 2. It is necessary to have a regular storage as well. He thinks about
>> >> > 1.2TB
>> >> > > HDD storage with 34TB SSD cache tier at the moment.
>> >> > >
>> >> > A 34TB cache pool with (at the very least) 2x replication will not be
>> >> > cheap.
>> >> >
>> >> > > The main question with an answer I don't have is how to 
>> >> > > calculate\predict
>> >> > > per client write speed for a ceph cluster?
>> >> > This question has been asked before and in fact quite recently, see the
>> >> > very short lived "Ceph performance calculator" thread.
>> >> >
>> >> Thank you, I've founded it. I've been following for the list for a pretty
>> >> long time but seems that I missed this discussion.
>> >> 
>> >> 
>> >> >
>> >> > In short, too many variables.
>> >> >
>> >> > >For example, if there will be a
>> >> > > cache tier or even a dedicated SSD-only pool with Intel S3710 or 
>> >> > > Samsung
>> >> > > SM863 drives - how to get approximation for the write speed? Concurent
>> >> > > writes to the 6-8 good SSD drives could probably give such speed, but 
>> >> > > is
>> >> > it
>> >> > > true for the cluster in general?
>> >> >
>> >> > Since we're looking here at one of the relatively few use case where
>> >> > bandwidth/throughput is the main factor and not IOPS, this calculation
>> >> > becomes a bit easier and predictable.
>> >> > For an example, see my recent post:
>> >> > "Better late than never, some XFS versus EXT4 test results"
>> >> >
>> >> Found it too, thanks! Very useful tests. Beside of the current topic,
>> >> wouldn't btrfs give some advantages in case of pure SSD pool with inline
>> >> (on the same drive) journals?
>> >> 
>> >In theory yes, but I think the bigger win here is with IOPS, as opposed to
>> >throughput.
>> >With BTRFS you could use filestore_journal_parallel, but AFAIK that will
>> >still result in 2 writes, so the the full speed of the drive won't be
>> >available either. 
>> >Main advantage here would be that either a successful journal or FS
>> >write will result in an ACK, so if the FS if faster you get some speedup. 
>> >
>> >The question is, how well tested is this code path, by the automatic Ceph
>> >build tests and users out there?
>> >At least fragmentation wouldn't matter with SSDs. ^o^
>> >
>> >At this point in time, I'd go with "well supported" and migrate to
>> >Bluestore once that becomes trustworthy.
>> Do I understand that now can be safely and advantageously used Bluestore  to 
>> Productions?
>> 
>Definitely not, BlueStore is not production ready and won't be for at
>least 1-2 more releases, so sometimes next year at the earliest.
>
>Christian
>> >
>> >
>> >Christian
>> >> 
>> >> > Which basically shows that with sufficient network bandwidth all 
>> >> > available
>> >> > drive speed can be utilized.
>> >> >
>> >> > With fio randwrite and 4MB blocks the above setup gives me 440MB/s and
>> >> > with 4K blocks 8000 IOPS.
>> >> > So throughput wise, 100% utilization, full speed present.
>> >> > IOPS, less than a third (the SSDs are at 33% utilization, the delays are
>> >> > caused by Ceph and network latencies).
>> >> >
>> >> > >3 sets per 8 drives in 13 servers (with an
>> >> > > additional overhead for the network operations, ACKs and placement
>> >> > > calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, 
>> >> > > is
>> >> > > there a formula exists to calculate speed expectations from the raw 
>> >> > > speed
>> >> > > and/or IOPS point of view?
>> >> > >
>> >> >
>> >> > Lets look at a simplified example:
>> >> > 10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes),
>> >> > 40Gb/s (QDR, Ether) interconnects.
>> >> > Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes 
>> >> > speeds.
>> >> > Of course journals needs to go somewhere, so the effective speed is half
>> >> > of that.
>> >> > Thus we get a top speed per node of 2GB/s.
>> >> > With a replication of 2 we would get a 10GB/s write capable cluster, 
>> >> > with
>> >> > 3 it's down to a theoretical 6.6GB/s.
>> >> >
>> >> > I'm ignoring the latency, ACK overhead up there, which has a 
>> >> > significantly
>> >> > lower impact on throughput than on IOPS.
>> >> 
>> >> 
>> >> > Having a single client or intermediary file server write all that to the
>> >> > Ceph cluster over a single link is the bit I'd be more worried about.
>> >> >
>> >> I totally agree.
>> >> 
>> >> 
>> >> >
>> >> > Christian
>> >> >
>> >> > > Or, from another side, if there are pre-requisites exist, how to be 
>> >> > > sure
>> >> > > the projected cluster meets them? I'm pretty sure it's a typical task,
>> >> > how
>> >> > > would you solve it?
>> >> > >
>> >> > > Thanks a lot in advance and best regards,
>> >> > > Vladimir
>> >> > >
>> >> > >
>> >> > > С уважением,
>> >> > > Дробышевский Владимир
>> >> > > Компания "АйТи Город"
>> >> > > +7 343 2222192
>> >> > >
>> >> > > Аппаратное и программное обеспечение
>> >> > > IBM, Microsoft, Eset
>> >> > > Поставка проектов "под ключ"
>> >> > > Аутсорсинг ИТ-услуг
>> >> > >
>> >> > > 2016-08-08 19:39 GMT+05:00 Александр Пивушков <  [email protected] >:
>> >> > >
>> >> > > > Hello dear community!
>> >> > > > I'm new to the Ceph and not long ago took up the theme of building
>> >> > > > clusters.
>> >> > > > Therefore it is very important to your opinion.
>> >> > > >
>> >> > > > It is necessary to create a cluster from 1.2 PB storage and very 
>> >> > > > rapid
>> >> > > > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB 
>> >> > > > NVMe
>> >> > > > PCIe 3.0 x4 Solid State Drive" were used, their speed of all
>> >> > satisfies, but
>> >> > > > with increase of volume of storage, the price of such cluster very
>> >> > strongly
>> >> > > > grows and therefore there was an idea to use Ceph.
>> >> > > > There are following requirements:
>> >> > > >
>> >> > > > - The amount of data 160 GB should be read and written at speeds of 
>> >> > > > SSD
>> >> > > > P3608
>> >> > > > - There must be created a high-speed storage of the SSD drives 36 TB
>> >> > > > volume with read / write speed tends to SSD P3608
>> >> > > > - Must be created store 1.2 PB with the access speed than the 
>> >> > > > bigger,
>> >> > the
>> >> > > > better ...
>> >> > > > - Must have triple redundancy
>> >> > > > I do not really understand yet, so to create a configuration with 
>> >> > > > SSD
>> >> > > > P3608 Disk. Of course, the configuration needs to be changed, it is
>> >> > very
>> >> > > > expensive.
>> >> > > >
>> >> > > > InfiniBand will be used, and 40 GB Ethernet.
>> >> > > > We will also use virtualization to high-performance hardware to
>> >> > optimize
>> >> > > > the number of physical servers.
>> >> > > > I'm not tied to a specific server models and manufacturers. I create
>> >> > only
>> >> > > > the cluster scheme which should be criticized :)
>> >> > > >
>> >> > > > 1. OSD - 13 pieces.
>> >> > > >      a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 
>> >> > > > pieces
>> >> > > >      b. Fiber Channel 16 Gbit / c - 2 port.
>> >> > > >      c. An array (not RAID) to 284 TB of SATA-based drives (36 
>> >> > > > drives
>> >> > for
>> >> > > > 8TB);
>> >> > > >      d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece
>> >> > > >      e. SATA drive 40 GB for installation of the operating system 
>> >> > > > (or
>> >> > > > booting from the network, which is preferable)
>> >> > > >      f. RAM 288 GB
>> >> > > >      g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4
>> >> > > > 2. MON - 3 pieces. All virtual server:
>> >> > > >      a. 1 Gbps Ethernet / c - 1 port.
>> >> > > >      b. SATA drive 40 GB for installation of the operating system 
>> >> > > > (or
>> >> > > > booting from the network, which is preferable)
>> >> > > >      c. SATA drive 40 GB
>> >> > > >      d. 6GB RAM
>> >> > > >      e. 1 x CPU - 2 cores at 1.9 Ghz
>> >> > > > 3. MDS - 2 pcs. All virtual server:
>> >> > > >      a. 1 Gbps Ethernet / c - 1 port.
>> >> > > >      b. SATA drive 40 GB for installation of the operating system 
>> >> > > > (or
>> >> > > > booting from the network, which is preferable)
>> >> > > >      c. SATA drive 40 GB
>> >> > > >      d. 6GB RAM
>> >> > > >      e. 1 x CPU - min. 2 cores at 1.9 Ghz
>> >> > > >
>> >> > > > I assume to use for an acceleration SSD for a cache and a log of 
>> >> > > > OSD.
>> >> > > >
>> >> > > > --
>> >> > > > Alexander Pushkov
>> >> > > >
>> >> > > > _______________________________________________
>> >> > > > ceph-users mailing list
>> >> > > >  [email protected]
>> >> > > >  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> > > >
>> >> > > >
>> >> >
>> >> >
>> >> > --
>> >> > Christian Balzer        Network/Systems Engineer
>> >> >  [email protected] Global OnLine Japan/Rakuten Communications
>> >> >  http://www.gol.com/
>> >> >
>> >> 
>> >> --
>> >> Best regards,
>> >> Vladimir
>> >
>> >
>> >-- 
>> >Christian Balzer        Network/Systems Engineer 
>> > [email protected] Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>> >_______________________________________________
>> >ceph-users mailing list
>> > [email protected]
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>
>
>-- 
>Christian Balzer        Network/Systems Engineer 
>[email protected] Global OnLine Japan/Rakuten Communications
>http://www.gol.com/


-- 
Александр Пивушков

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fast Ceph a Cluster with PB storage

Reply via email to