Re: [ceph-users] Fast Ceph a Cluster with PB storage

Christian Balzer Tue, 09 Aug 2016 04:57:09 -0700

Hello,

[re-added the list]


Also try to leave a line-break, paragraph between quoted and new text,
your mail looked like it was all written by me...

On Tue, 09 Aug 2016 11:00:27 +0300 Александр Пивушков wrote:

>  Thank you for your response!
> 
> 
> >Вторник,  9 августа 2016, 5:11 +03:00 от Christian Balzer <ch...@gol.com>:
> >
> >
> >Hello,
> >
> >On Mon, 08 Aug 2016 17:39:07 +0300 Александр Пивушков wrote:
> >
> >> 
> >> Hello dear community!
> >> I'm new to the Ceph and not long ago took up the theme of building 
> >> clusters.
> >> Therefore it is very important to your opinion.
> >> It is necessary to create a cluster from 1.2 PB storage and very rapid 
> >> access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe 
> >> PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, 
> >> but with increase of volume of storage, the price of such cluster very 
> >> strongly grows and therefore there was an idea to use Ceph.
> >
> >You may want to tell us more about your environment, use case and in
> >particular what your clients are.
> >Large amounts of data usually means graphical or scientific data,
> >extremely high speed (IOPS) requirements usually mean database
> >like applications, which one is it, or is it a mix? 
>
>This is a mixed project, with combined graphics and science. Project linking 
>the vast array of image data. Like google MAP :)
> Previously, customers were Windows that are connected to powerful servers 
> directly. 
> Ceph cluster connected on FC to servers of the virtual machines is now 
> planned. Virtualization - oVirt. 

Stop right there. oVirt, despite being from RedHat, doesn't really support
Ceph directly all that well, last I checked.
That is probably where you get the idea/need for FC from.

If anyhow possible, you do NOT want another layer and protocol conversion
between Ceph and the VMs, like a FC gateway or iSCSI or NFS.

So if you're free to choose your Virtualization platform, use KVM/qemu at
the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with
KVM resource agents on top.

>Clients on 40 GB ethernet are connected to servers of virtualization.

Your VM clients (if using RBD instead of FC) and the end-users could use
the same network infrastructure.

>Clients on Windows.
> Customers use their software. It is written by them. About the base I do not 
> know, probably not. The processing results are stored in conventional files. 
> In total about 160 GB.

1 image file being 160GB?

> We need very quickly to process these images, so as not to cause 
> dissatisfaction among customers. :) Per minute.

Explain. 
Writing 160GB/minute is going to be a challenge on many levels.
Even with 40Gb/s networks this assumes no contention on the network OR the
storage backend...


> >
> >
> >For example, how were the above NVMes deployed and how did they serve data
> >to the clients?
> >The fiber channel bit in your HW list below makes me think you're using
> >VMware, FC and/or iSCSI right now. 
>
>Data is stored on the SSD disk 1.6TB NVMe, and processed and stored directly 
>on it. In one powerful server. Gave for this task. Used 40 GB ethernet. Server 
>- CentOS 7 

So you're going from a single server with all NVMe storage to a
distributed storage. 

You will be disappointed by the cost/performance in direct comparison.


> 
> >
> >
> >> There are following requirements:
> >> - The amount of data 160 GB should be read and written at speeds of SSD 
> >> P3608
> >Again, how are they serving data now?
> >The speeds (latency!) a local NVMe can reach is of course impossible with
> >a network attached SDS like Ceph. 
>
>It is sad. Not helping matters is paralleling to 13 servers? and the FC?
>
Ceph does not FC internally.
I only uses IP (so you can use IPoIB if you want).
Never mind that the problem is that the replication (x3) is causing the
largest part of the latency.

> >
> >160GB is tiny, are you sure about this number? 
>
>Yes, it's small, and it is exactly. But, it is the most sensitive data 
>processing time. Even in the background and can be a slower process more data. 
>Their treatment is not so nervous clients.

Still no getting it, but it seems more and more like 160GB/s.

> >
> >
> >> - There must be created a high-speed storage of the SSD drives 36 TB 
> >> volume with read / write speed tends to SSD P3608
> >How is that different to the point above? The data of this volume can be 
> >processed in the background, running in parallel with the processing of 160 
> >GB. The speed of processing is not so important. Previously, the entire 
> >amount was placed in a server on Ssd disk lesser performance. Therefore, I 
> >declare Ceph cluster ssd drives of the same of volume that can quickly read 
> >and write data.
> >
> >
> >> - Must be created store 1.2 PB with the access speed than the bigger, the 
> >> better ...
> >Ceph scales well.
> >> - Must have triple redundancy
> >Also not an issue, depending on how you define this. 
>
>Standard means Ceph . The simplest configuration. As far as I understand the 
>default it has dual redundant.

Default is 3 replicas for about 2 years now. And trust me, you don't want
less when dealing with HDDs.

> >
> >
> >> I do not really understand yet, so to create a configuration with SSD 
> >> P3608 Disk. Of course, the configuration needs to be changed, it is very 
> >> expensive.
> >
> >There are HW guides and plenty of discussion about how to design largish
> >clusters, find and read them.
> >Like the ML threads:
> >"800TB - Ceph Physical Architecture Proposal"
> >"dense storage nodes" Thank you. I found and read
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/008775.html  
> 
> >
> >
> >Also read up on Ceph cache-tiering. 
> >
> >> InfiniBand will be used, and 40 GB Ethernet.
> >> We will also use virtualization to high-performance hardware to optimize 
> >> the number of physical servers.
> >
> >What VM stack/environment?
> >If it is VMware, Ceph is a bad fit as the most stable way to export Ceph
> >storage to this platform is NFS, which is also the least performant
> >(AFAIK). It is used oVirt
> >
> >
> >> I'm not tied to a specific server models and manufacturers. I create only 
> >> the cluster scheme which should be criticized :) 
> >> 
> >> 1. OSD - 13 pieces.
> >>      a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces
> >
> >For starters, that's not how you'd likely deploy high speed storage, see
> >CPU below. 
> >Also this gives you 36TB un-replicated capacity, so you'll need 2-3 times
> >the amount to be safe. 
>
>this amount is only a treatment, there is no storage

Don't understand what you mean here.

> >
> >
> >>      b. Fiber Channel 16 Gbit / c - 2 port.
> >What for?
> >If you need a FC GW (bad idea), 2 (dedicated if possible) machines will do. 
> >one for Ceph servers. Another connection to virtualization servers 
> >(clients). To ensure that the network was not a bottleneck.
> >

See above, it doesn't work like this.

> >
> >And where/what is your actual network HW? cluster is under creation. All 
> >will be purchased.
> Brocade 6510 FC SAN
> >
> >
> >>      c. An array (not RAID) to 284 TB of SATA-based drives (36 drives for 
> >> 8TB);
> >
> >Ceph works better with not overly large storage nodes and OSDs. 
> >I know you're trying to minimize rack space and cost, but something with
> >less OSDs per node and 4TB per OSD is going to be easier to get right.
> 
> well yes....
> I understood
> >
> >
> >>      d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece
> >What is that for?
> >Ceph only performs decently (with the current filestore) when using SSDs
> >as journals for the HDD based OSDs, a singe SSD won't cut that and a 3500
> >has likely insufficient endurance anyway.
> >
> >For 36 OSDs you're looking at 7 400GB DC S3710s or 3 400GB DC P3700s... 

>I do not, do not understand. I in fact have a large store to the HDD. For Ceph 
>need a separate disk for logs. That's why I singled him out for logs. 
>Relatively inexpensive enough of volume. Approximately 10 GB - 1 OSD. SATA 
>drives it drives HDD.

Logs aren't that big really with OSDs. 

Read up on "Ceph SSD journals".

> >
> >
> >>      e. SATA drive 40 GB for installation of the operating system (or 
> >> booting from the network, which is preferable)
> >>      f. RAM 288 GB
> >Generous, but will help reads. ""

> however, during recovery they need significantly more RAM (e.g., ~1GB per 1TB 
> of storage per daemon). Generally, more RAM is better.
> http://docs.ceph.com/docs/master/start/hardware-recommendations/   and how 
> much is enough?
>

For 36 ODS? 128GB if you're feeling lucky, 256GB if you want to play it
safe and have enhanced read performance.
 
That rule is for worst case scenarios and when 2TB disk were big.

> >
> >
> >>      g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4
> >Firstly that's a 10 core, 2.2GHz CPU.  Oh, yeah, like 10-core ... But the 
> >price! All of Freud.
> >
> >Secondly, most likely underpowered if serving both NVMes and 36 HDD OSDs.
> >A 400GB DC S3610 (so slower SATA, not NVMe) will eat about 3 2.2GHz cores
> >when doing small write IOPS. I took into account the recommendations of Ceph 
> >1 GHz, 1 core per OSD
> 

That's for pure HDD. Which will give you nowhere the performance you want.
HDD plus SSD journal I figure 1-2 GHz, pure SSD or NVMe, as much as you
can afford.


> >
> >
> >There are several saner approaches I can think of, but these depend on the
> >answers to the questions above.
> >
> >
> >> 2. MON - 3 pieces. All virtual server:
> >Virtual server can work, I prefer real (even if shared) HW.
> >3 is the absolute minimum, 5 would be a good match. ok
> >
> >
> >>      a. 1 Gbps Ethernet / c - 1 port.
> >While the MONs don't have much data traffic, the lower latency of a faster
> >network would be helpful. 
>what, for example?

Information exchange between the MONs (or MONs and OSDs) will have more
latency at 1Gb/s, thus be slower.

> >
> >
> >If you actually need MDS, make those (real) servers also MONs and put
> >the rest on OSD nodes or VMs. no, really, I do not know yet whether CephFS 
> >need. It is open to question.

From all I can you don't need it.
And thus no MDS.

Christian

> >
> >
> >>      b. SATA drive 40 GB for installation of the operating system (or 
> >> booting from the network, which is preferable)
> >>      c. SATA drive 40 GB
> >MONs like fast storage for their leveldb.  do not understand...
> >
> >
> >>      d. 6GB RAM
> >A bit low, but most likely enough.
> >
> >>      e. 1 x CPU - 2 cores at 1.9 Ghz
> >Enough for most scenarios, faster cores would be better.
> >
> >
> >> 3. MDS - 2 pcs. All virtual server:
> >Do you actually know what MDS do?
> >And where in your use case is CephFS needed or required? no, not sure yet, 
> >probably will not need.
> >
> >
> >>      a. 1 Gbps Ethernet / c - 1 port.
> >>      b. SATA drive 40 GB for installation of the operating system (or 
> >> booting from the network, which is preferable)
> >>      c. SATA drive 40 GB
> >>      d. 6GB RAM
> >>      e. 1 x CPU - min. 2 cores at 1.9 Ghz
> >Definitely not, you want physical nodes with the same level of networking
> >as the OSDs and your main clients. 
> >You will also want faster and more cores and way more memory (at least
> >64GB), how much depends on your CephFS size (number of files). CephFS likely 
> >will not be, unless it is absolutely necessary.
> >
> >
> >> I assume to use for an acceleration SSD for a cache and a log of OSD.
> >MDS don't hold any local data (caches), a logging SSD is fine.  I wrote at 
> >the end, and it was necessary in the beginning mail. I'm considering using 
> >1.6 SSD NVMe as a cache. And 300 GB SSD as a log.
> >
> >
> >
> >Christian
> >-- 
> >Christian Balzer        Network/Systems Engineer 
> >ch...@gol.com Global OnLine Japan/Rakuten Communications
> >http://www.gol.com/
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fast Ceph a Cluster with PB storage

Reply via email to