Hello Christian. 2017-08-24 22:43 GMT-03:00 Christian Balzer <[email protected]>:
> > Hello, > > On Thu, 24 Aug 2017 14:49:24 -0300 Guilherme Steinmüller wrote: > > > Hello Christian. > > > > First of all, thanks for your considerations, I really appreciate it. > > > > 2017-08-23 21:34 GMT-03:00 Christian Balzer <[email protected]>: > > > > > > > > Hello, > > > > > > On Wed, 23 Aug 2017 09:11:18 -0300 Guilherme Steinmüller wrote: > > > > > > > Hello! > > > > > > > > I recently installed INTEL SSD 400GB 750 SERIES PCIE 3.0 X4 in 3 of > my > > > OSD > > > > nodes. > > > > > > > Well, you know what's coming now, don't you? > > > > > > That's a consumer device, with 70GB writes per day endurance. > > > unless you're essentially having a read-only cluster, you're throwing > away > > > money. > > > > > > > Yes, we knew that we were going to buy a consumer device due to our > limited > > budget and our objective of constructing a small plan of a production > > cloud. This model seemed acceptable. It was the top list of the consumer > > models on Sebastien's benchmarks > > > > We are a lab that depends on different budget sources to accquire > > equipments, so they can vary and most of the time we are limited by > > different budget ranges. > > > Noted, I hope your tests won't last too long or move a lot of data. ^o^ > > > > > > > > First of all, here's is an schema describing how my cluster is: > > > > > > > > [image: Imagem inline 1] > > > > > > > > [image: Imagem inline 2] > > > > > > > > I primarily use my ceph as a beckend for OpenStack nova, glance, > swift > > > and > > > > cinder. My crushmap is configured to have rulesets for SAS disks, > SATA > > > > disks and another ruleset that resides in HPE nodes using SATA disks > too. > > > > > > > > Before installing the new journal in HPE nodes, i was using one of > the > > > > disks that today are OSDs (osd.35, osd.34 and osd.33). After > upgrading > > > the > > > > journal, i noticed that a dd command writing 1gb blocks in openstack > nova > > > > instances doubled the throughput but the value expected was actually > 400% > > > > or 500% since in the Dell nodes that we have another nova pool the > > > > throughput is around this value. > > > > > > > Apples, oranges and bananas. > > > You're comparing different HW (and no, I'm not going to look this up) > > > which may or may not have vastly different capabilities (like HW > cache), > > > RAM and (unlikely relevant) CPU. > > > > > > > > > Indeed, we took this into account. The HP server were cheaper and have a > > poor configuration due that limited budget source. > > > > > > > Your NVMe may also be plugged into a different, insufficient PCIe slot > for > > > all we know. > > > > > > > I checked this. I compared the slots identifying the slot information > > between the 3 dell nodes and 3 hp nodes by running: > > > > # ls -l /sys/block/nvme0n1 > > # lspci -vvv -s 0000:06:00.0 <- slot identifier > > > > The only difference is: > > > > Dell has a parameter called *Cache Line Size: 32 bytes* and HP doesn't > have > > this. > > > That shouldn't be relevant, AFAIK. > > > > > > > > You're also using very different HDDs, which definitely will be a > factor. > > > > > > > > I thought that the backend disks would not interfer that much. For > example, > > the ceph journal has a parameter called filestore max sync interval, > which > > means that ceph journal will commit the transactions to the backend OSDs > in > > a defined interval, ours is set to 35. So the client requests go first to > > SSD and than is commited to the OSDs. > > > As I wrote before, the journal comes not into play for any large amounts > of data unless massively tuned and/or under extreme pressure. > > You need to touch much more of the journal and filestore parameters than > max_sync, which will do nothing to prevent from min_sync and other values > to start flushing more or less immediately. > > And tuning things so the journal is used extensively by default will > result in I/O storms slowing things to a crawl when it finally flushes to > the HDDs. > > If your google foo is strong enough you should find the relevant > discussions about this, often in context with SSD OSDs where such tuning > makes some sense. > > > > > > But most importanly you're comparing 2 pools of vastly different ODS > > > count, no wonder a pool with 15 OSDs is faster in sequential writes > than > > > one with 9. > > > > > > Here is a demonstration of the scenario and the difference in > performance > > > > between Dell nodes and HPE nodes: > > > > > > > > > > > > > > > > Scenario: > > > > > > > > > > > > - Using pools to store instance disks for OpenStack > > > > > > > > > > > > - Pool nova in "ruleset SAS" placed on c4-osd201, c4-osd202 > and > > > > c4-osd203 with 5 osds per hosts > > > > > > > SAS > > > > > > > > - Pool nova_hpedl180 in "ruleset NOVA_HPEDL180" placed on > > > c4-osd204, > > > > c4-osd205, c4-osd206 with 3 osds per hosts > > > > > > > SATA > > > > > > > > - Every OSD has one partition of 35GB in a INTEL SSD 400GB 750 > > > > SERIES PCIE 3.0 X4 > > > > > > > Overkill, but since your NVMe will die shortly anyway... > > > > > > With large sequential tests, the journal will have nearly NO impact on > the > > > result, even if tuned to that effect. > > > > > > > > > > > - Internal link for cluster and public network of 10Gbps > > > > > > > > > > > > - Deployment via ceph-ansible. Same configuration define in > > > ansible > > > > for every host on cluster > > > > > > > > > > > > > > > > *Instance on pool nova in ruleset SAS:* > > > > > > > > > > > > # dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct > > > > 1+0 records in > > > > 1+0 records out > > > > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.56255 s, 419 MB/s > > > > > > > This is a very small test for what you're trying to determine and not > > > going to be very representative. > > > If for example there _is_ a HW cache of 2GB on the Dell nodes, it would > > > fit nicely in there. > > > > > > > > Dell has PERC H730 Mini (Embedded) each with cache memory size of 1024 > MB > > otherwise my HP uses a B140i dynamic array. Both HP and Dell doesn't use > > any raid level for the OSDs, just Dell for the Operating System. > > > So the Dells do have a HW cache, which of course will help immensely. > > > > > > > > > > > > > > *Instance on pool nova in ruleset NOVA_HPEDL180:* > > > > > > > > # dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct > > > > 1+0 records in > > > > 1+0 records out > > > > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.8243 s, 90.8 MB/s > > > > > > > > > > > > I made some FIO benchmarks as suggested by Sebastien ( > > > > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to- > > > > test-if-your-ssd-is-suitable-as-a-journal-device/ ) and the command > > > with 1 > > > > job returned me about 180MB/s of throughput in recently installed > nodes > > > > (HPE nodes). I made some hdparm benchmark in all SSDs and everything > > > seems > > > > normal. > > > > > > > I'd consider a 180MB/s result from a device that supposedly does > 900MB/s a > > > fail, but then again those tests above do NOT reflect journal usage > > > reality but a more of a hint if something is totally broken or not. > > > > > > > > > > > I can't see what is causing this difference of throughput since the > > > network > > > > is not a problem and i think that cpu and memory are not crucial > since i > > > > was monitoring the cluster with atop command and i didn't notice > > > saturation > > > > of resources. My only though is that I have less workload in > > > nova_hpedl180 > > > > pool in HPE nodes and less disks per node and this ca influence in > the > > > > throughput of the journal. > > > > > > > How busy are your NVMe journals during that test on the Dells and HPs > > > respectively? > > > Same for the HDDs. > > > > > > > > > I can't say it now precisely, but what I can tell you for sure is that > > monitoring these two pools, both thoughtput and disk usage, I can see > that > > the workload for the pool that is placed on the Dell nodes is > significantly > > higher than the pool in the HP node. For example, the OSDs in the Dell > node > > often keep the usage between 70% and 100%, different than HP OSDs, that > > vary between 10% and 40%. > > > > Basically it boils down to what you're trying to test/compare here: > > 1) The speed of your NVMe journal devices? > Put OSDs on them (with inline journal) and run extensive tests. > And with that I mean fio with 4MB for sequential speeds, 4K for IOPS and > latency, direct, sync, etc. > > 2) The actual production speed of these different servers and their pools? > Same as above and stated before, run _long_ tests and see where speeds > stabilize and what the utilizations are during that time. > > > And as always, sequential speed is in nearly all use cases not what > matters in Ceph cluster anyway, certainly not one serving VMs. > > Certainly after this discussion I can see the big picture better. I will plan with my team some representative tests. Cheers > Christian > > > > > > > > Again, run longer, larger tests to get something that will actually > > > register, also atop with shorter intervals. > > > > > > Christian > > > > > > > > Any clue about what is missing or what is happening? > > > > > > > > Thanks in advance. > > > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > [email protected] Rakuten Communications > > > > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Rakuten Communications >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
