Hello Christian.

2017-08-24 22:43 GMT-03:00 Christian Balzer <[email protected]>:

>
> Hello,
>
> On Thu, 24 Aug 2017 14:49:24 -0300 Guilherme Steinmüller wrote:
>
> > Hello Christian.
> >
> > First of all, thanks for your considerations, I really appreciate it.
> >
> > 2017-08-23 21:34 GMT-03:00 Christian Balzer <[email protected]>:
> >
> > >
> > > Hello,
> > >
> > > On Wed, 23 Aug 2017 09:11:18 -0300 Guilherme Steinmüller wrote:
> > >
> > > > Hello!
> > > >
> > > > I recently installed INTEL SSD 400GB 750 SERIES PCIE 3.0 X4 in 3 of
> my
> > > OSD
> > > > nodes.
> > > >
> > > Well, you know what's coming now, don't you?
> > >
> > > That's a consumer device, with 70GB writes per day endurance.
> > > unless you're essentially having a read-only cluster, you're throwing
> away
> > > money.
> > >
> >
> > Yes, we knew that we were going to buy a consumer device due to our
> limited
> > budget and our objective of constructing a small plan of a production
> > cloud. This model seemed acceptable. It was the top list of the consumer
> > models on Sebastien's benchmarks
> >
> > We are a lab that depends on different budget sources to accquire
> > equipments, so they can vary and most of the time we are limited by
> > different budget ranges.
> >
> Noted, I hope your tests won't last too long or move a lot of data. ^o^
>
> > >
> > > > First of all, here's is an schema describing how my cluster is:
> > > >
> > > > [image: Imagem inline 1]
> > > >
> > > > [image: Imagem inline 2]
> > > >
> > > > I primarily use my ceph as a beckend for OpenStack nova, glance,
> swift
> > > and
> > > > cinder. My crushmap is configured to have rulesets for SAS disks,
> SATA
> > > > disks and another ruleset that resides in HPE nodes using SATA disks
> too.
> > > >
> > > > Before installing the new journal in HPE nodes, i was using one of
> the
> > > > disks that today are OSDs (osd.35, osd.34 and osd.33). After
> upgrading
> > > the
> > > > journal, i noticed that a dd command writing 1gb blocks in openstack
> nova
> > > > instances doubled the throughput but the value expected was actually
> 400%
> > > > or 500% since in the Dell nodes that we have another nova pool the
> > > > throughput is around this value.
> > > >
> > > Apples, oranges and bananas.
> > > You're comparing different HW (and no, I'm not going to look this up)
> > > which may or may not have vastly different capabilities (like HW
> cache),
> > > RAM and (unlikely relevant) CPU.
> > >
> >
> >
> > Indeed, we took this into account. The HP server were cheaper and have a
> > poor configuration due that limited budget source.
> >
> >
> > > Your NVMe may also be plugged into a different, insufficient PCIe slot
> for
> > > all we know.
> > >
> >
> > I checked this. I compared the slots identifying the slot information
> > between the 3 dell nodes and 3 hp nodes by running:
> >
> > # ls -l /sys/block/nvme0n1
> > # lspci -vvv -s 0000:06:00.0 <- slot identifier
> >
> > The only difference is:
> >
> > Dell has a parameter called *Cache Line Size: 32 bytes* and HP doesn't
> have
> > this.
> >
> That shouldn't be relevant, AFAIK.
>
> >
> >
> > > You're also using very different HDDs, which definitely will be a
> factor.
> > >
> > >
> > I thought that the backend disks would not interfer that much. For
> example,
> > the ceph journal has a parameter called filestore max sync interval,
> which
> > means that ceph journal will commit the transactions to the backend OSDs
> in
> > a defined interval, ours is set to 35. So the client requests go first to
> > SSD and than is commited to the OSDs.
> >
> As I wrote before, the journal comes not into play for any large amounts
> of data unless massively tuned and/or under extreme pressure.
>
> You need to touch much more of the journal and filestore parameters than
> max_sync, which will do nothing to prevent from min_sync and other values
> to start flushing more or less immediately.
>
> And tuning things so the journal is used extensively by default will
> result in I/O storms slowing things to a crawl when it finally flushes to
> the HDDs.
>
> If your google foo is strong enough you should find the relevant
> discussions about this, often in context with SSD OSDs where such tuning
> makes some sense.
>
> >
> > > But most importanly you're comparing 2 pools of vastly different ODS
> > > count, no wonder a pool with 15 OSDs is faster in sequential writes
> than
> > > one with 9.
> > >
> > > Here is a demonstration of the scenario and the difference in
> performance
> > > > between Dell nodes and HPE nodes:
> > > >
> > > >
> > > >
> > > > Scenario:
> > > >
> > > >
> > > >    -    Using pools to store instance disks for OpenStack
> > > >
> > > >
> > > >    -     Pool nova in "ruleset SAS" placed on c4-osd201, c4-osd202
> and
> > > >    c4-osd203 with 5 osds per hosts
> > > >
> > > SAS
> > > >
> > > >    -     Pool nova_hpedl180 in "ruleset NOVA_HPEDL180" placed on
> > > c4-osd204,
> > > >    c4-osd205, c4-osd206 with 3 osds per hosts
> > > >
> > > SATA
> > > >
> > > >    -     Every OSD has one partition of 35GB in a INTEL SSD 400GB 750
> > > >    SERIES PCIE 3.0 X4
> > > >
> > > Overkill, but since your NVMe will die shortly anyway...
> > >
> > > With large sequential tests, the journal will have nearly NO impact on
> the
> > > result, even if tuned to that effect.
> > >
> > > >
> > > >    -     Internal link for cluster and public network of 10Gbps
> > > >
> > > >
> > > >    -     Deployment via ceph-ansible. Same configuration define in
> > > ansible
> > > >    for every host on cluster
> > > >
> > > >
> > > >
> > > > *Instance on pool nova in ruleset SAS:*
> > > >
> > > >
> > > >    # dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
> > > >        1+0 records in
> > > >        1+0 records out
> > > >        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.56255 s, 419 MB/s
> > > >
> > > This is a very small test for what you're trying to determine and not
> > > going to be very representative.
> > > If for example there _is_ a HW cache of 2GB on the Dell nodes, it would
> > > fit nicely in there.
> > >
> > >
> >  Dell has PERC H730 Mini (Embedded) each with cache memory size of 1024
> MB
> > otherwise my HP uses a B140i dynamic array. Both HP and Dell doesn't use
> > any raid level for the OSDs, just Dell for the Operating System.
> >
> So the Dells do have a HW cache, which of course will help immensely.
>
>
> >
> >
> > > >
> > > > *Instance on pool nova in ruleset NOVA_HPEDL180:*
> > > >
> > > >      #  dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
> > > >      1+0 records in
> > > >      1+0 records out
> > > >      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.8243 s, 90.8 MB/s
> > > >
> > > >
> > > > I made some FIO benchmarks as suggested by Sebastien (
> > > > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> > > > test-if-your-ssd-is-suitable-as-a-journal-device/ ) and the command
> > > with 1
> > > > job returned me about 180MB/s of throughput in recently installed
> nodes
> > > > (HPE nodes). I made some hdparm benchmark in all SSDs and everything
> > > seems
> > > > normal.
> > > >
> > > I'd consider a 180MB/s result from a device that supposedly does
> 900MB/s a
> > > fail, but then again those tests above do NOT reflect journal usage
> > > reality but a more of a hint if something is totally broken or not.
> > >
> > > >
> > > > I can't see what is causing this difference of throughput since the
> > > network
> > > > is not a problem and i think that cpu and memory are not crucial
> since i
> > > > was monitoring the cluster with atop command and i didn't notice
> > > saturation
> > > > of resources. My only though is that I have less workload in
> > > nova_hpedl180
> > > > pool in HPE nodes and less disks per node and this ca influence in
> the
> > > > throughput of the journal.
> > > >
> > > How busy are your NVMe journals during that test on the Dells and HPs
> > > respectively?
> > > Same for the HDDs.
> > >
> >
> >
> > I can't say it now precisely, but what I can tell you for sure is that
> > monitoring these two pools, both thoughtput and disk usage, I can see
> that
> > the workload for the pool that is placed on the Dell nodes is
> significantly
> > higher than the pool in the HP node. For example, the OSDs in the Dell
> node
> > often keep the usage between 70% and 100%, different than HP OSDs, that
> > vary between 10% and 40%.
> >
>
> Basically it boils down to what you're trying to test/compare here:
>
> 1) The speed of your NVMe journal devices?
> Put OSDs on them (with inline journal) and run extensive tests.
> And with that I mean fio with 4MB for sequential speeds, 4K for IOPS and
> latency, direct, sync, etc.
>
> 2) The actual production speed of these different servers and their pools?
> Same as above and stated before, run _long_ tests and see where speeds
> stabilize and what the utilizations are during that time.
>
>
> And as always, sequential speed is in nearly all use cases not what
> matters in Ceph cluster anyway, certainly not one serving VMs.
>
>

Certainly after this discussion I can see the big picture better. I will
plan with my team some representative tests.

Cheers



> Christian
> >
> > >
> > > Again, run longer, larger tests to get something that will actually
> > > register, also atop with shorter intervals.
> > >
> > > Christian
> > > >
> > > > Any clue about what is missing or what is happening?
> > > >
> > > > Thanks in advance.
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > [email protected]           Rakuten Communications
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Rakuten Communications
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to