Re: [ceph-users] Disk/Pool Layout

Jan Schermer Thu, 27 Aug 2015 13:00:07 -0700

> On 27 Aug 2015, at 21:37, Robert LeBlanc <[email protected]> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> 
> On Thu, Aug 27, 2015 at 1:13 PM, Jan Schermer  wrote:
> >
> >> On 27 Aug 2015, at 20:57, Robert LeBlanc  wrote:
> >>
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >>
> >>
> >>
> >> On Thu, Aug 27, 2015 at 10:25 AM, Jan Schermer  wrote:
> >>> Some comments inline.
> >>> A lot of it depends on your workload, but I'd say you almost certainly 
> >>> need
> >>> higher-grade SSDs. You can save money on memory.
> >>>
> >>> What will be the role of this cluster? VM disks? Object storage?
> >>> Streaming?...
> >>>
> >>> Jan
> >>>
> >>> On 27 Aug 2015, at 17:56, German Anders  wrote:
> >>>
> >>> Hi all,
> >>>
> >>>   I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've 
> >>> the
> >>> following HW:
> >>>
> >>> 3x MON Servers:
> >>>   2x Intel Xeon E5-2600@v3 8C
> >>
> >> This is overkill if only a monitor server.
> >
> > Maybe with newer releases of Ceph, but my Mons spin CPU pretty high (100% 
> > core, which means it doesn't scale that well with cores), and when 
> > adding/removing OSDs or shuffling data some of the peering issues I've seen 
> > were caused by lagging Mons.
> 
> If I remember right, you have a fairly large cluster. This is a pretty small 
> cluster, so probably OK with less CPU. Are you running Dumpling? I haven't 
> seen many issues with Hammer.
> 
Yes, Dumpling here.
> >
> >>
> >>>
> >>>   256GB RAM
> >>>
> >>>
> >>> I don't think you need that much memory, 64GB should be plenty (if that's
> >>> the only role for the servers).
> >>
> >>
> >> If it is only monitor, you can get by with even less.
> >>
> >>>
> >>>   1xIB FRD ADPT-DP (two ports for PUB network)
> >>>   1xGB ADPT-DP
> >>>
> >>>   Disk Layout:
> >>>
> >>>   SOFT-RAID:
> >>>   SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
> >>>   SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
> >>>
> >>>
> >>> I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast
> >>> ones (but they can be fairly small). Should be the same grade as journal
> >>> drives IMO.
> >>> NOT S3500!
> >>> I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1
> >>> DWPD rating, better go with 3 DWPD.
> >>
> >> S3500 should be just fine here. I get 25% better performance on the
> >> S3500 vs the S3700 doing sync direct writes. Write endurance should be
> >> just fine as the volume of data is not going to be that great. Unless
> >> there is something else I'm not aware of.
> >>
> >
> > S3500 is faster than S3700? I can compare 3700 x 3510 x 3610 tomorrow but 
> > I'd be very surprised if the S3500 had a _sustained_ throughput better than 
> > 36xx or 37xx. Were you comparing that on the same HBA and in the same way? 
> > (No offense, just curious)
> 
> None taken. I used the same box and swapped out the drives. The only 
> difference was the S3500 has been heavily used, the 3700 was fresh from the 
> package (if anything that should have helped the S3700).


What HBA was this?
With my LSI 2308 some drives have issues that manifest as an "IOPS 
amplification" of about 5x (unfortunately btrace doesn't work too well on my 
kernel so not 100% sure what is happening - still investigating).
To get the "true" speed of SSDs I either have to test them on AHCI or not use 
--sync=1 (direct should be sufficient - 1:1). And of course test that on a 
block device just as you do. I usually disable write cache also so that I get 
the bottom line of performance, sometimes it speeds the SSDs up actually.
But what I see is pretty wild, still not sure what's happening.

I only got the 3610 today and I got about 15K IOPS (same benchmark you do) when 
I started it, and it got up to 17.5K IOPS when I was leaving home. Let's see 
what is shows in the morning, I left it running overnight. If I remember 
correctly the S3700 did ~40K?
Anyway this is still only an artifical benchmark relevant to journal-like 
workload, but mix that with some queued reads and varying block sizes and I bet 
the S3700 beats the lower models into the ground. I'm curious so I'll try 
finding the different performance characteristics when I get to it.
> 
> for i in {1..8}; do fio --filename=/dev/sda --direct=1 --sync=1 --rw=write 
> --bs=4k --numjobs=$i --iodepth=1 --runtime=60 --time_based --group_reporting 
> --name=journal-test; done
> 
> # jobs  IOPs   Bandwidth (KB/s)
> 
> Intel S3500 (SSDSC2BB240G4) Max 4K RW 7,500
> 1       5,617  22,468.0
> 2       8,326  33,305.0
> 3      11,575  46,301.0
> 4      13,882  55,529.0
> 5      16,254  65,020.0
> 6      17,890  71,562.0
> 7      19,438  77,752.0
> 8      20,894  83,576.0
> 
> Intel S3700 (SSDSC2BA200G3) Max 4K RW 32,000
>  1      4,417  17,670.0
>  2      5,544  22,178.0
>  3      7,337  29,352.0
>  4      9,243  36,975.0
>  5     11,189  44,759.0
>  6     13,218  52,874.0
>  7     14,801  59,207.0
>  8     16,604  66,419.0
>  9     17,671  70,685.0
> 10     18,715  74,861.0
> 11     20,079  80,318.0
> 12     20,832  83,330.0
> 13     20,571  82,288.0
> 14     23,033  92,135.0
> 15     22,169  88,679.0
> 16     22,875  91,502.0
> 
> >
> > Mons can use some space, I've experienced logging havoc, leveldb bloating 
> > havoc  (I have to compact manually or it just grows and grows), and my Mons 
> > write quite a lot at times. I guesstimate my mons can write 200GB a day, 
> > often less but often more. Maybe that's not normal. I can confirm those 
> > numbers tomorrow.
> 
> True, I haven't had the compact issues so I can't comment on that. He has a 
> small cluster so I don't think he will get to the level you have.
> 
I only have about 2x more OSDs than he does. A lot more space, yes, but the 
number of OSDs is comparable.
I also have a lot more PGs, but that only seems to improve things so far.

> >
> >>>
> >>>
> >>> 8x OSD Servers:
> >>>   2x Intel Xeon E5-2600@v3 10C
> >>>
> >>>
> >>> Go for the fastest you can afford if you need the latency - even at the
> >>> expense of cores.
> >>> Go for cores if you want bigger throughput.
> >>
> >> I'm in the middle of my testing, but it seems that with lots of I/O
> >> depth (either from a single client or multiple clients) that clock
> >> speed does not have as much of an impact as core count does. Once I'm
> >> done, I'll be posting my results. Unless you have a single client that
> >> has a QD=1, go for cores at this point.
> >
> > NoSQL is basically still a database, and while NoSQL is mostly a more 
> > modern stuff which is built for clouds and horizontal scaling, you still 
> > need some baseline performance to achieve a good durability/replication and 
> > stuff.
> >
> >>
> >>>
> >>>   256GB RAM
> >>>
> >>>
> >>> Again - I think too much if that's the only role for those nodes, 64GB
> >>> should be plenty.
> >>
> >> Agree, if you can afford more RAM, it just means more page cache.
> >
> > But too much  page cache = bad.
> 
> I think /proc/sys/vm/min_free_kbytes help.
Nope. Had that set all the way up to 10G with no effect.
One scenario (I think I described it here already) is when I start a new OSD. 
The new OSD needs to allocate ~2GB of memory and if it isn't truly "free" then 
it causes all sorts of problems (peering stuck, slow ops...). Lowering 
min_free_kbytes or dropping caches helps because it makes the memory actually 
available fto the OSD and it starts right up, but that's not a nice solution.
This is CentOS6/RHEL6 with 2.6.32 Redhat frankenkernel with backports and a lot 
of patches that interact in mysterious ways...

> 
> >
> >>
> >>>
> >>>
> >>>   1xIB FRD ADPT-DP (one port for PUB and one for CLUS network)
> >>>   1xGB ADPT-DP
> >>>
> >>>   Disk Layout:
> >>>
> >>>   SOFT-RAID:
> >>>   SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
> >>>   SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
> >>>
> >>>   JBOD:
> >>>   SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
> >>>   SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
> >>>   SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
> >>>
> >>>
> >>> No no no. Those SSDs will die a horrible death, too little endurance.
> >>> Better go with 2x 3700 in RAID1 and partition them for journals. Or just
> >>> don't use journaling drives and buy better SSDs for storage.
> >>
> >> If he is only using these for journals, he can be just fine. He can
> >> get the same endurance as the S3700 by only using a portion of the
> >> drive space. [1][2]
> >
> > True for the 120GB drives. You only really need something like 1-10GB at 
> > most.
> > I'd still get a smaller higher-class drive and just not touch provisioning, 
> > if only for the sake of warranty. But I think it's easier to just skip 
> > dedicated journal drives in this case.
> 
> I think I remember someone saying that journals on separate SSDs gave them 
> better performance than journals co-located on the SSD, I don't remember 
> though. If warranty replacement is your primary concern, then go with the 
> 3700. If they already have the 3500, they can get it to perform/endure like 
> the 3700 with the only cost is disk space.
Yeah. It's true the 3500s will likely survive a few years and then the cost for 
something like 37xx will be much lower. 

The issue with journals on the same _filesystem_ is that a fsync of the journal 
causes all the dirty data to be flushed out, you should have a separate 
partition so that it doesn't interact (except in drive and its cache, a 
non-issue with Intels)
On the other hand, if you have journal as a file on filesystem you can disable 
barriers and get much higher throughput, while disabling flushes on a block 
device is hard or impossible (there's a very obscure option of echoing 
"temporary write through" to the scsi_disk/cache_type sysfs node, but that's 
not available on Ubuntu for example).
.
> 
> >
> > NoSQL is very write intensive - depending on implemenation (applications) 
> > of course. But it's not unusual to have 300MB of semi-structured data and 
> > 100GB indexes that are rebuilt all the time (of course that indicates the 
> > developers were just lazystupid, which is exactly why NoSQL is so popular 
> > and Agile :)).
> 
> Understandable. Our cluster is primarily write because reads are being served 
> out of all the layers of cache. Overprovisioned 3500s will work just as well 
> as the 3700.
> 
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> 
> wsFcBAEBCAAQBQJV32bnCRDmVDuy+mK58QAA0e4P/3jclEcvCRWgOYwUz0bo
> scf42NOhyNp3bPt4sUMN5h1aptX1s9TtUQxaq9yficjHhIb9ZBt1/SPxzDpf
> cbWBMgjKgEPHhN7AAGK6HwlQ+zrB8znRPabv81JO9heIwrcOY7LLJTl8kpij
> 0ktU7oRBn4xTDINTugZnq+YaBL+8N1/5g65lev6nnMs9ngTh4DSmjYuDjxFH
> Y8YuToImBQtuUQiL4feNN+lA+fPy3k0iYaTS2XvO7yX+w84ElDjUHvjZxOTt
> kZE5/YMKz7sImhhvLmvRRpqpEbJVPDl6JqhbyMTwpH4fkebrEGY/EbVYV+bT
> m3Hq6iMIs2NleExShOwdUK0r0cw1MnWPThdEtOAHefefDcsWPZoQpvPiuqwJ
> MdFxGP1LnX7yx1vYAt89nRhUsBQUvCcparcjjbM4aIe/6Q39Orkqb4sMuygf
> VyxFRwULDPwnl6xMn/oVIAXycXOMs3dWM12t6UGfe4kmSGEoShzkwimgJcvC
> lQnrp8u6jFYz6lflMMOQRauJSA4vDAU63JJMb7MLDqI6zy7MqXjnA9kyS1PP
> Px7mgxLINQ/KG4ymGtlRNKfZVF29fe+CGYZEwrVFsRGAIJsfG9TZj3IhdO1r
> /9gkXHvvE6NMPQWWNwxnvnFseqdNDbCZl3DFy9fciCgofznNo2sQumY8eG9P
> k5jF
> =HkOn
> -----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disk/Pool Layout

Reply via email to