>>> - 20 spinning SAS disks per node.
>> Don't use legacy HDDs if you care about performance.
>
> You are right here, but we use Ceph mainly for RBD. It performs 'good enough'
> for our RBD load.
You use RBD for archival?
>>> - Some nodes have 256GB RAM, some nodes 128GB.
>> 128GB is on the low side for 20 OSDs.
>
> Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. We
> haven't had any server that OOM'ed yet.
Remember that's a *target* not a *limit*. Say one or more of your failure
domains goes offline or you have some other large topology change. Your OSDs
might want up to 2x osd_memory_target, then you OOM and it cascades. I've been
there, had to do an emergency upgrade of 24xOSD nodes from 128GB to 192GB.
>>> - CPU varies between Intel E5-2650 and Intel Gold 5317.
>> E5-2650 is underpowered for 20 OSDs. 5317 isn't the ideal fit, it'd make a
>> decent MDS system but assuming a dual socket system, you have ~2 threads per
>> OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw
>> on some of them too.
>
> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted
> on the OSD nodes and are running on modern hardware.
You didn't list additional nodes so I assumed. You might still do well to have
a larger number of RGWs, wherever they run. RGWs often scale better
horizontally than vertically.
>
>> rados bench is a useful for smoke testing but not always a reflection of E2E
>> experience.
>>> Unfortunately not getting the same performance with Rados Gateway (S3).
>>> - 1x HAProxy with 3 backend RGW's.
>> Run an RGW on every node.
>
> On every OSD node?
Yep, why not?
>>> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp
>>> clients. Benchmarking towards the HAProxy.
>>> Results:
>>> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy
>>> server. Thats good.
>>> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s
>>> with 140 up to 300 obj/s.
>> Tiny objects are the devil of any object storage deployment. The HDDs are
>> killing you here, especially for the index pool. You might get a bit better
>> by upping pg_num from the party line.
>
> I would expect high write await times, but all OSD/disks have write await
> times of 1 ms up to 3 ms.
There are still serializations in the OSD and PG code. You have 240 OSDs, does
your index pool have *at least* 256 PGs?
>
>> You might also disable Nagle on the RGW nodes.
>
> I need to lookup what that exactly is and does.
>
>>> It depends on the concurrency setting of Warp.
>>> It look likes the objects/s is the bottleneck, not the throughput.
>>> Max memory usage is about 80-90GB per node. CPU's are quite idling.
>>> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At
>>> this moment I am not able to find the bottleneck what is causing the low
>>> obj/s.
>> HDDs are a false economy.
>
> Got it :)
>
>>> Ceph version is 15.2.
>>> Thanks!
>>> _______________________________________________
>>> ceph-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]