Re: [ceph-users] HA and data recovery of CEPH
That is true. When an OSD goes down it will take a few seconds for it's Placement Groups to re-peer with the other OSDs. During that period writes to those PGs will stall for a couple of seconds. I wouldn't say it's 40s, but it can take ~10s. Hello, According to my experience, in case of OSD crashes, killed -9 (any kind abnormat termination) OSD failure handling contains next steps: 1) Failed OSD's peers detect that it does not respond - it can take up to osd_heartbeat_grace + osd_heartbeat_interval seconds 2) Peers send reports to monitor 3) Monitor makes a decision according to (options from it's own config) mon_osd_adjust_heartbeat_grace, osd_heartbeat_grace, mon_osd_laggy_halflife, mon_osd_min_down_reporters, ... And finally mark OSD down in osdmap. 4) Monitor send updated OSDmap to OSDs and clients 5) OSDs starting peering 5.1) Peering itself is complicated process, for example we had experienced PGs stuck in inactive state due to osd_max_pg_per_osd_hard_ratio. 6) Peering finished (PGs' data continue moving) - clients can normally access affected PGs. Clients also have their own timeouts that can affect time to recover. Again, according to my experience, 40s with default settings is possible. -- Best regards! Aleksei Gutikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Is it possible not to list rgw names in ceph status output?
In Nautilus ceph status writes "rgw: 50 daemons active" and then lists all 50 names of rgw daemons. It takes significant space in terminal. Is it possible to disable list of names and make output like in Luminous: only number of active daemons? Thanks Aleksei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW 4 MiB objects
Hi Thomas, We did some investigations some time before and got several rules how to configure rgw and osd for big files stored on erasure-coded pool. Hope it will be useful. And if I have any mistakes, please let me know. S3 object saving pipeline: - S3 object is divided into multipart shards by client. - Rgw shards each multipart shard into rados objects of size rgw_obj_stripe_size. - Primary osd stripes rados object into ec stripes of width == ec.k*profile.stripe_unit, ec code them and send units into secondary osds and write into object store (bluestore). - Each subobject of rados object has size == (rados object size)/k. - Then while writing into disk bluestore can divide rados subobject into extents of minimal size == bluestore_min_alloc_size_hdd. Next rules can save some space and iops: - rgw_multipart_min_part_size SHOULD be multiple of rgw_obj_stripe_size (client can use different value greater than) - MUST rgw_obj_stripe_size == rgw_max_chunk_size - ec stripe == osd_pool_erasure_code_stripe_unit or profile.stripe_unit - rgw_obj_stripe_size SHOULD be multiple of profile.stripe_unit*ec.k - bluestore_min_alloc_size_hdd MAY be equal to bluefs_alloc_size (to avoid fragmentation) - rgw_obj_stripe_size/ec.k SHOULD be multiple of bluestore_min_alloc_size_hdd - bluestore_min_alloc_size_hdd MAY be multiple of profile.stripe_unit For example, if ec.k=5: - rgw_multipart_min_part_size = rgw_obj_stripe_size = rgw_max_chunk_size = 20M - rados object size == 20M - profile.stripe_unit = 256k - rados subobject size == 4M, 16 ec stripe units (20M / 5) - bluestore_min_alloc_size_hdd = bluefs_alloc_size = 1M - rados subobject can be written in 4 extents each containing 4 ec stripe units On 30.07.19 17:35, Thomas Bennett wrote: Hi, Does anyone out there use bigger than default values for rgw_max_chunk_size and rgw_obj_stripe_size? I'm planning to set rgw_max_chunk_size and rgw_obj_stripe_size to 20MiB, as it suits our use case and from our testing we can't see any obvious reason not to. Is there some convincing experience that we should stick with 4MiBs? Regards, Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best regards! Aleksei Gutikov | Ceph storage engeneer synesis.ru | Minsk. BY ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com