Re: [ceph-users] Boot volume on OSD device

2019-01-20 Thread Hector Martin
On 20/01/2019 05.50, Brian Topping wrote:
> My main constraint is I had four disks on a single machine to start with
> and any one of the disks should be able to fail without affecting the
> ability for the machine to boot, the bad disk replaced without requiring
> obscure admin skills, and the final recovery to the promised land of
> “HEALTH_OK”. A single machine Ceph deployment is not much better than
> just using local storage, except the ability to later scale out. That’s
> the use case I’m addressing here.

I assume parititioning the drive and using mdadm to add it to one or
more RAID arrays and then dealing with the Ceph side doesn't qualify as
"obscure admin skills", right? :-)

(I also use single-host Ceph deployments; I like its properties over
traditional RAID or things like ZFS).

> https://theithollow.com/2012/03/21/understanding-raid-penalty/ provided
> a good background that I did not previously have on the RAID write
> penalty. I combined this with what I learned
> in 
> https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328.
> By the end of these two articles, I felt like I knew all the tradeoffs,
> but the final decision really came down to the penalty table in the
> first article and a “RAID penalty” of 2 for RAID 10, which was the same
> as the penalty for RAID 1, but with 50% better storage efficiency.

FWIW, I disagree with that article on RAID write penalty. It's an
oversimplification and the math doesn't really add up. I don't like the
way they define the concept of "write penalty" relative to the sum of
disk performance. It should be relative to a single disk.

Here's my take on it. First of all, you need to consider three different
performance metrics for writes:

- Sequential writes (seq)
- Random writes < stripe size (small)
- Random writes >> stripe size or aligned (large)

* stripe size is the size across all disks for RAID5/6, but a single
disk for RAID0

And here is the performance, where n is the number of disks, relative to
a single disk of the same type:

seq small   large
RAID 0  n   n   1
RAID 1  1   1   1
RAID 5  n-1 0.5 1
RAID 6  n-2 0.5 1
RAID 10 n/2 n/2 1

RAID0 gives a throughput improvement proportional to the number of
disks, and the same small IOPS improvement *on average* (assuming your
I/Os hit all the disks equally, not like repeatedly hammering one stripe
chunk). There is also some loss of performance because whenever I/O hits
multiple disks the *slowest* disk becomes the bottleneck, so if the
worst case latency is 10ms for a single disk, your average latency is
5ms, but the average latency for the slowest of two disks is 6.6ms, for
three disks 7.5ms, etc. approaching 10ms as you add disks.

RAID1 is just like using a single disk, really. All the disks do the
same thing in parallel. That's it.

RAID5 has the same sequential improvement as RAID0, except with one
fewer disk, because parity takes one disk. However, small writes become
read-modify-write operations (it has to read the old data and parity to
update the parity), so you get half the IOPS. If your write is
stripe-aligned this penalty goes away, and misaligned writes larger than
several stripes amortize the penalty (it only hits the beginning and
end), so the performance approaches 1 as your write size increases, and
exceeds it as the sequential effect starts to dominate.

RAID6 is like RAID5 but with two parity disks. You still need a
(parallel) read and a (parallel) write for every small write.

RAID10 is just a RAID0 of RAID1s, so you ignore half the disks (the
mirrors) and the rest behave like RAID0.

The large/aligned I/O performance is identical to a single disk across
all RAID levels, because when your I/Os are larger than one stripe, then
*all* disks across the RAID have to handle the I/O (in parallel).

This is all assuming no controller or CPU bottlenecking. Realistically,
with software RAID and a non-terrible HBA, this is a pretty reasonable
assumption. There will be some extra overhead, but not much. Also, some
of the impact of RAID5/6 will be reduced by caching (hardware cache with
hardware RAID, or software stripe cache with md-raid).

This is all still a simplification in some ways, but I think it's closer
to reality than that article.

(somewhat offtopic for this list, but after seeing that article I felt I
had to try my own attempt at doing the math here).

Personally, I've had two set-ups like yours and this is what I did:

- On a production cluster with several OSDs with 4 disks (and no boot
drive), I used a 4-disk RAID1 for /boot and a 2-disk RAID1 with 2 spares
for /. This provides possibly a bit more fail-safe reliability in that
the RAID will auto-recover to the spares when something goes wrong
(instead of having to wait for a human to fix things). You could have a
4-disk RAID1, but there is some minor penalty (not detailed in my
explanation above) for replicating all writes 

Re: [ceph-users] Boot volume on OSD device

2019-01-19 Thread Brian Topping
> On Jan 18, 2019, at 10:58 AM, Hector Martin  wrote:
> 
> Just to add a related experience: you still need 1.0 metadata (that's
> the 1.x variant at the end of the partition, like 0.9.0) for an
> mdadm-backed EFI system partition if you boot using UEFI. This generally
> works well, except on some Dell servers where the firmware inexplicably
> *writes* to the ESP, messing up the RAID mirroring. 

I love this list. You guys are great. I have to admit I was kind of intimidated 
at first, I felt a little unworthy in the face of such cutting-edge tech. 
Thanks to everyone that’s helped with my posts.

Hector, one of the things I was thinking through last night and finally pulled 
the trigger on today was the overhead of various subsystems. LVM does not 
create much overhead, but tiny initial mistakes explode into a lot of wasted 
CPU over the course of a deployment lifetime. So I wanted to review everything 
and thought I would share my notes here.

My main constraint is I had four disks on a single machine to start with and 
any one of the disks should be able to fail without affecting the ability for 
the machine to boot, the bad disk replaced without requiring obscure admin 
skills, and the final recovery to the promised land of “HEALTH_OK”. A single 
machine Ceph deployment is not much better than just using local storage, 
except the ability to later scale out. That’s the use case I’m addressing here.

The first exploration I had was how to optimize for a good balance between 
safety for mon logs, disk usage and performance for the boot partitions. As I 
learned, an OSD can fit in a single partition with no spillover, so I had three 
partitions to work with. `inotifywait -mr /var/lib/ceph/` provided a good 
handle on what was being written to the log and with what frequency and I could 
see that the log was mostly writes.

https://theithollow.com/2012/03/21/understanding-raid-penalty/ 
 provided a 
good background that I did not previously have on the RAID write penalty. I 
combined this with what I learned in 
https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328
 
.
 By the end of these two articles, I felt like I knew all the tradeoffs, but 
the final decision really came down to the penalty table in the first article 
and a “RAID penalty” of 2 for RAID 10, which was the same as the penalty for 
RAID 1, but with 50% better storage efficiency.

For the boot partition, there are fewer choices. Specifying anything other than 
RAID 1 will not keep all the copies of /boot both up-to-date and ready to 
seamlessly restart the machine in case of a disk failure. Combined with a 
choice of RAID 10 for the root partition, we are left with a configuration that 
can reliably boot from any single drive failure (maybe two, I don’t know what 
mdadm would do if a “less than perfect storm” happened with one mirror from 
each stripe were to be lost instead of two mirrors from one stripe…)

With this setup, each disk used exactly two partitions and mdadm is using the 
latest MD metadata because Grub2 knows how to deal with everything. As well, 
`sfdisk /dev/sd[abcd]` shows all disks marked with the first partition as 
bootable. Milestone 1 success!

The next piece I was unsure of but didn’t want to spam the list with stuff I 
could just try was how many partitions an OSD would use. Hector mentioned that 
he was using LVM for Bluestore volumes. I privately wondered the value in 
creating LVM VGs when groups did not span disks. But this is exactly what the 
`ceph-deploy osd create` command as documented does in creating Bluestore OSDs. 
Knowing how to wire LVM is not rocket science, but if possible, I wanted to 
avoid as many manual steps as possible. This was a biggie.

And after adding the OSD partitions one after the other, “HEALTH_OK”. w00t!!! 
Final Milestone Success!!

I know there’s no perfect starter configuration for every hardware environment, 
but I thought I would share exactly what I ended up with here for future 
seekers. This has been a fun adventure. 

Next up: Convert my existing two pre-production nodes that need to use this 
layout. Fortunately there’s nothing on the second node except Ceph and I can 
take that one down pretty easily. It will be good practice to gracefully shut 
down the four OSDs on that node without losing any data, reformat the node with 
this pattern, bring it the cluster back to health, then migrate the mon (and 
the workloads) to it while I do the same for the first node. With that, I’ll be 
able to remove these satanic SATADOMs and get back to some real work!! ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Hector Martin
On 19/01/2019 02.24, Brian Topping wrote:
> 
> 
>> On Jan 18, 2019, at 4:29 AM, Hector Martin  wrote:
>>
>> On 12/01/2019 15:07, Brian Topping wrote:
>>> I’m a little nervous that BlueStore assumes it owns the partition table and 
>>> will not be happy that a couple of primary partitions have been used. Will 
>>> this be a problem?
>>
>> You should look into using ceph-volume in LVM mode. This will allow you to 
>> create an OSD out of any arbitrary LVM logical volume, and it doesn't care 
>> about other volumes on the same PV/VG. I'm running BlueStore OSDs sharing 
>> PVs with some non-Ceph stuff without any issues. It's the easiest way for 
>> OSDs to coexist with other stuff right now.
> 
> Very interesting, thanks!
> 
> On the subject, I just rediscovered the technique of putting boot and root 
> volumes on mdadm-backed stores. The last time I felt the need for this, it 
> was a lot of careful planning and commands. 
> 
> Now, at least with RHEL/CentOS, it’s now available in Anaconda. As it’s set 
> up before mkfs, there’s no manual hackery to reduce the size of a volume to 
> make room for the metadata. Even better, one isn’t stuck using metadata 0.9.0 
> just because they need the /boot volume to have the header at the end (grub 
> now understands mdadm 1.2 headers). Just be sure /boot is RAID 1 and it 
> doesn’t seem to matter what one does with the rest of the volumes. Kernel 
> upgrades process correctly as well (another major hassle in the old days 
> since mkinitrd had to be carefully managed).
> 

Just to add a related experience: you still need 1.0 metadata (that's
the 1.x variant at the end of the partition, like 0.9.0) for an
mdadm-backed EFI system partition if you boot using UEFI. This generally
works well, except on some Dell servers where the firmware inexplicably
*writes* to the ESP, messing up the RAID mirroring. But there is a hacky
workaround. They create a directory ("Dell" IIRC) to put their junk in.
If you create a *file* with the same name ahead of time, that makes the
firmware fail to mkdir, but it doesn't seem to cause any issues and it
doesn't touch the disk in this case, so the RAID stays in sync.


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Brian Topping


> On Jan 18, 2019, at 4:29 AM, Hector Martin  wrote:
> 
> On 12/01/2019 15:07, Brian Topping wrote:
>> I’m a little nervous that BlueStore assumes it owns the partition table and 
>> will not be happy that a couple of primary partitions have been used. Will 
>> this be a problem?
> 
> You should look into using ceph-volume in LVM mode. This will allow you to 
> create an OSD out of any arbitrary LVM logical volume, and it doesn't care 
> about other volumes on the same PV/VG. I'm running BlueStore OSDs sharing PVs 
> with some non-Ceph stuff without any issues. It's the easiest way for OSDs to 
> coexist with other stuff right now.

Very interesting, thanks!

On the subject, I just rediscovered the technique of putting boot and root 
volumes on mdadm-backed stores. The last time I felt the need for this, it was 
a lot of careful planning and commands. 

Now, at least with RHEL/CentOS, it’s now available in Anaconda. As it’s set up 
before mkfs, there’s no manual hackery to reduce the size of a volume to make 
room for the metadata. Even better, one isn’t stuck using metadata 0.9.0 just 
because they need the /boot volume to have the header at the end (grub now 
understands mdadm 1.2 headers). Just be sure /boot is RAID 1 and it doesn’t 
seem to matter what one does with the rest of the volumes. Kernel upgrades 
process correctly as well (another major hassle in the old days since mkinitrd 
had to be carefully managed).

best, B

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Hector Martin

On 12/01/2019 15:07, Brian Topping wrote:

I’m a little nervous that BlueStore assumes it owns the partition table and 
will not be happy that a couple of primary partitions have been used. Will this 
be a problem?


You should look into using ceph-volume in LVM mode. This will allow you 
to create an OSD out of any arbitrary LVM logical volume, and it doesn't 
care about other volumes on the same PV/VG. I'm running BlueStore OSDs 
sharing PVs with some non-Ceph stuff without any issues. It's the 
easiest way for OSDs to coexist with other stuff right now.


So, for example, you could have /boot on a partition, an LVM PV on 
another partition, containing an LV for / and an LV for your OSD. Or you 
could just use a partition for / (including /boot) and just have another 
partition for a PV wholly occupied by a single OSD LV. How you set up 
everything around LVM is up to you, ceph-volume just wants a logical 
volume to own (and uses LVM metadata to store its stuff, so it doesn't 
require a separate filesystem for metadata, just the main BlueStore device).


I also have two clusters using ceph-disk with a rootfs RAID1 across OSD 
data drives, with extra partitions; at least with GPT this works without 
any problems, but set-up might be finicky. For our deployment I ended up 
rewriting what ceph-disk does in my own script (it's not that 
complicated, just create a few partitions with the right GUIDs and write 
some files to the OSD filesystem root). So the OSDs get set up with some 
custom code, but then normal usage just uses ceph-disk (it certainly 
doesn't care about extra partitions once everything is set up). This was 
formerly FileStore and now BlueStore, but it's a legacy setup. I expect 
to move this over to ceph-volume at some point.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-12 Thread Félix Barbeira
If you have the chance maybe the best choice is try booting OS from
network. I mean you don't need an extra hd for the OS. Actually I'm trying
to make a squashfs image which is booted over LAN via iPXE. This is a very
good example: https://croit.io/features/efficiency-diskless

El sáb., 12 ene. 2019 a las 7:15, Brian Topping ()
escribió:

> Question about OSD sizes: I have two cluster nodes, each with 4x 800GiB
> SLC SSD using BlueStore. They boot from SATADOM so the OSDs are data-only,
> but the MLC SATADOM have terrible reliability and the SLC are way
> overpriced for this application.
>
> Can I carve off 64GiB of from one of the four drives on a node without
> causing problems? If I understand the strategy properly, this will cause
> mild extra load on the other three drives as the weight goes down on the
> partitioned drive, but it probably won’t be a big deal.
>
> Assuming the correct procedure is documented at
> http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/, first
> removing the OSD as documented, zap it, carve off the partition of the
> freed drive, then adding the remaining space back in.
>
> I’m a little nervous that BlueStore assumes it owns the partition table
> and will not be happy that a couple of primary partitions have been used.
> Will this be a problem?
>
> Thanks, Brian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Félix Barbeira.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Boot volume on OSD device

2019-01-11 Thread Brian Topping
Question about OSD sizes: I have two cluster nodes, each with 4x 800GiB SLC SSD 
using BlueStore. They boot from SATADOM so the OSDs are data-only, but the MLC 
SATADOM have terrible reliability and the SLC are way overpriced for this 
application.

Can I carve off 64GiB of from one of the four drives on a node without causing 
problems? If I understand the strategy properly, this will cause mild extra 
load on the other three drives as the weight goes down on the partitioned 
drive, but it probably won’t be a big deal.

Assuming the correct procedure is documented at 
http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/, first 
removing the OSD as documented, zap it, carve off the partition of the freed 
drive, then adding the remaining space back in.

I’m a little nervous that BlueStore assumes it owns the partition table and 
will not be happy that a couple of primary partitions have been used. Will this 
be a problem?

Thanks, Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com