[ceph-users] Erasure coding
Hi guys, We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold data) that we intend to grow later on as more storage is needed. We would very much like to use Erasure Coding for some pools but are facing some challenges regarding the optimal initial profile “replication” settings given the limited number of initial hosts that we can use to spread the chunks. Could somebody please help me with the following questions? 1. Suppose we initially use replication in stead of erasure. Can we convert a replicated pool to an erasure coded pool later on? 2. Will Ceph gain the ability to change the K and N values for an existing pool in the near future? 3. Can the failure domain be changed for an existing pool? E.g. can we start with failure domain OSD and then switch it to Host after adding more hosts? 4. Where can I find a good comparison of the available erasure code plugins that allows me to properly decide which one suits are needs best? Many thanks for your help! Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure coding
Great info! Many thanks! Tom 2015-03-25 13:30 GMT+01:00 Loic Dachary l...@dachary.org: Hi Tom, On 25/03/2015 11:31, Tom Verdaat wrote: Hi guys, We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold data) that we intend to grow later on as more storage is needed. We would very much like to use Erasure Coding for some pools but are facing some challenges regarding the optimal initial profile “replication” settings given the limited number of initial hosts that we can use to spread the chunks. Could somebody please help me with the following questions? 1. Suppose we initially use replication in stead of erasure. Can we convert a replicated pool to an erasure coded pool later on? What you would do is create an erasure coded pool later and have the initial replicated pool as a cache in front of it. http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ Objects from the replicated pool will move to the erasure coded pool if they are not used and it will save space. You don't need to create the erasure coded pool on your small cluster. You can do it when it grows larger or when it becomes full. 2. Will Ceph gain the ability to change the K and N values for an existing pool in the near future? I don't think so. 3. Can the failure domain be changed for an existing pool? E.g. can we start with failure domain OSD and then switch it to Host after adding more hosts? The failure domain, although listed in the erasure code profile for convenience, really belongs to the crush ruleset applied to the pool. It can therefore be changed after the pool is created. It is likely to result in objects moving a lot during the transition but it should work fine otherwise. 4. Where can I find a good comparison of the available erasure code plugins that allows me to properly decide which one suits are needs best? In a nutshell, jerasure is flexible and is likely to be what you want, isa computes faster than jerasure but only works on intel processors (note however that the erasure code computation does not make a significant difference overall), lrc and shec (to be published in hammer) minimize network usage during recovery but uses more space than jerasure or isa. Cheers Many thanks for your help! Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
Hi Darryl, Would love to do that too but only if we can configure nova to do this automatically. Any chance you could dig up and share how you guys accomplished this? From everything I've read so far Grizzly is not up for the task yet. If I can't set it in nova.conf then it probably won't work with 3rd party tools like Hostbill and break user self service functionality that we're aiming for with a public cloud concept. I think we'll need this and this blueprint implemented to be able to achieve this, and of course this one for the dashboard would be nice too. I'll do some more digging into Openstack and see how far we can get with this. In the mean time I've done some more research and figured out that: * There is a bunch of other cluster file systems but GFS2 and OCFS2 are the only open source ones I could find, and I believe the only ones that are integrated in the Linux kernel. * OCFS2 seems to have a lot more public information than GFS2. It has more documentation and a living - though not very active - mailing list. * OCFS2 seems to be in active use by its sponsor Oracle, while I can't find much on GFS2 from its sponsor RedHat. * OCFS2 documentation indicates a node soft limit of 256 versus 16 for GFS2, and there are actual deployments of stable 45 TB+ production clusters. * Performance tests from 2010 indicate OCFS2 clearly beating GFS2, though of course newer versions have been released since. * GFS2 has more fencing options than OCFS2. There is not much info from the last 12 months so it's hard get an accurate picture. If we have to go with the shared storage approach OCFS2 looks like the preferred option based on the info I've gathered so far though. Tom Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]: Tom, I'm no expert as I didn't set it up, but we are using Openstack Grizzly with KVM/QEMU and RBD volumes for VM's. We boot the VMs from the RBD volumes and it all seems to work just fine. Migration works perfectly, although live - no break migration only works from the command line tools. The GUI uses the pause, migrate then un-pause mode. Layered snapshot/cloning works just fine through the GUI. I would say Grizzly has pretty good integration with CEPH. Regards Darryl On 07/12/13 09:41, Tom Verdaat wrote: Hi Alex, We're planning to deploy OpenStack Grizzly using KVM. I agree that running every VM directly from RBD devices would be preferable, but booting from volumes is not one of OpenStack's strengths and configuring nova to make boot from volume the default method that works automatically is not really feasible yet. So the alternative is to mount a shared filesystem on /var/lib/nova/instances of every compute node. Hence the RBD + OCFS2/GFS2 question. Tom p.s. yes I've read the rbd-openstack page which covers images and persistent volumes, not running instances which is what my question is about. 2013/7/12 Alex Bligh a...@alex.org.uk Tom, On 11 Jul 2013, at 22:28, Tom Verdaat wrote: Actually I want my running VMs to all be stored on the same file system, so we can use live migration to move them between hosts. QEMU is not going to help because we're not using it in our virtualization solution. Out of interest, what are you using in your virtualization solution? Most things (including modern Xen) seem to use Qemu for the back end. If your virtualization solution does not use qemu as a back end, you can use kernel rbd devices straight which I think will give you better performance than OCFS2 on RBD devices. A 2013/7/11 Alex Bligh a...@alex.org.uk On 11 Jul 2013, at 19:25, Gilles Mocellin wrote: Hello, Yes, you missed that qemu can use directly RADOS volume. Look here : http://ceph.com/docs/master/rbd/qemu-rbd/ Create : qemu-img create -f rbd rbd:data/squeeze 10G Use : qemu -m 1024 -drive format=raw,file=rbd:data/squeeze I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs. OCFS2 on RBD I suppose is a reasonable choice for that. -- Alex Bligh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OCFS2 or GFS2 for cluster filesystem?
Hi guys, We want to use our Ceph cluster to create a shared disk file system to host VM's. Our preference would be to use CephFS but since it is not considered stable I'm looking into alternatives. The most appealing alternative seems to be to create a RBD volume, format it with a cluster file system and mount it on all the VM host machines. Obvious file system candidates would be OCFS2 and GFS2 but I'm having trouble finding recent and reliable documentation on the performance, features and reliability of these file systems, especially related to our specific use case. The specifics I'm trying to keep in mind are: - Using it to host VM ephemeral disks means the file system needs to perform well with few but very large files and usually machines don't try to compete for access to the same file, except for during live migration. - Needs to handle scale well (large number of nodes, manage a volume of tens of terabytes and file sizes of tens or hundreds of gigabytes) and handle online operations like increasing the volume size. - Since the cluster FS is already running on a distributed storage system (Ceph), the file system does not need to concern itself with things like replication. Just needs to not get corrupted and be fast of course. Anybody here that can help me shed some light on the following questions: 1. Are there other cluster file systems to consider besides OCFS2 and GFS2? 2. Which one would yield the best performance for our use case? 3. Is anybody doing this already and willing to share their experience? 4. Is there anything important that you think we might have missed? Your help is very much appreciated! Thanks! Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
You are right, I do want a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside. Doing everything straight from volumes would be more effective with regards to snapshots, using CoW etc. but unfortunately for now OpenStack nova insists on having an ephemeral disk and copying it to its local /var/lib/nova/instances directory. If you want to be able to do live migrations and such you need to mount a cluster filesystem at that path on every host machine. And that's what my questions were about! Tom 2013/7/12 McNamara, Bradley bradley.mcnam...@seattle.gov Correct me if I'm wrong, I'm new to this, but I think the distinction between the two methods is that using 'qemu-img create -f rbd' creates an RBD for either a VM to boot from, or for mounting within a VM. Whereas, the OP wants a single RBD, formatted with a cluster file system, to use as a place for multiple VM image files to reside. I've often contemplated this same scenario, and would be quite interested in different ways people have implemented their VM infrastructure using RBD. I guess one of the advantages of using 'qemu-img create -f rbd' is that a snapshot of a single RBD would capture just the changed RBD data for that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, would capture changes of all the VM's, not just one. It might provide more administrative agility to use the former. Also, I guess another question would be, when a RBD is expanded, does the underlying VM that is created using 'qemu-img create -f rbd' need to be rebooted to see the additional space. My guess would be, yes. Brad -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto: ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh Sent: Thursday, July 11, 2013 2:03 PM To: Gilles Mocellin Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem? On 11 Jul 2013, at 19:25, Gilles Mocellin wrote: Hello, Yes, you missed that qemu can use directly RADOS volume. Look here : http://ceph.com/docs/master/rbd/qemu-rbd/ Create : qemu-img create -f rbd rbd:data/squeeze 10G Use : qemu -m 1024 -drive format=raw,file=rbd:data/squeeze I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs. OCFS2 on RBD I suppose is a reasonable choice for that. -- Alex Bligh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD port error
Hi all, I've set up a new Ceph cluster for testing and it doesn't seem to be working out-of-the-box. If I check the status it tells me that of the 3 defined OSD's, only 1 is in: health HEALTH_WARN 392 pgs degraded; 392 pgs stuck unclean monmap e1: 3 mons at {controller-01= 10.20.3.110:6789/0,controller-02=10.20.3.111:6789/0,controller-03=10.20.3.112:6789/0}, election epoch 6, quorum 0,1,2 controller-01,controller-02,controller-03 osdmap e20: 3 osds: 1 up, 1 in pgmap v35: 392 pgs: 392 active+degraded; 0 bytes data, 37444 KB used, 15312 MB / 15348 MB avail mdsmap e1: 0/0/1 up Turns out this is true because if I run service ceph restart on my OSD nodes, osd.0 will restart just fine but osd.1 and osd.2 give me the follwoing error: Starting Ceph osd.0 on storage-02... starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal 2013-07-09 11:54:26.497639 7f5b18813780 -1 accepter.accepter.bind unable to bind to 10.20.4.121:7100 on any port in range 6800-7100: Cannot assign requested address failed: 'ulimit -n 8192; /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf ' My ceph.conf just has a very limited configuration. The OSD section basically contains: [osd] public network=10.20.3.0/24 cluster network=10.20.4.0/24 [osd.0] host = storage-01 public addr = 10.20.3.120 cluster addr = 10.20.4.120 [osd.1] host = storage-02 public addr = 10.20.3.121 cluster addr = 10.20.4.121 [osd.2] host = storage-03 public addr = 10.20.3.122 cluster addr = 10.20.4.122 A quick Google search on that port binding error doesn't really yield and results so I'm reaching out to you guys. Any thoughts on how to fix this? Thanks, Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com