[ceph-users] Erasure coding

2015-03-25 Thread Tom Verdaat
Hi guys,

We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold data)
that we intend to grow later on as more storage is needed. We would very
much like to use Erasure Coding for some pools but are facing some
challenges regarding the optimal initial profile “replication” settings
given the limited number of initial hosts that we can use to spread the
chunks. Could somebody please help me with the following questions?

   1.

   Suppose we initially use replication in stead of erasure. Can we convert
   a replicated pool to an erasure coded pool later on?
   2.

   Will Ceph gain the ability to change the K and N values for an existing
   pool in the near future?
   3.

   Can the failure domain be changed for an existing pool? E.g. can we
   start with failure domain OSD and then switch it to Host after adding more
   hosts?
   4.

   Where can I find a good comparison of the available erasure code plugins
   that allows me to properly decide which one suits are needs best?

 Many thanks for your help!

 Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding

2015-03-25 Thread Tom Verdaat
Great info! Many thanks!

Tom

2015-03-25 13:30 GMT+01:00 Loic Dachary l...@dachary.org:

 Hi Tom,

 On 25/03/2015 11:31, Tom Verdaat wrote: Hi guys,
 
  We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold
 data) that we intend to grow later on as more storage is needed. We would
 very much like to use Erasure Coding for some pools but are facing some
 challenges regarding the optimal initial profile “replication” settings
 given the limited number of initial hosts that we can use to spread the
 chunks. Could somebody please help me with the following questions?
 
   1.
 
  Suppose we initially use replication in stead of erasure. Can we
 convert a replicated pool to an erasure coded pool later on?

 What you would do is create an erasure coded pool later and have the
 initial replicated pool as a cache in front of it.

 http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

 Objects from the replicated pool will move to the erasure coded pool if
 they are not used and it will save space. You don't need to create the
 erasure coded pool on your small cluster. You can do it when it grows
 larger or when it becomes full.

   2.
 
  Will Ceph gain the ability to change the K and N values for an
 existing pool in the near future?

 I don't think so.

   3.
 
  Can the failure domain be changed for an existing pool? E.g. can we
 start with failure domain OSD and then switch it to Host after adding more
 hosts?

 The failure domain, although listed in the erasure code profile for
 convenience, really belongs to the crush ruleset applied to the pool. It
 can therefore be changed after the pool is created. It is likely to result
 in objects moving a lot during the transition but it should work fine
 otherwise.

   4.
 
  Where can I find a good comparison of the available erasure code
 plugins that allows me to properly decide which one suits are needs best?

 In a nutshell, jerasure is flexible and is likely to be what you want, isa
 computes faster than jerasure but only works on intel processors (note
 however that the erasure code computation does not make a significant
 difference overall), lrc and shec (to be published in hammer) minimize
 network usage during recovery but uses more space than jerasure or isa.

 Cheers

  Many thanks for your help!
 
  Tom
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

 --
 Loïc Dachary, Artisan Logiciel Libre


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-12 Thread Tom Verdaat
Hi Darryl,

Would love to do that too but only if we can configure nova to do this
automatically. Any chance you could dig up and share how you guys
accomplished this?

From everything I've read so far Grizzly is not up for the task yet. If
I can't set it in nova.conf then it probably won't work with 3rd party
tools like Hostbill and break user self service functionality that we're
aiming for with a public cloud concept. I think we'll need this and this
blueprint implemented to be able to achieve this, and of course this one
for the dashboard would be nice too.

I'll do some more digging into Openstack and see how far we can get with
this.

In the mean time I've done some more research and figured out that:

  * There is a bunch of other cluster file systems but GFS2 and
OCFS2 are the only open source ones I could find, and I believe
the only ones that are integrated in the Linux kernel.
  * OCFS2 seems to have a lot more public information than GFS2. It
has more documentation and a living - though not very active -
mailing list.
  * OCFS2 seems to be in active use by its sponsor Oracle, while I
can't find much on GFS2 from its sponsor RedHat.
  * OCFS2 documentation indicates a node soft limit of 256 versus 16
for GFS2, and there are actual deployments of stable 45 TB+
production clusters.
  * Performance tests from 2010 indicate OCFS2 clearly beating GFS2,
though of course newer versions have been released since.
  * GFS2 has more fencing options than OCFS2.


There is not much info from the last 12 months so it's hard get an
accurate picture. If we have to go with the shared storage approach
OCFS2 looks like the preferred option based on the info I've gathered so
far though.

Tom



Darryl Bond schreef op vr 12-07-2013 om 10:04 [+1000]:

 Tom,
 I'm no expert as I didn't set it up, but we are using Openstack
 Grizzly with KVM/QEMU and RBD volumes for VM's.
 We boot the VMs from the RBD volumes and it all seems to work just
 fine. 
 Migration works perfectly, although live - no break migration only
 works from the command line tools. The GUI uses the pause, migrate
 then un-pause mode.
 Layered snapshot/cloning works just fine through the GUI. I would say
 Grizzly has pretty good integration with CEPH.
 
 Regards
 Darryl
 
 
 On 07/12/13 09:41, Tom Verdaat wrote:
 
  Hi Alex, 
  
  
  
  We're planning to deploy OpenStack Grizzly using KVM. I agree that
  running every VM directly from RBD devices would be preferable, but
  booting from volumes is not one of OpenStack's strengths and
  configuring nova to make boot from volume the default method that
  works automatically is not really feasible yet.
  
  
  So the alternative is to mount a shared filesystem
  on /var/lib/nova/instances of every compute node. Hence the RBD +
  OCFS2/GFS2 question.
  
  
  Tom
  
  
  p.s. yes I've read the rbd-openstack page which covers images and
  persistent volumes, not running instances which is what my question
  is about.
  
  
  
  2013/7/12 Alex Bligh a...@alex.org.uk
  
  Tom,
  
  
  On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
  
   Actually I want my running VMs to all be stored on the
  same file system, so we can use live migration to move them
  between hosts.
  
   QEMU is not going to help because we're not using it in
  our virtualization solution.
  
  
  
  Out of interest, what are you using in your virtualization
  solution? Most things (including modern Xen) seem to use
  Qemu for the back end. If your virtualization solution does
  not use qemu as a back end, you can use kernel rbd devices
  straight which I think will give you better performance than
  OCFS2 on RBD devices.
  
  
  A
  
  
   2013/7/11 Alex Bligh a...@alex.org.uk
  
   On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
  
Hello,
   
Yes, you missed that qemu can use directly RADOS volume.
Look here :
http://ceph.com/docs/master/rbd/qemu-rbd/
   
Create :
qemu-img create -f rbd rbd:data/squeeze 10G
   
Use :
   
qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
  
   I don't think he did. As I read it he wants his VMs to all
  access the same filing system, and doesn't want to use
  cephfs.
  
   OCFS2 on RBD I suppose is a reasonable choice for that.
  
   --
   Alex Bligh
  
  
  
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Tom Verdaat
Hi guys,

We want to use our Ceph cluster to create a shared disk file system to host
VM's. Our preference would be to use CephFS but since it is not considered
stable I'm looking into alternatives.

The most appealing alternative seems to be to create a RBD volume, format
it with a cluster file system and mount it on all the VM host machines.

Obvious file system candidates would be OCFS2 and GFS2 but I'm having
trouble finding recent and reliable documentation on the performance,
features and reliability of these file systems, especially related to our
specific use case. The specifics I'm trying to keep in mind are:

   - Using it to host VM ephemeral disks means the file system needs to
   perform well with few but very large files and usually machines don't try
   to compete for access to the same file, except for during live migration.
   - Needs to handle scale well (large number of nodes, manage a volume of
   tens of terabytes and file sizes of tens or hundreds of gigabytes) and
   handle online operations like increasing the volume size.
   - Since the cluster FS is already running on a distributed storage
   system (Ceph), the file system does not need to concern itself with things
   like replication. Just needs to not get corrupted and be fast of course.


Anybody here that can help me shed some light on the following questions:

   1. Are there other cluster file systems to consider besides OCFS2 and
   GFS2?
   2. Which one would yield the best performance for our use case?
   3. Is anybody doing this already and willing to share their experience?
   4. Is there anything important that you think we might have missed?


Your help is very much appreciated!

Thanks!

Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Tom Verdaat
You are right, I do want a single RBD, formatted with a cluster file
system, to use as a place for multiple VM image files to reside.

Doing everything straight from volumes would be more effective with regards
to snapshots, using CoW etc. but unfortunately for now OpenStack nova
insists on having an ephemeral disk and copying it to its local
/var/lib/nova/instances
directory. If you want to be able to do live migrations and such you need
to mount a cluster filesystem at that path on every host machine.

And that's what my questions were about!

Tom



2013/7/12 McNamara, Bradley bradley.mcnam...@seattle.gov

 Correct me if I'm wrong, I'm new to this, but I think the distinction
 between the two methods is that using 'qemu-img create -f rbd' creates an
 RBD for either a VM to boot from, or for mounting within a VM.  Whereas,
 the OP wants a single RBD, formatted with a cluster file system, to use as
 a place for multiple VM image files to reside.

 I've often contemplated this same scenario, and would be quite interested
 in different ways people have implemented their VM infrastructure using
 RBD.  I guess one of the advantages of using 'qemu-img create -f rbd' is
 that a snapshot of a single RBD would capture just the changed RBD data for
 that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM
 images on it, would capture changes of all the VM's, not just one.  It
 might provide more administrative agility to use the former.

 Also, I guess another question would be, when a RBD is expanded, does the
 underlying VM that is created using 'qemu-img  create -f rbd' need to be
 rebooted to see the additional space.  My guess would be, yes.

 Brad

 -Original Message-
 From: ceph-users-boun...@lists.ceph.com [mailto:
 ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh
 Sent: Thursday, July 11, 2013 2:03 PM
 To: Gilles Mocellin
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


 On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

  Hello,
 
  Yes, you missed that qemu can use directly RADOS volume.
  Look here :
  http://ceph.com/docs/master/rbd/qemu-rbd/
 
  Create :
  qemu-img create -f rbd rbd:data/squeeze 10G
 
  Use :
 
  qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

 I don't think he did. As I read it he wants his VMs to all access the same
 filing system, and doesn't want to use cephfs.

 OCFS2 on RBD I suppose is a reasonable choice for that.

 --
 Alex Bligh




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD port error

2013-07-09 Thread Tom Verdaat
Hi all,

I've set up a new Ceph cluster for testing and it doesn't seem to be
working out-of-the-box. If I check the status it tells me that of the 3
defined OSD's, only 1 is in:

   health HEALTH_WARN 392 pgs degraded; 392 pgs stuck unclean
monmap e1: 3 mons at {controller-01=
 10.20.3.110:6789/0,controller-02=10.20.3.111:6789/0,controller-03=10.20.3.112:6789/0},
 election epoch 6, quorum 0,1,2 controller-01,controller-02,controller-03
osdmap e20: 3 osds: 1 up, 1 in
 pgmap v35: 392 pgs: 392 active+degraded; 0 bytes data, 37444 KB used,
 15312 MB / 15348 MB avail
mdsmap e1: 0/0/1 up


Turns out this is true because if I run service ceph restart on my OSD
nodes, osd.0 will restart just fine but osd.1 and osd.2 give me the
follwoing error:

Starting Ceph osd.0 on storage-02...

starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1
 /var/lib/ceph/osd/ceph-1/journal
 2013-07-09 11:54:26.497639 7f5b18813780 -1 accepter.accepter.bind unable
 to bind to 10.20.4.121:7100 on any port in range 6800-7100: Cannot assign
 requested address
 failed: 'ulimit -n 8192;  /usr/bin/ceph-osd -i 1 --pid-file
 /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf '


My ceph.conf just has a very limited configuration. The OSD section
basically contains:

[osd]
 public network=10.20.3.0/24
 cluster network=10.20.4.0/24
 [osd.0]
 host = storage-01
 public addr = 10.20.3.120
 cluster addr = 10.20.4.120
 [osd.1]
 host = storage-02
 public addr = 10.20.3.121
 cluster addr = 10.20.4.121
 [osd.2]
 host = storage-03
 public addr = 10.20.3.122
 cluster addr = 10.20.4.122


A quick Google search on that port binding error doesn't really yield and
results so I'm reaching out to you guys. Any thoughts on how to fix this?

Thanks,

Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com