Ceph support in OpenNebula

2013-02-20 Thread Jaime Melis
Hello everyone,

I'm an OpenNebula [1] developer. We've been working on integrating
OpenNebula with native Ceph drivers (using libvirt). The integration
is now complete, ready for testing. You can find more information
about its usage here [2].

We will maintain these drivers officially and extend its features, so
if you have any questions, comments or suggestions, please let me
know.

I would be very happy to write an OpenNebula page in Ceph's documentation.

[1] http://opennebula.org
[2] http://opennebula.org/documentation:rel4.0:ceph_ds

Cheers,
Jaime

--
Jaime Melis
Project Engineer
OpenNebula - The Open Source Toolkit for Cloud Computing
www.OpenNebula.org | jme...@opennebula.org
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Geo-replication with RBD

2013-02-20 Thread Sławomir Skowron
Like i say, yes. Now it is only option, to migrate data from one
cluster to other, and now it must be enough, with some auto features.

But is there any timeline, or any brainstorming in ceph internal
meetings, about any possible replication in block level, or something
like that ??

On 20 lut 2013, at 17:33, Sage Weil s...@inktank.com wrote:

 On Wed, 20 Feb 2013, S?awomir Skowron wrote:
 My requirement is to have full disaster recovery, buisness continuity,
 and failover of automatet services on second Datacenter, and not on
 same ceph cluster.
 Datacenters have 10GE dedicated link, for communication, and there is
 option to expand cluster into two DataCenters, but it is not what i
 mean.
 There are advantages of this option like fast snapshots, and fast
 switch of services, but there are some problems.

 When we talk about disaster recovery i mean that whole storage cluster
 have problems, not only services at top of storage. I am thinking
 about bug, or mistake of admin, that makes cluster not accessible in
 any copy, or a upgrade that makes data corruption, or upgrade that is
 disruptive for services - auto failover services into another DC,
 before upgrade cluster.

 If cluster have a solution to replicate data in rbd images to next
 cluster, than, only data are migrated, and when disaster comes, than
 there is no need to work on last imported snapshot (there can be
 constantly imported snapshot with minutes, or hour, before last
 production), but work on data from now. And when we have automated
 solution to recover DB (one of app service on top of rbd) clusters in
 new datacenter infrastructure, than we have a real disaster recovery
 solution.

 That's why we made, a s3 api layer synchronization to another DC, and
 Amazon, and only RBD is left.

 Have you read the thread from Jens last week, 'snapshot, clone and mount a
 VM-Image'?  Would this type of capability capture you're requirements?

 sage


 Dnia 19 lut 2013 o godz. 10:23 S?bastien Han
 han.sebast...@gmail.com napisa?(a):

 Hi,

 For of all, I have some questions about your setup:

 * What are your requirements?
 * Are the DCs far from each others?

 If they are reasonably close to each others, you can setup a single
 cluster, with replicas across both DCs and manage the RBD devices with
 pacemaker.

 Cheers.

 --
 Regards,
 S?bastien Han.


 On Mon, Feb 18, 2013 at 3:20 PM, S?awomir Skowron szi...@gmail.com wrote:
 Hi, Sorry for very late response, but i was sick.

 Our case is to make a failover rbd instance in another cluster. We are
 storing block device images, for some services like Database. We need
 to have a two clusters, synchronized, for a quick failover, if first
 cluster goes down, or for upgrade with restart, or many other cases.

 Volumes are in many sizes: 1-500GB
 external block device for kvm vm, like EBS.

 On Mon, Feb 18, 2013 at 3:07 PM, S?awomir Skowron szi...@gmail.com wrote:
 Hi, Sorry for very late response, but i was sick.

 Our case is to make a failover rbd instance in another cluster. We are
 storing block device images, for some services like Database. We need to
 have a two clusters, synchronized, for a quick failover, if first cluster
 goes down, or for upgrade with restart, or many other cases.

 Volumes are in many sizes: 1-500GB
 external block device for kvm vm, like EBS.


 On Fri, Feb 1, 2013 at 12:27 AM, Neil Levine neil.lev...@inktank.com
 wrote:

 Skowron,

 Can you go into a bit more detail on your specific use-case? What type
 of data are you storing in rbd (type, volume)?

 Neil

 On Wed, Jan 30, 2013 at 10:42 PM, Skowron S?awomir
 slawomir.skow...@grupaonet.pl wrote:
 I make new thread, because i think it's a diffrent case.

 We have managed async geo-replication of s3 service, beetwen two ceph
 clusters in two DC's, and to amazon s3 as third. All this via s3 API. I 
 love
 to see native RGW geo-replication with described features in another 
 thread.

 There is another case. What about RBD replication ?? It's much more
 complicated, and for disaster recovery much more important, just like in
 enterprise storage arrays.
 One cluster in two DC's, not solving problem, because we need security
 in data consistency, and isolation.
 Do you thinking about this case ??

 Regards
 Slawomir Skowron--
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




 --
 -
 Pozdrawiam

 S?awek sZiBis Skowron



 --
 -
 Pozdrawiam

 S?awek sZiBis Skowron
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this 

Re: Ceph Rpm Packages for Fedora 18

2013-02-20 Thread Gary Lowell
Hi Kirin -

The Ceph 0.56.3 (Bobtail) release includes Fedora18 rpms.  You can find those 
at:  http://www.ceph.com/rpm-bobtail/fc18/x86_64/

Cheers,
Gary

On Feb 19, 2013, at 7:01 PM, Kiran Patil wrote:

 Hello,
 
 Ceph Rpm Packages are up to Fedora 17.
 
 May I know when will Fedora 18 Rpm Packages release scheduled?
 
 Thanks,
 Kiran Patil.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Geo-replication with RBD

2013-02-20 Thread Sage Weil
On Wed, 20 Feb 2013, S?awomir Skowron wrote:
 Like i say, yes. Now it is only option, to migrate data from one
 cluster to other, and now it must be enough, with some auto features.
 
 But is there any timeline, or any brainstorming in ceph internal
 meetings, about any possible replication in block level, or something
 like that ??

I would like to get this in for cuttlefish (0.61).  See #4207 for the 
underlying rados bits.  We also need to settle the file format discussion; 
any input there would be appreciated!

sage


 
 On 20 lut 2013, at 17:33, Sage Weil s...@inktank.com wrote:
 
  On Wed, 20 Feb 2013, S?awomir Skowron wrote:
  My requirement is to have full disaster recovery, buisness continuity,
  and failover of automatet services on second Datacenter, and not on
  same ceph cluster.
  Datacenters have 10GE dedicated link, for communication, and there is
  option to expand cluster into two DataCenters, but it is not what i
  mean.
  There are advantages of this option like fast snapshots, and fast
  switch of services, but there are some problems.
 
  When we talk about disaster recovery i mean that whole storage cluster
  have problems, not only services at top of storage. I am thinking
  about bug, or mistake of admin, that makes cluster not accessible in
  any copy, or a upgrade that makes data corruption, or upgrade that is
  disruptive for services - auto failover services into another DC,
  before upgrade cluster.
 
  If cluster have a solution to replicate data in rbd images to next
  cluster, than, only data are migrated, and when disaster comes, than
  there is no need to work on last imported snapshot (there can be
  constantly imported snapshot with minutes, or hour, before last
  production), but work on data from now. And when we have automated
  solution to recover DB (one of app service on top of rbd) clusters in
  new datacenter infrastructure, than we have a real disaster recovery
  solution.
 
  That's why we made, a s3 api layer synchronization to another DC, and
  Amazon, and only RBD is left.
 
  Have you read the thread from Jens last week, 'snapshot, clone and mount a
  VM-Image'?  Would this type of capability capture you're requirements?
 
  sage
 
 
  Dnia 19 lut 2013 o godz. 10:23 S?bastien Han
  han.sebast...@gmail.com napisa?(a):
 
  Hi,
 
  For of all, I have some questions about your setup:
 
  * What are your requirements?
  * Are the DCs far from each others?
 
  If they are reasonably close to each others, you can setup a single
  cluster, with replicas across both DCs and manage the RBD devices with
  pacemaker.
 
  Cheers.
 
  --
  Regards,
  S?bastien Han.
 
 
  On Mon, Feb 18, 2013 at 3:20 PM, S?awomir Skowron szi...@gmail.com 
  wrote:
  Hi, Sorry for very late response, but i was sick.
 
  Our case is to make a failover rbd instance in another cluster. We are
  storing block device images, for some services like Database. We need
  to have a two clusters, synchronized, for a quick failover, if first
  cluster goes down, or for upgrade with restart, or many other cases.
 
  Volumes are in many sizes: 1-500GB
  external block device for kvm vm, like EBS.
 
  On Mon, Feb 18, 2013 at 3:07 PM, S?awomir Skowron szi...@gmail.com 
  wrote:
  Hi, Sorry for very late response, but i was sick.
 
  Our case is to make a failover rbd instance in another cluster. We are
  storing block device images, for some services like Database. We need to
  have a two clusters, synchronized, for a quick failover, if first 
  cluster
  goes down, or for upgrade with restart, or many other cases.
 
  Volumes are in many sizes: 1-500GB
  external block device for kvm vm, like EBS.
 
 
  On Fri, Feb 1, 2013 at 12:27 AM, Neil Levine neil.lev...@inktank.com
  wrote:
 
  Skowron,
 
  Can you go into a bit more detail on your specific use-case? What type
  of data are you storing in rbd (type, volume)?
 
  Neil
 
  On Wed, Jan 30, 2013 at 10:42 PM, Skowron S?awomir
  slawomir.skow...@grupaonet.pl wrote:
  I make new thread, because i think it's a diffrent case.
 
  We have managed async geo-replication of s3 service, beetwen two ceph
  clusters in two DC's, and to amazon s3 as third. All this via s3 API. 
  I love
  to see native RGW geo-replication with described features in another 
  thread.
 
  There is another case. What about RBD replication ?? It's much more
  complicated, and for disaster recovery much more important, just like 
  in
  enterprise storage arrays.
  One cluster in two DC's, not solving problem, because we need security
  in data consistency, and isolation.
  Do you thinking about this case ??
 
  Regards
  Slawomir Skowron--
  To unsubscribe from this list: send the line unsubscribe ceph-devel 
  in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel 
  in
  the body of a message to majord...@vger.kernel.org
  More majordomo 

Re: Hadoop DNS/topology details

2013-02-20 Thread Sage Weil
On Wed, 20 Feb 2013, Noah Watkins wrote:
 
 On Feb 19, 2013, at 4:39 PM, Sage Weil s...@inktank.com wrote:
 
  However, we do have host and rack information in the crush map, at least 
  for non-customized installations.  How about something like
  
   string ceph_get_osd_crush_location(int osd, string type);
  
  or similar.  We could call that with host and rack and get exactly 
  what we need, without making any changes to the data structures.
 
 This would then be used in conjunction with an interface:
 
  ceph_offset_to_osds(offset, vectorint osds)
 ...
 osdmpa-pg_to_acting_osds(osds)
 ...
 
 or something like this that replaces the current extent-to-sockaddr 
 interface? The proposed interface about would do the host/ip mapping, as 
 well as the topology mapping?

Yeah.  The ceph_offset_to_osds should probably also have an (optional?) 
out argument that tells you how long the extent is starting from offset 
that is on those devices.  Then you can do another call at offset+len to 
get the next segment.

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0.48.3] cluster health - 1 pgs incomplete state

2013-02-20 Thread Sławomir Skowron
Hi,

I have some problem. After OSD expand, and cluster crush re-organize i
have 1 pg in incomplete state. How can i solve this problem ??

ceph -s
   health HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs
stuck unclean
   monmap e21: 3 mons at
{0=10.178.64.4:6790/0,1=10.178.64.5:6790/0,2=10.178.64.6:6790/0},
election epoch 54, quorum 0,1,2 0,1,2
   osdmap e87682: 156 osds: 156 up, 156 in
pgmap v13097839: 6480 pgs: 6479 active+clean, 1 incomplete; 1484
GB data, 7202 GB used, 36218 GB / 43420 GB avail
   mdsmap e1: 0/0/1 up

ceph health details
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 5.5c is stuck incomplete, last acting [35,68,120]
pg 5.5c is stuck incomplete, last acting [35,68,120]
pg 5.5c is incomplete, acting [35,68,120]

in attachment output from ceph pg 5.5c query

Regards
Sławek sZiBis Skowron
ceph pg 5.5c query

{ state: incomplete,
  up: [
35,
68,
120],
  acting: [
35,
68,
120],
  info: { pgid: 5.5c,
  last_update: 28692'1809,
  last_complete: 28692'1809,
  log_tail: 509'809,
  last_backfill: 0\/\/0\/\/-1,
  purged_snaps: [],
  history: { epoch_created: 365,
  last_epoch_started: 37673,
  last_epoch_clean: 24973,
  last_epoch_split: 37673,
  same_up_since: 81165,
  same_interval_since: 81165,
  same_primary_since: 55011,
  last_scrub: 19046'1806,
  last_scrub_stamp: 2013-02-11 00:20:52.190807},
  stats: { version: 28692'1809,
  reported: 55011'57838,
  state: incomplete,
  last_fresh: 2013-02-20 16:27:35.078140,
  last_change: 2013-02-19 11:14:39.520274,
  last_active: 0.00,
  last_clean: 0.00,
  last_unstale: 2013-02-20 16:27:35.078140,
  mapping_epoch: 70925,
  log_start: 509'809,
  ondisk_log_start: 509'809,
  created: 365,
  last_epoch_clean: 365,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 19046'1806,
  last_scrub_stamp: 2013-02-11 00:20:52.190807,
  log_size: 203950,
  ondisk_log_size: 203950,
  stat_sum: { num_bytes: 0,
  num_objects: 0,
  num_object_clones: 0,
  num_object_copies: 0,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
  num_objects_unfound: 0,
  num_read: 0,
  num_read_kb: 0,
  num_write: 0,
  num_write_kb: 0},
  stat_cat_sum: {},
  up: [
35,
68,
120],
  acting: [
35,
68,
120]},
  empty: 0,
  dne: 0,
  incomplete: 1},
  recovery_state: [
{ name: Started\/Primary\/Peering,
  enter_time: 2013-02-19 11:14:35.094762,
  past_intervals: [
{ first: 23374,
  last: 23496,
  maybe_went_rw: 1,
  up: [
95],
  acting: [
95]},
{ first: 23497,
  last: 23498,
  maybe_went_rw: 1,
  up: [
56,
95],
  acting: [
56,
95]},
{ first: 23499,
  last: 23540,
  maybe_went_rw: 1,
  up: [
56,
95],
  acting: [
95,
56]},
{ first: 23541,
  last: 24899,
  maybe_went_rw: 1,
  up: [
56,
95],
  acting: [
56,
95]},
{ first: 24900,
  last: 24908,
  maybe_went_rw: 1,
  up: [
68,
95],
  acting: [
68,
95]},
{ first: 24909,
  last: 24950,
  maybe_went_rw: 1,
  up: [
72,
95],
  acting: [
72,
95]},
{ first: 24951,
  last: 25727,
  maybe_went_rw: 1,
  up: [
72,
95],
  acting: [
72,
95,
56]},
{ first: 25728,
  last: 25739,
  maybe_went_rw: 1,
  up: [
72,
20,
93],

Re: Hadoop DNS/topology details

2013-02-20 Thread Noah Watkins

On Feb 20, 2013, at 9:31 AM, Sage Weil s...@inktank.com wrote:

 or something like this that replaces the current extent-to-sockaddr 
 interface? The proposed interface about would do the host/ip mapping, as 
 well as the topology mapping?
 
 Yeah.  The ceph_offset_to_osds should probably also have an (optional?) 
 out argument that tells you how long the extent is starting from offset 
 that is on those devices.  Then you can do another call at offset+len to 
 get the next segment.


It'd be nice to hide the striping strategy so we don't have to reproduce it in 
the Hadoop shim as we currently do, and which is needed with an interface using 
only an offset (we have to know the stripe unit to jump to the next extent). 
So, something like this might work:

  struct extent {
loff_t offset, length;
vectorint osds;
  }

  ceph_get_file_extents(file, offset, length, vectorextent extents);

Then we could re-use the Striper or something?

-Noah--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enable old OSD snapshot to re-join a cluster

2013-02-20 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva ol...@gnu.org wrote:
 It recently occurred to me that I messed up an OSD's storage, and
 decided that the easiest way to bring it back was to roll it back to an
 earlier snapshot I'd taken (along the lines of clustersnap) and let it
 recover from there.

 The problem with that idea was that the cluster had advanced too much
 since the snapshot was taken: the latest OSDMap known by that snapshot
 was far behind the range still carried by the monitors.

 Determined to let that osd recover from all the data it already had,
 rather than restarting from scratch, I hacked up a “solution” that
 appears to work: with the patch below, the OSD will use the contents of
 an earlier OSDMap (presumably the latest one it has) for a newer OSDMap
 it can't get any more.

 A single run of osd with this patch was enough for it to pick up the
 newer state and join the cluster; from then on, the patched osd was no
 longer necessary, and presumably should not be used except for this sort
 of emergency.

 Of course this can only possibly work reliably if other nodes are up
 with same or newer versions of each of the PGs (but then, rolling back
 the OSD to an older snapshot would't be safe otherwise).  I don't know
 of any other scenarios in which this patch will not recover things
 correctly, but unless someone far more familiar with ceph internals than
 I am vows for it, I'd recommend using this only if you're really
 desperate to avoid a recovery from scratch, and you save snapshots of
 the other osds (as you probably already do, or you wouldn't have older
 snapshots to rollback to :-) and the mon *before* you get the patched
 ceph-osd to run, and that you stop the mds or otherwise avoid changes
 that you're not willing to lose should the patch not work for you and
 you have to go back to the saved state and let the osd recover from
 scratch.  If it works, lucky us; if it breaks, well, I told you :-)

Yeah, this ought to basically work but it's very dangerous —
potentially breaking invariants about cluster state changes, etc. I
wouldn't use it if the cluster wasn't otherwise healthy; other nodes
breaking in the middle of this operation could cause serious problems,
etc. I'd much prefer that one just recovers over the wire using normal
recovery paths... ;)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: crush reweight

2013-02-20 Thread Sage Weil
On Wed, 20 Feb 2013, Bo-Syung Yang wrote:
 Hi,
 
 I have a crush map (may not be practical, but just for demo) applied
 to a two-host cluster (each host has two OSDs) to test ceph osd crush
 reweight:
 
 # begin crush map
 
 # devices
 device 0 sdc-host0
 device 1 sdd-host0
 device 2 sdc-host1
 device 3 sdd-host1
 
 # types
 type 0 device
 type 1 pool
 type 2 root
 
 # buckets
 pool one {
 id -1
 alg straw
 hash 0  # rjenkins1
 item sdc-host0 weight 1.000
 item sdd-host0 weight 1.000
 item sdc-host1 weight 1.000
 item sdd-host1 weight 1.000
 }
 
 pool two {
 id -2
 alg straw
 hash 0  # rjenkins1
 item sdc-host0 weight 1.000
 item sdd-host0 weight 1.000
 item sdc-host1 weight 1.000
 item sdd-host1 weight 1.000
 }
 
 root root-for-one {
 id -3
 alg straw
 hash 0  # rjenkins1
 item one weight 4.000
 item two weight 4.000
 }
 
 root root-for-two {
 id -4
 alg straw
 hash 0  # rjenkins1
 item one weight 4.000
 item two weight 4.000
 }
 
 rule data {
 ruleset 0
 type replicated
 min_size 1
 max_size 4
 step take root-for-one
 step choose firstn 0 type pool
 step choose firstn 1 type device
 step emit
 }
 
 rule metadata {
 ruleset 1
 type replicated
 min_size 1
 max_size 4
 step take root-for-one
 step choose firstn 0 type pool
 step choose firstn 1 type device
 step emit
 }
 
 rule rbd {
 ruleset 2
 type replicated
 min_size 1
 max_size 4
 step take root-for-two
 step choose firstn 0 type pool
 step choose firstn 1 type device
 step emit
 }
 
 
 After crush map applied, the osd tree looks as:
 
 # idweight  type name   up/down reweight
 -4  8   root root-for-two
 -1  4   pool one
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -3  8   root root-for-one
 -1  4   pool one
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 
 
 Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through:
 
  ceph osd crush reweight sdc-host0 5
 
 I found the osd tree with the weight changes:
 
 # idweight  type name   up/down reweight
 -4  8   root root-for-two
 -1  4   pool one
 0   5   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -3  12  root root-for-one
 -1  8   pool one
 0   5   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 
 My question is why only pool one's weight changed, but not pool two.

Currently the reweight (and most of the other) command(s) assume there is 
only one instance of each item in the hierarchy, and only operate on the 
first one they see.

What is your motivation for having the pools appear in two different 
trees?

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: crush reweight

2013-02-20 Thread Bo-Syung Yang
On Wed, Feb 20, 2013 at 12:39 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 20 Feb 2013, Bo-Syung Yang wrote:
 Hi,

 I have a crush map (may not be practical, but just for demo) applied
 to a two-host cluster (each host has two OSDs) to test ceph osd crush
 reweight:

 # begin crush map

 # devices
 device 0 sdc-host0
 device 1 sdd-host0
 device 2 sdc-host1
 device 3 sdd-host1

 # types
 type 0 device
 type 1 pool
 type 2 root

 # buckets
 pool one {
 id -1
 alg straw
 hash 0  # rjenkins1
 item sdc-host0 weight 1.000
 item sdd-host0 weight 1.000
 item sdc-host1 weight 1.000
 item sdd-host1 weight 1.000
 }

 pool two {
 id -2
 alg straw
 hash 0  # rjenkins1
 item sdc-host0 weight 1.000
 item sdd-host0 weight 1.000
 item sdc-host1 weight 1.000
 item sdd-host1 weight 1.000
 }

 root root-for-one {
 id -3
 alg straw
 hash 0  # rjenkins1
 item one weight 4.000
 item two weight 4.000
 }

 root root-for-two {
 id -4
 alg straw
 hash 0  # rjenkins1
 item one weight 4.000
 item two weight 4.000
 }

 rule data {
 ruleset 0
 type replicated
 min_size 1
 max_size 4
 step take root-for-one
 step choose firstn 0 type pool
 step choose firstn 1 type device
 step emit
 }

 rule metadata {
 ruleset 1
 type replicated
 min_size 1
 max_size 4
 step take root-for-one
 step choose firstn 0 type pool
 step choose firstn 1 type device
 step emit
 }

 rule rbd {
 ruleset 2
 type replicated
 min_size 1
 max_size 4
 step take root-for-two
 step choose firstn 0 type pool
 step choose firstn 1 type device
 step emit
 }


 After crush map applied, the osd tree looks as:

 # idweight  type name   up/down reweight
 -4  8   root root-for-two
 -1  4   pool one
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -3  8   root root-for-one
 -1  4   pool one
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1


 Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through:

  ceph osd crush reweight sdc-host0 5

 I found the osd tree with the weight changes:

 # idweight  type name   up/down reweight
 -4  8   root root-for-two
 -1  4   pool one
 0   5   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -3  12  root root-for-one
 -1  8   pool one
 0   5   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 -2  4   pool two
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1

 My question is why only pool one's weight changed, but not pool two.

 Currently the reweight (and most of the other) command(s) assume there is
 only one instance of each item in the hierarchy, and only operate on the
 first one they see.

 What is your motivation for having the pools appear in two different
 trees?

 sage


By defining different pools in different trees, it allows me setting
different rules for sharing certain OSDs or/and isolating the others
for the different pools (created through ceph osd pool create ...)

Thanks,

Edward
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: crush reweight

2013-02-20 Thread Sage Weil
On Wed, 20 Feb 2013, Bo-Syung Yang wrote:
 On Wed, Feb 20, 2013 at 12:39 PM, Sage Weil s...@inktank.com wrote:
  On Wed, 20 Feb 2013, Bo-Syung Yang wrote:
  Hi,
 
  I have a crush map (may not be practical, but just for demo) applied
  to a two-host cluster (each host has two OSDs) to test ceph osd crush
  reweight:
 
  # begin crush map
 
  # devices
  device 0 sdc-host0
  device 1 sdd-host0
  device 2 sdc-host1
  device 3 sdd-host1
 
  # types
  type 0 device
  type 1 pool
  type 2 root
 
  # buckets
  pool one {
  id -1
  alg straw
  hash 0  # rjenkins1
  item sdc-host0 weight 1.000
  item sdd-host0 weight 1.000
  item sdc-host1 weight 1.000
  item sdd-host1 weight 1.000
  }
 
  pool two {
  id -2
  alg straw
  hash 0  # rjenkins1
  item sdc-host0 weight 1.000
  item sdd-host0 weight 1.000
  item sdc-host1 weight 1.000
  item sdd-host1 weight 1.000
  }
 
  root root-for-one {
  id -3
  alg straw
  hash 0  # rjenkins1
  item one weight 4.000
  item two weight 4.000
  }
 
  root root-for-two {
  id -4
  alg straw
  hash 0  # rjenkins1
  item one weight 4.000
  item two weight 4.000
  }
 
  rule data {
  ruleset 0
  type replicated
  min_size 1
  max_size 4
  step take root-for-one
  step choose firstn 0 type pool
  step choose firstn 1 type device
  step emit
  }
 
  rule metadata {
  ruleset 1
  type replicated
  min_size 1
  max_size 4
  step take root-for-one
  step choose firstn 0 type pool
  step choose firstn 1 type device
  step emit
  }
 
  rule rbd {
  ruleset 2
  type replicated
  min_size 1
  max_size 4
  step take root-for-two
  step choose firstn 0 type pool
  step choose firstn 1 type device
  step emit
  }
 
 
  After crush map applied, the osd tree looks as:
 
  # idweight  type name   up/down reweight
  -4  8   root root-for-two
  -1  4   pool one
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -3  8   root root-for-one
  -1  4   pool one
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
 
 
  Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through:
 
   ceph osd crush reweight sdc-host0 5
 
  I found the osd tree with the weight changes:
 
  # idweight  type name   up/down reweight
  -4  8   root root-for-two
  -1  4   pool one
  0   5   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -3  12  root root-for-one
  -1  8   pool one
  0   5   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
 
  My question is why only pool one's weight changed, but not pool two.
 
  Currently the reweight (and most of the other) command(s) assume there is
  only one instance of each item in the hierarchy, and only operate on the
  first one they see.
 
  What is your motivation for having the pools appear in two different
  trees?
 
  sage
 
 
 By defining different pools in different trees, it allows me setting
 different rules for sharing certain OSDs or/and isolating the others
 for the different pools (created through ceph osd pool create ...)

Re: crush reweight

2013-02-20 Thread Bo-Syung Yang
On Wed, Feb 20, 2013 at 1:19 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 20 Feb 2013, Bo-Syung Yang wrote:
 On Wed, Feb 20, 2013 at 12:39 PM, Sage Weil s...@inktank.com wrote:
  On Wed, 20 Feb 2013, Bo-Syung Yang wrote:
  Hi,
 
  I have a crush map (may not be practical, but just for demo) applied
  to a two-host cluster (each host has two OSDs) to test ceph osd crush
  reweight:
 
  # begin crush map
 
  # devices
  device 0 sdc-host0
  device 1 sdd-host0
  device 2 sdc-host1
  device 3 sdd-host1
 
  # types
  type 0 device
  type 1 pool
  type 2 root
 
  # buckets
  pool one {
  id -1
  alg straw
  hash 0  # rjenkins1
  item sdc-host0 weight 1.000
  item sdd-host0 weight 1.000
  item sdc-host1 weight 1.000
  item sdd-host1 weight 1.000
  }
 
  pool two {
  id -2
  alg straw
  hash 0  # rjenkins1
  item sdc-host0 weight 1.000
  item sdd-host0 weight 1.000
  item sdc-host1 weight 1.000
  item sdd-host1 weight 1.000
  }
 
  root root-for-one {
  id -3
  alg straw
  hash 0  # rjenkins1
  item one weight 4.000
  item two weight 4.000
  }
 
  root root-for-two {
  id -4
  alg straw
  hash 0  # rjenkins1
  item one weight 4.000
  item two weight 4.000
  }
 
  rule data {
  ruleset 0
  type replicated
  min_size 1
  max_size 4
  step take root-for-one
  step choose firstn 0 type pool
  step choose firstn 1 type device
  step emit
  }
 
  rule metadata {
  ruleset 1
  type replicated
  min_size 1
  max_size 4
  step take root-for-one
  step choose firstn 0 type pool
  step choose firstn 1 type device
  step emit
  }
 
  rule rbd {
  ruleset 2
  type replicated
  min_size 1
  max_size 4
  step take root-for-two
  step choose firstn 0 type pool
  step choose firstn 1 type device
  step emit
  }
 
 
  After crush map applied, the osd tree looks as:
 
  # idweight  type name   up/down reweight
  -4  8   root root-for-two
  -1  4   pool one
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -3  8   root root-for-one
  -1  4   pool one
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
 
 
  Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through:
 
   ceph osd crush reweight sdc-host0 5
 
  I found the osd tree with the weight changes:
 
  # idweight  type name   up/down reweight
  -4  8   root root-for-two
  -1  4   pool one
  0   5   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -3  12  root root-for-one
  -1  8   pool one
  0   5   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
  -2  4   pool two
  0   1   osd.0   up  1
  1   1   osd.1   up  1
  2   1   osd.2   up  1
  3   1   osd.3   up  1
 
  My question is why only pool one's weight changed, but not pool two.
 
  Currently the reweight (and most of the other) command(s) assume there is
  only one instance of each item in the hierarchy, and only operate on the
  first one they see.
 
  What is your motivation for having the pools appear in two different
  trees?
 
  sage
 

 By defining different pools in different trees, it allows me setting
 different rules for sharing certain OSDs or/and isolating the others
 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-20 Thread Sage Weil
Hi Jim,

I'm resurrecting an ancient thread here, but: we've just observed this on 
another big cluster and remembered that this hasn't actually been fixed.

I think the right solution is to make an option that will setsockopt on 
SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
wip-tcp.  Do you mind checking to see if this addresses the issue (without 
manually adjusting things in /proc)?

And perhaps we should consider making this default to 256KB...

Thanks!
sage



On Fri, 24 Feb 2012, Jim Schutt wrote:

 On 02/02/2012 10:52 AM, Gregory Farnum wrote:
  On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov  wrote:
   I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
   per OSD.  During a test I watch both OSD servers with both
   vmstat and iostat.
   
   During a good period, vmstat says the server is sustaining  2 GB/s
   for multiple tens of seconds.  Since I use replication factor 2, that
   means that server is sustaining  500 MB/s aggregate client throughput,
   right?  During such a period vmstat also reports ~10% CPU idle.
   
   During a bad period, vmstat says the server is doing ~200 MB/s,
   with lots of idle cycles.  It is during these periods that
   messages stuck in the policy throttler build up such long
   wait times.  Sometimes I see really bad periods with aggregate
   throughput per server  100 MB/s.
   
   The typical pattern I see is that a run starts with tens of seconds
   of aggregate throughput  2 GB/s.  Then it drops and bounces around
   500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
   it ramps back up near 2 GB/s again.
  
  Hmm. 100MB/s is awfully low for this theory, but have you tried to
  correlate the drops in throughput with the OSD journals running out of
  space? I assume from your setup that they're sharing the disk with the
  store (although it works either way), and your description makes me
  think that throughput is initially constrained by sequential journal
  writes but then the journal runs out of space and the OSD has to wait
  for the main store to catch up (with random IO), and that sends the IO
  patterns all to hell. (If you can say that random 4MB IOs are
  hellish.)
  I'm also curious about memory usage as a possible explanation for the
  more dramatic drops.
 
 I've finally figured out what is going on with this behaviour.
 Memory usage was on the right track.
 
 It turns out to be an unfortunate interaction between the
 number of OSDs/server, number of clients, TCP socket buffer
 autotuning, the policy throttler, and limits on the total
 memory used by the TCP stack (net/ipv4/tcp_mem sysctl).
 
 What happens is that for throttled reader threads, the
 TCP stack will continue to receive data as long as there
 is available socket buffer, and the sender has data to send.
 
 As each reader thread receives successive messages, the
 TCP socket buffer autotuning increases the size of the
 socket buffer.  Eventually, due to the number of OSDs
 per server and the number of clients trying to write,
 all the memory the TCP stack is allowed by net/ipv4/tcp_mem
 to use is consumed by the socket buffers of throttled
 reader threads.  When this happens, TCP processing is affected
 to the point that the TCP stack cannot send ACKs on behalf
 of the reader threads that aren't throttled.  At that point
 the OSD stalls until the TCP retransmit count on some connection
 is exceeded, causing it to be reset.
 
 Since my OSD servers don't run anything else, the simplest
 solution for me is to turn off socket buffer autotuning
 (net/ipv4/tcp_moderate_rcvbuf), and set the default socket
 buffer size to something reasonable.  256k seems to be
 working well for me right now.
 
 -- Jim
 
  -Greg
  
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Rpm Packages for Fedora 18

2013-02-20 Thread Kiran Patil
Thanks Gary.

On Wed, Feb 20, 2013 at 10:34 PM, Gary Lowell gary.low...@inktank.com wrote:
 Hi Kirin -

 The Ceph 0.56.3 (Bobtail) release includes Fedora18 rpms.  You can find those 
 at:  http://www.ceph.com/rpm-bobtail/fc18/x86_64/

 Cheers,
 Gary

 On Feb 19, 2013, at 7:01 PM, Kiran Patil wrote:

 Hello,

 Ceph Rpm Packages are up to Fedora 17.

 May I know when will Fedora 18 Rpm Packages release scheduled?

 Thanks,
 Kiran Patil.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd export speed limit

2013-02-20 Thread Andrey Korolyov
On Wed, Feb 13, 2013 at 12:22 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 is there a speed limit option for rbd export? Right now i'm able to produce
 several SLOW requests from IMPORTANT valid requests while just exporting a
 snapshot which is not really important.

 rbd export runs with 2400MB/s and each OSD with 250MB/s so it seems to block
 valid normal read / write operations.

 Greets,
 Stefan
 --

Can confirm this in some specific case - when 0.56.2 and 0.56.3
coexist for a long time, nodes with newer running version can produce
such warnings at the beginning of export huge snapshots, not during
entire export. And there are real impact on clients - for example I
can see messages from watchdog in the KVM guests. For now, I will do
an input throttling on export as temporary workaround.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html