Ceph support in OpenNebula
Hello everyone, I'm an OpenNebula [1] developer. We've been working on integrating OpenNebula with native Ceph drivers (using libvirt). The integration is now complete, ready for testing. You can find more information about its usage here [2]. We will maintain these drivers officially and extend its features, so if you have any questions, comments or suggestions, please let me know. I would be very happy to write an OpenNebula page in Ceph's documentation. [1] http://opennebula.org [2] http://opennebula.org/documentation:rel4.0:ceph_ds Cheers, Jaime -- Jaime Melis Project Engineer OpenNebula - The Open Source Toolkit for Cloud Computing www.OpenNebula.org | jme...@opennebula.org -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Geo-replication with RBD
Like i say, yes. Now it is only option, to migrate data from one cluster to other, and now it must be enough, with some auto features. But is there any timeline, or any brainstorming in ceph internal meetings, about any possible replication in block level, or something like that ?? On 20 lut 2013, at 17:33, Sage Weil s...@inktank.com wrote: On Wed, 20 Feb 2013, S?awomir Skowron wrote: My requirement is to have full disaster recovery, buisness continuity, and failover of automatet services on second Datacenter, and not on same ceph cluster. Datacenters have 10GE dedicated link, for communication, and there is option to expand cluster into two DataCenters, but it is not what i mean. There are advantages of this option like fast snapshots, and fast switch of services, but there are some problems. When we talk about disaster recovery i mean that whole storage cluster have problems, not only services at top of storage. I am thinking about bug, or mistake of admin, that makes cluster not accessible in any copy, or a upgrade that makes data corruption, or upgrade that is disruptive for services - auto failover services into another DC, before upgrade cluster. If cluster have a solution to replicate data in rbd images to next cluster, than, only data are migrated, and when disaster comes, than there is no need to work on last imported snapshot (there can be constantly imported snapshot with minutes, or hour, before last production), but work on data from now. And when we have automated solution to recover DB (one of app service on top of rbd) clusters in new datacenter infrastructure, than we have a real disaster recovery solution. That's why we made, a s3 api layer synchronization to another DC, and Amazon, and only RBD is left. Have you read the thread from Jens last week, 'snapshot, clone and mount a VM-Image'? Would this type of capability capture you're requirements? sage Dnia 19 lut 2013 o godz. 10:23 S?bastien Han han.sebast...@gmail.com napisa?(a): Hi, For of all, I have some questions about your setup: * What are your requirements? * Are the DCs far from each others? If they are reasonably close to each others, you can setup a single cluster, with replicas across both DCs and manage the RBD devices with pacemaker. Cheers. -- Regards, S?bastien Han. On Mon, Feb 18, 2013 at 3:20 PM, S?awomir Skowron szi...@gmail.com wrote: Hi, Sorry for very late response, but i was sick. Our case is to make a failover rbd instance in another cluster. We are storing block device images, for some services like Database. We need to have a two clusters, synchronized, for a quick failover, if first cluster goes down, or for upgrade with restart, or many other cases. Volumes are in many sizes: 1-500GB external block device for kvm vm, like EBS. On Mon, Feb 18, 2013 at 3:07 PM, S?awomir Skowron szi...@gmail.com wrote: Hi, Sorry for very late response, but i was sick. Our case is to make a failover rbd instance in another cluster. We are storing block device images, for some services like Database. We need to have a two clusters, synchronized, for a quick failover, if first cluster goes down, or for upgrade with restart, or many other cases. Volumes are in many sizes: 1-500GB external block device for kvm vm, like EBS. On Fri, Feb 1, 2013 at 12:27 AM, Neil Levine neil.lev...@inktank.com wrote: Skowron, Can you go into a bit more detail on your specific use-case? What type of data are you storing in rbd (type, volume)? Neil On Wed, Jan 30, 2013 at 10:42 PM, Skowron S?awomir slawomir.skow...@grupaonet.pl wrote: I make new thread, because i think it's a diffrent case. We have managed async geo-replication of s3 service, beetwen two ceph clusters in two DC's, and to amazon s3 as third. All this via s3 API. I love to see native RGW geo-replication with described features in another thread. There is another case. What about RBD replication ?? It's much more complicated, and for disaster recovery much more important, just like in enterprise storage arrays. One cluster in two DC's, not solving problem, because we need security in data consistency, and isolation. Do you thinking about this case ?? Regards Slawomir Skowron-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- - Pozdrawiam S?awek sZiBis Skowron -- - Pozdrawiam S?awek sZiBis Skowron -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this
Re: Ceph Rpm Packages for Fedora 18
Hi Kirin - The Ceph 0.56.3 (Bobtail) release includes Fedora18 rpms. You can find those at: http://www.ceph.com/rpm-bobtail/fc18/x86_64/ Cheers, Gary On Feb 19, 2013, at 7:01 PM, Kiran Patil wrote: Hello, Ceph Rpm Packages are up to Fedora 17. May I know when will Fedora 18 Rpm Packages release scheduled? Thanks, Kiran Patil. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Geo-replication with RBD
On Wed, 20 Feb 2013, S?awomir Skowron wrote: Like i say, yes. Now it is only option, to migrate data from one cluster to other, and now it must be enough, with some auto features. But is there any timeline, or any brainstorming in ceph internal meetings, about any possible replication in block level, or something like that ?? I would like to get this in for cuttlefish (0.61). See #4207 for the underlying rados bits. We also need to settle the file format discussion; any input there would be appreciated! sage On 20 lut 2013, at 17:33, Sage Weil s...@inktank.com wrote: On Wed, 20 Feb 2013, S?awomir Skowron wrote: My requirement is to have full disaster recovery, buisness continuity, and failover of automatet services on second Datacenter, and not on same ceph cluster. Datacenters have 10GE dedicated link, for communication, and there is option to expand cluster into two DataCenters, but it is not what i mean. There are advantages of this option like fast snapshots, and fast switch of services, but there are some problems. When we talk about disaster recovery i mean that whole storage cluster have problems, not only services at top of storage. I am thinking about bug, or mistake of admin, that makes cluster not accessible in any copy, or a upgrade that makes data corruption, or upgrade that is disruptive for services - auto failover services into another DC, before upgrade cluster. If cluster have a solution to replicate data in rbd images to next cluster, than, only data are migrated, and when disaster comes, than there is no need to work on last imported snapshot (there can be constantly imported snapshot with minutes, or hour, before last production), but work on data from now. And when we have automated solution to recover DB (one of app service on top of rbd) clusters in new datacenter infrastructure, than we have a real disaster recovery solution. That's why we made, a s3 api layer synchronization to another DC, and Amazon, and only RBD is left. Have you read the thread from Jens last week, 'snapshot, clone and mount a VM-Image'? Would this type of capability capture you're requirements? sage Dnia 19 lut 2013 o godz. 10:23 S?bastien Han han.sebast...@gmail.com napisa?(a): Hi, For of all, I have some questions about your setup: * What are your requirements? * Are the DCs far from each others? If they are reasonably close to each others, you can setup a single cluster, with replicas across both DCs and manage the RBD devices with pacemaker. Cheers. -- Regards, S?bastien Han. On Mon, Feb 18, 2013 at 3:20 PM, S?awomir Skowron szi...@gmail.com wrote: Hi, Sorry for very late response, but i was sick. Our case is to make a failover rbd instance in another cluster. We are storing block device images, for some services like Database. We need to have a two clusters, synchronized, for a quick failover, if first cluster goes down, or for upgrade with restart, or many other cases. Volumes are in many sizes: 1-500GB external block device for kvm vm, like EBS. On Mon, Feb 18, 2013 at 3:07 PM, S?awomir Skowron szi...@gmail.com wrote: Hi, Sorry for very late response, but i was sick. Our case is to make a failover rbd instance in another cluster. We are storing block device images, for some services like Database. We need to have a two clusters, synchronized, for a quick failover, if first cluster goes down, or for upgrade with restart, or many other cases. Volumes are in many sizes: 1-500GB external block device for kvm vm, like EBS. On Fri, Feb 1, 2013 at 12:27 AM, Neil Levine neil.lev...@inktank.com wrote: Skowron, Can you go into a bit more detail on your specific use-case? What type of data are you storing in rbd (type, volume)? Neil On Wed, Jan 30, 2013 at 10:42 PM, Skowron S?awomir slawomir.skow...@grupaonet.pl wrote: I make new thread, because i think it's a diffrent case. We have managed async geo-replication of s3 service, beetwen two ceph clusters in two DC's, and to amazon s3 as third. All this via s3 API. I love to see native RGW geo-replication with described features in another thread. There is another case. What about RBD replication ?? It's much more complicated, and for disaster recovery much more important, just like in enterprise storage arrays. One cluster in two DC's, not solving problem, because we need security in data consistency, and isolation. Do you thinking about this case ?? Regards Slawomir Skowron-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo
Re: Hadoop DNS/topology details
On Wed, 20 Feb 2013, Noah Watkins wrote: On Feb 19, 2013, at 4:39 PM, Sage Weil s...@inktank.com wrote: However, we do have host and rack information in the crush map, at least for non-customized installations. How about something like string ceph_get_osd_crush_location(int osd, string type); or similar. We could call that with host and rack and get exactly what we need, without making any changes to the data structures. This would then be used in conjunction with an interface: ceph_offset_to_osds(offset, vectorint osds) ... osdmpa-pg_to_acting_osds(osds) ... or something like this that replaces the current extent-to-sockaddr interface? The proposed interface about would do the host/ip mapping, as well as the topology mapping? Yeah. The ceph_offset_to_osds should probably also have an (optional?) out argument that tells you how long the extent is starting from offset that is on those devices. Then you can do another call at offset+len to get the next segment. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[0.48.3] cluster health - 1 pgs incomplete state
Hi, I have some problem. After OSD expand, and cluster crush re-organize i have 1 pg in incomplete state. How can i solve this problem ?? ceph -s health HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean monmap e21: 3 mons at {0=10.178.64.4:6790/0,1=10.178.64.5:6790/0,2=10.178.64.6:6790/0}, election epoch 54, quorum 0,1,2 0,1,2 osdmap e87682: 156 osds: 156 up, 156 in pgmap v13097839: 6480 pgs: 6479 active+clean, 1 incomplete; 1484 GB data, 7202 GB used, 36218 GB / 43420 GB avail mdsmap e1: 0/0/1 up ceph health details HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean pg 5.5c is stuck incomplete, last acting [35,68,120] pg 5.5c is stuck incomplete, last acting [35,68,120] pg 5.5c is incomplete, acting [35,68,120] in attachment output from ceph pg 5.5c query Regards Sławek sZiBis Skowron ceph pg 5.5c query { state: incomplete, up: [ 35, 68, 120], acting: [ 35, 68, 120], info: { pgid: 5.5c, last_update: 28692'1809, last_complete: 28692'1809, log_tail: 509'809, last_backfill: 0\/\/0\/\/-1, purged_snaps: [], history: { epoch_created: 365, last_epoch_started: 37673, last_epoch_clean: 24973, last_epoch_split: 37673, same_up_since: 81165, same_interval_since: 81165, same_primary_since: 55011, last_scrub: 19046'1806, last_scrub_stamp: 2013-02-11 00:20:52.190807}, stats: { version: 28692'1809, reported: 55011'57838, state: incomplete, last_fresh: 2013-02-20 16:27:35.078140, last_change: 2013-02-19 11:14:39.520274, last_active: 0.00, last_clean: 0.00, last_unstale: 2013-02-20 16:27:35.078140, mapping_epoch: 70925, log_start: 509'809, ondisk_log_start: 509'809, created: 365, last_epoch_clean: 365, parent: 0.0, parent_split_bits: 0, last_scrub: 19046'1806, last_scrub_stamp: 2013-02-11 00:20:52.190807, log_size: 203950, ondisk_log_size: 203950, stat_sum: { num_bytes: 0, num_objects: 0, num_object_clones: 0, num_object_copies: 0, num_objects_missing_on_primary: 0, num_objects_degraded: 0, num_objects_unfound: 0, num_read: 0, num_read_kb: 0, num_write: 0, num_write_kb: 0}, stat_cat_sum: {}, up: [ 35, 68, 120], acting: [ 35, 68, 120]}, empty: 0, dne: 0, incomplete: 1}, recovery_state: [ { name: Started\/Primary\/Peering, enter_time: 2013-02-19 11:14:35.094762, past_intervals: [ { first: 23374, last: 23496, maybe_went_rw: 1, up: [ 95], acting: [ 95]}, { first: 23497, last: 23498, maybe_went_rw: 1, up: [ 56, 95], acting: [ 56, 95]}, { first: 23499, last: 23540, maybe_went_rw: 1, up: [ 56, 95], acting: [ 95, 56]}, { first: 23541, last: 24899, maybe_went_rw: 1, up: [ 56, 95], acting: [ 56, 95]}, { first: 24900, last: 24908, maybe_went_rw: 1, up: [ 68, 95], acting: [ 68, 95]}, { first: 24909, last: 24950, maybe_went_rw: 1, up: [ 72, 95], acting: [ 72, 95]}, { first: 24951, last: 25727, maybe_went_rw: 1, up: [ 72, 95], acting: [ 72, 95, 56]}, { first: 25728, last: 25739, maybe_went_rw: 1, up: [ 72, 20, 93],
Re: Hadoop DNS/topology details
On Feb 20, 2013, at 9:31 AM, Sage Weil s...@inktank.com wrote: or something like this that replaces the current extent-to-sockaddr interface? The proposed interface about would do the host/ip mapping, as well as the topology mapping? Yeah. The ceph_offset_to_osds should probably also have an (optional?) out argument that tells you how long the extent is starting from offset that is on those devices. Then you can do another call at offset+len to get the next segment. It'd be nice to hide the striping strategy so we don't have to reproduce it in the Hadoop shim as we currently do, and which is needed with an interface using only an offset (we have to know the stripe unit to jump to the next extent). So, something like this might work: struct extent { loff_t offset, length; vectorint osds; } ceph_get_file_extents(file, offset, length, vectorextent extents); Then we could re-use the Striper or something? -Noah-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: enable old OSD snapshot to re-join a cluster
On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva ol...@gnu.org wrote: It recently occurred to me that I messed up an OSD's storage, and decided that the easiest way to bring it back was to roll it back to an earlier snapshot I'd taken (along the lines of clustersnap) and let it recover from there. The problem with that idea was that the cluster had advanced too much since the snapshot was taken: the latest OSDMap known by that snapshot was far behind the range still carried by the monitors. Determined to let that osd recover from all the data it already had, rather than restarting from scratch, I hacked up a “solution” that appears to work: with the patch below, the OSD will use the contents of an earlier OSDMap (presumably the latest one it has) for a newer OSDMap it can't get any more. A single run of osd with this patch was enough for it to pick up the newer state and join the cluster; from then on, the patched osd was no longer necessary, and presumably should not be used except for this sort of emergency. Of course this can only possibly work reliably if other nodes are up with same or newer versions of each of the PGs (but then, rolling back the OSD to an older snapshot would't be safe otherwise). I don't know of any other scenarios in which this patch will not recover things correctly, but unless someone far more familiar with ceph internals than I am vows for it, I'd recommend using this only if you're really desperate to avoid a recovery from scratch, and you save snapshots of the other osds (as you probably already do, or you wouldn't have older snapshots to rollback to :-) and the mon *before* you get the patched ceph-osd to run, and that you stop the mds or otherwise avoid changes that you're not willing to lose should the patch not work for you and you have to go back to the saved state and let the osd recover from scratch. If it works, lucky us; if it breaks, well, I told you :-) Yeah, this ought to basically work but it's very dangerous — potentially breaking invariants about cluster state changes, etc. I wouldn't use it if the cluster wasn't otherwise healthy; other nodes breaking in the middle of this operation could cause serious problems, etc. I'd much prefer that one just recovers over the wire using normal recovery paths... ;) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: crush reweight
On Wed, 20 Feb 2013, Bo-Syung Yang wrote: Hi, I have a crush map (may not be practical, but just for demo) applied to a two-host cluster (each host has two OSDs) to test ceph osd crush reweight: # begin crush map # devices device 0 sdc-host0 device 1 sdd-host0 device 2 sdc-host1 device 3 sdd-host1 # types type 0 device type 1 pool type 2 root # buckets pool one { id -1 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } pool two { id -2 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } root root-for-one { id -3 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } root root-for-two { id -4 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } rule data { ruleset 0 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 4 step take root-for-two step choose firstn 0 type pool step choose firstn 1 type device step emit } After crush map applied, the osd tree looks as: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 8 root root-for-one -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through: ceph osd crush reweight sdc-host0 5 I found the osd tree with the weight changes: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 12 root root-for-one -1 8 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 My question is why only pool one's weight changed, but not pool two. Currently the reweight (and most of the other) command(s) assume there is only one instance of each item in the hierarchy, and only operate on the first one they see. What is your motivation for having the pools appear in two different trees? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: crush reweight
On Wed, Feb 20, 2013 at 12:39 PM, Sage Weil s...@inktank.com wrote: On Wed, 20 Feb 2013, Bo-Syung Yang wrote: Hi, I have a crush map (may not be practical, but just for demo) applied to a two-host cluster (each host has two OSDs) to test ceph osd crush reweight: # begin crush map # devices device 0 sdc-host0 device 1 sdd-host0 device 2 sdc-host1 device 3 sdd-host1 # types type 0 device type 1 pool type 2 root # buckets pool one { id -1 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } pool two { id -2 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } root root-for-one { id -3 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } root root-for-two { id -4 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } rule data { ruleset 0 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 4 step take root-for-two step choose firstn 0 type pool step choose firstn 1 type device step emit } After crush map applied, the osd tree looks as: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 8 root root-for-one -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through: ceph osd crush reweight sdc-host0 5 I found the osd tree with the weight changes: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 12 root root-for-one -1 8 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 My question is why only pool one's weight changed, but not pool two. Currently the reweight (and most of the other) command(s) assume there is only one instance of each item in the hierarchy, and only operate on the first one they see. What is your motivation for having the pools appear in two different trees? sage By defining different pools in different trees, it allows me setting different rules for sharing certain OSDs or/and isolating the others for the different pools (created through ceph osd pool create ...) Thanks, Edward -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: crush reweight
On Wed, 20 Feb 2013, Bo-Syung Yang wrote: On Wed, Feb 20, 2013 at 12:39 PM, Sage Weil s...@inktank.com wrote: On Wed, 20 Feb 2013, Bo-Syung Yang wrote: Hi, I have a crush map (may not be practical, but just for demo) applied to a two-host cluster (each host has two OSDs) to test ceph osd crush reweight: # begin crush map # devices device 0 sdc-host0 device 1 sdd-host0 device 2 sdc-host1 device 3 sdd-host1 # types type 0 device type 1 pool type 2 root # buckets pool one { id -1 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } pool two { id -2 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } root root-for-one { id -3 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } root root-for-two { id -4 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } rule data { ruleset 0 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 4 step take root-for-two step choose firstn 0 type pool step choose firstn 1 type device step emit } After crush map applied, the osd tree looks as: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 8 root root-for-one -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through: ceph osd crush reweight sdc-host0 5 I found the osd tree with the weight changes: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 12 root root-for-one -1 8 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 My question is why only pool one's weight changed, but not pool two. Currently the reweight (and most of the other) command(s) assume there is only one instance of each item in the hierarchy, and only operate on the first one they see. What is your motivation for having the pools appear in two different trees? sage By defining different pools in different trees, it allows me setting different rules for sharing certain OSDs or/and isolating the others for the different pools (created through ceph osd pool create ...)
Re: crush reweight
On Wed, Feb 20, 2013 at 1:19 PM, Sage Weil s...@inktank.com wrote: On Wed, 20 Feb 2013, Bo-Syung Yang wrote: On Wed, Feb 20, 2013 at 12:39 PM, Sage Weil s...@inktank.com wrote: On Wed, 20 Feb 2013, Bo-Syung Yang wrote: Hi, I have a crush map (may not be practical, but just for demo) applied to a two-host cluster (each host has two OSDs) to test ceph osd crush reweight: # begin crush map # devices device 0 sdc-host0 device 1 sdd-host0 device 2 sdc-host1 device 3 sdd-host1 # types type 0 device type 1 pool type 2 root # buckets pool one { id -1 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } pool two { id -2 alg straw hash 0 # rjenkins1 item sdc-host0 weight 1.000 item sdd-host0 weight 1.000 item sdc-host1 weight 1.000 item sdd-host1 weight 1.000 } root root-for-one { id -3 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } root root-for-two { id -4 alg straw hash 0 # rjenkins1 item one weight 4.000 item two weight 4.000 } rule data { ruleset 0 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 4 step take root-for-one step choose firstn 0 type pool step choose firstn 1 type device step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 4 step take root-for-two step choose firstn 0 type pool step choose firstn 1 type device step emit } After crush map applied, the osd tree looks as: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 8 root root-for-one -1 4 pool one 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 Then, I reweight osd.0 (device sdc-host0) in crush map to 5 through: ceph osd crush reweight sdc-host0 5 I found the osd tree with the weight changes: # idweight type name up/down reweight -4 8 root root-for-two -1 4 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -3 12 root root-for-one -1 8 pool one 0 5 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 -2 4 pool two 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 My question is why only pool one's weight changed, but not pool two. Currently the reweight (and most of the other) command(s) assume there is only one instance of each item in the hierarchy, and only operate on the first one they see. What is your motivation for having the pools appear in two different trees? sage By defining different pools in different trees, it allows me setting different rules for sharing certain OSDs or/and isolating the others
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Jim, I'm resurrecting an ancient thread here, but: we've just observed this on another big cluster and remembered that this hasn't actually been fixed. I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? And perhaps we should consider making this default to 256KB... Thanks! sage On Fri, 24 Feb 2012, Jim Schutt wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? I assume from your setup that they're sharing the disk with the store (although it works either way), and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) I'm also curious about memory usage as a possible explanation for the more dramatic drops. I've finally figured out what is going on with this behaviour. Memory usage was on the right track. It turns out to be an unfortunate interaction between the number of OSDs/server, number of clients, TCP socket buffer autotuning, the policy throttler, and limits on the total memory used by the TCP stack (net/ipv4/tcp_mem sysctl). What happens is that for throttled reader threads, the TCP stack will continue to receive data as long as there is available socket buffer, and the sender has data to send. As each reader thread receives successive messages, the TCP socket buffer autotuning increases the size of the socket buffer. Eventually, due to the number of OSDs per server and the number of clients trying to write, all the memory the TCP stack is allowed by net/ipv4/tcp_mem to use is consumed by the socket buffers of throttled reader threads. When this happens, TCP processing is affected to the point that the TCP stack cannot send ACKs on behalf of the reader threads that aren't throttled. At that point the OSD stalls until the TCP retransmit count on some connection is exceeded, causing it to be reset. Since my OSD servers don't run anything else, the simplest solution for me is to turn off socket buffer autotuning (net/ipv4/tcp_moderate_rcvbuf), and set the default socket buffer size to something reasonable. 256k seems to be working well for me right now. -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Rpm Packages for Fedora 18
Thanks Gary. On Wed, Feb 20, 2013 at 10:34 PM, Gary Lowell gary.low...@inktank.com wrote: Hi Kirin - The Ceph 0.56.3 (Bobtail) release includes Fedora18 rpms. You can find those at: http://www.ceph.com/rpm-bobtail/fc18/x86_64/ Cheers, Gary On Feb 19, 2013, at 7:01 PM, Kiran Patil wrote: Hello, Ceph Rpm Packages are up to Fedora 17. May I know when will Fedora 18 Rpm Packages release scheduled? Thanks, Kiran Patil. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd export speed limit
On Wed, Feb 13, 2013 at 12:22 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, is there a speed limit option for rbd export? Right now i'm able to produce several SLOW requests from IMPORTANT valid requests while just exporting a snapshot which is not really important. rbd export runs with 2400MB/s and each OSD with 250MB/s so it seems to block valid normal read / write operations. Greets, Stefan -- Can confirm this in some specific case - when 0.56.2 and 0.56.3 coexist for a long time, nodes with newer running version can produce such warnings at the beginning of export huge snapshots, not during entire export. And there are real impact on clients - for example I can see messages from watchdog in the KVM guests. For now, I will do an input throttling on export as temporary workaround. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html