[ceph-users] filesystem fragmentation on ext4 OSD
Hi, after running Ceph for a while I see a lot of fragmented files on our OSD filesystems (all running ext4). For example: itchy ~ # fsck -f /srv/ceph/osd/ceph-5 fsck von util-linux 2.22.2 e2fsck 1.42 (29-Nov-2011) [...] /dev/mapper/vgosd00-ceph--osd00: 461903/418119680 files (33.7% non-contiguous), 478239460/836229120 blocks This is an unusually high value for ext4. The normal expectation is something in the 5% range. I suspect that such a high fragmentation produces lots of unnecessary seeks on the disks. Has anyone an idea what to do to make Ceph fragment an OSD filesystem less? TIA Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD Caching - How to enable?
Hi, I've got a few VMs in Ceph RBD that are running very slowly - presumably down to a backfill after increasing the pg_num of a big pool. Would RBD caching resolve that issue? If so, how do I enable it? The documentation states that setting rbd cache = true in [global] enables it, but doesn't elaborate on whether you need to restart any Ceph processes. Is that literally all that is needed or is there more to it than that? -- Best regards *Graeme * ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD Caching - How to enable?
The documentation states that setting rbd cache = true in [global] enables it, but doesn't elaborate on whether you need to restart any Ceph processe It's on the client side ! (so no need to restart ceph daemons) - Mail original - De: Graeme Lambert glamb...@adepteo.net À: ceph-users@lists.ceph.com Envoyé: Jeudi 6 Février 2014 11:43:56 Objet: [ceph-users] RBD Caching - How to enable? Hi, I've got a few VMs in Ceph RBD that are running very slowly - presumably down to a backfill after increasing the pg_num of a big pool. Would RBD caching resolve that issue? If so, how do I enable it? The documentation states that setting rbd cache = true in [global] enables it, but doesn't elaborate on whether you need to restart any Ceph processes. Is that literally all that is needed or is there more to it than that? -- Best regards Graeme ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD Caching - How to enable?
OK, so I need to change ceph.conf on the compute nodes? yes. Do the VMs using RBD images need to be restarted at all? I think yes. Anything changed in the virsh XML for the nodes? you need to add cache=writeback for your disks If you use qemu 1.2, no need to add rbd cache = true to ceph .conf :) http://ceph.com/docs/next/rbd/qemu-rbd/ - Mail original - De: Graeme Lambert glamb...@adepteo.net À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-users@lists.ceph.com Envoyé: Jeudi 6 Février 2014 12:03:00 Objet: Re: [ceph-users] RBD Caching - How to enable? Hi Alexandre, OK, so I need to change ceph.conf on the compute nodes? Do the VMs using RBD images need to be restarted at all? Anything changed in the virsh XML for the nodes? Best regards Graeme On 06/02/14 10:50, Alexandre DERUMIER wrote: blockquote blockquote The documentation states that setting rbd cache = true in [global] enables it, but doesn't elaborate on whether you need to restart any Ceph processe /blockquote It's on the client side ! (so no need to restart ceph daemons) - Mail original - De: Graeme Lambert glamb...@adepteo.net À: ceph-users@lists.ceph.com Envoyé: Jeudi 6 Février 2014 11:43:56 Objet: [ceph-users] RBD Caching - How to enable? Hi, I've got a few VMs in Ceph RBD that are running very slowly - presumably down to a backfill after increasing the pg_num of a big pool. Would RBD caching resolve that issue? If so, how do I enable it? The documentation states that setting rbd cache = true in [global] enables it, but doesn't elaborate on whether you need to restart any Ceph processes. Is that literally all that is needed or is there more to it than that? /blockquote ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw machines virtualization
Hi Ceph Users, What do you think about virtualization of the radosgw machines? Have somebody a production level experience with such architecture? -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw machines virtualization
Hi, Our three radosgw's are OpenStack VMs. Seems to work for our (limited) testing, and I don't see a reason why it shouldn't work. Cheers, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On Thu, Feb 6, 2014 at 2:12 PM, Dominik Mostowiec dominikmostow...@gmail.com wrote: Hi Ceph Users, What do you think about virtualization of the radosgw machines? Have somebody a production level experience with such architecture? -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD Caching - How to enable?
On Thu, Feb 6, 2014 at 12:11 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Do the VMs using RBD images need to be restarted at all? I think yes. In our case, we had to restart the hypervisor qemu-kvm process to enable caching. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd-fuse rbd_list: error %d Numerical result out of range
Hi all, Can anyone advise what the problem below is with rbd-fuse? From http://mail.blameitonlove.com/lists/ceph-devel/msg14723.html it looks like this has happened before but should've been fixed way before now? rbd-fuse -d -p libvirt-pool -c /etc/ceph/ceph.conf ceph FUSE library version: 2.8.6 nullpath_ok: 0 unique: 1, opcode: INIT (26), nodeid: 0, insize: 56 INIT: 7.17 flags=0x047b max_readahead=0x0002 INIT: 7.12 flags=0x0031 max_readahead=0x0002 max_write=0x0002 unique: 1, success, outsize: 40 unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56 getattr / rbd_list: error %d : Numerical result out of range unique: 2, success, outsize: 120 unique: 3, opcode: GETATTR (3), nodeid: 1, insize: 56 getattr / rbd_list: error %d : Numerical result out of range unique: 3, success, outsize: 120 unique: 4, opcode: ACCESS (34), nodeid: 1, insize: 48 unique: 4, error: -38 (Function not implemented), outsize: 16 unique: 5, opcode: OPENDIR (27), nodeid: 1, insize: 48 opendir flags: 0x98800 / rbd_list: error %d : Numerical result out of range opendir[0] flags: 0x98800 / unique: 5, success, outsize: 32 unique: 6, opcode: READDIR (28), nodeid: 1, insize: 80 readdir[0] from 0 unique: 6, success, outsize: 80 unique: 7, opcode: READDIR (28), nodeid: 1, insize: 80 unique: 7, success, outsize: 16 unique: 8, opcode: RELEASEDIR (29), nodeid: 1, insize: 64 releasedir[0] flags: 0x0 unique: 8, success, outsize: 16 unique: 9, opcode: OPENDIR (27), nodeid: 1, insize: 48 opendir flags: 0x98800 / rbd_list: error %d : Numerical result out of range opendir[0] flags: 0x98800 / unique: 9, success, outsize: 32 unique: 10, opcode: READDIR (28), nodeid: 1, insize: 80 readdir[0] from 0 unique: 10, success, outsize: 80 unique: 11, opcode: GETATTR (3), nodeid: 1, insize: 56 getattr / rbd_list: error %d : Numerical result out of range unique: 11, success, outsize: 120 unique: 12, opcode: GETXATTR (22), nodeid: 1, insize: 65 getxattr / security.selinux 255 unique: 12, success, outsize: 16 unique: 13, opcode: GETXATTR (22), nodeid: 1, insize: 72 getxattr / system.posix_acl_access 0 unique: 13, success, outsize: 24 unique: 14, opcode: GETXATTR (22), nodeid: 1, insize: 73 getxattr / system.posix_acl_default 0 unique: 14, success, outsize: 24 unique: 15, opcode: READDIR (28), nodeid: 1, insize: 80 unique: 15, success, outsize: 16 unique: 16, opcode: RELEASEDIR (29), nodeid: 1, insize: 64 releasedir[0] flags: 0x0 unique: 16, success, outsize: 16 unique: 17, opcode: GETATTR (3), nodeid: 1, insize: 56 getattr / rbd_list: error %d : Numerical result out of range unique: 17, success, outsize: 120 -- Best regards *Graeme* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] poor data distribution
Hi, Mabye this info can help to find what is wrong. For one PG (3.1e4a) which is active+remapped: { state: active+remapped, epoch: 96050, up: [ 119, 69], acting: [ 119, 69, 7], Logs: On osd.7: 2014-02-04 09:45:54.966913 7fa618afe700 1 osd.7 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1 lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:45:55.781278 7fa6172fb700 1 osd.7 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod 94459'207003 remapped NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:49:01.124510 7fa618afe700 1 osd.7 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped] stateStart: transitioning to Stray On osd.119: 2014-02-04 09:45:54.981707 7f37f07c5700 1 osd.119 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0 lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart: transitioning to Primary 2014-02-04 09:45:55.805712 7f37ecfbe700 1 osd.119 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0 remapped] stateStart: transitioning to Primary 2014-02-04 09:45:56.794015 7f37edfc0700 0 log [INF] : 3.1e4a restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004 2014-02-04 09:49:01.156627 7f37ef7c3700 1 osd.119 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart: transitioning to Primary On osd.69: 2014-02-04 09:45:56.845695 7f2231372700 1 osd.69 pg_epoch: 94462 pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462 pi=93485-94460/2 inactive] stateStart: transitioning to Stray 2014-02-04 09:49:01.153695 7f2229b63700 1 osd.69 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning to Stray pq query recovery state: recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2014-02-04 09:49:02.070724, might_have_unfound: [], recovery_progress: { backfill_target: -1, waiting_on_backfill: 0, backfill_pos: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, peer_backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, backfills_in_flight: [], pull_from_peer: [], pushing: []}, scrub: { scrubber.epoch_start: 77502, scrubber.active: 0, scrubber.block_writes: 0, scrubber.finalizing: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: []}}, { name: Started, enter_time: 2014-02-04 09:49:01.156626}]} --- Regards Dominik 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com: Hi, Thanks for Your help !! We've done again 'ceph osd reweight-by-utilization 105' Cluster stack on 10387 active+clean, 237 active+remapped; More info in attachments. -- Regards Dominik 2014-02-04 Sage Weil s...@inktank.com: Hi, I spent a couple hours looking at your map because it did look like there was something wrong. After some experimentation and adding a bucnh of improvements to osdmaptool to test the distribution, though, I think everything is working as expected. For pool 3, your map has a standard deviation in utilizations of ~8%, and we should expect ~9% for this number of PGs. For all pools, it is slightly higher (~9% vs expected ~8%). This is either just in the noise, or slightly confounded by the lack of the hashpspool flag on the pools (which slightly amplifies placement nonuniformity with multiple pools... not enough that it is worth changing anything though). The bad news is that that order of standard deviation results in pretty wide min/max range of 118 to 202 pgs. That seems a *bit* higher than we a perfectly random placement generates (I'm seeing a spread in that is usually 50-70 pgs), but I think *that* is where the pool overlap (no hashpspool) is rearing its head;
Re: [ceph-users] filesystem fragmentation on ext4 OSD
On 02/06/2014 04:17 AM, Christian Kauhaus wrote: Hi, after running Ceph for a while I see a lot of fragmented files on our OSD filesystems (all running ext4). For example: itchy ~ # fsck -f /srv/ceph/osd/ceph-5 fsck von util-linux 2.22.2 e2fsck 1.42 (29-Nov-2011) [...] /dev/mapper/vgosd00-ceph--osd00: 461903/418119680 files (33.7% non-contiguous), 478239460/836229120 blocks This is an unusually high value for ext4. The normal expectation is something in the 5% range. I suspect that such a high fragmentation produces lots of unnecessary seeks on the disks. Has anyone an idea what to do to make Ceph fragment an OSD filesystem less? Hi Christian, can you tell me a little bit about how you are using Ceph and what kind of IO you are doing? TIA Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Kernel rbd cephx signatures
Hi, I have to open our CEPH cluster for some clients, that only support kernel rbd. In general that's no problem and works just fine (verified in our test-cluster ;-) ). I then tried to map images from our production cluster and failed: rbd: add failed: (95) Operation not supported After some testing and comparing test and production cluster, it turned out that the config option, that hinders the kernel to map the image is cephx require signatures = true If I read the documentation (http://ceph.com/docs/master/rados/operations/authentication/#backward-compatibility) correctly that flag is recommended, which leads to two questions: 1. When will cephx signatures make it to kernel rbd (it's not there till at least 3.12.0 and I've found no reference in the changelogs of subsequent versions) ? 2. As I have to assess the risk when disabling cephx signatures, do you have some estimations how probable a real life attack is, ie. are there real threats for the whole infrastructure or is it just possible to disturb the communication of exactly that client in whose communicationmalicious messages are forced? Thanks a lot for your help, best regards, Kurt PS.: If my conclusion is correct, maybe that should be mentioned somewhere at http://ceph.com/docs/master/rbd/rbd-ko/ -- Kurt Bauer kurt.ba...@univie.ac.at Vienna University Computer Center - ACOnet - VIX Universitaetsstrasse 7, A-1010 Vienna, Austria, Europe Tel: ++43 1 4277 - 14070 (Fax: - 814070) KB1970-RIPE smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] poor data distribution
Hi, Just an update here. Another user saw this and after playing with it I identified a problem with CRUSH. There is a branch outstanding (wip-crush) that is pending review, but it's not a quick fix because of compatibility issues. sage On Thu, 6 Feb 2014, Dominik Mostowiec wrote: Hi, Mabye this info can help to find what is wrong. For one PG (3.1e4a) which is active+remapped: { state: active+remapped, epoch: 96050, up: [ 119, 69], acting: [ 119, 69, 7], Logs: On osd.7: 2014-02-04 09:45:54.966913 7fa618afe700 1 osd.7 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1 lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:45:55.781278 7fa6172fb700 1 osd.7 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod 94459'207003 remapped NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:49:01.124510 7fa618afe700 1 osd.7 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped] stateStart: transitioning to Stray On osd.119: 2014-02-04 09:45:54.981707 7f37f07c5700 1 osd.119 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0 lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart: transitioning to Primary 2014-02-04 09:45:55.805712 7f37ecfbe700 1 osd.119 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0 remapped] stateStart: transitioning to Primary 2014-02-04 09:45:56.794015 7f37edfc0700 0 log [INF] : 3.1e4a restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004 2014-02-04 09:49:01.156627 7f37ef7c3700 1 osd.119 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart: transitioning to Primary On osd.69: 2014-02-04 09:45:56.845695 7f2231372700 1 osd.69 pg_epoch: 94462 pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462 pi=93485-94460/2 inactive] stateStart: transitioning to Stray 2014-02-04 09:49:01.153695 7f2229b63700 1 osd.69 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning to Stray pq query recovery state: recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2014-02-04 09:49:02.070724, might_have_unfound: [], recovery_progress: { backfill_target: -1, waiting_on_backfill: 0, backfill_pos: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, peer_backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, backfills_in_flight: [], pull_from_peer: [], pushing: []}, scrub: { scrubber.epoch_start: 77502, scrubber.active: 0, scrubber.block_writes: 0, scrubber.finalizing: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: []}}, { name: Started, enter_time: 2014-02-04 09:49:01.156626}]} --- Regards Dominik 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com: Hi, Thanks for Your help !! We've done again 'ceph osd reweight-by-utilization 105' Cluster stack on 10387 active+clean, 237 active+remapped; More info in attachments. -- Regards Dominik 2014-02-04 Sage Weil s...@inktank.com: Hi, I spent a couple hours looking at your map because it did look like there was something wrong. After some experimentation and adding a bucnh of improvements to osdmaptool to test the distribution, though, I think everything is working as expected. For pool 3, your map has a standard deviation in utilizations of ~8%, and we should expect ~9% for this number of PGs. For all pools, it is slightly higher (~9% vs expected ~8%). This is either just in the noise, or slightly confounded by the lack of the hashpspool flag on the pools (which slightly amplifies placement
Re: [ceph-users] poor data distribution
Hi, Thanks !! Can You suggest any workaround for now? -- Regards Dominik 2014-02-06 18:39 GMT+01:00 Sage Weil s...@inktank.com: Hi, Just an update here. Another user saw this and after playing with it I identified a problem with CRUSH. There is a branch outstanding (wip-crush) that is pending review, but it's not a quick fix because of compatibility issues. sage On Thu, 6 Feb 2014, Dominik Mostowiec wrote: Hi, Mabye this info can help to find what is wrong. For one PG (3.1e4a) which is active+remapped: { state: active+remapped, epoch: 96050, up: [ 119, 69], acting: [ 119, 69, 7], Logs: On osd.7: 2014-02-04 09:45:54.966913 7fa618afe700 1 osd.7 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1 lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:45:55.781278 7fa6172fb700 1 osd.7 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod 94459'207003 remapped NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:49:01.124510 7fa618afe700 1 osd.7 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped] stateStart: transitioning to Stray On osd.119: 2014-02-04 09:45:54.981707 7f37f07c5700 1 osd.119 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0 lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart: transitioning to Primary 2014-02-04 09:45:55.805712 7f37ecfbe700 1 osd.119 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0 remapped] stateStart: transitioning to Primary 2014-02-04 09:45:56.794015 7f37edfc0700 0 log [INF] : 3.1e4a restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004 2014-02-04 09:49:01.156627 7f37ef7c3700 1 osd.119 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart: transitioning to Primary On osd.69: 2014-02-04 09:45:56.845695 7f2231372700 1 osd.69 pg_epoch: 94462 pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462 pi=93485-94460/2 inactive] stateStart: transitioning to Stray 2014-02-04 09:49:01.153695 7f2229b63700 1 osd.69 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning to Stray pq query recovery state: recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2014-02-04 09:49:02.070724, might_have_unfound: [], recovery_progress: { backfill_target: -1, waiting_on_backfill: 0, backfill_pos: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, peer_backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, backfills_in_flight: [], pull_from_peer: [], pushing: []}, scrub: { scrubber.epoch_start: 77502, scrubber.active: 0, scrubber.block_writes: 0, scrubber.finalizing: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: []}}, { name: Started, enter_time: 2014-02-04 09:49:01.156626}]} --- Regards Dominik 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com: Hi, Thanks for Your help !! We've done again 'ceph osd reweight-by-utilization 105' Cluster stack on 10387 active+clean, 237 active+remapped; More info in attachments. -- Regards Dominik 2014-02-04 Sage Weil s...@inktank.com: Hi, I spent a couple hours looking at your map because it did look like there was something wrong. After some experimentation and adding a bucnh of improvements to osdmaptool to test the distribution, though, I think everything is working as expected. For pool 3, your map has a standard deviation in utilizations of ~8%, and we should expect ~9% for this number of PGs. For all pools, it is slightly higher (~9% vs expected ~8%). This is either just in
[ceph-users] OSD block device performance
Hey all, I'm currently pouring through the ceph docs trying to familiarize myself with the product before I begin my cluster build-out for a virtualized environment. One area which I've been looking into is disk throughput/performance. I stumbled onto the following site: http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/ 1) I'm not sure where this info below originates as I did not see this on the ceph doc site, unless it is hidden in some dark corner somewhere. Anyone point me to a wiki/url? 2) Can someone describe this 50/50 split of journal vs filesystem (assume it has something to do with filestore flush)? Consideration about the ceph's journal. The journal is by design the component that could be severely and easily improved. Take a little step back over it. As a reminder the ceph's journal serves 2 purposes: * It acts as a buffer cache (FIFO buffer). The journal takes every request and performs each write with O_DIRECT. After a determined period and acknowledgment the journal flush his content to the backend filesystem. By default this value is set to 5 seconds and called filestore max sync interval. The filestore starts to flush when the journal is half-full or max sync interval is reached. * Failure coverage, pending writes are handled by the Journal if not committed yet to the backend filesystem. The journal can operate in 2 modes called parallel and writeahead, the given mode is automatically detected according to the file system in use by the OSD backend storage. The parallel mode is only supported by Btrfs. In practice, common gigabits network can write 100 MB/sec. Let say that you store your journal and your backend storage are stored on the same disk. This disk has a write speed of 100 MB/sec. With the default writeahead mode the write speed will be split after 5 seconds (the default duration during the one the journal starts to flush to the backend filesystem). The first 5 sec writes at 100 MB/sec, after that writes are splitted like so: * 50 MB/sec for the journal * 50 MB/sec for the backend filesystem ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD block device performance
Hi John, The 50/50 thing comes from the way the Ceph OSD writes data twice: first to the journal, and then subsequently to the data partition. The write doubling may not affect your performance outcome, depending on the ratio of drive bandwidth to network bandwidth and the I/O pattern. In configurations where it is an issue, the way to improve performance is to use an SSD for journals (Sebastian mentions this in his article under Commodity improved). The journal is an area of quite some flexibility, the relevant settings are in the docs here: http://ceph.com/docs/master/rados/configuration/journal-ref/ http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings There is some discussion of the use of SSDs with Ceph here: http://ceph.com/docs/master/start/hardware-recommendations/#solid-state-drives I'm sure others on this list will have more empirical information about their experiences in this area. Cheers, John On Thu, Feb 6, 2014 at 6:18 PM, John Mancuso jmanc...@freewheel.tv wrote: Hey all, I'm currently pouring through the ceph docs trying to familiarize myself with the product before I begin my cluster build-out for a virtualized environment. One area which I've been looking into is disk throughput/performance. I stumbled onto the following site: http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/ 1) I'm not sure where this info below originates as I did not see this on the ceph doc site, unless it is hidden in some dark corner somewhere. Anyone point me to a wiki/url? 2) Can someone describe this 50/50 split of journal vs filesystem (assume it has something to do with filestore flush)? Consideration about the ceph's journal. The journal is by design the component that could be severely and easily improved. Take a little step back over it. As a reminder the ceph's journal serves 2 purposes: It acts as a buffer cache (FIFO buffer). The journal takes every request and performs each write with O_DIRECT. After a determined period and acknowledgment the journal flush his content to the backend filesystem. By default this value is set to 5 seconds and called filestore max sync interval. The filestore starts to flush when the journal is half-full or max sync interval is reached. Failure coverage, pending writes are handled by the Journal if not committed yet to the backend filesystem. The journal can operate in 2 modes called parallel and writeahead, the given mode is automatically detected according to the file system in use by the OSD backend storage. The parallel mode is only supported by Btrfs. In practice, common gigabits network can write 100 MB/sec. Let say that you store your journal and your backend storage are stored on the same disk. This disk has a write speed of 100 MB/sec. With the default writeahead mode the write speed will be split after 5 seconds (the default duration during the one the journal starts to flush to the backend filesystem). The first 5 sec writes at 100 MB/sec, after that writes are splitted like so: 50 MB/sec for the journal 50 MB/sec for the backend filesystem ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] filesystem fragmentation on ext4 OSD
Am 06.02.2014 16:24, schrieb Mark Nelson: Hi Christian, can you tell me a little bit about how you are using Ceph and what kind of IO you are doing? Sure. We're using it almost exclusively for serving VM images that are accessed from Qemu's built-in RBD client. The VMs themselves perform a very wide range of I/O types, from servers that write mainly log files to ZEO database servers with nearly completely random I/O. Many VMs have slowly increasing storage utilization. A reason could be that the OSDs issue syncfs() calls and ext4 cuts FS extents from just what has been written so far. But I'm not sure about the exact pattern of OSD/filesystem interaction. HTH Christian -- Dipl.-Inf. Christian Kauhaus · k...@gocept.com · systems administration gocept gmbh co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] poor data distribution
Great! Thanks for Your help. -- Regards Dominik 2014-02-06 21:10 GMT+01:00 Sage Weil s...@inktank.com: On Thu, 6 Feb 2014, Dominik Mostowiec wrote: Hi, Thanks !! Can You suggest any workaround for now? You can adjust the crush weights on the overfull nodes slightly. You'd need to do it by hand, but that will do the trick. For example, ceph osd crush reweight osd.123 .96 (if the current weight is 1.0). sage -- Regards Dominik 2014-02-06 18:39 GMT+01:00 Sage Weil s...@inktank.com: Hi, Just an update here. Another user saw this and after playing with it I identified a problem with CRUSH. There is a branch outstanding (wip-crush) that is pending review, but it's not a quick fix because of compatibility issues. sage On Thu, 6 Feb 2014, Dominik Mostowiec wrote: Hi, Mabye this info can help to find what is wrong. For one PG (3.1e4a) which is active+remapped: { state: active+remapped, epoch: 96050, up: [ 119, 69], acting: [ 119, 69, 7], Logs: On osd.7: 2014-02-04 09:45:54.966913 7fa618afe700 1 osd.7 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1 lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:45:55.781278 7fa6172fb700 1 osd.7 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod 94459'207003 remapped NOTIFY] stateStart: transitioning to Stray 2014-02-04 09:49:01.124510 7fa618afe700 1 osd.7 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped] stateStart: transitioning to Stray On osd.119: 2014-02-04 09:45:54.981707 7f37f07c5700 1 osd.119 pg_epoch: 94460 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0 lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart: transitioning to Primary 2014-02-04 09:45:55.805712 7f37ecfbe700 1 osd.119 pg_epoch: 94461 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486 n=6718 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0 remapped] stateStart: transitioning to Primary 2014-02-04 09:45:56.794015 7f37edfc0700 0 log [INF] : 3.1e4a restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004 2014-02-04 09:49:01.156627 7f37ef7c3700 1 osd.119 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart: transitioning to Primary On osd.69: 2014-02-04 09:45:56.845695 7f2231372700 1 osd.69 pg_epoch: 94462 pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462 pi=93485-94460/2 inactive] stateStart: transitioning to Stray 2014-02-04 09:49:01.153695 7f2229b63700 1 osd.69 pg_epoch: 94495 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7] r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning to Stray pq query recovery state: recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2014-02-04 09:49:02.070724, might_have_unfound: [], recovery_progress: { backfill_target: -1, waiting_on_backfill: 0, backfill_pos: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, peer_backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, backfills_in_flight: [], pull_from_peer: [], pushing: []}, scrub: { scrubber.epoch_start: 77502, scrubber.active: 0, scrubber.block_writes: 0, scrubber.finalizing: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: []}}, { name: Started, enter_time: 2014-02-04 09:49:01.156626}]} --- Regards Dominik 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com: Hi, Thanks for Your help !! We've done again 'ceph osd reweight-by-utilization 105' Cluster stack on 10387 active+clean, 237 active+remapped; More info in attachments. -- Regards Dominik 2014-02-04 Sage Weil
Re: [ceph-users] RBD Caching - How to enable?
Does anybody else think there is a problem with the docs/settings here... Message: 13 Date: Thu, 06 Feb 2014 12:11:53 +0100 (CET) From: Alexandre DERUMIER aderum...@odiso.com To: Graeme Lambert glamb...@adepteo.net Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] RBD Caching - How to enable? Message-ID: d0af5e59-ea2a-471e-be65-ff273d0c0216@mailpro Content-Type: text/plain; charset=utf-8 OK, so I need to change ceph.conf on the compute nodes? yes. Do the VMs using RBD images need to be restarted at all? I think yes. Anything changed in the virsh XML for the nodes? you need to add cache=writeback for your disks If you use qemu 1.2, no need to add rbd cache = true to ceph .conf :) http://ceph.com/docs/next/rbd/qemu-rbd/ This page reads If you set rbd_cache=true, you must set cache=writeback or risk data loss. ... That's an inverted definition of writeback AFAIK! -- Cheers, ~Blairo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Crush Maps
I have a test cluster that is up and running. It consists of three mons, and three OSD servers, with each OSD server having eight OSD's and two SSD's for journals. I'd like to move from the flat crushmap to a crushmap with typical depth using most of the predefined types. I have the current crushmap decompiled and have edited it to add the additional depth of failure zones. Questions: 1) Do the ID's of the bucket types need to be consecutive, or can I make them up as long as they are negative in value and unique? 2) Is there any way that I can control the assignment of the bucket type ID's if I were to update the crushmap on a running system using the CLI? 3) Is there any harm in adding bucket types that are not currently used, but assigning them a weight of 0, so they aren't used (a row defined, with racks, but the racks have no hosts defined)? 4) Can I have a bucket type with no item lines in it, or does each bucket type need at least on item declaration to be valid? Example: # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host spucosds01 { id -2 # do not change unnecessarily # weight 29.120 alg straw hash 0 # rjenkins1 item osd.0 weight 3.640 item osd.1 weight 3.640 item osd.2 weight 3.640 item osd.3 weight 3.640 item osd.4 weight 3.640 item osd.5 weight 3.640 item osd.6 weight 3.640 item osd.7 weight 3.640 } host spucosds02 { id -3 # do not change unnecessarily # weight 29.120 alg straw hash 0 # rjenkins1 item osd.8 weight 3.640 item osd.9 weight 3.640 item osd.10 weight 3.640 item osd.11 weight 3.640 item osd.12 weight 3.640 item osd.13 weight 3.640 item osd.14 weight 3.640 item osd.15 weight 3.640 } host spucosds03 { id -4 # do not change unnecessarily # weight 29.120 alg straw hash 0 # rjenkins1 item osd.16 weight 3.640 item osd.17 weight 3.640 item osd.18 weight 3.640 item osd.19 weight 3.640 item osd.20 weight 3.640 item osd.21 weight 3.640 item osd.22 weight 3.640 item osd.23 weight 3.640 } rack rack2-2 { id -220 alg straw hash 0 item spucosds01 weight 29.12 } rack rack3-2 { id -230 alg straw hash 0 item spucosds02 weight 29.12 } rack rack4-2 { id -240 alg straw hash 0 item spucosds03 weight 29.12 } row row1 { id -100 alg straw hash 0 } row row2 { id -200 alg straw hash 0 item rack2-2 weight 29.12 item rack3-2 weight 29.12 item rack4-2 weight 29.12 } datacenter smt { id -1000 alg straw hash 0 item row1 weight 0.0 item row2 weight 87.36 } root default { id -1 # do not change unnecessarily # weight 87.360 alg straw hash 0 # rjenkins1 item smt weight 87.36 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW Replication
On 2/4/14 17:06 , Craig Lewis wrote: Now that I've started seeing missing objects, I'm not able to download objects that should be on the slave if replication is up to date. Either it's not up to date, or it's skipping objects every pass. Using my --max-entries fix (https://github.com/ceph/radosgw-agent/pull/8), I think I see what's happening. Shut down replication Upload 6 objects to an empty bucket on the master: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg None show on the slave, because replication is down. Start radosgw-agent --max-entries=2 (1 doesn't seem to replicate anything) Check contents of slave after pass #1: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg Check contents of slave after pass #10: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg Leave replication running Upload 1 object, test6.jpg, to the master. Check the master: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg 2014-02-07 02:0610k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg Check contents of slave after next pass: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg Upload another file, test7.jpg, to the master: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg 2014-02-07 02:0610k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg 2014-02-07 02:0810k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test7.jpg The slave doesn't get it this time: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg Upload another file, test8.jpg, to the master: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg 2014-02-07 02:0610k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg 2014-02-07 02:0810k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test7.jpg 2014-02-07 02:1010k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test8.jpg The slave gets the 3rd file: 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg So I think the problem is caused by the shard marker being set to the current marker after every pass, even if the bucket replication caps on max-entries. Updating the shard marker by uploading a file causes another pass on the bucket, and the bucket marker is being tracked correctly. I would prefer to track the shard marker better, but I don't see any way to get the last shard marker given the last bucket entry. If I track the shard marker correctly, then the stats I'm generating are still somewhat useful (if incomplete). I'll be able to see when replication falls behind because the graphs keep growing. The alternative is to change the bucket sync so that it loops until
Re: [ceph-users] Crush Maps
Hallo Bradley, additionally to your question, I'm interesting in the following: 5) can I change all 'type' Ids because adding a new type host-slow to distinguish between OSD's with journal on the same HDD / separate SSD? E.g. from type 0 osd type 1 host type 2 rack .. to type 0 osd type 1 host type 2 host-slow type 3 rack .. 6) After importing the crush map to the cluster, how can I start rebalancing all existing pools? (This is because all OSD now mixed up to other locations in the crush hierarchy). best regards Danny From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of McNamara, Bradley 1) Do the ID's of the bucket types need to be consecutive, or can I make them up as long as they are negative in value and unique? 2) Is there any way that I can control the assignment of the bucket type ID's if I were to update the crushmap on a running system using the CLI? 3) Is there any harm in adding bucket types that are not currently used, but assigning them a weight of 0, so they aren't used (a row defined, with racks, but the racks have no hosts defined)? 4) Can I have a bucket type with no item lines in it, or does each bucket type need at least on item declaration to be valid? smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD+KVM problems with sequential read
Hi All. Hosts: Dell R815x5, 128 GB RAM, 25 OSD + 5 SSD(journal+system). Network: 2x10Gb+LACP Kernel: 2.6.32 QEMU emulator version 1.4.2, Copyright (c) 2003-2008 Fabrice Bellard POOLs: root@kvm05:~# ceph osd dump | grep 'rbd' pool 5 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 1400 pgp_num 1400 last_change 12550 owner 0 --- root@kvm05:~# ceph osd dump | grep 'test' pool 32 'test' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 1400 pgp_num 1400 last_change 12655 owner 0 root@kvm01:~# ceph -v ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) -- root@kvm01:~# rados bench -p test 120 write --no-cleanup Total time run: 120.125225 Total writes made: 11519 Write size: 4194304 Bandwidth (MB/sec): 383.566 Stddev Bandwidth: 36.2022 Max bandwidth (MB/sec): 408 Min bandwidth (MB/sec): 0 Average Latency:0.166819 Stddev Latency: 0.0553357 Max latency:1.60795 Min latency:0.044263 -- root@kvm01:~# rados bench -p test 120 seq Total time run:67.271769 Total reads made: 11519 Read size:4194304 Bandwidth (MB/sec):684.923 Average Latency: 0.0933579 Max latency: 0.808438 Min latency: 0.018063 --- [root@cephadmin cluster]# cat ceph.conf [global] fsid = 43a571a9-b3e8-4dc9-9200-1f3904e1e12a initial_members = kvm01, kvm02, kvm03 mon_host = 192.168.100.1,192.168.100.2, 192.168.100.3 auth_supported = cephx public network = 192.168.100.0/24 cluster_network = 192.168.101.0/24 [osd] osd journal size = 12500 osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog osd op threads = 10 osd disk threads = 10 osd max backfills = 2 osd recovery max active = 1 filestore op threads = 64 filestore xattr use omap = true [client] rbd cache = true rbd cache size = 134217728 rbd cache max dirty = 0 [mon.kvm01] host = kvm01 mon addr = 192.168.100.1:6789 [mon.kvm02] host = kvm02 mon addr = 192.168.100.2:6789 [mon.kvm03] host = kvm03 mon addr = 192.168.100.3:6789 [osd.0] public addr = 192.168.100.1 cluster addr = 192.168.101.1 [osd.1] public addr = 192.168.100.1 cluster addr = 192.168.101.1 [osd.2] public addr = 192.168.100.1 cluster addr = 192.168.101.1 [osd.3] public addr = 192.168.100.1 cluster addr = 192.168.101.1 [osd.4] public addr = 192.168.100.1 cluster addr = 192.168.101.1 [osd.5] public addr = 192.168.100.2 cluster addr = 192.168.101.2 [osd.6] public addr = 192.168.100.2 cluster addr = 192.168.101.2 [osd.7] public addr = 192.168.100.2 cluster addr = 192.168.101.2 [osd.8] public addr = 192.168.100.2 cluster addr = 192.168.101.2 [osd.9] public addr = 192.168.100.2 cluster addr = 192.168.101.2 [osd.10] public addr = 192.168.100.3 cluster addr = 192.168.101.3 [osd.11] public addr = 192.168.100.3 cluster addr = 192.168.101.3 [osd.12] public addr = 192.168.100.3 cluster addr = 192.168.101.3 [osd.13] public addr = 192.168.100.3 cluster addr = 192.168.101.3 [osd.14] public addr = 192.168.100.3 cluster addr = 192.168.101.3 [osd.15] public addr = 192.168.100.4 cluster addr = 192.168.101.4 [osd.16] public addr = 192.168.100.4 cluster addr = 192.168.101.4 [osd.17] public addr = 192.168.100.4 cluster addr = 192.168.101.4 [osd.18] public addr = 192.168.100.4 cluster addr = 192.168.101.4 [osd.19] public addr = 192.168.100.4 cluster addr = 192.168.101.4 [osd.20] public addr = 192.168.100.5 cluster addr = 192.168.101.5 [osd.21] public addr = 192.168.100.5 cluster addr = 192.168.101.5 [osd.22] public addr = 192.168.100.5 cluster addr = 192.168.101.5 [osd.23] public addr = 192.168.100.5 cluster addr = 192.168.101.5 [osd.24] public addr = 192.168.100.5 cluster addr = 192.168.101.5 --- [root@cephadmin ~]# cat crushd # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device