[ceph-users] filesystem fragmentation on ext4 OSD

2014-02-06 Thread Christian Kauhaus
Hi,

after running Ceph for a while I see a lot of fragmented files on our OSD
filesystems (all running ext4). For example:

itchy ~ # fsck -f /srv/ceph/osd/ceph-5
fsck von util-linux 2.22.2
e2fsck 1.42 (29-Nov-2011)
[...]
/dev/mapper/vgosd00-ceph--osd00: 461903/418119680 files (33.7%
non-contiguous), 478239460/836229120 blocks

This is an unusually high value for ext4. The normal expectation is something
in the 5% range. I suspect that such a high fragmentation produces lots of
unnecessary seeks on the disks.

Has anyone an idea what to do to make Ceph fragment an OSD filesystem less?

TIA

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Caching - How to enable?

2014-02-06 Thread Graeme Lambert

Hi,

I've got a few VMs in Ceph RBD that are running very slowly - presumably 
down to a backfill after increasing the pg_num of a big pool.


Would RBD caching resolve that issue?  If so, how do I enable it? The 
documentation states that setting rbd cache = true in [global] enables 
it, but doesn't elaborate on whether you need to restart any Ceph 
processes.  Is that literally all that is needed or is there more to it 
than that?

--

Best regards

*Graeme
*

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Caching - How to enable?

2014-02-06 Thread Alexandre DERUMIER
 The documentation states that setting rbd cache = true in [global] enables 
 it, but doesn't elaborate on whether you need to restart any Ceph processe

It's on the client side ! (so no need to restart ceph daemons)




- Mail original - 

De: Graeme Lambert glamb...@adepteo.net 
À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 6 Février 2014 11:43:56 
Objet: [ceph-users] RBD Caching - How to enable? 

Hi, 

I've got a few VMs in Ceph RBD that are running very slowly - presumably down 
to a backfill after increasing the pg_num of a big pool. 

Would RBD caching resolve that issue? If so, how do I enable it? The 
documentation states that setting rbd cache = true in [global] enables it, 
but doesn't elaborate on whether you need to restart any Ceph processes. Is 
that literally all that is needed or is there more to it than that? 

-- 

Best regards 

Graeme 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Caching - How to enable?

2014-02-06 Thread Alexandre DERUMIER
OK, so I need to change ceph.conf on the compute nodes?  
yes.

Do the VMs using RBD images need to be restarted at all?  
I think yes.

Anything changed in the virsh XML for the nodes?

you need to add cache=writeback for your disks

If you use qemu  1.2, no need to add  rbd cache = true to ceph .conf :)
http://ceph.com/docs/next/rbd/qemu-rbd/


- Mail original - 

De: Graeme Lambert glamb...@adepteo.net 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-users@lists.ceph.com 
Envoyé: Jeudi 6 Février 2014 12:03:00 
Objet: Re: [ceph-users] RBD Caching - How to enable? 


Hi Alexandre, 

OK, so I need to change ceph.conf on the compute nodes? Do the VMs using RBD 
images need to be restarted at all? Anything changed in the virsh XML for the 
nodes? 


Best regards 

Graeme 

On 06/02/14 10:50, Alexandre DERUMIER wrote: 



blockquote

blockquote
The documentation states that setting rbd cache = true in [global] enables 
it, but doesn't elaborate on whether you need to restart any Ceph processe 


/blockquote
It's on the client side ! (so no need to restart ceph daemons)




- Mail original - 

De: Graeme Lambert glamb...@adepteo.net À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 6 Février 2014 11:43:56 
Objet: [ceph-users] RBD Caching - How to enable? 

Hi, 

I've got a few VMs in Ceph RBD that are running very slowly - presumably down 
to a backfill after increasing the pg_num of a big pool. 

Would RBD caching resolve that issue? If so, how do I enable it? The 
documentation states that setting rbd cache = true in [global] enables it, 
but doesn't elaborate on whether you need to restart any Ceph processes. Is 
that literally all that is needed or is there more to it than that? 
/blockquote
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw machines virtualization

2014-02-06 Thread Dominik Mostowiec
Hi Ceph Users,
What do you think about virtualization of the radosgw machines?
Have somebody a production level experience with such architecture?

-- 
Regards
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw machines virtualization

2014-02-06 Thread Dan van der Ster
Hi,
Our three radosgw's are OpenStack VMs. Seems to work for our (limited)
testing, and I don't see a reason why it shouldn't work.
Cheers, Dan

-- Dan van der Ster || Data  Storage Services || CERN IT Department --


On Thu, Feb 6, 2014 at 2:12 PM, Dominik Mostowiec
dominikmostow...@gmail.com wrote:
 Hi Ceph Users,
 What do you think about virtualization of the radosgw machines?
 Have somebody a production level experience with such architecture?

 --
 Regards
 Dominik
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Caching - How to enable?

2014-02-06 Thread Dan van der Ster
On Thu, Feb 6, 2014 at 12:11 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
Do the VMs using RBD images need to be restarted at all?
 I think yes.

In our case, we had to restart the hypervisor qemu-kvm process to
enable caching.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd-fuse rbd_list: error %d Numerical result out of range

2014-02-06 Thread Graeme Lambert

Hi all,

Can anyone advise what the problem below is with rbd-fuse?  From 
http://mail.blameitonlove.com/lists/ceph-devel/msg14723.html it looks 
like this has happened before but should've been fixed way before now?


rbd-fuse -d -p libvirt-pool -c /etc/ceph/ceph.conf ceph
FUSE library version: 2.8.6
nullpath_ok: 0
unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
INIT: 7.17
flags=0x047b
max_readahead=0x0002
   INIT: 7.12
   flags=0x0031
   max_readahead=0x0002
   max_write=0x0002
   unique: 1, success, outsize: 40
unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56
getattr /
rbd_list: error %d
: Numerical result out of range
   unique: 2, success, outsize: 120
unique: 3, opcode: GETATTR (3), nodeid: 1, insize: 56
getattr /
rbd_list: error %d
: Numerical result out of range
   unique: 3, success, outsize: 120
unique: 4, opcode: ACCESS (34), nodeid: 1, insize: 48
   unique: 4, error: -38 (Function not implemented), outsize: 16
unique: 5, opcode: OPENDIR (27), nodeid: 1, insize: 48
opendir flags: 0x98800 /
rbd_list: error %d
: Numerical result out of range
   opendir[0] flags: 0x98800 /
   unique: 5, success, outsize: 32
unique: 6, opcode: READDIR (28), nodeid: 1, insize: 80
readdir[0] from 0
   unique: 6, success, outsize: 80
unique: 7, opcode: READDIR (28), nodeid: 1, insize: 80
   unique: 7, success, outsize: 16
unique: 8, opcode: RELEASEDIR (29), nodeid: 1, insize: 64
releasedir[0] flags: 0x0
   unique: 8, success, outsize: 16
unique: 9, opcode: OPENDIR (27), nodeid: 1, insize: 48
opendir flags: 0x98800 /
rbd_list: error %d
: Numerical result out of range
   opendir[0] flags: 0x98800 /
   unique: 9, success, outsize: 32
unique: 10, opcode: READDIR (28), nodeid: 1, insize: 80
readdir[0] from 0
   unique: 10, success, outsize: 80
unique: 11, opcode: GETATTR (3), nodeid: 1, insize: 56
getattr /
rbd_list: error %d
: Numerical result out of range
   unique: 11, success, outsize: 120
unique: 12, opcode: GETXATTR (22), nodeid: 1, insize: 65
getxattr / security.selinux 255
   unique: 12, success, outsize: 16
unique: 13, opcode: GETXATTR (22), nodeid: 1, insize: 72
getxattr / system.posix_acl_access 0
   unique: 13, success, outsize: 24
unique: 14, opcode: GETXATTR (22), nodeid: 1, insize: 73
getxattr / system.posix_acl_default 0
   unique: 14, success, outsize: 24
unique: 15, opcode: READDIR (28), nodeid: 1, insize: 80
   unique: 15, success, outsize: 16
unique: 16, opcode: RELEASEDIR (29), nodeid: 1, insize: 64
releasedir[0] flags: 0x0
   unique: 16, success, outsize: 16
unique: 17, opcode: GETATTR (3), nodeid: 1, insize: 56
getattr /
rbd_list: error %d
: Numerical result out of range
   unique: 17, success, outsize: 120

--

Best regards

*Graeme*


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-06 Thread Dominik Mostowiec
Hi,
Mabye this info can help to find what is wrong.
For one PG (3.1e4a) which is active+remapped:
{ state: active+remapped,
  epoch: 96050,
  up: [
119,
69],
  acting: [
119,
69,
7],
Logs:
On osd.7:
2014-02-04 09:45:54.966913 7fa618afe700  1 osd.7 pg_epoch: 94460
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1
lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY]
stateStart: transitioning to Stray
2014-02-04 09:45:55.781278 7fa6172fb700  1 osd.7 pg_epoch: 94461
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
[119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod
94459'207003 remapped NOTIFY] stateStart: transitioning to Stray
2014-02-04 09:49:01.124510 7fa618afe700  1 osd.7 pg_epoch: 94495
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped]
stateStart: transitioning to Stray

On osd.119:
2014-02-04 09:45:54.981707 7f37f07c5700  1 osd.119 pg_epoch: 94460
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0
lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart:
transitioning to Primary
2014-02-04 09:45:55.805712 7f37ecfbe700  1 osd.119 pg_epoch: 94461
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
[119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0
remapped] stateStart: transitioning to Primary
2014-02-04 09:45:56.794015 7f37edfc0700  0 log [INF] : 3.1e4a
restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004
2014-02-04 09:49:01.156627 7f37ef7c3700  1 osd.119 pg_epoch: 94495
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart:
transitioning to Primary

On osd.69:
2014-02-04 09:45:56.845695 7f2231372700  1 osd.69 pg_epoch: 94462
pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486
94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462
pi=93485-94460/2 inactive] stateStart: transitioning to Stray
2014-02-04 09:49:01.153695 7f2229b63700  1 osd.69 pg_epoch: 94495
pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning
to Stray

pq query recovery state:
  recovery_state: [
{ name: Started\/Primary\/Active,
  enter_time: 2014-02-04 09:49:02.070724,
  might_have_unfound: [],
  recovery_progress: { backfill_target: -1,
  waiting_on_backfill: 0,
  backfill_pos: 0\/\/0\/\/-1,
  backfill_info: { begin: 0\/\/0\/\/-1,
  end: 0\/\/0\/\/-1,
  objects: []},
  peer_backfill_info: { begin: 0\/\/0\/\/-1,
  end: 0\/\/0\/\/-1,
  objects: []},
  backfills_in_flight: [],
  pull_from_peer: [],
  pushing: []},
  scrub: { scrubber.epoch_start: 77502,
  scrubber.active: 0,
  scrubber.block_writes: 0,
  scrubber.finalizing: 0,
  scrubber.waiting_on: 0,
  scrubber.waiting_on_whom: []}},
{ name: Started,
  enter_time: 2014-02-04 09:49:01.156626}]}

---
Regards
Dominik

2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com:
 Hi,
 Thanks for Your help !!
 We've done again 'ceph osd reweight-by-utilization 105'
 Cluster stack on 10387 active+clean, 237 active+remapped;
 More info in attachments.

 --
 Regards
 Dominik


 2014-02-04 Sage Weil s...@inktank.com:
 Hi,

 I spent a couple hours looking at your map because it did look like there
 was something wrong.  After some experimentation and adding a bucnh of
 improvements to osdmaptool to test the distribution, though, I think
 everything is working as expected.  For pool 3, your map has a standard
 deviation in utilizations of ~8%, and we should expect ~9% for this number
 of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).
 This is either just in the noise, or slightly confounded by the lack of
 the hashpspool flag on the pools (which slightly amplifies placement
 nonuniformity with multiple pools... not enough that it is worth changing
 anything though).

 The bad news is that that order of standard deviation results in pretty
 wide min/max range of 118 to 202 pgs.  That seems a *bit* higher than we a
 perfectly random placement generates (I'm seeing a spread in that is
 usually 50-70 pgs), but I think *that* is where the pool overlap (no
 hashpspool) is rearing its head; 

Re: [ceph-users] filesystem fragmentation on ext4 OSD

2014-02-06 Thread Mark Nelson

On 02/06/2014 04:17 AM, Christian Kauhaus wrote:

Hi,

after running Ceph for a while I see a lot of fragmented files on our OSD
filesystems (all running ext4). For example:

itchy ~ # fsck -f /srv/ceph/osd/ceph-5
fsck von util-linux 2.22.2
e2fsck 1.42 (29-Nov-2011)
[...]
/dev/mapper/vgosd00-ceph--osd00: 461903/418119680 files (33.7%
non-contiguous), 478239460/836229120 blocks

This is an unusually high value for ext4. The normal expectation is something
in the 5% range. I suspect that such a high fragmentation produces lots of
unnecessary seeks on the disks.

Has anyone an idea what to do to make Ceph fragment an OSD filesystem less?


Hi Christian, can you tell me a little bit about how you are using Ceph 
and what kind of IO you are doing?




TIA

Christian



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kernel rbd cephx signatures

2014-02-06 Thread Kurt Bauer
Hi,

I have to open our CEPH cluster for some clients, that only support
kernel rbd. In general that's no problem and works just fine (verified
in our test-cluster ;-) ). I then tried to map images from our
production cluster and failed: rbd: add failed: (95) Operation not supported
After some testing and comparing test and production cluster, it turned
out that the config option, that hinders the kernel to map the image is
cephx require signatures = true
If I read the documentation
(http://ceph.com/docs/master/rados/operations/authentication/#backward-compatibility)
correctly that flag is recommended, which leads to two questions:
1. When will cephx signatures make it to kernel rbd (it's not there till
at least 3.12.0 and I've found no reference in the changelogs of
subsequent versions) ?
2. As I have to assess the risk when disabling cephx signatures, do you
have some estimations how probable a real life attack is, ie. are
there real threats for the whole infrastructure or is it just possible
to disturb the communication of exactly that client in whose
communicationmalicious messages are forced?

Thanks a lot for your help,
best regards,
Kurt

PS.: If my conclusion is correct, maybe that should be mentioned
somewhere at http://ceph.com/docs/master/rbd/rbd-ko/

-- 
Kurt Bauer kurt.ba...@univie.ac.at
Vienna University Computer Center - ACOnet - VIX
Universitaetsstrasse 7, A-1010 Vienna, Austria, Europe
Tel: ++43 1 4277 - 14070 (Fax: - 814070)  KB1970-RIPE



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-06 Thread Sage Weil
Hi,

Just an update here.  Another user saw this and after playing with it I 
identified a problem with CRUSH.  There is a branch outstanding 
(wip-crush) that is pending review, but it's not a quick fix because of 
compatibility issues.

sage


On Thu, 6 Feb 2014, Dominik Mostowiec wrote:

 Hi,
 Mabye this info can help to find what is wrong.
 For one PG (3.1e4a) which is active+remapped:
 { state: active+remapped,
   epoch: 96050,
   up: [
 119,
 69],
   acting: [
 119,
 69,
 7],
 Logs:
 On osd.7:
 2014-02-04 09:45:54.966913 7fa618afe700  1 osd.7 pg_epoch: 94460
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1
 lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY]
 stateStart: transitioning to Stray
 2014-02-04 09:45:55.781278 7fa6172fb700  1 osd.7 pg_epoch: 94461
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
 [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod
 94459'207003 remapped NOTIFY] stateStart: transitioning to Stray
 2014-02-04 09:49:01.124510 7fa618afe700  1 osd.7 pg_epoch: 94495
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
 r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped]
 stateStart: transitioning to Stray
 
 On osd.119:
 2014-02-04 09:45:54.981707 7f37f07c5700  1 osd.119 pg_epoch: 94460
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0
 lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart:
 transitioning to Primary
 2014-02-04 09:45:55.805712 7f37ecfbe700  1 osd.119 pg_epoch: 94461
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
 [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0
 remapped] stateStart: transitioning to Primary
 2014-02-04 09:45:56.794015 7f37edfc0700  0 log [INF] : 3.1e4a
 restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004
 2014-02-04 09:49:01.156627 7f37ef7c3700  1 osd.119 pg_epoch: 94495
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
 r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart:
 transitioning to Primary
 
 On osd.69:
 2014-02-04 09:45:56.845695 7f2231372700  1 osd.69 pg_epoch: 94462
 pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486
 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462
 pi=93485-94460/2 inactive] stateStart: transitioning to Stray
 2014-02-04 09:49:01.153695 7f2229b63700  1 osd.69 pg_epoch: 94495
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
 r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning
 to Stray
 
 pq query recovery state:
   recovery_state: [
 { name: Started\/Primary\/Active,
   enter_time: 2014-02-04 09:49:02.070724,
   might_have_unfound: [],
   recovery_progress: { backfill_target: -1,
   waiting_on_backfill: 0,
   backfill_pos: 0\/\/0\/\/-1,
   backfill_info: { begin: 0\/\/0\/\/-1,
   end: 0\/\/0\/\/-1,
   objects: []},
   peer_backfill_info: { begin: 0\/\/0\/\/-1,
   end: 0\/\/0\/\/-1,
   objects: []},
   backfills_in_flight: [],
   pull_from_peer: [],
   pushing: []},
   scrub: { scrubber.epoch_start: 77502,
   scrubber.active: 0,
   scrubber.block_writes: 0,
   scrubber.finalizing: 0,
   scrubber.waiting_on: 0,
   scrubber.waiting_on_whom: []}},
 { name: Started,
   enter_time: 2014-02-04 09:49:01.156626}]}
 
 ---
 Regards
 Dominik
 
 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com:
  Hi,
  Thanks for Your help !!
  We've done again 'ceph osd reweight-by-utilization 105'
  Cluster stack on 10387 active+clean, 237 active+remapped;
  More info in attachments.
 
  --
  Regards
  Dominik
 
 
  2014-02-04 Sage Weil s...@inktank.com:
  Hi,
 
  I spent a couple hours looking at your map because it did look like there
  was something wrong.  After some experimentation and adding a bucnh of
  improvements to osdmaptool to test the distribution, though, I think
  everything is working as expected.  For pool 3, your map has a standard
  deviation in utilizations of ~8%, and we should expect ~9% for this number
  of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).
  This is either just in the noise, or slightly confounded by the lack of
  the hashpspool flag on the pools (which slightly amplifies placement
  

Re: [ceph-users] poor data distribution

2014-02-06 Thread Dominik Mostowiec
Hi,
Thanks !!
Can You suggest any workaround for now?

--
Regards
Dominik


2014-02-06 18:39 GMT+01:00 Sage Weil s...@inktank.com:
 Hi,

 Just an update here.  Another user saw this and after playing with it I
 identified a problem with CRUSH.  There is a branch outstanding
 (wip-crush) that is pending review, but it's not a quick fix because of
 compatibility issues.

 sage


 On Thu, 6 Feb 2014, Dominik Mostowiec wrote:

 Hi,
 Mabye this info can help to find what is wrong.
 For one PG (3.1e4a) which is active+remapped:
 { state: active+remapped,
   epoch: 96050,
   up: [
 119,
 69],
   acting: [
 119,
 69,
 7],
 Logs:
 On osd.7:
 2014-02-04 09:45:54.966913 7fa618afe700  1 osd.7 pg_epoch: 94460
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1
 lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY]
 stateStart: transitioning to Stray
 2014-02-04 09:45:55.781278 7fa6172fb700  1 osd.7 pg_epoch: 94461
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
 [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod
 94459'207003 remapped NOTIFY] stateStart: transitioning to Stray
 2014-02-04 09:49:01.124510 7fa618afe700  1 osd.7 pg_epoch: 94495
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
 r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped]
 stateStart: transitioning to Stray

 On osd.119:
 2014-02-04 09:45:54.981707 7f37f07c5700  1 osd.119 pg_epoch: 94460
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0
 lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart:
 transitioning to Primary
 2014-02-04 09:45:55.805712 7f37ecfbe700  1 osd.119 pg_epoch: 94461
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
 n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
 [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0
 remapped] stateStart: transitioning to Primary
 2014-02-04 09:45:56.794015 7f37edfc0700  0 log [INF] : 3.1e4a
 restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004
 2014-02-04 09:49:01.156627 7f37ef7c3700  1 osd.119 pg_epoch: 94495
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
 r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart:
 transitioning to Primary

 On osd.69:
 2014-02-04 09:45:56.845695 7f2231372700  1 osd.69 pg_epoch: 94462
 pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486
 94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462
 pi=93485-94460/2 inactive] stateStart: transitioning to Stray
 2014-02-04 09:49:01.153695 7f2229b63700  1 osd.69 pg_epoch: 94495
 pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
 n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
 r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning
 to Stray

 pq query recovery state:
   recovery_state: [
 { name: Started\/Primary\/Active,
   enter_time: 2014-02-04 09:49:02.070724,
   might_have_unfound: [],
   recovery_progress: { backfill_target: -1,
   waiting_on_backfill: 0,
   backfill_pos: 0\/\/0\/\/-1,
   backfill_info: { begin: 0\/\/0\/\/-1,
   end: 0\/\/0\/\/-1,
   objects: []},
   peer_backfill_info: { begin: 0\/\/0\/\/-1,
   end: 0\/\/0\/\/-1,
   objects: []},
   backfills_in_flight: [],
   pull_from_peer: [],
   pushing: []},
   scrub: { scrubber.epoch_start: 77502,
   scrubber.active: 0,
   scrubber.block_writes: 0,
   scrubber.finalizing: 0,
   scrubber.waiting_on: 0,
   scrubber.waiting_on_whom: []}},
 { name: Started,
   enter_time: 2014-02-04 09:49:01.156626}]}

 ---
 Regards
 Dominik

 2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com:
  Hi,
  Thanks for Your help !!
  We've done again 'ceph osd reweight-by-utilization 105'
  Cluster stack on 10387 active+clean, 237 active+remapped;
  More info in attachments.
 
  --
  Regards
  Dominik
 
 
  2014-02-04 Sage Weil s...@inktank.com:
  Hi,
 
  I spent a couple hours looking at your map because it did look like there
  was something wrong.  After some experimentation and adding a bucnh of
  improvements to osdmaptool to test the distribution, though, I think
  everything is working as expected.  For pool 3, your map has a standard
  deviation in utilizations of ~8%, and we should expect ~9% for this number
  of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).
  This is either just in 

[ceph-users] OSD block device performance

2014-02-06 Thread John Mancuso
Hey all, I'm currently pouring through the ceph docs trying to familiarize 
myself with the product before I begin my cluster build-out for a virtualized 
environment. One area which I've been looking into is disk 
throughput/performance.

I stumbled onto the following site:
http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/


1)   I'm not sure where this info below originates as I did not see this on 
the ceph doc site, unless it is hidden in some dark corner somewhere.  Anyone 
point me to a wiki/url?

2)  Can someone describe this 50/50 split of journal vs filesystem (assume 
it has something to do with filestore flush)?

Consideration about the ceph's journal. The journal is by design the component 
that could be severely and easily improved. Take a little step back over it. As 
a reminder the ceph's journal serves 2 purposes:

  *   It acts as a buffer cache (FIFO buffer). The journal takes every request 
and performs each write with O_DIRECT. After a determined period and 
acknowledgment the journal flush his content to the backend filesystem. By 
default this value is set to 5 seconds and called filestore max sync interval. 
The filestore starts to flush when the journal is half-full or max sync 
interval is reached.
  *   Failure coverage, pending writes are handled by the Journal if not 
committed yet to the backend filesystem.
The journal can operate in 2 modes called parallel and writeahead, the given 
mode is automatically detected according to the file system in use by the OSD 
backend storage. The parallel mode is only supported by Btrfs.
In practice, common gigabits network can write 100 MB/sec. Let say that you 
store your journal and your backend storage are stored on the same disk. This 
disk has a write speed of 100 MB/sec. With the default writeahead mode the 
write speed will be split after 5 seconds (the default duration during the one 
the journal starts to flush to the backend filesystem).
The first 5 sec writes at 100 MB/sec, after that writes are splitted like so:

  *   50 MB/sec for the journal
  *   50 MB/sec for the backend filesystem
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD block device performance

2014-02-06 Thread John Spray
Hi John,

The 50/50 thing comes from the way the Ceph OSD writes data twice:
first to the journal, and then subsequently to the data partition.
The write doubling may not affect your performance outcome, depending
on the ratio of drive bandwidth to network bandwidth and the I/O
pattern.  In configurations where it is an issue, the way to improve
performance is to use an SSD for journals (Sebastian mentions this in
his article under Commodity improved).

The journal is an area of quite some flexibility, the relevant
settings are in the docs here:
http://ceph.com/docs/master/rados/configuration/journal-ref/
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings

There is some discussion of the use of SSDs with Ceph here:
http://ceph.com/docs/master/start/hardware-recommendations/#solid-state-drives

I'm sure others on this list will have more empirical information
about their experiences in this area.

Cheers,
John

On Thu, Feb 6, 2014 at 6:18 PM, John Mancuso jmanc...@freewheel.tv wrote:
 Hey all, I'm currently pouring through the ceph docs trying to familiarize
 myself with the product before I begin my cluster build-out for a
 virtualized environment. One area which I've been looking into is disk
 throughput/performance.



 I stumbled onto the following site:

 http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/



 1)   I'm not sure where this info below originates as I did not see this
 on the ceph doc site, unless it is hidden in some dark corner somewhere.
 Anyone point me to a wiki/url?

 2)  Can someone describe this 50/50 split of journal vs filesystem
 (assume it has something to do with filestore flush)?

 Consideration about the ceph's journal. The journal is by design the
 component that could be severely and easily improved. Take a little step
 back over it. As a reminder the ceph's journal serves 2 purposes:

 It acts as a buffer cache (FIFO buffer). The journal takes every request and
 performs each write with O_DIRECT. After a determined period and
 acknowledgment the journal flush his content to the backend filesystem. By
 default this value is set to 5 seconds and called filestore max sync
 interval. The filestore starts to flush when the journal is half-full or max
 sync interval is reached.
 Failure coverage, pending writes are handled by the Journal if not committed
 yet to the backend filesystem.

 The journal can operate in 2 modes called parallel and writeahead, the given
 mode is automatically detected according to the file system in use by the
 OSD backend storage. The parallel mode is only supported by Btrfs.

 In practice, common gigabits network can write 100 MB/sec. Let say that you
 store your journal and your backend storage are stored on the same disk.
 This disk has a write speed of 100 MB/sec. With the default writeahead mode
 the write speed will be split after 5 seconds (the default duration during
 the one the journal starts to flush to the backend filesystem).

 The first 5 sec writes at 100 MB/sec, after that writes are splitted like
 so:

 50 MB/sec for the journal
 50 MB/sec for the backend filesystem


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filesystem fragmentation on ext4 OSD

2014-02-06 Thread Christian Kauhaus
Am 06.02.2014 16:24, schrieb Mark Nelson:
 Hi Christian, can you tell me a little bit about how you are using Ceph and
 what kind of IO you are doing?

Sure. We're using it almost exclusively for serving VM images that are
accessed from Qemu's built-in RBD client. The VMs themselves perform a very
wide range of I/O types, from servers that write mainly log files to ZEO
database servers with nearly completely random I/O. Many VMs have slowly
increasing storage utilization.

A reason could be that the OSDs issue syncfs() calls and ext4 cuts FS extents
from just what has been written so far. But I'm not sure about the exact
pattern of OSD/filesystem interaction.

HTH

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-06 Thread Dominik Mostowiec
Great!
Thanks for Your help.

--
Regards
Dominik

2014-02-06 21:10 GMT+01:00 Sage Weil s...@inktank.com:
 On Thu, 6 Feb 2014, Dominik Mostowiec wrote:
 Hi,
 Thanks !!
 Can You suggest any workaround for now?

 You can adjust the crush weights on the overfull nodes slightly.  You'd
 need to do it by hand, but that will do the trick.  For example,

   ceph osd crush reweight osd.123 .96

 (if the current weight is 1.0).

 sage


 --
 Regards
 Dominik


 2014-02-06 18:39 GMT+01:00 Sage Weil s...@inktank.com:
  Hi,
 
  Just an update here.  Another user saw this and after playing with it I
  identified a problem with CRUSH.  There is a branch outstanding
  (wip-crush) that is pending review, but it's not a quick fix because of
  compatibility issues.
 
  sage
 
 
  On Thu, 6 Feb 2014, Dominik Mostowiec wrote:
 
  Hi,
  Mabye this info can help to find what is wrong.
  For one PG (3.1e4a) which is active+remapped:
  { state: active+remapped,
epoch: 96050,
up: [
  119,
  69],
acting: [
  119,
  69,
  7],
  Logs:
  On osd.7:
  2014-02-04 09:45:54.966913 7fa618afe700  1 osd.7 pg_epoch: 94460
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
  n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=-1
  lpr=94460 pi=92546-94459/5 lcod 94459'207003 inactive NOTIFY]
  stateStart: transitioning to Stray
  2014-02-04 09:45:55.781278 7fa6172fb700  1 osd.7 pg_epoch: 94461
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
  n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
  [119,69]/[119,69,7,142] r=2 lpr=94461 pi=92546-94460/6 lcod
  94459'207003 remapped NOTIFY] stateStart: transitioning to Stray
  2014-02-04 09:49:01.124510 7fa618afe700  1 osd.7 pg_epoch: 94495
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
  n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
  r=2 lpr=94495 pi=92546-94494/7 lcod 94459'207003 remapped]
  stateStart: transitioning to Stray
 
  On osd.119:
  2014-02-04 09:45:54.981707 7f37f07c5700  1 osd.119 pg_epoch: 94460
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
  n=6718 ec=4 les/c 93486/93486 94460/94460/92233) [119,69] r=0
  lpr=94460 pi=93485-94459/1 mlcod 0'0 inactive] stateStart:
  transitioning to Primary
  2014-02-04 09:45:55.805712 7f37ecfbe700  1 osd.119 pg_epoch: 94461
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=93486
  n=6718 ec=4 les/c 93486/93486 94460/94461/92233)
  [119,69]/[119,69,7,142] r=0 lpr=94461 pi=93485-94460/2 mlcod 0'0
  remapped] stateStart: transitioning to Primary
  2014-02-04 09:45:56.794015 7f37edfc0700  0 log [INF] : 3.1e4a
  restarting backfill on osd.69 from (0'0,0'0] MAX to 94459'207004
  2014-02-04 09:49:01.156627 7f37ef7c3700  1 osd.119 pg_epoch: 94495
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
  n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
  r=0 lpr=94495 pi=94461-94494/1 mlcod 0'0 remapped] stateStart:
  transitioning to Primary
 
  On osd.69:
  2014-02-04 09:45:56.845695 7f2231372700  1 osd.69 pg_epoch: 94462
  pg[3.1e4a( empty local-les=0 n=0 ec=4 les/c 93486/93486
  94460/94461/92233) [119,69]/[119,69,7,142] r=1 lpr=94462
  pi=93485-94460/2 inactive] stateStart: transitioning to Stray
  2014-02-04 09:49:01.153695 7f2229b63700  1 osd.69 pg_epoch: 94495
  pg[3.1e4a( v 94459'207004 (72275'204004,94459'207004] local-les=94462
  n=6718 ec=4 les/c 94462/94494 94460/94495/92233) [119,69]/[119,69,7]
  r=1 lpr=94495 pi=93485-94494/3 remapped] stateStart: transitioning
  to Stray
 
  pq query recovery state:
recovery_state: [
  { name: Started\/Primary\/Active,
enter_time: 2014-02-04 09:49:02.070724,
might_have_unfound: [],
recovery_progress: { backfill_target: -1,
waiting_on_backfill: 0,
backfill_pos: 0\/\/0\/\/-1,
backfill_info: { begin: 0\/\/0\/\/-1,
end: 0\/\/0\/\/-1,
objects: []},
peer_backfill_info: { begin: 0\/\/0\/\/-1,
end: 0\/\/0\/\/-1,
objects: []},
backfills_in_flight: [],
pull_from_peer: [],
pushing: []},
scrub: { scrubber.epoch_start: 77502,
scrubber.active: 0,
scrubber.block_writes: 0,
scrubber.finalizing: 0,
scrubber.waiting_on: 0,
scrubber.waiting_on_whom: []}},
  { name: Started,
enter_time: 2014-02-04 09:49:01.156626}]}
 
  ---
  Regards
  Dominik
 
  2014-02-04 12:09 GMT+01:00 Dominik Mostowiec dominikmostow...@gmail.com:
   Hi,
   Thanks for Your help !!
   We've done again 'ceph osd reweight-by-utilization 105'
   Cluster stack on 10387 active+clean, 237 active+remapped;
   More info in attachments.
  
   --
   Regards
   Dominik
  
  
   2014-02-04 Sage Weil 

Re: [ceph-users] RBD Caching - How to enable?

2014-02-06 Thread Blair Bethwaite
Does anybody else think there is a problem with the docs/settings here...


 Message: 13
 Date: Thu, 06 Feb 2014 12:11:53 +0100 (CET)
 From: Alexandre DERUMIER aderum...@odiso.com
 To: Graeme Lambert glamb...@adepteo.net
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] RBD Caching - How to enable?
 Message-ID: d0af5e59-ea2a-471e-be65-ff273d0c0216@mailpro
 Content-Type: text/plain; charset=utf-8

 OK, so I need to change ceph.conf on the compute nodes?
 yes.

 Do the VMs using RBD images need to be restarted at all?
 I think yes.

 Anything changed in the virsh XML for the nodes?

 you need to add cache=writeback for your disks

 If you use qemu  1.2, no need to add  rbd cache = true to ceph .conf :)
 http://ceph.com/docs/next/rbd/qemu-rbd/


This page reads If you set rbd_cache=true, you must set cache=writeback or
risk data loss. ...

That's an inverted definition of writeback AFAIK!

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush Maps

2014-02-06 Thread McNamara, Bradley
I have a test cluster that is up and running.  It consists of three mons, and 
three OSD servers, with each OSD server having eight OSD's and two SSD's for 
journals.  I'd like to move from the flat crushmap to a crushmap with typical 
depth using most of the predefined types.  I have the current crushmap 
decompiled and have edited it to add the additional depth of failure zones.

Questions:


1)  Do the ID's of the bucket types need to be consecutive, or can I make 
them up as long as they are negative in value and unique?

2)  Is there any way that I can control the assignment of the bucket type 
ID's if I were to update the crushmap on a running system using the CLI?

3)  Is there any harm in adding bucket types that are not currently used, 
but assigning them a weight of 0, so they aren't used (a row defined, with 
racks, but the racks have no hosts defined)?

4)  Can I have a bucket type with no item lines in it, or does each 
bucket type need at least on item declaration to be valid?

Example:
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host spucosds01 {
id -2   # do not change unnecessarily
# weight 29.120
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.640
item osd.1 weight 3.640
item osd.2 weight 3.640
item osd.3 weight 3.640
item osd.4 weight 3.640
item osd.5 weight 3.640
item osd.6 weight 3.640
item osd.7 weight 3.640
}
host spucosds02 {
id -3   # do not change unnecessarily
# weight 29.120
alg straw
hash 0  # rjenkins1
item osd.8 weight 3.640
item osd.9 weight 3.640
item osd.10 weight 3.640
item osd.11 weight 3.640
item osd.12 weight 3.640
item osd.13 weight 3.640
item osd.14 weight 3.640
item osd.15 weight 3.640
}
host spucosds03 {
id -4   # do not change unnecessarily
# weight 29.120
alg straw
hash 0  # rjenkins1
item osd.16 weight 3.640
item osd.17 weight 3.640
item osd.18 weight 3.640
item osd.19 weight 3.640
item osd.20 weight 3.640
item osd.21 weight 3.640
item osd.22 weight 3.640
item osd.23 weight 3.640
}
rack rack2-2 {
id -220
alg straw
hash 0
item spucosds01 weight 29.12
}
rack rack3-2 {
id -230
alg straw
hash 0
item spucosds02 weight 29.12
}
rack rack4-2 {
id -240
alg straw
hash 0
item spucosds03 weight 29.12
}
row row1 {
id -100
alg straw
hash 0
}
row row2 {
id -200
alg straw
hash 0
item rack2-2 weight 29.12
item rack3-2 weight 29.12
item rack4-2 weight 29.12
}
datacenter smt {
id -1000
alg straw
hash 0
item row1 weight 0.0
item row2 weight 87.36
}
root default {
id -1   # do not change unnecessarily
# weight 87.360
alg straw
hash 0  # rjenkins1
item smt weight 87.36
}

# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-06 Thread Craig Lewis

On 2/4/14 17:06 , Craig Lewis wrote:


Now that I've started seeing missing objects, I'm not able to download 
objects that should be on the slave if replication is up to date.  
Either it's not up to date, or it's skipping objects every pass.




Using my --max-entries fix 
(https://github.com/ceph/radosgw-agent/pull/8), I think I see what's 
happening.



Shut down replication
Upload 6 objects to an empty bucket on the master:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test2.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test3.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test4.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test5.jpg

None show on the slave, because replication is down.

Start radosgw-agent --max-entries=2 (1 doesn't seem to replicate anything)
Check contents of slave after pass #1:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg


Check contents of slave after pass #10:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg


Leave replication running
Upload 1 object, test6.jpg, to the master.  Check the master:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test2.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test3.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test4.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test5.jpg
2014-02-07 02:0610k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test6.jpg


Check contents of slave after next pass:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg


Upload another file, test7.jpg, to the master:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test2.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test3.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test4.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test5.jpg
2014-02-07 02:0610k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test6.jpg
2014-02-07 02:0810k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test7.jpg


The slave doesn't get it this time:
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg


Upload another file, test8.jpg, to the master:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test2.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test3.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test4.jpg
2014-02-07 02:0310k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test5.jpg
2014-02-07 02:0610k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test6.jpg
2014-02-07 02:0810k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test7.jpg
2014-02-07 02:1010k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test8.jpg


The slave gets the 3rd file:
2014-02-07 02:0210k  dc5674336e2212a0819b7abcb811e323 
s3://bucket1/test0.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test1.jpg
2014-02-07 02:0210k dc5674336e2212a0819b7abcb811e323  
s3://bucket1/test2.jpg




So I think the problem is caused by the shard marker being set to the 
current marker after every pass, even if the bucket replication caps on 
max-entries.


Updating the shard marker by uploading a file causes another pass on the 
bucket, and the bucket marker is being tracked correctly.



I would prefer to track the shard marker better, but I don't see any way 
to get the last shard marker given the last bucket entry.  If I track 
the shard marker correctly, then the stats I'm generating are still 
somewhat useful (if incomplete).  I'll be able to see when replication 
falls behind because the graphs keep growing.


The alternative is to change the bucket sync so that it loops until 

Re: [ceph-users] Crush Maps

2014-02-06 Thread Daniel Schwager
Hallo Bradley, additionally to your question, I'm interesting in the following:

5) can I change all 'type' Ids because adding a new type host-slow to 
distinguish between OSD's with journal on the same HDD / separate SSD? E.g.  
from
type 0 osd
type 1 host
type 2 rack
..
to
type 0 osd
type 1 host
type 2 host-slow
type 3 rack
..

6)  After importing the crush map to the cluster, how can I start rebalancing 
all existing pools? (This is because all OSD now mixed up to other locations in 
the crush hierarchy).

best regards
Danny

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of McNamara, Bradley


1)  Do the ID's of the bucket types need to be consecutive, or can I make 
them up as long as they are negative in value and unique?

2)  Is there any way that I can control the assignment of the bucket type 
ID's if I were to update the crushmap on a running system using the CLI?

3)  Is there any harm in adding bucket types that are not currently used, 
but assigning them a weight of 0, so they aren't used (a row defined, with 
racks, but the racks have no hosts defined)?

4)  Can I have a bucket type with no item lines in it, or does each 
bucket type need at least on item declaration to be valid?


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD+KVM problems with sequential read

2014-02-06 Thread Ирек Фасихов
Hi All.

Hosts: Dell R815x5, 128 GB RAM, 25 OSD + 5 SSD(journal+system).
Network: 2x10Gb+LACP
Kernel: 2.6.32
QEMU emulator version 1.4.2, Copyright (c) 2003-2008 Fabrice Bellard


POOLs:
root@kvm05:~# ceph osd dump | grep 'rbd'
pool 5 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins
pg_num 1400 pgp_num 1400 last_change 12550 owner 0
---
root@kvm05:~# ceph osd dump | grep 'test'
pool 32 'test' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins
pg_num 1400 pgp_num 1400 last_change 12655 owner 0

root@kvm01:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
--
root@kvm01:~# rados bench -p test 120 write --no-cleanup
Total time run: 120.125225
Total writes made:  11519
Write size: 4194304
Bandwidth (MB/sec): 383.566

Stddev Bandwidth:   36.2022
Max bandwidth (MB/sec): 408
Min bandwidth (MB/sec): 0
Average Latency:0.166819
Stddev Latency: 0.0553357
Max latency:1.60795
Min latency:0.044263
--
root@kvm01:~# rados bench -p test 120 seq
Total time run:67.271769
Total reads made: 11519
Read size:4194304
Bandwidth (MB/sec):684.923

Average Latency:   0.0933579
Max latency:   0.808438
Min latency:   0.018063
---
[root@cephadmin cluster]# cat ceph.conf
[global]
fsid = 43a571a9-b3e8-4dc9-9200-1f3904e1e12a
initial_members = kvm01, kvm02, kvm03
mon_host = 192.168.100.1,192.168.100.2, 192.168.100.3
auth_supported = cephx
public network = 192.168.100.0/24
cluster_network = 192.168.101.0/24

[osd]
osd journal size = 12500
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd op threads = 10
osd disk threads = 10
osd max backfills = 2
osd recovery max active = 1
filestore op threads = 64
filestore xattr use omap = true

[client]
rbd cache = true
rbd cache size = 134217728
rbd cache max dirty = 0

[mon.kvm01]
host = kvm01
mon addr = 192.168.100.1:6789

[mon.kvm02]
host = kvm02
mon addr = 192.168.100.2:6789

[mon.kvm03]
host = kvm03
mon addr = 192.168.100.3:6789

[osd.0]
public addr = 192.168.100.1
cluster addr = 192.168.101.1

[osd.1]
public addr = 192.168.100.1
cluster addr = 192.168.101.1

[osd.2]
public addr = 192.168.100.1
cluster addr = 192.168.101.1

[osd.3]
public addr = 192.168.100.1
cluster addr = 192.168.101.1

[osd.4]
public addr = 192.168.100.1
cluster addr = 192.168.101.1

[osd.5]
public addr = 192.168.100.2
cluster addr = 192.168.101.2

[osd.6]
public addr = 192.168.100.2
cluster addr = 192.168.101.2

[osd.7]
public addr = 192.168.100.2
cluster addr = 192.168.101.2

[osd.8]
public addr = 192.168.100.2
cluster addr = 192.168.101.2

[osd.9]
public addr = 192.168.100.2
cluster addr = 192.168.101.2

[osd.10]
public addr = 192.168.100.3
cluster addr = 192.168.101.3

[osd.11]
public addr = 192.168.100.3
cluster addr = 192.168.101.3

[osd.12]
public addr = 192.168.100.3
cluster addr = 192.168.101.3

[osd.13]
public addr = 192.168.100.3
cluster addr = 192.168.101.3

[osd.14]
public addr = 192.168.100.3
cluster addr = 192.168.101.3
[osd.15]
public addr = 192.168.100.4
cluster addr = 192.168.101.4

[osd.16]
public addr = 192.168.100.4
cluster addr = 192.168.101.4

[osd.17]
public addr = 192.168.100.4
cluster addr = 192.168.101.4

[osd.18]
public addr = 192.168.100.4
cluster addr = 192.168.101.4

[osd.19]
public addr = 192.168.100.4
cluster addr = 192.168.101.4

[osd.20]
public addr = 192.168.100.5
cluster addr = 192.168.101.5

[osd.21]
public addr = 192.168.100.5
cluster addr = 192.168.101.5

[osd.22]
public addr = 192.168.100.5
cluster addr = 192.168.101.5

[osd.23]
public addr = 192.168.100.5
cluster addr = 192.168.101.5

[osd.24]
public addr = 192.168.100.5
cluster addr = 192.168.101.5
---
[root@cephadmin ~]# cat crushd
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device