[ceph-users] basic questions about pool

2014-07-10 Thread pragya jain
hi all,

I have some very basic questions about pools in ceph.

According to ceph documentation, as we deploy a ceph cluster with radosgw 
instance over it, ceph creates pool by default to store the data or the 
deployer can also create pools according to the requirement.

Now, my question is:
1. what is the relevance of multiple pools in a cluster?
i.e. why should a deployer create multiple pools in a cluster? what should be 
the benefits of creating multiple pools?

2. according to the docs, the default pools are data, metadata, and rbd.
what is the difference among these three types of pools?

3. when a system deployer has deployed a ceph cluster with radosgw interface 
and start providing services to the end-user, such as, end-user can create 
their account on the ceph cluster and can store/retrieve their data to/from the 
cluster, then Is the end user has any concern about the pools created in the 
cluster?

Please somebody help me to clear these confusions.

regards
Pragya Jain___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mount not working anymore

2014-07-10 Thread Joshua McClintock
[root@chefwks01 ~]# ceph --cluster us-west01 osd crush dump

{ "devices": [

{ "id": 0,

  "name": "osd.0"},

{ "id": 1,

  "name": "osd.1"},

{ "id": 2,

  "name": "osd.2"},

{ "id": 3,

  "name": "osd.3"},

{ "id": 4,

  "name": "osd.4"}],

  "types": [

{ "type_id": 0,

  "name": "osd"},

{ "type_id": 1,

  "name": "host"},

{ "type_id": 2,

  "name": "chassis"},

{ "type_id": 3,

  "name": "rack"},

{ "type_id": 4,

  "name": "row"},

{ "type_id": 5,

  "name": "pdu"},

{ "type_id": 6,

  "name": "pod"},

{ "type_id": 7,

  "name": "room"},

{ "type_id": 8,

  "name": "datacenter"},

{ "type_id": 9,

  "name": "region"},

{ "type_id": 10,

  "name": "root"}],

  "buckets": [

{ "id": -1,

  "name": "default",

  "type_id": 10,

  "type_name": "root",

  "weight": 147455,

  "alg": "straw",

  "hash": "rjenkins1",

  "items": [

{ "id": -2,

  "weight": 29491,

  "pos": 0},

{ "id": -3,

  "weight": 29491,

  "pos": 1},

{ "id": -4,

  "weight": 29491,

  "pos": 2},

{ "id": -5,

  "weight": 29491,

  "pos": 3},

{ "id": -6,

  "weight": 29491,

  "pos": 4}]},

{ "id": -2,

  "name": "ceph-node20",

  "type_id": 1,

  "type_name": "host",

  "weight": 29491,

  "alg": "straw",

  "hash": "rjenkins1",

  "items": [

{ "id": 0,

  "weight": 29491,

  "pos": 0}]},

{ "id": -3,

  "name": "ceph-node22",

  "type_id": 1,

  "type_name": "host",

  "weight": 29491,

  "alg": "straw",

  "hash": "rjenkins1",

  "items": [

{ "id": 2,

  "weight": 29491,

  "pos": 0}]},

{ "id": -4,

  "name": "ceph-node24",

  "type_id": 1,

  "type_name": "host",

  "weight": 29491,

  "alg": "straw",

  "hash": "rjenkins1",

  "items": [

{ "id": 4,

  "weight": 29491,

  "pos": 0}]},

{ "id": -5,

  "name": "ceph-node21",

  "type_id": 1,

  "type_name": "host",

  "weight": 29491,

  "alg": "straw",

  "hash": "rjenkins1",

  "items": [

{ "id": 1,

  "weight": 29491,

  "pos": 0}]},

{ "id": -6,

  "name": "ceph-node23",

  "type_id": 1,

  "type_name": "host",

  "weight": 29491,

  "alg": "straw",

  "hash": "rjenkins1",

  "items": [

{ "id": 3,

  "weight": 29491,

  "pos": 0}]}],

  "rules": [

{ "rule_id": 0,

  "rule_name": "replicated_ruleset",

  "ruleset": 0,

  "type": 1,

  "min_size": 1,

  "max_size": 10,

  "steps": [

{ "op": "take",

  "item": -1,

  "item_name": "default"},

{ "op": "chooseleaf_firstn",

  "num": 0,

  "type": "host"},

{ "op": "emit"}]},

{ "rule_id": 1,

  "rule_name": "erasure-code",

  "ruleset": 1,

  "type": 3,

  "min_size": 3,

  "max_size": 20,

  "steps": [

{ "op": "set_chooseleaf_tries",

  "num": 5},

{ "op": "take",

  "item": -1,

  "item_name": "default"},

{ "op": "chooseleaf_indep",

  "num": 0,

  "type": "host"},

{ "op": "emit"}]},

{ "rule_id": 2,

  "rule_name": "ecpool",

  "ruleset": 2,

  "type": 3,

  "min_size": 3,

  "max_size": 20,

  "steps": [

{ "op": "set_chooseleaf_tries",

  "num": 5},

{ "op": "take",

  "item": -1,

  "item_name": "default"},

{ "op": "choose_indep",

  "num": 0,

  "type": "osd"},

{ "op": "emit"}]}],

  "tunables": { "choose_local_tries": 0,

  "choose_local_fallback_tries": 0,

  "choose_total_tries": 50,

  "chooseleaf_descend_once": 1,

  "profile": "bobtail",

  "optimal_tunables": 0,

  "legacy_tunables": 0,

  "require_feature_tunables": 1,

 

Re: [ceph-users] ceph mount not working anymore

2014-07-10 Thread Sage Weil
That is CEPH_FEATURE_CRUSH_V2.  Can you attach teh output of

 ceph osd crush dump

Thanks!
sage


On Thu, 10 Jul 2014, Joshua McClintock wrote:

> Yes, I change some of the mount options on my osds (xfs mount options), but
> I think this may be the answer from dmesg, sorta looks like a version
> mismatch:
> 
> libceph: loaded (mon/osd proto 15/24)
> 
> ceph: loaded (mds proto 32)
> 
> libceph: mon0 192.168.0.14:6789 feature set mismatch, my 4a042aca < server's
> 104a042aca, missing 10
> 
> libceph: mon0 192.168.0.14:6789 socket error on read
> 
> libceph: mon2 192.168.0.16:6789 feature set mismatch, my 4a042aca < server's
> 104a042aca, missing 10
> 
> libceph: mon2 192.168.0.16:6789 socket error on read
> 
> libceph: mon1 192.168.0.15:6789 feature set mismatch, my 4a042aca < server's
> 104a042aca, missing 10
> 
> libceph: mon1 192.168.0.15:6789 socket error on read
> 
> libceph: mon0 192.168.0.14:6789 feature set mismatch, my 4a042aca < server's
> 104a042aca, missing 10
> 
> libceph: mon0 192.168.0.14:6789 socket error on read
> 
> libceph: mon2 192.168.0.16:6789 feature set mismatch, my 4a042aca < server's
> 104a042aca, missing 10
> 
> libceph: mon2 192.168.0.16:6789 socket error on read
> 
> libceph: mon1 192.168.0.15:6789 feature set mismatch, my 4a042aca < server's
> 104a042aca, missing 10
> 
> libceph: mon1 192.168.0.15:6789 socket error on read
> 
> 
> I maybe I didn't update as well as I thought it did.  I did hit every mon,
> but I remember I couldn't upgrade to the new 'ceph' package because it
> conflicted with 'python-ceph', so I uninstalled it (python-ceph), and then
> upgraded to .80.1-2.   Maybe there's a subcomponent I missed? 
> 
> 
> Here's rpm -qa from the client:
> 
> 
> [root@chefwks01 ~]# rpm -qa|grep ceph
> 
> ceph-deploy-1.5.2-0.noarch
> 
> ceph-release-1-0.el6.noarch
> 
> ceph-0.80.1-2.el6.x86_64
> 
> libcephfs1-0.80.1-0.el6.x86_64
> 
> 
> Here's rpm -qa from the mons:
> 
> 
> [root@ceph-mon01 ~]# rpm -qa|grep ceph
> 
> ceph-0.80.1-2.el6.x86_64
> 
> ceph-release-1-0.el6.noarch
> 
> libcephfs1-0.80.1-0.el6.x86_64
> 
> [root@ceph-mon01 ~]#
> 
> 
> [root@ceph-mon02 ~]# rpm -qa|grep ceph
> 
> libcephfs1-0.80.1-0.el6.x86_64
> 
> ceph-0.80.1-2.el6.x86_64
> 
> ceph-release-1-0.el6.noarch
> 
> [root@ceph-mon02 ~]#
> 
> 
> [root@ceph-mon03 ~]# rpm -qa|grep ceph
> 
> libcephfs1-0.80.1-0.el6.x86_64
> 
> ceph-0.80.1-2.el6.x86_64
> 
> ceph-release-1-0.el6.noarch
> 
> [root@ceph-mon03 ~]# 
> 
> 
> Joshua
> 
> 
> 
> On Thu, Jul 10, 2014 at 6:04 PM, Sage Weil  wrote:
>   Have you made any other changes after the upgrade?  (Like
>   adjusting
>   tunables, or creating EC pools?)
> 
>   See if there is anything in 'dmesg' output.
> 
>   sage
> 
>   On Thu, 10 Jul 2014, Joshua McClintock wrote:
> 
>   > I upgraded my cluster to .80.1-2 (CentOS).  My mount command
>   just freezes
>   > and outputs an error:
>   >
>   > mount.ceph 192.168.0.14,192.168.0.15,192.168.0.16:/ /us-west01
>   -o
>   > name=chefwks01,secret=`ceph-authtool -p -n client.admin
>   > /etc/ceph/us-west01.client.admin.keyring`
>   >
>   > mount error 5 = Input/output error
>   >
>   >
>   > Here's the output from 'ceph -s'
>   >
>   >
>   >     cluster xx
>   >
>   >      health HEALTH_OK
>   >
>   >      monmap e1: 3 
> monsat{ceph-mon01=192.168.0.14:6789/0,ceph-mon02=192.168.0.15:6789/0,ceph-mon03
>   =1
>   > 92.168.0.16:6789/0}, election epoch 88, quorum 0,1,2
>   > ceph-mon01,ceph-mon02,ceph-mon03
>   >
>   >      mdsmap e26: 1/1/1 up {0=0=up:active}
>   >
>   >      osdmap e1371: 5 osds: 5 up, 5 in
>   >
>   >       pgmap v49431: 192 pgs, 3 pools, 135 GB data, 34733
>   objects
>   >
>   >             406 GB used, 1874 GB / 2281 GB avail
>   >
>   >                  192 active+clean
>   >
>   >
>   > I can see some packets being exchanged between the client and
>   the mon, but
>   > it's a pretty short exchange.
>   >
>   > Any ideas where to look next?
>   >
>   > Joshua
>   >
>   >
>   >
> 
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] qemu image create failed

2014-07-10 Thread Yonghua Peng

Hi,

I try to create a qemu image, but got failed.

ceph@ceph:~/my-cluster$ qemu-img create -f rbd rbd:rbd/qemu 2G
Formatting 'rbd:rbd/qemu', fmt=rbd size=2147483648 cluster_size=0
qemu-img: error connecting
qemu-img: rbd:rbd/qemu: error while creating rbd: Input/output error

Can you tell what's the problem?

Thanks.

--

We are hiring cloud Dev/Ops, more details please see: YY Cloud Jobs 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I have PGs that I can't deep-scrub

2014-07-10 Thread Chris Dunlop
Hi Craig,

On Thu, Jul 10, 2014 at 03:09:51PM -0700, Craig Lewis wrote:
> I fixed this issue by reformatting all of the OSDs.  I changed the mkfs
> options from
> 
> [osd]
>   osd mkfs type = xfs
>   osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096
> 
> to
> [osd]
>   osd mkfs type = xfs
>   osd mkfs options xfs = -s size=4096
> 
> (I have a mix of 512 and 4k sector drives, and I want to treat them all
> like 4k sector).
> 
> 
> Now deep scrub runs to completion, and CPU usage of the daemon never goes
> over 30%.  I did have to restart a few OSDs when I scrubbed known problem
> PGs, but they scrubbed the 2nd time successfully.  The cluster is still
> scrubbing, but it's completed half with no more issues.

I suspect it was the "-n size=64k" causing this behaviour, potentially using
too much CPU and starving the OSD processes:

http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mount not working anymore

2014-07-10 Thread Joshua McClintock
Yes, I change some of the mount options on my osds (xfs mount options), but
I think this may be the answer from dmesg, sorta looks like a version
mismatch:

libceph: loaded (mon/osd proto 15/24)

ceph: loaded (mds proto 32)

libceph: mon0 192.168.0.14:6789 feature set mismatch, my 4a042aca <
server's 104a042aca, missing 10

libceph: mon0 192.168.0.14:6789 socket error on read

libceph: mon2 192.168.0.16:6789 feature set mismatch, my 4a042aca <
server's 104a042aca, missing 10

libceph: mon2 192.168.0.16:6789 socket error on read

libceph: mon1 192.168.0.15:6789 feature set mismatch, my 4a042aca <
server's 104a042aca, missing 10

libceph: mon1 192.168.0.15:6789 socket error on read

libceph: mon0 192.168.0.14:6789 feature set mismatch, my 4a042aca <
server's 104a042aca, missing 10

libceph: mon0 192.168.0.14:6789 socket error on read

libceph: mon2 192.168.0.16:6789 feature set mismatch, my 4a042aca <
server's 104a042aca, missing 10

libceph: mon2 192.168.0.16:6789 socket error on read

libceph: mon1 192.168.0.15:6789 feature set mismatch, my 4a042aca <
server's 104a042aca, missing 10

libceph: mon1 192.168.0.15:6789 socket error on read


I maybe I didn't update as well as I thought it did.  I did hit every mon,
but I remember I couldn't upgrade to the new 'ceph' package because it
conflicted with 'python-ceph', so I uninstalled it (python-ceph), and then
upgraded to .80.1-2.   Maybe there's a subcomponent I missed?


Here's rpm -qa from the client:


[root@chefwks01 ~]# rpm -qa|grep ceph

ceph-deploy-1.5.2-0.noarch

ceph-release-1-0.el6.noarch

ceph-0.80.1-2.el6.x86_64

libcephfs1-0.80.1-0.el6.x86_64


Here's rpm -qa from the mons:


[root@ceph-mon01 ~]# rpm -qa|grep ceph

ceph-0.80.1-2.el6.x86_64

ceph-release-1-0.el6.noarch

libcephfs1-0.80.1-0.el6.x86_64

[root@ceph-mon01 ~]#


[root@ceph-mon02 ~]# rpm -qa|grep ceph

libcephfs1-0.80.1-0.el6.x86_64

ceph-0.80.1-2.el6.x86_64

ceph-release-1-0.el6.noarch

[root@ceph-mon02 ~]#


[root@ceph-mon03 ~]# rpm -qa|grep ceph

libcephfs1-0.80.1-0.el6.x86_64

ceph-0.80.1-2.el6.x86_64

ceph-release-1-0.el6.noarch

[root@ceph-mon03 ~]#


Joshua


On Thu, Jul 10, 2014 at 6:04 PM, Sage Weil  wrote:

> Have you made any other changes after the upgrade?  (Like adjusting
> tunables, or creating EC pools?)
>
> See if there is anything in 'dmesg' output.
>
> sage
>
> On Thu, 10 Jul 2014, Joshua McClintock wrote:
>
> > I upgraded my cluster to .80.1-2 (CentOS).  My mount command just freezes
> > and outputs an error:
> >
> > mount.ceph 192.168.0.14,192.168.0.15,192.168.0.16:/ /us-west01 -o
> > name=chefwks01,secret=`ceph-authtool -p -n client.admin
> > /etc/ceph/us-west01.client.admin.keyring`
> >
> > mount error 5 = Input/output error
> >
> >
> > Here's the output from 'ceph -s'
> >
> >
> > cluster xx
> >
> >  health HEALTH_OK
> >
> >  monmap e1: 3 mons at{ceph-mon01=
> 192.168.0.14:6789/0,ceph-mon02=192.168.0.15:6789/0,ceph-mon03=1
> > 92.168.0.16:6789/0}, election epoch 88, quorum 0,1,2
> > ceph-mon01,ceph-mon02,ceph-mon03
> >
> >  mdsmap e26: 1/1/1 up {0=0=up:active}
> >
> >  osdmap e1371: 5 osds: 5 up, 5 in
> >
> >   pgmap v49431: 192 pgs, 3 pools, 135 GB data, 34733 objects
> >
> > 406 GB used, 1874 GB / 2281 GB avail
> >
> >  192 active+clean
> >
> >
> > I can see some packets being exchanged between the client and the mon,
> but
> > it's a pretty short exchange.
> >
> > Any ideas where to look next?
> >
> > Joshua
> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mount not working anymore

2014-07-10 Thread Sage Weil
Have you made any other changes after the upgrade?  (Like adjusting 
tunables, or creating EC pools?)

See if there is anything in 'dmesg' output.

sage

On Thu, 10 Jul 2014, Joshua McClintock wrote:

> I upgraded my cluster to .80.1-2 (CentOS).  My mount command just freezes
> and outputs an error:
> 
> mount.ceph 192.168.0.14,192.168.0.15,192.168.0.16:/ /us-west01 -o
> name=chefwks01,secret=`ceph-authtool -p -n client.admin
> /etc/ceph/us-west01.client.admin.keyring`
> 
> mount error 5 = Input/output error
> 
> 
> Here's the output from 'ceph -s'
> 
> 
>     cluster xx
> 
>      health HEALTH_OK
> 
>      monmap e1: 3 mons 
> at{ceph-mon01=192.168.0.14:6789/0,ceph-mon02=192.168.0.15:6789/0,ceph-mon03=1
> 92.168.0.16:6789/0}, election epoch 88, quorum 0,1,2
> ceph-mon01,ceph-mon02,ceph-mon03
> 
>      mdsmap e26: 1/1/1 up {0=0=up:active}
> 
>      osdmap e1371: 5 osds: 5 up, 5 in
> 
>       pgmap v49431: 192 pgs, 3 pools, 135 GB data, 34733 objects
> 
>             406 GB used, 1874 GB / 2281 GB avail
> 
>                  192 active+clean
> 
> 
> I can see some packets being exchanged between the client and the mon, but
> it's a pretty short exchange.
> 
> Any ideas where to look next?
> 
> Joshua
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph mount not working anymore

2014-07-10 Thread Joshua McClintock
I upgraded my cluster to .80.1-2 (CentOS).  My mount command just freezes
and outputs an error:

mount.ceph 192.168.0.14,192.168.0.15,192.168.0.16:/ /us-west01 -o
name=chefwks01,secret=`ceph-authtool -p -n client.admin
/etc/ceph/us-west01.client.admin.keyring`

mount error 5 = Input/output error


Here's the output from 'ceph -s'


cluster xx

 health HEALTH_OK

 monmap e1: 3 mons at {ceph-mon01=
192.168.0.14:6789/0,ceph-mon02=192.168.0.15:6789/0,ceph-mon03=192.168.0.16:6789/0},
election epoch 88, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03

 mdsmap e26: 1/1/1 up {0=0=up:active}

 osdmap e1371: 5 osds: 5 up, 5 in

  pgmap v49431: 192 pgs, 3 pools, 135 GB data, 34733 objects

406 GB used, 1874 GB / 2281 GB avail

 192 active+clean


I can see some packets being exchanged between the client and the mon, but
it's a pretty short exchange.

Any ideas where to look next?

Joshua
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using large SSD cache tier instead of SSD

2014-07-10 Thread Lazuardi Nasution
Hi,

I prefer to use bCache or other likely local write back cache on SSD since
it is only related to local HDDs. It think, it will reduce the risk of
error on cache flushing comparing to CEPH Cache Tiering which still using
clustering network on flushing. After the data has been written to the
cache of an OSD, that OSD has done the writing process and the rest is
internal OSD process. Anyway, the performance testing report will be so
interesting.

Best rergards,


> Date: Tue, 8 Jul 2014 09:40:29 +
>> From: Somhegyi Benjamin 
>> To: Arne Wiebalck , James Harper
>> 
>> Cc: "ceph-users@lists.ceph.com" 
>> Subject: Re: [ceph-users] Using large SSD cache tier instead of SSD
>> journals?
>> Message-ID:
>> <
>> c8c8e269282a034ca2a0180cbe01c24001b62...@wdcsrv.wdc.wigner.mta.hu>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi Arne and James,
>>
>> Ah, I misunderstood James' suggestion. Using bcache w/ SSDs can be
>> another viable alternative to SSD journal partitions indeed.
>> I think ultimately I will need to test the options since very few people
>> have experience with cache tiering or bcache.
>>
>> Thanks,
>> Benjamin
>>
>> From: Arne Wiebalck [mailto:arne.wieba...@cern.ch]
>> Sent: Tuesday, July 08, 2014 11:27 AM
>> To: Somhegyi Benjamin
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Using large SSD cache tier instead of SSD
>> journals?
>>
>> Hi Benjamin,
>>
>> Unless I misunderstood, I think the suggestion was to use bcache devices
>> on the OSDs
>> (not on the clients), so what you use it for in the end doesn't really
>> matter.
>>
>> The setup of bcache devices is pretty similar to a mkfs and once set up,
>> bcache devices
>> come up and can be mounted as any other device.
>>
>> Cheers,
>>  Arne
>>
>> --
>> Arne Wiebalck
>> CERN IT
>>
>> On 08 Jul 2014, at 11:01, Somhegyi Benjamin <
>> somhegyi.benja...@wigner.mta.hu>
>> wrote:
>>
>>
>> Hi James,
>>
>> Yes, I've checked bcache, but as far as I can tell you need to manually
>> configure and register the backing devices and attach them to the cache
>> device, which is not really suitable to dynamic environment (like RBD
>> devices for cloud VMs).
>>
>> Benjamin
>>
>>
>>
>> -Original Message-
>> From: James Harper [mailto:ja...@ejbdigital.com.au]
>> Sent: Tuesday, July 08, 2014 10:17 AM
>> To: Somhegyi Benjamin; ceph-users@lists.ceph.com> ceph-users@lists.ceph.com>
>> Subject: RE: Using large SSD cache tier instead of SSD journals?
>>
>> Have you considered bcache? It's in the kernel since 3.10 I think.
>>
>> It would be interesting to see comparisons between no ssd, journal on
>> ssd, and bcache with ssd (with journal on same fs as osd)
>>
>> James
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-users Digest, Vol 18, Issue 8

2014-07-10 Thread Lazuardi Nasution
Hi,

I prefer to use bcache or other likely local write back cache on SSD since
it is only related to local HDDs. It think, it will reduce the risk of
error on cache flushing comparing to CEPH Cache Tiering which still using
clustering network on flushing. After the data has been written to the
cache of an OSD, that OSD has done the writing process and the rest is
internal OSD process. Anyway, the performance testing report will be so
interesting.

Best rergards


> Date: Tue, 8 Jul 2014 09:40:29 +
> From: Somhegyi Benjamin 
> To: Arne Wiebalck , James Harper
> 
> Cc: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] Using large SSD cache tier instead of SSD
> journals?
> Message-ID:
>  >
> Content-Type: text/plain; charset="us-ascii"
>
> Hi Arne and James,
>
> Ah, I misunderstood James' suggestion. Using bcache w/ SSDs can be another
> viable alternative to SSD journal partitions indeed.
> I think ultimately I will need to test the options since very few people
> have experience with cache tiering or bcache.
>
> Thanks,
> Benjamin
>
> From: Arne Wiebalck [mailto:arne.wieba...@cern.ch]
> Sent: Tuesday, July 08, 2014 11:27 AM
> To: Somhegyi Benjamin
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Using large SSD cache tier instead of SSD
> journals?
>
> Hi Benjamin,
>
> Unless I misunderstood, I think the suggestion was to use bcache devices
> on the OSDs
> (not on the clients), so what you use it for in the end doesn't really
> matter.
>
> The setup of bcache devices is pretty similar to a mkfs and once set up,
> bcache devices
> come up and can be mounted as any other device.
>
> Cheers,
>  Arne
>
> --
> Arne Wiebalck
> CERN IT
>
> On 08 Jul 2014, at 11:01, Somhegyi Benjamin <
> somhegyi.benja...@wigner.mta.hu>
> wrote:
>
>
> Hi James,
>
> Yes, I've checked bcache, but as far as I can tell you need to manually
> configure and register the backing devices and attach them to the cache
> device, which is not really suitable to dynamic environment (like RBD
> devices for cloud VMs).
>
> Benjamin
>
>
>
> -Original Message-
> From: James Harper [mailto:ja...@ejbdigital.com.au]
> Sent: Tuesday, July 08, 2014 10:17 AM
> To: Somhegyi Benjamin; ceph-users@lists.ceph.com ceph-users@lists.ceph.com>
> Subject: RE: Using large SSD cache tier instead of SSD journals?
>
> Have you considered bcache? It's in the kernel since 3.10 I think.
>
> It would be interesting to see comparisons between no ssd, journal on
> ssd, and bcache with ssd (with journal on same fs as osd)
>
> James
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Samuel Just
It could be an indication of a problem on osd 5, but the timing is
worrying.  Can you attach your ceph.conf?  Have there been any osds
going down, new osds added, anything to cause recovery?  Anything in
dmesg to indicate an fs problem?  Have you recently changed any
settings?
-Sam

On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith  wrote:
> Greetings,
>
> Just a follow up on my original issue. =ceph pg repair ...= fixed the
> problem. However, today I got another inconsistent pg. It's interesting to
> me that this second error is in the same rbd image and appears to be "close"
> to the previously inconsistent pg. (Even more fun, osd.5 was the secondary
> in the first error and is the primary here though the other osd is
> different.)
>
> Is this indicative of a problem on osd.5 or perhaps a clue into what's
> causing firefly to be so inconsistent?
>
> The relevant log entries are below.
>
> 2014-07-07 18:50:48.646407 osd.2 192.168.253.70:6801/56987 163 : [ERR] 3.c6
> shard 2: soid 34dc35c6/rb.0.b0ce3.238e1f29.000b/head//3 digest
> 2256074002 != known digest 3998068918
> 2014-07-07 18:51:36.936076 osd.2 192.168.253.70:6801/56987 164 : [ERR] 3.c6
> deep-scrub 0 missing, 1 inconsistent objects
> 2014-07-07 18:51:36.936082 osd.2 192.168.253.70:6801/56987 165 : [ERR] 3.c6
> deep-scrub 1 errors
>
>
> 2014-07-10 15:38:53.990328 osd.5 192.168.253.81:6800/10013 257 : [ERR] 3.41
> shard 1: soid e183cc41/rb.0.b0ce3.238e1f29.024c/head//3 digest
> 3224286363 != known digest 3409342281
> 2014-07-10 15:39:11.701276 osd.5 192.168.253.81:6800/10013 258 : [ERR] 3.41
> deep-scrub 0 missing, 1 inconsistent objects
> 2014-07-10 15:39:11.701281 osd.5 192.168.253.81:6800/10013 259 : [ERR] 3.41
> deep-scrub 1 errors
>
>
>
> On Thu, Jul 10, 2014 at 12:05 PM, Chahal, Sudip 
> wrote:
>>
>> Thanks - so it appears that the advantage of the 3rd replica (relative to
>> 2 replicas) has to do much more with recovering from two concurrent OSD
>> failures than with inconsistencies found during deep scrub - would you
>> agree?
>>
>> Re: repair - do you mean the "repair" process during deep scrub  - if yes,
>> this is automatic - correct?
>> Or
>> Are you referring to the explicit manually initiated repair commands?
>>
>> Thanks,
>>
>> -Sudip
>>
>> -Original Message-
>> From: Samuel Just [mailto:sam.j...@inktank.com]
>> Sent: Thursday, July 10, 2014 10:50 AM
>> To: Chahal, Sudip
>> Cc: Christian Eichelmann; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] scrub error on firefly
>>
>> Repair I think will tend to choose the copy with the lowest osd number
>> which is not obviously corrupted.  Even with three replicas, it does not do
>> any kind of voting at this time.
>> -Sam
>>
>> On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip 
>> wrote:
>> > I've a basic related question re: Firefly operation - would appreciate
>> > any insights:
>> >
>> > With three replicas, if checksum inconsistencies across replicas are
>> > found during deep-scrub then:
>> > a.  does the majority win or is the primary always the winner
>> > and used to overwrite the secondaries
>> > b. is this reconciliation done automatically during
>> > deep-scrub or does each reconciliation have to be executed manually by the
>> > administrator?
>> >
>> > With 2 replicas - how are things different (if at all):
>> >a. The primary is declared the winner - correct?
>> >b. is this reconciliation done automatically during
>> > deep-scrub or does it have to be done "manually" because there is no
>> > majority?
>> >
>> > Thanks,
>> >
>> > -Sudip
>> >
>> >
>> > -Original Message-
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> > Of Samuel Just
>> > Sent: Thursday, July 10, 2014 10:16 AM
>> > To: Christian Eichelmann
>> > Cc: ceph-users@lists.ceph.com
>> > Subject: Re: [ceph-users] scrub error on firefly
>> >
>> > Can you attach your ceph.conf for your osds?
>> > -Sam
>> >
>> > On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann
>> >  wrote:
>> >> I can also confirm that after upgrading to firefly both of our
>> >> clusters (test and live) were going from 0 scrub errors each for
>> >> about
>> >> 6 Month to about 9-12 per week...
>> >> This also makes me kind of nervous, since as far as I know everything
>> >> "ceph pg repair" does, is to copy the primary object to all replicas,
>> >> no matter which object is the correct one.
>> >> Of course the described method of manual checking works (for pools
>> >> with more than 2 replicas), but doing this in a large cluster nearly
>> >> every week is horribly timeconsuming and error prone.
>> >> It would be great to get an explanation for the increased numbers of
>> >> scrub errors since firefly. Were they just not detected correctly in
>> >> previous versions? Or is there maybe something wrong with the new code?
>> >>
>> >> Acutally, our company is currently preventing our projects to move to
>> >> ceph because of this problem

[ceph-users] logrotate

2014-07-10 Thread James Eckersall
Hi,

I've just upgraded a ceph cluster from Ubuntu 12.04 with 0.73.1 to Ubuntu
14.04 with 0.80.1.

I've noticed that the log rotation doesn't appear to work correctly.
The OSD's are just not logging to the current ceph-osd-X.log file.
If I restart the OSD's, they start logging, but then overnight, they stop
logging when the logs are rotated.

Has anyone else noticed a problem with this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I have PGs that I can't deep-scrub

2014-07-10 Thread Craig Lewis
I fixed this issue by reformatting all of the OSDs.  I changed the mkfs
options from

[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096

to
[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -s size=4096

(I have a mix of 512 and 4k sector drives, and I want to treat them all
like 4k sector).


Now deep scrub runs to completion, and CPU usage of the daemon never goes
over 30%.  I did have to restart a few OSDs when I scrubbed known problem
PGs, but they scrubbed the 2nd time successfully.  The cluster is still
scrubbing, but it's completed half with no more issues.



It took me a long time to correlate the "XFS: possible memory allocation
deadlock in kmem_alloc" message in dmesg to OSD problems.  It was only when
I started having these deep-scrub issues that the XFS deadlock messages
were well correlated with OSD issues.

Looking back at previous issues I had with OSDs flapping, the XFS deadlocks
were present, but usually preceded the issues by several hours.


I strongly recommend to anybody that sees "XFS: possible memory allocation
deadlock in kmem_alloc" in dmesg to reformat your XFS filesystems.  It's
painful, but my cluster has been rock solid since I finished.





On Wed, Jun 11, 2014 at 2:23 PM, Craig Lewis 
wrote:

> New logs, with debug ms = 1, debug osd = 20.
>
>
> In this timeline, I started the deep-scrub at 11:04:00  Ceph start
> deep-scrubing at 11:04:03.
>
> osd.11 started consuming 100% CPU around 11:07.  Same for osd.0.  CPU
> usage is all user; iowait is < 0.10%.  There is more variance in the
> CPU usage now, ranging between 98.5% and 101.2%
>
> This time, I didn't see any major IO, read or write.
>
> osd.11 was marked down at 11:22:00:
> 2014-06-11 11:22:00.820118 mon.0 [INF] osd.11 marked down after no pg
> stats for 902.656777seconds
>
> osd.0 was marked down at 11:36:00:
>  2014-06-11 11:36:00.890869 mon.0 [INF] osd.0 marked down after no pg
> stats for 902.498894seconds
>
>
>
>
> ceph.conf: https://cd.centraldesktop.com/p/eAAADwbcABIDZuE
> ceph-osd.0.log.gz
> 
> (140MiB, 18MiB compressed):
> https://cd.centraldesktop.com/p/eAAADwbdAHnmhFQ
> ceph-osd.11.log.gz (131MiB, 17MiB compressed):
> https://cd.centraldesktop.com/p/eAAADwbeAEUR9AI
> ceph pg 40.11e query:
> https://cd.centraldesktop.com/p/eAAADwbfAEJcwvc
>
>
>
>
>
> On Wed, Jun 11, 2014 at 5:42 AM, Sage Weil  wrote:
> > Hi Craig,
> >
> > It's hard to say what is going wrong with that level of logs.  Can you
> > reproduce with debug ms = 1 and debug osd = 20?
> >
> > There were a few things fixed in scrub between emperor and firefly.  Are
> > you planning on upgrading soon?
> >
> > sage
> >
> >
> > On Tue, 10 Jun 2014, Craig Lewis wrote:
> >
> >> Every time I deep-scrub one PG, all of the OSDs responsible get kicked
> >> out of the cluster.  I've deep-scrubbed this PG 4 times now, and it
> >> fails the same way every time.  OSD logs are linked at the bottom.
> >>
> >> What can I do to get this deep-scrub to complete cleanly?
> >>
> >> This is the first time I've deep-scrubbed these PGs since Sage helped
> >> me recover from some OSD problems
> >> (
> http://t53277.file-systems-ceph-development.file-systemstalk.info/70-osd-are-down-and-not-coming-up-t53277.html
> )
> >>
> >> I can trigger the issue easily in this cluster, but have not been able
> >> to re-create in other clusters.
> >>
> >>
> >>
> >>
> >>
> >>
> >> The PG stats for this PG say that last_deep_scrub and deep_scrub_stamp
> >> are 48009'1904117 2014-05-21 07:28:01.315996 respectively.  This PG is
> >> owned by OSDs [11,0]
> >>
> >> This is a secondary cluster, so I stopped all external I/O on it.  I
> >> set nodeep-scrub, and restarted both OSDs with:
> >>   debug osd = 5/5
> >>   debug filestore = 5/5
> >>   debug journal = 1
> >>   debug monc = 20/20
> >>
> >> then I ran a deep-scrub on this PG.
> >>
> >> 2014-06-10 10:47:50.881783 mon.0 [INF] pgmap v8832020: 2560 pgs: 2555
> >> active+clean, 5 active+clean+scrubbing; 27701 GB data, 56218 GB used,
> >> 77870 GB / 130 TB avail
> >> 2014-06-10 10:47:54.039829 mon.0 [INF] pgmap v8832021: 2560 pgs: 2554
> >> active+clean, 5 active+clean+scrubbing, 1 active+clean+scrubbing+deep;
> >> 27701 GB data, 56218 GB used, 77870 GB / 130 TB avail
> >>
> >>
> >> At 10:49:09, I see ceph-osd for both 11 and 0 spike to 100% CPU
> >> (100.3% +/- 1.0%).  Prior to this, they were both using ~30% CPU.  It
> >> might've started a few seconds sooner, I'm watching top.
> >>
> >> I forgot to watch IO stat until 10:56.  At this point, both OSDs are
> >> reading.  iostat reports that they're both doing ~100
> >> transactions/sec, reading ~1 MiBps, 0 writes.
> >>
> >>
> >> At 11:01:26, iostat reports that both osds are no longer consuming any
> >> disk I/O.  They both go for > 30 seconds with 0 transactions, and 0
> >> kiB read/write.  There are small bumps of 2 transactions/sec for

Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Randy Smith
Greetings,

Just a follow up on my original issue. =ceph pg repair ...= fixed the
problem. However, today I got another inconsistent pg. It's interesting to
me that this second error is in the same rbd image and appears to be
"close" to the previously inconsistent pg. (Even more fun, osd.5 was the
secondary in the first error and is the primary here though the other osd
is different.)

Is this indicative of a problem on osd.5 or perhaps a clue into what's
causing firefly to be so inconsistent?

The relevant log entries are below.

2014-07-07 18:50:48.646407 osd.2 192.168.253.70:6801/56987 163 : [ERR] 3.c6
shard 2: soid 34dc35c6/rb.0.b0ce3.238e1f29.000b/head//3 digest
2256074002 != known digest 3998068918
2014-07-07 18:51:36.936076 osd.2 192.168.253.70:6801/56987 164 : [ERR] 3.c6
deep-scrub 0 missing, 1 inconsistent objects
2014-07-07 18:51:36.936082 osd.2 192.168.253.70:6801/56987 165 : [ERR] 3.c6
deep-scrub 1 errors


2014-07-10 15:38:53.990328 osd.5 192.168.253.81:6800/10013 257 : [ERR] 3.41
shard 1: soid e183cc41/rb.0.b0ce3.238e1f29.024c/head//3 digest
3224286363 != known digest 3409342281
2014-07-10 15:39:11.701276 osd.5 192.168.253.81:6800/10013 258 : [ERR] 3.41
deep-scrub 0 missing, 1 inconsistent objects
2014-07-10 15:39:11.701281 osd.5 192.168.253.81:6800/10013 259 : [ERR] 3.41
deep-scrub 1 errors



On Thu, Jul 10, 2014 at 12:05 PM, Chahal, Sudip 
wrote:

> Thanks - so it appears that the advantage of the 3rd replica (relative to
> 2 replicas) has to do much more with recovering from two concurrent OSD
> failures than with inconsistencies found during deep scrub - would you
> agree?
>
> Re: repair - do you mean the "repair" process during deep scrub  - if yes,
> this is automatic - correct?
> Or
> Are you referring to the explicit manually initiated repair commands?
>
> Thanks,
>
> -Sudip
>
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Thursday, July 10, 2014 10:50 AM
> To: Chahal, Sudip
> Cc: Christian Eichelmann; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] scrub error on firefly
>
> Repair I think will tend to choose the copy with the lowest osd number
> which is not obviously corrupted.  Even with three replicas, it does not do
> any kind of voting at this time.
> -Sam
>
> On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip 
> wrote:
> > I've a basic related question re: Firefly operation - would appreciate
> any insights:
> >
> > With three replicas, if checksum inconsistencies across replicas are
> found during deep-scrub then:
> > a.  does the majority win or is the primary always the winner
> and used to overwrite the secondaries
> > b. is this reconciliation done automatically during
> deep-scrub or does each reconciliation have to be executed manually by the
> administrator?
> >
> > With 2 replicas - how are things different (if at all):
> >a. The primary is declared the winner - correct?
> >b. is this reconciliation done automatically during
> deep-scrub or does it have to be done "manually" because there is no
> majority?
> >
> > Thanks,
> >
> > -Sudip
> >
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Samuel Just
> > Sent: Thursday, July 10, 2014 10:16 AM
> > To: Christian Eichelmann
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] scrub error on firefly
> >
> > Can you attach your ceph.conf for your osds?
> > -Sam
> >
> > On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann <
> christian.eichelm...@1und1.de> wrote:
> >> I can also confirm that after upgrading to firefly both of our
> >> clusters (test and live) were going from 0 scrub errors each for
> >> about
> >> 6 Month to about 9-12 per week...
> >> This also makes me kind of nervous, since as far as I know everything
> >> "ceph pg repair" does, is to copy the primary object to all replicas,
> >> no matter which object is the correct one.
> >> Of course the described method of manual checking works (for pools
> >> with more than 2 replicas), but doing this in a large cluster nearly
> >> every week is horribly timeconsuming and error prone.
> >> It would be great to get an explanation for the increased numbers of
> >> scrub errors since firefly. Were they just not detected correctly in
> >> previous versions? Or is there maybe something wrong with the new code?
> >>
> >> Acutally, our company is currently preventing our projects to move to
> >> ceph because of this problem.
> >>
> >> Regards,
> >> Christian
> >> 
> >> Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von
> >> "Travis Rhoden [trho...@gmail.com]
> >> Gesendet: Donnerstag, 10. Juli 2014 16:24
> >> An: Gregory Farnum
> >> Cc: ceph-users@lists.ceph.com
> >> Betreff: Re: [ceph-users] scrub error on firefly
> >>
> >> And actually just to follow-up, it does seem like there are some
> >> additional smarts beyond just using 

Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Chahal, Sudip
Thanks - so it appears that the advantage of the 3rd replica (relative to 2 
replicas) has to do much more with recovering from two concurrent OSD failures 
than with inconsistencies found during deep scrub - would you agree?

Re: repair - do you mean the "repair" process during deep scrub  - if yes, this 
is automatic - correct?  
Or 
Are you referring to the explicit manually initiated repair commands?

Thanks,

-Sudip

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com] 
Sent: Thursday, July 10, 2014 10:50 AM
To: Chahal, Sudip
Cc: Christian Eichelmann; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] scrub error on firefly

Repair I think will tend to choose the copy with the lowest osd number which is 
not obviously corrupted.  Even with three replicas, it does not do any kind of 
voting at this time.
-Sam

On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip  wrote:
> I've a basic related question re: Firefly operation - would appreciate any 
> insights:
>
> With three replicas, if checksum inconsistencies across replicas are found 
> during deep-scrub then:
> a.  does the majority win or is the primary always the winner and 
> used to overwrite the secondaries
> b. is this reconciliation done automatically during 
> deep-scrub or does each reconciliation have to be executed manually by the 
> administrator?
>
> With 2 replicas - how are things different (if at all):
>a. The primary is declared the winner - correct?
>b. is this reconciliation done automatically during deep-scrub 
> or does it have to be done "manually" because there is no majority?
>
> Thanks,
>
> -Sudip
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Samuel Just
> Sent: Thursday, July 10, 2014 10:16 AM
> To: Christian Eichelmann
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] scrub error on firefly
>
> Can you attach your ceph.conf for your osds?
> -Sam
>
> On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann 
>  wrote:
>> I can also confirm that after upgrading to firefly both of our 
>> clusters (test and live) were going from 0 scrub errors each for 
>> about
>> 6 Month to about 9-12 per week...
>> This also makes me kind of nervous, since as far as I know everything 
>> "ceph pg repair" does, is to copy the primary object to all replicas, 
>> no matter which object is the correct one.
>> Of course the described method of manual checking works (for pools 
>> with more than 2 replicas), but doing this in a large cluster nearly 
>> every week is horribly timeconsuming and error prone.
>> It would be great to get an explanation for the increased numbers of 
>> scrub errors since firefly. Were they just not detected correctly in 
>> previous versions? Or is there maybe something wrong with the new code?
>>
>> Acutally, our company is currently preventing our projects to move to 
>> ceph because of this problem.
>>
>> Regards,
>> Christian
>> 
>> Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von 
>> "Travis Rhoden [trho...@gmail.com]
>> Gesendet: Donnerstag, 10. Juli 2014 16:24
>> An: Gregory Farnum
>> Cc: ceph-users@lists.ceph.com
>> Betreff: Re: [ceph-users] scrub error on firefly
>>
>> And actually just to follow-up, it does seem like there are some 
>> additional smarts beyond just using the primary to overwrite the 
>> secondaries...  Since I captured md5 sums before and after the 
>> repair, I can say that in this particular instance, the secondary copy was 
>> used to overwrite the primary.
>> So, I'm just trusting Ceph to the right thing, and so far it seems 
>> to, but the comments here about needing to determine the correct 
>> object and place it on the primary PG make me wonder if I've been missing 
>> something.
>>
>>  - Travis
>>
>>
>> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden  wrote:
>>>
>>> I can also say that after a recent upgrade to Firefly, I have 
>>> experienced massive uptick in scrub errors.  The cluster was on 
>>> cuttlefish for about a year, and had maybe one or two scrub errors.
>>> After upgrading to Firefly, we've probably seen 3 to 4 dozen in the 
>>> last month or so (was getting 2-3 a day for a few weeks until the whole 
>>> cluster was rescrubbed, it seemed).
>>>
>>> What I cannot determine, however, is how to know which object is busted?
>>> For example, just today I ran into a scrub error.  The object has 
>>> two copies and is an 8MB piece of an RBD, and has identical 
>>> timestamps, identical xattrs names and values.  But it definitely 
>>> has a different
>>> MD5 sum. How to know which one is correct?
>>>
>>> I've been just kicking off pg repair each time, which seems to just 
>>> use the primary copy to overwrite the others.  Haven't run into any 
>>> issues with that so far, but it does make me nervous.
>>>
>>>  - Travis
>>>
>>>
>>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum  wrote

Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Samuel Just
Repair I think will tend to choose the copy with the lowest osd number
which is not obviously corrupted.  Even with three replicas, it does
not do any kind of voting at this time.
-Sam

On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip  wrote:
> I've a basic related question re: Firefly operation - would appreciate any 
> insights:
>
> With three replicas, if checksum inconsistencies across replicas are found 
> during deep-scrub then:
> a.  does the majority win or is the primary always the winner and 
> used to overwrite the secondaries
> b. is this reconciliation done automatically during 
> deep-scrub or does each reconciliation have to be executed manually by the 
> administrator?
>
> With 2 replicas - how are things different (if at all):
>a. The primary is declared the winner - correct?
>b. is this reconciliation done automatically during deep-scrub 
> or does it have to be done "manually" because there is no majority?
>
> Thanks,
>
> -Sudip
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Samuel Just
> Sent: Thursday, July 10, 2014 10:16 AM
> To: Christian Eichelmann
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] scrub error on firefly
>
> Can you attach your ceph.conf for your osds?
> -Sam
>
> On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann 
>  wrote:
>> I can also confirm that after upgrading to firefly both of our
>> clusters (test and live) were going from 0 scrub errors each for about
>> 6 Month to about 9-12 per week...
>> This also makes me kind of nervous, since as far as I know everything
>> "ceph pg repair" does, is to copy the primary object to all replicas,
>> no matter which object is the correct one.
>> Of course the described method of manual checking works (for pools
>> with more than 2 replicas), but doing this in a large cluster nearly
>> every week is horribly timeconsuming and error prone.
>> It would be great to get an explanation for the increased numbers of
>> scrub errors since firefly. Were they just not detected correctly in
>> previous versions? Or is there maybe something wrong with the new code?
>>
>> Acutally, our company is currently preventing our projects to move to
>> ceph because of this problem.
>>
>> Regards,
>> Christian
>> 
>> Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von
>> "Travis Rhoden [trho...@gmail.com]
>> Gesendet: Donnerstag, 10. Juli 2014 16:24
>> An: Gregory Farnum
>> Cc: ceph-users@lists.ceph.com
>> Betreff: Re: [ceph-users] scrub error on firefly
>>
>> And actually just to follow-up, it does seem like there are some
>> additional smarts beyond just using the primary to overwrite the
>> secondaries...  Since I captured md5 sums before and after the repair,
>> I can say that in this particular instance, the secondary copy was used to 
>> overwrite the primary.
>> So, I'm just trusting Ceph to the right thing, and so far it seems to,
>> but the comments here about needing to determine the correct object
>> and place it on the primary PG make me wonder if I've been missing something.
>>
>>  - Travis
>>
>>
>> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden  wrote:
>>>
>>> I can also say that after a recent upgrade to Firefly, I have
>>> experienced massive uptick in scrub errors.  The cluster was on
>>> cuttlefish for about a year, and had maybe one or two scrub errors.
>>> After upgrading to Firefly, we've probably seen 3 to 4 dozen in the
>>> last month or so (was getting 2-3 a day for a few weeks until the whole 
>>> cluster was rescrubbed, it seemed).
>>>
>>> What I cannot determine, however, is how to know which object is busted?
>>> For example, just today I ran into a scrub error.  The object has two
>>> copies and is an 8MB piece of an RBD, and has identical timestamps,
>>> identical xattrs names and values.  But it definitely has a different
>>> MD5 sum. How to know which one is correct?
>>>
>>> I've been just kicking off pg repair each time, which seems to just
>>> use the primary copy to overwrite the others.  Haven't run into any
>>> issues with that so far, but it does make me nervous.
>>>
>>>  - Travis
>>>
>>>
>>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum  wrote:

 It's not very intuitive or easy to look at right now (there are
 plans from the recent developer summit to improve things), but the
 central log should have output about exactly what objects are
 busted. You'll then want to compare the copies manually to determine
 which ones are good or bad, get the good copy on the primary (make
 sure you preserve xattrs), and run repair.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith  wrote:
 > Greetings,
 >
 > I upgraded to firefly last week and I suddenly received this error:
 >
 > health HEALTH_ERR 1 pgs inconsiste

Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Chahal, Sudip
I've a basic related question re: Firefly operation - would appreciate any 
insights:

With three replicas, if checksum inconsistencies across replicas are found 
during deep-scrub then:
a.  does the majority win or is the primary always the winner and used 
to overwrite the secondaries
b. is this reconciliation done automatically during deep-scrub 
or does each reconciliation have to be executed manually by the administrator?

With 2 replicas - how are things different (if at all):
   a. The primary is declared the winner - correct?
   b. is this reconciliation done automatically during deep-scrub 
or does it have to be done "manually" because there is no majority?

Thanks,

-Sudip

 
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Samuel 
Just
Sent: Thursday, July 10, 2014 10:16 AM
To: Christian Eichelmann
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] scrub error on firefly

Can you attach your ceph.conf for your osds?
-Sam

On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann 
 wrote:
> I can also confirm that after upgrading to firefly both of our 
> clusters (test and live) were going from 0 scrub errors each for about 
> 6 Month to about 9-12 per week...
> This also makes me kind of nervous, since as far as I know everything 
> "ceph pg repair" does, is to copy the primary object to all replicas, 
> no matter which object is the correct one.
> Of course the described method of manual checking works (for pools 
> with more than 2 replicas), but doing this in a large cluster nearly 
> every week is horribly timeconsuming and error prone.
> It would be great to get an explanation for the increased numbers of 
> scrub errors since firefly. Were they just not detected correctly in 
> previous versions? Or is there maybe something wrong with the new code?
>
> Acutally, our company is currently preventing our projects to move to 
> ceph because of this problem.
>
> Regards,
> Christian
> 
> Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von 
> "Travis Rhoden [trho...@gmail.com]
> Gesendet: Donnerstag, 10. Juli 2014 16:24
> An: Gregory Farnum
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] scrub error on firefly
>
> And actually just to follow-up, it does seem like there are some 
> additional smarts beyond just using the primary to overwrite the 
> secondaries...  Since I captured md5 sums before and after the repair, 
> I can say that in this particular instance, the secondary copy was used to 
> overwrite the primary.
> So, I'm just trusting Ceph to the right thing, and so far it seems to, 
> but the comments here about needing to determine the correct object 
> and place it on the primary PG make me wonder if I've been missing something.
>
>  - Travis
>
>
> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden  wrote:
>>
>> I can also say that after a recent upgrade to Firefly, I have 
>> experienced massive uptick in scrub errors.  The cluster was on 
>> cuttlefish for about a year, and had maybe one or two scrub errors.  
>> After upgrading to Firefly, we've probably seen 3 to 4 dozen in the 
>> last month or so (was getting 2-3 a day for a few weeks until the whole 
>> cluster was rescrubbed, it seemed).
>>
>> What I cannot determine, however, is how to know which object is busted?
>> For example, just today I ran into a scrub error.  The object has two 
>> copies and is an 8MB piece of an RBD, and has identical timestamps, 
>> identical xattrs names and values.  But it definitely has a different 
>> MD5 sum. How to know which one is correct?
>>
>> I've been just kicking off pg repair each time, which seems to just 
>> use the primary copy to overwrite the others.  Haven't run into any 
>> issues with that so far, but it does make me nervous.
>>
>>  - Travis
>>
>>
>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum  wrote:
>>>
>>> It's not very intuitive or easy to look at right now (there are 
>>> plans from the recent developer summit to improve things), but the 
>>> central log should have output about exactly what objects are 
>>> busted. You'll then want to compare the copies manually to determine 
>>> which ones are good or bad, get the good copy on the primary (make 
>>> sure you preserve xattrs), and run repair.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith  wrote:
>>> > Greetings,
>>> >
>>> > I upgraded to firefly last week and I suddenly received this error:
>>> >
>>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> >
>>> > ceph health detail shows the following:
>>> >
>>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 3.c6 is 
>>> > active+clean+inconsistent, acting [2,5]
>>> > 1 scrub errors
>>> >
>>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. 
>>> > What I want to know is what are the risks of data loss if I 

Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Samuel Just
Can you attach your ceph.conf for your osds?
-Sam

On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann
 wrote:
> I can also confirm that after upgrading to firefly both of our clusters
> (test and live) were going from 0 scrub errors each for about 6 Month to
> about 9-12 per week...
> This also makes me kind of nervous, since as far as I know everything "ceph
> pg repair" does, is to copy the primary object to all replicas, no matter
> which object is the correct one.
> Of course the described method of manual checking works (for pools with more
> than 2 replicas), but doing this in a large cluster nearly every week is
> horribly timeconsuming and error prone.
> It would be great to get an explanation for the increased numbers of scrub
> errors since firefly. Were they just not detected correctly in previous
> versions? Or is there maybe something wrong with the new code?
>
> Acutally, our company is currently preventing our projects to move to ceph
> because of this problem.
>
> Regards,
> Christian
> 
> Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von "Travis
> Rhoden [trho...@gmail.com]
> Gesendet: Donnerstag, 10. Juli 2014 16:24
> An: Gregory Farnum
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] scrub error on firefly
>
> And actually just to follow-up, it does seem like there are some additional
> smarts beyond just using the primary to overwrite the secondaries...  Since
> I captured md5 sums before and after the repair, I can say that in this
> particular instance, the secondary copy was used to overwrite the primary.
> So, I'm just trusting Ceph to the right thing, and so far it seems to, but
> the comments here about needing to determine the correct object and place it
> on the primary PG make me wonder if I've been missing something.
>
>  - Travis
>
>
> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden  wrote:
>>
>> I can also say that after a recent upgrade to Firefly, I have experienced
>> massive uptick in scrub errors.  The cluster was on cuttlefish for about a
>> year, and had maybe one or two scrub errors.  After upgrading to Firefly,
>> we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a
>> day for a few weeks until the whole cluster was rescrubbed, it seemed).
>>
>> What I cannot determine, however, is how to know which object is busted?
>> For example, just today I ran into a scrub error.  The object has two copies
>> and is an 8MB piece of an RBD, and has identical timestamps, identical
>> xattrs names and values.  But it definitely has a different MD5 sum. How to
>> know which one is correct?
>>
>> I've been just kicking off pg repair each time, which seems to just use
>> the primary copy to overwrite the others.  Haven't run into any issues with
>> that so far, but it does make me nervous.
>>
>>  - Travis
>>
>>
>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum  wrote:
>>>
>>> It's not very intuitive or easy to look at right now (there are plans
>>> from the recent developer summit to improve things), but the central
>>> log should have output about exactly what objects are busted. You'll
>>> then want to compare the copies manually to determine which ones are
>>> good or bad, get the good copy on the primary (make sure you preserve
>>> xattrs), and run repair.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith  wrote:
>>> > Greetings,
>>> >
>>> > I upgraded to firefly last week and I suddenly received this error:
>>> >
>>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> >
>>> > ceph health detail shows the following:
>>> >
>>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> > pg 3.c6 is active+clean+inconsistent, acting [2,5]
>>> > 1 scrub errors
>>> >
>>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I
>>> > want
>>> > to know is what are the risks of data loss if I run that command in
>>> > this
>>> > state and how can I mitigate them?
>>> >
>>> > --
>>> > Randall Smith
>>> > Computing Services
>>> > Adams State University
>>> > http://www.adams.edu/
>>> > 719-587-7741
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Suggested best practise for Ceph node online/offline?

2014-07-10 Thread Gregory Farnum
On Thu, Jul 10, 2014 at 9:04 AM, Joe Hewitt  wrote:
> Hi there
> Recently I got a problem triggered by rebooting ceph nodes, which eventually
> wound up by rebuilding from ground up. A too-long-don't-read question here
> is: is there suggested best practices for online/offline ceph node?
>
> Following the official ceph doc, I set up a 4 node ceph (firefly) cluster in
> our lab last week. It consists of 1 admin, 1 mon and 2 osd nodes. All 4
> nodes are physical servers running Ubuntu 14.04, no virtual machines used.
> The mds and the 3rd osd are actually on the monitor node. Everything looked
> okay by then, 'ceph -w' gave me health_ok and I could observe all available
> storage capacity. I was happily writing python code using ceph S3 API.
>
> This Monday I apt-get upgraded all those 4 machines and then rebooted them.
> Once all 4 were back online, ceph -w gave some errors about IP address of
> monitor node and some error messages like "{timestamp} 7fb6456f2500  0 --
> {monitor IP address}:0/2924 .. pipe(0x5516270 sd=97 :0 s=1 pgs=0 cs=0
> l=1c=0x5c54e20).fault"
> (sorry didn't save exact error logs since I've rebuilt the cluster, my
> mistake :-( ).
>
> Due to lab's policy, only DHCP is allowed, so I updated monitor IP address
> in /etc/ceph/ceph.conf and tried to push config to all nodes but that didn't
> work. Then I tried to restart ceph service on those nodes, no luck. I even
> went to ceph-deploy purgedata approach, no luck again. Then I had to purge
> all, restarted from zero. Again, I'm sorry no error msgs were saved, I was
> just too frustrated.
>
> Now I have a working cluster but I don't think I can afford redo it again.
> So the question mentioned above: how shall I properly do maintenance work
> without breaking my ceph cluster? Some procedure or commands I should issue
> after rebooting? Thanks

The monitors (and only the monitors) require fixed IP addresses. If
you can't give that to them, you're going to have a bad time. The best
suggestion off the top of my head is to maintain a cluster of them,
and, every time you need to reboot one:
1) stop it
2) remove it from the monitor map
3) reboot the machine
4) add it to the monitor map with the same name and new IP.
5) wait until the monitors report it's part of the quorum and nobody's syncing

And just configure the clients with the monitor hostnames to talk to,
rather than specific IPs.

There are docs about how to do that config management, but nobody's
ever tried or tested any of this so maybe I'm forgetting something.
And of course if you lose the IP of a majority of them at the same
time, you'll have a bad time again and need to do manual repairs. :/
Really, you just need to give them a fixed IP.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Suggested best practise for Ceph node online/offline?

2014-07-10 Thread Joe Hewitt
Hi there
Recently I got a problem triggered by rebooting ceph nodes, which eventually
wound up by rebuilding from ground up. A too-long-don't-read question here
is: is there suggested best practices for online/offline ceph node? 

Following the official ceph doc, I set up a 4 node ceph (firefly) cluster in
our lab last week. It consists of 1 admin, 1 mon and 2 osd nodes. All 4
nodes are physical servers running Ubuntu 14.04, no virtual machines used.
The mds and the 3rd osd are actually on the monitor node. Everything looked
okay by then, 'ceph -w' gave me health_ok and I could observe all available
storage capacity. I was happily writing python code using ceph S3 API.

This Monday I apt-get upgraded all those 4 machines and then rebooted them.
Once all 4 were back online, ceph -w gave some errors about IP address of
monitor node and some error messages like "{timestamp} 7fb6456f2500  0 --
{monitor IP address}:0/2924 .. pipe(0x5516270 sd=97 :0 s=1 pgs=0 cs=0
l=1c=0x5c54e20).fault"
(sorry didn't save exact error logs since I've rebuilt the cluster, my
mistake :-( ). 

Due to lab's policy, only DHCP is allowed, so I updated monitor IP address
in /etc/ceph/ceph.conf and tried to push config to all nodes but that didn't
work. Then I tried to restart ceph service on those nodes, no luck. I even
went to ceph-deploy purgedata approach, no luck again. Then I had to purge
all, restarted from zero. Again, I'm sorry no error msgs were saved, I was
just too frustrated.

Now I have a working cluster but I don't think I can afford redo it again.
So the question mentioned above: how shall I properly do maintenance work
without breaking my ceph cluster? Some procedure or commands I should issue
after rebooting? Thanks 

Br.
J Hewitt

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inktank-mellanox webinar access ?

2014-07-10 Thread Georgios Dimitrakakis

That makes two of us...

G.

On Thu, 10 Jul 2014 17:12:08 +0200 (CEST), Alexandre DERUMIER wrote:

Ok, sorry, we have finally receive the login a bit late.

Sorry again to have spam the mailing list
- Mail original -

De: "Alexandre DERUMIER" 
À: "ceph-users" 
Envoyé: Jeudi 10 Juillet 2014 16:55:22
Objet: [ceph-users] inktank-mellanox webinar access ?

Hi,

sorry to spam the mailing list,

but they are a inktank mellanox webinar in 10minutes,

and I don't have receive access since I have been registered
yesterday (same for my co-worker).

and the webinar mellanox contact email (conta...@mellanox.com), does
not exist

Maybe somebody from Inktank or Mellanox could help us ?

Regards,

Alexandre



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] logrotate

2014-07-10 Thread James Eckersall
Hi,

I've just upgraded a ceph cluster from Ubuntu 12.04 with 0.72.1 to Ubuntu
14.04 with 0.80.1.

I've noticed that the log rotation doesn't appear to work correctly.
The OSD's are just not logging to the current ceph-osd-X.log file.
If I restart the OSD's or run "service ceph-osd reload id=X", they start
logging.  They stop logging overnight when the logs are rotated.
Running "service ceph reload" does not fix the logging.

Has anyone else noticed a problem with this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Temporary degradation when adding OSD's

2014-07-10 Thread Gregory Farnum
On Thursday, July 10, 2014, Erik Logtenberg  wrote:

>
> > Yeah, Ceph will never voluntarily reduce the redundancy. I believe
> > splitting the "degraded" state into separate "wrongly placed" and
> > "degraded" (reduced redundancy) states is currently on the menu for
> > the Giant release, but it's not been done yet.
>
> That would greatly improve the accuracy of ceph's status reports.
>
> Does ceph currently know about the difference of these states well
> enough to be smart with prioritizing? Specifically, if I add an OSD and
> ceph starts moving data around, but during that time an other OSD fails;
> is ceph smart enough to quickly prioritize reduplicating the lost copies
> before continuing to move data around (that was still perfectly
> duplicated)?
>

I believe that when choosing the next PG to backfill, OSDs prefer PGs which
are undersized. But it won't stop replicating a PG if one goes undersized
mid-process, and it's not a guarantee anyway because backfill is
distributed over the cluster, but the decisions have to be made locally.
(So a backfilling OSD which has no undersized PGs might beat out an OSD
with undersized PGs to get the "reservation".)
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inktank-mellanox webinar access ?

2014-07-10 Thread Alexandre DERUMIER
Ok, sorry, we have finally receive the login a bit late.

Sorry again to have spam the mailing list
- Mail original - 

De: "Alexandre DERUMIER"  
À: "ceph-users"  
Envoyé: Jeudi 10 Juillet 2014 16:55:22 
Objet: [ceph-users] inktank-mellanox webinar access ? 

Hi, 

sorry to spam the mailing list, 

but they are a inktank mellanox webinar in 10minutes, 

and I don't have receive access since I have been registered yesterday (same 
for my co-worker). 

and the webinar mellanox contact email (conta...@mellanox.com), does not 
exist 

Maybe somebody from Inktank or Mellanox could help us ? 

Regards, 

Alexandre 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inktank-mellanox webinar access ?

2014-07-10 Thread Georgios Dimitrakakis

The same here...Neither do I or my colleagues
G.

On Thu, 10 Jul 2014 16:55:22 +0200 (CEST), Alexandre DERUMIER wrote:

Hi,

sorry to spam the mailing list,

but they are a inktank mellanox webinar  in 10minutes,

and I don't have receive access since I have been registered
yesterday (same for my co-worker).

and the webinar mellanox contact email (conta...@mellanox.com), does
not exist

Maybe somebody from Inktank or Mellanox could help us ?

Regards,

Alexandre



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Placing different pools on different OSDs in the same physical servers

2014-07-10 Thread Nikola Pajtic

Hello to all,

I was wondering is it possible to place different pools on different 
OSDs, but using only two physical servers?


I was thinking about this: http://tinypic.com/r/30tgt8l/8

I would like to use osd.0 and osd.1 for Cinder/RBD pool, and osd.2 and 
osd.3 for Nova instances. I was following the howto from ceph 
documentation:
http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds 
, but it assumed that there are 4 physical servers: 2 for "Platter" 
pool and 2 for "SSD" pool.


What I was concerned about is how the CRUSH map should be written and 
how the CRUSH will decide where it will send the data? Because of the 
the same hostnames in cinder and nova pools. For example, is it possible 
to do something like this:



# buckets
host cephosd1 {
id -2   # do not change unnecessarily
# weight 0.010
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.000
}

host cephosd1 {
id -3   # do not change unnecessarily
# weight 0.010
alg straw
hash 0  # rjenkins1
item osd.2 weight 0.010
}

host cephosd2 {
id -4   # do not change unnecessarily
# weight 0.010
alg straw
hash 0  # rjenkins1
item osd.1 weight 0.000
}

host cephosd2 {
id -5   # do not change unnecessarily
# weight 0.010
alg straw
hash 0  # rjenkins1
item osd.3 weight 0.010
}

root cinder {
id -1   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item cephosd1 weight 0.000
item cephosd2 weight 0.000
}

root nova {
id -6   # do not change unnecessarily
# weight 0.020
alg straw
hash 0  # rjenkins1
item cephosd1 weight 0.010
item cephosd2 weight 0.010
}

If not, could you share an idea how this scenario could be achieved?

Thanks in advance!!


--
Nikola Pajtic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Christian Eichelmann
I can also confirm that after upgrading to firefly both of our clusters (test 
and live) were going from 0 scrub errors each for about 6 Month to about 9-12 
per week...
This also makes me kind of nervous, since as far as I know everything "ceph pg 
repair" does, is to copy the primary object to all replicas, no matter which 
object is the correct one.
Of course the described method of manual checking works (for pools with more 
than 2 replicas), but doing this in a large cluster nearly every week is 
horribly timeconsuming and error prone.
It would be great to get an explanation for the increased numbers of scrub 
errors since firefly. Were they just not detected correctly in previous 
versions? Or is there maybe something wrong with the new code?

Acutally, our company is currently preventing our projects to move to ceph 
because of this problem.

Regards,
Christian

Von: ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von "Travis 
Rhoden [trho...@gmail.com]
Gesendet: Donnerstag, 10. Juli 2014 16:24
An: Gregory Farnum
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] scrub error on firefly

And actually just to follow-up, it does seem like there are some additional 
smarts beyond just using the primary to overwrite the secondaries...  Since I 
captured md5 sums before and after the repair, I can say that in this 
particular instance, the secondary copy was used to overwrite the primary.  So, 
I'm just trusting Ceph to the right thing, and so far it seems to, but the 
comments here about needing to determine the correct object and place it on the 
primary PG make me wonder if I've been missing something.

 - Travis


On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden 
mailto:trho...@gmail.com>> wrote:
I can also say that after a recent upgrade to Firefly, I have experienced 
massive uptick in scrub errors.  The cluster was on cuttlefish for about a 
year, and had maybe one or two scrub errors.  After upgrading to Firefly, we've 
probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a day for a 
few weeks until the whole cluster was rescrubbed, it seemed).

What I cannot determine, however, is how to know which object is busted?  For 
example, just today I ran into a scrub error.  The object has two copies and is 
an 8MB piece of an RBD, and has identical timestamps, identical xattrs names 
and values.  But it definitely has a different MD5 sum. How to know which one 
is correct?

I've been just kicking off pg repair each time, which seems to just use the 
primary copy to overwrite the others.  Haven't run into any issues with that so 
far, but it does make me nervous.

 - Travis


On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum 
mailto:g...@inktank.com>> wrote:
It's not very intuitive or easy to look at right now (there are plans
from the recent developer summit to improve things), but the central
log should have output about exactly what objects are busted. You'll
then want to compare the copies manually to determine which ones are
good or bad, get the good copy on the primary (make sure you preserve
xattrs), and run repair.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith 
mailto:rbsm...@adams.edu>> wrote:
> Greetings,
>
> I upgraded to firefly last week and I suddenly received this error:
>
> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>
> ceph health detail shows the following:
>
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 3.c6 is active+clean+inconsistent, acting [2,5]
> 1 scrub errors
>
> The docs say that I can run `ceph pg repair 3.c6` to fix this. What I want
> to know is what are the risks of data loss if I run that command in this
> state and how can I mitigate them?
>
> --
> Randall Smith
> Computing Services
> Adams State University
> http://www.adams.edu/
> 719-587-7741
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inktank-mellanox webinar access ?

2014-07-10 Thread Alexandre DERUMIER
Hi,

sorry to spam the mailing list,

but they are a inktank mellanox webinar  in 10minutes,

and I don't have receive access since I have been registered yesterday (same 
for my co-worker).

and the webinar mellanox contact email (conta...@mellanox.com), does not 
exist

Maybe somebody from Inktank or Mellanox could help us ?

Regards,

Alexandre



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph mons die unexpected

2014-07-10 Thread Iban Cabrillo
Hi,
  I am getting some troubles with the ceph mon stability.
  Every couple of days mons die. I only see this error in the logs:

  2014-07-08 14:24:53.056805 7f713bb5b700 -1 mon.cephmon02@1(peon) e2 ***
Got Signal Interrupt ***
2014-07-08 14:24:53.061795 7f713bb5b700  1 mon.cephmon02@1(peon) e2 shutdown
2014-07-08 14:24:53.072424 7f713bb5b700  0 quorum service shutdown
2014-07-08 14:24:53.072439 7f713bb5b700  0 mon.cephmon02@1(shutdown).health(36)
HealthMonitor::service_shutdown 1 services
2014-07-08 14:24:53.072446 7f713bb5b700  0 quorum service shutdown


ire=2014-07-08 17:32:21.518667 has v0 lc 2837
2014-07-08 17:32:18.642858 7fa7c1a9a700  1 mon.cephmon03@2(peon).paxos(paxos
active c 2260..2837) is_readable now=2014-07-08 17:32:18.642861
lease_expire=2014-07-08 17:32:21.518667 has v0 lc 2837
2014-07-08 17:32:18.637279 7fa7c0496700 -1 mon.cephmon03@2(peon) e2 *** Got
Signal Interrupt ***
2014-07-08 17:32:18.642936 7fa7c0496700  1 mon.cephmon03@2(peon) e2 shutdown
2014-07-08 17:32:18.643106 7fa7c1a9a700  1 mon.cephmon03@2(peon).paxos(paxos
active c 2260..2837) is_readable now=2014-07-08 17:32:18.643109
lease_expire=2014-07-08 17:32:21.518667 has v0 lc 2837
2014-07-08 17:32:18.659001 7fa7c0496700  0 quorum service shutdown
2014-07-08 17:32:18.659016 7fa7c0496700  0 mon.cephmon03@2(shutdown).health(38)
HealthMonitor::service_shutdown 1 services
2014-07-08 17:32:18.659023 7fa7c0496700  0 quorum service shutdown
2014-07-08 17:32:18.685100 7fa7bdee8700  0 -- 10.10.3.3:6789/0 >>
10.10.33.31:0/1413204834 pipe(0x56b9180 sd=10 :6789 s=0 pgs=0 cs=0 l=0
c=0x2d06040).accept peer addr is really 10.10.33.31:0/1413204834 (socket is
10.10.33.31:56502/0)


Any idea?


Bertrand Russell:
*"El problema con el mundo es que los estúpidos están seguros de todo y los
inteligentes están llenos de dudas*"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating a bucket on a non-master region in a multi-region configuration with unified namespace/replication

2014-07-10 Thread Bachelder, Kurt
OK - I found this information 
(http://comments.gmane.org/gmane.comp.file-systems.ceph.user/4992):

When creating a bucket users can specify which placement target they
want to use with that specific bucket (by using the S3 create bucket
location constraints field, format is
location_constraints='[region][:placement-target]', e.g.,
location_constraints=':fast-placement').

This looks like exactly what I want to do - but HOW is the location_constraints 
field used to specify a region for bucket creation?  Is there any documentation 
around this?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Bachelder, Kurt
Sent: Tuesday, July 08, 2014 3:22 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Creating a bucket on a non-master region in a 
multi-region configuration with unified namespace/replication

Hi -

We have a multi-region configuration with metadata replication between the 
regions for a unified namespace.  Each region has pools in a different cluster. 
Users created in the master region are replicated to the slave region without 
any issues - we can get user info, and everything is consistent.  Buckets 
created in the master region are also indexed in the slave region without any 
issues.

The issue we're having is with creating a bucket in the slave region.  As we 
understand it, a bucket is considered "metadata", and should be created via the 
master region, but using a "location constraint" to ensure the objects within 
the bucket are physically stored in the secondary region.  However, there is a 
lack of documentation on exactly *how* to accomplish this - we've attempted 
with s3cmd using the -bucket-location parameter to point to the slave region, 
but the bucket is still created in the master region.

We noticed that the ceph S3 API documentation 
(http://ceph.com/docs/master/radosgw/s3/#features-support) shows that the 
Amazon S3 functionality for "bucket location" is not supported... but I can't 
imagine that restriction renders the entire slave region setup useless... does 
it?

So how are buckets created in a secondary region?

Thanks in advance!

Kurt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph wthin vmware-cluster

2014-07-10 Thread Marc Hansen
Hi,

we try to set up a Webcluster, may somebody give a hint.

 

3 Webserver with Typo3

The Typo3-Cache on a central Storage with ceph.

May this be usefull within a vmware-cluster?

 

I need something different than a central NFS-Store. This is to slow.

 

Regards

Marc

 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Travis Rhoden
And actually just to follow-up, it does seem like there are some additional
smarts beyond just using the primary to overwrite the secondaries...  Since
I captured md5 sums before and after the repair, I can say that in this
particular instance, the secondary copy was used to overwrite the primary.
So, I'm just trusting Ceph to the right thing, and so far it seems to, but
the comments here about needing to determine the correct object and place
it on the primary PG make me wonder if I've been missing something.

 - Travis


On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden  wrote:

> I can also say that after a recent upgrade to Firefly, I have experienced
> massive uptick in scrub errors.  The cluster was on cuttlefish for about a
> year, and had maybe one or two scrub errors.  After upgrading to Firefly,
> we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a
> day for a few weeks until the whole cluster was rescrubbed, it seemed).
>
> What I cannot determine, however, is how to know which object is busted?
> For example, just today I ran into a scrub error.  The object has two
> copies and is an 8MB piece of an RBD, and has identical timestamps,
> identical xattrs names and values.  But it definitely has a different MD5
> sum. How to know which one is correct?
>
> I've been just kicking off pg repair each time, which seems to just use
> the primary copy to overwrite the others.  Haven't run into any issues with
> that so far, but it does make me nervous.
>
>  - Travis
>
>
> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum  wrote:
>
>> It's not very intuitive or easy to look at right now (there are plans
>> from the recent developer summit to improve things), but the central
>> log should have output about exactly what objects are busted. You'll
>> then want to compare the copies manually to determine which ones are
>> good or bad, get the good copy on the primary (make sure you preserve
>> xattrs), and run repair.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith  wrote:
>> > Greetings,
>> >
>> > I upgraded to firefly last week and I suddenly received this error:
>> >
>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>> >
>> > ceph health detail shows the following:
>> >
>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>> > pg 3.c6 is active+clean+inconsistent, acting [2,5]
>> > 1 scrub errors
>> >
>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I
>> want
>> > to know is what are the risks of data loss if I run that command in this
>> > state and how can I mitigate them?
>> >
>> > --
>> > Randall Smith
>> > Computing Services
>> > Adams State University
>> > http://www.adams.edu/
>> > 719-587-7741
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error on firefly

2014-07-10 Thread Travis Rhoden
I can also say that after a recent upgrade to Firefly, I have experienced
massive uptick in scrub errors.  The cluster was on cuttlefish for about a
year, and had maybe one or two scrub errors.  After upgrading to Firefly,
we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a
day for a few weeks until the whole cluster was rescrubbed, it seemed).

What I cannot determine, however, is how to know which object is busted?
For example, just today I ran into a scrub error.  The object has two
copies and is an 8MB piece of an RBD, and has identical timestamps,
identical xattrs names and values.  But it definitely has a different MD5
sum. How to know which one is correct?

I've been just kicking off pg repair each time, which seems to just use the
primary copy to overwrite the others.  Haven't run into any issues with
that so far, but it does make me nervous.

 - Travis


On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum  wrote:

> It's not very intuitive or easy to look at right now (there are plans
> from the recent developer summit to improve things), but the central
> log should have output about exactly what objects are busted. You'll
> then want to compare the copies manually to determine which ones are
> good or bad, get the good copy on the primary (make sure you preserve
> xattrs), and run repair.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith  wrote:
> > Greetings,
> >
> > I upgraded to firefly last week and I suddenly received this error:
> >
> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> >
> > ceph health detail shows the following:
> >
> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> > pg 3.c6 is active+clean+inconsistent, acting [2,5]
> > 1 scrub errors
> >
> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I
> want
> > to know is what are the risks of data loss if I run that command in this
> > state and how can I mitigate them?
> >
> > --
> > Randall Smith
> > Computing Services
> > Adams State University
> > http://www.adams.edu/
> > 719-587-7741
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [ANN] ceph-deploy 1.5.8 released

2014-07-10 Thread Alfredo Deza
Hi All,

There is a new bug-fix release of ceph-deploy, the easy deployment tool
for Ceph.

The full list of fixes for this release can be found in the changelog:

http://ceph.com/ceph-deploy/docs/changelog.html#id1


Make sure you update!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Temporary degradation when adding OSD's

2014-07-10 Thread Erik Logtenberg

> Yeah, Ceph will never voluntarily reduce the redundancy. I believe
> splitting the "degraded" state into separate "wrongly placed" and
> "degraded" (reduced redundancy) states is currently on the menu for
> the Giant release, but it's not been done yet.

That would greatly improve the accuracy of ceph's status reports.

Does ceph currently know about the difference of these states well
enough to be smart with prioritizing? Specifically, if I add an OSD and
ceph starts moving data around, but during that time an other OSD fails;
is ceph smart enough to quickly prioritize reduplicating the lost copies
before continuing to move data around (that was still perfectly duplicated)?

Thanks,

Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.com centos7 repository ?

2014-07-10 Thread Erik Logtenberg
Hi,

RHEL7 repository works just as well. CentOS 7 is effectively a copy of
RHEL7 anyway. Packages for CentOS 7 wouldn't actually be any different.

Erik.

On 07/10/2014 06:14 AM, Alexandre DERUMIER wrote:
> Hi,
> 
> I would like to known if a centos7 respository will be available soon ?
> 
> Or can I use current rhel7 for the moment ?
> 
> http://ceph.com/rpm-firefly/rhel7/x86_64/
> 
> 
> 
> Cheers,
> 
> Alexandre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance tests

2014-07-10 Thread Mark Nelson

On 07/10/2014 03:24 AM, Xabier Elkano wrote:

El 10/07/14 09:18, Christian Balzer escribió:

On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote:


El 09/07/14 16:53, Christian Balzer escribió:

On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:


On 07/09/2014 06:52 AM, Xabier Elkano wrote:

El 09/07/14 13:10, Mark Nelson escribió:

On 07/09/2014 05:57 AM, Xabier Elkano wrote:

Hi,

I was doing some tests in my cluster with fio tool, one fio
instance with 70 jobs, each job writing 1GB random with 4K block
size. I did this test with 3 variations:

1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
module, format and mount each image as ext4. Each fio job writing
in a separate image/directory. (ioengine=libaio, queue_depth=4,
direct=1)

  IOPS: 6542
  AVG LAT: 41ms

2- Creating 1 large image 4,2TB in the pool. Using rbd kernel
module, format and mount the image as ext4. Each fio job writing
in a separate file in the same directory. (ioengine=libaio,
queue_depth=4,direct=1)

 IOPS: 5899
 AVG LAT:  47ms

3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in
fio to access the image through librados. (ioengine=rbd,
queue_depth=4,direct=1)

 IOPS: 2638
 AVG LAT: 96ms

Do these results make sense? From Ceph perspective, It is better to
have many small images than a larger one? What is the best approach
to simulate the workload of 70 VMs?

I'm not sure the difference between the first two cases is enough to
say much yet.  You may need to repeat the test a couple of times to
ensure that the difference is more than noise.  having said that, if
we are seeing an effect, it would be interesting to know what the
latency distribution is like.  is it consistently worse in the 2nd
case or do we see higher spikes at specific times?


I've repeated the tests with similar results. Each test is done with
a clean new rbd image, first removing any existing images in the
pool and then creating the new image. Between tests I am running:

   echo 3 > /proc/sys/vm/drop_caches

- In the first test I've created 70 images (60G) and mounted them:

/dev/rbd1 on /mnt/fiotest/vtest0
/dev/rbd2 on /mnt/fiotest/vtest1
..
/dev/rbd70 on /mnt/fiotest/vtest69

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
14:52:56 2014
write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
  slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
  clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
   lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
  clat percentiles (msec):
   |  1.00th=[5],  5.00th=[   10], 10.00th=[   13],
20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
99.95th=[  515], | 99.99th=[  553]
  bw (KB  /s): min=0, max=  694, per=1.46%, avg=383.29,
stdev=148.01 lat (usec) : 1000=0.01%
  lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
  lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
cpu  : usr=0.69%, sys=2.57%, ctx=1525021, majf=0,
minf=2405 IO depths: 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%,
32=0.0%,

=64=0.0%

   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,

=64=0.0%

   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,

=64=0.0%

   issued: total=r=0/w=655015/d=0, short=r=0/w=0/d=0
   latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s,
maxb=26178KB/s, mint=100116msec, maxt=100116msec

Disk stats (read/write):
rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
in_queue=39459720, util=99.68%

- In the second test I only created one large image (4,2T)

/dev/rbd1 on /mnt/fiotest/vtest0 type ext4
(rw,noatime,nodiratime,data=ordered)

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
13:38:14 2014
write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
  slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
  clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
   lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
  clat percentiles (msec):
   |  1.00th=[5],  5.00th=[   11], 10.00th=[   14],
20.00th=[   19], | 30.00th=[   24], 40.00th=[   29], 50.00th=[   33],
60.00th=[   36], | 70.00th=[   39], 80.00th=[   43], 90.00th=[   51],
95.00th=[   68], | 99.00th=[  506], 99.50th=[  553], 99.90th=[  717],
99.95th=[  783], | 99.99th=[ 3130]
  bw (KB  /s): min=0, max=  680, per=1.54%, avg=355.39,
stdev=156.10 lat (usec) : 1000=0.01%
  lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95%
  lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%,
1000=0.02% lat (msec) : >=2000=0.04%
cpu  : usr=0.65%, sys=2.45%, ctx

Re: [ceph-users] performance tests

2014-07-10 Thread Mark Nelson

On 07/09/2014 09:53 AM, Christian Balzer wrote:

On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:


On 07/09/2014 06:52 AM, Xabier Elkano wrote:

El 09/07/14 13:10, Mark Nelson escribió:

On 07/09/2014 05:57 AM, Xabier Elkano wrote:



Hi,

I was doing some tests in my cluster with fio tool, one fio instance
with 70 jobs, each job writing 1GB random with 4K block size. I did
this test with 3 variations:

1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
module, format and mount each image as ext4. Each fio job writing in
a separate image/directory. (ioengine=libaio, queue_depth=4,
direct=1)

  IOPS: 6542
  AVG LAT: 41ms

2- Creating 1 large image 4,2TB in the pool. Using rbd kernel module,
format and mount the image as ext4. Each fio job writing in a
separate file in the same directory. (ioengine=libaio,
queue_depth=4,direct=1)

 IOPS: 5899
 AVG LAT:  47ms

3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in fio
to access the image through librados. (ioengine=rbd,
queue_depth=4,direct=1)

 IOPS: 2638
 AVG LAT: 96ms

Do these results make sense? From Ceph perspective, It is better to
have many small images than a larger one? What is the best approach
to simulate the workload of 70 VMs?


I'm not sure the difference between the first two cases is enough to
say much yet.  You may need to repeat the test a couple of times to
ensure that the difference is more than noise.  having said that, if
we are seeing an effect, it would be interesting to know what the
latency distribution is like.  is it consistently worse in the 2nd
case or do we see higher spikes at specific times?


I've repeated the tests with similar results. Each test is done with a
clean new rbd image, first removing any existing images in the pool and
then creating the new image. Between tests I am running:

   echo 3 > /proc/sys/vm/drop_caches

- In the first test I've created 70 images (60G) and mounted them:

/dev/rbd1 on /mnt/fiotest/vtest0
/dev/rbd2 on /mnt/fiotest/vtest1
..
/dev/rbd70 on /mnt/fiotest/vtest69

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
14:52:56 2014
write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
  slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
  clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
   lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
  clat percentiles (msec):
   |  1.00th=[5],  5.00th=[   10], 10.00th=[   13],
20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
99.95th=[  515], | 99.99th=[  553]
  bw (KB  /s): min=0, max=  694, per=1.46%, avg=383.29,
stdev=148.01 lat (usec) : 1000=0.01%
  lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
  lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
cpu  : usr=0.69%, sys=2.57%, ctx=1525021, majf=0, minf=2405
IO depths: 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%, 32=0.0%,

=64=0.0%

   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

   issued: total=r=0/w=655015/d=0, short=r=0/w=0/d=0
   latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s, maxb=26178KB/s,
mint=100116msec, maxt=100116msec

Disk stats (read/write):
rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
in_queue=39459720, util=99.68%

- In the second test I only created one large image (4,2T)

/dev/rbd1 on /mnt/fiotest/vtest0 type ext4
(rw,noatime,nodiratime,data=ordered)

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
13:38:14 2014
write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
  slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
  clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
   lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
  clat percentiles (msec):
   |  1.00th=[5],  5.00th=[   11], 10.00th=[   14],
20.00th=[   19], | 30.00th=[   24], 40.00th=[   29], 50.00th=[   33],
60.00th=[   36], | 70.00th=[   39], 80.00th=[   43], 90.00th=[   51],
95.00th=[   68], | 99.00th=[  506], 99.50th=[  553], 99.90th=[  717],
99.95th=[  783], | 99.99th=[ 3130]
  bw (KB  /s): min=0, max=  680, per=1.54%, avg=355.39,
stdev=156.10 lat (usec) : 1000=0.01%
  lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95%
  lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%,
1000=0.02% lat (msec) : >=2000=0.04%
cpu  : usr=0.65%, sys=2.45%, ctx=1434322, majf=0, minf=2399
IO depths: 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%, 32=0.0%,

=64=0.0%

   submit: 0=0.0%, 4=100

Re: [ceph-users] Some OSD and MDS crash

2014-07-10 Thread Pierre BLONDEAU

Hi,

Great.

All my OSD restart :
osdmap e438044: 36 osds: 36 up, 36 in

All PG page are active and some in recovery :
1604040/49575206 objects degraded (3.236%)
 1780 active+clean
 17 active+degraded+remapped+backfilling
 61 active+degraded+remapped+wait_backfill
 11 active+clean+scrubbing+deep
 34 active+remapped+backfilling
 21 active+remapped+wait_backfill
 4 active+clean+replay

But all mds crash. Logs are here : 
https://blondeau.users.greyc.fr/cephlog/legacy/


In any case, thank you very much for your help.

Pierre

Le 09/07/2014 19:34, Joao Eduardo Luis a écrit :

On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote:

Hi,

There is any chance to restore my data ?


Okay, I talked to Sam and here's what you could try before anything else:

- Make sure you have everything running on the same version.
- unset the the chooseleaf_vary_r flag -- this can be accomplished by
setting tunables to legacy.
- have the osds join in the cluster
- you should then either upgrade to firefly (if you haven't done so by
now) or wait for the point-release before you move on to setting
tunables to optimal again.

Let us know how it goes.

   -Joao




Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a écrit :

No chance to have those logs and even less in debug mode. I do this
change 3 weeks ago.

I put all my log here if it's can help :
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for
that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, "Pierre BLONDEAU" mailto:pierre.blond...@unicaen.fr>> wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i >
/tmp/crush$i.d; done; diff /tmp/crush20.d
/tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush23
6d5
< tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow
ended
up divergent?


The only thing that comes to mind that could cause this is
if we
changed
the leader's in-memory map, proposed it, it failed, and only
the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect
osdmap to
whoever asked osdmaps from it, the remaining quorum would
serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after
the update in
firefly, I was in state : HEALTH_WARN crush map has legacy
tunables and
I see "feature set mismatch" in log.

So if I good remeber, i do : ceph osd crush tunables
optimal
for the
problem of "crush map" and I update my client and server
kernel to
3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
mailto:sam.j...@inktank.com>>
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db__4f
  osd-20_osdmap.13258__0___4E62BB79__none
6037911f31dc3c18b05499d24dcdbe__5c
  osd-23_osdmap.13258__0___4E62BB79__none

Joao: 

Re: [ceph-users] performance tests

2014-07-10 Thread Xabier Elkano
El 10/07/14 09:18, Christian Balzer escribió:
> On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote:
>
>> El 09/07/14 16:53, Christian Balzer escribió:
>>> On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:
>>>
 On 07/09/2014 06:52 AM, Xabier Elkano wrote:
> El 09/07/14 13:10, Mark Nelson escribió:
>> On 07/09/2014 05:57 AM, Xabier Elkano wrote:
>>> Hi,
>>>
>>> I was doing some tests in my cluster with fio tool, one fio
>>> instance with 70 jobs, each job writing 1GB random with 4K block
>>> size. I did this test with 3 variations:
>>>
>>> 1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
>>> module, format and mount each image as ext4. Each fio job writing
>>> in a separate image/directory. (ioengine=libaio, queue_depth=4,
>>> direct=1)
>>>
>>>  IOPS: 6542
>>>  AVG LAT: 41ms
>>>
>>> 2- Creating 1 large image 4,2TB in the pool. Using rbd kernel
>>> module, format and mount the image as ext4. Each fio job writing
>>> in a separate file in the same directory. (ioengine=libaio,
>>> queue_depth=4,direct=1)
>>>
>>> IOPS: 5899
>>> AVG LAT:  47ms
>>>
>>> 3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in
>>> fio to access the image through librados. (ioengine=rbd,
>>> queue_depth=4,direct=1)
>>>
>>> IOPS: 2638
>>> AVG LAT: 96ms
>>>
>>> Do these results make sense? From Ceph perspective, It is better to
>>> have many small images than a larger one? What is the best approach
>>> to simulate the workload of 70 VMs?
>> I'm not sure the difference between the first two cases is enough to
>> say much yet.  You may need to repeat the test a couple of times to
>> ensure that the difference is more than noise.  having said that, if
>> we are seeing an effect, it would be interesting to know what the
>> latency distribution is like.  is it consistently worse in the 2nd
>> case or do we see higher spikes at specific times?
>>
> I've repeated the tests with similar results. Each test is done with
> a clean new rbd image, first removing any existing images in the
> pool and then creating the new image. Between tests I am running:
>
>   echo 3 > /proc/sys/vm/drop_caches
>
> - In the first test I've created 70 images (60G) and mounted them:
>
> /dev/rbd1 on /mnt/fiotest/vtest0
> /dev/rbd2 on /mnt/fiotest/vtest1
> ..
> /dev/rbd70 on /mnt/fiotest/vtest69
>
> fio output:
>
> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
> 14:52:56 2014
>write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
>  slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
>  clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
>   lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
>  clat percentiles (msec):
>   |  1.00th=[5],  5.00th=[   10], 10.00th=[   13],
> 20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
> 60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
> 95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
> 99.95th=[  515], | 99.99th=[  553]
>  bw (KB  /s): min=0, max=  694, per=1.46%, avg=383.29,
> stdev=148.01 lat (usec) : 1000=0.01%
>  lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
>  lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
>cpu  : usr=0.69%, sys=2.57%, ctx=1525021, majf=0,
> minf=2405 IO depths: 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%,
> 32=0.0%,
>> =64=0.0%
>   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 64=0.0%,
>> =64=0.0%
>   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 64=0.0%,
>> =64=0.0%
>   issued: total=r=0/w=655015/d=0, short=r=0/w=0/d=0
>   latency   : target=0, window=0, percentile=100.00%, depth=4
>
> Run status group 0 (all jobs):
>WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s,
> maxb=26178KB/s, mint=100116msec, maxt=100116msec
>
> Disk stats (read/write):
>rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
> in_queue=39459720, util=99.68%
>
> - In the second test I only created one large image (4,2T)
>
> /dev/rbd1 on /mnt/fiotest/vtest0 type ext4
> (rw,noatime,nodiratime,data=ordered)
>
> fio output:
>
> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
> 13:38:14 2014
>write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
>  slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
>  clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
>   lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
>  clat percentiles (

Re: [ceph-users] performance tests

2014-07-10 Thread Christian Balzer
On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote:

> El 09/07/14 16:53, Christian Balzer escribió:
> > On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:
> >
> >> On 07/09/2014 06:52 AM, Xabier Elkano wrote:
> >>> El 09/07/14 13:10, Mark Nelson escribió:
>  On 07/09/2014 05:57 AM, Xabier Elkano wrote:
> >
> > Hi,
> >
> > I was doing some tests in my cluster with fio tool, one fio
> > instance with 70 jobs, each job writing 1GB random with 4K block
> > size. I did this test with 3 variations:
> >
> > 1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
> > module, format and mount each image as ext4. Each fio job writing
> > in a separate image/directory. (ioengine=libaio, queue_depth=4,
> > direct=1)
> >
> >  IOPS: 6542
> >  AVG LAT: 41ms
> >
> > 2- Creating 1 large image 4,2TB in the pool. Using rbd kernel
> > module, format and mount the image as ext4. Each fio job writing
> > in a separate file in the same directory. (ioengine=libaio,
> > queue_depth=4,direct=1)
> >
> > IOPS: 5899
> > AVG LAT:  47ms
> >
> > 3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in
> > fio to access the image through librados. (ioengine=rbd,
> > queue_depth=4,direct=1)
> >
> > IOPS: 2638
> > AVG LAT: 96ms
> >
> > Do these results make sense? From Ceph perspective, It is better to
> > have many small images than a larger one? What is the best approach
> > to simulate the workload of 70 VMs?
>  I'm not sure the difference between the first two cases is enough to
>  say much yet.  You may need to repeat the test a couple of times to
>  ensure that the difference is more than noise.  having said that, if
>  we are seeing an effect, it would be interesting to know what the
>  latency distribution is like.  is it consistently worse in the 2nd
>  case or do we see higher spikes at specific times?
> 
> >>> I've repeated the tests with similar results. Each test is done with
> >>> a clean new rbd image, first removing any existing images in the
> >>> pool and then creating the new image. Between tests I am running:
> >>>
> >>>   echo 3 > /proc/sys/vm/drop_caches
> >>>
> >>> - In the first test I've created 70 images (60G) and mounted them:
> >>>
> >>> /dev/rbd1 on /mnt/fiotest/vtest0
> >>> /dev/rbd2 on /mnt/fiotest/vtest1
> >>> ..
> >>> /dev/rbd70 on /mnt/fiotest/vtest69
> >>>
> >>> fio output:
> >>>
> >>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
> >>> 14:52:56 2014
> >>>write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
> >>>  slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
> >>>  clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
> >>>   lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
> >>>  clat percentiles (msec):
> >>>   |  1.00th=[5],  5.00th=[   10], 10.00th=[   13],
> >>> 20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
> >>> 60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
> >>> 95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
> >>> 99.95th=[  515], | 99.99th=[  553]
> >>>  bw (KB  /s): min=0, max=  694, per=1.46%, avg=383.29,
> >>> stdev=148.01 lat (usec) : 1000=0.01%
> >>>  lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
> >>>  lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
> >>>cpu  : usr=0.69%, sys=2.57%, ctx=1525021, majf=0,
> >>> minf=2405 IO depths: 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%,
> >>> 32=0.0%,
>  =64=0.0%
> >>>   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >>> 64=0.0%,
>  =64=0.0%
> >>>   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >>> 64=0.0%,
>  =64=0.0%
> >>>   issued: total=r=0/w=655015/d=0, short=r=0/w=0/d=0
> >>>   latency   : target=0, window=0, percentile=100.00%, depth=4
> >>>
> >>> Run status group 0 (all jobs):
> >>>WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s,
> >>> maxb=26178KB/s, mint=100116msec, maxt=100116msec
> >>>
> >>> Disk stats (read/write):
> >>>rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
> >>> in_queue=39459720, util=99.68%
> >>>
> >>> - In the second test I only created one large image (4,2T)
> >>>
> >>> /dev/rbd1 on /mnt/fiotest/vtest0 type ext4
> >>> (rw,noatime,nodiratime,data=ordered)
> >>>
> >>> fio output:
> >>>
> >>> rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
> >>> 13:38:14 2014
> >>>write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
> >>>  slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
> >>>  clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
> >>>   lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
> >>>  clat percentiles (msec):
> >>>   |  1.00th=[5],  5.0