[ceph-users] Wrong PG information after increase pg_num

2015-07-14 Thread Luke Kao
Hello all,

I am testing cluster with mixed type OSD on same data node (yes, it's the idea 
from:
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/),
 and run into a strange status:

ceph -s or ceph pg dump shows incorrect PG information after set pg_num to pool 
which is using different ruleset to select faster OSD.



Please advise what's wrong and if I can fix the issue without recreate new pool 
with final pg_num directly:





Soe more detail:

1) update crushmap to have different root  ruleset to select different OSDs 
like this:

rule replicated_ruleset_ssd {
ruleset 50
type replicated
min_size 1
max_size 10
step take sdd
step chooseleaf firstn 0 type host
step emit
}

2) create new pool and set crush_ruleset to use this new rule



$ ceph osd pool create ssd 64 64 replicated replicated_ruleset_ssd

(however after this command it's still using default ruleset 0)

$ ceph osd pool set ssd crush_ruleset 50

3) it looks good now:
$ ceph osd dump | grep pool
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0
pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0
pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1574 flags hashpspool stripe_width 0

$ ceph -s
cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569
 health HEALTH_OK
 monmap e2: 3 mons at 
{DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0},
 election epoch 84, quorum 0,1,2 
DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3
 osdmap e1578: 21 osds: 15 up, 15 in
  pgmap v560681: 1472 pgs, 5 pools, 285 GB data, 73352 objects
80151 MB used, 695 GB / 779 GB avail
1472 active+clean
4) increase pg_num  pgp_num but total PG number is still 1472 in ceph -s:
$ ceph osd pool set ssd pg_num 128
set pool 9 pg_num to 128
$ ceph osd pool set ssd pgp_num 128
set pool 9 pgp_num to 128

$ ceph osd dump | grep pool
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0
pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0
pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins 
pg_num 128 pgp_num 128 last_change 1581 flags hashpspool stripe_width 0

$ ceph -s
cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569
 health HEALTH_OK
 monmap e2: 3 mons at 
{DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0},
 election epoch 84, quorum 0,1,2 
DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3
 osdmap e1582: 21 osds: 15 up, 15 in
  pgmap v560709: 1472 pgs, 5 pools, 285 GB data, 73352 objects
80158 MB used, 695 GB / 779 GB avail
1472 active+clean

5) same problem with pg dump:
$ ceph pg dump | grep '^9\.' | wc
dumped all in format plain
 641472   10288

6) looks pg are created under /var/lib/ceph/osd/ceph-osd/current folder:



$ ls -ld /var/lib/ceph/osd/ceph-15/current/9.* | wc
 74 6666133

]$ ls -ld /var/lib/ceph/osd/ceph-16/current/9.* | wc
 54 4864475



6 osd for this ruleset = 128 * 3 / 6 ~= 64





Thanks a lot





BR,

Luke Kao

MYCOM-OSI



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] rados-java issue tracking and release

2015-07-14 Thread Mingfai
hi,

does anyone know who is maintaining rados-java and perform release to the
Maven central? In May, there was a release to Maven central *[1], but the
release version is not based on the latest code base from:
https://github.com/ceph/rados-java
I wonder if the one who do the Maven release could tag a version and
release the current snapshot.

Besides, I am not sure if the rados-java developers will notice any issue
reported in the ceph issue tracker. would it be better if the rados-java
project could enable issue tracking at github? thx

[1]
http://search.maven.org/#artifactdetails%7Ccom.ceph%7Crados%7C0.1.4%7Cjar

regards,
mingfai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados-java issue tracking and release

2015-07-14 Thread Wido den Hollander
Hi,

On 14-07-15 11:05, Mingfai wrote:
 hi,
 
 does anyone know who is maintaining rados-java and perform release to
 the Maven central? In May, there was a release to Maven central *[1],
 but the release version is not based on the latest code base from:
 https://github.com/ceph/rados-java
 I wonder if the one who do the Maven release could tag a version and
 release the current snapshot. 
 

From the CloudStack project Laszlo pushed it to Maven central with my
permission, but it seems he used a different source then from Github.

CC'ing him if he knows which source he used.

 Besides, I am not sure if the rados-java developers will notice any
 issue reported in the ceph issue tracker. would it be better if the
 rados-java project could enable issue tracking at github? thx
 

I have to be honest that I simply forgot to look at the outstanding issues.

Any help is more then appreciated since I don't have the time to look at
them.

Always feel free to send in a pull request on Github:
https://github.com/ceph/rados-java/pulls

If it fixes a issue, please add that in the git commit message.

Wido

 [1] http://search.maven.org/#artifactdetails%7Ccom.ceph%7Crados%7C0.1.4%7Cjar 
 
 regards,
 mingfai
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xattrs vs omap

2015-07-14 Thread Gregory Farnum
On Tue, Jul 14, 2015 at 10:53 AM, Jan Schermer j...@schermer.cz wrote:
 Thank you for your reply.
 Comments inline.

 I’m still hoping to get some more input, but there are many people running 
 ceph on ext4, and it sounds like it works pretty good out of the box. Maybe 
 I’m overthinking this, then?

I think so — somebody did a lot of work making sure we were well-tuned
on the standard filesystems; I believe it was David.
-Greg


 Jan

 On 13 Jul 2015, at 21:04, Somnath Roy somnath@sandisk.com wrote:

 inline

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
 Schermer
 Sent: Monday, July 13, 2015 2:32 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] xattrs vs omap

 Sorry for reviving an old thread, but could I get some input on this, pretty 
 please?

 ext4 has 256-byte inodes by default (at least according to docs) but the 
 fragment below says:
 OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)

 The default 512b is too much if the inode is just 256b, so shouldn’t that be 
 256b in case people use the default ext4 inode size?

 Anyway, is it better to format ext4 with larger inodes (say 2048b) and set 
 filestore_max_inline_xattr_size_other=1536, or leave it at defaults?
 [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any 
 harm though, but, curious.

 AFAIK there is other information in the inode other than xattrs, also you 
 need to count the xattra labels into this - so if I want to store 1536B of 
 “values” it would cost more, and there still needs to be some space left.

 (As I understand it, on ext4 xattrs ale limited to one block, inode size + 
 something can spill to one different inode - maybe someone knows better).


 [Somnath] The xttr size (_) is now more than 256 bytes and it will spill 
 over, so, bigger inode  size will be good. But, I would suggest do your 
 benchmark before putting it into production.


 Good poin and I am going to do that, but I’d like to avoid the guesswork. 
 Also, not all patterns are always replicable….

 Is filestore_max_inline_xattr_size and absolute limit, or is it 
 filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality?

 [Somnath] The *_size is tracking the xttr size per attribute and 
 *inline_xattrs keep track of max number of inline attributes allowed. So, if 
 a xattr size is  *_size , it will go to omap and also if the total number 
 of xattra  *inline_xattrs , it will go to omap.
 If you are only using rbd, the number of inline xattrs will be always 2 and 
 it will not cross that default max limit.

 If I’m reading this correctly then with my setting of  
 filestore_max_inline_xattr_size_other=1536, it could actually consume 3072B 
 (2 xattrs), so I should in reality use 4K inodes…?



 Does OSD do the sane thing if for some reason the xattrs do not fit? What 
 are the performance implications of storing the xattrs in leveldb?

 [Somnath] Even though I don't have the exact numbers, but, it has a 
 significant overhead if the xattrs go to leveldb.

 And lastly - what size of xattrs should I really expect if all I use is RBD 
 for OpenStack instances? (No radosgw, no cephfs, but heavy on rbd image and 
 pool snapshots). This overhead is quite large

 [Somnath] It will be 2 xattrs, default _ will be little bigger than 256 
 bytes and _snapset is small depends on number of snaps/clones, but 
 unlikely will cross 256 bytes range.

 I have few pool snapshots and lots (hundreds) of (nested) snapshots for rbd 
 volumes. Does this come into play somehow?


 My plan so far is to format the drives like this:
 mkfs.ext4 -I 2048 -b 4096 -i 524288 -E stride=32,stripe-width=256 (2048b 
 inode, 4096b block size, one inode for 512k of space and set  
 filestore_max_inline_xattr_size_other=1536
 [Somnath] Not much idea on ext4, sorry..

 Does that make sense?

 Thanks!

 Jan



 On 02 Jul 2015, at 12:18, Jan Schermer j...@schermer.cz wrote:

 Does anyone have a known-good set of parameters for ext4? I want to try it 
 as well but I’m a bit worried what happnes if I get it wrong.

 Thanks

 Jan



 On 02 Jul 2015, at 09:40, Nick Fisk n...@fisk.me.uk wrote:

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
 Behalf Of Christian Balzer
 Sent: 02 July 2015 02:23
 To: Ceph Users
 Subject: Re: [ceph-users] xattrs vs omap

 On Thu, 2 Jul 2015 00:36:18 + Somnath Roy wrote:

 It is replaced with the following config option..

 // Use omap for xattrs for attrs over //
 filestore_max_inline_xattr_size or
 OPTION(filestore_max_inline_xattr_size, OPT_U32, 0) //Override
 OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536)
 OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048)
 OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)

 // for more than filestore_max_inline_xattrs attrs
 OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override
 OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 

Re: [ceph-users] xattrs vs omap

2015-07-14 Thread Jan Schermer
Thank you for your reply.
Comments inline.

I’m still hoping to get some more input, but there are many people running ceph 
on ext4, and it sounds like it works pretty good out of the box. Maybe I’m 
overthinking this, then?

Jan

 On 13 Jul 2015, at 21:04, Somnath Roy somnath@sandisk.com wrote:
 
 inline
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
 Schermer
 Sent: Monday, July 13, 2015 2:32 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] xattrs vs omap
 
 Sorry for reviving an old thread, but could I get some input on this, pretty 
 please?
 
 ext4 has 256-byte inodes by default (at least according to docs) but the 
 fragment below says:
 OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)
 
 The default 512b is too much if the inode is just 256b, so shouldn’t that be 
 256b in case people use the default ext4 inode size?
 
 Anyway, is it better to format ext4 with larger inodes (say 2048b) and set 
 filestore_max_inline_xattr_size_other=1536, or leave it at defaults?
 [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any 
 harm though, but, curious.

AFAIK there is other information in the inode other than xattrs, also you need 
to count the xattra labels into this - so if I want to store 1536B of “values” 
it would cost more, and there still needs to be some space left.

 (As I understand it, on ext4 xattrs ale limited to one block, inode size + 
 something can spill to one different inode - maybe someone knows better).
 
 
 [Somnath] The xttr size (_) is now more than 256 bytes and it will spill 
 over, so, bigger inode  size will be good. But, I would suggest do your 
 benchmark before putting it into production.
 

Good poin and I am going to do that, but I’d like to avoid the guesswork. Also, 
not all patterns are always replicable….

 Is filestore_max_inline_xattr_size and absolute limit, or is it 
 filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality?
 
 [Somnath] The *_size is tracking the xttr size per attribute and 
 *inline_xattrs keep track of max number of inline attributes allowed. So, if 
 a xattr size is  *_size , it will go to omap and also if the total number of 
 xattra  *inline_xattrs , it will go to omap.
 If you are only using rbd, the number of inline xattrs will be always 2 and 
 it will not cross that default max limit.

If I’m reading this correctly then with my setting of  
filestore_max_inline_xattr_size_other=1536, it could actually consume 3072B (2 
xattrs), so I should in reality use 4K inodes…?


 
 Does OSD do the sane thing if for some reason the xattrs do not fit? What are 
 the performance implications of storing the xattrs in leveldb?
 
 [Somnath] Even though I don't have the exact numbers, but, it has a 
 significant overhead if the xattrs go to leveldb.
 
 And lastly - what size of xattrs should I really expect if all I use is RBD 
 for OpenStack instances? (No radosgw, no cephfs, but heavy on rbd image and 
 pool snapshots). This overhead is quite large
 
 [Somnath] It will be 2 xattrs, default _ will be little bigger than 256 
 bytes and _snapset is small depends on number of snaps/clones, but unlikely 
 will cross 256 bytes range.

I have few pool snapshots and lots (hundreds) of (nested) snapshots for rbd 
volumes. Does this come into play somehow?

 
 My plan so far is to format the drives like this:
 mkfs.ext4 -I 2048 -b 4096 -i 524288 -E stride=32,stripe-width=256 (2048b 
 inode, 4096b block size, one inode for 512k of space and set  
 filestore_max_inline_xattr_size_other=1536
 [Somnath] Not much idea on ext4, sorry..
 
 Does that make sense?
 
 Thanks!
 
 Jan
 
 
 
 On 02 Jul 2015, at 12:18, Jan Schermer j...@schermer.cz wrote:
 
 Does anyone have a known-good set of parameters for ext4? I want to try it 
 as well but I’m a bit worried what happnes if I get it wrong.
 
 Thanks
 
 Jan
 
 
 
 On 02 Jul 2015, at 09:40, Nick Fisk n...@fisk.me.uk wrote:
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
 Behalf Of Christian Balzer
 Sent: 02 July 2015 02:23
 To: Ceph Users
 Subject: Re: [ceph-users] xattrs vs omap
 
 On Thu, 2 Jul 2015 00:36:18 + Somnath Roy wrote:
 
 It is replaced with the following config option..
 
 // Use omap for xattrs for attrs over //
 filestore_max_inline_xattr_size or
 OPTION(filestore_max_inline_xattr_size, OPT_U32, 0) //Override
 OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536)
 OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048)
 OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)
 
 // for more than filestore_max_inline_xattrs attrs
 OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override
 OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 10)
 OPTION(filestore_max_inline_xattrs_btrfs, OPT_U32, 10)
 OPTION(filestore_max_inline_xattrs_other, OPT_U32, 2)
 
 
 If these limits crossed, xattrs will be stored in omap..
 
 Sounds 

Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

2015-07-14 Thread Simion Rad
Hi , 

The output of ceph -s : 

cluster 50961297-815c-4598-8efe-5e08203f9fea
 health HEALTH_OK
 monmap e5: 5 mons at 
{pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0},
 election epoch 258, quorum 0,1,2,3,4 pshn05,pshn06,pshn13,psosctl111,psosctl112
 mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby
 osdmap e21319: 16 osds: 16 up, 16 in
  pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects
9940 GB used, 10170 GB / 21187 GB avail
 384 active+clean

I don't use any ceph client (kernel or fuse) on the same nodes that run 
osd/mon/mds daemons.
Yes, I see slow operations warnings from time to time when I'm looking at ceph 
-w.
The number of iops on the servers aren't that high and I think the write-back 
cache of the RAID controller sould be able to help with the journal ops.

Simion Rad.

From: Gregory Farnum [g...@gregs42.com]
Sent: Tuesday, July 14, 2015 12:38
To: Simion Rad
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote:
 Hi ,

 I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between
 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and
 big files).
 Every day there are times when a lot of processes running on the client
 servers ( using either fuse of kernel client) become stuck in D state and
 when I run a strace of them I see them waiting in FUTEX_WAIT syscall.
 The same issue I'm able to see on all OSD demons.
 The ceph version I'm running is Firefly 0.80.10 both on clients and on
 server daemons.
 I use ext4 as osd filesystem.
 Operating system on servers : Ubuntu 14.04 and kernel 3.13.
 Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13
 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on
 RAID controller Dell PERC H700 with 512MB BBU using write-back mode).
 The servers which the ceph daemons are running on are also hosting KVM VMs (
 OpenStack Nova ).
 Because of this unfortunate setup the performance is really bad, but at
 least I shouldn't see as many locking issues (or shoud I ? ).
 The only thing which temporarily improves the performance is restarting
 every osd. After such a restart I see some processes on client machines
 resume I/O but only for a couple of
 hours,  then the whole process must be repeated.
 I cannot afford to run a setup without RAID because there isn't enough RAM
 left for a couple of osd daemons.

 The ceph.conf settings I use  :

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 filestore xattr use omap = true
 osd pool default size = 2
 osd pool default min size = 1
 osd pool default pg num = 128
 osd pool default pgp num = 128
 public network = 10.71.13.0/24
 cluster network = 10.71.12.0/24

 Did someone else experienced this kind of behaviour (stuck processes in
 FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ?

What's the output of ceph -s on your cluster?
When your clients get stuck, is the cluster complaining about stuck
ops on the OSDs?
Are you running kernel clients on the same boxes as your OSDs?

If I were to guess I'd imagine that you might just have overloaded
your cluster and the FUTEX_WAIT is the clients waiting for writes to
get acknowledged, but if restarting the OSDs brings everything back up
for a few hours that might not be the case.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

2015-07-14 Thread Gregory Farnum
On Tue, Jul 14, 2015 at 11:30 AM, Simion Rad simion@yardi.com wrote:
 Hi ,

 The output of ceph -s :

 cluster 50961297-815c-4598-8efe-5e08203f9fea
  health HEALTH_OK
  monmap e5: 5 mons at 
 {pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0},
  election epoch 258, quorum 0,1,2,3,4 
 pshn05,pshn06,pshn13,psosctl111,psosctl112
  mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby
  osdmap e21319: 16 osds: 16 up, 16 in
   pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects
 9940 GB used, 10170 GB / 21187 GB avail
  384 active+clean

 I don't use any ceph client (kernel or fuse) on the same nodes that run 
 osd/mon/mds daemons.
 Yes, I see slow operations warnings from time to time when I'm looking at 
 ceph -w.

Yeah, I think this is just it — especially if you've got some OSDs
which are 9 times larger than others, the load will disproportionately
go to them and they probably can't take it.

The next time things get stuck you can look at the admin socket on the
ceph-fuse machines and dump_ops_in_flight and see if any of them are
very old, and which OSDs they're targeted at. (You can get similar
information out of the kernel clients by cat'ing the files in
/sys/kernel/debug/ceph/*/.)
-Greg

 The number of iops on the servers aren't that high and I think the write-back 
 cache of the RAID controller sould be able to help with the journal ops.

 Simion Rad.
 
 From: Gregory Farnum [g...@gregs42.com]
 Sent: Tuesday, July 14, 2015 12:38
 To: Simion Rad
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

 On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote:
 Hi ,

 I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between
 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and
 big files).
 Every day there are times when a lot of processes running on the client
 servers ( using either fuse of kernel client) become stuck in D state and
 when I run a strace of them I see them waiting in FUTEX_WAIT syscall.
 The same issue I'm able to see on all OSD demons.
 The ceph version I'm running is Firefly 0.80.10 both on clients and on
 server daemons.
 I use ext4 as osd filesystem.
 Operating system on servers : Ubuntu 14.04 and kernel 3.13.
 Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13
 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on
 RAID controller Dell PERC H700 with 512MB BBU using write-back mode).
 The servers which the ceph daemons are running on are also hosting KVM VMs (
 OpenStack Nova ).
 Because of this unfortunate setup the performance is really bad, but at
 least I shouldn't see as many locking issues (or shoud I ? ).
 The only thing which temporarily improves the performance is restarting
 every osd. After such a restart I see some processes on client machines
 resume I/O but only for a couple of
 hours,  then the whole process must be repeated.
 I cannot afford to run a setup without RAID because there isn't enough RAM
 left for a couple of osd daemons.

 The ceph.conf settings I use  :

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 filestore xattr use omap = true
 osd pool default size = 2
 osd pool default min size = 1
 osd pool default pg num = 128
 osd pool default pgp num = 128
 public network = 10.71.13.0/24
 cluster network = 10.71.12.0/24

 Did someone else experienced this kind of behaviour (stuck processes in
 FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ?

 What's the output of ceph -s on your cluster?
 When your clients get stuck, is the cluster complaining about stuck
 ops on the OSDs?
 Are you running kernel clients on the same boxes as your OSDs?

 If I were to guess I'd imagine that you might just have overloaded
 your cluster and the FUTEX_WAIT is the clients waiting for writes to
 get acknowledged, but if restarting the OSDs brings everything back up
 for a few hours that might not be the case.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

2015-07-14 Thread Gregory Farnum
On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote:
 Hi ,

 I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between
 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and
 big files).
 Every day there are times when a lot of processes running on the client
 servers ( using either fuse of kernel client) become stuck in D state and
 when I run a strace of them I see them waiting in FUTEX_WAIT syscall.
 The same issue I'm able to see on all OSD demons.
 The ceph version I'm running is Firefly 0.80.10 both on clients and on
 server daemons.
 I use ext4 as osd filesystem.
 Operating system on servers : Ubuntu 14.04 and kernel 3.13.
 Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13
 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on
 RAID controller Dell PERC H700 with 512MB BBU using write-back mode).
 The servers which the ceph daemons are running on are also hosting KVM VMs (
 OpenStack Nova ).
 Because of this unfortunate setup the performance is really bad, but at
 least I shouldn't see as many locking issues (or shoud I ? ).
 The only thing which temporarily improves the performance is restarting
 every osd. After such a restart I see some processes on client machines
 resume I/O but only for a couple of
 hours,  then the whole process must be repeated.
 I cannot afford to run a setup without RAID because there isn't enough RAM
 left for a couple of osd daemons.

 The ceph.conf settings I use  :

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 filestore xattr use omap = true
 osd pool default size = 2
 osd pool default min size = 1
 osd pool default pg num = 128
 osd pool default pgp num = 128
 public network = 10.71.13.0/24
 cluster network = 10.71.12.0/24

 Did someone else experienced this kind of behaviour (stuck processes in
 FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ?

What's the output of ceph -s on your cluster?
When your clients get stuck, is the cluster complaining about stuck
ops on the OSDs?
Are you running kernel clients on the same boxes as your OSDs?

If I were to guess I'd imagine that you might just have overloaded
your cluster and the FUTEX_WAIT is the clients waiting for writes to
get acknowledged, but if restarting the OSDs brings everything back up
for a few hours that might not be the case.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xattrs vs omap

2015-07-14 Thread Jan Schermer
Instead of guessing I took a look at one of my OSDs.

TL;DR: I’m going to bump the inode size to 512 which should fit majority of 
xattrs, no need to touch filestore parameters.

Short news first - I can’t find a file with more than 2 xattrs. (and that’s 
good)

Then I extracted all the xattrs on all the ~100K files, counted their size and 
counted the occurences.

The largest xattrs I have are 705 chars in base64 (so let’s say it’s half), and 
that particular file has about 512B total in xattr (that’s more than was 
expected with RBD-only workload, right?)

# file: 
var/lib/ceph/osd/ceph-55//current/4.1ad7_head/rbd134udata.1a785181f15746a.0005a578__head_E5C51AD7__4
 117
user.ceph._=0sCwjyBANKACkAAAByYmRfZGF0YS4xYTc4NTE4MWYxNTc0NmEuMDAwMDAwMDAwMDA1YTU3OP7/1xrF5QAABAAFAxQEAP8AAADrEKMAADB2DQAiDaMA
AG11DQACAhUI1xSoAQD9CwAMAEAAABAgpFWoa6QVAgIV6xCjAAAwdg0=
 347
user.ceph.snapset=0sAgL5AQAAgt8HAAABBgAAAILfBwAAb94HAAC23AcAAEnPBwAA470HAAB4ugcAAAQAAAC1ugcAAOO9BwAAStAHAACC3wcAAAQAAAC1ugcAAAQAAABQFGAUwAowHwAAAJAZ4DggBwAA470HAAAFEA8gDwAAACAFSBQAAABADgAAAJAioAI4JQAAAMgaAABK0AcAAAQAAADgAQAAAOgBeCYAAACAKHAAACkAFwAAgt8HAAAFoAEAAADAAQAAAIAMUA4QBgAAAIAU4ACAFQAAAIAqAAAEtboHAEAAAOO9BwBAAABK0AcAQAAAgt8HAE==
 705

(If anyone wants to enlighten me on the contents that would be great - is this 
expected to grow much?)


BUT most of the files have much smaller xattrs, and if I researched it 
correctly it seems ext4 uses free space in inode (which should be something 
like inode_size-128-28=free) and if that’s not enough it will allocate one more 
block.

In other words, if I format ext4 with 2048 inode size and 4096 block size, 
there will be 2048-(128+28)=1892 bytes available in the inode, and 4096 bytes 
can be allocated  from another block. With default format, there will be just 
256-(128+28)=100 bytes in the inode + 4096 bytes in another block.


In my case, majority of the files have xattr size 200B, which is larger than 
fits inside one inode, but not really that large, so it should be beneficial to 
bump the inode size to 512B (that leaves plenty of 356 bytes for xattrs).

Jan


 On 14 Jul 2015, at 12:18, Gregory Farnum g...@gregs42.com wrote:
 
 On Tue, Jul 14, 2015 at 10:53 AM, Jan Schermer j...@schermer.cz wrote:
 Thank you for your reply.
 Comments inline.
 
 I’m still hoping to get some more input, but there are many people running 
 ceph on ext4, and it sounds like it works pretty good out of the box. Maybe 
 I’m overthinking this, then?
 
 I think so — somebody did a lot of work making sure we were well-tuned
 on the standard filesystems; I believe it was David.
 -Greg
 
 
 Jan
 
 On 13 Jul 2015, at 21:04, Somnath Roy somnath@sandisk.com wrote:
 
 inline
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
 Jan Schermer
 Sent: Monday, July 13, 2015 2:32 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] xattrs vs omap
 
 Sorry for reviving an old thread, but could I get some input on this, 
 pretty please?
 
 ext4 has 256-byte inodes by default (at least according to docs) but the 
 fragment below says:
 OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)
 
 The default 512b is too much if the inode is just 256b, so shouldn’t that 
 be 256b in case people use the default ext4 inode size?
 
 Anyway, is it better to format ext4 with larger inodes (say 2048b) and set 
 filestore_max_inline_xattr_size_other=1536, or leave it at defaults?
 [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any 
 harm though, but, curious.
 
 AFAIK there is other information in the inode other than xattrs, also you 
 need to count the xattra labels into this - so if I want to store 1536B of 
 “values” it would cost more, and there still needs to be some space left.
 
 (As I understand it, on ext4 xattrs ale limited to one block, inode size + 
 something can spill to one different inode - maybe someone knows better).
 
 
 [Somnath] The xttr size (_) is now more than 256 bytes and it will spill 
 over, so, bigger inode  size will be good. But, I would suggest do your 
 benchmark before putting it into production.
 
 
 Good poin and I am going to do that, but I’d like to avoid the guesswork. 
 Also, not all patterns are always replicable….
 
 Is filestore_max_inline_xattr_size and absolute limit, or is it 
 filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality?
 
 [Somnath] The *_size is tracking the 

Re: [ceph-users] slow requests going up and down

2015-07-14 Thread Deneau, Tom
I don't think there were any stale or unclean PGs,  (when there are,
I have seen health detail list them and it did not in this case).
I have since restarted the 2 osds and the health went immediately to HEALTH_OK.

-- Tom

 -Original Message-
 From: Will.Boege [mailto:will.bo...@target.com]
 Sent: Monday, July 13, 2015 10:19 PM
 To: Deneau, Tom; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] slow requests going up and down
 
 Does the ceph health detail show anything about stale or unclean PGs, or
 are you just getting the blocked ops messages?
 
 On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote:
 
 I have a cluster where over the weekend something happened and successive
 calls to ceph health detail show things like below.
 What does it mean when the number of blocked requests goes up and down
 like this?
 Some clients are still running successfully.
 
 -- Tom Deneau, AMD
 
 
 
 HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
 20 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 18 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
 4 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 2 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
 27 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
 34 ops are blocked  536871 sec
 9 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Confusion in Erasure Code benchmark app

2015-07-14 Thread Nitin Saxena
Hi All,

I am trying to debug ceph_erasure_code_benchmark_app available in ceph
repo. using cauchy_good technique. I am running gdb using following command:

src/ceph_erasure_code_benchmark --plugin jerasure_neon --workload encode
--iterations 10 --size 1048576 --parameter k=6 --parameter m=2 --parameter
directory=src/.libs --parameter packetsize=3072 --parameter
technique=cauchy_good

My confusion here is why underlying GF(32) function galois_w32_region_xor()
is called even if the parameter value of w passed in
jerasure_schedule_encode() is 8.

According to me since GF(8) is passed in jerasure_schedule_encode() (with
parameter w==8) then underlying gf function galois_w8_region_xor() should
have been called instead of GF(32) function galois_w32_region_xor

Thanks in advance
Nitin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange issues after upgrading to SL6.6 and latest kernel

2015-07-14 Thread Dan van der Ster
Hi,
This reminds me of when a buggy leveldb package slipped into the ceph
repos (http://tracker.ceph.com/issues/7792).

Which version of leveldb do you have installed?
Cheers, Dan

On Tue, Jul 14, 2015 at 3:39 PM, Barry O'Rourke barry.o'rou...@ed.ac.uk wrote:
 Hi,

 I managed to destroy my development cluster yesteday after upgrading it to
 Scientific Linux and kernel 2.6.32-504.23.4.el6.x86_64.

 Upon rebooting the development node hung whilst attempting to start the
 monitor. It was still in the same state after being left overnight to
 see if it would time out.

 I decided to start from scratch to see if I could recreate the issue on
 a clean install.

 I've followed both the quick install and manual install guides on the
 wiki and always see the following error whilst creating the initial
 monitor.

 https://gist.github.com/barryorourke/47b0a988d38a817afb5b#file-gistfile1-txt

 Has anyone seen anything similar?

 Regards,

 Barry

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Confusion in Erasure Code benchmark app

2015-07-14 Thread Loic Dachary
Hi,

I've observed the same thing but never spent time to figure that out. It would 
be nice to know. I don't think it's a bug, just something slightly confusing.

Cheers

On 14/07/2015 14:52, Nitin Saxena wrote:
 Hi All,
 
 I am trying to debug ceph_erasure_code_benchmark_app available in ceph repo. 
 using cauchy_good technique. I am running gdb using following command:
 
 src/ceph_erasure_code_benchmark --plugin jerasure_neon --workload encode 
 --iterations 10 --size 1048576 --parameter k=6 --parameter m=2 --parameter 
 directory=src/.libs --parameter packetsize=3072 --parameter 
 technique=cauchy_good
 
 My confusion here is why underlying GF(32) function galois_w32_region_xor() 
 is called even if the parameter value of w passed in 
 jerasure_schedule_encode() is 8. 
 
 According to me since GF(8) is passed in jerasure_schedule_encode() (with 
 parameter w==8) then underlying gf function galois_w8_region_xor() should 
 have been called instead of GF(32) function galois_w32_region_xor
 
 Thanks in advance
 Nitin
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

2015-07-14 Thread Simion Rad

I'll consider looking into more detail at the slow OSDs.
Thank you, 

Simion Rad.

From: Gregory Farnum [g...@gregs42.com]
Sent: Tuesday, July 14, 2015 13:42
To: Simion Rad
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

On Tue, Jul 14, 2015 at 11:30 AM, Simion Rad simion@yardi.com wrote:
 Hi ,

 The output of ceph -s :

 cluster 50961297-815c-4598-8efe-5e08203f9fea
  health HEALTH_OK
  monmap e5: 5 mons at 
 {pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0},
  election epoch 258, quorum 0,1,2,3,4 
 pshn05,pshn06,pshn13,psosctl111,psosctl112
  mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby
  osdmap e21319: 16 osds: 16 up, 16 in
   pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects
 9940 GB used, 10170 GB / 21187 GB avail
  384 active+clean

 I don't use any ceph client (kernel or fuse) on the same nodes that run 
 osd/mon/mds daemons.
 Yes, I see slow operations warnings from time to time when I'm looking at 
 ceph -w.

Yeah, I think this is just it — especially if you've got some OSDs
which are 9 times larger than others, the load will disproportionately
go to them and they probably can't take it.

The next time things get stuck you can look at the admin socket on the
ceph-fuse machines and dump_ops_in_flight and see if any of them are
very old, and which OSDs they're targeted at. (You can get similar
information out of the kernel clients by cat'ing the files in
/sys/kernel/debug/ceph/*/.)
-Greg

 The number of iops on the servers aren't that high and I think the write-back 
 cache of the RAID controller sould be able to help with the journal ops.

 Simion Rad.
 
 From: Gregory Farnum [g...@gregs42.com]
 Sent: Tuesday, July 14, 2015 12:38
 To: Simion Rad
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall

 On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote:
 Hi ,

 I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between
 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and
 big files).
 Every day there are times when a lot of processes running on the client
 servers ( using either fuse of kernel client) become stuck in D state and
 when I run a strace of them I see them waiting in FUTEX_WAIT syscall.
 The same issue I'm able to see on all OSD demons.
 The ceph version I'm running is Firefly 0.80.10 both on clients and on
 server daemons.
 I use ext4 as osd filesystem.
 Operating system on servers : Ubuntu 14.04 and kernel 3.13.
 Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13
 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on
 RAID controller Dell PERC H700 with 512MB BBU using write-back mode).
 The servers which the ceph daemons are running on are also hosting KVM VMs (
 OpenStack Nova ).
 Because of this unfortunate setup the performance is really bad, but at
 least I shouldn't see as many locking issues (or shoud I ? ).
 The only thing which temporarily improves the performance is restarting
 every osd. After such a restart I see some processes on client machines
 resume I/O but only for a couple of
 hours,  then the whole process must be repeated.
 I cannot afford to run a setup without RAID because there isn't enough RAM
 left for a couple of osd daemons.

 The ceph.conf settings I use  :

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 filestore xattr use omap = true
 osd pool default size = 2
 osd pool default min size = 1
 osd pool default pg num = 128
 osd pool default pgp num = 128
 public network = 10.71.13.0/24
 cluster network = 10.71.12.0/24

 Did someone else experienced this kind of behaviour (stuck processes in
 FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ?

 What's the output of ceph -s on your cluster?
 When your clients get stuck, is the cluster complaining about stuck
 ops on the OSDs?
 Are you running kernel clients on the same boxes as your OSDs?

 If I were to guess I'd imagine that you might just have overloaded
 your cluster and the FUTEX_WAIT is the clients waiting for writes to
 get acknowledged, but if restarting the OSDs brings everything back up
 for a few hours that might not be the case.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests going up and down

2015-07-14 Thread Will . Boege
In my experience I have seen something like this this happen twice - First
time there were unclean PGs because Ceph was down to one replica of a PG.
When that happens Ceph blocks IO to remaining replicas when the number
falls below the Œmin_size¹ parameter. That will manifest as blocked ops.
Second time the disk was Œsoft-failing¹ - gaining many bad sectors but
SMART still reported the drive as OK.  Maybe check OSD.5 and OSD.7 for low
level media errors with a tool like MegaCli, or whatever controller
management tool comes with your hardware.
At any rate, restarting the problem-child OSD is probably troubleshooting
step #1, which you have done.

On 7/14/15, 6:45 AM, Deneau, Tom tom.den...@amd.com wrote:

I don't think there were any stale or unclean PGs,  (when there are,
I have seen health detail list them and it did not in this case).
I have since restarted the 2 osds and the health went immediately to
HEALTH_OK.

-- Tom

 -Original Message-
 From: Will.Boege [mailto:will.bo...@target.com]
 Sent: Monday, July 13, 2015 10:19 PM
 To: Deneau, Tom; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] slow requests going up and down
 
 Does the ceph health detail show anything about stale or unclean PGs, or
 are you just getting the blocked ops messages?
 
 On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote:
 
 I have a cluster where over the weekend something happened and
successive
 calls to ceph health detail show things like below.
 What does it mean when the number of blocked requests goes up and down
 like this?
 Some clients are still running successfully.
 
 -- Tom Deneau, AMD
 
 
 
 HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
 20 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 18 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
 4 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 2 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
 27 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
 34 ops are blocked  536871 sec
 9 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange issues after upgrading to SL6.6 and latest kernel

2015-07-14 Thread Barry O'Rourke

Hi,

I managed to destroy my development cluster yesteday after upgrading it to
Scientific Linux and kernel 2.6.32-504.23.4.el6.x86_64.

Upon rebooting the development node hung whilst attempting to start the
monitor. It was still in the same state after being left overnight to
see if it would time out.

I decided to start from scratch to see if I could recreate the issue on
a clean install.

I've followed both the quick install and manual install guides on the
wiki and always see the following error whilst creating the initial
monitor.

https://gist.github.com/barryorourke/47b0a988d38a817afb5b#file-gistfile1-txt

Has anyone seen anything similar?

Regards,

Barry

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster reliability

2015-07-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm trying to understand the real world reliability of Ceph to provide
some data to our upper management and may be valuable to others
investigating Ceph.

Things I'm trying to understand:
1. How many clusters are in production?
2. How long has the cluster(s) been in production?
3. The size of the cluster(s) (# OSDs TB of raw disk space).
4. Has there been a data loss event?
5. Has there been a data near-loss event (some manual process was
required to recover some or all of the cluster data, but not from
backup as that would be considered a loss event)?
6. How much data was involved in loss/near-loss event?
7. How long to recover to cluster to performing all I/O (not
necessarily healthy).
8. What was the root cause of the loss/near-loss event?

From what I've read on the mailing lists, it seems that most data loss
event are around a design decisions to have 2 or less copies and
experiencing multiple drive issues. Others seem to be related to human
error, and I'm only aware of one instance where an upgrade caused a
data availability issue.

If you would take a few minutes to send me this information, I can
summarize the findings and report back to the list.

Thank you,
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVpUQtCRDmVDuy+mK58QAA3d8P/3tSSFXInMDLde9IZFfE
jNjbN4zudF+ikNgBlOXFcU6vRIJyxj3V7Od+6PehLjyadmz+9Ju7KajRPb4z
gdoAF6cqmWvu+urJ8AZ4Av0wdK6xRpWA3Dmz9VcdPOGOBztKB8ZC1L3htqty
ysns5jgyzrFWPfWRGM0A/OC2r0JeAn2xDazZ3y2Gfhpi9sT7YqrqZj17MfdV
S3MXNoHoVZFRSgS9jXJ0C1f1DKjR+kaoED3k+mJiiV/sR+RjK5OQi7tvU96L
lWWPcOM02HJlfO//4cKXIqb4cs/p30y+VG4bh74c+svw5rq7SSAe0BFALZJY
p9TyFUGQjp8BjA7ZwyT7UmqBPbwAbGRwToGOup9T2ZHbUW66vRyGPQcdVj7M
z8PQDQWOg3AOaAgNXYNgwjTvfa5JwHHqIBhtTL8Kzt8ITteUr3N0sWNfydo6
h33sbhmv/SHmwFa267ounDj3M+MsVJq48iaHaZ+xznzYAsxfGQ0+Hv9nJxhB
+KQWYCJTJcSXPDe7Ct3eHExYcO88I1mXqxuCdimk1DfNH0IieiUUYurLFTzy
YvkJzbFaYNh/eccf7ASvt2ZjjGlrVozOijIopesaOJbJZP+Bd8FdGwc2IZXM
ofkxbEBP96gpsatlSTlk1jEzTSiq+odgp9PJ2KjqgqYO6S4rNMKKgxnUsbjz
IAJR
=MGc9
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?

2015-07-14 Thread Bruce McFarland
When starting the rbdmap.service to provide map/unmap of rbd devices across 
boot/shutdown cycles the /etc/init.d/rbdmap includes /lib/lsb/init-functions. 
This is not a problem except that the rbdmap script is making calls to the 
log_daemon_* log_progress_* log_actiion_* functions that are included in Ubuntu 
14.04 distro's, but are not in the RHEL 7.1/RHCS 1.3 distro. Are there any 
recommended workaround for boot time startup in RHEL/Centos 7.1 clients?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSL for tracker.ceph.com

2015-07-14 Thread Wido den Hollander
Hi,

Curently tracker.ceph.com doesn't have SSL enabled.

Every time I log in I'm sending my password over plain text which I'd
rather not.

Can we get SSL enabled on tracker.ceph.com?

And while we are at it, can we enable IPv6 as well? :)

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance dégradation after upgrade to hammer

2015-07-14 Thread Florent MONTHEL
Hi All,

I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to 
Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork 
mode
I'm experiencing huge write performance degradation just after upgrade 
(Cosbench).

Do you already run performance tests between Hammer and Firefly ?

No problem with read performance that was amazing


Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance dégradation after upgrade to hammer

2015-07-14 Thread Mark Nelson

On 07/14/2015 06:42 PM, Florent MONTHEL wrote:

Hi All,

I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to 
Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork 
mode
I'm experiencing huge write performance degradation just after upgrade 
(Cosbench).

Do you already run performance tests between Hammer and Firefly ?

No problem with read performance that was amazing


Hi Florent,

Can you talk a little bit about how your write tests are setup?  How 
many concurrent IOs and what size?  Also, do you see similar problems 
with rados bench?


We have done some testing and haven't seen significant performance 
degradation except when switching to civetweb which appears to perform 
deletes more slowly than what we saw with apache+fcgi.


Mark




Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VM with rbd volume hangs on write during load

2015-07-14 Thread Wido den Hollander
On 07/15/2015 01:17 AM, Jeya Ganesh Babu Jegatheesan wrote:
 Hi,
 
 We have a Openstack + Ceph cluster based on Giant release. We use ceph for 
 the VMs volumes including the boot volumes. Under load, we see the write 
 access to the volumes stuck from within the VM. The same would work after a 
 VM reboot. The issue is seen with and without rbd cache. Let me know if this 
 is some known issue and any way to debug further. The ceph cluster itself 
 seems to be clean. We have currently disabled scrub and deep scrub. 'ceph -s' 
 output as below.
 

Are you seeing slow requests in the system?

Are any of the disks under the OSDs 100% busy or close to it?

Btw, the amount of PGs is rather high. You are at 88, while the formula
recommends:

num_osd * 100 / 3 = 14k (cluster total)

Wido

 cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d
  health HEALTH_WARN noscrub,nodeep-scrub flag(s) set
  monmap e71: 9 mons at 
 {gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=10.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:6789/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0},
  election epoch 22246, quorum 0,1,2,3,4,5,6,7,8 
 gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc011b,gngsvc011c,gngsvm011d
  osdmap e54600: 425 osds: 425 up, 425 in
 flags noscrub,nodeep-scrub
   pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects
 402 TB used, 941 TB / 1344 TB avail
37620 active+clean
   client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s
 
 
 The traces we see in the VM's kernel are as below.
 
 [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120 seconds.
 [ 1080.553027]   Tainted: GF  O 3.13.0-34-generic 
 #60~precise1-Ubuntu
 [ 1080.553157] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
 this message.
 [ 1080.553295] jbd2/vdb-8  D 88003687e3e0 0   813  2 
 0x
 [ 1080.553298]  880444fadb48 0002 880455114440 
 880444fadfd8
 [ 1080.553302]  00014440 00014440 88044a9317f0 
 88044b7917f0
 [ 1080.553303]  880444fadb48 880455114cd8 88044b7917f0 
 811fc670
 [ 1080.553307] Call Trace:
 [ 1080.553309]  [811fc670] ? __wait_on_buffer+0x30/0x30
 [ 1080.553311]  [8175b8b9] schedule+0x29/0x70
 [ 1080.553313]  [8175b98f] io_schedule+0x8f/0xd0
 [ 1080.553315]  [811fc67e] sleep_on_buffer+0xe/0x20
 [ 1080.553316]  [8175c052] __wait_on_bit+0x62/0x90
 [ 1080.553318]  [811fc670] ? __wait_on_buffer+0x30/0x30
 [ 1080.553320]  [8175c0fc] out_of_line_wait_on_bit+0x7c/0x90
 [ 1080.553322]  [810aff70] ? wake_atomic_t_function+0x40/0x40
 [ 1080.553324]  [811fc66e] __wait_on_buffer+0x2e/0x30
 [ 1080.553326]  [8129806b] 
 jbd2_journal_commit_transaction+0x136b/0x1520
 [ 1080.553329]  [810a1f75] ? sched_clock_local+0x25/0x90
 [ 1080.553331]  [8109a7b8] ? finish_task_switch+0x128/0x170
 [ 1080.55]  [8107891f] ? try_to_del_timer_sync+0x4f/0x70
 [ 1080.553334]  [8129c5d8] kjournald2+0xb8/0x240
 [ 1080.553336]  [810afef0] ? __wake_up_sync+0x20/0x20
 [ 1080.553338]  [8129c520] ? commit_timeout+0x10/0x10
 [ 1080.553340]  [8108fa79] kthread+0xc9/0xe0
 [ 1080.553343]  [8108f9b0] ? flush_kthread_worker+0xb0/0xb0
 [ 1080.553346]  [8176827c] ret_from_fork+0x7c/0xb0
 [ 1080.553349]  [8108f9b0] ? flush_kthread_worker+0xb0/0xb0
 
 Thanks,
 Jeyaganesh.
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VM with rbd volume hangs on write during load

2015-07-14 Thread Jeya Ganesh Babu Jegatheesan


On 7/14/15, 4:56 PM, ceph-users on behalf of Wido den Hollander
ceph-users-boun...@lists.ceph.com on behalf of w...@42on.com wrote:

On 07/15/2015 01:17 AM, Jeya Ganesh Babu Jegatheesan wrote:
 Hi,
 
 We have a Openstack + Ceph cluster based on Giant release. We use ceph
for the VMs volumes including the boot volumes. Under load, we see the
write access to the volumes stuck from within the VM. The same would
work after a VM reboot. The issue is seen with and without rbd cache.
Let me know if this is some known issue and any way to debug further.
The ceph cluster itself seems to be clean. We have currently disabled
scrub and deep scrub. 'ceph -s' output as below.
 

Are you seeing slow requests in the system?
 
I dont see slow requests in the cluster.


Are any of the disks under the OSDs 100% busy or close to it?

Most of the OSDs use 20% of a core. There is no OSD process busy at 100%.


Btw, the amount of PGs is rather high. You are at 88, while the formula
recommends:

num_osd * 100 / 3 = 14k (cluster total)

We used 30 * num_osd per pool. We do have 4 pools, i believe thats the why
the PG seems to be be high.


Wido

 cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d
  health HEALTH_WARN noscrub,nodeep-scrub flag(s) set
  monmap e71: 9 mons at
{gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=1
0.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:67
89/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm
010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election epoch
22246, quorum 0,1,2,3,4,5,6,7,8
gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc0
11b,gngsvc011c,gngsvm011d
  osdmap e54600: 425 osds: 425 up, 425 in
 flags noscrub,nodeep-scrub
   pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects
 402 TB used, 941 TB / 1344 TB avail
37620 active+clean
   client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s
 
 
 The traces we see in the VM's kernel are as below.
 
 [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120
seconds.
 [ 1080.553027]   Tainted: GF  O 3.13.0-34-generic
#60~precise1-Ubuntu
 [ 1080.553157] echo 0  /proc/sys/kernel/hung_task_timeout_secs
disables this message.
 [ 1080.553295] jbd2/vdb-8  D 88003687e3e0 0   813  2
0x
 [ 1080.553298]  880444fadb48 0002 880455114440
880444fadfd8
 [ 1080.553302]  00014440 00014440 88044a9317f0
88044b7917f0
 [ 1080.553303]  880444fadb48 880455114cd8 88044b7917f0
811fc670
 [ 1080.553307] Call Trace:
 [ 1080.553309]  [811fc670] ? __wait_on_buffer+0x30/0x30
 [ 1080.553311]  [8175b8b9] schedule+0x29/0x70
 [ 1080.553313]  [8175b98f] io_schedule+0x8f/0xd0
 [ 1080.553315]  [811fc67e] sleep_on_buffer+0xe/0x20
 [ 1080.553316]  [8175c052] __wait_on_bit+0x62/0x90
 [ 1080.553318]  [811fc670] ? __wait_on_buffer+0x30/0x30
 [ 1080.553320]  [8175c0fc] out_of_line_wait_on_bit+0x7c/0x90
 [ 1080.553322]  [810aff70] ? wake_atomic_t_function+0x40/0x40
 [ 1080.553324]  [811fc66e] __wait_on_buffer+0x2e/0x30
 [ 1080.553326]  [8129806b]
jbd2_journal_commit_transaction+0x136b/0x1520
 [ 1080.553329]  [810a1f75] ? sched_clock_local+0x25/0x90
 [ 1080.553331]  [8109a7b8] ? finish_task_switch+0x128/0x170
 [ 1080.55]  [8107891f] ? try_to_del_timer_sync+0x4f/0x70
 [ 1080.553334]  [8129c5d8] kjournald2+0xb8/0x240
 [ 1080.553336]  [810afef0] ? __wake_up_sync+0x20/0x20
 [ 1080.553338]  [8129c520] ? commit_timeout+0x10/0x10
 [ 1080.553340]  [8108fa79] kthread+0xc9/0xe0
 [ 1080.553343]  [8108f9b0] ? flush_kthread_worker+0xb0/0xb0
 [ 1080.553346]  [8176827c] ret_from_fork+0x7c/0xb0
 [ 1080.553349]  [8108f9b0] ? flush_kthread_worker+0xb0/0xb0
 
 Thanks,
 Jeyaganesh.
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CPU Hyperthreading ?

2015-07-14 Thread Florent MONTHEL
Hi list

Do you recommend to enable or disable hyper threading on CPU ?
Is it the case for Mon ? Osd ? Radosgw ?
Thanks

Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSL for tracker.ceph.com

2015-07-14 Thread Ken Dreyer
On 07/14/2015 04:14 PM, Wido den Hollander wrote:
 Hi,
 
 Curently tracker.ceph.com doesn't have SSL enabled.
 
 Every time I log in I'm sending my password over plain text which I'd
 rather not.
 
 Can we get SSL enabled on tracker.ceph.com?
 
 And while we are at it, can we enable IPv6 as well? :)
 

File a ... tracker ticket for it! :D

I'm not sure what is involved with getting IPv6 on the rest of our
servers, but we need to look into it. Particularly git.ceph.com.

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] VM with rbd volume hangs on write during load

2015-07-14 Thread Jeya Ganesh Babu Jegatheesan
Hi,

We have a Openstack + Ceph cluster based on Giant release. We use ceph for the 
VMs volumes including the boot volumes. Under load, we see the write access to 
the volumes stuck from within the VM. The same would work after a VM reboot. 
The issue is seen with and without rbd cache. Let me know if this is some known 
issue and any way to debug further. The ceph cluster itself seems to be clean. 
We have currently disabled scrub and deep scrub. 'ceph -s' output as below.

cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d
 health HEALTH_WARN noscrub,nodeep-scrub flag(s) set
 monmap e71: 9 mons at 
{gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=10.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:6789/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0},
 election epoch 22246, quorum 0,1,2,3,4,5,6,7,8 
gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc011b,gngsvc011c,gngsvm011d
 osdmap e54600: 425 osds: 425 up, 425 in
flags noscrub,nodeep-scrub
  pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects
402 TB used, 941 TB / 1344 TB avail
   37620 active+clean
  client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s


The traces we see in the VM's kernel are as below.

[ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120 seconds.
[ 1080.553027]   Tainted: GF  O 3.13.0-34-generic 
#60~precise1-Ubuntu
[ 1080.553157] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 1080.553295] jbd2/vdb-8  D 88003687e3e0 0   813  2 0x
[ 1080.553298]  880444fadb48 0002 880455114440 
880444fadfd8
[ 1080.553302]  00014440 00014440 88044a9317f0 
88044b7917f0
[ 1080.553303]  880444fadb48 880455114cd8 88044b7917f0 
811fc670
[ 1080.553307] Call Trace:
[ 1080.553309]  [811fc670] ? __wait_on_buffer+0x30/0x30
[ 1080.553311]  [8175b8b9] schedule+0x29/0x70
[ 1080.553313]  [8175b98f] io_schedule+0x8f/0xd0
[ 1080.553315]  [811fc67e] sleep_on_buffer+0xe/0x20
[ 1080.553316]  [8175c052] __wait_on_bit+0x62/0x90
[ 1080.553318]  [811fc670] ? __wait_on_buffer+0x30/0x30
[ 1080.553320]  [8175c0fc] out_of_line_wait_on_bit+0x7c/0x90
[ 1080.553322]  [810aff70] ? wake_atomic_t_function+0x40/0x40
[ 1080.553324]  [811fc66e] __wait_on_buffer+0x2e/0x30
[ 1080.553326]  [8129806b] 
jbd2_journal_commit_transaction+0x136b/0x1520
[ 1080.553329]  [810a1f75] ? sched_clock_local+0x25/0x90
[ 1080.553331]  [8109a7b8] ? finish_task_switch+0x128/0x170
[ 1080.55]  [8107891f] ? try_to_del_timer_sync+0x4f/0x70
[ 1080.553334]  [8129c5d8] kjournald2+0xb8/0x240
[ 1080.553336]  [810afef0] ? __wake_up_sync+0x20/0x20
[ 1080.553338]  [8129c520] ? commit_timeout+0x10/0x10
[ 1080.553340]  [8108fa79] kthread+0xc9/0xe0
[ 1080.553343]  [8108f9b0] ? flush_kthread_worker+0xb0/0xb0
[ 1080.553346]  [8176827c] ret_from_fork+0x7c/0xb0
[ 1080.553349]  [8108f9b0] ? flush_kthread_worker+0xb0/0xb0

Thanks,
Jeyaganesh.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance dégradation after upgrade to hammer

2015-07-14 Thread Florent MONTHEL
Yes of course thanks Mark

Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected 
- EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster 
servers
No SSD drives used

We're using Cosbench to send :
- 8k object size : 100% read with 256 workers : better results with Hammer
 - 8k object size : 80% read - 20% write with 256 workers : real degradation 
between Firefly and Hammer (divided by something like 10)
- 8k object size : 100% write with 256 workers : real degradation between 
Firefly and Hammer (divided by something like 10)

Thanks

Sent from my iPhone

 On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote:
 
 On 07/14/2015 06:42 PM, Florent MONTHEL wrote:
 Hi All,
 
 I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to 
 Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork 
 mode
 I'm experiencing huge write performance degradation just after upgrade 
 (Cosbench).
 
 Do you already run performance tests between Hammer and Firefly ?
 
 No problem with read performance that was amazing
 
 Hi Florent,
 
 Can you talk a little bit about how your write tests are setup?  How many 
 concurrent IOs and what size?  Also, do you see similar problems with rados 
 bench?
 
 We have done some testing and haven't seen significant performance 
 degradation except when switching to civetweb which appears to perform 
 deletes more slowly than what we saw with apache+fcgi.
 
 Mark
 
 
 
 Sent from my iPhone
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance dégradation after upgrade to hammer

2015-07-14 Thread Mark Nelson

Hi Florent,

10x degradation is definitely unusual!  A couple of things to look at:

Are 8K rados bench writes to the rgw.buckets pool slow?  You can with 
something like:


rados -p rgw.buckets bench 30 write -t 256 -b 8192

You may also want to try targeting a specific RGW server to make sure 
the RR-DNS setup isn't interfering (at least while debugging).  It may 
also be worth creating a new replicated pool and try writes to that pool 
as well to see if you see much difference.


Mark

On 07/14/2015 07:17 PM, Florent MONTHEL wrote:

Yes of course thanks Mark

Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected 
- EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster 
servers
No SSD drives used

We're using Cosbench to send :
- 8k object size : 100% read with 256 workers : better results with Hammer
  - 8k object size : 80% read - 20% write with 256 workers : real degradation 
between Firefly and Hammer (divided by something like 10)
- 8k object size : 100% write with 256 workers : real degradation between 
Firefly and Hammer (divided by something like 10)

Thanks

Sent from my iPhone


On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote:


On 07/14/2015 06:42 PM, Florent MONTHEL wrote:
Hi All,

I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to 
Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork 
mode
I'm experiencing huge write performance degradation just after upgrade 
(Cosbench).

Do you already run performance tests between Hammer and Firefly ?

No problem with read performance that was amazing


Hi Florent,

Can you talk a little bit about how your write tests are setup?  How many 
concurrent IOs and what size?  Also, do you see similar problems with rados 
bench?

We have done some testing and haven't seen significant performance degradation 
except when switching to civetweb which appears to perform deletes more slowly 
than what we saw with apache+fcgi.

Mark




Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-14 Thread Eric Eastman
Hi John,

I cut the test down to a single client running only Ganesha NFS
without any ceph drivers loaded on the Ceph FS client.  After deleting
all the files in the Ceph file system, rebooting all the nodes, I
restarted the create 5 million file test using 2 NFS clients to the
one Ceph file system node running Ganesha NFS. After a couple hours I
am seeing the  client ede-c2-gw01 failing to respond to cache pressure
error:

$ ceph -s
cluster 6d8aae1e-1125-11e5-a708-001b78e265be
 health HEALTH_WARN
mds0: Client ede-c2-gw01 failing to respond to cache pressure
 monmap e1: 3 mons at
{ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
election epoch 22, quorum 0,1,2
ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
 mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby
 osdmap e323: 8 osds: 8 up, 8 in
  pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects
182 GB used, 78459 MB / 263 GB avail
 832 active+clean

Dumping the mds daemon shows inodes  inodes_max:

# ceph daemon mds.ede-c2-mds02 perf dump mds
{
mds: {
request: 21862302,
reply: 21862302,
reply_latency: {
avgcount: 21862302,
sum: 16728.480772060
},
forward: 0,
dir_fetch: 13,
dir_commit: 50788,
dir_split: 0,
inode_max: 10,
inodes: 100010,
inodes_top: 0,
inodes_bottom: 0,
inodes_pin_tail: 100010,
inodes_pinned: 100010,
inodes_expired: 4308279,
inodes_with_caps: 8,
caps: 8,
subtrees: 2,
traverse: 30802465,
traverse_hit: 26394836,
traverse_forward: 0,
traverse_discover: 0,
traverse_dir_fetch: 0,
traverse_remote_ino: 0,
traverse_lock: 0,
load_cent: 2186230200,
q: 0,
exported: 0,
exported_inodes: 0,
imported: 0,
imported_inodes: 0
}
}

Once this test finishes and I verify the files were all correctly
written, I will retest using the SAMBA VFS interface, followed by the
kernel test.

Please let me know if there is more info you need and if you want me
to open a ticket.

Best regards
Eric



On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman
eric.east...@keepertech.com wrote:
 Thanks John. I will back the test down to the simple case of 1 client
 without the kernel driver and only running NFS Ganesha, and work forward
 till I trip the problem and report my findings.

 Eric

 On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote:



 On 13/07/2015 04:02, Eric Eastman wrote:

 Hi John,

 I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
 nodes.  This system is using 4 Ceph FS client systems. They all have
 the kernel driver version of CephFS loaded, but none are mounting the
 file system. All 4 clients are using the libcephfs VFS interface to
 Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
 share out the Ceph file system.

 # ceph -s
  cluster 6d8aae1e-1125-11e5-a708-001b78e265be
   health HEALTH_WARN
  4 near full osd(s)
  mds0: Client ede-c2-gw01 failing to respond to cache
 pressure
  mds0: Client ede-c2-gw02:cephfs failing to respond to cache
 pressure
  mds0: Client ede-c2-gw03:cephfs failing to respond to cache
 pressure
   monmap e1: 3 mons at

 {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
  election epoch 8, quorum 0,1,2
 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
   mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
   osdmap e272: 8 osds: 8 up, 8 in
pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
  212 GB used, 48715 MB / 263 GB avail
   832 active+clean
client io 1379 kB/s rd, 20653 B/s wr, 98 op/s


 It would help if we knew whether it's the kernel clients or the userspace
 clients that are generating the warnings here.  You've probably already done
 this, but I'd get rid of any unused kernel client mounts to simplify the
 situation.

 We haven't tested the cache limit enforcement with NFS Ganesha, so there
 is a decent chance that it is broken.  The ganehsha FSAL is doing
 ll_get/ll_put reference counting on inodes, so it seems quite possible that
 its cache is pinning things that we would otherwise be evicting in response
 to cache pressure.  You mention samba as well,

 You can see if the MDS cache is indeed exceeding its limit by looking at
 the output of:
 ceph daemon mds.daemon id perf dump mds

 ...where the inodes value tells you how many are in the cache, vs.
 inode_max.

 If you can, it would be useful to boil this down to a straightforward test
 case: if you start with a healthy cluster, mount a single ganesha client,
 and do your 5 million file procedure, do you get the warning?  

Re: [ceph-users] Ruby bindings for Librados

2015-07-14 Thread Ken Dreyer
On 07/13/2015 02:11 PM, Wido den Hollander wrote:
 On 07/13/2015 09:43 PM, Corin Langosch wrote:
 Hi Wido,

 I'm the dev of https://github.com/netskin/ceph-ruby and still use it in 
 production on some systems. It has everything I
 need so I didn't develop any further. If you find any bugs or need new 
 features, just open an issue and I'm happy to
 have a look.

 
 Ah, that's great! We should look into making a Ruby binding official
 and moving it to Ceph's Github project. That would make it more clear
 for end-users.
 
 I see that RADOS namespaces are currently not implemented in the Ruby
 bindings. Not many bindings have them though. Might be worth looking at.
 
 I'll give the current bindings a try btw!

I'd like to see this happen too. Corin, would you be amenable to moving
this under the ceph GitHub org? You'd still have control over it,
similar to the way Wido manages https://github.com/ceph/phprados

- Ken


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CPU Hyperthreading ?

2015-07-14 Thread Somnath Roy
I was getting better performance with HT enabled (Intel cpu) for ceph-osd. I 
guess for mon it doesn't matter, but, for RadosGW I didn't measure the 
difference...We are running our benchmark with HT enabled for all components 
though.

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Florent MONTHEL
Sent: Tuesday, July 14, 2015 5:19 PM
To: ceph-users
Subject: [ceph-users] CPU Hyperthreading ?

Hi list

Do you recommend to enable or disable hyper threading on CPU ?
Is it the case for Mon ? Osd ? Radosgw ?
Thanks

Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-14 Thread 谷枫
I change the mds_cache_size to 50 from 10 get rid of the
WARN temporary.
Now dumping the mds daemon shows like this:
inode_max: 50,
inodes: 124213,
But i have no idea if the indoes rises more than 50 , change the
mds_cache_size again?
Thanks.

2015-07-15 13:34 GMT+08:00 谷枫 feiche...@gmail.com:

 I change the mds_cache_size to 50 from 10 get rid of the
 WARN temporary.
 Now dumping the mds daemon shows like this:
 inode_max: 50,
 inodes: 124213,
 But i have no idea if the indoes rises more than 50 , change the
 mds_cache_size again?
 Thanks.

 2015-07-15 11:06 GMT+08:00 Eric Eastman eric.east...@keepertech.com:

 Hi John,

 I cut the test down to a single client running only Ganesha NFS
 without any ceph drivers loaded on the Ceph FS client.  After deleting
 all the files in the Ceph file system, rebooting all the nodes, I
 restarted the create 5 million file test using 2 NFS clients to the
 one Ceph file system node running Ganesha NFS. After a couple hours I
 am seeing the  client ede-c2-gw01 failing to respond to cache pressure
 error:

 $ ceph -s
 cluster 6d8aae1e-1125-11e5-a708-001b78e265be
  health HEALTH_WARN
 mds0: Client ede-c2-gw01 failing to respond to cache pressure
  monmap e1: 3 mons at
 {ede-c2-mon01=
 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0
 }
 election epoch 22, quorum 0,1,2
 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
  mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby
  osdmap e323: 8 osds: 8 up, 8 in
   pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects
 182 GB used, 78459 MB / 263 GB avail
  832 active+clean

 Dumping the mds daemon shows inodes  inodes_max:

 # ceph daemon mds.ede-c2-mds02 perf dump mds
 {
 mds: {
 request: 21862302,
 reply: 21862302,
 reply_latency: {
 avgcount: 21862302,
 sum: 16728.480772060
 },
 forward: 0,
 dir_fetch: 13,
 dir_commit: 50788,
 dir_split: 0,
 inode_max: 10,
 inodes: 100010,
 inodes_top: 0,
 inodes_bottom: 0,
 inodes_pin_tail: 100010,
 inodes_pinned: 100010,
 inodes_expired: 4308279,
 inodes_with_caps: 8,
 caps: 8,
 subtrees: 2,
 traverse: 30802465,
 traverse_hit: 26394836,
 traverse_forward: 0,
 traverse_discover: 0,
 traverse_dir_fetch: 0,
 traverse_remote_ino: 0,
 traverse_lock: 0,
 load_cent: 2186230200,
 q: 0,
 exported: 0,
 exported_inodes: 0,
 imported: 0,
 imported_inodes: 0
 }
 }

 Once this test finishes and I verify the files were all correctly
 written, I will retest using the SAMBA VFS interface, followed by the
 kernel test.

 Please let me know if there is more info you need and if you want me
 to open a ticket.

 Best regards
 Eric



 On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman
 eric.east...@keepertech.com wrote:
  Thanks John. I will back the test down to the simple case of 1 client
  without the kernel driver and only running NFS Ganesha, and work forward
  till I trip the problem and report my findings.
 
  Eric
 
  On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com
 wrote:
 
 
 
  On 13/07/2015 04:02, Eric Eastman wrote:
 
  Hi John,
 
  I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
  nodes.  This system is using 4 Ceph FS client systems. They all have
  the kernel driver version of CephFS loaded, but none are mounting the
  file system. All 4 clients are using the libcephfs VFS interface to
  Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
  share out the Ceph file system.
 
  # ceph -s
   cluster 6d8aae1e-1125-11e5-a708-001b78e265be
health HEALTH_WARN
   4 near full osd(s)
   mds0: Client ede-c2-gw01 failing to respond to cache
  pressure
   mds0: Client ede-c2-gw02:cephfs failing to respond to
 cache
  pressure
   mds0: Client ede-c2-gw03:cephfs failing to respond to
 cache
  pressure
monmap e1: 3 mons at
 
  {ede-c2-mon01=
 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0
 }
   election epoch 8, quorum 0,1,2
  ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
osdmap e272: 8 osds: 8 up, 8 in
 pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
   212 GB used, 48715 MB / 263 GB avail
832 active+clean
 client io 1379 kB/s rd, 20653 B/s wr, 98 op/s
 
 
  It would help if we knew whether it's the kernel clients or the
 userspace
  clients that are generating the warnings here.  You've probably
 already done
  this, but I'd get rid of any unused kernel 

Re: [ceph-users] CPU Hyperthreading ?

2015-07-14 Thread Florent MONTHEL
Thanks for feed-back Somnath

Sent from my iPhone

 On 14 juil. 2015, at 20:24, Somnath Roy somnath@sandisk.com wrote:
 
 I was getting better performance with HT enabled (Intel cpu) for ceph-osd. I 
 guess for mon it doesn't matter, but, for RadosGW I didn't measure the 
 difference...We are running our benchmark with HT enabled for all components 
 though.
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
 Florent MONTHEL
 Sent: Tuesday, July 14, 2015 5:19 PM
 To: ceph-users
 Subject: [ceph-users] CPU Hyperthreading ?
 
 Hi list
 
 Do you recommend to enable or disable hyper threading on CPU ?
 Is it the case for Mon ? Osd ? Radosgw ?
 Thanks
 
 Sent from my iPhone
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Redhat Enterprise Virtualization (RHEV)

2015-07-14 Thread Neil Levine
RHEV does not formally support Ceph yet. Future versions are looking to
include Cinder support which will allow you to hook in Ceph.
You should contact your RHEV contacts who can give an indication of the
timeline for this.

Neil

On Tue, Jul 14, 2015 at 10:43 AM, Peter Michael Calum pe...@tdc.dk wrote:

  Hi,

 Does anyone know if it is possible to use Ceph storage in Redhat
 Enterprise Virtualization (RHEV),
 and connect it as a data domain in the Redhat Enterprise Virtualization
 Manager (RHEVM).

 My RHEV version and Hypervisors are the latest RHEV 6.5 version.

 Thanks,
 Peter Calum
 TDC


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and Redhat Enterprise Virtualization (RHEV)

2015-07-14 Thread Peter Michael Calum
Hi,

Does anyone know if it is possible to use Ceph storage in Redhat Enterprise 
Virtualization (RHEV),
and connect it as a data domain in the Redhat Enterprise Virtualization Manager 
(RHEVM).

My RHEV version and Hypervisors are the latest RHEV 6.5 version.

Thanks,
Peter Calum
TDC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com