Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Daleep Bais
Hi Shinobu,

I have 1 X 1TB HDD on each node. The network bandwidth between nodes is
1Gbps.

Thanks for the info. I will also try to go through discussion mails related
to performance.

Thanks.

Daleep Singh Bais


On Wed, Sep 9, 2015 at 2:09 PM, Shinobu Kinjo  wrote:

> How many disks does each osd node have?
> How about networking layer?
> There are several factors to make your cluster much more stronger.
>
> Probably you may need to take a look at other discussion on this mailing
> list.
> There was a bunch of discussion about performance.
>
> Shinobu
>
> - Original Message -
> From: "Daleep Bais" 
> To: "Ceph-User" 
> Sent: Wednesday, September 9, 2015 5:17:48 PM
> Subject: [ceph-users] Poor IOPS performance with Ceph
>
> Hi,
>
> I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
> read write performance for the test cluster and the read IOPS is poor.
> When I individually test it for each HDD, I get good performance, whereas,
> when I test it for ceph cluster, it is poor.
>
> Between nodes, using iperf, I get good bandwidth.
>
> My cluster info :
>
> root@ceph-node3:~# ceph --version
> ceph version 9.0.2-752-g64d37b7 (64d37b70a687eb63edf69a91196bb124651da210)
> root@ceph-node3:~# ceph -s
> cluster 9654468b-5c78-44b9-9711-4a7c4455c480
> health HEALTH_OK
> monmap e9: 3 mons at {ceph-node10=
> 192.168.1.210:6789/0,ceph-node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0
> }
> election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-node17
> osdmap e1850: 6 osds: 6 up, 6 in
> pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> 9624 MB used, 5384 GB / 5394 GB avail
> 256 active+clean
>
>
> I have mapped an RBD block device to client machine (Ubuntu 14) and from
> there, when I run tests using FIO, i get good write IOPS, however, read is
> poor comparatively.
>
> Write IOPS : 44618 approx
>
> Read IOPS : 7356 approx
>
> Pool replica - single
> pool 1 'test1' replicated size 1 min_size 1
>
> I have implemented rbd_readahead in my ceph conf file also.
> Any suggestions in this regard with help me..
>
> Thanks.
>
> Daleep Singh Bais
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-09 Thread Shinobu
Did you unmount filesystem using?

  umount -l

Shinobu

On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges 
wrote:

> Dear Ceph / CephFS gurus...
>
> Bare a bit with me while I give you a bit of context. Questions will
> appear at the end.
>
> 1) I am currently running ceph 9.0.3 and I have install it  to test the
> cephfs recovery tools.
>
> 2) I've created a situation where I've deliberately (on purpose) lost some
> data and metadata (check annex 1 after the main email).
>
> 3) I've stopped the mds, and waited to check how the cluster reacts. After
> some time, as expected, the cluster reports a ERROR state, with a lot of
> PGs degraded and stuck
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_ERR
> 174 pgs degraded
> 48 pgs stale
> 174 pgs stuck degraded
> 41 pgs stuck inactive
> 48 pgs stuck stale
> 238 pgs stuck unclean
> 174 pgs stuck undersized
> 174 pgs undersized
> recovery 22366/463263 objects degraded (4.828%)
> recovery 8190/463263 objects misplaced (1.768%)
> too many PGs per OSD (388 > max 300)
> mds rank 0 has failed
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e24: 0/1/1 up, 1 failed
>  osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>   pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
> 1715 GB used, 40027 GB / 41743 GB avail
> 22366/463263 objects degraded (4.828%)
> 8190/463263 objects misplaced (1.768%)
> 1799 active+clean
>  110 active+undersized+degraded
>   60 active+remapped
>   37 stale+undersized+degraded+peered
>   23 active+undersized+degraded+remapped
>   11 stale+active+clean
>4 undersized+degraded+peered
>4 active
>
> 4) I've umounted the cephfs clients ('umount -l' worked for me this time
> but I already had situations where 'umount' would simply hang, and the only
> viable solutions would be to reboot the client).
>
> 5) I've recovered the ceph cluster by (details on the recover operations
> are in annex 2 after the main email.)
> - declaring the osds lost
> - removing the osds from the crush map
> - letting the cluster stabilize and letting all the recover I/O finish
> - identifying stuck PGs
> - checking if they existed, and if not recreate them.
>
>
> 6) I've restarted the MDS. Initially, the mds cluster was considered
> degraded but after some small amount of time, that message disappeared. The
> WARNING status was just because of "too many PGs per OSD (409 > max 300)"
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
> 1761 GB used, 39981 GB / 41743 GB avail
> 2048 active+clean
>   client io 4151 kB/s rd, 1 op/s
>
> (wait some time)
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
> 1761 GB used, 39981 GB / 41743 GB avail
> 2048 active+clean
>
> 7) I was able to mount the cephfs filesystem in a client. When I tried to
> read a file made of some lost objects, I got holes in part of the file
> (compare with the same operation on annex 1)
>
> # od /cephfs/goncalo/5Gbytes_029.txt | head
> 000 00 00 00 00 00 00 00 00
> *
> 200 176665 053717 015710 124465 047254 102011 065275 123534
> 220 015727 131070 075673 176566 047511 154343 146334 006111
> 240 050506 102172 172362 121464 003532 005427 137554 137111
> 260 071444 052477 123364 127652 043562 144163 170405 026422
> 2000100 050316 117337 042573 171037 150704 071144 066344 116653
> 2000120 076041 041546 030235 055204 016253 136063 046012 066200
> 2000140 171626 123573 065351 032357 171326 132673 012213 016046
> 2000160 022034 160053 156107 141471 162551 124615 102247 125502
>
>
> Finally the questions:
>
> a./ Under a situation 

[ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Daleep Bais
Hi,

I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
read write performance for the test cluster and the read IOPS is  poor.
When I individually test it for each HDD, I get good performance, whereas,
when I test it for ceph cluster, it is poor.

Between nodes, using iperf, I get good bandwidth.

My cluster info :

root@ceph-node3:~# ceph --version
ceph version 9.0.2-752-g64d37b7 (64d37b70a687eb63edf69a91196bb124651da210)
root@ceph-node3:~# ceph -s
cluster 9654468b-5c78-44b9-9711-4a7c4455c480
 health HEALTH_OK
 monmap e9: 3 mons at {ceph-node10=
192.168.1.210:6789/0,ceph-node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0
}
election epoch 442, quorum 0,1,2
ceph-node3,ceph-node10,ceph-node17
 osdmap e1850: 6 osds: 6 up, 6 in
  pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
9624 MB used, 5384 GB / 5394 GB avail
 256 active+clean


I have mapped an RBD block device to client machine (Ubuntu 14) and from
there, when I run tests using FIO, i get good write IOPS, however, read is
poor comparatively.

*Write IOPS : 44618 approx*

*Read IOPS : 7356 approx*

*Pool replica - single  *

*pool 1 'test1' replicated size 1 min_size 1 *

I have implemented *rbd_readahead* in my ceph conf file also.
Any suggestions in this regard with help me..

Thanks.

Daleep Singh Bais
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Daleep Bais
> Sent: 09 September 2015 09:18
> To: Ceph-User 
> Subject: [ceph-users] Poor IOPS performance with Ceph
> 
> Hi,
> 
> I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the read
> write performance for the test cluster and the read IOPS is  poor.
> When I individually test it for each HDD, I get good performance, whereas,
> when I test it for ceph cluster, it is poor.

Can you give any further details about your cluster. Are your HDD's backed by 
SSD journals?

> 
> Between nodes, using iperf, I get good bandwidth.
> 
> My cluster info :
> 
> root@ceph-node3:~# ceph --version
> ceph version 9.0.2-752-g64d37b7
> (64d37b70a687eb63edf69a91196bb124651da210)
> root@ceph-node3:~# ceph -s
> cluster 9654468b-5c78-44b9-9711-4a7c4455c480
>  health HEALTH_OK
>  monmap e9: 3 mons at {ceph-node10=192.168.1.210:6789/0,ceph-
> node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0}
> election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-
> node17
>  osdmap e1850: 6 osds: 6 up, 6 in
>   pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> 9624 MB used, 5384 GB / 5394 GB avail
>  256 active+clean
> 
> 
> I have mapped an RBD block device to client machine (Ubuntu 14) and from
> there, when I run tests using FIO, i get good write IOPS, however, read is
> poor comparatively.
> 
> Write IOPS : 44618 approx
> 
> Read IOPS : 7356 approx

1st thing that strikes me is that your numbers are too good, unless these are 
actually SSD's and not spinning HDD's? I would expect to get around a max of 
600 read IOPs for 6x 7.2k disks, so I guess either you are hitting the page 
cache on the OSD node(s) or the librbd cache.

The writes are even higher, are you using the "direct=1" option in the Fio job?

> 
> Pool replica - single
> pool 1 'test1' replicated size 1 min_size 1
> 
> I have implemented rbd_readahead in my ceph conf file also.
> Any suggestions in this regard with help me..
> 
> Thanks.
> 
> Daleep Singh Bais




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-09 Thread Shinobu Kinjo
Anyhow this page would help you:

http://ceph.com/docs/master/cephfs/disaster-recovery/

Shinobu

- Original Message -
From: "Shinobu Kinjo" 
To: "Goncalo Borges" 
Cc: "ceph-users" 
Sent: Wednesday, September 9, 2015 5:28:38 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

Did you try to identify what kind of processes were accessing filesystem using 
fuser or lsof and then kill them?
If not, you had to do that first.

Shinobu

- Original Message -
From: "Goncalo Borges" 
To: ski...@redhat.com
Sent: Wednesday, September 9, 2015 5:04:23 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

Hi Shinobu

> Did you unmount filesystem using?
>
>   umount -l

Yes!
Goncalo

>
> Shinobu
>
> On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges 
> > wrote:
>
> Dear Ceph / CephFS gurus...
>
> Bare a bit with me while I give you a bit of context. Questions
> will appear at the end.
>
> 1) I am currently running ceph 9.0.3 and I have install it  to
> test the cephfs recovery tools.
>
> 2) I've created a situation where I've deliberately (on purpose)
> lost some data and metadata (check annex 1 after the main email).
>
> 3) I've stopped the mds, and waited to check how the cluster
> reacts. After some time, as expected, the cluster reports a ERROR
> state, with a lot of PGs degraded and stuck
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_ERR
> 174 pgs degraded
> 48 pgs stale
> 174 pgs stuck degraded
> 41 pgs stuck inactive
> 48 pgs stuck stale
> 238 pgs stuck unclean
> 174 pgs stuck undersized
> 174 pgs undersized
> recovery 22366/463263 objects degraded (4.828%)
> recovery 8190/463263 objects misplaced (1.768%)
> too many PGs per OSD (388 > max 300)
> mds rank 0 has failed
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e24: 0/1/1 up, 1 failed
>  osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>   pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
> 1715 GB used, 40027 GB / 41743 GB avail
> 22366/463263 objects degraded (4.828%)
> 8190/463263 objects misplaced (1.768%)
> 1799 active+clean
>  110 active+undersized+degraded
>   60 active+remapped
>   37 stale+undersized+degraded+peered
>   23 active+undersized+degraded+remapped
>   11 stale+active+clean
>4 undersized+degraded+peered
>4 active
>
> 4) I've umounted the cephfs clients ('umount -l' worked for me
> this time but I already had situations where 'umount' would simply
> hang, and the only viable solutions would be to reboot the client).
>
> 5) I've recovered the ceph cluster by (details on the recover
> operations are in annex 2 after the main email.)
> - declaring the osds lost
> - removing the osds from the crush map
> - letting the cluster stabilize and letting all the recover I/O finish
> - identifying stuck PGs
> - checking if they existed, and if not recreate them.
>
>
> 6) I've restarted the MDS. Initially, the mds cluster was
> considered degraded but after some small amount of time, that
> message disappeared. The WARNING status was just because of "too
> many PGs per OSD (409 > max 300)"
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
> 1761 GB used, 39981 GB / 41743 GB avail
> 2048 active+clean
>   client io 4151 kB/s rd, 1 op/s
>
> (wait some time)
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too 

Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Shinobu Kinjo
Are you using that hdd as also for storing journal data?
Or are you using ssd for that purpose?

Shinobu

- Original Message -
From: "Daleep Bais" 
To: "Shinobu Kinjo" 
Cc: "Ceph-User" 
Sent: Wednesday, September 9, 2015 5:59:33 PM
Subject: Re: [ceph-users] Poor IOPS performance with Ceph

Hi Shinobu,

I have 1 X 1TB HDD on each node. The network bandwidth between nodes is
1Gbps.

Thanks for the info. I will also try to go through discussion mails related
to performance.

Thanks.

Daleep Singh Bais


On Wed, Sep 9, 2015 at 2:09 PM, Shinobu Kinjo  wrote:

> How many disks does each osd node have?
> How about networking layer?
> There are several factors to make your cluster much more stronger.
>
> Probably you may need to take a look at other discussion on this mailing
> list.
> There was a bunch of discussion about performance.
>
> Shinobu
>
> - Original Message -
> From: "Daleep Bais" 
> To: "Ceph-User" 
> Sent: Wednesday, September 9, 2015 5:17:48 PM
> Subject: [ceph-users] Poor IOPS performance with Ceph
>
> Hi,
>
> I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
> read write performance for the test cluster and the read IOPS is poor.
> When I individually test it for each HDD, I get good performance, whereas,
> when I test it for ceph cluster, it is poor.
>
> Between nodes, using iperf, I get good bandwidth.
>
> My cluster info :
>
> root@ceph-node3:~# ceph --version
> ceph version 9.0.2-752-g64d37b7 (64d37b70a687eb63edf69a91196bb124651da210)
> root@ceph-node3:~# ceph -s
> cluster 9654468b-5c78-44b9-9711-4a7c4455c480
> health HEALTH_OK
> monmap e9: 3 mons at {ceph-node10=
> 192.168.1.210:6789/0,ceph-node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0
> }
> election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-node17
> osdmap e1850: 6 osds: 6 up, 6 in
> pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> 9624 MB used, 5384 GB / 5394 GB avail
> 256 active+clean
>
>
> I have mapped an RBD block device to client machine (Ubuntu 14) and from
> there, when I run tests using FIO, i get good write IOPS, however, read is
> poor comparatively.
>
> Write IOPS : 44618 approx
>
> Read IOPS : 7356 approx
>
> Pool replica - single
> pool 1 'test1' replicated size 1 min_size 1
>
> I have implemented rbd_readahead in my ceph conf file also.
> Any suggestions in this regard with help me..
>
> Thanks.
>
> Daleep Singh Bais
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radula - radosgw(s3) cli tool

2015-09-09 Thread Andrew Bibby (lists)
Hey cephers,
Just wanted to briefly announce the release of a radosgw CLI tool that solves 
some of our team's minor annoyances. Called radula, a nod to the patron animal, 
this utility acts a lot like s3cmd with some tweaks to meet the expectations of 
our researchers.
 
https://pypi.python.org/pypi/radula
https://github.com/bibby/radula
 
I've seen a lot of boto wrappers, and yup - it's just another one. But, it 
could still have value for users, so we put it out there.
 
Here's a quick at its features:
- When a user is granted read access to a bucket, they're not given read access 
to any of the existing keys. radula applies bucket ACL changes to existing 
keys, and can synchronize anomalies. New keys are issued a copy of the bucket's 
ACL. Permissions are also kept from duplicating like they can on AWS and rados.
 
- Unless they are tiny, uploads are always multi-parted and multi-threaded. The 
file can then be checksum verified to have uploaded correctly.

- CLI and importable python module
 
- Typical s3cmd-like commands (mb, rb, lb, etc) leaning directly on boto; no 
clever rewrites.
 
We hope someone finds it useful.
Have a good rest of the week!
 
- Andrew Bibby
- DevOps, NantOmics, LLC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-09 Thread Shinobu Kinjo
Did you try to identify what kind of processes were accessing filesystem using 
fuser or lsof and then kill them?
If not, you had to do that first.

Shinobu

- Original Message -
From: "Goncalo Borges" 
To: ski...@redhat.com
Sent: Wednesday, September 9, 2015 5:04:23 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

Hi Shinobu

> Did you unmount filesystem using?
>
>   umount -l

Yes!
Goncalo

>
> Shinobu
>
> On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges 
> > wrote:
>
> Dear Ceph / CephFS gurus...
>
> Bare a bit with me while I give you a bit of context. Questions
> will appear at the end.
>
> 1) I am currently running ceph 9.0.3 and I have install it  to
> test the cephfs recovery tools.
>
> 2) I've created a situation where I've deliberately (on purpose)
> lost some data and metadata (check annex 1 after the main email).
>
> 3) I've stopped the mds, and waited to check how the cluster
> reacts. After some time, as expected, the cluster reports a ERROR
> state, with a lot of PGs degraded and stuck
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_ERR
> 174 pgs degraded
> 48 pgs stale
> 174 pgs stuck degraded
> 41 pgs stuck inactive
> 48 pgs stuck stale
> 238 pgs stuck unclean
> 174 pgs stuck undersized
> 174 pgs undersized
> recovery 22366/463263 objects degraded (4.828%)
> recovery 8190/463263 objects misplaced (1.768%)
> too many PGs per OSD (388 > max 300)
> mds rank 0 has failed
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e24: 0/1/1 up, 1 failed
>  osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>   pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
> 1715 GB used, 40027 GB / 41743 GB avail
> 22366/463263 objects degraded (4.828%)
> 8190/463263 objects misplaced (1.768%)
> 1799 active+clean
>  110 active+undersized+degraded
>   60 active+remapped
>   37 stale+undersized+degraded+peered
>   23 active+undersized+degraded+remapped
>   11 stale+active+clean
>4 undersized+degraded+peered
>4 active
>
> 4) I've umounted the cephfs clients ('umount -l' worked for me
> this time but I already had situations where 'umount' would simply
> hang, and the only viable solutions would be to reboot the client).
>
> 5) I've recovered the ceph cluster by (details on the recover
> operations are in annex 2 after the main email.)
> - declaring the osds lost
> - removing the osds from the crush map
> - letting the cluster stabilize and letting all the recover I/O finish
> - identifying stuck PGs
> - checking if they existed, and if not recreate them.
>
>
> 6) I've restarted the MDS. Initially, the mds cluster was
> considered degraded but after some small amount of time, that
> message disappeared. The WARNING status was just because of "too
> many PGs per OSD (409 > max 300)"
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
> mds cluster is degraded
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
> 1761 GB used, 39981 GB / 41743 GB avail
> 2048 active+clean
>   client io 4151 kB/s rd, 1 op/s
>
> (wait some time)
>
> # ceph -s
> cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>  health HEALTH_WARN
> too many PGs per OSD (409 > max 300)
>  monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
> election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>  mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
>  osdmap e614: 15 osds: 15 up, 15 in
>   pgmap v30442: 2048 pgs, 2 pools, 586 GB 

Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Nick Fisk
It looks like you are using the kernel RBD client, ie you ran "rbd map " In 
which case the librbd settings in the ceph.conf won't have any affect as they 
are only for if you are using fio with the librbd engine.

There are several things you may have to do to improve Kernel client 
performance, but 1st thing you need to pass the "direct=1" flag to your fio job 
to get a realistic idea of your clusters performance. But be warned if you 
thought you had bad performance now, you will likely be shocked after you 
enable it.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Daleep Bais
> Sent: 09 September 2015 09:37
> To: Nick Fisk 
> Cc: Ceph-User 
> Subject: Re: [ceph-users] Poor IOPS performance with Ceph
> 
> Hi Nick,
> 
> I dont have separate SSD / HDD for journal. I am using a 10 G partition on the
> same HDD for journaling. They are rotating HDD's and not SSD's.
> 
> I am using below command to run the test:
> 
> fio --name=test --filename=test --bs=4k  --size=4G --readwrite=read / write
> 
> I did few kernel tuning and that has improved my write IOPS. For read I am
> using rbd_readahead  and also used read_ahead_kb kernel tuning
> parameter.
> 
> Also I should mention that its not x86, its on armv7 32bit.
> 
> Thanks.
> 
> Daleep Singh Bais
> 
> 
> 
> On Wed, Sep 9, 2015 at 1:55 PM, Nick Fisk  wrote:
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> > Daleep Bais
> > Sent: 09 September 2015 09:18
> > To: Ceph-User 
> > Subject: [ceph-users] Poor IOPS performance with Ceph
> >
> > Hi,
> >
> > I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
> read
> > write performance for the test cluster and the read IOPS is  poor.
> > When I individually test it for each HDD, I get good performance, whereas,
> > when I test it for ceph cluster, it is poor.
> 
> Can you give any further details about your cluster. Are your HDD's backed by
> SSD journals?
> 
> >
> > Between nodes, using iperf, I get good bandwidth.
> >
> > My cluster info :
> >
> > root@ceph-node3:~# ceph --version
> > ceph version 9.0.2-752-g64d37b7
> > (64d37b70a687eb63edf69a91196bb124651da210)
> > root@ceph-node3:~# ceph -s
> > cluster 9654468b-5c78-44b9-9711-4a7c4455c480
> >  health HEALTH_OK
> >  monmap e9: 3 mons at {ceph-node10=192.168.1.210:6789/0,ceph-
> > node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0}
> > election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-
> > node17
> >  osdmap e1850: 6 osds: 6 up, 6 in
> >   pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> > 9624 MB used, 5384 GB / 5394 GB avail
> >  256 active+clean
> >
> >
> > I have mapped an RBD block device to client machine (Ubuntu 14) and from
> > there, when I run tests using FIO, i get good write IOPS, however, read is
> > poor comparatively.
> >
> > Write IOPS : 44618 approx
> >
> > Read IOPS : 7356 approx
> 
> 1st thing that strikes me is that your numbers are too good, unless these are
> actually SSD's and not spinning HDD's? I would expect to get around a max of
> 600 read IOPs for 6x 7.2k disks, so I guess either you are hitting the page
> cache on the OSD node(s) or the librbd cache.
> 
> The writes are even higher, are you using the "direct=1" option in the Fio
> job?
> 
> >
> > Pool replica - single
> > pool 1 'test1' replicated size 1 min_size 1
> >
> > I have implemented rbd_readahead in my ceph conf file also.
> > Any suggestions in this regard with help me..
> >
> > Thanks.
> >
> > Daleep Singh Bais
> 
> 
> 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Shinobu Kinjo
How many disks does each osd node have?
How about networking layer?
There are several factors to make your cluster much more stronger.

Probably you may need to take a look at other discussion on this mailing list.
There was a bunch of discussion about performance.

Shinobu

- Original Message -
From: "Daleep Bais" 
To: "Ceph-User" 
Sent: Wednesday, September 9, 2015 5:17:48 PM
Subject: [ceph-users] Poor IOPS performance with Ceph

Hi, 

I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the read 
write performance for the test cluster and the read IOPS is poor. 
When I individually test it for each HDD, I get good performance, whereas, when 
I test it for ceph cluster, it is poor. 

Between nodes, using iperf, I get good bandwidth. 

My cluster info : 

root@ceph-node3:~# ceph --version 
ceph version 9.0.2-752-g64d37b7 (64d37b70a687eb63edf69a91196bb124651da210) 
root@ceph-node3:~# ceph -s 
cluster 9654468b-5c78-44b9-9711-4a7c4455c480 
health HEALTH_OK 
monmap e9: 3 mons at {ceph-node10= 
192.168.1.210:6789/0,ceph-node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0
 } 
election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-node17 
osdmap e1850: 6 osds: 6 up, 6 in 
pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects 
9624 MB used, 5384 GB / 5394 GB avail 
256 active+clean 


I have mapped an RBD block device to client machine (Ubuntu 14) and from there, 
when I run tests using FIO, i get good write IOPS, however, read is poor 
comparatively. 

Write IOPS : 44618 approx 

Read IOPS : 7356 approx 

Pool replica - single 
pool 1 'test1' replicated size 1 min_size 1 

I have implemented rbd_readahead in my ceph conf file also. 
Any suggestions in this regard with help me.. 

Thanks. 

Daleep Singh Bais 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ensuring write activity is finished

2015-09-09 Thread Jan Schermer
I never played much with rados bench but it doesn't seem to have for example 
settings for synchronous/asynchronous workloads, thus it probably just 
benchmarks the OSD throughput and ability to write to journal (in write mode) 
unless you let it run for a longer time.
So when you stop rados bench the OSDs are actually still flushing the data, 
exactly as you wrote.

There are parameters filestore_min_sync_interval and 
filestore_max_sync_interval on OSDs, the cluster should be idle after 
filestore_max_sync_interval (+ a few seconds to actually write the dirty data + 
possibly a few seconds for filesystem to flush) has elapsed.

How did you drop caches on the OSD nodes?

I apologize in advance if I'm wrong :-)

Jan

> On 08 Sep 2015, at 20:38, Deneau, Tom  wrote:
> 
> When measuring read bandwidth using rados bench, I've been doing the
> following:
>   * write some objects using rados bench write --no-cleanup
>   * drop caches on the osd nodes
>   * use rados bench seq to read.
> 
> I've noticed that on the first rados bench seq immediately following the 
> rados bench write,
> there is often activity on the journal partitions which must be a carry over 
> from the rados
> bench write.
> 
> What is the preferred way to ensure that all write activity is finished 
> before starting
> to use rados bench seq?
> 
> -- Tom Deneau
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question on cephfs recovery tools

2015-09-09 Thread Goncalo Borges

Dear Ceph / CephFS gurus...

Bare a bit with me while I give you a bit of context. Questions will 
appear at the end.


1) I am currently running ceph 9.0.3 and I have install it  to test the 
cephfs recovery tools.


2) I've created a situation where I've deliberately (on purpose) lost 
some data and metadata (check annex 1 after the main email).


3) I've stopped the mds, and waited to check how the cluster reacts. 
After some time, as expected, the cluster reports a ERROR state, with a 
lot of PGs degraded and stuck


   # ceph -s
cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
 health HEALTH_ERR
174 pgs degraded
48 pgs stale
174 pgs stuck degraded
41 pgs stuck inactive
48 pgs stuck stale
238 pgs stuck unclean
174 pgs stuck undersized
174 pgs undersized
recovery 22366/463263 objects degraded (4.828%)
recovery 8190/463263 objects misplaced (1.768%)
too many PGs per OSD (388 > max 300)
mds rank 0 has failed
mds cluster is degraded
 monmap e1: 3 mons at
   {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
election epoch 10, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e24: 0/1/1 up, 1 failed
 osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
  pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
1715 GB used, 40027 GB / 41743 GB avail
22366/463263 objects degraded (4.828%)
8190/463263 objects misplaced (1.768%)
1799 active+clean
 110 active+undersized+degraded
  60 active+remapped
  37 stale+undersized+degraded+peered
  23 active+undersized+degraded+remapped
  11 stale+active+clean
   4 undersized+degraded+peered
   4 active

4) I've umounted the cephfs clients ('umount -l' worked for me this time 
but I already had situations where 'umount' would simply hang, and the 
only viable solutions would be to reboot the client).


5) I've recovered the ceph cluster by (details on the recover operations 
are in annex 2 after the main email.)

- declaring the osds lost
- removing the osds from the crush map
- letting the cluster stabilize and letting all the recover I/O finish
- identifying stuck PGs
- checking if they existed, and if not recreate them.


6) I've restarted the MDS. Initially, the mds cluster was considered 
degraded but after some small amount of time, that message disappeared. 
The WARNING status was just because of "too many PGs per OSD (409 > max 
300)"


   # ceph -s
cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
 health HEALTH_WARN
too many PGs per OSD (409 > max 300)
mds cluster is degraded
 monmap e1: 3 mons at
   {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
election epoch 10, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
 osdmap e614: 15 osds: 15 up, 15 in
  pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
1761 GB used, 39981 GB / 41743 GB avail
2048 active+clean
  client io 4151 kB/s rd, 1 op/s

   (wait some time)

   # ceph -s
cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
 health HEALTH_WARN
too many PGs per OSD (409 > max 300)
 monmap e1: 3 mons at
   {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
election epoch 10, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
 osdmap e614: 15 osds: 15 up, 15 in
  pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
1761 GB used, 39981 GB / 41743 GB avail
2048 active+clean

7) I was able to mount the cephfs filesystem in a client. When I tried 
to read a file made of some lost objects, I got holes in part of the 
file (compare with the same operation on annex 1)


   # od /cephfs/goncalo/5Gbytes_029.txt | head
   000 00 00 00 00 00 00 00 00
   *
   200 176665 053717 015710 124465 047254 102011 065275 123534
   220 015727 131070 075673 176566 047511 154343 146334 006111
   240 050506 102172 172362 121464 003532 005427 137554 137111
   260 071444 052477 123364 127652 043562 144163 170405 026422
   2000100 050316 117337 042573 171037 150704 071144 066344 116653
   2000120 076041 041546 030235 055204 016253 136063 046012 066200
   2000140 171626 123573 065351 032357 171326 132673 012213 016046
   2000160 022034 160053 156107 141471 162551 124615 102247 125502


Finally the questions:

a./ Under a situation as the one describe above, how can we safely 
terminate cephfs in the 

Re: [ceph-users] maximum object size

2015-09-09 Thread HEWLETT, Paul (Paul)
By setting a parameter osd_max_write_size to 2047Š
This normally defaults to 90

Setting to 2048 exposes a bug in Ceph where signed overflow occurs...

Part of the problem is my expectations. Ilya pointed out that one can use
libradosstriper to stripe a large object over many OSD¹s. I expected this
to happen automatically for any object > osd_max_write_size (=90MB) but it
does not. Instead one has to set special attributes to trigger striping.

Additionally interaction with erasure coding is unclear - apparently the
error is reached when the total file size exceeds the limit - if EC is
enabled then maybe a better solution would be to test the size of the
chunk written to the OSD which will be only part of the total file size.
Or do I have that wrong?

If EC is being used then would the individual chunks after splitting the
file then be erasure coded ? I.e if we decide to split a large file into 5
striped chunks does ceph then EC the individual chunks?

Striping is not really documentedŠ

Paul

On 08/09/2015 17:53, "Somnath Roy"  wrote:

>I think the limit is 90 MB from OSD side, isn't it ?
>If so, how are you able to write object till 1.99 GB ?
>Am I missing anything ?
>
>Thanks & Regards
>Somnath
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>HEWLETT, Paul (Paul)
>Sent: Tuesday, September 08, 2015 8:55 AM
>To: ceph-users@lists.ceph.com
>Subject: [ceph-users] maximum object size
>
>Hi All
>
>We have recently encountered a problem on Hammer (0.94.2) whereby we
>cannot write objects > 2GB in size to the rados backend.
>(NB not RadosGW, CephFS or RBD)
>
>I found the following issue
>https://wiki.ceph.com/Planning/Blueprints/Firefly/Object_striping_in_libra
>d
>os which seems to address this but no progress reported.
>
>What are the implications of writing such large objects to RADOS? What
>impact is expected on the XFS backend particularly regarding the size and
>location of the journal?
>
>Any prospect of progressing the issue reported in the enclosed link?
>
>Interestingly I could not find anywhere in the ceph documentation that
>describes the 2GB limitation. The implication of most of the website docs
>is that there is no limit on objects stored in Ceph. The only hint is
>that osd_max_write_size is a 32 bit signed integer.
>
>If we use erasure coding will this reduce the impact? I.e. 4+1 EC will
>only write 500MB to each OSD and then this value will be tested against
>the chunk size instead of the total file size?
>
>The relevant code in Ceph is:
>
>src/FileJournal.cc:
>
>  needed_space = ((int64_t)g_conf->osd_max_write_size) << 20;
>  needed_space += (2 * sizeof(entry_header_t)) + get_top();
>  if (header.max_size - header.start < needed_space) {
>derr << "FileJournal::create: OSD journal is not large enough to hold
>"
><< "osd_max_write_size bytes!" << dendl;
>ret = -ENOSPC;
>goto free_buf;
>  }
>
>src/osd/OSD.cc:
>
>// too big?
>if (cct->_conf->osd_max_write_size &&
>m->get_data_len() > cct->_conf->osd_max_write_size << 20) {
>// journal can't hold commit!
> derr << "handle_op msg data len " << m->get_data_len()
> << " > osd_max_write_size " << (cct->_conf->osd_max_write_size << 20)
> << " on " << *m << dendl;
>service.reply_op_error(op, -OSD_WRITETOOBIG);
>return;
>  }
>
>Interestingly the code in OSD.cc looks like a bug - the max_write value
>should be cast to an int64_t before shifting left 20 bits (which is done
>correctly in FileJournal.cc). Otherwise overflow may occur and negative
>values generated.
>
>
>Any comments welcome - any help appreciated.
>
>Regards
>Paul
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>PLEASE NOTE: The information contained in this electronic mail message is
>intended only for the use of the designated recipient(s) named above. If
>the reader of this message is not the intended recipient, you are hereby
>notified that you have received this message in error and that any
>review, dissemination, distribution, or copying of this message is
>strictly prohibited. If you have received this communication in error,
>please notify the sender by telephone or e-mail (as shown above)
>immediately and destroy any and all copies of this message in your
>possession (whether hard copies or electronically stored copies).
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Daleep Bais
Hi Nick,

I dont have separate SSD / HDD for journal. I am using a 10 G partition on
the same HDD for journaling. They are rotating HDD's and not SSD's.

I am using below command to run the test:

fio --name=test --filename=test --bs=4k  --size=4G --readwrite=read / write

I did few kernel tuning and that has improved my write IOPS. For read I am
using *rbd_readahead*  and also used *read_ahead_kb* kernel tuning
parameter.

Also I should mention that its not x86, its on armv7 32bit.

Thanks.

Daleep Singh Bais



On Wed, Sep 9, 2015 at 1:55 PM, Nick Fisk  wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Daleep Bais
> > Sent: 09 September 2015 09:18
> > To: Ceph-User 
> > Subject: [ceph-users] Poor IOPS performance with Ceph
> >
> > Hi,
> >
> > I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
> read
> > write performance for the test cluster and the read IOPS is  poor.
> > When I individually test it for each HDD, I get good performance,
> whereas,
> > when I test it for ceph cluster, it is poor.
>
> Can you give any further details about your cluster. Are your HDD's backed
> by SSD journals?
>
> >
> > Between nodes, using iperf, I get good bandwidth.
> >
> > My cluster info :
> >
> > root@ceph-node3:~# ceph --version
> > ceph version 9.0.2-752-g64d37b7
> > (64d37b70a687eb63edf69a91196bb124651da210)
> > root@ceph-node3:~# ceph -s
> > cluster 9654468b-5c78-44b9-9711-4a7c4455c480
> >  health HEALTH_OK
> >  monmap e9: 3 mons at {ceph-node10=192.168.1.210:6789/0,ceph-
> > node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0}
> > election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-
> > node17
> >  osdmap e1850: 6 osds: 6 up, 6 in
> >   pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> > 9624 MB used, 5384 GB / 5394 GB avail
> >  256 active+clean
> >
> >
> > I have mapped an RBD block device to client machine (Ubuntu 14) and from
> > there, when I run tests using FIO, i get good write IOPS, however, read
> is
> > poor comparatively.
> >
> > Write IOPS : 44618 approx
> >
> > Read IOPS : 7356 approx
>
> 1st thing that strikes me is that your numbers are too good, unless these
> are actually SSD's and not spinning HDD's? I would expect to get around a
> max of 600 read IOPs for 6x 7.2k disks, so I guess either you are hitting
> the page cache on the OSD node(s) or the librbd cache.
>
> The writes are even higher, are you using the "direct=1" option in the Fio
> job?
>
> >
> > Pool replica - single
> > pool 1 'test1' replicated size 1 min_size 1
> >
> > I have implemented rbd_readahead in my ceph conf file also.
> > Any suggestions in this regard with help me..
> >
> > Thanks.
> >
> > Daleep Singh Bais
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-09 Thread Jan Schermer
Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not 
flushing something to drives (iostat)? Not cleaning pagecache (kswapd and 
similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network 
link errors, no bad checksums (those are hard to spot, though)?

Unless you find something I suggest you try disabling offloads on the NICs and 
see if the problem goes away.

Jan

> On 08 Sep 2015, at 18:26, Lincoln Bryant  wrote:
> 
> For whatever it’s worth, my problem has returned and is very similar to 
> yours. Still trying to figure out what’s going on over here.
> 
> Performance is nice for a few seconds, then goes to 0. This is a similar 
> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc)
> 
>  384  16 29520 29504   307.287  1188 0.0492006  0.208259
>  385  16 29813 29797   309.532  1172 0.0469708  0.206731
>  386  16 30105 30089   311.756  1168 0.0375764  0.205189
>  387  16 30401 30385   314.009  1184  0.036142  0.203791
>  388  16 30695 30679   316.231  1176 0.0372316  0.202355
>  389  16 30987 30971318.42  1168 0.0660476  0.200962
>  390  16 31282 31266   320.628  1180 0.0358611  0.199548
>  391  16 31568 31552   322.734  1144 0.0405166  0.198132
>  392  16 31857 31841   324.859  1156 0.0360826  0.196679
>  393  16 32090 32074   326.404   932 0.0416869   0.19549
>  394  16 32205 32189   326.743   460 0.0251877  0.194896
>  395  16 32302 32286   326.897   388 0.0280574  0.194395
>  396  16 32348 32332   326.537   184 0.0256821  0.194157
>  397  16 32385 32369   326.087   148 0.0254342  0.193965
>  398  16 32424 32408   325.659   156 0.0263006  0.193763
>  399  16 32445 32429   325.05484 0.0233839  0.193655
> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 
> 0.193655
>  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  400  16 32445 32429   324.241 0 -  0.193655
>  401  16 32445 32429   323.433 0 -  0.193655
>  402  16 32445 32429   322.628 0 -  0.193655
>  403  16 32445 32429   321.828 0 -  0.193655
>  404  16 32445 32429   321.031 0 -  0.193655
>  405  16 32445 32429   320.238 0 -  0.193655
>  406  16 32445 32429319.45 0 -  0.193655
>  407  16 32445 32429   318.665 0 -  0.193655
> 
> needless to say, very strange.
> 
> —Lincoln
> 
> 
>> On Sep 7, 2015, at 3:35 PM, Vickey Singh  wrote:
>> 
>> Adding ceph-users.
>> 
>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh  
>> wrote:
>> 
>> 
>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke  wrote:
>> Hi Vickey,
>> Thanks for your time in replying to my problem.
>> 
>> I had the same rados bench output after changing the motherboard of the 
>> monitor node with the lowest IP...
>> Due to the new mainboard, I assume the hw-clock was wrong during startup. 
>> Ceph health show no errors, but all VMs aren't able to do IO (very high load 
>> on the VMs - but no traffic).
>> I stopped the mon, but this don't changed anything. I had to restart all 
>> other mons to get IO again. After that I started the first mon also (with 
>> the right time now) and all worked fine again...
>> 
>> Thanks i will try to restart all OSD / MONS and report back , if it solves 
>> my problem 
>> 
>> Another posibility:
>> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage 
>> collection?
>> 
>> No i don't have journals on SSD , they are on the same OSD disk. 
>> 
>> 
>> 
>> Udo
>> 
>> 
>> On 07.09.2015 16:36, Vickey Singh wrote:
>>> Dear Experts
>>> 
>>> Can someone please help me , why my cluster is not able write data.
>>> 
>>> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
>>> 
>>> 
>>> Ceph Hammer  0.94.2
>>> CentOS 6 (3.10.69-1)
>>> 
>>> The Ceph status says OPS are blocked , i have tried checking , what all i 
>>> know 
>>> 
>>> - System resources ( CPU , net, disk , memory )-- All normal 
>>> - 10G network for public and cluster network  -- no saturation 
>>> - Add disks are physically healthy 
>>> - No messages in /var/log/messages OR dmesg
>>> - Tried restarting OSD which are blocking operation , but no luck
>>> - Tried writing through RBD  and Rados bench , both are giving same problemm
>>> 
>>> Please help me to fix this problem.
>>> 
>>> #  rados bench -p rbd 60 write
>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 
>>> objects
>>> Object prefix: benchmark_data_stor1_1791844
>>>   sec Cur ops   started  finished  

Re: [ceph-users] jemalloc and transparent hugepage

2015-09-09 Thread Jan Schermer
This is great, thank you!

Jan

> On 09 Sep 2015, at 12:37, HEWLETT, Paul (Paul) 
>  wrote:
> 
> Hi Jan
> 
> If I can suggest that you look at:
> 
> http://engineering.linkedin.com/performance/optimizing-linux-memory-managem
> ent-low-latency-high-throughput-databases
> 
> 
> where LinkedIn ended up disabling some of the new kernel features to
> prevent memory thrashing.
> Search for Transparent Huge Pages..
> 
> RHEL7 has these now disabled by default - LinkedIn are using GraphDB which
> is a log-structured system.
> 
> Paul
> 
> On 09/09/2015 10:54, "ceph-devel-ow...@vger.kernel.org on behalf of Jan
> Schermer" 
> wrote:
> 
>> I looked at THP before. It comes enabled on RHEL6 and on our KVM hosts it
>> merges a lot (~300GB hugepages on a 400GB KVM footprint).
>> I am probably going to disable it and see if it introduces any problems
>> for me - the most important gain here is better processor memory lookup
>> table (cache) utilization where it considerably lowers the number of
>> entries. Not sure how it affects different workloads - HPC guys should
>> have a good idea? I can only evaluate the effect on OSDs and KVM, but the
>> problem is that going over the cache limit even by a tiny bit can have
>> huge impact - theoretically...
>> 
>> This issue sounds strange, though. THP should kick in and defrag/remerge
>> the pages that are part-empty. Maybe it's just not aggressive enough?
>> Does the "free" memory show as used (part of RSS of the process using the
>> page)? I guess not because there might be more processes with memory in
>> the same hugepage.
>> 
>> This might actually partially explain the pagecache problem I mentioned
>> there about a week ago (slow OSD startup), maybe kswapd is what has to do
>> the work and defrag the pages when memory pressure is high!
>> 
>> I'll try to test it somehow, hopefully then there will be cake.
>> 
>> Jan
>> 
>>> On 09 Sep 2015, at 07:08, Alexandre DERUMIER 
>>> wrote:
>>> 
>>> They are a tracker here
>>> 
>>> https://github.com/jemalloc/jemalloc/issues/243
>>> "Improve interaction with transparent huge pages"
>>> 
>>> 
>>> 
>>> - Mail original -
>>> De: "aderumier" 
>>> À: "Sage Weil" 
>>> Cc: "ceph-devel" , "ceph-users"
>>> 
>>> Envoyé: Mercredi 9 Septembre 2015 06:37:22
>>> Objet: Re: [ceph-users] jemalloc and transparent hugepage
>>> 
> Is this something we can set with mallctl[1] at startup?
>>> 
>>> I don't think it's possible.
>>> 
>>> TP hugepage are managed by kernel, not jemalloc.
>>> 
>>> (but a simple "echo never >
>>> /sys/kernel/mm/transparent_hugepage/enabled" in init script is enough)
>>> 
>>> - Mail original -
>>> De: "Sage Weil" 
>>> À: "aderumier" 
>>> Cc: "Mark Nelson" , "ceph-devel"
>>> , "ceph-users" ,
>>> "Somnath Roy" 
>>> Envoyé: Mercredi 9 Septembre 2015 04:07:59
>>> Objet: Re: [ceph-users] jemalloc and transparent hugepage
>>> 
>>> On Wed, 9 Sep 2015, Alexandre DERUMIER wrote:
>> Have you noticed any performance difference with tp=never?
 
 No difference. 
 
 I think hugepage could speedup big memory sets like 100-200GB, but for
 1-2GB they are no noticable difference.
>>> 
>>> Is this something we can set with mallctl[1] at startup?
>>> 
>>> sage 
>>> 
>>> [1] 
>>> http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.h
>>> tml 
>>> 
 
 
 
 
 
 
 - Mail original -
 De: "Mark Nelson" 
 À: "aderumier" , "ceph-devel"
 , "ceph-users" 
 Cc: "Somnath Roy" 
 Envoyé: Mercredi 9 Septembre 2015 01:49:35
 Objet: Re: [ceph-users] jemalloc and transparent hugepage
 
 Excellent investigation Alexandre! Have you noticed any performance
 difference with tp=never?
 
 Mark 
 
 On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote:
> I have done small benchmark with tcmalloc and jemalloc, transparent
> hugepage=always|never.
> 
> for tcmalloc, they are no difference.
> but for jemalloc, the difference is huge (around 25% lower with
> tp=never). 
> 
> jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory
> 
> jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc !
> 
> 
> I don't have monitored memory usage in recovery, but I think it
> should help too.
> 
> 
> 
> 
> tcmalloc 2.1 tp=always
> ---
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> 
> root 67746 120 1.0 1531220 671152 ? Ssl 01:18 0:43 /usr/bin/ceph-osd

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Jan Schermer
Sorry if I wasn't clear.
Going from 2GB to 8GB is not normal, although some slight bloating is expected. 
In your case it just got much worse than usual for reasons yet unknown.

Jan


> On 09 Sep 2015, at 12:40, Mariusz Gronczewski 
>  wrote:
> 
> 
> well I was going by
> http://ceph.com/docs/master/start/hardware-recommendations/ and planning for 
> 2GB per OSD so that was a suprise maybe there should be warning somewhere 
> ?
> 
> 
> On Wed, 9 Sep 2015 12:21:15 +0200, Jan Schermer  wrote:
> 
>> The memory gets used for additional PGs on the OSD.
>> If you were to "swap" PGs between two OSDs, you'll get memory wasted on both 
>> of them because tcmalloc doesn't release it.*
>> It usually gets stable after few days even during backfills, so it does get 
>> reused if needed.
>> If for some reason your OSDs get to 8GB RSS then I recommend you just get 
>> more memory, or try disabling tcmalloc which can either help or make it even 
>> worse :-)
>> 
>> * E.g. if you do something silly like "ceph osd crush reweight osd.1 1" 
>> you will see the RSS of osd.28 skyrocket. Reweighting it back down will not 
>> release the memory until you do "heap release".
>> 
>> Jan
>> 
>> 
>>> On 09 Sep 2015, at 12:05, Mariusz Gronczewski 
>>>  wrote:
>>> 
>>> On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
>>>  wrote:
>>> 
 Does 'ceph tell osd.* heap release' help with OSD RAM usage?
 
 From
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
 
 Chad.
>>> 
>>> it did help now, but cluster is in clean state at the moment. But I
>>> didnt know that one, thanks.
>>> 
>>> High memory usage stopped once cluster rebuilt, but I've planned
>>> cluster to have 2GB per OSD so I needed to add ram to even get to the
>>> point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
>>> recover
>>> 
>>> -- 
>>> Mariusz Gronczewski, Administrator
>>> 
>>> Efigence S. A.
>>> ul. Wołoska 9a, 02-583 Warszawa
>>> T: [+48] 22 380 13 13
>>> F: [+48] 22 380 13 14
>>> E: mariusz.gronczew...@efigence.com
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> -- 
> Mariusz Gronczewski, Administrator
> 
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: mariusz.gronczew...@efigence.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Daleep Bais
That same HDD is used for journal also on a separate 10 G partition.

Thanks.

Daleep Singh Bais

On Wed, Sep 9, 2015 at 2:37 PM, Shinobu Kinjo  wrote:

> Are you using that hdd as also for storing journal data?
> Or are you using ssd for that purpose?
>
> Shinobu
>
> - Original Message -
> From: "Daleep Bais" 
> To: "Shinobu Kinjo" 
> Cc: "Ceph-User" 
> Sent: Wednesday, September 9, 2015 5:59:33 PM
> Subject: Re: [ceph-users] Poor IOPS performance with Ceph
>
> Hi Shinobu,
>
> I have 1 X 1TB HDD on each node. The network bandwidth between nodes is
> 1Gbps.
>
> Thanks for the info. I will also try to go through discussion mails related
> to performance.
>
> Thanks.
>
> Daleep Singh Bais
>
>
> On Wed, Sep 9, 2015 at 2:09 PM, Shinobu Kinjo  wrote:
>
> > How many disks does each osd node have?
> > How about networking layer?
> > There are several factors to make your cluster much more stronger.
> >
> > Probably you may need to take a look at other discussion on this mailing
> > list.
> > There was a bunch of discussion about performance.
> >
> > Shinobu
> >
> > - Original Message -
> > From: "Daleep Bais" 
> > To: "Ceph-User" 
> > Sent: Wednesday, September 9, 2015 5:17:48 PM
> > Subject: [ceph-users] Poor IOPS performance with Ceph
> >
> > Hi,
> >
> > I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
> > read write performance for the test cluster and the read IOPS is poor.
> > When I individually test it for each HDD, I get good performance,
> whereas,
> > when I test it for ceph cluster, it is poor.
> >
> > Between nodes, using iperf, I get good bandwidth.
> >
> > My cluster info :
> >
> > root@ceph-node3:~# ceph --version
> > ceph version 9.0.2-752-g64d37b7
> (64d37b70a687eb63edf69a91196bb124651da210)
> > root@ceph-node3:~# ceph -s
> > cluster 9654468b-5c78-44b9-9711-4a7c4455c480
> > health HEALTH_OK
> > monmap e9: 3 mons at {ceph-node10=
> >
> 192.168.1.210:6789/0,ceph-node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0
> > }
> > election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-node17
> > osdmap e1850: 6 osds: 6 up, 6 in
> > pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> > 9624 MB used, 5384 GB / 5394 GB avail
> > 256 active+clean
> >
> >
> > I have mapped an RBD block device to client machine (Ubuntu 14) and from
> > there, when I run tests using FIO, i get good write IOPS, however, read
> is
> > poor comparatively.
> >
> > Write IOPS : 44618 approx
> >
> > Read IOPS : 7356 approx
> >
> > Pool replica - single
> > pool 1 'test1' replicated size 1 min_size 1
> >
> > I have implemented rbd_readahead in my ceph conf file also.
> > Any suggestions in this regard with help me..
> >
> > Thanks.
> >
> > Daleep Singh Bais
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Shinobu Kinjo
They may be more help to do something for performance analysis -;

  http://ceph.com/docs/master/start/hardware-recommendations/
  
http://www.sebastien-han.fr/blog/2013/10/03/quick-analysis-of-the-ceph-io-layer/

Shinobu

- Original Message -
From: "Shinobu Kinjo" 
To: "Daleep Bais" 
Cc: "Ceph-User" 
Sent: Wednesday, September 9, 2015 6:07:56 PM
Subject: Re: [ceph-users] Poor IOPS performance with Ceph

Are you using that hdd as also for storing journal data?
Or are you using ssd for that purpose?

Shinobu

- Original Message -
From: "Daleep Bais" 
To: "Shinobu Kinjo" 
Cc: "Ceph-User" 
Sent: Wednesday, September 9, 2015 5:59:33 PM
Subject: Re: [ceph-users] Poor IOPS performance with Ceph

Hi Shinobu,

I have 1 X 1TB HDD on each node. The network bandwidth between nodes is
1Gbps.

Thanks for the info. I will also try to go through discussion mails related
to performance.

Thanks.

Daleep Singh Bais


On Wed, Sep 9, 2015 at 2:09 PM, Shinobu Kinjo  wrote:

> How many disks does each osd node have?
> How about networking layer?
> There are several factors to make your cluster much more stronger.
>
> Probably you may need to take a look at other discussion on this mailing
> list.
> There was a bunch of discussion about performance.
>
> Shinobu
>
> - Original Message -
> From: "Daleep Bais" 
> To: "Ceph-User" 
> Sent: Wednesday, September 9, 2015 5:17:48 PM
> Subject: [ceph-users] Poor IOPS performance with Ceph
>
> Hi,
>
> I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
> read write performance for the test cluster and the read IOPS is poor.
> When I individually test it for each HDD, I get good performance, whereas,
> when I test it for ceph cluster, it is poor.
>
> Between nodes, using iperf, I get good bandwidth.
>
> My cluster info :
>
> root@ceph-node3:~# ceph --version
> ceph version 9.0.2-752-g64d37b7 (64d37b70a687eb63edf69a91196bb124651da210)
> root@ceph-node3:~# ceph -s
> cluster 9654468b-5c78-44b9-9711-4a7c4455c480
> health HEALTH_OK
> monmap e9: 3 mons at {ceph-node10=
> 192.168.1.210:6789/0,ceph-node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0
> }
> election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-node17
> osdmap e1850: 6 osds: 6 up, 6 in
> pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
> 9624 MB used, 5384 GB / 5394 GB avail
> 256 active+clean
>
>
> I have mapped an RBD block device to client machine (Ubuntu 14) and from
> there, when I run tests using FIO, i get good write IOPS, however, read is
> poor comparatively.
>
> Write IOPS : 44618 approx
>
> Read IOPS : 7356 approx
>
> Pool replica - single
> pool 1 'test1' replicated size 1 min_size 1
>
> I have implemented rbd_readahead in my ceph conf file also.
> Any suggestions in this regard with help me..
>
> Thanks.
>
> Daleep Singh Bais
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD with iSCSI

2015-09-09 Thread Daleep Bais
Hi,

I am following steps from URL
*http://www.sebastien-han.fr/blog/2014/07/07/start-with-the-rbd-support-for-tgt/
*
  to create a RBD pool  and share to another initiator.

I am not able to get rbd in the backstore list. Please suggest.

below is the output of tgtadm command:

tgtadm --lld iscsi --op show --mode system
System:
State: ready
debug: off
LLDs:
iscsi: ready
iser: error
Backing stores:
sheepdog
bsg
sg
null
ssc
smc (bsoflags sync:direct)
mmc (bsoflags sync:direct)
rdwr (bsoflags sync:direct)
Device types:
disk
cd/dvd
osd
controller
changer
tape
passthrough
iSNS:
iSNS=Off
iSNSServerIP=
iSNSServerPort=3205
iSNSAccessControl=Off


I have installed tgt and tgt-rbd packages till now. Working on Debian
GNU/Linux 8.1 (jessie)

Thanks.

Daleep Singh Bais
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc and transparent hugepage

2015-09-09 Thread Jan Schermer
I looked at THP before. It comes enabled on RHEL6 and on our KVM hosts it 
merges a lot (~300GB hugepages on a 400GB KVM footprint).
I am probably going to disable it and see if it introduces any problems for me 
- the most important gain here is better processor memory lookup table (cache) 
utilization where it considerably lowers the number of entries. Not sure how it 
affects different workloads - HPC guys should have a good idea? I can only 
evaluate the effect on OSDs and KVM, but the problem is that going over the 
cache limit even by a tiny bit can have huge impact - theoretically...

This issue sounds strange, though. THP should kick in and defrag/remerge the 
pages that are part-empty. Maybe it's just not aggressive enough?
Does the "free" memory show as used (part of RSS of the process using the 
page)? I guess not because there might be more processes with memory in the 
same hugepage.

This might actually partially explain the pagecache problem I mentioned there 
about a week ago (slow OSD startup), maybe kswapd is what has to do the work 
and defrag the pages when memory pressure is high!

I'll try to test it somehow, hopefully then there will be cake.

Jan

> On 09 Sep 2015, at 07:08, Alexandre DERUMIER  wrote:
> 
> They are a tracker here
> 
> https://github.com/jemalloc/jemalloc/issues/243
> "Improve interaction with transparent huge pages"
> 
> 
> 
> - Mail original -
> De: "aderumier" 
> À: "Sage Weil" 
> Cc: "ceph-devel" , "ceph-users" 
> 
> Envoyé: Mercredi 9 Septembre 2015 06:37:22
> Objet: Re: [ceph-users] jemalloc and transparent hugepage
> 
>>> Is this something we can set with mallctl[1] at startup? 
> 
> I don't think it's possible. 
> 
> TP hugepage are managed by kernel, not jemalloc. 
> 
> (but a simple "echo never > /sys/kernel/mm/transparent_hugepage/enabled" in 
> init script is enough) 
> 
> - Mail original - 
> De: "Sage Weil"  
> À: "aderumier"  
> Cc: "Mark Nelson" , "ceph-devel" 
> , "ceph-users" , 
> "Somnath Roy"  
> Envoyé: Mercredi 9 Septembre 2015 04:07:59 
> Objet: Re: [ceph-users] jemalloc and transparent hugepage 
> 
> On Wed, 9 Sep 2015, Alexandre DERUMIER wrote: 
 Have you noticed any performance difference with tp=never? 
>> 
>> No difference. 
>> 
>> I think hugepage could speedup big memory sets like 100-200GB, but for 
>> 1-2GB they are no noticable difference. 
> 
> Is this something we can set with mallctl[1] at startup? 
> 
> sage 
> 
> [1] 
> http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html 
> 
>> 
>> 
>> 
>> 
>> 
>> 
>> - Mail original - 
>> De: "Mark Nelson"  
>> À: "aderumier" , "ceph-devel" 
>> , "ceph-users"  
>> Cc: "Somnath Roy"  
>> Envoyé: Mercredi 9 Septembre 2015 01:49:35 
>> Objet: Re: [ceph-users] jemalloc and transparent hugepage 
>> 
>> Excellent investigation Alexandre! Have you noticed any performance 
>> difference with tp=never? 
>> 
>> Mark 
>> 
>> On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote: 
>>> I have done small benchmark with tcmalloc and jemalloc, transparent 
>>> hugepage=always|never. 
>>> 
>>> for tcmalloc, they are no difference. 
>>> but for jemalloc, the difference is huge (around 25% lower with tp=never). 
>>> 
>>> jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory 
>>> 
>>> jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc ! 
>>> 
>>> 
>>> I don't have monitored memory usage in recovery, but I think it should help 
>>> too. 
>>> 
>>> 
>>> 
>>> 
>>> tcmalloc 2.1 tp=always 
>>> --- 
>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
>>> 
>>> root 67746 120 1.0 1531220 671152 ? Ssl 01:18 0:43 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 0 -f 
>>> root 67764 144 1.0 1570256 711232 ? Ssl 01:18 0:51 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 1 -f 
>>> 
>>> root 68363 220 0.9 1522292 655888 ? Ssl 01:19 0:46 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 0 -f 
>>> root 68381 261 1.0 1563396 702500 ? Ssl 01:19 0:55 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 1 -f 
>>> 
>>> root 68963 228 1.0 1519240 666196 ? Ssl 01:20 0:31 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 0 -f 
>>> root 68981 268 1.0 1564452 694352 ? Ssl 01:20 0:37 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 1 -f 
>>> 
>>> 
>>> 
>>> tcmalloc 2.1 tp=never 
>>> - 
>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
>>> 
>>> root 69560 144 1.0 1544968 677584 ? Ssl 01:21 0:20 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 0 -f 
>>> root 69578 167 1.0 1568620 704456 ? Ssl 01:21 0:23 /usr/bin/ceph-osd 
>>> --cluster=ceph -i 1 -f 
>>> 
>>> 
>>> root 70156 164 0.9 1519680 649776 ? Ssl 01:21 0:16 

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Jan Schermer
The memory gets used for additional PGs on the OSD.
If you were to "swap" PGs between two OSDs, you'll get memory wasted on both of 
them because tcmalloc doesn't release it.*
It usually gets stable after few days even during backfills, so it does get 
reused if needed.
If for some reason your OSDs get to 8GB RSS then I recommend you just get more 
memory, or try disabling tcmalloc which can either help or make it even worse 
:-)

* E.g. if you do something silly like "ceph osd crush reweight osd.1 1" you 
will see the RSS of osd.28 skyrocket. Reweighting it back down will not release 
the memory until you do "heap release".

Jan


> On 09 Sep 2015, at 12:05, Mariusz Gronczewski 
>  wrote:
> 
> On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
>  wrote:
> 
>> Does 'ceph tell osd.* heap release' help with OSD RAM usage?
>> 
>> From
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
>> 
>> Chad.
> 
> it did help now, but cluster is in clean state at the moment. But I
> didnt know that one, thanks.
> 
> High memory usage stopped once cluster rebuilt, but I've planned
> cluster to have 2GB per OSD so I needed to add ram to even get to the
> point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
> recover
> 
> -- 
> Mariusz Gronczewski, Administrator
> 
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: mariusz.gronczew...@efigence.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EC pool design

2015-09-09 Thread Luis Periquito
I'm in the process of adding more resources to an existing cluster.

I'll have 38 hosts, with 2 HDD each, for an EC pool. I plan on adding a
cache pool in front of it (is it worth it? S3 data, mostly writes and
objects are usually 200kB upwards to several MB/GB...); all of the hosts
are on the same rack. All the other pools will go into a separate SSD based
pool and would be replicated.

I was reading Somnath's email regarding the performance of different EC
backends, and he compares the jerasure performance with different plugins.

This cluster is currently Hammer, so I was looking to using LRC. Is it
worth using LRC over standard jerasure? What would be a good k and m? I was
thinking k=12, m=4, l=4 as I have more than enough hosts for these values,
but what if I lose more than one host? Will LRC still be able to recover
using the "adjacent" group?

And what about performance? From Somnath's email it seemed the bigger the k
and m the worse it would perform...

What are the usual values you all use?

PS: I still haven't seen Mark Nelson performance presentation...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Mariusz Gronczewski

well I was going by
http://ceph.com/docs/master/start/hardware-recommendations/ and planning for 
2GB per OSD so that was a suprise maybe there should be warning somewhere ?


On Wed, 9 Sep 2015 12:21:15 +0200, Jan Schermer  wrote:

> The memory gets used for additional PGs on the OSD.
> If you were to "swap" PGs between two OSDs, you'll get memory wasted on both 
> of them because tcmalloc doesn't release it.*
> It usually gets stable after few days even during backfills, so it does get 
> reused if needed.
> If for some reason your OSDs get to 8GB RSS then I recommend you just get 
> more memory, or try disabling tcmalloc which can either help or make it even 
> worse :-)
> 
> * E.g. if you do something silly like "ceph osd crush reweight osd.1 1" 
> you will see the RSS of osd.28 skyrocket. Reweighting it back down will not 
> release the memory until you do "heap release".
> 
> Jan
> 
> 
> > On 09 Sep 2015, at 12:05, Mariusz Gronczewski 
> >  wrote:
> > 
> > On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
> >  wrote:
> > 
> >> Does 'ceph tell osd.* heap release' help with OSD RAM usage?
> >> 
> >> From
> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
> >> 
> >> Chad.
> > 
> > it did help now, but cluster is in clean state at the moment. But I
> > didnt know that one, thanks.
> > 
> > High memory usage stopped once cluster rebuilt, but I've planned
> > cluster to have 2GB per OSD so I needed to add ram to even get to the
> > point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
> > recover
> > 
> > -- 
> > Mariusz Gronczewski, Administrator
> > 
> > Efigence S. A.
> > ul. Wołoska 9a, 02-583 Warszawa
> > T: [+48] 22 380 13 13
> > F: [+48] 22 380 13 14
> > E: mariusz.gronczew...@efigence.com
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com



pgpTOlRxY6kyv.pgp
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Mariusz Gronczewski
On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
 wrote:

> Does 'ceph tell osd.* heap release' help with OSD RAM usage?
> 
> From
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
> 
> Chad.

it did help now, but cluster is in clean state at the moment. But I
didnt know that one, thanks.

High memory usage stopped once cluster rebuilt, but I've planned
cluster to have 2GB per OSD so I needed to add ram to even get to the
point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
recover

-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com



pgp26csXkG0cA.pgp
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor IOPS performance with Ceph

2015-09-09 Thread Jan Schermer
For the record

--direct=1 (or any O_DIRECT IO anywhere) is by itselt not guaranteed to be 
unbuffered and synchronous.
you need to add
--direct=1 --sync=1 --fsync=1 to make sure you are actually flushing the data 
somewhere. (This puts additional OPS in the queue though)
In case of RBD this is important because O_DIRECT write by itself could 
actually end in rbd cache.
Not sure how it is with different kernels, I believe this behaviour changed 
several times as applications have different assumptions on durability of 
O_DIRECT writes.
I can probably dig some reference to that if you want...

Jan

> On 09 Sep 2015, at 11:06, Nick Fisk  wrote:
> 
> It looks like you are using the kernel RBD client, ie you ran "rbd map " 
> In which case the librbd settings in the ceph.conf won't have any affect as 
> they are only for if you are using fio with the librbd engine.
> 
> There are several things you may have to do to improve Kernel client 
> performance, but 1st thing you need to pass the "direct=1" flag to your fio 
> job to get a realistic idea of your clusters performance. But be warned if 
> you thought you had bad performance now, you will likely be shocked after you 
> enable it.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Daleep Bais
>> Sent: 09 September 2015 09:37
>> To: Nick Fisk 
>> Cc: Ceph-User 
>> Subject: Re: [ceph-users] Poor IOPS performance with Ceph
>> 
>> Hi Nick,
>> 
>> I dont have separate SSD / HDD for journal. I am using a 10 G partition on 
>> the
>> same HDD for journaling. They are rotating HDD's and not SSD's.
>> 
>> I am using below command to run the test:
>> 
>> fio --name=test --filename=test --bs=4k  --size=4G --readwrite=read / write
>> 
>> I did few kernel tuning and that has improved my write IOPS. For read I am
>> using rbd_readahead  and also used read_ahead_kb kernel tuning
>> parameter.
>> 
>> Also I should mention that its not x86, its on armv7 32bit.
>> 
>> Thanks.
>> 
>> Daleep Singh Bais
>> 
>> 
>> 
>> On Wed, Sep 9, 2015 at 1:55 PM, Nick Fisk  wrote:
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of
>>> Daleep Bais
>>> Sent: 09 September 2015 09:18
>>> To: Ceph-User 
>>> Subject: [ceph-users] Poor IOPS performance with Ceph
>>> 
>>> Hi,
>>> 
>>> I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
>> read
>>> write performance for the test cluster and the read IOPS is  poor.
>>> When I individually test it for each HDD, I get good performance, whereas,
>>> when I test it for ceph cluster, it is poor.
>> 
>> Can you give any further details about your cluster. Are your HDD's backed by
>> SSD journals?
>> 
>>> 
>>> Between nodes, using iperf, I get good bandwidth.
>>> 
>>> My cluster info :
>>> 
>>> root@ceph-node3:~# ceph --version
>>> ceph version 9.0.2-752-g64d37b7
>>> (64d37b70a687eb63edf69a91196bb124651da210)
>>> root@ceph-node3:~# ceph -s
>>>cluster 9654468b-5c78-44b9-9711-4a7c4455c480
>>> health HEALTH_OK
>>> monmap e9: 3 mons at {ceph-node10=192.168.1.210:6789/0,ceph-
>>> node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0}
>>>election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-
>>> node17
>>> osdmap e1850: 6 osds: 6 up, 6 in
>>>  pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
>>>9624 MB used, 5384 GB / 5394 GB avail
>>> 256 active+clean
>>> 
>>> 
>>> I have mapped an RBD block device to client machine (Ubuntu 14) and from
>>> there, when I run tests using FIO, i get good write IOPS, however, read is
>>> poor comparatively.
>>> 
>>> Write IOPS : 44618 approx
>>> 
>>> Read IOPS : 7356 approx
>> 
>> 1st thing that strikes me is that your numbers are too good, unless these are
>> actually SSD's and not spinning HDD's? I would expect to get around a max of
>> 600 read IOPs for 6x 7.2k disks, so I guess either you are hitting the page
>> cache on the OSD node(s) or the librbd cache.
>> 
>> The writes are even higher, are you using the "direct=1" option in the Fio
>> job?
>> 
>>> 
>>> Pool replica - single
>>> pool 1 'test1' replicated size 1 min_size 1
>>> 
>>> I have implemented rbd_readahead in my ceph conf file also.
>>> Any suggestions in this regard with help me..
>>> 
>>> Thanks.
>>> 
>>> Daleep Singh Bais
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Tuning + KV backend

2015-09-09 Thread Jan Schermer
You actually can't know what the network contention is like - you see virtual 
NICs, but those are overprovisioned on the physical hosts, and the backbone 
between AWS racks/datacenters are overprovisioned as well (likely).
The same goes for CPU and RAM - depending on your kernel and how AWS is set up, 
it might look like the CPUs in guests are idle, because the workload has left 
your domain (guest), but the host might be struggling at the same time without 
you knowing. Sometimes this shows as "steal" time, sometimes it does not. Your 
guest can be totally idle because it sent the data "out" (to the virtual drive 
cache, to the network buffers via DMA) but the host still has work to do at 
that moment.

Rerun the test at different times of day, or create the same setup in a 
different AWS zone and compare the results.

Not sure what the AWS service level settings are, if there is some kind of 
resource reservation that's exactly what you should do to get meaningful 
numbers.

If you want to identify the bottlenecks you need to test all the metrics at the 
same time - at least latency (ping, or arping if in the same network/subnet) 
test between the nodes on all networks, some minimalistic fio read+write test 
on the virtual disk, and a latency test (cyclictest from rt-tests for example) 
for the CPUs. Whatever jumps up is the bottleneck.

Jan

> On 08 Sep 2015, at 21:00, Niels Jakob Darger  wrote:
> 
> Hello,
> 
> Excuse my ignorance, I have just joined this list and started using Ceph 
> (which looks very cool). On AWS I have set up a 5-way Ceph cluster (4 vCPUs, 
> 32G RAM, dedicated SSDs for system, osd and journal) with the Object Gateway. 
> For the purpose of simplicity of the test all the nodes are identical and 
> each node contains osd, mon and the radosgw.
> 
> I have run parallel inserts from all 5 nodes, I can insert about 10-12000 
> objects per minute. The insert rate is relatively constant regardless of 
> whether I run 1 insert process per node or 5, i.e. a total of 5 or 25.
> 
> These are just numbers, of course, and not meaningful without more context. 
> But looking at the nodes I think the cluster could run faster - the CPUs are 
> not doing much, there isn't much I/O wait - only about 50% utilisation and 
> only on the SSDs storing the journals on two of the nodes (I've set the 
> replication to 2), the other file systems are almost idle. The network is far 
> from maxed out and the processes are not using much memory. I've tried 
> increasing osd_op_threads to 5 or 10 but that didn't make much difference.
> 
> The co-location of all the daemons on all the nodes may not be ideal, but 
> since there isn't much resource use or contention I don't think that's the 
> problem.
> 
> So two questions:
> 
> 1) Are there any good resources on tuning Ceph? There's quite a few posts out 
> there testing and timing specific setups with RAID controller X and 12 disks 
> of brand Y etc. but I'm more looking for general tuning guidelines - 
> explaining the big picture.
> 
> 2) What's the status of the keyvalue backend? The documentation on 
> http://ceph.com/docs/master/rados/configuration/keyvaluestore-config-ref/ 
> looks nice but I found it difficult to work out how to switch to the keyvalue 
> backend, the Internet suggests "osd objectstore = keyvaluestore-dev", but 
> that didn't seem to work so I checked out the source code and it looks like 
> "osd objectstore = keyvaluestore" does it. However, it results in nasty 
> things in the log file ("*** experimental feature 'keyvaluestore' is not 
> enabled *** This feature is marked as experimental ...") so perhaps it's too 
> early to use the KV backend for production use?
> 
> Thanks & regards,
> Jakob
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Mariusz Gronczewski
sadly I dont have any from when it was taking excess amounts of memory
duing rebuild. But I will remember to do that next time, thanks

On Tue, 8 Sep 2015 18:28:48 -0400 (EDT), Shinobu Kinjo
 wrote:

> Have you ever?
> 
> http://ceph.com/docs/master/rados/troubleshooting/memory-profiling/
> 
> Shinobu
> 
> - Original Message -
> From: "Chad William Seys" 
> To: "Mariusz Gronczewski" , "Shinobu Kinjo" 
> , ceph-users@lists.ceph.com
> Sent: Wednesday, September 9, 2015 6:14:15 AM
> Subject: Re: Huge memory usage spike in OSD on hammer/giant
> 
> Does 'ceph tell osd.* heap release' help with OSD RAM usage?
> 
> From
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
> 
> Chad.



-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com



pgpeLgQdeJlcs.pgp
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc and transparent hugepage

2015-09-09 Thread HEWLETT, Paul (Paul)
Hi Jan

If I can suggest that you look at:

http://engineering.linkedin.com/performance/optimizing-linux-memory-managem
ent-low-latency-high-throughput-databases


where LinkedIn ended up disabling some of the new kernel features to
prevent memory thrashing.
Search for Transparent Huge Pages..

RHEL7 has these now disabled by default - LinkedIn are using GraphDB which
is a log-structured system.

Paul

On 09/09/2015 10:54, "ceph-devel-ow...@vger.kernel.org on behalf of Jan
Schermer" 
wrote:

>I looked at THP before. It comes enabled on RHEL6 and on our KVM hosts it
>merges a lot (~300GB hugepages on a 400GB KVM footprint).
>I am probably going to disable it and see if it introduces any problems
>for me - the most important gain here is better processor memory lookup
>table (cache) utilization where it considerably lowers the number of
>entries. Not sure how it affects different workloads - HPC guys should
>have a good idea? I can only evaluate the effect on OSDs and KVM, but the
>problem is that going over the cache limit even by a tiny bit can have
>huge impact - theoretically...
>
>This issue sounds strange, though. THP should kick in and defrag/remerge
>the pages that are part-empty. Maybe it's just not aggressive enough?
>Does the "free" memory show as used (part of RSS of the process using the
>page)? I guess not because there might be more processes with memory in
>the same hugepage.
>
>This might actually partially explain the pagecache problem I mentioned
>there about a week ago (slow OSD startup), maybe kswapd is what has to do
>the work and defrag the pages when memory pressure is high!
>
>I'll try to test it somehow, hopefully then there will be cake.
>
>Jan
>
>> On 09 Sep 2015, at 07:08, Alexandre DERUMIER 
>>wrote:
>> 
>> They are a tracker here
>> 
>> https://github.com/jemalloc/jemalloc/issues/243
>> "Improve interaction with transparent huge pages"
>> 
>> 
>> 
>> - Mail original -
>> De: "aderumier" 
>> À: "Sage Weil" 
>> Cc: "ceph-devel" , "ceph-users"
>>
>> Envoyé: Mercredi 9 Septembre 2015 06:37:22
>> Objet: Re: [ceph-users] jemalloc and transparent hugepage
>> 
 Is this something we can set with mallctl[1] at startup?
>> 
>> I don't think it's possible.
>> 
>> TP hugepage are managed by kernel, not jemalloc.
>> 
>> (but a simple "echo never >
>>/sys/kernel/mm/transparent_hugepage/enabled" in init script is enough)
>> 
>> - Mail original -
>> De: "Sage Weil" 
>> À: "aderumier" 
>> Cc: "Mark Nelson" , "ceph-devel"
>>, "ceph-users" ,
>>"Somnath Roy" 
>> Envoyé: Mercredi 9 Septembre 2015 04:07:59
>> Objet: Re: [ceph-users] jemalloc and transparent hugepage
>> 
>> On Wed, 9 Sep 2015, Alexandre DERUMIER wrote:
> Have you noticed any performance difference with tp=never?
>>> 
>>> No difference. 
>>> 
>>> I think hugepage could speedup big memory sets like 100-200GB, but for
>>> 1-2GB they are no noticable difference.
>> 
>> Is this something we can set with mallctl[1] at startup?
>> 
>> sage 
>> 
>> [1] 
>>http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.h
>>tml 
>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> - Mail original -
>>> De: "Mark Nelson" 
>>> À: "aderumier" , "ceph-devel"
>>>, "ceph-users" 
>>> Cc: "Somnath Roy" 
>>> Envoyé: Mercredi 9 Septembre 2015 01:49:35
>>> Objet: Re: [ceph-users] jemalloc and transparent hugepage
>>> 
>>> Excellent investigation Alexandre! Have you noticed any performance
>>> difference with tp=never?
>>> 
>>> Mark 
>>> 
>>> On 09/08/2015 06:33 PM, Alexandre DERUMIER wrote:
 I have done small benchmark with tcmalloc and jemalloc, transparent
hugepage=always|never.
 
 for tcmalloc, they are no difference.
 but for jemalloc, the difference is huge (around 25% lower with
tp=never). 
 
 jemmaloc 4.6.0+tp=never vs tcmalloc use 10% more RSS memory
 
 jemmaloc 4.0+tp=never almost use same RSS memory than tcmalloc !
 
 
 I don't have monitored memory usage in recovery, but I think it
should help too.
 
 
 
 
 tcmalloc 2.1 tp=always
 ---
 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
 
 root 67746 120 1.0 1531220 671152 ? Ssl 01:18 0:43 /usr/bin/ceph-osd
--cluster=ceph -i 0 -f
 root 67764 144 1.0 1570256 711232 ? Ssl 01:18 0:51 /usr/bin/ceph-osd
--cluster=ceph -i 1 -f
 
 root 68363 220 0.9 1522292 655888 ? Ssl 01:19 0:46 /usr/bin/ceph-osd
--cluster=ceph -i 0 -f
 root 68381 261 1.0 1563396 702500 ? Ssl 01:19 0:55 /usr/bin/ceph-osd
--cluster=ceph 

Re: [ceph-users] Question on cephfs recovery tools

2015-09-09 Thread goncalo


Hi Shinobu

I did check that page but I do not think that in its current state it  
helps much.


If you look to my email, I did try the operations documented there but  
nothing substantial really happened. The tools do not produce any  
output so I am not sure what they did, if they did something at all.  
From the documentation it is also not obvious in which situations we  
should use the tools, and if there is a particular order to run them.


The reason for my email is to get some clarification on that.

Cheers

Quoting Shinobu Kinjo :


Anyhow this page would help you:

http://ceph.com/docs/master/cephfs/disaster-recovery/

Shinobu

- Original Message -
From: "Shinobu Kinjo" 
To: "Goncalo Borges" 
Cc: "ceph-users" 
Sent: Wednesday, September 9, 2015 5:28:38 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

Did you try to identify what kind of processes were accessing  
filesystem using fuser or lsof and then kill them?

If not, you had to do that first.

Shinobu

- Original Message -
From: "Goncalo Borges" 
To: ski...@redhat.com
Sent: Wednesday, September 9, 2015 5:04:23 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

Hi Shinobu


Did you unmount filesystem using?

  umount -l


Yes!
Goncalo



Shinobu

On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges
> wrote:

Dear Ceph / CephFS gurus...

Bare a bit with me while I give you a bit of context. Questions
will appear at the end.

1) I am currently running ceph 9.0.3 and I have install it  to
test the cephfs recovery tools.

2) I've created a situation where I've deliberately (on purpose)
lost some data and metadata (check annex 1 after the main email).

3) I've stopped the mds, and waited to check how the cluster
reacts. After some time, as expected, the cluster reports a ERROR
state, with a lot of PGs degraded and stuck

# ceph -s
cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
 health HEALTH_ERR
174 pgs degraded
48 pgs stale
174 pgs stuck degraded
41 pgs stuck inactive
48 pgs stuck stale
238 pgs stuck unclean
174 pgs stuck undersized
174 pgs undersized
recovery 22366/463263 objects degraded (4.828%)
recovery 8190/463263 objects misplaced (1.768%)
too many PGs per OSD (388 > max 300)
mds rank 0 has failed
mds cluster is degraded
 monmap e1: 3 mons at
{mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
election epoch 10, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e24: 0/1/1 up, 1 failed
 osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
  pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
1715 GB used, 40027 GB / 41743 GB avail
22366/463263 objects degraded (4.828%)
8190/463263 objects misplaced (1.768%)
1799 active+clean
 110 active+undersized+degraded
  60 active+remapped
  37 stale+undersized+degraded+peered
  23 active+undersized+degraded+remapped
  11 stale+active+clean
   4 undersized+degraded+peered
   4 active

4) I've umounted the cephfs clients ('umount -l' worked for me
this time but I already had situations where 'umount' would simply
hang, and the only viable solutions would be to reboot the client).

5) I've recovered the ceph cluster by (details on the recover
operations are in annex 2 after the main email.)
- declaring the osds lost
- removing the osds from the crush map
- letting the cluster stabilize and letting all the recover I/O finish
- identifying stuck PGs
- checking if they existed, and if not recreate them.


6) I've restarted the MDS. Initially, the mds cluster was
considered degraded but after some small amount of time, that
message disappeared. The WARNING status was just because of "too
many PGs per OSD (409 > max 300)"

# ceph -s
cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
 health HEALTH_WARN
too many PGs per OSD (409 > max 300)
mds cluster is degraded
 monmap e1: 3 mons at
{mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
election epoch 10, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
 osdmap 

Re: [ceph-users] RAM usage only very slowly decreases after cluster recovery

2015-09-09 Thread Mark Nelson



On 08/28/2015 10:55 AM, Somnath Roy wrote:

Yeah, that means tcmalloc probably caching those as I suspected..
There are some discussion going on in that front, but, unfortunately we 
concluded to have tcmalloc as default and if somebody needs performance should 
move to jemalloc.
One of the reason is, it seems jemalloc is consuming ~200MB more memory/osd 
during IO run...
But, I think this is one of the serious issue of tcmalloc we need to consider 
as well..I posted this findings earlier in ceph-devl during my write path 
optimization investigation.
There are some settings in tcmalloc that should expedite this memory release 
faster though. I tried, but, didn't work. I didn't dig down further in that 
route though.

Mark,
Did you observe similar tcmalloc behavior in your recovery experiment for 
tcmalloc vs jemalloc?


Hi Somnath,

I haven't graphed out all of the results, but on slide 13 of the tech 
talk I gave the other week you can see one of the recovery test examples 
comparing tcmalloc and jemalloc:


http://nhm.ceph.com/mark_nelson_ceph_tech_talk.odp

In all cases (jemalloc too) after recovery completed we didn't return 
completely to previous RSS levels.  It would be good to include some 
kind of configurable heap release options in cbt to see what that does 
in each case.


I did try various things to reduce jemalloc memory usage with little 
effect.  I question if the environment variables I set were properly 
being read.  I think we need to do more testing with disabling 
transparent huge pages and see how much effect that has.  I don't want 
to totally give up on jemalloc as default yet since we've got so many 
more things to look into regarding memory usage.


Mark



Thanks & Regards
Somnath

-Original Message-
From: Chad William Seys [mailto:cws...@physics.wisc.edu]
Sent: Friday, August 28, 2015 7:58 AM
To: 池信泽
Cc: Somnath Roy; Haomai Wang; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RAM usage only very slowly decreases after cluster 
recovery

Thanks! 'ceph tell osd.* heap release' seems to have worked!  Guess I'll 
sprinkle it around my maintenance scripts.

Somnath Is there a plan to make jemalloc standard in Ceph in the future?

Thanks!
Chad.




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Jan Schermer
You can sort of simulate it:

 * E.g. if you do something silly like "ceph osd crush reweight osd.1 
 1" you will see the RSS of osd.28 skyrocket. Reweighting it back down 
 will not release the memory until you do "heap release".

But this is expected, methinks.

Jan


> On 09 Sep 2015, at 15:51, Mark Nelson  wrote:
> 
> Yes, under no circumstances is it really ok for an OSD to consume 8GB of RSS! 
> :)  It'd be really swell if we could replicate that kind of memory growth 
> in-house on demand.
> 
> Mark
> 
> On 09/09/2015 05:56 AM, Jan Schermer wrote:
>> Sorry if I wasn't clear.
>> Going from 2GB to 8GB is not normal, although some slight bloating is 
>> expected. In your case it just got much worse than usual for reasons yet 
>> unknown.
>> 
>> Jan
>> 
>> 
>>> On 09 Sep 2015, at 12:40, Mariusz Gronczewski 
>>>  wrote:
>>> 
>>> 
>>> well I was going by
>>> http://ceph.com/docs/master/start/hardware-recommendations/ and planning 
>>> for 2GB per OSD so that was a suprise maybe there should be warning 
>>> somewhere ?
>>> 
>>> 
>>> On Wed, 9 Sep 2015 12:21:15 +0200, Jan Schermer  wrote:
>>> 
 The memory gets used for additional PGs on the OSD.
 If you were to "swap" PGs between two OSDs, you'll get memory wasted on 
 both of them because tcmalloc doesn't release it.*
 It usually gets stable after few days even during backfills, so it does 
 get reused if needed.
 If for some reason your OSDs get to 8GB RSS then I recommend you just get 
 more memory, or try disabling tcmalloc which can either help or make it 
 even worse :-)
 
 * E.g. if you do something silly like "ceph osd crush reweight osd.1 
 1" you will see the RSS of osd.28 skyrocket. Reweighting it back down 
 will not release the memory until you do "heap release".
 
 Jan
 
 
> On 09 Sep 2015, at 12:05, Mariusz Gronczewski 
>  wrote:
> 
> On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
>  wrote:
> 
>> Does 'ceph tell osd.* heap release' help with OSD RAM usage?
>> 
>> From
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
>> 
>> Chad.
> 
> it did help now, but cluster is in clean state at the moment. But I
> didnt know that one, thanks.
> 
> High memory usage stopped once cluster rebuilt, but I've planned
> cluster to have 2GB per OSD so I needed to add ram to even get to the
> point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
> recover
> 
> --
> Mariusz Gronczewski, Administrator
> 
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: mariusz.gronczew...@efigence.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
>>> 
>>> 
>>> 
>>> --
>>> Mariusz Gronczewski, Administrator
>>> 
>>> Efigence S. A.
>>> ul. Wołoska 9a, 02-583 Warszawa
>>> T: [+48] 22 380 13 13
>>> F: [+48] 22 380 13 14
>>> E: mariusz.gronczew...@efigence.com
>>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Chad William Seys

> Going from 2GB to 8GB is not normal, although some slight bloating is
> expected. 

If I recall correctly, Mariusz's cluster had a period of flapping OSDs?

I experienced a  a similar situation using hammer. My OSDs went from 10GB in 
RAM in a Healthy state to 24GB RAM + 10GB swap in a recovering state.  I also 
could not re-add a node b/c every time I tried OOM killer would kill an OSD 
daemon somewhere before the cluster could become healthy again.

Therefore I propose we begin expecting bloating under these circumstances.  :) 

> In your case it just got much worse than usual for reasons yet
> unknown.

Not really unknown: B/c 'ceph tell osd.* heap release' freed RAM for Mariusz, 
I think we know the reason for so much RAM use is b/c of tcmalloc not freeing 
unused memory.   Right?

Here is a related "urgent" and "won't fix" bug to which applies 
http://tracker.ceph.com/issues/12681 .  Sage suggests making the heap release 
command a cron job .   :)

Have fun!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-09 Thread Kyle Hutson
We are using Hammer - latest released version. How do I check if it's
getting promoted into the cache?

We're using the latest ceph kernel client. Where do I poke at readahead
settings there?

On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum  wrote:

> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson  wrote:
> > I was wondering if anybody could give me some insight as to how CephFS
> does
> > its caching - read-caching in particular.
> >
> > We are using CephFS with an EC pool on the backend with a replicated
> cache
> > pool in front of it. We're seeing some very slow read times. Trying to
> > compute an md5sum on a 15GB file twice in a row (so it should be in
> cache)
> > takes the time from 23 minutes down to 17 minutes, but this is over a
> 10Gbps
> > network and with a crap-ton of OSDs (over 300), so I would expect it to
> be
> > down in the 2-3 minute range.
>
> A single sequential read won't necessarily promote an object into the
> cache pool (although if you're using Hammer I think it will), so you
> want to check if it's actually getting promoted into the cache before
> assuming that's happened.
>
> >
> > I'm just trying to figure out what we can do to increase the
> performance. I
> > have over 300 TB of live data that I have to be careful with, though, so
> I
> > have to have some level of caution.
> >
> > Is there some other caching we can do (client-side or server-side) that
> > might give us a decent performance boost?
>
> Which client are you using for this testing? Have you looked at the
> readahead settings? That's usually the big one; if you're only asking
> for 4KB at once then stuff is going to be slow no matter what (a
> single IO takes at minimum about 2 milliseconds right now, although
> the RADOS team is working to improve that).
> -Greg
>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-09 Thread Gregory Farnum
On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
> We are using Hammer - latest released version. How do I check if it's
> getting promoted into the cache?

Umm...that's a good question. You can run rados ls on the cache pool,
but that's not exactly scalable; you can turn up logging and dig into
them to see if redirects are happening, or watch the OSD operations
happening via the admin socket. But I don't know if there's a good
interface for users to just query the cache state of a single object.
:/

>
> We're using the latest ceph kernel client. Where do I poke at readahead
> settings there?

Just the standard kernel readahead settings; I'm not actually familiar
with how to configure those but I don't believe Ceph's are in any way
special. What do you mean by "latest ceph kernel client"; are you
running one of the developer testing kernels or something? I think
Ilya might have mentioned some issues with readahead being
artificially blocked, but that might have only been with RBD.

Oh, are the files you're using sparse? There was a bug with sparse
files not filling in pages that just got patched yesterday or
something.
-Greg

>
> On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum  wrote:
>>
>> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson  wrote:
>> > I was wondering if anybody could give me some insight as to how CephFS
>> > does
>> > its caching - read-caching in particular.
>> >
>> > We are using CephFS with an EC pool on the backend with a replicated
>> > cache
>> > pool in front of it. We're seeing some very slow read times. Trying to
>> > compute an md5sum on a 15GB file twice in a row (so it should be in
>> > cache)
>> > takes the time from 23 minutes down to 17 minutes, but this is over a
>> > 10Gbps
>> > network and with a crap-ton of OSDs (over 300), so I would expect it to
>> > be
>> > down in the 2-3 minute range.
>>
>> A single sequential read won't necessarily promote an object into the
>> cache pool (although if you're using Hammer I think it will), so you
>> want to check if it's actually getting promoted into the cache before
>> assuming that's happened.
>>
>> >
>> > I'm just trying to figure out what we can do to increase the
>> > performance. I
>> > have over 300 TB of live data that I have to be careful with, though, so
>> > I
>> > have to have some level of caution.
>> >
>> > Is there some other caching we can do (client-side or server-side) that
>> > might give us a decent performance boost?
>>
>> Which client are you using for this testing? Have you looked at the
>> readahead settings? That's usually the big one; if you're only asking
>> for 4KB at once then stuff is going to be slow no matter what (a
>> single IO takes at minimum about 2 milliseconds right now, although
>> the RADOS team is working to improve that).
>> -Greg
>>
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Chad William Seys
On Tuesday, September 08, 2015 18:28:48 Shinobu Kinjo wrote:
> Have you ever?
> 
> http://ceph.com/docs/master/rados/troubleshooting/memory-profiling/

No.  But the command 'ceph tell osd.* heap release' did cause my OSDs to 
consume the "normal" amount of RAM.  ("normal" in this case means the same 
amount of RAM as before my cluster went through a recovery phase.

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-09 Thread Mark Nelson
Yes, under no circumstances is it really ok for an OSD to consume 8GB of 
RSS! :)  It'd be really swell if we could replicate that kind of memory 
growth in-house on demand.


Mark

On 09/09/2015 05:56 AM, Jan Schermer wrote:

Sorry if I wasn't clear.
Going from 2GB to 8GB is not normal, although some slight bloating is expected. 
In your case it just got much worse than usual for reasons yet unknown.

Jan



On 09 Sep 2015, at 12:40, Mariusz Gronczewski 
 wrote:


well I was going by
http://ceph.com/docs/master/start/hardware-recommendations/ and planning for 
2GB per OSD so that was a suprise maybe there should be warning somewhere ?


On Wed, 9 Sep 2015 12:21:15 +0200, Jan Schermer  wrote:


The memory gets used for additional PGs on the OSD.
If you were to "swap" PGs between two OSDs, you'll get memory wasted on both of 
them because tcmalloc doesn't release it.*
It usually gets stable after few days even during backfills, so it does get 
reused if needed.
If for some reason your OSDs get to 8GB RSS then I recommend you just get more 
memory, or try disabling tcmalloc which can either help or make it even worse 
:-)

* E.g. if you do something silly like "ceph osd crush reweight osd.1 1" you will see 
the RSS of osd.28 skyrocket. Reweighting it back down will not release the memory until you do 
"heap release".

Jan



On 09 Sep 2015, at 12:05, Mariusz Gronczewski 
 wrote:

On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
 wrote:


Does 'ceph tell osd.* heap release' help with OSD RAM usage?

From
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html

Chad.


it did help now, but cluster is in clean state at the moment. But I
didnt know that one, thanks.

High memory usage stopped once cluster rebuilt, but I've planned
cluster to have 2GB per OSD so I needed to add ram to even get to the
point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
recover

--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-09 Thread Shinobu Kinjo
Hi Goncalo,

>> a./ Under a situation as the one describe above, how can we safely
>> terminate cephfs in the clients? I have had situations where
>> umount simply hangs and there is no real way to unblock the
>> situation unless I reboot the client. If we have hundreds of
>> clients, I would like to avoid that.

Use "lsof" to find process accessing filesystem. I would see process id.
And then kill that process using:

 kill -9 

But you **must** make sure if it's ok or not to kill that process.
You have to be careful no matter when you kill any process.

>> b./ I was expecting to have lost metadata information since I've
>> clean OSDs where metadata information was stored for the
>> /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that
>> the /'cephfs/goncalo/5Gbytes_029.txt' was still properly
>> referenced, without me having to run any recover tool. What am I
>> missing?
>>
>> c./ After recovering the cluster, I though I was in a cephfs
>> situation where I had
>> c.1 files with holes (because of lost PGs and objects in the
>> data pool)
>> c.2 files without metadata (because of lost PGs and objects in
>> the metadata pool)
>> c.3 metadata without associated files (because of lost PGs and
>> objects in the data pool)
>> I've tried to run the recovery tools, but I have several doubts
>> which I did not found described in the documentation
>> - Is there a specific order / a way to run the tools for the
>> c.1, c.2 and c.3 cases I mentioned?

What is the recovery tools?
How did you do with that tool?
I'm just assuming d) -;

Am I right?

If so, why did you use that tool?

>> d./ Since I was testing, I simply ran the following sequence but I
>> am not sure of what the command are doing, nor if the sequence is
>> correct. I think an example use case should be documented.
>> Specially the cephfs-data-scan did not returned any output, or
>> information. So, I am not sure if anything happened at all.
>>
>> # cephfs-table-tool 0 reset session
>> {
>> "0": {
>> "data": {},
>> "result": 0
>> }
>> }
>>
>> # cephfs-table-tool 0 reset snap
>> {
>> "result": 0
>> }
>>
>> # cephfs-table-tool 0 reset inode
>> {
>> "0": {
>> "data": {},
>> "result": 0
>> }
>> }
>>
>> # cephfs-journal-tool --rank=0 journal reset
>> old journal was 4194304~22381701
>> new journal start will be 29360128 (2784123 bytes past old end)
>> writing journal head
>> writing EResetJournal entry
>> done
>>
>> # cephfs-data-scan init
>>
>> # cephfs-data-scan scan_extents cephfs_dt
>> # cephfs-data-scan scan_inodes cephfs_dt
>>
>> # cephfs-data-scan scan_extents --force-pool cephfs_mt
>> (doesn't seem to work)
>>
>> e./ After running the cephfs tools, everything seemed exactly in
>> the same status. No visible changes or errors at the filesystem
>> level. So, at this point not sure what to conclude...

Anyway just let me know if your ceph cluster is production or not.
I do hope, not -;

Shinobu

- Original Message -
From: gonc...@physics.usyd.edu.au
To: "Shinobu Kinjo" 
Cc: "ceph-users" 
Sent: Wednesday, September 9, 2015 9:50:30 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools


Hi Shinobu

I did check that page but I do not think that in its current state it  
helps much.

If you look to my email, I did try the operations documented there but  
nothing substantial really happened. The tools do not produce any  
output so I am not sure what they did, if they did something at all.  
 From the documentation it is also not obvious in which situations we  
should use the tools, and if there is a particular order to run them.

The reason for my email is to get some clarification on that.

Cheers

Quoting Shinobu Kinjo :

> Anyhow this page would help you:
>
> http://ceph.com/docs/master/cephfs/disaster-recovery/
>
> Shinobu
>
> - Original Message -
> From: "Shinobu Kinjo" 
> To: "Goncalo Borges" 
> Cc: "ceph-users" 
> Sent: Wednesday, September 9, 2015 5:28:38 PM
> Subject: Re: [ceph-users] Question on cephfs recovery tools
>
> Did you try to identify what kind of processes were accessing  
> filesystem using fuser or lsof and then kill them?
> If not, you had to do that first.
>
> Shinobu
>
> - Original Message -
> From: "Goncalo Borges" 
> To: ski...@redhat.com
> Sent: Wednesday, September 9, 2015 5:04:23 PM
> Subject: Re: [ceph-users] Question on cephfs recovery tools
>
> Hi Shinobu
>
>> Did you unmount filesystem 

Re: [ceph-users] RAM usage only very slowly decreases after cluster recovery

2015-09-09 Thread Chad William Seys
Thanks Somnath!
I found a bug in the tracker to follow: http://tracker.ceph.com/issues/12681

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-09 Thread Kyle Hutson
On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum  wrote:

> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
> > We are using Hammer - latest released version. How do I check if it's
> > getting promoted into the cache?
>
> Umm...that's a good question. You can run rados ls on the cache pool,
> but that's not exactly scalable; you can turn up logging and dig into
> them to see if redirects are happening, or watch the OSD operations
> happening via the admin socket. But I don't know if there's a good
> interface for users to just query the cache state of a single object.
> :/
>

even using 'rados ls', I (naturally) get cephfs object names - is there a
way to see a filename -> objectname conversion ... or objectname ->
filename ?


> > We're using the latest ceph kernel client. Where do I poke at readahead
> > settings there?
>
> Just the standard kernel readahead settings; I'm not actually familiar
> with how to configure those but I don't believe Ceph's are in any way
> special. What do you mean by "latest ceph kernel client"; are you
> running one of the developer testing kernels or something?


No, just what comes with the latest stock kernel. Sorry for any confusion.


> I think
> Ilya might have mentioned some issues with readahead being
> artificially blocked, but that might have only been with RBD.
>
> Oh, are the files you're using sparse? There was a bug with sparse
> files not filling in pages that just got patched yesterday or
> something.
>

No, these are not sparse files. Just really big.


> >
> > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum 
> wrote:
> >>
> >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson 
> wrote:
> >> > I was wondering if anybody could give me some insight as to how CephFS
> >> > does
> >> > its caching - read-caching in particular.
> >> >
> >> > We are using CephFS with an EC pool on the backend with a replicated
> >> > cache
> >> > pool in front of it. We're seeing some very slow read times. Trying to
> >> > compute an md5sum on a 15GB file twice in a row (so it should be in
> >> > cache)
> >> > takes the time from 23 minutes down to 17 minutes, but this is over a
> >> > 10Gbps
> >> > network and with a crap-ton of OSDs (over 300), so I would expect it
> to
> >> > be
> >> > down in the 2-3 minute range.
> >>
> >> A single sequential read won't necessarily promote an object into the
> >> cache pool (although if you're using Hammer I think it will), so you
> >> want to check if it's actually getting promoted into the cache before
> >> assuming that's happened.
> >>
> >> >
> >> > I'm just trying to figure out what we can do to increase the
> >> > performance. I
> >> > have over 300 TB of live data that I have to be careful with, though,
> so
> >> > I
> >> > have to have some level of caution.
> >> >
> >> > Is there some other caching we can do (client-side or server-side)
> that
> >> > might give us a decent performance boost?
> >>
> >> Which client are you using for this testing? Have you looked at the
> >> readahead settings? That's usually the big one; if you're only asking
> >> for 4KB at once then stuff is going to be slow no matter what (a
> >> single IO takes at minimum about 2 milliseconds right now, although
> >> the RADOS team is working to improve that).
> >> -Greg
> >>
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-09 Thread Gregory Farnum
On Wed, Sep 9, 2015 at 4:26 PM, Kyle Hutson  wrote:
>
>
> On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum  wrote:
>>
>> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
>> > We are using Hammer - latest released version. How do I check if it's
>> > getting promoted into the cache?
>>
>> Umm...that's a good question. You can run rados ls on the cache pool,
>> but that's not exactly scalable; you can turn up logging and dig into
>> them to see if redirects are happening, or watch the OSD operations
>> happening via the admin socket. But I don't know if there's a good
>> interface for users to just query the cache state of a single object.
>> :/
>
>
> even using 'rados ls', I (naturally) get cephfs object names - is there a
> way to see a filename -> objectname conversion ... or objectname -> filename
> ?

The object name is .. So you can
look at the file inode and then see which of its objects are actually
in the pool.
-Greg

>
>>
>> > We're using the latest ceph kernel client. Where do I poke at readahead
>> > settings there?
>>
>> Just the standard kernel readahead settings; I'm not actually familiar
>> with how to configure those but I don't believe Ceph's are in any way
>> special. What do you mean by "latest ceph kernel client"; are you
>> running one of the developer testing kernels or something?
>
>
> No, just what comes with the latest stock kernel. Sorry for any confusion.
>
>>
>> I think
>> Ilya might have mentioned some issues with readahead being
>> artificially blocked, but that might have only been with RBD.
>>
>> Oh, are the files you're using sparse? There was a bug with sparse
>> files not filling in pages that just got patched yesterday or
>> something.
>
>
> No, these are not sparse files. Just really big.
>
>>
>> >
>> > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum 
>> > wrote:
>> >>
>> >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson 
>> >> wrote:
>> >> > I was wondering if anybody could give me some insight as to how
>> >> > CephFS
>> >> > does
>> >> > its caching - read-caching in particular.
>> >> >
>> >> > We are using CephFS with an EC pool on the backend with a replicated
>> >> > cache
>> >> > pool in front of it. We're seeing some very slow read times. Trying
>> >> > to
>> >> > compute an md5sum on a 15GB file twice in a row (so it should be in
>> >> > cache)
>> >> > takes the time from 23 minutes down to 17 minutes, but this is over a
>> >> > 10Gbps
>> >> > network and with a crap-ton of OSDs (over 300), so I would expect it
>> >> > to
>> >> > be
>> >> > down in the 2-3 minute range.
>> >>
>> >> A single sequential read won't necessarily promote an object into the
>> >> cache pool (although if you're using Hammer I think it will), so you
>> >> want to check if it's actually getting promoted into the cache before
>> >> assuming that's happened.
>> >>
>> >> >
>> >> > I'm just trying to figure out what we can do to increase the
>> >> > performance. I
>> >> > have over 300 TB of live data that I have to be careful with, though,
>> >> > so
>> >> > I
>> >> > have to have some level of caution.
>> >> >
>> >> > Is there some other caching we can do (client-side or server-side)
>> >> > that
>> >> > might give us a decent performance boost?
>> >>
>> >> Which client are you using for this testing? Have you looked at the
>> >> readahead settings? That's usually the big one; if you're only asking
>> >> for 4KB at once then stuff is going to be slow no matter what (a
>> >> single IO takes at minimum about 2 milliseconds right now, although
>> >> the RADOS team is working to improve that).
>> >> -Greg
>> >>
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] purpose of different default pools created by radosgw instance

2015-09-09 Thread Ben Hines
The Ceph docs in general could use a lot of improvement, IMO. There
are many, many
settings listed, but one must dive into the mailing list to learn
which ones are worth tweaking (And often, even *what they do*!)

-Ben

On Wed, Sep 9, 2015 at 3:51 PM, Mark Kirkwood
 wrote:
> On 16/09/14 17:10, pragya jain wrote:
>> Hi all!
>>
>> As document says, ceph has some default pools for radosgw instance. These 
>> pools are:
>>   * .rgw.root
>>   * .rgw.control
>>   * .rgw.gc
>>   * .rgw.buckets
>>   * .rgw.buckets.index
>>   * .log
>>   * .intent-log
>>   * .usage
>>   * .users
>>   * .users.email
>>   * .users.swift
>>   * .users.uid
>> Can somebody explain me what are the purpose of these different pools in 
>> terms of storing the data, for example, according to my understanding,
>>   * .users pool contains the information of the users that have their 
>> account in the system
>>   * .users.swift contains the information of users that are using Swift 
>> APIs to authenticate to the system.
>> Please help me to clarify all these concepts.
>>
>> Regards
>> Pragya Jain
>>
>
> I'd like to add a +1 to this, just had some issues with puzzling
> contents of one of these pools and the scarcity of doco on them made it
> more puzzling still. Fortunately one can poke about in src/rgw/ for
> enlightenment but that is slower than simpley reading some nice docs!
>
> regards
>
> Mark
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] purpose of different default pools created by radosgw instance

2015-09-09 Thread Shinobu Kinjo
That's good point actually.
Probably saves our life -;

Shinobu

- Original Message -
From: "Ben Hines" 
To: "Mark Kirkwood" 
Cc: "ceph-users" 
Sent: Thursday, September 10, 2015 8:23:26 AM
Subject: Re: [ceph-users] purpose of different default pools created by radosgw 
instance

The Ceph docs in general could use a lot of improvement, IMO. There
are many, many
settings listed, but one must dive into the mailing list to learn
which ones are worth tweaking (And often, even *what they do*!)

-Ben

On Wed, Sep 9, 2015 at 3:51 PM, Mark Kirkwood
 wrote:
> On 16/09/14 17:10, pragya jain wrote:
>> Hi all!
>>
>> As document says, ceph has some default pools for radosgw instance. These 
>> pools are:
>>   * .rgw.root
>>   * .rgw.control
>>   * .rgw.gc
>>   * .rgw.buckets
>>   * .rgw.buckets.index
>>   * .log
>>   * .intent-log
>>   * .usage
>>   * .users
>>   * .users.email
>>   * .users.swift
>>   * .users.uid
>> Can somebody explain me what are the purpose of these different pools in 
>> terms of storing the data, for example, according to my understanding,
>>   * .users pool contains the information of the users that have their 
>> account in the system
>>   * .users.swift contains the information of users that are using Swift 
>> APIs to authenticate to the system.
>> Please help me to clarify all these concepts.
>>
>> Regards
>> Pragya Jain
>>
>
> I'd like to add a +1 to this, just had some issues with puzzling
> contents of one of these pools and the scarcity of doco on them made it
> more puzzling still. Fortunately one can poke about in src/rgw/ for
> enlightenment but that is slower than simpley reading some nice docs!
>
> regards
>
> Mark
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] purpose of different default pools created by radosgw instance

2015-09-09 Thread Mark Kirkwood
On 16/09/14 17:10, pragya jain wrote:
> Hi all!
> 
> As document says, ceph has some default pools for radosgw instance. These 
> pools are:
>   * .rgw.root
>   * .rgw.control
>   * .rgw.gc
>   * .rgw.buckets
>   * .rgw.buckets.index
>   * .log
>   * .intent-log
>   * .usage
>   * .users
>   * .users.email
>   * .users.swift
>   * .users.uid
> Can somebody explain me what are the purpose of these different pools in 
> terms of storing the data, for example, according to my understanding, 
>   * .users pool contains the information of the users that have their 
> account in the system
>   * .users.swift contains the information of users that are using Swift 
> APIs to authenticate to the system.
> Please help me to clarify all these concepts.
> 
> Regards 
> Pragya Jain
> 

I'd like to add a +1 to this, just had some issues with puzzling
contents of one of these pools and the scarcity of doco on them made it
more puzzling still. Fortunately one can poke about in src/rgw/ for
enlightenment but that is slower than simpley reading some nice docs!

regards

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-09 Thread Goncalo Borges

Hey Shinobu

Thanks for the replies.


 a./ Under a situation as the one describe above, how can we safely
 terminate cephfs in the clients? I have had situations where
 umount simply hangs and there is no real way to unblock the
 situation unless I reboot the client. If we have hundreds of
 clients, I would like to avoid that.

Use "lsof" to find process accessing filesystem. I would see process id.
And then kill that process using:

  kill -9 

But you **must** make sure if it's ok or not to kill that process.
You have to be careful no matter when you kill any process.


Sure.  Thanks for the advice.






 b./ I was expecting to have lost metadata information since I've
 clean OSDs where metadata information was stored for the
 /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that
 the /'cephfs/goncalo/5Gbytes_029.txt' was still properly
 referenced, without me having to run any recover tool. What am I
 missing?

 c./ After recovering the cluster, I though I was in a cephfs
 situation where I had
 c.1 files with holes (because of lost PGs and objects in the
 data pool)
 c.2 files without metadata (because of lost PGs and objects in
 the metadata pool)
 c.3 metadata without associated files (because of lost PGs and
 objects in the data pool)
 I've tried to run the recovery tools, but I have several doubts
 which I did not found described in the documentation
 - Is there a specific order / a way to run the tools for the
 c.1, c.2 and c.3 cases I mentioned?

What is the recovery tools?
How did you do with that tool?
I'm just assuming d) -;

Am I right?

If so, why did you use that tool?


The tools and the order of execution I've used were the ones mentioned 
in my point d./ bellow. However, I am not really sure if what I did was 
correct. The tools have not provided any output nor I have seen any 
meaningful change comparing files in the filesystem before and after 
their execution. So, am I a bit in the dark concerning what the tools 
do. I guess that the tools should log what they are doing so that the 
admin understands what is going on. At the end, they should give a 
summary of what they fix or not fixed.


Another thing that puzzled me was what I reported in point b./ I was 
able to list /cephfs/goncalo/5Gbytes_029.txt, after I've recovered the 
Ceph cluster, restarted mds, remounted the client and without having to 
run any recover tools. Please be aware that the original problem was 
generated by me when I've destroyed the 3 OSDs (my cluster is configured 
with 3 replicas) where the metadata for this file was stored. I do not 
understand why the metadata information for the file was still available.


3) Get its inode, and convert it to HEX

# ls -li /cephfs/goncalo/5Gbytes_029.txt
1099511627812 -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
/cephfs/goncalo/5Gbytes_029.txt

(1099511627812)_base = (124)_base16

--- * ---

5) Get the file / PG / OSD mapping

# ceph osd map cephfs_dt 124.
osdmap e479 pool 'cephfs_dt' (1) object '124.' ->
pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)
# ceph osd map cephfs_mt 124.
osdmap e479 pool 'cephfs_mt' (2) object '124.' ->
pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)

--- * ---

6) Kill the relevant osd daemons, umount the osd partition and
delete the partitions

[root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o
| tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o;
umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted
-s  ${dev::8} rm 2; partprobe; done

[root@server2 ~]# for o in 13 15; do dev=`df
/var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
/etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done

[root@server3 ~]# for o in 19 23; do dev=`df
/var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
/etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done

[root@server4 ~]# for o in 27; do dev=`df
/var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
/etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done







 d./ Since I was testing, I simply ran the following sequence but I
 am not sure of what the command are doing, nor if the sequence is
 correct. I think an example use case should be documented.
 Specially the cephfs-data-scan did not returned any output, or
 information. So, I am not sure if anything happened at all.

 # cephfs-table-tool 0 reset session
 {
 

Re: [ceph-users] purpose of different default pools created by radosgw instance

2015-09-09 Thread Mark Kirkwood
On 10/09/15 11:27, Shinobu Kinjo wrote:
> That's good point actually.
> Probably saves our life -;
> 
> Shinobu
> 
> - Original Message -
> From: "Ben Hines" 
> To: "Mark Kirkwood" 
> Cc: "ceph-users" 
> Sent: Thursday, September 10, 2015 8:23:26 AM
> Subject: Re: [ceph-users] purpose of different default pools created by 
> radosgw instance
> 
> The Ceph docs in general could use a lot of improvement, IMO. There
> are many, many
> settings listed, but one must dive into the mailing list to learn
> which ones are worth tweaking (And often, even *what they do*!)


Indeed - but just to be 100% clear re $SUBJECT - most of the rgw pools
do not have their purpose clearly documented anywhere in the docs (just
rechecked).

There is a little bit of enlightenment in src/doc/radosgw/layout.rst -
but this page does not seem to be linked anywhere in the built docs
(that I could find anyway).

regards

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-09 Thread Vickey Singh
Hey Lincoln



On Tue, Sep 8, 2015 at 7:26 PM, Lincoln Bryant 
wrote:

> For whatever it’s worth, my problem has returned and is very similar to
> yours. Still trying to figure out what’s going on over here.
>
> Performance is nice for a few seconds, then goes to 0. This is a similar
> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc)
>
>   384  16 29520 29504   307.287  1188 0.0492006  0.208259
>   385  16 29813 29797   309.532  1172 0.0469708  0.206731
>   386  16 30105 30089   311.756  1168 0.0375764  0.205189
>   387  16 30401 30385   314.009  1184  0.036142  0.203791
>   388  16 30695 30679   316.231  1176 0.0372316  0.202355
>   389  16 30987 30971318.42  1168 0.0660476  0.200962
>   390  16 31282 31266   320.628  1180 0.0358611  0.199548
>   391  16 31568 31552   322.734  1144 0.0405166  0.198132
>   392  16 31857 31841   324.859  1156 0.0360826  0.196679
>   393  16 32090 32074   326.404   932 0.0416869   0.19549
>   394  16 32205 32189   326.743   460 0.0251877  0.194896
>   395  16 32302 32286   326.897   388 0.0280574  0.194395
>   396  16 32348 32332   326.537   184 0.0256821  0.194157
>   397  16 32385 32369   326.087   148 0.0254342  0.193965
>   398  16 32424 32408   325.659   156 0.0263006  0.193763
>   399  16 32445 32429   325.05484 0.0233839  0.193655
> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat:
> 0.193655
>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>   400  16 32445 32429   324.241 0 -  0.193655
>   401  16 32445 32429   323.433 0 -  0.193655
>   402  16 32445 32429   322.628 0 -  0.193655
>   403  16 32445 32429   321.828 0 -  0.193655
>   404  16 32445 32429   321.031 0 -  0.193655
>   405  16 32445 32429   320.238 0 -  0.193655
>   406  16 32445 32429319.45 0 -  0.193655
>   407  16 32445 32429   318.665 0 -  0.193655
>
> needless to say, very strange.
>

Its indeed very strange

( The solution that you gave me in the below email ) Have you tried
restarting all OSD's ?

By the way my problem got fixed ( but i am afraid , it can come back any
time ) by doing

# service ceph restart osd  on all OSD nodes ( this didn't helped )
# set noout,nodown,nobackfill,norecover and then reboot all OSD nodes ( It
worked )  After they all the rados bench write started to work.

[ i know its hilarious , feels like  i am watching *The IT Crowd* ' Hello
IT , Have you tried turning it OFF and ON again ' ]

It would be really helpful if someone provides a real solution.




>
> —Lincoln
>
>
> > On Sep 7, 2015, at 3:35 PM, Vickey Singh 
> wrote:
> >
> > Adding ceph-users.
> >
> > On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh <
> vickey.singh22...@gmail.com> wrote:
> >
> >
> > On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke 
> wrote:
> > Hi Vickey,
> > Thanks for your time in replying to my problem.
> >
> > I had the same rados bench output after changing the motherboard of the
> monitor node with the lowest IP...
> > Due to the new mainboard, I assume the hw-clock was wrong during
> startup. Ceph health show no errors, but all VMs aren't able to do IO (very
> high load on the VMs - but no traffic).
> > I stopped the mon, but this don't changed anything. I had to restart all
> other mons to get IO again. After that I started the first mon also (with
> the right time now) and all worked fine again...
> >
> > Thanks i will try to restart all OSD / MONS and report back , if it
> solves my problem
> >
> > Another posibility:
> > Do you use journal on SSDs? Perhaps the SSDs can't write to garbage
> collection?
> >
> > No i don't have journals on SSD , they are on the same OSD disk.
> >
> >
> >
> > Udo
> >
> >
> > On 07.09.2015 16:36, Vickey Singh wrote:
> >> Dear Experts
> >>
> >> Can someone please help me , why my cluster is not able write data.
> >>
> >> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
> >>
> >>
> >> Ceph Hammer  0.94.2
> >> CentOS 6 (3.10.69-1)
> >>
> >> The Ceph status says OPS are blocked , i have tried checking , what all
> i know
> >>
> >> - System resources ( CPU , net, disk , memory )-- All normal
> >> - 10G network for public and cluster network  -- no saturation
> >> - Add disks are physically healthy
> >> - No messages in /var/log/messages OR dmesg
> >> - Tried restarting OSD which are blocking operation , but no luck
> >> - Tried writing through RBD  and Rados bench , both are giving same
> problemm
> >>
> >> Please help me to 

[ceph-users] Ceph/Radosgw v0.94 Content-Type versus Content-type

2015-09-09 Thread Chang, Fangzhe (Fangzhe)
I noticed that S3 Java SDK for getContentType() no longer works in Ceph/Radosgw 
v0.94 (Hammer). It seems that S3 SDK expects the metadata “Content-Type” 
whereas ceph responds with “Content-type”.
Does anyone know how to make a request for having this issue fixed?

Fangzhe


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-09 Thread Bill Sanders
We were experiencing something similar in our setup (rados bench does some
work, then comes to a screeching halt).  No pattern to which OSD's were
causing the problem, though.  Sounds like similar hardware (This was on
Dell R720xd, and yeah, that controller is suuuper frustrating).

For us, setting tcp_moderate_rcvbuf to 0 on all nodes solved the issue.

echo 0 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

Or set it in /etc/sysctl.conf:

net.ipv4.tcp_moderate_rcvbuf = 0

We figured this out independently after I posted this thread, "Slow/Hung
IOs":
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-January/045674.html

Hope this helps

Bill Sanders

On Wed, Sep 9, 2015 at 11:09 AM, Lincoln Bryant 
wrote:

> Hi Jan,
>
> I’ll take a look at all of those things and report back (hopefully :))
>
> I did try setting all of my OSDs to writethrough instead of writeback on
> the controller, which was significantly more consistent in performance
> (from 1100MB/s down to 300MB/s, but still occasionally dropping to 0MB/s).
> Still plenty of blocked ops.
>
> I was wondering if not-so-nicely failing OSD(s) might be the cause. My
> controller (PERC H730 Mini) seems frustratingly terse with SMART
> information, but at least one disk has a “Non-medium error count” of over
> 20,000..
>
> I’ll try disabling offloads as well.
>
> Thanks much for the suggestions!
>
> Cheers,
> Lincoln
>
> > On Sep 9, 2015, at 3:59 AM, Jan Schermer  wrote:
> >
> > Just to recapitulate - the nodes are doing "nothing" when it drops to
> zero? Not flushing something to drives (iostat)? Not cleaning pagecache
> (kswapd and similiar)? Not out of any type of memory (slab,
> min_free_kbytes)? Not network link errors, no bad checksums (those are hard
> to spot, though)?
> >
> > Unless you find something I suggest you try disabling offloads on the
> NICs and see if the problem goes away.
> >
> > Jan
> >
> >> On 08 Sep 2015, at 18:26, Lincoln Bryant  wrote:
> >>
> >> For whatever it’s worth, my problem has returned and is very similar to
> yours. Still trying to figure out what’s going on over here.
> >>
> >> Performance is nice for a few seconds, then goes to 0. This is a
> similar setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3,
> etc)
> >>
> >> 384  16 29520 29504   307.287  1188 0.0492006  0.208259
> >> 385  16 29813 29797   309.532  1172 0.0469708  0.206731
> >> 386  16 30105 30089   311.756  1168 0.0375764  0.205189
> >> 387  16 30401 30385   314.009  1184  0.036142  0.203791
> >> 388  16 30695 30679   316.231  1176 0.0372316  0.202355
> >> 389  16 30987 30971318.42  1168 0.0660476  0.200962
> >> 390  16 31282 31266   320.628  1180 0.0358611  0.199548
> >> 391  16 31568 31552   322.734  1144 0.0405166  0.198132
> >> 392  16 31857 31841   324.859  1156 0.0360826  0.196679
> >> 393  16 32090 32074   326.404   932 0.0416869   0.19549
> >> 394  16 32205 32189   326.743   460 0.0251877  0.194896
> >> 395  16 32302 32286   326.897   388 0.0280574  0.194395
> >> 396  16 32348 32332   326.537   184 0.0256821  0.194157
> >> 397  16 32385 32369   326.087   148 0.0254342  0.193965
> >> 398  16 32424 32408   325.659   156 0.0263006  0.193763
> >> 399  16 32445 32429   325.05484 0.0233839  0.193655
> >> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat:
> 0.193655
> >> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >> 400  16 32445 32429   324.241 0 -  0.193655
> >> 401  16 32445 32429   323.433 0 -  0.193655
> >> 402  16 32445 32429   322.628 0 -  0.193655
> >> 403  16 32445 32429   321.828 0 -  0.193655
> >> 404  16 32445 32429   321.031 0 -  0.193655
> >> 405  16 32445 32429   320.238 0 -  0.193655
> >> 406  16 32445 32429319.45 0 -  0.193655
> >> 407  16 32445 32429   318.665 0 -  0.193655
> >>
> >> needless to say, very strange.
> >>
> >> —Lincoln
> >>
> >>
> >>> On Sep 7, 2015, at 3:35 PM, Vickey Singh 
> wrote:
> >>>
> >>> Adding ceph-users.
> >>>
> >>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh <
> vickey.singh22...@gmail.com> wrote:
> >>>
> >>>
> >>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke 
> wrote:
> >>> Hi Vickey,
> >>> Thanks for your time in replying to my problem.
> >>>
> >>> I had the same rados bench output after changing the motherboard of
> the monitor node with the lowest IP...
> >>> Due to the new mainboard, I assume the hw-clock was wrong during
> startup. Ceph health 

Re: [ceph-users] rebalancing taking very long time

2015-09-09 Thread Vickey Singh
Agreed with Alphe , Ceph Hammer (0.94.2) sucks when it comes to recovery
and rebalancing.

Here is my Ceph Hammer cluster , which is like this for more than 30 hours.

You might be thinking about that one OSD which is down and not in.  Its
intentional, i want to remove that OSD.
I want the cluster to become healthy again before i remove that OSD.

Can someone help us with this problem

 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
 health HEALTH_WARN
14 pgs stuck unclean
5 requests are blocked > 32 sec
recovery 420/28358085 objects degraded (0.001%)
recovery 199941/28358085 objects misplaced (0.705%)
too few PGs per OSD (28 < min 30)
 monmap e3: 3 mons at {stor0201=10.100.1.201:6789/0,stor0202
=10.100.1.202:6789/0,stor0203=10.100.1.203:6789/0}
election epoch 1076, quorum 0,1,2 stor0201,stor0202,
stor0203
 osdmap e778879: 96 osds: 95 up, 95 in; 14 remapped pgs
  pgmap v2475334: 896 pgs, 4 pools, 51364 GB data, 9231 kobjects
150 TB used, 193 TB / 344 TB avail
420/28358085 objects degraded (0.001%)
199941/28358085 objects misplaced (0.705%)
 879 active+clean
  14 active+remapped
   3 active+clean+scrubbing+deep



On Tue, Sep 8, 2015 at 5:59 PM, Alphe Salas  wrote:

> I can say exactly the same I am using ceph sin 0.38 and I never get osd so
> laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94
> serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes
> 1.6GB each !!! serriously ! that makes avanche snow.
>
> Let me be straight and explain what changed.
>
> in 0.38 you ALWAYS could stop the ceph cluster and then start it up it
> would evaluate if everyone is back if there is enough replicas then start
> rebuilding /rebalancing what needed of course like 10 minutes was necesary
> to bring up ceph cluster but then the rebuilding /rebalancing process was
> smooth.
> With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20
> osd. then you get a disc crash. so ceph starts automatically to rebuild and
> rebalance stuff. and there osd start to lag then to crash
> you stop ceph cluster you change the drive restart the ceph cluster stops
> all rebuild process setting no-backfill, norecovey noscrub nodeep-scrub you
> rm the old osd create a new one wait for all osd
> to be in and up and then starts rebuilding lag/rebalancing since it is
> automated not much a choice there.
>
> And again all osd are stuck in enless lag/down/recovery intent cycle...
>
> It is a pain serriously. 5 days after changing the faulty disc it is still
> locked in the lag/down/recovery cycle.
>
> Sur it can be argued that my machines are really ressource limited and
> that I should buy 3 thousand dollar worth server at least. But intil 0.72
> that rebalancing /rebuilding process was working smoothly on the same
> hardware.
>
> It seems to me that the rebalancing/rebuilding algorithm is more strict
> now than it was in the past. in the past only what really really needed to
> be rebuild or rebalance was rebalanced or rebuild.
>
> I can still delete all and go back to 0.72... like I should buy a cray
> T-90 to not have anymore problems and have ceph run smoothly. But this will
> not help making ceph a better product.
>
> for me ceph 0.94 is like windows vista...
>
> Alphe Salas
> I.T ingeneer
>
>
> On 09/08/2015 10:20 AM, Gregory Farnum wrote:
>
>> On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko  wrote:
>>
>>> When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a
>>> very
>>> long time to rebalance.  I should note that my cluster is slightly
>>> unique in
>>> that I am using cephfs(shouldn't matter?) and it currently contains about
>>> 310 million objects.
>>>
>>> The last time I replaced a disk/OSD was 2.5 days ago and it is still
>>> rebalancing.  This is on a cluster with no client load.
>>>
>>> The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
>>> SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
>>> total.  System disk is on its own disk.  I'm also using a backend network
>>> with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very
>>> slow
>>> when it is close to finishingsay <1% objects misplaced.
>>>
>>> It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
>>> with no load on the cluster.  Are my expectations off?
>>>
>>
>> Possibly...Ceph basically needs to treat each object as a single IO.
>> If you're recovering from a failed disk then you've got to replicate
>> roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
>> balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
>> hours) worth of work just to read each file — and in reality it's
>> likely to take more than one IO to read the file, and then you have to
>> spend a bunch to write it as well.
>>
>>

Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-09 Thread Vickey Singh
Hello Jan

On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer  wrote:

> Just to recapitulate - the nodes are doing "nothing" when it drops to
> zero? Not flushing something to drives (iostat)? Not cleaning pagecache
> (kswapd and similiar)? Not out of any type of memory (slab,
> min_free_kbytes)? Not network link errors, no bad checksums (those are hard
> to spot, though)?
>
> Unless you find something I suggest you try disabling offloads on the NICs
> and see if the problem goes away.
>

Could you please elaborate this point , how do you disable / offload on the
NIC ? what does it mean ? how to do it ? how its gonna help.

Sorry i don't know about this.

- Vickey -



>
> Jan
>
> > On 08 Sep 2015, at 18:26, Lincoln Bryant  wrote:
> >
> > For whatever it’s worth, my problem has returned and is very similar to
> yours. Still trying to figure out what’s going on over here.
> >
> > Performance is nice for a few seconds, then goes to 0. This is a similar
> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc)
> >
> >  384  16 29520 29504   307.287  1188 0.0492006  0.208259
> >  385  16 29813 29797   309.532  1172 0.0469708  0.206731
> >  386  16 30105 30089   311.756  1168 0.0375764  0.205189
> >  387  16 30401 30385   314.009  1184  0.036142  0.203791
> >  388  16 30695 30679   316.231  1176 0.0372316  0.202355
> >  389  16 30987 30971318.42  1168 0.0660476  0.200962
> >  390  16 31282 31266   320.628  1180 0.0358611  0.199548
> >  391  16 31568 31552   322.734  1144 0.0405166  0.198132
> >  392  16 31857 31841   324.859  1156 0.0360826  0.196679
> >  393  16 32090 32074   326.404   932 0.0416869   0.19549
> >  394  16 32205 32189   326.743   460 0.0251877  0.194896
> >  395  16 32302 32286   326.897   388 0.0280574  0.194395
> >  396  16 32348 32332   326.537   184 0.0256821  0.194157
> >  397  16 32385 32369   326.087   148 0.0254342  0.193965
> >  398  16 32424 32408   325.659   156 0.0263006  0.193763
> >  399  16 32445 32429   325.05484 0.0233839  0.193655
> > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat:
> 0.193655
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >  400  16 32445 32429   324.241 0 -  0.193655
> >  401  16 32445 32429   323.433 0 -  0.193655
> >  402  16 32445 32429   322.628 0 -  0.193655
> >  403  16 32445 32429   321.828 0 -  0.193655
> >  404  16 32445 32429   321.031 0 -  0.193655
> >  405  16 32445 32429   320.238 0 -  0.193655
> >  406  16 32445 32429319.45 0 -  0.193655
> >  407  16 32445 32429   318.665 0 -  0.193655
> >
> > needless to say, very strange.
> >
> > —Lincoln
> >
> >
> >> On Sep 7, 2015, at 3:35 PM, Vickey Singh 
> wrote:
> >>
> >> Adding ceph-users.
> >>
> >> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh <
> vickey.singh22...@gmail.com> wrote:
> >>
> >>
> >> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke 
> wrote:
> >> Hi Vickey,
> >> Thanks for your time in replying to my problem.
> >>
> >> I had the same rados bench output after changing the motherboard of the
> monitor node with the lowest IP...
> >> Due to the new mainboard, I assume the hw-clock was wrong during
> startup. Ceph health show no errors, but all VMs aren't able to do IO (very
> high load on the VMs - but no traffic).
> >> I stopped the mon, but this don't changed anything. I had to restart
> all other mons to get IO again. After that I started the first mon also
> (with the right time now) and all worked fine again...
> >>
> >> Thanks i will try to restart all OSD / MONS and report back , if it
> solves my problem
> >>
> >> Another posibility:
> >> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage
> collection?
> >>
> >> No i don't have journals on SSD , they are on the same OSD disk.
> >>
> >>
> >>
> >> Udo
> >>
> >>
> >> On 07.09.2015 16:36, Vickey Singh wrote:
> >>> Dear Experts
> >>>
> >>> Can someone please help me , why my cluster is not able write data.
> >>>
> >>> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
> >>>
> >>>
> >>> Ceph Hammer  0.94.2
> >>> CentOS 6 (3.10.69-1)
> >>>
> >>> The Ceph status says OPS are blocked , i have tried checking , what
> all i know
> >>>
> >>> - System resources ( CPU , net, disk , memory )-- All normal
> >>> - 10G network for public and cluster network  -- no saturation
> >>> - Add disks are physically healthy
> >>> - No messages in /var/log/messages OR dmesg
> >>> - Tried 

Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-09 Thread Lincoln Bryant
Hi Jan,

I’ll take a look at all of those things and report back (hopefully :))

I did try setting all of my OSDs to writethrough instead of writeback on the 
controller, which was significantly more consistent in performance (from 
1100MB/s down to 300MB/s, but still occasionally dropping to 0MB/s). Still 
plenty of blocked ops. 

I was wondering if not-so-nicely failing OSD(s) might be the cause. My 
controller (PERC H730 Mini) seems frustratingly terse with SMART information, 
but at least one disk has a “Non-medium error count” of over 20,000..

I’ll try disabling offloads as well. 

Thanks much for the suggestions!

Cheers,
Lincoln

> On Sep 9, 2015, at 3:59 AM, Jan Schermer  wrote:
> 
> Just to recapitulate - the nodes are doing "nothing" when it drops to zero? 
> Not flushing something to drives (iostat)? Not cleaning pagecache (kswapd and 
> similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network 
> link errors, no bad checksums (those are hard to spot, though)?
> 
> Unless you find something I suggest you try disabling offloads on the NICs 
> and see if the problem goes away.
> 
> Jan
> 
>> On 08 Sep 2015, at 18:26, Lincoln Bryant  wrote:
>> 
>> For whatever it’s worth, my problem has returned and is very similar to 
>> yours. Still trying to figure out what’s going on over here.
>> 
>> Performance is nice for a few seconds, then goes to 0. This is a similar 
>> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc)
>> 
>> 384  16 29520 29504   307.287  1188 0.0492006  0.208259
>> 385  16 29813 29797   309.532  1172 0.0469708  0.206731
>> 386  16 30105 30089   311.756  1168 0.0375764  0.205189
>> 387  16 30401 30385   314.009  1184  0.036142  0.203791
>> 388  16 30695 30679   316.231  1176 0.0372316  0.202355
>> 389  16 30987 30971318.42  1168 0.0660476  0.200962
>> 390  16 31282 31266   320.628  1180 0.0358611  0.199548
>> 391  16 31568 31552   322.734  1144 0.0405166  0.198132
>> 392  16 31857 31841   324.859  1156 0.0360826  0.196679
>> 393  16 32090 32074   326.404   932 0.0416869   0.19549
>> 394  16 32205 32189   326.743   460 0.0251877  0.194896
>> 395  16 32302 32286   326.897   388 0.0280574  0.194395
>> 396  16 32348 32332   326.537   184 0.0256821  0.194157
>> 397  16 32385 32369   326.087   148 0.0254342  0.193965
>> 398  16 32424 32408   325.659   156 0.0263006  0.193763
>> 399  16 32445 32429   325.05484 0.0233839  0.193655
>> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 
>> 0.193655
>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>> 400  16 32445 32429   324.241 0 -  0.193655
>> 401  16 32445 32429   323.433 0 -  0.193655
>> 402  16 32445 32429   322.628 0 -  0.193655
>> 403  16 32445 32429   321.828 0 -  0.193655
>> 404  16 32445 32429   321.031 0 -  0.193655
>> 405  16 32445 32429   320.238 0 -  0.193655
>> 406  16 32445 32429319.45 0 -  0.193655
>> 407  16 32445 32429   318.665 0 -  0.193655
>> 
>> needless to say, very strange.
>> 
>> —Lincoln
>> 
>> 
>>> On Sep 7, 2015, at 3:35 PM, Vickey Singh  
>>> wrote:
>>> 
>>> Adding ceph-users.
>>> 
>>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh  
>>> wrote:
>>> 
>>> 
>>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke  wrote:
>>> Hi Vickey,
>>> Thanks for your time in replying to my problem.
>>> 
>>> I had the same rados bench output after changing the motherboard of the 
>>> monitor node with the lowest IP...
>>> Due to the new mainboard, I assume the hw-clock was wrong during startup. 
>>> Ceph health show no errors, but all VMs aren't able to do IO (very high 
>>> load on the VMs - but no traffic).
>>> I stopped the mon, but this don't changed anything. I had to restart all 
>>> other mons to get IO again. After that I started the first mon also (with 
>>> the right time now) and all worked fine again...
>>> 
>>> Thanks i will try to restart all OSD / MONS and report back , if it solves 
>>> my problem 
>>> 
>>> Another posibility:
>>> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage 
>>> collection?
>>> 
>>> No i don't have journals on SSD , they are on the same OSD disk. 
>>> 
>>> 
>>> 
>>> Udo
>>> 
>>> 
>>> On 07.09.2015 16:36, Vickey Singh wrote:
 Dear Experts
 
 Can someone please help me , why my cluster is not able write data.
 
 See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.

Re: [ceph-users] Ceph/Radosgw v0.94 Content-Type versus Content-type

2015-09-09 Thread Robin H. Johnson

On Wed, Sep 09, 2015 at 05:28:26PM +, Chang, Fangzhe (Fangzhe) wrote:
> I noticed that S3 Java SDK for getContentType() no longer works in 
> Ceph/Radosgw v0.94 (Hammer). It seems that S3 SDK expects the metadata 
> “Content-Type” whereas ceph responds with “Content-type”.
> Does anyone know how to make a request for having this issue fixed?
I put a fix in place for it already, it just needs backport merging to
Hammer

https://github.com/ceph/ceph/pull/58012
http://tracker.ceph.com/issues/12939

The S3 JDK should also NOT be case-sensitive, the HTTP spec declares
that all field names should be treated case-insensitive.

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebalancing taking very long time

2015-09-09 Thread Sage Weil
On Wed, 9 Sep 2015, Vickey Singh wrote:
> Agreed with Alphe , Ceph Hammer (0.94.2) sucks when it comes to recovery and
> rebalancing.
> 
> Here is my Ceph Hammer cluster , which is like this for more than 30 hours.
> 
> You might be thinking about that one OSD which is down and not in.  Its
> intentional, i want to remove that OSD.
> I want the cluster to become healthy again before i remove that OSD.
> 
> Can someone help us with this problem
> 
>  cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
>      health HEALTH_WARN
>             14 pgs stuck unclean
>             5 requests are blocked > 32 sec
>             recovery 420/28358085 objects degraded (0.001%)
>             recovery 199941/28358085 objects misplaced (0.705%)
>             too few PGs per OSD (28 < min 30)
>      monmap e3: 3 mons at {stor0201=10.100.1.201:6789/0,stor0202
> =10.100.1.202:6789/0,stor0203=10.100.1.203:6789/0}
>             election epoch 1076, quorum 0,1,2 stor0201,stor0202,
> stor0203
>      osdmap e778879: 96 osds: 95 up, 95 in; 14 remapped pgs
>       pgmap v2475334: 896 pgs, 4 pools, 51364 GB data, 9231 kobjects
>             150 TB used, 193 TB / 344 TB avail
>             420/28358085 objects degraded (0.001%)
>             199941/28358085 objects misplaced (0.705%)
>                  879 active+clean

>                   14 active+remapped

   ^^^

This is your problem.  It's not the recovery, it's that CRUSH is only 
mapping to 2 devices for one of your PGs.  This is usually a 
result of the vary_r tunable being 0.  Assuming all of your clients 
are firefly or newer, you can fix it with

 ceph osd crush tunables firefly

Alternatively, you can probably work around the situation by removing any 
'out' OSD from the crush map entirely, in which case

 ceph osd crush rm osd.

will do the trick.

sage


>                    3 active+clean+scrubbing+deep
> 
> 
> 
> On Tue, Sep 8, 2015 at 5:59 PM, Alphe Salas  wrote:
>   I can say exactly the same I am using ceph sin 0.38 and I never
>   get osd so laggy than with 0.94. rebalancing /rebuild algorithm
>   is crap in 0.94 serriously I have 2 osd serving 2 discs of 2TB
>   and 4 GB of RAM osd takes 1.6GB each !!! serriously ! that makes
>   avanche snow.
> 
>   Let me be straight and explain what changed.
> 
>   in 0.38 you ALWAYS could stop the ceph cluster and then start it
>   up it would evaluate if everyone is back if there is enough
>   replicas then start rebuilding /rebalancing what needed of
>   course like 10 minutes was necesary to bring up ceph cluster but
>   then the rebuilding /rebalancing process was smooth.
>   With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63%
>   over 20 osd. then you get a disc crash. so ceph starts
>   automatically to rebuild and rebalance stuff. and there osd
>   start to lag then to crash
>   you stop ceph cluster you change the drive restart the ceph
>   cluster stops all rebuild process setting no-backfill, norecovey
>   noscrub nodeep-scrub you rm the old osd create a new one wait
>   for all osd
>   to be in and up and then starts rebuilding lag/rebalancing since
>   it is automated not much a choice there.
> 
>   And again all osd are stuck in enless lag/down/recovery intent
>   cycle...
> 
>   It is a pain serriously. 5 days after changing the faulty disc
>   it is still locked in the lag/down/recovery cycle.
> 
>   Sur it can be argued that my machines are really ressource
>   limited and that I should buy 3 thousand dollar worth server at
>   least. But intil 0.72 that rebalancing /rebuilding process was
>   working smoothly on the same hardware.
> 
>   It seems to me that the rebalancing/rebuilding algorithm is more
>   strict now than it was in the past. in the past only what really
>   really needed to be rebuild or rebalance was rebalanced or
>   rebuild.
> 
>   I can still delete all and go back to 0.72... like I should buy
>   a cray T-90 to not have anymore problems and have ceph run
>   smoothly. But this will not help making ceph a better product.
> 
>   for me ceph 0.94 is like windows vista...
> 
>   Alphe Salas
>   I.T ingeneer
> 
>   On 09/08/2015 10:20 AM, Gregory Farnum wrote:
> On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko
>  wrote:
>   When I lose a disk OR replace a OSD in
>   my POC ceph cluster, it takes a very
>   long time to rebalance.  I should note
>   that my cluster is slightly unique in
>   that I am using cephfs(shouldn't
>   matter?) and it currently contains about
>   310 million objects.
> 
>   The last time I replaced a disk/OSD was
>   2.5 days ago and it is still
>   rebalancing.  This is 

[ceph-users] backfilling on a single OSD and caching controllers

2015-09-09 Thread Lionel Bouton
Hi,

just a tip I just validated on our hardware. I'm currently converting an
OSD from xfs with journal on same platter to btrfs with journal on SSD.
To avoid any unwanted movement, I reused the same OSD number, weight and
placement : so Ceph is simply backfilling all PGs previously stored on
the old version of this OSD.

The problem is that all the other OSDs on the same server (which has a
total of 6) suffer greatly (>10x jump in apply latencies). I
half-expected this: the RAID card has 2GB of battery-backed RAM from
which ~1.6-1.7 GB is used as write cache. Obviously if you write the
entire content of an OSD through this cache (~500GB currently) it will
not be useful: the first GBs will be put in cache but the OSD will
overflow the cache (writing faster than what the HDD can handle) which
will then become useless for the backfilling.
Worse, once the cache is full writes to the other HDDs will compete for
access to the cache with the backfilling OSD instead of getting the full
benefit of a BBWC.

I already took the precaution of excluding the SSDs from the
controller's cache (which already divides the cache pressure by 2
because the writes to journals are not using it). But right now I just
disabled the cache for the HDD behind the OSD on which backfilling is
happening and I saw an immediate performance gain: apply latencies for
the other OSDs on the same server jumped back from >100ms to <10ms.

AFAIK the Ceph OSD code doesn't bypass the kernel cache when
backfilling, if it's really the case this might be a good idea to do so
(or at least make it configurable): the probability that the data
written during backfilling is reused should be lower than the one for
normal accesses.

On an HP Smart Storage Array:

hpacucli> ctrl slot= ld  modify caching=disable

when the backfilling stops:

hpacucli> ctrl slot= ld  modify caching=enable

This is not usable when there are large scale rebalancing (where nearly
all OSDs are hit by pg movements) but in this particular case this helps
a *lot*.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com