Re: [ceph-users] Ceph RBD bench has a strange behaviour when RBD client caching is active

2016-01-25 Thread J-P Methot
Ceph package is 0.94.5, which is hammer. So yes it could very well be
this bug. Must I assume then that it only affects rbd bench and not the
general functionality of the client?



On 2016-01-25 1:59 PM, Jason Dillaman wrote:
> What release are you testing?  You might be hitting this issue [1] where 'rbd 
> bench-write' would issue the same IO request repeatedly.  With writeback 
> cache enabled, this would result in virtually no ops issued to the backend.
> 
> [1] http://tracker.ceph.com/issues/14283
> 


-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph RBD bench has a strange behaviour when RBD client caching is active

2016-01-25 Thread J-P Methot
Hi,

We've run into a weird issue on our current test setup. We're currently
testing a small low-cost Ceph setup, with sata drives, 1gbps ethernet
and an Intel SSD for journaling per host. We've linked this to an
openstack setup. Ceph is the latest Hammer release.

We notice that when we do rbd benchmarks using the rbd bench tool, the
benchmark never really complete. It starts, run for 3 seconds and then
stop, despite the rbd drive being 10 GB and the tool using 4k block
size. If I set the rbd caching to false, the benchmark runs normally and
complete after a few minutes.

How can the rbd_cache affect the benchmark tool in this manner and does
it has direct impacts on the openstack cluster running of this ceph setup?
-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Backing up ceph rbd content to an external storage

2015-09-22 Thread J-P Methot
Hi,

We've been considering periodically backing up rbds from ceph to a
different storage backend, just in case. I've thought of a few ways this
could be possible, but I am curious if anybody on this list is currently
doing that.

Are you currently backing up data that is contained in ceph? What do you
think is the best way to do it?
-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange logging behaviour for ceph

2015-09-02 Thread J-P Methot
Hi,

We're using Ceph Hammer 0.94.1 on centOS 7. On the monitor, when we set
log_to_syslog = true
Ceph starts shooting logs at stdout. I thought at first it might be
rsyslog that is wrongly configured, but I did not find a rule that could
explain this behavior.

Can anybody else replicate this? If it's a bug, has it been fixed in
more recent version? (I couldn't find anything relating to such an issue).

-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread J-P Methot
Hi,

First of all, we are sure that the return to the default configuration
fixed it. As soon as we restarted only one of the ceph nodes with the
default configuration, it sped up recovery tremedously. We had already
restarted before with the old conf and recovery was never that fast.

Regarding the configuration, here's the old one with comments :

[global]
fsid = *
mon_initial_members = cephmon1
mon_host = ***
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true   //
  Let's you use xattributes of xfs/ext4/btrfs filesystems
osd_pool_default_pgp_num = 450   //
default pgp number for new pools
osd_pg_bits = 12  //
 number of bits used to designate pgps. Lets you have 2^12 pgps
osd_pool_default_size = 3   //
 default copy number for new pools
osd_pool_default_pg_num = 450//
default pg number for new pools
public_network = *
cluster_network = ***
osd_pgp_bits = 12   //
 number of bits used to designate pgps. Let's you have 2^12 pgps

[osd]
filestore_queue_max_ops = 5000// set to 500 by default Defines the
maximum number of in progress operations the file store accepts before
blocking on queuing new operations.
filestore_fd_cache_random = true//  
journal_queue_max_ops = 100   //   set
to 500 by default. Number of operations allowed in the journal queue
filestore_omap_header_cache_size = 100  //   Determines
the size of the LRU used to cache object omap headers. Larger values use
more memory but may reduce lookups on omap.
filestore_fd_cache_size = 100 //
not in the ceph documentation. Seems to be a common tweak for SSD
clusters though.
max_open_files = 100 //
  lets ceph set the max file descriptor in the OS to prevent running out
of file descriptors
osd_journal_size = 1   //
journal max size for each OSD

New conf:

[global]
fsid = *
mon_initial_members = cephmon1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = **
cluster_network = **

You might notice, I have a few undocumented settings in the old
configuration. These are settings I took from a certain openstack summit
presentation and they may have contributed to this whole problem. Here's
a list of settings that I think might be a possible cause for these
speed issues:

filestore_fd_cache_random = true
filestore_fd_cache_size = 100

Additionally, my colleague thinks these settings may have contributed :

filestore_queue_max_ops = 5000
journal_queue_max_ops = 100

We will do further tests on these settings once we have our lab ceph
test environment as we are also curious as to exactly what caused this.


On 2015-08-20 11:43 AM, Alex Gorbachev wrote:

 Just to update the mailing list, we ended up going back to default
 ceph.conf without any additional settings than what is mandatory. We are
 now reaching speeds we never reached before, both in recovery and in
 regular usage. There was definitely something we set in the ceph.conf
 bogging everything down.
 
 Could you please share the old and new ceph.conf, or the section that
 was removed?
 
 Best regards,
 Alex
 


 On 2015-08-20 4:06 AM, Christian Balzer wrote:

 Hello,

 from all the pertinent points by Somnath, the one about pre-conditioning
 would be pretty high on my list, especially if this slowness persists and
 nothing else (scrub) is going on.

 This might be fixed by doing a fstrim.

 Additionally the levelDB's per OSD are of course sync'ing heavily during
 reconstruction, so that might not be the favorite thing for your type of
 SSDs.

 But ultimately situational awareness is very important, as in what is
 actually going and slowing things down.
 As usual my recommendations would be to use atop, iostat or similar on all
 your nodes and see if your OSD SSDs are indeed the bottleneck or if it is
 maybe just one of them or something else entirely.

 Christian

 On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote:

 Also, check if scrubbing started in the cluster or not. That may
 considerably slow down the cluster.

 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, August 19, 2015 1:35 PM
 To: 'J-P Methot'; ceph-us...@ceph.com
 Subject: RE: [ceph-users] Bad performances in recovery

 All the writes will go through the journal.
 It may happen your SSDs are not preconditioned well and after a lot of
 writes during recovery IOs are stabilized to lower number. This is quite
 common for SSDs

Re: [ceph-users] Bad performances in recovery

2015-08-20 Thread J-P Methot
Hi,

Just to update the mailing list, we ended up going back to default
ceph.conf without any additional settings than what is mandatory. We are
now reaching speeds we never reached before, both in recovery and in
regular usage. There was definitely something we set in the ceph.conf
bogging everything down.


On 2015-08-20 4:06 AM, Christian Balzer wrote:
 
 Hello,
 
 from all the pertinent points by Somnath, the one about pre-conditioning
 would be pretty high on my list, especially if this slowness persists and
 nothing else (scrub) is going on.
 
 This might be fixed by doing a fstrim.
 
 Additionally the levelDB's per OSD are of course sync'ing heavily during
 reconstruction, so that might not be the favorite thing for your type of
 SSDs.
 
 But ultimately situational awareness is very important, as in what is
 actually going and slowing things down. 
 As usual my recommendations would be to use atop, iostat or similar on all
 your nodes and see if your OSD SSDs are indeed the bottleneck or if it is
 maybe just one of them or something else entirely.
 
 Christian
 
 On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote:
 
 Also, check if scrubbing started in the cluster or not. That may
 considerably slow down the cluster.

 -Original Message-
 From: Somnath Roy 
 Sent: Wednesday, August 19, 2015 1:35 PM
 To: 'J-P Methot'; ceph-us...@ceph.com
 Subject: RE: [ceph-users] Bad performances in recovery

 All the writes will go through the journal.
 It may happen your SSDs are not preconditioned well and after a lot of
 writes during recovery IOs are stabilized to lower number. This is quite
 common for SSDs if that is the case.

 Thanks  Regards
 Somnath

 -Original Message-
 From: J-P Methot [mailto:jpmet...@gtcomm.net]
 Sent: Wednesday, August 19, 2015 1:03 PM
 To: Somnath Roy; ceph-us...@ceph.com
 Subject: Re: [ceph-users] Bad performances in recovery

 Hi,

 Thank you for the quick reply. However, we do have those exact settings
 for recovery and it still strongly affects client io. I have looked at
 various ceph logs and osd logs and nothing is out of the ordinary.
 Here's an idea though, please tell me if I am wrong.

 We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was
 explained several times on this mailing list, Samsung SSDs suck in ceph.
 They have horrible O_dsync speed and die easily, when used as journal.
 That's why we're using Intel ssds for journaling, so that we didn't end
 up putting 96 samsung SSDs in the trash.

 In recovery though, what is the ceph behaviour? What kind of write does
 it do on the OSD SSDs? Does it write directly to the SSDs or through the
 journal?

 Additionally, something else we notice: the ceph cluster is MUCH slower
 after recovery than before. Clearly there is a bottleneck somewhere and
 that bottleneck does not get cleared up after the recovery is done.


 On 2015-08-19 3:32 PM, Somnath Roy wrote:
 If you are concerned about *client io performance* during recovery,
 use these settings..

 osd recovery max active = 1
 osd max backfills = 1
 osd recovery threads = 1
 osd recovery op priority = 1

 If you are concerned about *recovery performance*, you may want to
 bump this up, but I doubt it will help much from default settings..

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of J-P Methot
 Sent: Wednesday, August 19, 2015 12:17 PM
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Bad performances in recovery

 Hi,

 Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for
 a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each.
 The ceph version is hammer v0.94.1 . There is a performance overhead
 because we're using SSDs (I've heard it gets better in infernalis, but
 we're not upgrading just yet) but we can reach numbers that I would
 consider alright.

 Now, the issue is, when the cluster goes into recovery it's very fast
 at first, but then slows down to ridiculous levels as it moves
 forward. You can go from 7% to 2% to recover in ten minutes, but it
 may take 2 hours to recover the last 2%. While this happens, the
 attached openstack setup becomes incredibly slow, even though there is
 only a small fraction of objects still recovering (less than 1%). The
 settings that may affect recovery speed are very low, as they are by
 default, yet they still affect client io speed way more than it should.

 Why would ceph recovery become so slow as it progress and affect
 client io even though it's recovering at a snail's pace? And by a
 snail's pace, I mean a few kb/second on 10gbps uplinks. --
 == Jean-Philippe Méthot
 Administrateur système / System administrator GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com

Re: [ceph-users] Bad performances in recovery

2015-08-19 Thread J-P Methot
Hi,

Thank you for the quick reply. However, we do have those exact settings
for recovery and it still strongly affects client io. I have looked at
various ceph logs and osd logs and nothing is out of the ordinary.
Here's an idea though, please tell me if I am wrong.

We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was
explained several times on this mailing list, Samsung SSDs suck in ceph.
They have horrible O_dsync speed and die easily, when used as journal.
That's why we're using Intel ssds for journaling, so that we didn't end
up putting 96 samsung SSDs in the trash.

In recovery though, what is the ceph behaviour? What kind of write does
it do on the OSD SSDs? Does it write directly to the SSDs or through the
journal?

Additionally, something else we notice: the ceph cluster is MUCH slower
after recovery than before. Clearly there is a bottleneck somewhere and
that bottleneck does not get cleared up after the recovery is done.


On 2015-08-19 3:32 PM, Somnath Roy wrote:
 If you are concerned about *client io performance* during recovery, use these 
 settings..
 
 osd recovery max active = 1
 osd max backfills = 1
 osd recovery threads = 1
 osd recovery op priority = 1
 
 If you are concerned about *recovery performance*, you may want to bump this 
 up, but I doubt it will help much from default settings..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P 
 Methot
 Sent: Wednesday, August 19, 2015 12:17 PM
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Bad performances in recovery
 
 Hi,
 
 Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total 
 of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph 
 version is hammer v0.94.1 . There is a performance overhead because we're 
 using SSDs (I've heard it gets better in infernalis, but we're not upgrading 
 just yet) but we can reach numbers that I would consider alright.
 
 Now, the issue is, when the cluster goes into recovery it's very fast at 
 first, but then slows down to ridiculous levels as it moves forward. You can 
 go from 7% to 2% to recover in ten minutes, but it may take 2 hours to 
 recover the last 2%. While this happens, the attached openstack setup becomes 
 incredibly slow, even though there is only a small fraction of objects still 
 recovering (less than 1%). The settings that may affect recovery speed are 
 very low, as they are by default, yet they still affect client io speed way 
 more than it should.
 
 Why would ceph recovery become so slow as it progress and affect client io 
 even though it's recovering at a snail's pace? And by a snail's pace, I mean 
 a few kb/second on 10gbps uplinks.
 --
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 


-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bad performances in recovery

2015-08-19 Thread J-P Methot
Hi,

Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a
total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The
ceph version is hammer v0.94.1 . There is a performance overhead because
we're using SSDs (I've heard it gets better in infernalis, but we're not
upgrading just yet) but we can reach numbers that I would consider
alright.

Now, the issue is, when the cluster goes into recovery it's very fast at
first, but then slows down to ridiculous levels as it moves forward. You
can go from 7% to 2% to recover in ten minutes, but it may take 2 hours
to recover the last 2%. While this happens, the attached openstack setup
becomes incredibly slow, even though there is only a small fraction of
objects still recovering (less than 1%). The settings that may affect
recovery speed are very low, as they are by default, yet they still
affect client io speed way more than it should.

Why would ceph recovery become so slow as it progress and affect client
io even though it's recovering at a snail's pace? And by a snail's pace,
I mean a few kb/second on 10gbps uplinks.
-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to backup hundreds or thousands of TB

2015-05-06 Thread J-P Methot
Case in point, here's a little story as to why backup outside ceph is 
necessary:


I was working on modifying journal locations for a running test ceph 
cluster when, after bringing back a few OSD nodes, two PGs started being 
marked as incomplete. That made all operations on the pool hang as, for 
some reason, rbd clients couldn't read the missing PG and there was no 
timeout value for their operation. After spending half a day fixing 
this, I ended up needing to delete the pool and then recreate it. 
Thankfully that setup was not in production so it was only a minor setback.


So, when we go in production with our setup, we are planning to have a 
second ceph for backups, just in case such an issue happens again. I 
don't want to scare anyone and I'm pretty sure my issue was very 
exceptional, but no matter how well ceph replicate and ensures data 
safety, backups are still a good idea, in my humble opinion.



On 5/6/2015 6:35 AM, Mariusz Gronczewski wrote:

Snapshot on same storage cluster should definitely NOT be treated as
backup

Snapshot as a source for backup however can be pretty good solution for
some cases, but not every case.

For example if using ceph to serve static web files, I'd rather have
possibility to restore given file from given path than snapshot of
whole multiple TB cluster.

There are 2 cases for backup restore:

* something failed, need to fix it - usually full restore needed
* someone accidentally removed a thing, and now they need a thing back

Snapshots fix first problem, but not the second one, restoring 7TB of
data to recover few GBs is not reasonable.

As it is now we just backup from inside VMs (file-based backup) and have
puppet to easily recreate machine config but if (or rather when) we
would use object store we would backup it in a way that allows for
partial restore.

On Wed, 6 May 2015 10:50:34 +0100, Nick Fisk n...@fisk.me.uk wrote:

For me personally I would always feel more comfortable with backups on a 
completely different storage technology.

Whilst there are many things you can do with snapshots and replication, there 
is always a small risk that whatever causes data loss on your primary system 
may affect/replicate to your 2nd copy.

I guess it all really depends on what you are trying to protect against, but 
Tape still looks very appealing if you want to maintain a completely isolated 
copy of data.


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Alexandre DERUMIER
Sent: 06 May 2015 10:10
To: Götz Reinicke
Cc: ceph-users
Subject: Re: [ceph-users] How to backup hundreds or thousands of TB

for the moment, you can use snapshot for backup

https://ceph.com/community/blog/tag/backup/

I think that async mirror is on the roadmap
https://wiki.ceph.com/Planning/Blueprints/Hammer/RBD%3A_Mirroring



if you use qemu, you can do qemu full backup. (qemu incremental backup is
coming for qemu 2.4)


- Mail original -
De: Götz Reinicke goetz.reini...@filmakademie.de
À: ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 6 Mai 2015 10:25:01
Objet: [ceph-users] How to backup hundreds or thousands of TB

Hi folks,

beside hardware and performance and failover design: How do you manage
to backup hundreds or thousands of TB :) ?

Any suggestions? Best practice?

A second ceph cluster at a different location? bigger archive Disks in good
boxes? Or tabe-libs?

What kind of backupsoftware can handle such volumes nicely?

Thanks and regards . Götz
--
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82 420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im
Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-21 Thread J-P Methot


Thank you everyone for your replies. We are currently in the process of 
selecting new drives for journaling to replace the samsung drives. We're 
running our own tests using DD and the command found here : 
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/


dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync

I have no trouble believing that this is right, but I was asked to 
double-check the command's validity. So, does this command emulate 
properly the way ceph journaling write to SSD?


If you want, I can also post the results of our test on different drives 
once we're done.


On 4/21/2015 4:04 AM, Andrei Mikhailovsky wrote:

Hi

I have been testing the Samsung 840 Pro (128gb) for quite sometime and 
I can also confirm that this drive is unsuitable for osd journal. The 
performance and latency that I get from these drives (according to 
ceph osd perf) are between 10 - 15 times slower compared to the Intel 
520. The Intel 530 drives are also pretty awful. They are meant to be 
a replacement of the 520 drives, but the performance is pretty bad.


I have found Intel 520 to be a reasonable drive for performance per 
price, for a cluster without a great deal of writes. However they do 
not make those anymore.


Otherwise, it seems that the Intel 3600 and 3700 series is a good 
performer and has a much longer life expectancy.


Andrei


*From: *Eneko Lacunza elacu...@binovo.es
*To: *J-P Methot jpmet...@gtcomm.net, Christian Balzer
ch...@gol.com, ceph-users@lists.ceph.com
*Sent: *Tuesday, 21 April, 2015 8:18:20 AM
*Subject: *Re: [ceph-users] Possible improvements for a slow write
speed (excluding independent SSD journals)

Hi,

I'm just writing to you to stress out what others have already said,
because it is very important that you take it very seriously.

On 20/04/15 19:17, J-P Methot wrote:
 On 4/20/2015 11:01 AM, Christian Balzer wrote:

 This is similar to another thread running right now, but since our
 current setup is completely different from the one described
in the
 other thread, I thought it may be better to start a new one.

 We are running Ceph Firefly 0.80.8 (soon to be upgraded to
0.80.9). We
 have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs).
Each OSD
 is a
 Samsung SSD 840 EVO on which I can reach write speeds of
roughly 400
 MB/sec, plugged in jbod on a controller that can theoretically
transfer
 at 6gb/sec. All of that is linked to openstack compute nodes
on two
 bonded 10gbps links (so a max transfer rate of 20 gbps).

 I sure as hell hope you're not planning to write all that much
to this
 cluster.
 But then again you're worried about write speed, so I guess you do.
 Those _consumer_ SSDs will be dropping like flies, there are a
number of
 threads about them here.

 They also might be of the kind that don't play well with
O_DSYNC, I
 can't
 recall for sure right now, check the archives.
   Consumer SSDs universally tend to slow down quite a bit when not
 TRIM'ed
 and/or subjected to prolonged writes, like those generated by a
 benchmark.
 I see, yes it looks like these SSDs are not the best for the
job. We
 will not change them for now, but if they start failing, we will
 replace them with better ones.
I tried to put a Samsung 840 Pro 256GB in a ceph setup. It is
supposed
to be quite better than the EVO right? It was total crap. No not the
best for the job. TOTAL CRAP. :)

It can't give any useful write performance for a Ceph OSD. Spec sheet
numbers don't matter for this, they don't work for ceph OSD,
period. And
yes, the drive is fine and works like a charm in workstation
workloads.

I suggest you at least get some intel S3700/S3610 and use them for
the
journal of those samsung drives, I think that could help
performance a lot.

Cheers
Eneko

-- 
Zuzendari Teknikoa / Director Técnico

Binovo IT Human Project, S.L.
Telf. 943575997
   943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun
(Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-20 Thread J-P Methot
My journals are on-disk, each disk being a SSD. The reason I didn't go 
with dedicated drives for journals is that when designing the setup, I 
was told that having dedicated journal SSDs on a full-SSD setup would 
not give me performance increases.


So that makes the journal disk to data disk ratio 1:1.

The replication size is 3, yes. The pools are replicated.

On 4/20/2015 10:43 AM, Barclay Jameson wrote:
Are your journals on separate disks? What is your ratio of journal 
disks to data disks? Are you doing replication size 3 ?


On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot jpmet...@gtcomm.net 
mailto:jpmet...@gtcomm.net wrote:


Hi,

This is similar to another thread running right now, but since our
current setup is completely different from the one described in
the other thread, I thought it may be better to start a new one.

We are running Ceph Firefly 0.80.8 (soon to be upgraded to
0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96
OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach
write speeds of roughly 400 MB/sec, plugged in jbod on a
controller that can theoretically transfer at 6gb/sec. All of that
is linked to openstack compute nodes on two bonded 10gbps links
(so a max transfer rate of 20 gbps).

When I run rados bench from the compute nodes, I reach the network
cap in read speed. However, write speeds are vastly inferior,
reaching about 920 MB/sec. If I have 4 compute nodes running the
write benchmark at the same time, I can see the number plummet to
350 MB/sec . For our planned usage, we find it to be rather slow,
considering we will run a high number of virtual machines in there.

Of course, the first thing to do would be to transfer the journal
on faster drives. However, these are SSDs we're talking about. We
don't really have access to faster drives. I must find a way to
get better write speeds. Thus, I am looking for suggestions as to
how to make it faster.

I have also thought of options myself like:
-Upgrading to the latest stable hammer version (would that really
give me a big performance increase?)
-Crush map modifications? (this is a long shot, but I'm still
using the default crush map, maybe there's a change there I could
make to improve performances)

Any suggestions as to anything else I can tweak would be strongly
appreciated.

For reference, here's part of my ceph.conf:

[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
osd pool default size = 3


osd pg bits = 12
osd pgp bits = 12
osd pool default pg num = 800
osd pool default pgp num = 800

[client]
rbd cache = true
rbd cache writethrough until flush = true

[osd]
filestore_fd_cache_size = 100
filestore_omap_header_cache_size = 100
filestore_fd_cache_random = true
filestore_queue_max_ops = 5000
journal_queue_max_ops = 100
max_open_files = 100
osd journal size = 1

-- 
==

Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050 tel:1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750 tel:1-%28514%29-907-0750
jpmet...@gtcomm.net mailto:jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-20 Thread J-P Methot

Hi,

This is similar to another thread running right now, but since our 
current setup is completely different from the one described in the 
other thread, I thought it may be better to start a new one.


We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We 
have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a 
Samsung SSD 840 EVO on which I can reach write speeds of roughly 400 
MB/sec, plugged in jbod on a controller that can theoretically transfer 
at 6gb/sec. All of that is linked to openstack compute nodes on two 
bonded 10gbps links (so a max transfer rate of 20 gbps).


When I run rados bench from the compute nodes, I reach the network cap 
in read speed. However, write speeds are vastly inferior, reaching about 
920 MB/sec. If I have 4 compute nodes running the write benchmark at the 
same time, I can see the number plummet to 350 MB/sec . For our planned 
usage, we find it to be rather slow, considering we will run a high 
number of virtual machines in there.


Of course, the first thing to do would be to transfer the journal on 
faster drives. However, these are SSDs we're talking about. We don't 
really have access to faster drives. I must find a way to get better 
write speeds. Thus, I am looking for suggestions as to how to make it 
faster.


I have also thought of options myself like:
-Upgrading to the latest stable hammer version (would that really give 
me a big performance increase?)
-Crush map modifications? (this is a long shot, but I'm still using the 
default crush map, maybe there's a change there I could make to improve 
performances)


Any suggestions as to anything else I can tweak would be strongly 
appreciated.


For reference, here's part of my ceph.conf:

[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
osd pool default size = 3


osd pg bits = 12
osd pgp bits = 12
osd pool default pg num = 800
osd pool default pgp num = 800

[client]
rbd cache = true
rbd cache writethrough until flush = true

[osd]
filestore_fd_cache_size = 100
filestore_omap_header_cache_size = 100
filestore_fd_cache_random = true
filestore_queue_max_ops = 5000
journal_queue_max_ops = 100
max_open_files = 100
osd journal size = 1

--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread J-P Methot
That's a great idea. I know I can setup cinder (the openstack volume 
manager) as a multi-backend manager and migrate from one backend to the 
other, each backend linking to different pools of the same ceph cluster. 
What bugs me though is that I'm pretty sure the image store, glance, 
wouldn't let me do that. Additionally, since the compute component also 
has its own ceph pool, I'm pretty sure it won't let me migrate the data 
through openstack.




On 3/26/2015 3:54 PM, Steffen W Sørensen wrote:

On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:

Lately I've been going back to work on one of my first ceph setup and now I see 
that I have created way too many placement groups for the pools on that setup 
(about 10 000 too many). I believe this may impact performances negatively, as 
the performances on this ceph cluster are abysmal. Since it is not possible to 
reduce the number of PGs in a pool, I was thinking of creating new pools with a 
smaller number of PGs, moving the data from the old pools to the new pools and 
then deleting the old pools.

I haven't seen any command to copy objects from one pool to another. Would that 
be possible? I'm using ceph for block storage with openstack, so surely there 
must be a way to move block devices from a pool to another, right?

What I did a one point was going one layer higher in my storage abstraction, 
and created new Ceph pools and used those for new storage resources/pool in my 
VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual 
disks there, assume you could do the same in OpenStack.

My 0.02$

/Steffen



--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread J-P Methot

Hi,

Lately I've been going back to work on one of my first ceph setup and 
now I see that I have created way too many placement groups for the 
pools on that setup (about 10 000 too many). I believe this may impact 
performances negatively, as the performances on this ceph cluster are 
abysmal. Since it is not possible to reduce the number of PGs in a pool, 
I was thinking of creating new pools with a smaller number of PGs, 
moving the data from the old pools to the new pools and then deleting 
the old pools.


I haven't seen any command to copy objects from one pool to another. 
Would that be possible? I'm using ceph for block storage with openstack, 
so surely there must be a way to move block devices from a pool to 
another, right?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster Address

2015-03-03 Thread J-P Methot
I had to go through the same experience of changing the public network 
address and it's not easy.  Ceph seems to keep a record of what ip 
address is associated to what OSD and a port number for the process. I 
was never able to find out where this record is kept or how to change it 
manually. Here's what I did, from memory :


1. Remove the network address I didn't want to use anymore from the 
ceph.conf and put the one I wanted to use instead. Don't worry, 
modifying the ceph.conf will not affect a currently running cluster 
unless you issue a command to it, like adding an OSD.
2. Remove each OSD one by one and then reinitialize them right after. 
You will lose the data that's on the OSD, but if your cluster is 
replicated properly and do this operation one OSD at a time, you should 
not lose the copies of that data.
3. Check the OSD status to make sure they use the proper IP. The command 
ceph osd dump will tell you if your OSDs are detected on the proper IP.

4. Remove and reinstall each monitor one by one.

If anybody else has another solution I'd be curious to hear it, but this 
is how I managed to do it, by basically reinstalling each component one 
by one.


On 3/3/2015 12:26 PM, Garg, Pankaj wrote:


Hi,

I have ceph cluster that is contained within a rack (1 Monitor and 5 
OSD nodes). I kept the same public and private address for configuration.


I do have 2 NICS and 2 valid IP addresses (one internal only and one 
external) for each machine.


Is it possible now, to change the Public Network address, after the 
cluster is up and running?


I had used Ceph-deploy for the cluster. If I change the address of the 
public network in Ceph.conf, do I need to propagate to all the 
machines in the cluster or just the Monitor Node is enough?


Thanks

Pankaj



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] client unable to access files after caching pool addition

2015-02-03 Thread J-P Methot

Hi,

I tried to add a caching pool in front of openstack vms and volumes 
pools. I believed that the process was transparent, but as soon as I set 
the caching for both of these pools, the VMs could not find their 
volumes anymore. Obviously when I undid my changes, everything went back 
to normal.


Could it be an authorization issue? Would the openstack vms need to 
connect to the caching pool instead of the storage pools to be able to 
access their volumes? Or is the configuration supposed to stay the same 
and the process is supposed to be completely transparent?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] filestore_fiemap and other ceph tweaks

2015-02-02 Thread J-P Methot

Hi,

I've been looking into increasing the performance of my ceph cluster for 
openstack that will be moved in production soon. It's a full 1TB SSD 
cluster with 16 OSD per node over 6 nodes.


As I searched for possible tweaks to implement, I stumbled upon 
unitedstack's presentation at the openstack paris summit (video : 
https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/build-a-high-performance-and-high-durability-block-storage-service-based-on-ceph).


Now, before implementing any of the suggested tweaks, I've been reading 
up on each one. It's not that I don't trust everything that's being said 
there, but I thought it may be better to inform myself before starting
to implement tweaks that may strongly impact the performance and 
stability of my cluster.


One of the suggested tweaks is to set filestore_fiemap to true. The 
issue is, after some research, I found that there is a rados block 
device corruption bug linked to setting that option to true (link: 
http://www.spinics.net/lists/ceph-devel/msg06851.html ). I have not 
found any trace of that bug being fixed since, despite the mailing list 
message being fairly old.


Is it safe to set filestore_fiemap to true?

Additionally, if anybody feels like watching the video or reading the 
presentation (slides are at 
http://www.spinics.net/lists/ceph-users/attachments/pdfUlINnd6l8e.pdf ), 
what do you think of the part about the other tweaks and the data 
durability part?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore_fiemap and other ceph tweaks

2015-02-02 Thread J-P Methot
Thank you very much. Also thank you for the presentation you made in 
Paris, it was very instructive.


So, from what I understand, the fiemap patch is proven to work on kernel 
2.6.32 . The good news is that we use the same kernel in our setup. How 
long have your production cluster been running with fiemap set to true?


On 2/2/2015 10:47 AM, Haomai Wang wrote:

There exists a more recently discuss in
PR(https://github.com/ceph/ceph/pull/1665).


On Mon, Feb 2, 2015 at 11:05 PM, J-P Methot jpmet...@gtcomm.net wrote:

Hi,

I've been looking into increasing the performance of my ceph cluster for
openstack that will be moved in production soon. It's a full 1TB SSD cluster
with 16 OSD per node over 6 nodes.

As I searched for possible tweaks to implement, I stumbled upon
unitedstack's presentation at the openstack paris summit (video :
https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/build-a-high-performance-and-high-durability-block-storage-service-based-on-ceph).

Now, before implementing any of the suggested tweaks, I've been reading up
on each one. It's not that I don't trust everything that's being said there,
but I thought it may be better to inform myself before starting
to implement tweaks that may strongly impact the performance and stability
of my cluster.

One of the suggested tweaks is to set filestore_fiemap to true. The issue
is, after some research, I found that there is a rados block device
corruption bug linked to setting that option to true (link:
http://www.spinics.net/lists/ceph-devel/msg06851.html ). I have not found
any trace of that bug being fixed since, despite the mailing list message
being fairly old.

Is it safe to set filestore_fiemap to true?

Additionally, if anybody feels like watching the video or reading the
presentation (slides are at
http://www.spinics.net/lists/ceph-users/attachments/pdfUlINnd6l8e.pdf ),
what do you think of the part about the other tweaks and the data durability
part?

--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs not getting mounted back after reboot

2015-01-28 Thread J-P Methot

Hi,

I'm having an issue wuite similar to this old bug : 
http://tracker.ceph.com/issues/5194, except that I'm using centos 6. 
Basically, I setup a cluster using ceph-deploy to save some time (this 
is a 90+ OSD cluster). I rebooted a node earlier today and now all the 
drives are unmounted and any attempt at mounting them manually returns : 
mount: special device /dev/sda1 does not exist


However, those partitions are listed if I do sfdisk -l /dev/sda. I have 
also tried to do a partprobe on the devices, as was done in the previous 
bug, to no avail. /usr/sbin/ceph-disk-activate --mount /dev/sda1 tells 
me that the device does not exist. Is it a bug, or am I doing something 
wrong?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph configuration on multiple public networks.

2015-01-09 Thread J-P Methot

Hi,

We've setup ceph and openstack on a fairly peculiar network 
configuration (or at least I think it is) and I'm looking for 
information on how to make it work properly.


Basically, we have 3 networks, a management network, a storage network 
and a cluster network. The management network is over a 1 gbps link, 
while the storage network is over 2 bonded 10 gbps links. The cluster 
network can be ignored for now, as it works well.


Now, the main problem is that ceph osd nodes are plugged on the 
management, storage and cluster networks, but the monitors are only 
plugged on the management network. When I do tests, I see that all the 
traffic ends up going through the management network, slowing down 
ceph's performances. Because of the current network setup, I can't hook 
up the monitoring nodes on the  storage network, as we're missing ports 
on the switch.


Would it be possible to maintain access to the management nodes while 
forcing the ceph cluster to use the storage network for data transfer? 
As a reference, here's my ceph.conf.


[global]
osd_pool_default_pgp_num = 800
osd_pg_bits = 12
auth_service_required = cephx
osd_pool_default_size = 3
filestore_xattr_use_omap = true
auth_client_required = cephx
osd_pool_default_pg_num = 800
auth_cluster_required = cephx
mon_host = 10.251.0.51
public_network = 10.251.0.0/24, 10.21.0.0/24
mon_initial_members = cephmon1
cluster_network = 192.168.31.0/24
fsid = 60e1b557-e081-4dab-aa76-e68ba38a159e
osd_pgp_bits = 12

As you can see I've setup 2 public networks, 10.251.0.0 being the 
management network and 10.21.0.0 being the storage network. Would it be 
possible to maintain cluster functionality and remove 10.251.0.0/24 from 
the public_network list? For example, if I were to remove it from the 
public network list and referenced each monitor node IP in the config 
file, would I be able to maintain connectivity?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increasing osd pg bits and osd pgp bits after cluster has been setup

2014-12-03 Thread J-P Methot

Hi,

I'm trying to reach a number of 4096 pg as suggested in the doc, but I 
can't have more than 32 pgs per OSD. I suspect this is caused by the 
default of 6 pg bits (2^5 = 32, the first bit being for 2^0). Is there a 
command to increase it once the OSDs have been linked to the cluster and 
the default pool has been created? I have increased the number of bits 
to 9 in ceph.conf, but it is not taken into account by the OSDs that 
already exist.


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com