[ceph-users] incomplete pg, recovery some data

2015-06-18 Thread Mateusz Skała
Hi,

After some hardware errors one of pg in our backup server is 'incomplete'.

I do export pg without problems like here:
https://ceph.com/community/incomplete-pgs-oh-my/

After remove pg from all osd's and  import pg to one of osd pg is still
'incomplete'.

I want to  recover only some pice of data from this rbd so if I lost
something then nothing happened. How can I tell ceph to accept this pg as
complete and clean?

 

 ceph health detail

HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean

pg 0.109 is stuck inactive since forever, current state incomplete, last
acting [9,13]

pg 0.109 is stuck unclean since forever, current state incomplete, last
acting [9,13]

pg 0.109 is incomplete, acting [9,13]

 

 Ceph pg 0.109 query

In attachment.

 

Regards,

Mateusz

{ state: incomplete,
  snap_trimq: [],
  epoch: 6310,
  up: [
9,
13],
  acting: [
9,
13],
  info: { pgid: 0.109,
  last_update: 0'0,
  last_complete: 0'0,
  log_tail: 0'0,
  last_user_version: 0,
  last_backfill: MAX,
  purged_snaps: [],
  history: { epoch_created: 1,
  last_epoch_started: 4101,
  last_epoch_clean: 4089,
  last_epoch_split: 0,
  same_up_since: 6306,
  same_interval_since: 6306,
  same_primary_since: 6304,
  last_scrub: 3249'1096189,
  last_scrub_stamp: 2015-06-05 14:50:50.378387,
  last_deep_scrub: 3084'1088300,
  last_deep_scrub_stamp: 2015-05-31 13:56:29.394517,
  last_clean_scrub_stamp: 2015-06-05 14:50:50.378387},
  stats: { version: 0'0,
  reported_seq: 7,
  reported_epoch: 6310,
  state: incomplete,
  last_fresh: 2015-06-18 12:54:14.562011,
  last_change: 2015-06-18 12:53:05.172499,
  last_active: 0.00,
  last_clean: 0.00,
  last_became_active: 0.00,
  last_unstale: 2015-06-18 12:54:14.562011,
  last_undegraded: 2015-06-18 12:54:14.562011,
  last_fullsized: 2015-06-18 12:54:14.562011,
  mapping_epoch: 6306,
  log_start: 0'0,
  ondisk_log_start: 0'0,
  created: 1,
  last_epoch_clean: 4089,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 3249'1096189,
  last_scrub_stamp: 2015-06-05 14:50:50.378387,
  last_deep_scrub: 3084'1088300,
  last_deep_scrub_stamp: 2015-05-31 13:56:29.394517,
  last_clean_scrub_stamp: 2015-06-05 14:50:50.378387,
  log_size: 0,
  ondisk_log_size: 0,
  stats_invalid: 0,
  stat_sum: { num_bytes: 0,
  num_objects: 0,
  num_object_clones: 0,
  num_object_copies: 0,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
  num_objects_misplaced: 0,
  num_objects_unfound: 0,
  num_objects_dirty: 0,
  num_whiteouts: 0,
  num_read: 0,
  num_read_kb: 0,
  num_write: 0,
  num_write_kb: 0,
  num_scrub_errors: 0,
  num_shallow_scrub_errors: 0,
  num_deep_scrub_errors: 0,
  num_objects_recovered: 0,
  num_bytes_recovered: 0,
  num_keys_recovered: 0,
  num_objects_omap: 0,
  num_objects_hit_set_archive: 0,
  num_bytes_hit_set_archive: 0},
  stat_cat_sum: {},
  up: [
9,
13],
  acting: [
9,
13],
  blocked_by: [],
  up_primary: 9,
  acting_primary: 9},
  empty: 1,
  dne: 0,
  incomplete: 0,
  last_epoch_started: 0,
  hit_set_history: { current_last_update: 0'0,
  current_last_stamp: 0.00,
  current_info: { begin: 0.00,
  end: 0.00,
  version: 0'0},
  history: []}},
  peer_info: [
{ peer: 2,
  pgid: 0.109,
  last_update: 0'0,
  last_complete: 0'0,
  log_tail: 0'0,
  last_user_version: 0,
  last_backfill: MAX,
  purged_snaps: [],
  history: { epoch_created: 0,
  last_epoch_started: 0,
  last_epoch_clean: 0,
  last_epoch_split: 0,
  same_up_since: 0,
  same_interval_since: 0,
  same_primary_since: 0,
  last_scrub: 0'0,
  last_scrub_stamp: 0.00,
  last_deep_scrub: 0'0,
  last_deep_scrub_stamp: 0.00,
  last_clean_scrub_stamp: 0.00},
  stats: { version: 0'0,
  reported_seq: 0,
  reported_epoch: 0,
  state: inactive,
  last_fresh: 0.00,
  last_change: 0.00,
  last_active: 0.00,
  last_clean: 0.00,
  last_became_active: 0.00,

Re: [ceph-users] 403-Forbidden error using radosgw

2015-06-18 Thread B, Naga Venkata

I am also having same issue can somebody help me out. But for me it is 
HTTP/1.1 404 Not Found.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-18 Thread Jacek Jarosiewicz

Hi,

On 06/18/2015 12:54 PM, Alexandre DERUMIER wrote:

Hi,

for read benchmark

with fio, what is the iodepth ?

my fio 4k randr results with

iodepth=1 : bw=6795.1KB/s, iops=1698
iodepth=2 : bw=14608KB/s, iops=3652
iodepth=4 : bw=32686KB/s, iops=8171
iodepth=8 : bw=76175KB/s, iops=19043
iodepth=16 :bw=173651KB/s, iops=43412
iodepth=32 :bw=336719KB/s, iops=84179



I'm trying multiple versions - from one job and iodepth=1 to 16 jobs 
with iodepth 32, similar to what You do.


I'm less worried about the bandwidth now, since I found out about the 
Intel SSD 530 problem (the dsync stuff).


I'm worried about iops - when I test it locally I get the expected ~40k 
iops on a ssd drive, but when I do it from a client I get 2-4k iops..



(This should be similar with rados bench -t (threads) option).

This is normal because of network latencies + ceph latencies.
Doing more parallism increase iops.



yes, I'm expecting that, but for now I can't get close to what I should 
see using SSD as an OSD in ceph..



(doing a bench with dd = iodepth=1)



I'm only using dd to test seq read/write speed.


Theses result are with 1 client/rbd volume.


now with more fio client (numjobs=X)

I can reach up to 300kiops with 8-10 clients.



I would love to see these results in my setup :)

J
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-18 Thread Jacek Jarosiewicz

On 06/18/2015 12:23 PM, Mark Nelson wrote:

so.. in order to increase performance, do I need to change the ssd
drives?


I'm just guessing, but because your read performance is slow as well,
you may multiple issues going on.  The Intel 530 being slow at O_DSYNC
writes is one of them, but it's possible there is something else too. If
I were in your position I think I'd try to beg/borrw/steal a single DC
S3700 or even 520 (despite it's presumed lack of safety) and just see
how a single OSD cluster using it does on your setup before replacing
everything.



Oh, sorry - this was my bad, I was doing different test with different 
setups to find out what might be the problem. I thought that maybe the 
mellanox network hardware/setup is the problem (wouldn't know why, but I 
wanted to check) so I switched the servers to use 1Gbps network cards 
and thus the slow read results. After I switched back to 56Gbps network, 
sequential read/write tests are satisfactory:


root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct
100+0 records in
100+0 records out
1048576 bytes (10 GB) copied, 27.0479 s, 388 MB/s

root@cf03:/ceph/tmp# dd if=test of=/dev/null bs=100M iflag=direct
100+0 records in
100+0 records out
1048576 bytes (10 GB) copied, 7.30296 s, 1.4 GB/s

and now rados bench shows:

root@cf03:~# rados -p rbd bench 30 rand
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  16   208   192   767.782   768  0.084049 0.0796911
 2  16   390   374   747.833   728  0.055108 0.0834168
 3  16   579   563   750.523   756  0.080945 0.0841484
 4  16   756   740   739.865   708  0.119879 0.0853113
 5  16   942   926   740.668   744  0.131534  0.085389
 6  16  1128  1112   741.207   744  0.085159 0.0857775
 7  16  1314  1298   741.587   744  0.137615 0.0857103
 8  16  1496  1480   739.877   728  0.047122 0.0858808
 9  16  1678  1662   738.548   728  0.118557 0.0860778
10  16  1866  1850   739.882   752   0.07375 0.0861203
11  16  2054  2038   740.974   752  0.053814 0.0860436
12  16  2247  2231743.55   772  0.101077 0.0857194
13  16  2430  2414   742.652   732  0.038217 0.0856958
14  16  2592  2576   735.886   648  0.014755 0.0864883
15  16  2764  2748   732.688   688  0.125262 0.0870332
16  16  2934  2918729.39   680  0.144276 0.0873883
17  16  3109  3093   727.655   700   0.05022 0.0876425
18  16  3274  3258   723.892   660  0.027348 0.0880826
19  16  3428  3412   718.209   616  0.145429 0.0888024
20  16  3590  3574   714.695   648  0.145609 0.0892346
21  16  3753  3737   711.704   652  0.146557   0.08958
22  16  3914  3898   708.623   644  0.164886 0.0900086
23  16  4077  4061   706.158   652  0.021976 0.0903442
24  16  4243  4227   704.398   664  0.013213 0.0905628
25  16  4409  4393   702.779   664  0.039111 0.0908182
26  16  4576  4560   701.438   668  0.179205 0.0909782
27  16  4744  4728   700.344   672  0.176603 0.0911509
28  16  4924  4908   701.043   720  0.062736 0.0911056
29  16  5107  5091   702.107   732  0.103679 0.0910063
30  16  5294  5278   703.633   748  0.078924 0.0908063
 Total time run:30.105242
Total reads made: 5294
Read size:4194304
Bandwidth (MB/sec):703.399

Average Latency:   0.0909628
Max latency:   0.198346
Min latency:   0.00676


..but unfortunately fio still shows low iops - 2-4k...

J

--
Jacek Jarosiewicz
Administrator Systemów Informatycznych


SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,

nr KRS 029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa


SUPERMEDIA -   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw did not create auth url for swift

2015-06-18 Thread venkat

can you please let me know if you solved this issue please



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best Linux distro for Ceph

2015-06-18 Thread Chris Jones
Hi Shane,

We (Bloomberg) have many large clusters and we currently use Ubuntu. We
have just recently upgraded to Trusty (14.04). Our new super object store
that we're building out is using Trusty but we may switch to RHEL because
of other departments joining in - final decision has not been made.
However, our OpenStack clusters will stay Ubuntu.

Thanks,
Chris

On Wed, Jun 17, 2015 at 2:06 PM, Shane Gibson shane_gib...@symantec.com
wrote:


 Ok - I know this post has the potential to spread to unsavory corners of
 discussion about the best linux distro ... blah blah blah ... please,
 don't let it go there ... !

 I'm seeking some input from people that have been running larger Ceph
 clusters ... on the order of 100s of physical servers with thousands of
 OSDs in them.  Our primary use case is Object via Swift API integration and
 adding Block store capability for both OpenStack/KVM backing VMs, as well
 as general use for various block store scenarios.

 We'd *like* to look at CephFS, and I'm heartened to see a kernel module
 (over the FUSE based), and a growing use base around it, and hoping
 production ready will soon be stamped on CephFS ...

 We currently deploy Ubuntu (primarily Trusty - 14.04), and CentOS 7.1.
 Currently we've been testing our Ceph clusters on both, but our preference
 as an organization is CentOS 7.1.1503 (currently).

 However - I see a lot of noise in the list about needing to track the more
 modern kernel versions as opposed to the already dated 3.10.x that CentOS
 7.1 deploys.  Yes, I know RH and community backport a lot of the newer
 kernel features to their kernel version ... but ... not everything gets
 backported.

 Can someone out there with real world, larger scale Ceph cluster
 operational experience provide a guideline on the Linux Distro they
 deploy/use, and works well with Ceph, and is more inline with keeping up
 with modern kernel versions ... without crossing the line in to the
 bleeding and painful edge versions ... ?

 Thank you ...

 ~~shane



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Best Regards,
Chris Jones

http://www.cloudm2.com

cjo...@cloudm2.com
(p) 770.655.0770

This message is intended exclusively for the individual or entity to which
it is addressed.  This communication may contain information that is
proprietary, privileged or confidential or otherwise legally exempt from
disclosure.  If you are not the named addressee, you are not authorized to
read, print, retain, copy or disseminate this message or any part of it.
If you have received this message in error, please notify the sender
immediately by e-mail and delete all copies of the message.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Journal creation ?

2015-06-18 Thread Barclay Jameson
The journal should be a raw partition and should not have any filesystem on it.
Inside your /var/lib/ceph/osd/ceph-# you should make symlink to the
journal partition that you are going to use for that osd.


On Thu, Jun 18, 2015 at 2:36 AM, Shane Gibson shane_gib...@symantec.com wrote:
 All - I am building my first ceph cluster, and doing it the hard way,
 manually without the aid of ceph-deploy.  I have successfully built the
 mon cluster and am now adding OSDs.

 My main question:
 How do I prepare the Journal prior to the prepare/activate stages of the
 OSD creation?


 More details:
 Basically - all of the documentation seems to assume the journal is
 prepared.   Do I simply create a single raw partition on a physical
 device and the ceph-disk prepare... and ceph-disk activate... steps
 will take care of everything for the journal ... presumably based on the
 ceph-disk prepare ... --type filesystem setting?  Or do I need to
 actually format it as a filesystem prior to giving it over to the Ceph OSD
 ???

 The architecture I'm thinking of is as follows - based on the hardware I
 have for OSDs (currenly 9 servers each with):

   RAID 0 mirror for OS hard drives (2 disks)
   data disk for journal placement for 5 physical disks (4TB)
   data disk for journal placement for 5 physical disks (4TB)
   10 data disks as OSDs (one OSD per disk) (4TB each)

 Essentially - there are 12 data disks in the node (all 4 TB 7200 rpm
 spinning disks).  Splitting the Journal across two of them gives me a
 failure domain of 5 data disks + 1 journal disk in a single physical
 server for crush map purposes ...  It also vaguely helps spread the I/O
 workload for the journaling activity across 2 physical disks in a chassis
 instead of a one (since the journal disk is pretty darn slow).

 In this configuration I'd create 5 separate partitions on Journal Disk A
 and 5 on Journal Disk B ... but do they need to be formatted and mounted?

 Yes, we know as we go to more real production workloads, we'll want/need
 to change this for performance reasons - eg the Journal on SSDs ...

 Any pointers on where I missed this info in the documentation would be
 helpful too ... I've been all over the ceph.com/docs/ site and haven't
 found it yet...

 Thanks,
 ~~shane

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin node

2015-06-18 Thread Teclus Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Hello Everyone,

I have setup a new cluster with Ceph-hammer version (0.94.2   The install went 
through fine without any issues but from the admin node I am not able to 
execute any of the Ceph commands

Error:
root@ceph-main:/cephcluster# ceph auth export
2015-06-18 12:43:28.922367 7f54d286b700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2015-06-18 12:43:28.922375 7f54d286b700  0 librados: client.admin 
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound

I googled for this and only found one article relevant, but it did not solve my 
problem.
http://t75390.file-systems-ceph-user.file-systemstalk.us/newbie-error-connecting-to-cluster-permissionerror-t75390.html

Is there any other workaround or fix for this ??

Regards
Teclus Dsouza
Technical Architect
Tech Mahindra


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin node

2015-06-18 Thread B, Naga Venkata
Do you have admin keyring in /etc/ceph directory?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Teclus 
Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Sent: Thursday, June 18, 2015 10:35 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin 
node
Importance: High

Hello Everyone,

I have setup a new cluster with Ceph-hammer version (0.94.2   The install went 
through fine without any issues but from the admin node I am not able to 
execute any of the Ceph commands

Error:
root@ceph-main:/cephcluster# ceph auth export
2015-06-18 12:43:28.922367 7f54d286b700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2015-06-18 12:43:28.922375 7f54d286b700  0 librados: client.admin 
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound

I googled for this and only found one article relevant, but it did not solve my 
problem.
http://t75390.file-systems-ceph-user.file-systemstalk.us/newbie-error-connecting-to-cluster-permissionerror-t75390.html

Is there any other workaround or fix for this ??

Regards
Teclus Dsouza
Technical Architect
Tech Mahindra


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin node

2015-06-18 Thread Teclus Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Hello Naga,

The keyring file is present under a folder I created for ceph.   Are you saying 
the same needs to be copied to the /etc/ceph folder ?

Regards
Teclus

From: B, Naga Venkata [mailto:nag...@hp.com]
Sent: Thursday, June 18, 2015 10:37 PM
To: Teclus Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco); 
ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH 
admin node

Do you have admin keyring in /etc/ceph directory?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Teclus 
Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Sent: Thursday, June 18, 2015 10:35 PM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin 
node
Importance: High

Hello Everyone,

I have setup a new cluster with Ceph-hammer version (0.94.2   The install went 
through fine without any issues but from the admin node I am not able to 
execute any of the Ceph commands

Error:
root@ceph-main:/cephcluster# ceph auth export
2015-06-18 12:43:28.922367 7f54d286b700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2015-06-18 12:43:28.922375 7f54d286b700  0 librados: client.admin 
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound

I googled for this and only found one article relevant, but it did not solve my 
problem.
http://t75390.file-systems-ceph-user.file-systemstalk.us/newbie-error-connecting-to-cluster-permissionerror-t75390.html

Is there any other workaround or fix for this ??

Regards
Teclus Dsouza
Technical Architect
Tech Mahindra


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin node

2015-06-18 Thread Alan Johnson
And also this needs the correct permission set as otherwise it will give this 
error.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of B, 
Naga Venkata
Sent: Thursday, June 18, 2015 10:07 AM
To: Teclus Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco); 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH 
admin node

Do you have admin keyring in /etc/ceph directory?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Teclus 
Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Sent: Thursday, June 18, 2015 10:35 PM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin 
node
Importance: High

Hello Everyone,

I have setup a new cluster with Ceph-hammer version (0.94.2   The install went 
through fine without any issues but from the admin node I am not able to 
execute any of the Ceph commands

Error:
root@ceph-main:/cephcluster# ceph auth export
2015-06-18 12:43:28.922367 7f54d286b700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2015-06-18 12:43:28.922375 7f54d286b700  0 librados: client.admin 
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound

I googled for this and only found one article relevant, but it did not solve my 
problem.
http://t75390.file-systems-ceph-user.file-systemstalk.us/newbie-error-connecting-to-cluster-permissionerror-t75390.html

Is there any other workaround or fix for this ??

Regards
Teclus Dsouza
Technical Architect
Tech Mahindra


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] intel atom erasure coded pool

2015-06-18 Thread Reid Kelley
Has there been any testing/feedback on using the 8-core intel atom c2750 with 
EC pools? Or any use case really?  There are some enticing 1U 12x3.5’ chassis 
out there with with atom processor.  The idea of low-power, dense, EC pool 
storage has a lot of appeal.  We’re looking to build out a pretty cold EC pool 
(media storage strong hot/cold skew) behind a large-ish NVME cache tier.

Thanks!

 -Reid
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] keyring getting overwritten by mon generated bootstrap-osd keyring

2015-06-18 Thread Johanni Thunstrom
Dear Ceph Community,


We are fetching from our own encrypted data bags the mon and osd bootstrap 
keyring values. We are successful in setting the mon_secret to a preset value, 
but fail to do so for the /var/lib/ceph/boostrap-osd keyring.

Similar to how we set mon_secret, we set osd_secret. We added log messages 
printing out the osd_secret in the ceph community cookbook recipe osd.rb. This 
value logged correctly in our chef client log. However, after chef completes 
the ceph osd recipe, the /var/lib/ceph/boostrap-osd/ceph.keyring file is not 
the same value as the intended osd_secret. It is overwritten by the 
bootstrap-odd keyring value created during the mon recipe.

Since it is being reverted, how can we set the initial 
/var/lib/ceph/boostrap-osd/ceph.keyring file to start out with the correct 
value? We see the bootstrap-osd keyring file is being created during the mon 
installation, but we are not sure where and how to set the 
bootstrap-osd/keyring value.

Sincerely,
Johanni B. Thunstrom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD test results with Plextor M6 Pro, HyperX Fury, Kingston V300, ADATA SP90

2015-06-18 Thread Jelle de Jong
Hello everybody,

I thought I would share the benchmarks from these four ssd's I tested
(see attachment)

I do still have some question:

#1 *Data Set Management TRIM supported (limit 1 block)
vs
   *Data Set Management TRIM supported (limit 8 blocks)
and how this effects Ceph and also how can I test if TRIM is actually
working and not corruption data.

#2 are there other things I should test to compare ssd's for Ceph Journals

#3 are the power loss security mechanisms on SSD relevant in Ceph when
configured in a way that a full node can fully die and that a power loss
of all nodes at the same time should not be possible (or has an extreme
low probability)

#4 how to benchmarks the OSD (disk+ssd-journal) combination so I can
compare them.

I got some other benchmarks question, but I will make an separate mail
for them.

Kind regards,

Jelle de Jong
#---

# Plextor M6 Pro 128G

root@ceph01:~# uname -a
Linux ceph01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24) x86_64 
GNU/Linux

root@ceph01:~# smartctl -i /dev/sdc
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: PLEXTOR PX-128M6Pro
Serial Number:P02441106228
LU WWN Device Id: 5 002303 1002de43e
Add. Product Id:  NC702090
Firmware Version: 1.02
User Capacity:128,035,676,160 bytes [128 GB]
Sector Size:  512 bytes logical/physical
Rotation Rate:Solid State Device
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:Thu Jun 18 15:46:33 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@ceph01:~# hdparm -I /dev/sdc | grep TRIM
   *Data Set Management TRIM supported (limit 8 blocks)

root@ceph01:~# hdparm -W 0 /dev/sdc 0

fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=2 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=4 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=8 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=16 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=32 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k --numjobs=64 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

1# write: io=163136KB, bw=2718.1KB/s, iops=679,   runt= 60001msec
2# write: io=323768KB, bw=5396.5KB/s, iops=1349,  runt= 60001msec
4#  write: io=643624KB, bw=10727KB/s, iops=2681,  runt= 60001msec
8#  write: io=1238.3MB, bw=21132KB/s, iops=5283,  runt= 60002msec
16# write: io=2218.9MB, bw=37868KB/s, iops=9466,  runt= 60001msec
32# write: io=3342.7MB, bw=57045KB/s, iops=14261, runt= 60003msec
64# write: io=3149.6MB, bw=53745KB/s, iops=13436, runt= 60007msec

# second run after testing the other ssd's
1# write: io=162100KB, bw=2701.7KB/s, iops=675,   runt= 60001msec
2# write: io=321076KB, bw=5351.2KB/s, iops=1337,  runt= 60001msec
4#  write: io=641076KB, bw=10684KB/s, iops=2671,  runt= 60001msec
8#  write: io=1230.5MB, bw=20999KB/s, iops=5249,  runt= 60002msec
16# write: io=2199.9MB, bw=37543KB/s, iops=9385,  runt= 60002msec
32# write: io=3367.4MB, bw=57467KB/s, iops=14366, runt= 60002msec
64# write: io=3270.5MB, bw=55809KB/s, iops=13952, runt= 60006msec

root@ceph01:~# dd if=/dev/zero of=/dev/sdc bs=4k count=1 oflag=direct,dsync
1+0 records in
1+0 records out
4096 bytes (41 MB) copied, 14.6745 s, 2.8 MB/s

root@ceph01:~# dmidecode -t system
# dmidecode 2.12
SMBIOS 2.6 present.

Handle 0x0002, DMI type 1, 27 bytes
System Information
Manufacturer: Hewlett-Packard
Product Name: HP Z600 Workstation
Version:
Serial Number: CZC0121R1J
UUID: CD0720D9-378D-11DF-BBDA-05C40AB118A9
Wake-up Type: Power Switch
SKU Number: FW863AV
Family: 103C_53335X

Handle 0x004B, DMI type 32, 11 bytes
System Boot Information
Status: No errors detected

#---

# Kingston HyperX Fury 120G

root@ceph01:~# uname -a
Linux ceph01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24) x86_64 
GNU/Linux

root@ceph01:~# smartctl -i 

Re: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin node

2015-06-18 Thread Alan Johnson
For the permissions use  sudo chmod +r /etc/ceph/ceph.client.admin.keyring


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Teclus 
Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Sent: Thursday, June 18, 2015 10:21 AM
To: B, Naga Venkata; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH 
admin node

Hello Naga,

The keyring file is present under a folder I created for ceph.   Are you saying 
the same needs to be copied to the /etc/ceph folder ?

Regards
Teclus

From: B, Naga Venkata [mailto:nag...@hp.com]
Sent: Thursday, June 18, 2015 10:37 PM
To: Teclus Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco); 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH 
admin node

Do you have admin keyring in /etc/ceph directory?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Teclus 
Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Sent: Thursday, June 18, 2015 10:35 PM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer 0.94.2: Error when running commands on CEPH admin 
node
Importance: High

Hello Everyone,

I have setup a new cluster with Ceph-hammer version (0.94.2   The install went 
through fine without any issues but from the admin node I am not able to 
execute any of the Ceph commands

Error:
root@ceph-main:/cephcluster# ceph auth export
2015-06-18 12:43:28.922367 7f54d286b700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2015-06-18 12:43:28.922375 7f54d286b700  0 librados: client.admin 
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound

I googled for this and only found one article relevant, but it did not solve my 
problem.
http://t75390.file-systems-ceph-user.file-systemstalk.us/newbie-error-connecting-to-cluster-permissionerror-t75390.html

Is there any other workaround or fix for this ??

Regards
Teclus Dsouza
Technical Architect
Tech Mahindra


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD test results with Plextor M6 Pro, HyperX Fury, Kingston V300, ADATA SP90

2015-06-18 Thread Christian Balzer

Hello,

On Thu, 18 Jun 2015 17:48:12 +0200 Jelle de Jong wrote:

 Hello everybody,
 
 I thought I would share the benchmarks from these four ssd's I tested
 (see attachment)
 

Neither of these are DC level SSDs of course, though the HyperX at least
supposedly can handle 2.5 DWPD.
Alas that info is only on the the PDF, not the web page specifications and
that PDF also says not for servers, no siree.
Which can mean a lot of things, the worst would be something like going
_very_ slow when doing housekeeping or the likes.

 I do still have some question:
 
 #1 *Data Set Management TRIM supported (limit 1 block)
 vs
*Data Set Management TRIM supported (limit 8 blocks)
 and how this effects Ceph and also how can I test if TRIM is actually
 working and not corruption data.
 

I would not deploy any SSDs that actually require TRIM to maintain their
speed or TBW endurance. 
And I wouldn't want Ceph to do TRIMs due to the corruption issues you
already are aware of.
And last but not least, TRIM makes little to no sense with Ceph journals.
These are raw partitions, so Ceph would need to issue the TRIM commands.
And they are constantly being overwritten, trimming them would be
detrimental to the performance for sure.

 #2 are there other things I should test to compare ssd's for Ceph
 Journals
 
TBW/$. I couldn't find the endurance data for the Plextor at all.
I have a cluster with journal SSDs that experience average 2MB/s writes,
so in 5 years that makes 315TB. Just shy of the 354TB the 128GB HyperX
promises. 
First rule of engineering, overspec by at lest 100%, so the 240GB model
would be a fit. If one were to use such drives in the first place.

 #3 are the power loss security mechanisms on SSD relevant in Ceph when
 configured in a way that a full node can fully die and that a power loss
 of all nodes at the same time should not be possible (or has an extreme
 low probability)
 
A full node death is often something you can recover from much faster than
a dead OSD (usually no data loss, just reboot it) and if Ceph is configured
correctly (mon_osd_down_out_subtree_limit = host) with very little impact
when it comes back.

If your journals are hosed because of a power loss, all the associated
OSDs are dead until you either recreate the journal (if possible) or in
the worst case (OSD HDD also hosed) the entire OSD.

That said, I personally consider total power loss scenarios in the DCs we
use to be very, very unlikely as well. Others here will strongly disagree
with that, based on their experience.
Penultimately that doesn't stop folks from accidentally powering off or
unplugging servers.
And I have seen SSDs w/o power loss protection getting hosed in such
scenarios while ones with it had no issues.

 #4 how to benchmarks the OSD (disk+ssd-journal) combination so I can
 compare them.
 
There are plenty of examples in the archives, from rados bench to fio with
rbd ioengine to running fio in a VM (for most people the most realistic
test). Block size will have of course a dramatic impact on throughput,
IOPS and CPU utilization.

The fio and dd tests you did are an indication of the capabilities of
those SSDs, those numbers however don't translate directly to Ceph.

Also, once your SSDs are fast enough to ACK things in a timely fashion,
your HDDs will become the bottleneck with persistent loads.

For example in my cluster with a 2 journals per SSD (DC S3700 100GB) a fio
run with 4K blocks will quickly get the CPUs sweating, the HDDs to 100%
utilization and the SSDs to about 10%. 
However with 4M blocks the CPUs are nearly bored, the HDDs of course at
about 100% and the SSD are going up to 40% (they are approaching their
throughput/bandwidth limit of 200MB/s, not IOPS). With rados bench I can
push the SSDs to 70%, which is one of the reasons I postulate that HDDs
(of the 7.2K RPM SATA persuasion) won't be doing much over 80MB/s in the
best case scenario when being used as OSDs.

Regards,

Christian

 I got some other benchmarks question, but I will make an separate mail
 for them.
 
 Kind regards,
 
 Jelle de Jong


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Aug Ceph Hackathon

2015-06-18 Thread Patrick McGarry
Hey Cephers,

So it looks like we have the list of approved attendees for the Ceph
Hackathon in Hilsboro, OR that Intel is being kind enough to host.

http://pad.ceph.com/p/hackathon_2015-08

If you are not on that list and would like to be, please contact me as
soon as possible to see if we can get you added. Thanks!


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CDS Jewel Details Posted

2015-06-18 Thread Patrick McGarry
Hey cephers,

The schedule and videoconference details have been added to the CDS Jewel page.

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

If you see any problems with my timezone math or have a scheduling
conflict that wont allow you to attend your blueprint session, please
let me know. We don't have a ton of options for moving things around,
but we can try our best to at least get the blueprint owners to their
own session.

Shout if you have any questions. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs unmounts itself from time to time

2015-06-18 Thread Roland Giesler
On 15 June 2015 at 13:09, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Jun 15, 2015 at 4:03 AM, Roland Giesler rol...@giesler.za.net
 wrote:
  I have a small cluster of 4 machines and quite a few drives.  After
 about 2
  - 3 weeks cephfs fails.  It's not properly mounted anymore in
 /mnt/cephfs,
  which of course causes the VM's running to fail too.
 
  In /var/log/syslog I have /mnt/cephfs: File exists at
  /usr/share/perl5/PVE/Storage/DirPlugin.pm line 52 repeatedly.
 
  There doesn't seem to be anything wrong with ceph at the time.
 
  # ceph -s
  cluster 40f26838-4760-4b10-a65c-b9c1cd671f2f
   health HEALTH_WARN clock skew detected on mon.s1
   monmap e2: 2 mons at
  {h1=192.168.121.30:6789/0,s1=192.168.121.33:6789/0}, election epoch 312,
  quorum 0,1 h1,s1
   mdsmap e401: 1/1/1 up {0=s3=up:active}, 1 up:standby
   osdmap e5577: 19 osds: 19 up, 19 in
pgmap v11191838: 384 pgs, 3 pools, 774 GB data, 455 kobjects
  1636 GB used, 9713 GB / 11358 GB avail
   384 active+clean
client io 12240 kB/s rd, 1524 B/s wr, 24 op/s
  # ceph osd tree
  # id  weight   type nameup/down  reweight
  -111.13root default
  -2 8.14host h1
   1 0.9 osd.1up1
   3 0.9 osd.3up1
   4 0.9 osd.4up1
   5 0.68osd.5up1
   6 0.68osd.6up1
   7 0.68osd.7up1
   8 0.68osd.8up1
   9 0.68osd.9up1
  10 0.68osd.10   up1
  11 0.68osd.11   up1
  12 0.68osd.12   up1
  -3 0.45host s3
   2 0.45osd.2up1
  -4 0.9 host s2
  13 0.9 osd.13   up1
  -5 1.64host s1
  14 0.29osd.14   up1
   0 0.27osd.0up1
  15 0.27osd.15   up1
  16 0.27osd.16   up1
  17 0.27osd.17   up1
  18 0.27osd.18   up1
 
  When I umount -l /mnt/cephfs and then mount -a after that, the the
 ceph
  volume is loaded again.  I can restart the VM's and all seems well.
 
  I can't find errors pertaining to cephfs in the the other logs either.
 
  System information:
 
  Linux s1 2.6.32-34-pve #1 SMP Fri Dec 19 07:42:04 CET 2014 x86_64
 GNU/Linux

 I'm not sure what version of Linux this really is (I assume it's a
 vendor kernel of some kind!), but it's definitely an old one! CephFS
 sees pretty continuous improvements to stability and it could be any
 number of resolved bugs.


​This is the stock standard installation of Proxmo​x with CephFS.



 If you can't upgrade the kernel, you might try out the ceph-fuse
 client instead as you can run a much newer and more up-to-date version
 of it, even on the old kernel.


​I'm under the impression that CephFS is the filesystem implimented by
ceph-fuse. Is it not? ​



 Other than that, can you include more
 information about exactly what you mean when saying CephFS unmounts
 itself?


​Everything runs fine for weeks.  Then suddenly a user reports that a VM is
not functioning anymore.  On investigation is transpires than CephFS is not
mounted anymore and the error I reported is logged.

I can't see anything else wrong at this stage.  ceph is running, the osd
are all up.

thanks again

Roland​



 -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-18 Thread Lionel Bouton
Hi,

I've just noticed an odd behaviour with the btrfs OSDs. We monitor the
amount of disk writes on each device, our granularity is 10s (every 10s
the monitoring system collects the total amount of sector written and
write io performed since boot and computes both the B/s and IO/s).

With only residual write activity on our storage network (~450kB/s total
for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on
each OSD once replication, double writes due to journal and number of
OSD are factored in) :
- Disks with btrfs OSD have a spike of activity every 30s (2 intervals
of 10s with nearly 0 activity, one interval with a total amount of
writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
- Disks with xfs OSD (with journal on a separate partition but same
disk) don't have these spikes of activity and the averages are far lower
: 160kB/s and 5 IO/s. This is not far off what is expected from the
whole cluster write activity.

There's a setting of 30s on our platform :
filestore max sync interval

I changed it to 60s with
ceph tell osd.* injectargs '--filestore-max-sync-interval 60'
and the amount of writes was lowered to ~2.5MB/s.

I changed it to 5s (the default) with
ceph tell osd.* injectargs '--filestore-max-sync-interval 5'
the amount of writes to the device rose to an average of 10MB/s (and
given our sampling interval of 10s appeared constant).

During these tests the activity on disks hosting XFS OSDs didn't change
much.

So it seems filestore syncs generate far more activity on btrfs OSDs
compared to XFS OSDs (journal activity included for both).

Note that autodefrag is disabled on our btrfs OSDs. We use our own
scheduler which in the case of our OSD limits the amount of defragmented
data to ~10MB per minute in the worst case and usually (during low write
activity which was the case here) triggers a single file defragmentation
every 2 minutes (which amounts to a 4MB write as we only host RBDs with
the default order value). So defragmentation shouldn't be an issue here.

This doesn't seem to generate too much stress when filestore max sync
interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same
amount of data according to apply latencies) but at 5s the btrfs OSDs
are far slower than our xfs OSDs with 10x the average apply latency (we
didn't let this continue more than 10 minutes as it began to make some
VMs wait for IOs too much).

Does anyone know if this is normal and why it is happening?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-18 Thread Lionel Bouton
I just realized I forgot to add a proper context :

this is with Firefly 0.80.9 and the btrfs OSDs are running on kernel
4.0.5 (this was happening with previous kernel versions according to our
monitoring history), xfs OSDs run on 4.0.5 or 3.18.9. There are 23 OSDs
total and 2 of them are using btrfs.

On 06/18/15 23:28, Lionel Bouton wrote:
 Hi,

 I've just noticed an odd behaviour with the btrfs OSDs. We monitor the
 amount of disk writes on each device, our granularity is 10s (every 10s
 the monitoring system collects the total amount of sector written and
 write io performed since boot and computes both the B/s and IO/s).

 With only residual write activity on our storage network (~450kB/s total
 for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on
 each OSD once replication, double writes due to journal and number of
 OSD are factored in) :
 - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
 of 10s with nearly 0 activity, one interval with a total amount of
 writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
 - Disks with xfs OSD (with journal on a separate partition but same
 disk) don't have these spikes of activity and the averages are far lower
 : 160kB/s and 5 IO/s. This is not far off what is expected from the
 whole cluster write activity.

 There's a setting of 30s on our platform :
 filestore max sync interval

 I changed it to 60s with
 ceph tell osd.* injectargs '--filestore-max-sync-interval 60'
 and the amount of writes was lowered to ~2.5MB/s.

 I changed it to 5s (the default) with
 ceph tell osd.* injectargs '--filestore-max-sync-interval 5'
 the amount of writes to the device rose to an average of 10MB/s (and
 given our sampling interval of 10s appeared constant).

 During these tests the activity on disks hosting XFS OSDs didn't change
 much.

 So it seems filestore syncs generate far more activity on btrfs OSDs
 compared to XFS OSDs (journal activity included for both).

 Note that autodefrag is disabled on our btrfs OSDs. We use our own
 scheduler which in the case of our OSD limits the amount of defragmented
 data to ~10MB per minute in the worst case and usually (during low write
 activity which was the case here) triggers a single file defragmentation
 every 2 minutes (which amounts to a 4MB write as we only host RBDs with
 the default order value). So defragmentation shouldn't be an issue here.

 This doesn't seem to generate too much stress when filestore max sync
 interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same
 amount of data according to apply latencies) but at 5s the btrfs OSDs
 are far slower than our xfs OSDs with 10x the average apply latency (we
 didn't let this continue more than 10 minutes as it began to make some
 VMs wait for IOs too much).

 Does anyone know if this is normal and why it is happening?

 Best regards,

 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very chatty MON logs: Is this normal?

2015-06-18 Thread Joao Eduardo Luis
On 06/17/2015 08:30 PM, Somnath Roy wrote:
  However, I'd rather not set the level to 0/0, as that would disable all 
 logging from the MONs
 
 I don't think so. All the error scenarios and stack trace (in case of crash) 
 are supposed to be logged with log level 0. But, generally, we need the 
 highest log level (say 20) to get all the information when something to debug.
 So, I doubt how beneficial it will be to enable logging for some intermediate 
 levels.
 Probably, there is no guideline for these log level too which developer 
 should follow strictly.

I don't think this is documented anywhere, but for a while now we've
been using roughly this approach to debug levels:

-1  - errors.
 0  - info you really want in the log each time it happens.
 1  - info that should be outputted by default
  should be stuff that don't happen often and is quite important to
  get to the logs when it happens.
 5  - important but happens a bit often not to output as 1
10  - gross majority of debug messages in the monitor
20  - debug that could impact monitor performance severely
  (e.g., debug from inside a loop)
30  - debug that you should not need unless you're really looking for it

It is fairly common a developer will ask you for 'debug mon = 10' in
order to catch all debug messages at levels 5 and 10, because those are
the ones that usually pay off when tracking down issues.

But given this is left pretty much to the developer's criteria,
different services may use different levels of verbosity for different
things, and you may need a higher debug level to get info out of some
parts of the code than others.

In this particular case, the message that is being outputted should,
imo, be on debug level 5 instead of 1.  We used to output a lot of stuff
on debug level 1, but have been moving away from that; there are still
artifacts though, and this is one of them.

Setting 'mon debug = 0/5' should be okay.  Unless you see that setting
'/5' impacts your performance and/or memory consumption, you should
leave that be.  '0/5' means 'output only debug 0 or lower to the logs;
keep the last 1000 debug level 5 or lower in memory in case of a crash'.
Your logs will not be as heavily populated but, if for some reason the
daemon crashes, you get quite a few of debug information to help track
down the source of the problem.

HTH,

  -Joao

 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
 Daniel Schneller
 Sent: Wednesday, June 17, 2015 12:11 PM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very chatty MON logs: Is this normal?
 
 On 2015-06-17 18:52:51 +, Somnath Roy said:
 
 This is presently written from log level 1 onwards :-) So, only log
 level 0 will not log this..
 Try, 'debug_mon = 0/0' in the conf file..
 
 Yeah, once I had sent the mail I realized that 1 in the log line was the 
 level. Had overlooked that before.
 However, I'd rather not set the level to 0/0, as that would disable all 
 logging from the MONs.
 
 Now, I don't have enough knowledge on that part to say whether it is
 important enough to log at log level 1 , sorry :-(
 
 That would indeed be an interesting to know.
 Judging from the sheer amount, at least I have my doubts, because the cluster 
 seems to be running without any issues. So I figure at least it isn't 
 indicative of an immediate issue.
 
 Anyone with a little more definitve knowledge around? Should I create a bug 
 ticket for this?
 
 Cheers,
 Daniel
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Interesting postmortem on SSDs from Algolia

2015-06-18 Thread Mark Nelson

Oh that's very good to know.  Are there details posted anywhere?

Mark

On 06/18/2015 02:46 AM, Dan van der Ster wrote:

Thanks, that's a nice article.

We're pretty happy with the SSDs he lists as Good, but note that
they're not totally immune to these type of issues -- indeed we've
found that bcache can crash a DC S3700, and Intel confirmed it was a
firmware bug.

Cheers, Dan


On Wed, Jun 17, 2015 at 8:36 PM, Steve Anthony sma...@lehigh.edu wrote:

There's often a great deal of discussion about which SSDs to use for
journals, and why some of the cheaper SSDs end up being more expensive
in the long run. The recent blog post at Algoria, though not Ceph
specific, provides a good illustration of exactly how insidious
kernel/SSD interactions can be. Thought the list might find it
interesting.

https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

-Steve

--
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-18 Thread Mark Nelson

On 06/18/2015 04:49 AM, Jacek Jarosiewicz wrote:

On 06/17/2015 04:19 PM, Mark Nelson wrote:

SSD's are INTEL SSDSC2BW240A4


Ah, if I'm not mistaken that's the Intel 530 right?  You'll want to see
this thread by Stefan Priebe:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html

In fact it was the difference in Intel 520 and Intel 530 performance
that triggered many of the different investigations that have taken
place by various folks into SSD flushing behavior on ATA_CMD_FLUSH.  The
gist of it is that the 520 is very fast but probably not safe.  The 530
is safe but not fast.  The DC S3700 (and similar drives with super
capacitors) are thought to be both fast and safe (though some drives
like the crucual M500 and later misrepresented their power loss
protection so you have to be very careful!)



Yes, these are Intel 530.
I did the tests described in the thread You pasted and unfortunately
that's my case... I think.

The dd run locally on a mounted ssd partition looks like this:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
oflag=direct,dsync
1+0 records in
1+0 records out
358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s

and when I skip the flag dsync it goes fast:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
oflag=direct
1+0 records in
1+0 records out
358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s

(I used the same 350k block size as mentioned in the e-mail from the
thread above)

I tried disabling the dsync like this:

[root@cf02 ~]# echo temporary write through 
/sys/class/scsi_disk/1\:0\:0\:0/cache_type

[root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type
write through

..and then locally I see the speedup:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
oflag=direct,dsync
1+0 records in
1+0 records out
358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s


..but when I test it from a client I still get slow results:

root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct
100+0 records in
100+0 records out
1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s

and fio gives the same 2-3k iops.

after the change to SSD cache_type I tried remounting the test image,
recreating it and so on - nothing helped.

I ran rbd bench-write on it, and it's not good either:

root@cf03:~# rbd bench-write t2
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1  4221   4220.64  32195919.35
 2  9628   4813.95  36286083.00
 3 15288   4790.90  35714620.49
 4 19610   4902.47  36626193.93
 5 24844   4968.37  37296562.14
 6 30488   5081.31  38112444.88
 7 36152   5164.54  38601615.10
 8 41479   5184.80  38860207.38
 9 46971   5218.70  39181437.52
10 52219   5221.77  39322641.34
11 5   5151.36  38761566.30
12 62073   5172.71  38855021.35
13 65962   5073.95  38182880.49
14 71541   5110.02  38431536.17
15 77039   5135.85  38615125.42
16 82133   5133.31  38692578.98
17 87657   5156.24  38849948.84
18 92943   5141.03  38635464.85
19 97528   5133.03  38628548.32
20103100   5154.99  38751359.30
21108952   5188.09  38944016.94
22114511   5205.01  38999594.18
23120319   5231.17  39138227.64
24125975   5248.92  39195739.46
25131438   5257.50  39259023.06
26136883   5264.72  39344673.41
27142362   5272.66  39381638.20
elapsed:27  ops:   143789  ops/sec:  5273.01  bytes/sec: 39376124.30

rados bench gives:

root@cf03:~# rados -p rbd bench 30 write --no-cleanup
  Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
  Object prefix: benchmark_data_cf03_21194
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   0 0 0 0 0 - 0
  1  162812   47.986348  0.779211   0.48964
  2  164327   53.988660   1.17958  0.775733
  3  16594357.32264  0.157145  0.798348
  4  167357   56.989756  0.424493  0.862553
  5  168973 58.3964  0.246444  0.893064
  6  16   10488   58.656960   1.67389  0.901757
  7  16   120   104   59.418664   1.78324  0.935242
  8  16   132   116   57.990548   1.50035  0.963947
  9  16   147   131   58.212860   1.85047  0.978697
 10  16   161   145   57.990856  0.133187  0.99
 11  16   174   158   57.445552   1.59548   1.02264
 12  16   189   173   57.657760  0.179966   1.01623
 13  16   206   190   58.452668   1.93064   1.02108
 14 

Re: [ceph-users] Interesting postmortem on SSDs from Algolia

2015-06-18 Thread Dan van der Ster
Thanks, that's a nice article.

We're pretty happy with the SSDs he lists as Good, but note that
they're not totally immune to these type of issues -- indeed we've
found that bcache can crash a DC S3700, and Intel confirmed it was a
firmware bug.

Cheers, Dan


On Wed, Jun 17, 2015 at 8:36 PM, Steve Anthony sma...@lehigh.edu wrote:
 There's often a great deal of discussion about which SSDs to use for
 journals, and why some of the cheaper SSDs end up being more expensive
 in the long run. The recent blog post at Algoria, though not Ceph
 specific, provides a good illustration of exactly how insidious
 kernel/SSD interactions can be. Thought the list might find it
 interesting.

 https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

 -Steve

 --
 Steve Anthony
 LTS HPC Support Specialist
 Lehigh University
 sma...@lehigh.edu



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Journal creation ?

2015-06-18 Thread Shane Gibson
All - I am building my first ceph cluster, and doing it the hard way,
manually without the aid of ceph-deploy.  I have successfully built the
mon cluster and am now adding OSDs.

My main question:
How do I prepare the Journal prior to the prepare/activate stages of the
OSD creation?  


More details:
Basically - all of the documentation seems to assume the journal is
prepared.   Do I simply create a single raw partition on a physical
device and the ceph-disk prepare... and ceph-disk activate... steps
will take care of everything for the journal ... presumably based on the
ceph-disk prepare ... --type filesystem setting?  Or do I need to
actually format it as a filesystem prior to giving it over to the Ceph OSD
???   

The architecture I'm thinking of is as follows - based on the hardware I
have for OSDs (currenly 9 servers each with):

  RAID 0 mirror for OS hard drives (2 disks)
  data disk for journal placement for 5 physical disks (4TB)
  data disk for journal placement for 5 physical disks (4TB)
  10 data disks as OSDs (one OSD per disk) (4TB each)

Essentially - there are 12 data disks in the node (all 4 TB 7200 rpm
spinning disks).  Splitting the Journal across two of them gives me a
failure domain of 5 data disks + 1 journal disk in a single physical
server for crush map purposes ...  It also vaguely helps spread the I/O
workload for the journaling activity across 2 physical disks in a chassis
instead of a one (since the journal disk is pretty darn slow).

In this configuration I'd create 5 separate partitions on Journal Disk A
and 5 on Journal Disk B ... but do they need to be formatted and mounted?

Yes, we know as we go to more real production workloads, we'll want/need
to change this for performance reasons - eg the Journal on SSDs ...

Any pointers on where I missed this info in the documentation would be
helpful too ... I've been all over the ceph.com/docs/ site and haven't
found it yet... 

Thanks,
~~shane 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware cache settings recomendation

2015-06-18 Thread Jan Schermer
Those are strange numbers, where are you getting them from? Test the drives 
directly with fio with every combination, that’s should tell you what’s 
happening

Jan

 On 18 Jun 2015, at 07:52, Mateusz Skała mateusz.sk...@budikom.net wrote:
 
 Thanks for answer,
 
 I made some test, first leave dwc=enabled and caching on journal drive 
 disabled. Latency grows from 20ms to 90ms on this drive. Next I enabled cache 
 on journal drive and disabled all cache on data drives. Latency on data 
 drives grows from 30 – 50ms to 1500 – 2000ms. 
 Test made only on one osd host with P410i controller, with SATA drives 
 ST1000LM014-1EJ1 for data and for journal  SSD  INTEL SSDSC2BW12.
 Regards, 
 Mateusz
 
 
 From: Jan Schermer [mailto:j...@schermer.cz] 
 Sent: Wednesday, June 17, 2015 9:41 AM
 To: Mateusz Skała
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Hardware cache settings recomendation
 
 Cache on top of the data drives (not journal) will not help in most cases, 
 those writes are already buffered in the OS - so unless your OS is very light 
 on memory and flushing constantly it will have no effect, it just adds 
 overhead in case a flush comes. I haven’t tested this extensively with Ceph, 
 though.
 
 Cache enabled on journal drive _could_ help if your SSD is very slow (or if 
 you don’t have SSD for journal at all), and if it is large enough (more than 
 the active journal size) it could prolong the life of your SSD - depending on 
 how and when the cache starts to flush. I know from experience that write 
 cache on Areca controller didn't flush at all until it hit a watermark (50% 
 capacity default or something) and it will be faster than some SSDon their 
 own. Some SSD have higher IOPS than the cache can achieve, but you likely 
 won’t saturate that with Ceph.
 
 Another thing is write cache on the drives themselves - I’d leave that on 
 disabled (which is probably the default) unless the drive in question has 
 capacitors to flush the cache in case of power failure. Controllers usually 
 have a whitelist of devices that respect flushes on which the write cache is 
 default=enabled, but in case of for example Dell Perc you would need to have 
 Dell original drives or enable it manually.
 
 YMMV - i’ve hit the controller cache IOPS limit in the past with cheap Dell 
 Perc (H310 was it?) that did ~20K IOPS top on one SSD drive, while the drive 
 itself did close to 40K. On my SSDs, disabling write cache helps latency 
 (good for journal) bud could be troubling for the SSD lifetime.
 
 In any case I don’t think you would saturate either with Ceph, so I recommend 
 you just test the latency with write cache enabled/disabled on the controller 
 and pick the one that gives the best numbers
 this is basically how: 
 http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
 Ceph recommended way is to use everything as passthrough (initiator/target 
 mode) or JBOD (RAID0 with single drives on some controllers), so I’d stick 
 with that.
 
 Jan
 
 
 On 17 Jun 2015, at 08:01, Mateusz Skała mateusz.sk...@budikom.net wrote:
 
 Yes, all disk are in single drive raid 0. Now cache is enabled for all 
 drives, should I disable cache for SSD drives?
 Regards,
 Mateusz
 
 From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net] 
 Sent: Thursday, June 11, 2015 7:30 PM
 To: Mateusz Skała
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Hardware cache settings recomendation
 
 You want write cache to disk, no write cache for SSD.
 
 I assume all of your data disk are single drive raid 0?
 
 
 
 Tyler Bishop
 Chief Executive Officer
 513-299-7108 x10
 tyler.bis...@beyondhosting.net
 
 If you are not the intended recipient of this transmission you are notified 
 that disclosing, copying, distributing or taking any action in reliance on 
 the contents of this information is strictly prohibited.
 
 
 
 From: Mateusz Skała mateusz.sk...@budikom.net
 To: ceph-users@lists.ceph.com
 Sent: Saturday, June 6, 2015 4:09:59 AM
 Subject: [ceph-users] Hardware cache settings recomendation
 
 Hi,
 Please help me with hardware cache settings on controllers for ceph rbd best 
 performance. All Ceph hosts have one SSD drive for journal.
 
 We are using 4 different controllers, all with BBU: 
 • HP Smart Array P400
 • HP Smart Array P410i
 • Dell PERC 6/i
 • Dell  PERC H700
 
 I have to set cache policy, on Dell settings are:
 • Read Policy 
 o   Read-Ahead (current)
 o   No-Read-Ahead
 o   Adaptive Read-Ahead
 • Write Policy 
 o   Write-Back (current)
 o   Write-Through 
 • Cache Policy
 o   Cache I/O
 o   Direct I/O (current)
 • Disk Cache Policy
 o   Default (current)
 o   Enabled
 o   Disabled
 On HP controllers:
 • Cache Ratio (current: 25% Read / 75% Write)
 • Drive Write Cache
 o   Enabled (current)
 o   Disabled
 
   And there is one more setting 

Re: [ceph-users] Accessing Ceph from Spark

2015-06-18 Thread Milan Sladky
Hi Yuan,Thanks for the answer.
Our main use case is to replace AWS S3 with object storage in private cloud, 
very preferably with S3 compatible API. But we also know that we want to 
perform some machine learning and data processing by Spark in not so far future 
on the data residing in the object storage. The data locality feature would be 
very nice to have but I was not aware that this is possible with Ceph or Swift.
We do not want to use HDFS, mainly because of the cost brought in by 
replication factor 3x and we also plan to store more/a lot of smaller files.
Best regards,Milan

 From: dunk...@gmail.com
 Date: Wed, 17 Jun 2015 23:41:48 +0800
 Subject: Re: [ceph-users] Accessing Ceph from Spark
 To: milan.sla...@outlook.com
 CC: ceph-users@lists.ceph.com
 
 Hi Milan,
 
 We've done some tests here and our hadoop can talk to RGW successfully
 with this SwiftFS plugin. But we haven't tried Spark yet. One thing is
 the data locality feature, it actually requires some special
 configuration of Swift proxy-server, so RGW is not able to archive the
 data locality there.
 
 Could you please kindly share some deployment consideration of running
 Spark on Swift/Ceph? Tachyon seems more promising...
 
 
 Sincerely, Yuan
 
 
 On Wed, Jun 17, 2015 at 9:58 PM, Milan Sladky milan.sla...@outlook.com 
 wrote:
  Is it possible to access Ceph from Spark as it is mentioned here for
  Openstack Swift?
 
  https://spark.apache.org/docs/latest/storage-openstack-swift.html
 
  Thanks for help.
 
  Milan Sladky
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-18 Thread Jacek Jarosiewicz

On 06/17/2015 04:19 PM, Mark Nelson wrote:

SSD's are INTEL SSDSC2BW240A4


Ah, if I'm not mistaken that's the Intel 530 right?  You'll want to see
this thread by Stefan Priebe:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html

In fact it was the difference in Intel 520 and Intel 530 performance
that triggered many of the different investigations that have taken
place by various folks into SSD flushing behavior on ATA_CMD_FLUSH.  The
gist of it is that the 520 is very fast but probably not safe.  The 530
is safe but not fast.  The DC S3700 (and similar drives with super
capacitors) are thought to be both fast and safe (though some drives
like the crucual M500 and later misrepresented their power loss
protection so you have to be very careful!)



Yes, these are Intel 530.
I did the tests described in the thread You pasted and unfortunately 
that's my case... I think.


The dd run locally on a mounted ssd partition looks like this:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 
oflag=direct,dsync

1+0 records in
1+0 records out
358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s

and when I skip the flag dsync it goes fast:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 
oflag=direct

1+0 records in
1+0 records out
358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s

(I used the same 350k block size as mentioned in the e-mail from the 
thread above)


I tried disabling the dsync like this:

[root@cf02 ~]# echo temporary write through  
/sys/class/scsi_disk/1\:0\:0\:0/cache_type


[root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type
write through

..and then locally I see the speedup:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 
oflag=direct,dsync

1+0 records in
1+0 records out
358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s


..but when I test it from a client I still get slow results:

root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct
100+0 records in
100+0 records out
1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s

and fio gives the same 2-3k iops.

after the change to SSD cache_type I tried remounting the test image, 
recreating it and so on - nothing helped.


I ran rbd bench-write on it, and it's not good either:

root@cf03:~# rbd bench-write t2
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1  4221   4220.64  32195919.35
2  9628   4813.95  36286083.00
3 15288   4790.90  35714620.49
4 19610   4902.47  36626193.93
5 24844   4968.37  37296562.14
6 30488   5081.31  38112444.88
7 36152   5164.54  38601615.10
8 41479   5184.80  38860207.38
9 46971   5218.70  39181437.52
   10 52219   5221.77  39322641.34
   11 5   5151.36  38761566.30
   12 62073   5172.71  38855021.35
   13 65962   5073.95  38182880.49
   14 71541   5110.02  38431536.17
   15 77039   5135.85  38615125.42
   16 82133   5133.31  38692578.98
   17 87657   5156.24  38849948.84
   18 92943   5141.03  38635464.85
   19 97528   5133.03  38628548.32
   20103100   5154.99  38751359.30
   21108952   5188.09  38944016.94
   22114511   5205.01  38999594.18
   23120319   5231.17  39138227.64
   24125975   5248.92  39195739.46
   25131438   5257.50  39259023.06
   26136883   5264.72  39344673.41
   27142362   5272.66  39381638.20
elapsed:27  ops:   143789  ops/sec:  5273.01  bytes/sec: 39376124.30

rados bench gives:

root@cf03:~# rados -p rbd bench 30 write --no-cleanup
 Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds 
or 0 objects

 Object prefix: benchmark_data_cf03_21194
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  162812   47.986348  0.779211   0.48964
 2  164327   53.988660   1.17958  0.775733
 3  16594357.32264  0.157145  0.798348
 4  167357   56.989756  0.424493  0.862553
 5  168973 58.3964  0.246444  0.893064
 6  16   10488   58.656960   1.67389  0.901757
 7  16   120   104   59.418664   1.78324  0.935242
 8  16   132   116   57.990548   1.50035  0.963947
 9  16   147   131   58.212860   1.85047  0.978697
10  16   161   145   57.990856  0.133187  0.99
11  16   174   158   57.445552   1.59548   1.02264
12  16   189   173   57.657760  0.179966   1.01623
13  16   206   190   58.452668   1.93064   1.02108
14  16   221   205   58.562460   1.54504   1.02566
15  

Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-18 Thread Alexandre DERUMIER
Hi,

for read benchmark

with fio, what is the iodepth ?

my fio 4k randr results with

iodepth=1 : bw=6795.1KB/s, iops=1698
iodepth=2 : bw=14608KB/s, iops=3652
iodepth=4 : bw=32686KB/s, iops=8171
iodepth=8 : bw=76175KB/s, iops=19043
iodepth=16 :bw=173651KB/s, iops=43412
iodepth=32 :bw=336719KB/s, iops=84179

(This should be similar with rados bench -t (threads) option).

This is normal because of network latencies + ceph latencies.
Doing more parallism increase iops.

(doing a bench with dd = iodepth=1)

Theses result are with 1 client/rbd volume.


now with more fio client (numjobs=X)

I can reach up to 300kiops with 8-10 clients.


This should be the same with lauching multiple rados bench in parallel

(BTW, it could be great to have an option in rados bench to do it)


- Mail original -
De: Jacek Jarosiewicz jjarosiew...@supermedia.pl
À: Mark Nelson mnel...@redhat.com, ceph-users ceph-users@lists.ceph.com
Envoyé: Jeudi 18 Juin 2015 11:49:11
Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck

On 06/17/2015 04:19 PM, Mark Nelson wrote: 
 SSD's are INTEL SSDSC2BW240A4 
 
 Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see 
 this thread by Stefan Priebe: 
 
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html 
 
 In fact it was the difference in Intel 520 and Intel 530 performance 
 that triggered many of the different investigations that have taken 
 place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The 
 gist of it is that the 520 is very fast but probably not safe. The 530 
 is safe but not fast. The DC S3700 (and similar drives with super 
 capacitors) are thought to be both fast and safe (though some drives 
 like the crucual M500 and later misrepresented their power loss 
 protection so you have to be very careful!) 
 

Yes, these are Intel 530. 
I did the tests described in the thread You pasted and unfortunately 
that's my case... I think. 

The dd run locally on a mounted ssd partition looks like this: 

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 
oflag=direct,dsync 
1+0 records in 
1+0 records out 
358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s 

and when I skip the flag dsync it goes fast: 

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 
oflag=direct 
1+0 records in 
1+0 records out 
358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s 

(I used the same 350k block size as mentioned in the e-mail from the 
thread above) 

I tried disabling the dsync like this: 

[root@cf02 ~]# echo temporary write through  
/sys/class/scsi_disk/1\:0\:0\:0/cache_type 

[root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type 
write through 

..and then locally I see the speedup: 

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 
oflag=direct,dsync 
1+0 records in 
1+0 records out 
358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s 


..but when I test it from a client I still get slow results: 

root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct 
100+0 records in 
100+0 records out 
1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s 

and fio gives the same 2-3k iops. 

after the change to SSD cache_type I tried remounting the test image, 
recreating it and so on - nothing helped. 

I ran rbd bench-write on it, and it's not good either: 

root@cf03:~# rbd bench-write t2 
bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq 
SEC OPS OPS/SEC BYTES/SEC 
1 4221 4220.64 32195919.35 
2 9628 4813.95 36286083.00 
3 15288 4790.90 35714620.49 
4 19610 4902.47 36626193.93 
5 24844 4968.37 37296562.14 
6 30488 5081.31 38112444.88 
7 36152 5164.54 38601615.10 
8 41479 5184.80 38860207.38 
9 46971 5218.70 39181437.52 
10 52219 5221.77 39322641.34 
11 5 5151.36 38761566.30 
12 62073 5172.71 38855021.35 
13 65962 5073.95 38182880.49 
14 71541 5110.02 38431536.17 
15 77039 5135.85 38615125.42 
16 82133 5133.31 38692578.98 
17 87657 5156.24 38849948.84 
18 92943 5141.03 38635464.85 
19 97528 5133.03 38628548.32 
20 103100 5154.99 38751359.30 
21 108952 5188.09 38944016.94 
22 114511 5205.01 38999594.18 
23 120319 5231.17 39138227.64 
24 125975 5248.92 39195739.46 
25 131438 5257.50 39259023.06 
26 136883 5264.72 39344673.41 
27 142362 5272.66 39381638.20 
elapsed: 27 ops: 143789 ops/sec: 5273.01 bytes/sec: 39376124.30 

rados bench gives: 

root@cf03:~# rados -p rbd bench 30 write --no-cleanup 
Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds 
or 0 objects 
Object prefix: benchmark_data_cf03_21194 
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
0 0 0 0 0 0 - 0 
1 16 28 12 47.9863 48 0.779211 0.48964 
2 16 43 27 53.9886 60 1.17958 0.775733 
3 16 59 43 57.322 64 0.157145 0.798348 
4 16 73 57 56.9897 56 0.424493 0.862553 
5 16 89 73 58.39 64 0.246444 0.893064 
6 16 104 88 58.6569 60 1.67389 0.901757 
7 16 120 104 59.4186 64 1.78324 0.935242 
8 16 132 116 57.9905 48