Re: [ceph-users] New cluster in unhealthy state

2015-06-19 Thread Nick Fisk
Try

ceph osd pool set rbd pgp_num 310

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Dave Durkee
Sent: 19 June 2015 22:31
To: ceph-users@lists.ceph.com
Subject: [ceph-users] New cluster in unhealthy state

 

I just built a small lab cluster.  1 mon node, 3 osd nodes with 3 ceph disks
and 1 os/journal disk, an admin vm and 3 client vm's.

 

I followed the preflight and install instructions and when I finished adding
the osd's I ran a ceph status and got the following:

 

ceph status

cluster b4419183-5320-4701-aae2-eb61e186b443

 health HEALTH_WARN

32 pgs degraded

64 pgs stale

32 pgs stuck degraded

246 pgs stuck inactive

64 pgs stuck stale

310 pgs stuck unclean

32 pgs stuck undersized

32 pgs undersized

pool rbd pg_num 310  pgp_num 64

 monmap e1: 1 mons at {mon=172.17.1.16:6789/0}

election epoch 2, quorum 0 mon

 osdmap e49: 11 osds: 9 up, 9 in

  pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects

298 MB used, 4189 GB / 4189 GB avail

 246 creating

  32 stale+active+undersized+degraded

  32 stale+active+remapped

 

ceph health

HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs
stuck inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck
undersized; 32 pgs undersized; pool rbd pg_num 310  pgp_num 64

 

ceph quorum_status

{election_epoch:2,quorum:[0],quorum_names:[mon],quorum_leader_name
:mon,monmap:{epoch:1,fsid:b4419183-5320-4701-aae2-eb61e186b443,mo
dified:0.00,created:0.00,mons:[{rank:0,name:mon,addr
:172.17.1.16:6789\/0}]}}

 

ceph mon_status

{name:mon,rank:0,state:leader,election_epoch:2,quorum:[0],out
side_quorum:[],extra_probe_peers:[],sync_provider:[],monmap:{epoch:
1,fsid:b4419183-5320-4701-aae2-eb61e186b443,modified:0.00,creat
ed:0.00,mons:[{rank:0,name:mon,addr:172.17.1.16:6789\/0}]
}}

 

ceph osd tree

ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 

-1 4.94997 root default 

-2 2.24998 host osd1

 0 0.45000 osd.0 down0  1.0 

 1 0.45000 osd.1 down0  1.0 

 2 0.45000 osd.2   up  1.0  1.0 

 3 0.45000 osd.3   up  1.0  1.0 

10 0.45000 osd.10  up  1.0  1.0 

-3 1.34999 host osd2

 4 0.45000 osd.4   up  1.0  1.0 

 5 0.45000 osd.5   up  1.0  1.0 

 6 0.45000 osd.6   up  1.0  1.0 

-4 1.34999 host osd3

 7 0.45000 osd.7   up  1.0  1.0 

 8 0.45000 osd.8   up  1.0  1.0 

 9 0.45000 osd.9   up  1.0  1.0

 

 

Admin-node:

[root@admin test-cluster]# cat ceph.conf

[global]

auth_service_required = cephx

filestore_xattr_use_omap = true

auth_client_required = cephx

auth_cluster_required = cephx

mon_host = 172.17.1.16

mon_initial_members = mon

fsid = b4419183-5320-4701-aae2-eb61e186b443

osd pool default size = 2

public network = 172.17.1.0/24

cluster network = 10.0.0.0/24

 

 

How do I diagnose and solve the cluster health issue?  Do you need any
additional information to help with the diag process?

 

Thanks!!

 

Dave




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados gateway to use ec pools

2015-06-19 Thread Somnath Roy
Just configure '.rgw.buckets'  as an EC pool and rest of the rgw pools should 
be replicated.

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Deneau, Tom
Sent: Friday, June 19, 2015 2:31 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] rados gateway to use ec pools

what is the correct way to make radosgw create its pools as erasure coded pools?

-- Tom Deneau, AMD
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph EC pool performance benchmarking, high latencies.

2015-06-19 Thread Nick Fisk
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 19 June 2015 13:44
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Ceph EC pool performance benchmarking, high
 latencies.
 
 On 06/19/2015 07:28 AM, MATHIAS, Bryn (Bryn) wrote:
  Hi All,
 
  I am currently benchmarking CEPH to work out the correct read / write
 model, to get the optimal cluster throughput and latency.
 
  For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised
 name using the rados python interface.
 
  Load generation is happening on external machines.
 
  Write generation is characterised as the number of IOContexts and the
 number of simultaneous async writes on those contexts.
  With one machine, IOContexts threads and 50 simultaneous writes per
 context I achieve over 300 seconds:
 
  Percentile 5 = 0.133775639534
  Percentile 10 = 0.178686833382
  Percentile 15 = 0.180827605724
  Percentile 20 = 0.185487747192
  Percentile 25 = 0.229317903519
  Percentile 30 = 0.23066740036
  Percentile 35 = 0.232764816284
  Percentile 40 = 0.278827047348
  Percentile 45 = 0.280579996109
  Percentile 50 = 0.283169865608
  Percentile 55 = 0.329843044281
  Percentile 60 = 0.332481050491
  Percentile 65 = 0.380337607861
  Percentile 70 = 0.428911447525
  Percentile 75 = 0.438932359219
  Percentile 80 = 0.530071306229
  Percentile 85 = 0.597331762314
  Percentile 90 = 0.735066819191
  Percentile 95 = 1.08006491661
  Percentile 100 = 11.7352428436
  Max latancies = 11.7352428436, Min = 0.0499050617218, mean =
  0.43913059745 Total objects writen = 24552 in time 302.979903936s
  gives 81.0350775118/s (324.140310047 MB/s)
 
 
 
   From two load generators on separate machines I achieve:
 
 
  Percentile 5 = 0.228541088104
  Percentile 10 = 0.23213224411
  Percentile 15 = 0.279508590698
  Percentile 20 = 0.28137254715
  Percentile 25 = 0.328829288483
  Percentile 30 = 0.330499911308
  Percentile 35 = 0.334045898914
  Percentile 40 = 0.380131435394
  Percentile 45 = 0.382810294628
  Percentile 50 = 0.430188417435
  Percentile 55 = 0.43399245739
  Percentile 60 = 0.48120136261
  Percentile 65 = 0.530511438847
  Percentile 70 = 0.580485081673
  Percentile 75 = 0.631661534309
  Percentile 80 = 0.728989124298
  Percentile 85 = 0.830820584297
  Percentile 90 = 1.03238985538
  Percentile 95 = 1.62925363779
  Percentile 100 = 32.5414278507
  Max latancies = 32.5414278507, Min = 0.0375339984894, mean =
  0.863403101415 Total objects writen = 12714 in time 325.92741394s
  gives 39.0086855422/s (156.034742169 MB/s)
 
 
  Percentile 5 = 0.229072237015
  Percentile 10 = 0.247376871109
  Percentile 15 = 0.280901908875
  Percentile 20 = 0.329082489014
  Percentile 25 = 0.331234931946
  Percentile 30 = 0.379406833649
  Percentile 35 = 0.381390666962
  Percentile 40 = 0.429595994949
  Percentile 45 = 0.43164896965
  Percentile 50 = 0.480262041092
  Percentile 55 = 0.529169607162
  Percentile 60 = 0.533170747757
  Percentile 65 = 0.582635164261
  Percentile 70 = 0.634325170517
  Percentile 75 = 0.72939991951
  Percentile 80 = 0.829002094269
  Percentile 85 = 0.931713819504
  Percentile 90 = 1.18014221191
  Percentile 95 = 2.08048944473
  Percentile 100 = 31.1357450485
  Max latancies = 31.1357450485, Min = 0.0553231239319, mean =
  1.03054529335 Total objects writen = 10769 in time 328.515608788s
  gives 32.7807863978/s (131.123145591 MB/s)
 
  Total = 278Mb/s
 
 
  The combined test has much higher latencies and a less than half
 throughput per box.
 
  If I scale this up to 5 nodes all generating load I see the throughput drop 
  to
 ~50MB/s and latencies up to 60 seconds.
 
 
  An example slow write from dump_historic_ops is:
 
   description: osd_op(client.1892123.0:1525
 \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae
 ack+ondisk+write+known_if_redirected e523),
   initiated_at: 2015-06-19 12:37:54.698848,
   age: 578.438516,
   duration: 38.399151,
   type_data: [
   commit sent; apply or cleanup,
   {
   client: client.1892123,
   tid: 1525
   },
   [
   {
   time: 2015-06-19 12:37:54.698848,
   event: initiated
   },
   {
   time: 2015-06-19 12:37:54.856361,
   event: reached_pg
   },
   {
   time: 2015-06-19 12:37:55.095731,
   event: started
   },
   {
   time: 2015-06-19 12:37:55.103645,
   event: started
   },
   {
   time: 2015-06-19 12:37:55.104125,
   event: commit_queued_for_journal_write

Re: [ceph-users] Very chatty MON logs: Is this normal?

2015-06-19 Thread Joao Eduardo Luis
On 06/19/2015 11:16 AM, Daniel Schneller wrote:
 On 2015-06-18 09:53:54 +, Joao Eduardo Luis said:
 
 Setting 'mon debug = 0/5' should be okay.  Unless you see that setting
 '/5' impacts your performance and/or memory consumption, you should
 leave that be.  '0/5' means 'output only debug 0 or lower to the logs;
 keep the last 1000 debug level 5 or lower in memory in case of a crash'.
 Your logs will not be as heavily populated but, if for some reason the
 daemon crashes, you get quite a few of debug information to help track
 down the source of the problem.
 
 Great, will do.
 
 Just for my understanding re/ memory: If this is a ring
 buffer for the last 1 events, shouldn't that be a somewhat fixed amount
 of memory? How would it negatively affect the MON's consumption? Assuming
 it works that way, once they have been running for a few days or weeks,
 these buffers would be full of events anyway, just more aged ones if
 the memory level was lower?
 
 Daniel

From briefly taking a peak at 'src/log/*', this looks like it is a
linked list rather than a buffer ring.

So, given it will always be capped at 10k events, there's a fixed amount
of memory it will consume in the worst case (when you have 10k events).

But if you have bare minimum activity in the logs, said memory
consumption should be lower, or at most slowly growing as the queue grows.

Although I was not obvious, my initial thought was that someone with
debug levels set at 0/0 would certainly be surprised if, after setting
0/5, the daemon's memory consumption started to grow.  In retrospect,
10k log messages should not take more than a handful of MBs, and should
not have any impact at all as long as you're not provisioning your
monitor's memory in the dozens of MBs.

  -Joao



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New cluster in unhealthy state

2015-06-19 Thread Dave Durkee
I just built a small lab cluster.  1 mon node, 3 osd nodes with 3 ceph disks 
and 1 os/journal disk, an admin vm and 3 client vm's.

I followed the preflight and install instructions and when I finished adding 
the osd's I ran a ceph status and got the following:

ceph status
cluster b4419183-5320-4701-aae2-eb61e186b443
 health HEALTH_WARN
32 pgs degraded
64 pgs stale
32 pgs stuck degraded
246 pgs stuck inactive
64 pgs stuck stale
310 pgs stuck unclean
32 pgs stuck undersized
32 pgs undersized
pool rbd pg_num 310  pgp_num 64
 monmap e1: 1 mons at {mon=172.17.1.16:6789/0}
election epoch 2, quorum 0 mon
 osdmap e49: 11 osds: 9 up, 9 in
  pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects
298 MB used, 4189 GB / 4189 GB avail
 246 creating
  32 stale+active+undersized+degraded
  32 stale+active+remapped

ceph health
HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs stuck 
inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck undersized; 
32 pgs undersized; pool rbd pg_num 310  pgp_num 64

ceph quorum_status
{election_epoch:2,quorum:[0],quorum_names:[mon],quorum_leader_name:mon,monmap:{epoch:1,fsid:b4419183-5320-4701-aae2-eb61e186b443,modified:0.00,created:0.00,mons:[{rank:0,name:mon,addr:172.17.1.16:6789\/0}]}}

ceph mon_status
{name:mon,rank:0,state:leader,election_epoch:2,quorum:[0],outside_quorum:[],extra_probe_peers:[],sync_provider:[],monmap:{epoch:1,fsid:b4419183-5320-4701-aae2-eb61e186b443,modified:0.00,created:0.00,mons:[{rank:0,name:mon,addr:172.17.1.16:6789\/0}]}}

ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.94997 root default
-2 2.24998 host osd1
 0 0.45000 osd.0 down0  1.0
 1 0.45000 osd.1 down0  1.0
 2 0.45000 osd.2   up  1.0  1.0
 3 0.45000 osd.3   up  1.0  1.0
10 0.45000 osd.10  up  1.0  1.0
-3 1.34999 host osd2
 4 0.45000 osd.4   up  1.0  1.0
 5 0.45000 osd.5   up  1.0  1.0
 6 0.45000 osd.6   up  1.0  1.0
-4 1.34999 host osd3
 7 0.45000 osd.7   up  1.0  1.0
 8 0.45000 osd.8   up  1.0  1.0
 9 0.45000 osd.9   up  1.0  1.0


Admin-node:
[root@admin test-cluster]# cat ceph.conf
[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
mon_host = 172.17.1.16
mon_initial_members = mon
fsid = b4419183-5320-4701-aae2-eb61e186b443
osd pool default size = 2
public network = 172.17.1.0/24
cluster network = 10.0.0.0/24


How do I diagnose and solve the cluster health issue?  Do you need any 
additional information to help with the diag process?

Thanks!!

Dave
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados gateway to use ec pools

2015-06-19 Thread Deneau, Tom
what is the correct way to make radosgw create its pools as erasure coded pools?

-- Tom Deneau, AMD
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Build latest KRBD module

2015-06-19 Thread Vasiliy Angapov
Hi, guys!

Do we have any procedure on how to build the latest KRBD module? I think it
will be helpful to many people here.

Regards, Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs unmounts itself from time to time

2015-06-19 Thread Roland Giesler
On 19 June 2015 at 13:46, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Jun 18, 2015 at 10:15 PM, Roland Giesler rol...@giesler.za.net
 wrote:
  On 15 June 2015 at 13:09, Gregory Farnum g...@gregs42.com wrote:
 
  On Mon, Jun 15, 2015 at 4:03 AM, Roland Giesler rol...@giesler.za.net
  wrote:
   I have a small cluster of 4 machines and quite a few drives.  After
   about 2
 ​-3 weeks cephfs fails.  It's not properly mounted anymore
   in
 ​ ​
 /mnt/cephfs, which of course causes the VM's running to fail too.
  

​snip​


 
 
  I'm under the impression that CephFS is the filesystem implimented by
  ceph-fuse. Is it not?

 Of course it is, but it's a different implementation than the kernel
 client and often has different bugs. ;) Plus you can get a newer
 version of it easily.


​Let me look into it and see how it might help me.​


   Other than that, can you include more
  information about exactly what you mean when saying CephFS unmounts
  itself?
 
 
  Everything runs fine for weeks.  Then suddenly a user reports that a VM
 is
  not functioning anymore.  On investigation is transpires than CephFS is
 not
  mounted anymore and the error I reported is logged.
 
  I can't see anything else wrong at this stage.  ceph is running, the osd
 are
  all up.

 Maybe one of our kernel devs has a better idea but I've no clue how to
 debug this if you can't give me any information about how CephFS came
 to be unmounted. It just doesn't make any sense to me. :(


​I'll go through the logs again and find the point where it happens and
post it.

- Roland​
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reversing the removal of an osd (re-adding osd)

2015-06-19 Thread Jelle de Jong
Hello everybody,

I'm doing some experiments and I am trying to re-add an removed osd. I
removed it with the bellow five commands.

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

ceph osd out 5
/etc/init.d/ceph stop osd.5
ceph osd crush remove osd.5
ceph auth del osd.5
ceph osd rm 5

I think I added the auth back correctly, but I cant figure out the right
crush add commands?

ceph auth add osd.5 osd 'allow *' mon 'allow rwx' -i
/var/lib/ceph/osd/ceph-5/keyring

root@ceph03:~# /etc/init.d/ceph start osd.5
=== osd.5 ===
Error ENOENT: osd.5 does not exist.  create it before updating the crush map
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5
--keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5
0.91 host=ceph03 root=default'

Can somebody show me some examples of the right commands to re-add?

Kind regards,

Jelle de Jong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reversing the removal of an osd (re-adding osd)

2015-06-19 Thread Jelle de Jong
On 19/06/15 16:07, Jelle de Jong wrote:
 Hello everybody,
 
 I'm doing some experiments and I am trying to re-add an removed osd. I
 removed it with the bellow five commands.
 
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
 
 ceph osd out 5
 /etc/init.d/ceph stop osd.5
 ceph osd crush remove osd.5
 ceph auth del osd.5
 ceph osd rm 5
 
 I think I added the auth back correctly, but I cant figure out the right
 crush add commands?
 
 ceph auth add osd.5 osd 'allow *' mon 'allow rwx' -i
 /var/lib/ceph/osd/ceph-5/keyring
 
 root@ceph03:~# /etc/init.d/ceph start osd.5
 === osd.5 ===
 Error ENOENT: osd.5 does not exist.  create it before updating the crush map
 failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5
 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5
 0.91 host=ceph03 root=default'
 
 Can somebody show me some examples of the right commands to re-add?

I figured it out myself :)

root@ceph03:~# ceph osd create
5
root@ceph03:~# ceph osd crush add 5 0.0 host=ceph03 root=default
add item id 5 name 'osd.5' weight 0 at location
{host=ceph03,root=default} to crush map

root@ceph03:~# /etc/init.d/ceph start osd.5
=== osd.5 ===
create-or-move updated item name 'osd.5' weight 0.91 at location
{host=ceph03,root=default} to crush map
Starting Ceph osd.5 on ceph03...
starting osd.5 at :/0 osd_data /var/lib/ceph/osd/ceph-5
/var/lib/ceph/osd/ceph-5/journal

Kind regards,

Jelle de Jong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] incomplete pg, recovery some data

2015-06-19 Thread Mykola Golub
On Thu, Jun 18, 2015 at 01:24:38PM +0200, Mateusz Skała wrote:
 Hi,
 
 After some hardware errors one of pg in our backup server is 'incomplete'.
 
 I do export pg without problems like here:
 https://ceph.com/community/incomplete-pgs-oh-my/
 
 After remove pg from all osd's and  import pg to one of osd pg is still
 'incomplete'.
 
 I want to  recover only some pice of data from this rbd so if I lost
 something then nothing happened. How can I tell ceph to accept this pg as
 complete and clean?

I have a patch for ceph-objectstore-tool, which adds mark-complete operation,
as it has been suggested by Sam in http://tracker.ceph.com/issues/10098

https://github.com/ceph/ceph/pull/5031

It has not been reviewed yet and not tested well though because I
don't know a simple way how to get an incomplete pg.

You might want to try it on your own risk.

-- 
Mykola Golub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Unexpected disk write activity with btrfs OSDs

2015-06-19 Thread Lionel Bouton
On 06/19/15 13:42, Burkhard Linke wrote:

 Forget the reply to the list...

  Forwarded Message 
 Subject:  Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
 Date: Fri, 19 Jun 2015 09:06:33 +0200
 From: Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de
 To:   Lionel Bouton lionel+c...@bouton.name



 Hi,

 On 06/18/2015 11:28 PM, Lionel Bouton wrote:
  Hi,
 *snipsnap*

  - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
  of 10s with nearly 0 activity, one interval with a total amount of
  writes of ~120MB). The averages are : 4MB/s, 100 IO/s.

 Just a guess:

 btrfs has a commit interval which defaults to 30 seconds.

 You can verify this by changing the interval with the commit=XYZ mount 
 option.

I know and I tested commit intervals of 60 and 120 seconds without any
change. As this is directly linked to filestore max sync interval I
didn't report this test result.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Explanation for ceph osd set nodown and ceph osd cluster_snap

2015-06-19 Thread Carsten Schmitt

Hi Jan,

On 06/18/2015 12:48 AM, Jan Schermer wrote:

1) Flags available in ceph osd set are

pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent

I know or can guess most of them (the docs are a “bit” lacking)

But with ceph osd set nodown” I have no idea what it should be used for
- to keep hammering a faulty OSD?


I only know the documentation for this one:
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
You can set an OSD to nodown if you know for certain that it is not 
faulty but it gets set to this state by the monitor because of problems 
with the cluster network.


Cheers,
Carsten



2) looking through the docs there I found reference to ceph osd
cluster_snap”
http://ceph.com/docs/v0.67.9/rados/operations/control/

what does it do? how does that work? does it really work? ;-) I got a
few hits on google which suggest it might not be something that really
works, but looks like something we could certainly use

Thanks

Jan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com








smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph EC pool performance benchmarking, high latencies.

2015-06-19 Thread MATHIAS, Bryn (Bryn)
Hi All,

I am currently benchmarking CEPH to work out the correct read / write model, to 
get the optimal cluster throughput and latency.

For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised name 
using the rados python interface.

Load generation is happening on external machines.

Write generation is characterised as the number of IOContexts and the number of 
simultaneous async writes on those contexts.
With one machine, IOContexts threads and 50 simultaneous writes per context I 
achieve over 300 seconds:

Percentile 5 = 0.133775639534
Percentile 10 = 0.178686833382
Percentile 15 = 0.180827605724
Percentile 20 = 0.185487747192
Percentile 25 = 0.229317903519
Percentile 30 = 0.23066740036
Percentile 35 = 0.232764816284
Percentile 40 = 0.278827047348
Percentile 45 = 0.280579996109
Percentile 50 = 0.283169865608
Percentile 55 = 0.329843044281
Percentile 60 = 0.332481050491
Percentile 65 = 0.380337607861
Percentile 70 = 0.428911447525
Percentile 75 = 0.438932359219
Percentile 80 = 0.530071306229
Percentile 85 = 0.597331762314
Percentile 90 = 0.735066819191
Percentile 95 = 1.08006491661
Percentile 100 = 11.7352428436
Max latancies = 11.7352428436, Min = 0.0499050617218, mean = 0.43913059745
Total objects writen = 24552 in time 302.979903936s gives 81.0350775118/s 
(324.140310047 MB/s)



From two load generators on separate machines I achieve:


Percentile 5 = 0.228541088104
Percentile 10 = 0.23213224411
Percentile 15 = 0.279508590698
Percentile 20 = 0.28137254715
Percentile 25 = 0.328829288483
Percentile 30 = 0.330499911308
Percentile 35 = 0.334045898914
Percentile 40 = 0.380131435394
Percentile 45 = 0.382810294628
Percentile 50 = 0.430188417435
Percentile 55 = 0.43399245739
Percentile 60 = 0.48120136261
Percentile 65 = 0.530511438847
Percentile 70 = 0.580485081673
Percentile 75 = 0.631661534309
Percentile 80 = 0.728989124298
Percentile 85 = 0.830820584297
Percentile 90 = 1.03238985538
Percentile 95 = 1.62925363779
Percentile 100 = 32.5414278507
Max latancies = 32.5414278507, Min = 0.0375339984894, mean = 0.863403101415
Total objects writen = 12714 in time 325.92741394s gives 39.0086855422/s 
(156.034742169 MB/s)


Percentile 5 = 0.229072237015
Percentile 10 = 0.247376871109
Percentile 15 = 0.280901908875
Percentile 20 = 0.329082489014
Percentile 25 = 0.331234931946
Percentile 30 = 0.379406833649
Percentile 35 = 0.381390666962
Percentile 40 = 0.429595994949
Percentile 45 = 0.43164896965
Percentile 50 = 0.480262041092
Percentile 55 = 0.529169607162
Percentile 60 = 0.533170747757
Percentile 65 = 0.582635164261
Percentile 70 = 0.634325170517
Percentile 75 = 0.72939991951
Percentile 80 = 0.829002094269
Percentile 85 = 0.931713819504
Percentile 90 = 1.18014221191
Percentile 95 = 2.08048944473
Percentile 100 = 31.1357450485
Max latancies = 31.1357450485, Min = 0.0553231239319, mean = 1.03054529335
Total objects writen = 10769 in time 328.515608788s gives 32.7807863978/s 
(131.123145591 MB/s)

Total = 278Mb/s 


The combined test has much higher latencies and a less than half throughput per 
box.

If I scale this up to 5 nodes all generating load I see the throughput drop to 
~50MB/s and latencies up to 60 seconds.


An example slow write from dump_historic_ops is:

description: osd_op(client.1892123.0:1525 
\/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae 
ack+ondisk+write+known_if_redirected e523),
initiated_at: 2015-06-19 12:37:54.698848,
age: 578.438516,
duration: 38.399151,
type_data: [
commit sent; apply or cleanup,
{
client: client.1892123,
tid: 1525
},
[
{
time: 2015-06-19 12:37:54.698848,
event: initiated
},
{
time: 2015-06-19 12:37:54.856361,
event: reached_pg
},
{
time: 2015-06-19 12:37:55.095731,
event: started
},
{
time: 2015-06-19 12:37:55.103645,
event: started
},
{
time: 2015-06-19 12:37:55.104125,
event: commit_queued_for_journal_write
},
{
time: 2015-06-19 12:37:55.104900,
event: write_thread_in_journal_buffer
},
{
time: 2015-06-19 12:37:55.106112,
event: journaled_completion_queued
},
{
time: 2015-06-19 12:37:55.107065,
event: sub_op_committed
},
{
time: 2015-06-19 

Re: [ceph-users] cephfs unmounts itself from time to time

2015-06-19 Thread Gregory Farnum
On Thu, Jun 18, 2015 at 10:15 PM, Roland Giesler rol...@giesler.za.net wrote:
 On 15 June 2015 at 13:09, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Jun 15, 2015 at 4:03 AM, Roland Giesler rol...@giesler.za.net
 wrote:
  I have a small cluster of 4 machines and quite a few drives.  After
  about 2
  - 3 weeks cephfs fails.  It's not properly mounted anymore in
  /mnt/cephfs,
  which of course causes the VM's running to fail too.
 
  In /var/log/syslog I have /mnt/cephfs: File exists at
  /usr/share/perl5/PVE/Storage/DirPlugin.pm line 52 repeatedly.
 
  There doesn't seem to be anything wrong with ceph at the time.
 
  # ceph -s
  cluster 40f26838-4760-4b10-a65c-b9c1cd671f2f
   health HEALTH_WARN clock skew detected on mon.s1
   monmap e2: 2 mons at
  {h1=192.168.121.30:6789/0,s1=192.168.121.33:6789/0}, election epoch 312,
  quorum 0,1 h1,s1
   mdsmap e401: 1/1/1 up {0=s3=up:active}, 1 up:standby
   osdmap e5577: 19 osds: 19 up, 19 in
pgmap v11191838: 384 pgs, 3 pools, 774 GB data, 455 kobjects
  1636 GB used, 9713 GB / 11358 GB avail
   384 active+clean
client io 12240 kB/s rd, 1524 B/s wr, 24 op/s
  # ceph osd tree
  # id  weight   type nameup/down  reweight
  -111.13root default
  -2 8.14host h1
   1 0.9 osd.1up1
   3 0.9 osd.3up1
   4 0.9 osd.4up1
   5 0.68osd.5up1
   6 0.68osd.6up1
   7 0.68osd.7up1
   8 0.68osd.8up1
   9 0.68osd.9up1
  10 0.68osd.10   up1
  11 0.68osd.11   up1
  12 0.68osd.12   up1
  -3 0.45host s3
   2 0.45osd.2up1
  -4 0.9 host s2
  13 0.9 osd.13   up1
  -5 1.64host s1
  14 0.29osd.14   up1
   0 0.27osd.0up1
  15 0.27osd.15   up1
  16 0.27osd.16   up1
  17 0.27osd.17   up1
  18 0.27osd.18   up1
 
  When I umount -l /mnt/cephfs and then mount -a after that, the the
  ceph
  volume is loaded again.  I can restart the VM's and all seems well.
 
  I can't find errors pertaining to cephfs in the the other logs either.
 
  System information:
 
  Linux s1 2.6.32-34-pve #1 SMP Fri Dec 19 07:42:04 CET 2014 x86_64
  GNU/Linux

 I'm not sure what version of Linux this really is (I assume it's a
 vendor kernel of some kind!), but it's definitely an old one! CephFS
 sees pretty continuous improvements to stability and it could be any
 number of resolved bugs.


 This is the stock standard installation of Proxmox with CephFS.



 If you can't upgrade the kernel, you might try out the ceph-fuse
 client instead as you can run a much newer and more up-to-date version
 of it, even on the old kernel.


 I'm under the impression that CephFS is the filesystem implimented by
 ceph-fuse. Is it not?

Of course it is, but it's a different implementation than the kernel
client and often has different bugs. ;) Plus you can get a newer
version of it easily.

 Other than that, can you include more
 information about exactly what you mean when saying CephFS unmounts
 itself?


 Everything runs fine for weeks.  Then suddenly a user reports that a VM is
 not functioning anymore.  On investigation is transpires than CephFS is not
 mounted anymore and the error I reported is logged.

 I can't see anything else wrong at this stage.  ceph is running, the osd are
 all up.

Maybe one of our kernel devs has a better idea but I've no clue how to
debug this if you can't give me any information about how CephFS came
to be unmounted. It just doesn't make any sense to me. :(
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very chatty MON logs: Is this normal?

2015-06-19 Thread Daniel Schneller

On 2015-06-18 09:53:54 +, Joao Eduardo Luis said:


Setting 'mon debug = 0/5' should be okay.  Unless you see that setting
'/5' impacts your performance and/or memory consumption, you should
leave that be.  '0/5' means 'output only debug 0 or lower to the logs;
keep the last 1000 debug level 5 or lower in memory in case of a crash'.
Your logs will not be as heavily populated but, if for some reason the
daemon crashes, you get quite a few of debug information to help track
down the source of the problem.


Great, will do.

Just for my understanding re/ memory: If this is a ring
buffer for the last 1 events, shouldn't that be a somewhat fixed amount
of memory? How would it negatively affect the MON's consumption? Assuming
it works that way, once they have been running for a few days or weeks,
these buffers would be full of events anyway, just more aged ones if
the memory level was lower?

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Re: Unexpected disk write activity with btrfs OSDs

2015-06-19 Thread Burkhard Linke


Forget the reply to the list...

 Forwarded Message 
Subject:Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
Date:   Fri, 19 Jun 2015 09:06:33 +0200
From:   Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de
To: Lionel Bouton lionel+c...@bouton.name



Hi,

On 06/18/2015 11:28 PM, Lionel Bouton wrote:

Hi,

*snipsnap*


- Disks with btrfs OSD have a spike of activity every 30s (2 intervals
of 10s with nearly 0 activity, one interval with a total amount of
writes of ~120MB). The averages are : 4MB/s, 100 IO/s.


Just a guess:

btrfs has a commit interval which defaults to 30 seconds.

You can verify this by changing the interval with the commit=XYZ mount
option.

Best regards,
Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] qemu jemalloc patch

2015-06-19 Thread Alexandre DERUMIER
Hi,

I have send a patch to qemu devel mailing list to add support jemalloc linking

http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05265.html

Help is welcome to get it upstream !

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW Performance

2015-06-19 Thread Stuart Harland
I'm trying to evaluate various object stores/distributed file systems for
use in our company and have a little experience of using Ceph in the past.
However I'm running into a few issues when running some benchmarks against
RadosGW.


Basically my script is pretty dumb, but it captures one of our primary use
cases reasonably accurately -  it iteratively copies files repeatedly onto
a different key either in s3, or to a hierarchical directory structure on a
block device (eg 000/000/000/001/1.jpg) where the directory is a key. When
adding to an s3-esque object store, it uses the same scheme to generate the
key for the file.

Now when running this script against an RBD volume I get high hundreds of
MB/s throughput quite happily particularly if I run the process in parallel
(forking the process multiple times). However if I try to bludgeon the
script to use the s3 interface via radosgw, everything grinds to a halt
(read 0.5MB/s throughput per fork). This is a problem. I don't believe that
the discrepancy is due to anything other than a misconfiguration.

The test cluster is running with 3 nodes, 86 drives/OSDs each (they are
currently 6tb). Our use case requires the storage density to be high. HW
wise, there is 256GB Ram with 2 12Core E5-2690 v3 @ 2.60GHz, so more than
enough CPU/Ram capacity.

Currently I have RadosGW running on one of the nodes with Apache 2.4.7
acting as the proxy.

Any suggestions/pointers would be more than welcome, as ceph is high on our
list of favourites due to its feature set. It definitely should be
performing faster than this.

Regards

Stuart Harland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Andrei Mikhailovsky
Hi guys,

I also use a combination of intel 520 and 530 for my journals and have noticed 
that the latency and the speed of 520s is better than 530s. 

Could someone please confirm that doing the following at start up will stop the 
dsync on the relevant drives?

# echo temporary write through  /sys/class/scsi_disk/1\:0\:0\:0/cache_type 

Do I need to patch my kernel for this or is this already implementable in 
vanilla? I am running 3.19.x branch from ubuntu testing repo.

Would the above change the performance of 530s to be more like 520s?

Cheers

Andrei



- Original Message -
 From: Alexandre DERUMIER aderum...@odiso.com
 To: Jacek Jarosiewicz jjarosiew...@supermedia.pl
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Thursday, 18 June, 2015 11:54:42 AM
 Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
 
 Hi,
 
 for read benchmark
 
 with fio, what is the iodepth ?
 
 my fio 4k randr results with
 
 iodepth=1 : bw=6795.1KB/s, iops=1698
 iodepth=2 : bw=14608KB/s, iops=3652
 iodepth=4 : bw=32686KB/s, iops=8171
 iodepth=8 : bw=76175KB/s, iops=19043
 iodepth=16 :bw=173651KB/s, iops=43412
 iodepth=32 :bw=336719KB/s, iops=84179
 
 (This should be similar with rados bench -t (threads) option).
 
 This is normal because of network latencies + ceph latencies.
 Doing more parallism increase iops.
 
 (doing a bench with dd = iodepth=1)
 
 Theses result are with 1 client/rbd volume.
 
 
 now with more fio client (numjobs=X)
 
 I can reach up to 300kiops with 8-10 clients.
 
 
 This should be the same with lauching multiple rados bench in parallel
 
 (BTW, it could be great to have an option in rados bench to do it)
 
 
 - Mail original -
 De: Jacek Jarosiewicz jjarosiew...@supermedia.pl
 À: Mark Nelson mnel...@redhat.com, ceph-users
 ceph-users@lists.ceph.com
 Envoyé: Jeudi 18 Juin 2015 11:49:11
 Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck
 
 On 06/17/2015 04:19 PM, Mark Nelson wrote:
  SSD's are INTEL SSDSC2BW240A4
  
  Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see
  this thread by Stefan Priebe:
  
  https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html
  
  In fact it was the difference in Intel 520 and Intel 530 performance
  that triggered many of the different investigations that have taken
  place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The
  gist of it is that the 520 is very fast but probably not safe. The 530
  is safe but not fast. The DC S3700 (and similar drives with super
  capacitors) are thought to be both fast and safe (though some drives
  like the crucual M500 and later misrepresented their power loss
  protection so you have to be very careful!)
  
 
 Yes, these are Intel 530.
 I did the tests described in the thread You pasted and unfortunately
 that's my case... I think.
 
 The dd run locally on a mounted ssd partition looks like this:
 
 [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
 oflag=direct,dsync
 1+0 records in
 1+0 records out
 358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s
 
 and when I skip the flag dsync it goes fast:
 
 [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
 oflag=direct
 1+0 records in
 1+0 records out
 358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s
 
 (I used the same 350k block size as mentioned in the e-mail from the
 thread above)
 
 I tried disabling the dsync like this:
 
 [root@cf02 ~]# echo temporary write through 
 /sys/class/scsi_disk/1\:0\:0\:0/cache_type
 
 [root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type
 write through
 
 ..and then locally I see the speedup:
 
 [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
 oflag=direct,dsync
 1+0 records in
 1+0 records out
 358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s
 
 
 ..but when I test it from a client I still get slow results:
 
 root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct
 100+0 records in
 100+0 records out
 1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s
 
 and fio gives the same 2-3k iops.
 
 after the change to SSD cache_type I tried remounting the test image,
 recreating it and so on - nothing helped.
 
 I ran rbd bench-write on it, and it's not good either:
 
 root@cf03:~# rbd bench-write t2
 bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq
 SEC OPS OPS/SEC BYTES/SEC
 1 4221 4220.64 32195919.35
 2 9628 4813.95 36286083.00
 3 15288 4790.90 35714620.49
 4 19610 4902.47 36626193.93
 5 24844 4968.37 37296562.14
 6 30488 5081.31 38112444.88
 7 36152 5164.54 38601615.10
 8 41479 5184.80 38860207.38
 9 46971 5218.70 39181437.52
 10 52219 5221.77 39322641.34
 11 5 5151.36 38761566.30
 12 62073 5172.71 38855021.35
 13 65962 5073.95 38182880.49
 14 71541 5110.02 38431536.17
 15 77039 5135.85 38615125.42
 16 82133 5133.31 38692578.98
 17 87657 5156.24 38849948.84
 18 92943 5141.03 38635464.85
 19 97528 5133.03 

[ceph-users] EC on 1.1PB?

2015-06-19 Thread Sean

*

I am looking to use Ceph using EC on a few leftover storage servers (36 
disk supermicro servers with dual xeon sockets and around 256Gb of ram). 
I did a small test using one node and using the ISA library and noticed 
that the CPU load was pretty spikey for just normal operation.



Does anyone have any experience running Ceph EC on around 216 to 270 4TB 
disks? I'm looking  to yield around 680 TB to 1PB if possible. just 
putting my feelers out there to see if anyone else has had any 
experience and looking for any guidance.*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Block Size

2015-06-19 Thread Garg, Pankaj
Hi,

I have been formatting my OSD drives with XFS (using mkfs.xfs )with default 
options. Is it recommended for Ceph to choose a bigger block size?
I'd like to understand the impact of block size. Any recommendations?

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC on 1.1PB?

2015-06-19 Thread Lincoln Bryant
Hi Sean,

We have ~1PB of EC storage using Dell R730xd servers with 6TB OSDs. We've got 
our erasure coding profile set up to be k=10,m=3 which gives us a very 
reasonable chunk of the raw storage with nice resiliency.

I found that CPU usage was significantly higher in EC, but not so much as to be 
problematic. Additionally, EC performance was about 40% of replicated pool 
performance in our testing. 

With 36-disk servers you'll probably need to make sure you do the usual kernel 
tweaks like increasing the max number of file descriptors, etc. 

Cheers,
Lincoln

On Jun 19, 2015, at 10:36 AM, Sean wrote:

 I am looking to use Ceph using EC on a few leftover storage servers (36 disk 
 supermicro servers with dual xeon sockets and around 256Gb of ram). I did a 
 small test using one node and using the ISA library and noticed that the CPU 
 load was pretty spikey for just normal operation.
 
 Does anyone have any experience running Ceph EC on around 216 to 270 4TB 
 disks? I'm looking  to yield around 680 TB to 1PB if possible. just putting 
 my feelers out there to see if anyone else has had any experience and looking 
 for any guidance.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Mark Nelson



On 06/19/2015 11:19 AM, Andrei Mikhailovsky wrote:


Mark, thanks for putting it down this way. It does make sense.

Does it mean that having the Intel 520s, which bypass the dsync is theat to the 
data stored on the journals?


I'm not sure if anyone has ever 100% conclusively shown that this is 
what they are doing, but given their performance that's the current 
theory.  I still use them in our test lab because we've got a bunch of 
them and they are reasonably close in terms of performance to the DC 
S3700, but I'd be very concerned using them in a production environment 
for real data.




I do have a few of these installed, alongside with 530s. I did not plan to 
replace them just yet. Would it make more sense to get a small battery 
protected raid card in front of the 520s and 530s to protect against these 
types of scenarios?


Maybe, but only if you can disable all of the on-disk cache.  Since the 
drive itself is (probably) doing bad things, you are kind of at it's 
mercy and who knows what other demons lurk.  I'd be wary.




Cheers

- Original Message -

From: Mark Nelson mnel...@redhat.com
To: Andrei Mikhailovsky and...@arhont.com
Cc: ceph-users@lists.ceph.com
Sent: Friday, 19 June, 2015 5:08:31 PM
Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck

On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote:

Mark,

Thanks, I do understand that there is a risk of data loss by doing this.
Having said this, ceph is designed to be fault tollerant and self
repairing should something happen to individual journals, osds and server
nodes. Isn't this a still good measure to compromise between data
integrity and speed? So, by faking dsync and not actually doing this, you
have a window of opportunity to data loss should a failure happen between
the last flash and the moment of failure.

Thus, if the ssd disk failure happens, regardless if dsync is used or not,
would ceph still consider the osds behind the journal to be
unavailable/lost and migrate the data around anyway and perform the
necessary checks to make sure the data integrity is not compromised? If
this is true, I would still consider using the dsync bypass in favour of
the extra speed benefit. Unless I am missing a bigger picture and
miscalculated something.

Could someone please elaborate on this a bit further to understand the
realy world threat of not using the dsync bypass?


Hi Andrei,

Basically the entire point of the Ceph journal is to guarantee that data
hits a persistent medium before the write gets acknowledged.  Imagine a
scenario where you lose power just as the write happens.

Scenario A:  You have proper O_DSYNC writes.  In this case, assuming the
SSD is behaving properly, you can be fairly confident that the write to
the local journal succeeded (or not).

Scenario B: You bypass O_DSYNC.  The journal write completes quickly,
but it's not actually written out to flash, just to the drive cache.  If
the SSD has power loss protection it can theoretically write that data
out to the flash before it losses power.  For this reason, drives with
PLP can often perform O_DSYNC writes very quickly even without this hack
(ie it can ignore ATA_CMD_FLUSH).

For a drive like the 530 without PLP, there's no guarantee that the data
in cache will hit the flash.  Ceph will *think* it did though, and the
risk is worse because the write completes so fast.  Now you have a
scenario where ceph thinks something exists but it really doesn't (or
exists in a corrupted state).  This leads to all sorts of problems.  If
another OSD goes down and you have two copies of the data that disagree
with each other, what do you do?  What if not all of the replica writes
succeeded but you have a copy of the data on the primary?  Can you trust
it?  Everything starts breaking down.

Mark



Cheers

Andrei


- Original Message -

From: Mark Nelson mnel...@redhat.com
To: ceph-users@lists.ceph.com
Sent: Friday, 19 June, 2015 3:59:55 PM
Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck



On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote:

Hi guys,

I also use a combination of intel 520 and 530 for my journals and have
noticed that the latency and the speed of 520s is better than 530s.

Could someone please confirm that doing the following at start up will
stop
the dsync on the relevant drives?

# echo temporary write through 
/sys/class/scsi_disk/1\:0\:0\:0/cache_type

Do I need to patch my kernel for this or is this already implementable in
vanilla? I am running 3.19.x branch from ubuntu testing repo.

Would the above change the performance of 530s to be more like 520s?


I need to comment that it's *really* not a good idea to do this if you
care about data integrity.  There's a reason why the 530 is slower than
the 520.  If you need speed and you care about your data, you should
really consider jumping up to the DC S3700.

There's a possibility that the 730 *may* be ok as it supposedly has
power loss protection, but it's 

Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Mark Nelson

On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote:

Mark,

Thanks, I do understand that there is a risk of data loss by doing this. Having 
said this, ceph is designed to be fault tollerant and self repairing should 
something happen to individual journals, osds and server nodes. Isn't this a 
still good measure to compromise between data integrity and speed? So, by 
faking dsync and not actually doing this, you have a window of opportunity to 
data loss should a failure happen between the last flash and the moment of 
failure.

Thus, if the ssd disk failure happens, regardless if dsync is used or not, 
would ceph still consider the osds behind the journal to be unavailable/lost 
and migrate the data around anyway and perform the necessary checks to make 
sure the data integrity is not compromised? If this is true, I would still 
consider using the dsync bypass in favour of the extra speed benefit. Unless I 
am missing a bigger picture and miscalculated something.

Could someone please elaborate on this a bit further to understand the realy 
world threat of not using the dsync bypass?


Hi Andrei,

Basically the entire point of the Ceph journal is to guarantee that data 
hits a persistent medium before the write gets acknowledged.  Imagine a 
scenario where you lose power just as the write happens.


Scenario A:  You have proper O_DSYNC writes.  In this case, assuming the 
SSD is behaving properly, you can be fairly confident that the write to 
the local journal succeeded (or not).


Scenario B: You bypass O_DSYNC.  The journal write completes quickly, 
but it's not actually written out to flash, just to the drive cache.  If 
the SSD has power loss protection it can theoretically write that data 
out to the flash before it losses power.  For this reason, drives with 
PLP can often perform O_DSYNC writes very quickly even without this hack 
(ie it can ignore ATA_CMD_FLUSH).


For a drive like the 530 without PLP, there's no guarantee that the data 
in cache will hit the flash.  Ceph will *think* it did though, and the 
risk is worse because the write completes so fast.  Now you have a 
scenario where ceph thinks something exists but it really doesn't (or 
exists in a corrupted state).  This leads to all sorts of problems.  If 
another OSD goes down and you have two copies of the data that disagree 
with each other, what do you do?  What if not all of the replica writes 
succeeded but you have a copy of the data on the primary?  Can you trust 
it?  Everything starts breaking down.


Mark



Cheers

Andrei


- Original Message -

From: Mark Nelson mnel...@redhat.com
To: ceph-users@lists.ceph.com
Sent: Friday, 19 June, 2015 3:59:55 PM
Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck



On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote:

Hi guys,

I also use a combination of intel 520 and 530 for my journals and have
noticed that the latency and the speed of 520s is better than 530s.

Could someone please confirm that doing the following at start up will stop
the dsync on the relevant drives?

# echo temporary write through  /sys/class/scsi_disk/1\:0\:0\:0/cache_type

Do I need to patch my kernel for this or is this already implementable in
vanilla? I am running 3.19.x branch from ubuntu testing repo.

Would the above change the performance of 530s to be more like 520s?


I need to comment that it's *really* not a good idea to do this if you
care about data integrity.  There's a reason why the 530 is slower than
the 520.  If you need speed and you care about your data, you should
really consider jumping up to the DC S3700.

There's a possibility that the 730 *may* be ok as it supposedly has
power loss protection, but it's still not using HET MLC so the flash
cells will wear out faster.  It's also a consumer grade drive, so no one
will give you support for this kind of use case if you have problems.

Mark



Cheers

Andrei



- Original Message -

From: Alexandre DERUMIER aderum...@odiso.com
To: Jacek Jarosiewicz jjarosiew...@supermedia.pl
Cc: ceph-users ceph-users@lists.ceph.com
Sent: Thursday, 18 June, 2015 11:54:42 AM
Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck

Hi,

for read benchmark

with fio, what is the iodepth ?

my fio 4k randr results with

iodepth=1 : bw=6795.1KB/s, iops=1698
iodepth=2 : bw=14608KB/s, iops=3652
iodepth=4 : bw=32686KB/s, iops=8171
iodepth=8 : bw=76175KB/s, iops=19043
iodepth=16 :bw=173651KB/s, iops=43412
iodepth=32 :bw=336719KB/s, iops=84179

(This should be similar with rados bench -t (threads) option).

This is normal because of network latencies + ceph latencies.
Doing more parallism increase iops.

(doing a bench with dd = iodepth=1)

Theses result are with 1 client/rbd volume.


now with more fio client (numjobs=X)

I can reach up to 300kiops with 8-10 clients.


This should be the same with lauching multiple rados bench in parallel

(BTW, it could be great to have an option in rados bench to do it)


- Mail 

[ceph-users] fail OSD prepare

2015-06-19 Thread Jaemyoun Lee
I am following the quick doc.

It is successful until Adding the initial monitor.
So I made the osd folder (/var/local/osd0, osd10, osd20) in each node
(csAnt, csBull, csCat), and deployed to prepare the OSDs.

But the below error was occurred.

---
jae@csElsa:~$ ceph-deploy osd prepare csAnt:/var/local/osd0
csBull:/var/local/osd10 csCat:/var/local/osd20
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/jae/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.25): /usr/bin/ceph-deploy osd
prepare csAnt:/var/local/osd0 csBull:/var/local/osd10 csCat:/var/local/osd20
[ceph_deploy][ERROR ] ConfigError: Cannot load config: [Errno 2] No such
file or directory: 'ceph.conf'; has `ceph-deploy new` been run in this
directory?
---

Should I do something, which there is no the quick doc, before preparing
the OSDs?

-- 
  Jaemyoun Lee

  CPS Lab. ( Cyber-Physical Systems Laboratory in Hanyang University)
  E-mail : jm...@cpslab.hanyang.ac.kr
  Homepage : http://cpslab.hanyang.ac.kr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Shane Gibson

All - I have been following this thread for a bit, and am happy to see how
involved, capable, and collaborative that this ceph-users community seems
to be.  It appears there is a fairly strong amount of domain knowledge
around the hardware used by many Ceph deployments, with a lot of thumbs
up and thumbs down sort of experience based on bugs, problems, issues,
configuration landmines to avoid, etc...

Is there somewhere that community experience with hardware like this is
being tracked?  Not necessarily a full blown HWCL (hardware compability
list), but maybe a more cohesive list of controllers, SSD/Spinning disks,
and the community lessons learned (like when to or not to use TRIM, silent
corruption, etc...)???

It seems like this is all extremely valuable information as new
operators like my self come in to the picture...  Yes, one can mine the
email archives ... 

Thanks! 
~~shane



On 6/19/15, 9:08 AM, ceph-users on behalf of Mark Nelson
ceph-users-boun...@lists.ceph.com on behalf of mnel...@redhat.com wrote:


 Would the above change the performance of 530s to be more like 520s?

 I need to comment that it's *really* not a good idea to do this if you
 care about data integrity.  There's a reason why the 530 is slower than
 the 520.  If you need speed and you care about your data, you should
 really consider jumping up to the DC S3700.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Andrei Mikhailovsky

Mark, thanks for putting it down this way. It does make sense.

Does it mean that having the Intel 520s, which bypass the dsync is theat to the 
data stored on the journals?

I do have a few of these installed, alongside with 530s. I did not plan to 
replace them just yet. Would it make more sense to get a small battery 
protected raid card in front of the 520s and 530s to protect against these 
types of scenarios?

Cheers

- Original Message -
 From: Mark Nelson mnel...@redhat.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users@lists.ceph.com
 Sent: Friday, 19 June, 2015 5:08:31 PM
 Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
 
 On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote:
  Mark,
 
  Thanks, I do understand that there is a risk of data loss by doing this.
  Having said this, ceph is designed to be fault tollerant and self
  repairing should something happen to individual journals, osds and server
  nodes. Isn't this a still good measure to compromise between data
  integrity and speed? So, by faking dsync and not actually doing this, you
  have a window of opportunity to data loss should a failure happen between
  the last flash and the moment of failure.
 
  Thus, if the ssd disk failure happens, regardless if dsync is used or not,
  would ceph still consider the osds behind the journal to be
  unavailable/lost and migrate the data around anyway and perform the
  necessary checks to make sure the data integrity is not compromised? If
  this is true, I would still consider using the dsync bypass in favour of
  the extra speed benefit. Unless I am missing a bigger picture and
  miscalculated something.
 
  Could someone please elaborate on this a bit further to understand the
  realy world threat of not using the dsync bypass?
 
 Hi Andrei,
 
 Basically the entire point of the Ceph journal is to guarantee that data
 hits a persistent medium before the write gets acknowledged.  Imagine a
 scenario where you lose power just as the write happens.
 
 Scenario A:  You have proper O_DSYNC writes.  In this case, assuming the
 SSD is behaving properly, you can be fairly confident that the write to
 the local journal succeeded (or not).
 
 Scenario B: You bypass O_DSYNC.  The journal write completes quickly,
 but it's not actually written out to flash, just to the drive cache.  If
 the SSD has power loss protection it can theoretically write that data
 out to the flash before it losses power.  For this reason, drives with
 PLP can often perform O_DSYNC writes very quickly even without this hack
 (ie it can ignore ATA_CMD_FLUSH).
 
 For a drive like the 530 without PLP, there's no guarantee that the data
 in cache will hit the flash.  Ceph will *think* it did though, and the
 risk is worse because the write completes so fast.  Now you have a
 scenario where ceph thinks something exists but it really doesn't (or
 exists in a corrupted state).  This leads to all sorts of problems.  If
 another OSD goes down and you have two copies of the data that disagree
 with each other, what do you do?  What if not all of the replica writes
 succeeded but you have a copy of the data on the primary?  Can you trust
 it?  Everything starts breaking down.
 
 Mark
 
 
  Cheers
 
  Andrei
 
 
  - Original Message -
  From: Mark Nelson mnel...@redhat.com
  To: ceph-users@lists.ceph.com
  Sent: Friday, 19 June, 2015 3:59:55 PM
  Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
 
 
 
  On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote:
  Hi guys,
 
  I also use a combination of intel 520 and 530 for my journals and have
  noticed that the latency and the speed of 520s is better than 530s.
 
  Could someone please confirm that doing the following at start up will
  stop
  the dsync on the relevant drives?
 
  # echo temporary write through 
  /sys/class/scsi_disk/1\:0\:0\:0/cache_type
 
  Do I need to patch my kernel for this or is this already implementable in
  vanilla? I am running 3.19.x branch from ubuntu testing repo.
 
  Would the above change the performance of 530s to be more like 520s?
 
  I need to comment that it's *really* not a good idea to do this if you
  care about data integrity.  There's a reason why the 530 is slower than
  the 520.  If you need speed and you care about your data, you should
  really consider jumping up to the DC S3700.
 
  There's a possibility that the 730 *may* be ok as it supposedly has
  power loss protection, but it's still not using HET MLC so the flash
  cells will wear out faster.  It's also a consumer grade drive, so no one
  will give you support for this kind of use case if you have problems.
 
  Mark
 
 
  Cheers
 
  Andrei
 
 
 
  - Original Message -
  From: Alexandre DERUMIER aderum...@odiso.com
  To: Jacek Jarosiewicz jjarosiew...@supermedia.pl
  Cc: ceph-users ceph-users@lists.ceph.com
  Sent: Thursday, 18 June, 2015 11:54:42 AM
  Subject: Re: [ceph-users] rbd performance issue - can't find 

Re: [ceph-users] EC on 1.1PB?

2015-06-19 Thread Sean
Thanks lincoln! May I ask how many drives you have per storage node and 
how many threads you have available? IE are you using hyper threading 
and do you have more than 24 disks per node in your cluster? I noticed 
with our replicated cluster that disks == more pgs == more cpu/ram and 
with 24+ disks this ends up causing issues in some cases. So a 3 node 
cluster with 70 disks each is fine but scaling up to 21 and i see 
issues. Even with connections, pids, and file descriptors turned up. Are 
you using just jerasure or have you tried the ISA driver as well?


Sorry for bombarding you with questions I am just curious as to where 
the 40% performance comes from.


On 06/19/2015 11:05 AM, Lincoln Bryant wrote:

Hi Sean,

We have ~1PB of EC storage using Dell R730xd servers with 6TB 
OSDs. We've got our erasure coding profile set up to be k=10,m=3 which 
gives us a very reasonable chunk of the raw storage with nice resiliency.


I found that CPU usage was significantly higher in EC, but not so much 
as to be problematic. Additionally, EC performance was about 40% of 
replicated pool performance in our testing.


With 36-disk servers you'll probably need to make sure you do the 
usual kernel tweaks like increasing the max number of file 
descriptors, etc.


Cheers,
Lincoln

On Jun 19, 2015, at 10:36 AM, Sean wrote:


*
I am looking to use Ceph using EC on a few leftover storage servers 
(36 disk supermicro servers with dual xeon sockets and around 256Gb 
of ram). I did a small test using one node and using the ISA library 
and noticed that the CPU load was pretty spikey for just normal 
operation.


Does anyone have any experience running Ceph EC on around 216 to 270 
4TB disks? I'm looking  to yield around 680 TB to 1PB if possible. 
just putting my feelers out there to see if anyone else has had any 
experience and looking for any guidance.*

___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Block Size

2015-06-19 Thread Somnath Roy
Pankaj,
I think Linux will not allow bigger block size than page_size. If you want 
bigger block size than 4K, you need to rebuild the kernel I guess.
Now, I am not sure if there is any internal settings (or grub param)  to tweak 
this page size during reboot or not.
I think it is recommended (or best practice ) to have bigger inode size and 
also inode64 mount option.

Thanks  Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Friday, June 19, 2015 9:59 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Block Size

Hi,

I have been formatting my OSD drives with XFS (using mkfs.xfs )with default 
options. Is it recommended for Ceph to choose a bigger block size?
I'd like to understand the impact of block size. Any recommendations?

Thanks
Pankaj



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC on 1.1PB?

2015-06-19 Thread Lincoln Bryant
We're running 12 OSDs per node, with 32 hyper-threaded CPUs available. We 
over-provisioned the CPUs because we would like to additionally run jobs from 
our batch system and isolate them via cgroups (we're a high-throughput 
computing facility). . With a total of ~13000 pgs across a few pools, I'm 
seeing about 1GB of resident memory per OSD. As far as EC plugins go, we're 
using jerasure and haven't experimented with others.

That said, in our use case we're using CephFS, so we're fronting the 
erasure-coded pool with a cache tier. The cache pool is limited to 5TB, and 
right now usage is light enough that most operations live in the cache tier and 
rarely get flushed out to the EC pool. I'm sure as we bring more users onto 
this, there will be some more tweaking to do.

As far as performance goes, you might want to read Mark Nelson's excellent 
document about EC performance under Firefly. If you search the list archives, 
he sent a mail in February titled Erasure Coding CPU Overhead Data. I can 
forward you the PDF off-list if you would like.

--Lincoln

On Jun 19, 2015, at 12:42 PM, Sean wrote:

 Thanks lincoln! May I ask how many drives you have per storage node and how 
 many threads you have available? IE are you using hyper threading and do you 
 have more than 24 disks per node in your cluster? I noticed with our 
 replicated cluster that disks == more pgs == more cpu/ram and with 24+ disks 
 this ends up causing issues in some cases. So a 3 node cluster with 70 disks 
 each is fine but scaling up to 21 and i see issues. Even with connections, 
 pids, and file descriptors turned up. Are you using just jerasure or have you 
 tried the ISA driver as well? 
 
 Sorry for bombarding you with questions I am just curious as to where the 40% 
 performance comes from.
 
 On 06/19/2015 11:05 AM, Lincoln Bryant wrote:
 Hi Sean,
 
 We have ~1PB of EC storage using Dell R730xd servers with 6TB OSDs. We've 
 got our erasure coding profile set up to be k=10,m=3 which gives us a very 
 reasonable chunk of the raw storage with nice resiliency.
 
 I found that CPU usage was significantly higher in EC, but not so much as to 
 be problematic. Additionally, EC performance was about 40% of replicated 
 pool performance in our testing. 
 
 With 36-disk servers you'll probably need to make sure you do the usual 
 kernel tweaks like increasing the max number of file descriptors, etc. 
 
 Cheers,
 Lincoln
 
 On Jun 19, 2015, at 10:36 AM, Sean wrote:
 
  I am looking to use Ceph using EC on a few leftover storage servers (36 
 disk supermicro servers with dual xeon sockets and around 256Gb of ram). I 
 did a small test using one node and using the ISA library and noticed that 
 the CPU load was pretty spikey for just normal operation.
 
 Does anyone have any experience running Ceph EC on around 216 to 270 4TB 
 disks? I'm looking  to yield around 680 TB to 1PB if possible. just putting 
 my feelers out there to see if anyone else has had any experience and 
 looking for any guidance.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected period of iowait, no obvious activity?

2015-06-19 Thread Daniel Schneller

Hi!

Recently over a few hours our 4 Ceph disk nodes showed unusually high
and somewhat constant iowait times. Cluster runs 0.94.1 on Ubuntu
14.04.1.

It started on one node, then - with maybe 15 minutes delay each - on the
next and the next one. Overall duration of the phenomenon was about 90
minutes on each machine, finishing in the same order they had started.

We could not see any obvious cluster activity during that time,
applications did not do anything out of the ordinary. Scrubbing and deep
scrubbing were turned off long before this happened.

We are using CephFS for shared administrator home directories on the
system, RBD volumes for OpenStack and the Rados Gateway to manage
application data via the Swift interface. Telemetry and logs from inside
the VMs did not offer an explanation either.

The fact that these readings were limited to OSD hosts, but none of the
other (client) nodes in the system, suggests this must be some kind of
Ceph behaviour. Any ideas? We would like to understand what the system
was doing, but haven't found anything obvious in the logs.

Thanks!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com