Re: [ceph-users] Two osds are spaming dmesg every 900 seconds

2014-08-26 Thread Gregory Farnum
This is being output by one of the kernel clients, and it's just
saying that the connections to those two OSDs have died from
inactivity. Either the other OSD connections are used a lot more, or
aren't used at all.

In any case, it's not a problem; just a noisy notification. There's
not much you can do about it; sorry.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Hello

 I am seeing this message every 900 seconds on the osd servers. My dmesg 
 output is all filled with:

 [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state 
 OPEN)
 [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state 
 OPEN)


 Looking at the ceph-osd logs I see the following at the same time:

 2014-08-25 19:48:14.869145 7f0752125700  0 -- 192.168.168.200:6821/4097  
 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 
 c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket 
 is 192.168.168.200:54457/0)


 This happens only on two osds and the rest of osds seem fine. Does anyone 
 know why am I seeing this and how to correct it?

 Thanks

 Andrei
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse fails to mount

2014-08-26 Thread Gregory Farnum
In particular, we changed things post-Firefly so that the filesystem
isn't created automatically. You'll need to set it up (and its pools,
etc) explicitly to use it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby
richardnixonsh...@gmail.com wrote:
 Hi James,


 On 26 August 2014 07:17, LaBarre, James (CTR) A6IT james.laba...@cigna.com
 wrote:



 [ceph@first_cluster ~]$ ceph -s

 cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d

  health HEALTH_OK

  monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election
 epoch 2, quorum 0 first_cluster

  mdsmap e4: 1/1/1 up {0=first_cluster=up:active}

  osdmap e13: 3 osds: 3 up, 3 in

   pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects

 19835 MB used, 56927 MB / 76762 MB avail

  192 active+clean


 This cluster has an MDS. It should mount.




 [ceph@second_cluster ~]$ ceph -s

 cluster 06f655b7-e147-4790-ad52-c57dcbf160b7

  health HEALTH_OK

  monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election
 epoch 1, quorum 0 cilsdbxd1768

  osdmap e16: 7 osds: 7 up, 7 in

   pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects

 252 MB used, 194 GB / 194 GB avail

  192 active+clean


 No mdsmap line for this cluster, and therefore the filesystem won't mount.
 Have you added an MDS for this cluster, or has the mds daemon died? You'll
 have to get the mdsmap line to show before it will mount

 Sean


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread Gregory Farnum
I don't think the log messages you're showing are the actual cause of
the failure. The log file should have a proper stack trace (with
specific function references and probably a listed assert failure),
can you find that?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
tientienminh080...@gmail.com wrote:
 Hi all,

 I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate =  2)

 When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.

 I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack
 trace:

 ---

  2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154 ==
 osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230 10003f6.
 [tmapup 0~0] ondisk = 0) v4  119+0+0 (1770421071 0 0) 0x2aece00 con
 0x2aa4200
-54 2014-08-26 17:08:34.362942 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.55 10.20.0.23:6800/2407 10  osd_op_reply(263
 100048a. [getxattr] ack = -2 (No such file or directory)) v4
  119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
-53 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log submit_entry
 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
-52 2014-08-26 17:08:34.363022 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.37 10.20.0.22:6898/11994 6  osd_op_reply(226 1. [tmapput
 0~7664] ondisk = 0) v4  109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0
-51 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
 segment 293601899 2548 events
-50 2014-08-26 17:08:34.363117 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.17 10.20.0.21:6941/17572 9  osd_op_reply(264
 1000489. [getxattr] ack = -2 (No such file or directory)) v4
  119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
-49 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log submit_entry
 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
-48 2014-08-26 17:08:34.363197 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.1 10.20.0.21:6872/13227 6  osd_op_reply(265 1000491.
 [getxattr] ack = -2 (No such file or directory)) v4  119+0+0 (1231782695
 0 0) 0x1e63400 con 0x1e7ac00
-47 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log submit_entry
 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
-46 2014-08-26 17:08:34.363274 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.11 10.20.0.21:6884/7018 5  osd_op_reply(266 100047d.
 [getxattr] ack = -2 (No such file or directory)) v4  119+0+0 (2737916920
 0 0) 0x1e61e00 con 0x1e7bc80

 -
 I try to restart MDSs, but after a few seconds in a state of active, MDS
 switch to state laggy or crashed. I have a lot of important data on it.
 I do not want to use the command:
 ceph mds newfs metadata pool id data pool id --yes-i-really-mean-it

 :(

 Tien Bui.



 --
 Bui Minh Tien

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fresh Firefly install degraded without modified default tunables

2014-08-26 Thread Gregory Farnum
Hmm, that all looks basically fine. But why did you decide not to
segregate OSDs across hosts (according to your CRUSH rules)? I think
maybe it's the interaction of your map, setting choose_local_tries to
0, and trying to go straight to the OSDs instead of choosing hosts.
But I'm not super familiar with how the tunables would act under these
exact conditions.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji ri...@nathuji.com wrote:
 Hi Greg,

 Thanks for helping to take a look. Please find your requested outputs below.

 ceph osd tree:

 # id weight type name up/down reweight
 -1 0 root default
 -2 0 host osd1
 0 0 osd.0 up 1
 4 0 osd.4 up 1
 8 0 osd.8 up 1
 11 0 osd.11 up 1
 -3 0 host osd0
 1 0 osd.1 up 1
 3 0 osd.3 up 1
 6 0 osd.6 up 1
 9 0 osd.9 up 1
 -4 0 host osd2
 2 0 osd.2 up 1
 5 0 osd.5 up 1
 7 0 osd.7 up 1
 10 0 osd.10 up 1


 ceph -s:

 cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
  health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery
 43/86 objects degraded (50.000%)
  monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2,
 quorum 0 ceph-mon0
  osdmap e34: 12 osds: 12 up, 12 in
   pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects
 403 MB used, 10343 MB / 10747 MB avail
 43/86 objects degraded (50.000%)
  832 active+degraded


 Thanks,
 Ripal

 On Aug 25, 2014, at 12:45 PM, Gregory Farnum g...@inktank.com wrote:

 What's the output of ceph osd tree? And the full output of ceph -s?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji ri...@nathuji.com wrote:

 Hi folks,

 I've come across an issue which I found a fix for, but I'm not sure
 whether it's correct or if there is some other misconfiguration on my end
 and this is merely a symptom. I'd appreciate any insights anyone could
 provide based on the information below, and happy to provide more details as
 necessary.

 Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as
 active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying
 number of OSD hosts (1, 2, 3), where each OSD host has four storage drives.
 The configuration file defines a default replica size of 2, and allows leafs
 of type 0. Specific snippet:

 [global]
  ...
  osd pool default size = 2
  osd crush chooseleaf type = 0


 I verified the crush rules were as expected:

  rules: [
{ rule_id: 0,
  rule_name: replicated_ruleset,
  ruleset: 0,
  type: 1,
  min_size: 1,
  max_size: 10,
  steps: [
{ op: take,
  item: -1,
  item_name: default},
{ op: choose_firstn,
  num: 0,
  type: osd},
{ op: emit}]}],


 Inspecting the pg dump I observed that all pgs had a single osd in the
 up/acting sets. That seemed to explain why the pgs were degraded, but it was
 unclear to me why a second OSD wasn't in the set. After trying a variety of
 things, I noticed that there was a difference between Emperor (which works
 fine in these configurations) and Firefly with the default tunables, where
 Firefly comes up with the bobtail profile. The setting
 choose_local_fallback_tries is 0 in this profile while it used to default to
 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to
 a non-zero value, the cluster remaps and goes healthy with all pgs
 active+clean.

 The documentation states the optimal value of choose_local_fallback_tries is
 0 for FF, so I'd like to get a better understanding of this parameter and
 why modifying the default value moves the pgs to a clean state in my
 scenarios.

 Thanks,
 Ripal

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] ceph replication and striping

2014-08-26 Thread Aaron Ten Clay
On Tue, Aug 26, 2014 at 5:07 AM, m.channappa.nega...@accenture.com wrote:

  Hello all,



 I have configured a ceph storage cluster.



 1. I created the volume .I would like to know that  replication of data
 will happen automatically in ceph ?

 2. how to configure striped volume using ceph ?





 Regards,

 Malleshi CN


If I understand your position and questions correctly... the replication
level is configured per-pool, so whatever your size parameter is set to
for the pool you created the volume in will dictate how many copies are
stored. (Default is 3, IIRC.)

RADOS block device volumes are always striped across 4 MiB objects. I don't
believe that is configurable (at least not yet.)


FYI, this list is intended for discussion of Ceph community concerns. These
kinds of questions are better handled on the ceph-users list, and I've
forwarded your message accordingly.

-Aaron
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse fails to mount

2014-08-26 Thread Gregory Farnum
[Re-added the list.]

I believe you'll find everything you need at
http://ceph.com/docs/master/cephfs/createfs/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 1:25 PM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 So is there a link for documentation on the newer versions?  (we're doing 
 evaluations at present, so I had wanted to work with newer versions, since it 
 would be closer to what we would end up using).


 -Original Message-
 From: Gregory Farnum [mailto:g...@inktank.com]
 Sent: Tuesday, August 26, 2014 4:05 PM
 To: Sean Crosby
 Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Ceph-fuse fails to mount

 In particular, we changed things post-Firefly so that the filesystem isn't 
 created automatically. You'll need to set it up (and its pools,
 etc) explicitly to use it.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby richardnixonsh...@gmail.com 
 wrote:
 Hi James,


 On 26 August 2014 07:17, LaBarre, James (CTR) A6IT
 james.laba...@cigna.com
 wrote:



 [ceph@first_cluster ~]$ ceph -s

 cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d

  health HEALTH_OK

  monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0},
 election epoch 2, quorum 0 first_cluster

  mdsmap e4: 1/1/1 up {0=first_cluster=up:active}

  osdmap e13: 3 osds: 3 up, 3 in

   pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects

 19835 MB used, 56927 MB / 76762 MB avail

  192 active+clean


 This cluster has an MDS. It should mount.




 [ceph@second_cluster ~]$ ceph -s

 cluster 06f655b7-e147-4790-ad52-c57dcbf160b7

  health HEALTH_OK

  monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0},
 election epoch 1, quorum 0 cilsdbxd1768

  osdmap e16: 7 osds: 7 up, 7 in

   pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects

 252 MB used, 194 GB / 194 GB avail

  192 active+clean


 No mdsmap line for this cluster, and therefore the filesystem won't mount.
 Have you added an MDS for this cluster, or has the mds daemon died?
 You'll have to get the mdsmap line to show before it will mount

 Sean


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to
 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2014 Cigna
 ==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fresh Firefly install degraded without modified default tunables

2014-08-26 Thread Ripal Nathuji
Hi Greg,

Good question: I started with a single node test and had just left the setting 
in across larger configs as in earlier versions (e.g. Emperor) it didn't seem 
to matter. I also had the same thought that it could be causing an issue with 
the new default tunables in Firefly and did try removing for multi-host (all 
things the same except for omitting osd crush chooseleaf type = 0 in 
ceph.conf). However, I observed the same behavior in both cases. 

Thanks,
Ripal

On Aug 26, 2014, at 3:04 PM, Gregory Farnum g...@inktank.com wrote:

 Hmm, that all looks basically fine. But why did you decide not to
 segregate OSDs across hosts (according to your CRUSH rules)? I think
 maybe it's the interaction of your map, setting choose_local_tries to
 0, and trying to go straight to the OSDs instead of choosing hosts.
 But I'm not super familiar with how the tunables would act under these
 exact conditions.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji ri...@nathuji.com wrote:
 Hi Greg,
 
 Thanks for helping to take a look. Please find your requested outputs below.
 
 ceph osd tree:
 
 # id weight type name up/down reweight
 -1 0 root default
 -2 0 host osd1
 0 0 osd.0 up 1
 4 0 osd.4 up 1
 8 0 osd.8 up 1
 11 0 osd.11 up 1
 -3 0 host osd0
 1 0 osd.1 up 1
 3 0 osd.3 up 1
 6 0 osd.6 up 1
 9 0 osd.9 up 1
 -4 0 host osd2
 2 0 osd.2 up 1
 5 0 osd.5 up 1
 7 0 osd.7 up 1
 10 0 osd.10 up 1
 
 
 ceph -s:
 
cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
 health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery
 43/86 objects degraded (50.000%)
 monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2,
 quorum 0 ceph-mon0
 osdmap e34: 12 osds: 12 up, 12 in
  pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects
403 MB used, 10343 MB / 10747 MB avail
43/86 objects degraded (50.000%)
 832 active+degraded
 
 
 Thanks,
 Ripal
 
 On Aug 25, 2014, at 12:45 PM, Gregory Farnum g...@inktank.com wrote:
 
 What's the output of ceph osd tree? And the full output of ceph -s?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji ri...@nathuji.com wrote:
 
 Hi folks,
 
 I've come across an issue which I found a fix for, but I'm not sure
 whether it's correct or if there is some other misconfiguration on my end
 and this is merely a symptom. I'd appreciate any insights anyone could
 provide based on the information below, and happy to provide more details as
 necessary.
 
 Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as
 active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying
 number of OSD hosts (1, 2, 3), where each OSD host has four storage drives.
 The configuration file defines a default replica size of 2, and allows leafs
 of type 0. Specific snippet:
 
 [global]
 ...
 osd pool default size = 2
 osd crush chooseleaf type = 0
 
 
 I verified the crush rules were as expected:
 
 rules: [
   { rule_id: 0,
 rule_name: replicated_ruleset,
 ruleset: 0,
 type: 1,
 min_size: 1,
 max_size: 10,
 steps: [
   { op: take,
 item: -1,
 item_name: default},
   { op: choose_firstn,
 num: 0,
 type: osd},
   { op: emit}]}],
 
 
 Inspecting the pg dump I observed that all pgs had a single osd in the
 up/acting sets. That seemed to explain why the pgs were degraded, but it was
 unclear to me why a second OSD wasn't in the set. After trying a variety of
 things, I noticed that there was a difference between Emperor (which works
 fine in these configurations) and Firefly with the default tunables, where
 Firefly comes up with the bobtail profile. The setting
 choose_local_fallback_tries is 0 in this profile while it used to default to
 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to
 a non-zero value, the cluster remaps and goes healthy with all pgs
 active+clean.
 
 The documentation states the optimal value of choose_local_fallback_tries is
 0 for FF, so I'd like to get a better understanding of this parameter and
 why modifying the default value moves the pgs to a clean state in my
 scenarios.
 
 Thanks,
 Ripal
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] do RGW have billing feature? If have, how do we use it ?

2014-08-26 Thread baijia...@126.com





baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

 Hi Craig,
 
 I assume the reason for the 48 hours recovery time is to keep the cost
 of the cluster low ? I wrote 1h recovery time because it is roughly
 the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
 your hardware to reduce the recovery time to less than two hours ? Or
 are there factors other than cost that prevent this ?
 

I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right. 
Pandemonium.

I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD. 
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle. 

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.

More in another reply.

 Cheers
 
 On 26/08/2014 19:37, Craig Lewis wrote:
  My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd
  max backfills = 1).   I believe that increases my risk of failure by
  48^2 .  Since your numbers are failure rate per hour per disk, I need
  to consider the risk for the whole time for each disk.  So more
  formally, rebuild time to the power of (replicas -1).
  
  So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
  higher risk than 1 / 10^8.
  
  
  A risk of 1/43,000 means that I'm more likely to lose data due to
  human error than disk failure.  Still, I can put a small bit of effort
  in to optimize recovery speed, and lower this number.  Managing human
  error is much harder.
  
  
  
  
  
  
  On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org
  mailto:l...@dachary.org wrote:
  
  Using percentages instead of numbers lead me to calculations
  errors. Here it is again using 1/100 instead of % for clarity ;-)
  
  Assuming that:
  
  * The pool is configured for three replicas (size = 3 which is the
  default)
  * It takes one hour for Ceph to recover from the loss of a single
  OSD
  * Any other disk has a 1/100,000 chance to fail within the hour
  following the failure of the first disk (assuming AFR
  https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
  8%, divided by the number of hours during a year == (0.08 / 8760) ~=
  1/100,000
  * A given disk does not participate in more than 100 PG
  
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy with --release (--stable) for dumpling?

2014-08-26 Thread Nigel Williams
On Tue, Aug 26, 2014 at 5:10 PM, Konrad Gutkowski
konrad.gutkow...@ffs.pl wrote:
 Ceph-deploy should set priority for ceph repository, which it doesn't, this
 usually installs the best available version from any repository.

Thanks Konrad for the tip. It took several goes (notably ceph-deploy
purge did not, for me at least, seem to be removing librbd1 cleanly)
but I managed to get 0.67.10 to be preferred, basically I did this:

root@ceph12:~# ceph -v
ceph version 0.67.10
root@ceph12:~# cat /etc/apt/preferences
Package: *
Pin: origin ceph.com
Pin-priority: 900

Package: *
Pin: origin ceph.newdream.net
Pin-priority: 900
root@ceph12:~#
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 16:12:11 +0200 Loic Dachary wrote:

 Using percentages instead of numbers lead me to calculations errors.
 Here it is again using 1/100 instead of % for clarity ;-)
 
 Assuming that:
 
 * The pool is configured for three replicas (size = 3 which is the
 default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
I think Craig and I have debunked that number.
It will be something like that depends on many things starting with the
amount of data, the disk speeds, the contention (client and other ops),
the network speed/utilization, the actual OSD process and queue handling
speed, etc..
If you want to make an assumption that's not an order of magnitude wrong,
start with 24 hours.

It would be nice to hear from people with really huge clusters like Dan at
CERN how their recovery speeds are.

 * Any other disk has a 1/100,000 chance to fail within the hour
 following the failure of the first disk (assuming AFR
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
 1/100,000 
 * A given disk does not participate in more than 100 PG
 
You will find that the smaller the cluster, the more likely it is to be
higher than 100, due to rounding up or just upping things because the
distribution is too uneven otherwise.


 Each time an OSD is lost, there is a 1/100,000*1/100,000 =
 1/10,000,000,000 chance that two other disks are lost before recovery.
 Since the disk that failed initialy participates in 100 PG, that is
 1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the
 entire pool if it is used in a way that loosing a PG means loosing all
 data in the pool (as in your example, where it contains RBD volumes and
 each of the RBD volume uses all the available PG).
 
 If the pool is using at least two datacenters operated by two different
 organizations, this calculation makes sense to me. However, if the
 cluster is in a single datacenter, isn't it possible that some event
 independent of Ceph has a greater probability of permanently destroying
 the data ? A month ago I lost three machines in a Ceph cluster and
 realized on that occasion that the crushmap was not configured properly
 and that PG were lost as a result. Fortunately I was able to recover the
 disks and plug them in another machine to recover the lost PGs. I'm not
 a system administrator and the probability of me failing to do the right
 thing is higher than normal: this is just an example of a high
 probability event leading to data loss. Another example would be if all
 disks in the same PG are part of the same batch and therefore likely to
 fail at the same time. In other words, I wonder if this 0.0001% chance
 of losing a PG within the hour following a disk failure matters or if it
 is dominate d by other factors. What do you think ?


Batch failures are real, I'm seeing that all the time. 
However they tend to be still spaced out widely enough most of the time.
Still something to consider in a complete calculation.

As for failures other than disks, these tend to be recoverable, as you
experienced yourself. A node, rack, whatever failure might make your
cluster temporarily inaccessible (and thus should be avoided by proper
CRUSH maps and other precautions), but it will not lead to actual data
loss.
  
Regards,

Christian
 
 Cheers
 
  
  Assuming that:
  
  * The pool is configured for three replicas (size = 3 which is the
  default)
  * It takes one hour for Ceph to recover from the loss of a single OSD
  * Any other disk has a 0.001% chance to fail within the hour following
  the failure of the first disk (assuming AFR
  https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
  10%, divided by the number of hours during a year).
  * A given disk does not participate in more than 100 PG
  
  Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance
  that two other disks are lost before recovery. Since the disk that
  failed initialy participates in 100 PG, that is 0.01% x 100 =
  0.0001% chance that a PG is lost. Or the entire pool if it is used in
  a way that loosing a PG means loosing all data in the pool (as in your
  example, where it contains RBD volumes and each of the RBD volume uses
  all the available PG).
  
  If the pool is using at least two datacenters operated by two
  different organizations, this calculation makes sense to me. However,
  if the cluster is in a single datacenter, isn't it possible that some
  event independent of Ceph has a greater probability of permanently
  destroying the data ? A month ago I lost three machines in a Ceph
  cluster and realized on that occasion that the crushmap was not
  configured properly and that PG were lost as a result. Fortunately I
  was able to recover the disks and plug them in another machine to
  recover the lost PGs. I'm not a system administrator and the
  probability of me failing to do the right thing 

Re: [ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread MinhTien MinhTien
Hi Gregory Farmum,

Thank you for your reply!
This is the log:

2014-08-26 16:22:39.103461 7f083752f700 -1 mds/CDir.cc: In function 'void
CDir::_committed(version_t)' thread 7f083752f700 time 2014-08-26
16:22:39.075809
mds/CDir.cc: 2071: FAILED assert(in-is_dirty() || in-last  ((__u64)(-2)))

 ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f)
 1: (CDir::_committed(unsigned long)+0xc4e) [0x74d9ee]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe8d) [0x7d09bd]
 3: (MDS::handle_core_message(Message*)+0x987) [0x57c457]
 4: (MDS::_dispatch(Message*)+0x2f) [0x57c50f]
 5: (MDS::ms_dispatch(Message*)+0x19b) [0x57dfbb]
 6: (DispatchQueue::entry()+0x5a2) [0x904732]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8afdbd]
 8: (()+0x79d1) [0x7f083c2979d1]
 9: (clone()+0x6d) [0x7f083afb6b5d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mds.Ceph01-dc5k3u0104.log
--- end dump of recent events ---
2014-08-26 16:22:39.134173 7f083752f700 -1 *** Caught signal (Aborted) **
 in thread 7f083752f700




On Wed, Aug 27, 2014 at 3:09 AM, Gregory Farnum g...@inktank.com wrote:

 I don't think the log messages you're showing are the actual cause of
 the failure. The log file should have a proper stack trace (with
 specific function references and probably a listed assert failure),
 can you find that?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
 tientienminh080...@gmail.com wrote:
  Hi all,
 
  I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate
 =  2)
 
  When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.
 
  I have 3 MDS in 3 nodes,the MDS process is dying after a while with a
 stack
  trace:
 
 
 ---
 
   2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154 ==
  osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230
 10003f6.
  [tmapup 0~0] ondisk = 0) v4  119+0+0 (1770421071 0 0) 0x2aece00 con
  0x2aa4200
 -54 2014-08-26 17:08:34.362942 7f1c2c704700  1 --
 10.20.0.21:6800/22154
  == osd.55 10.20.0.23:6800/2407 10  osd_op_reply(263
  100048a. [getxattr] ack = -2 (No such file or directory)) v4
   119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
 -53 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log submit_entry
  427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
 -52 2014-08-26 17:08:34.363022 7f1c2c704700  1 --
 10.20.0.21:6800/22154
  == osd.37 10.20.0.22:6898/11994 6  osd_op_reply(226 1.
 [tmapput
  0~7664] ondisk = 0) v4  109+0+0 (1007110430 0 0) 0x1e64800 con
 0x1e7a7e0
 -51 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
  segment 293601899 2548 events
 -50 2014-08-26 17:08:34.363117 7f1c2c704700  1 --
 10.20.0.21:6800/22154
  == osd.17 10.20.0.21:6941/17572 9  osd_op_reply(264
  1000489. [getxattr] ack = -2 (No such file or directory)) v4
   119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
 -49 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log submit_entry
  427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
 -48 2014-08-26 17:08:34.363197 7f1c2c704700  1 --
 10.20.0.21:6800/22154
  == osd.1 10.20.0.21:6872/13227 6  osd_op_reply(265
 1000491.
  [getxattr] ack = -2 (No such file or directory)) v4  119+0+0
 (1231782695
  0 0) 0x1e63400 con 0x1e7ac00
 -47 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log submit_entry
  427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
 -46 2014-08-26 17:08:34.363274 7f1c2c704700  1 --
 10.20.0.21:6800/22154
  == osd.11 10.20.0.21:6884/7018 5  osd_op_reply(266
 100047d.
  [getxattr] ack = -2 (No such file or directory)) v4  119+0+0
 (2737916920
  0 0) 0x1e61e00 con 0x1e7bc80
 
 
 -
  I try to restart MDSs, but after a 

[ceph-users] 'incomplete' PGs: what does it mean?

2014-08-26 Thread John Morris
In the docs [1], 'incomplete' is defined thusly:

  Ceph detects that a placement group is missing a necessary period of
  history from its log. If you see this state, report a bug, and try
  to start any failed OSDs that may contain the needed information.

However, during an extensive review of list postings related to
incomplete PGs, an alternate and oft-repeated definition is something
like 'the number of existing replicas is less than the min_size of the
pool'.  In no list posting was there any acknowledgement of the
definition from the docs.

While trying to understand what 'incomplete' PGs are, I simply set
min_size = 1 on this cluster with incomplete PGs, and they continue to
be 'incomplete'.  Does this mean that definition #2 is incorrect?

In case #1 is correct, how can the cluster be told to forget the lapse
in history?  In our case, there was nothing writing to the cluster
during the OSD reorganization that could have caused this lapse.

[1] http://ceph.com/docs/master/rados/operations/pg-states/

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about getting rbd.ko and ceph.ko

2014-08-26 Thread yuelongguang
hi,all
 
is there a way to get rbd,ko and ceph.ko for centos 6.X.
 
or  i have to build them from source code?  which is the least kernel version?
 
thanks___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread Irek Fasikhov
Hi.
I and many people use fio.
For ceph rbd has a special engine:
https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html


2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com:

 hi,all

 i am planning to do a test on ceph, include performance, throughput,
 scalability,availability.
 in order to get a full test result, i  hope you all can give me some
 advice. meanwhile i can send the result to you,if you like.
 as for each category test( performance, throughput,
 scalability,availability)  ,  do you have some some test idea and test
 tools?
 basicly i have know some tools to test throughtput,iops .  but you can
 tell the tools you prefer and the result you expect.

 thanks very much




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster inconsistency?

2014-08-26 Thread Kenneth Waegeman


Hi,

In the meantime I already tried with upgrading the cluster to 0.84, to  
see if that made a difference, and it seems it does.

I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' anymore.

But now the cluster detect it is inconsistent:

  cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
   health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too  
few pgs per osd (4  min 20); mon.ceph002 low disk space
   monmap e3: 3 mons at  
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2  
ceph001,ceph002,ceph003

   mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
   osdmap e145384: 78 osds: 78 up, 78 in
pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
  1502 GB used, 129 TB / 131 TB avail
   279 active+clean
40 active+clean+inconsistent
 1 active+clean+scrubbing+deep


I tried to do ceph pg repair for all the inconsistent pgs:

  cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
   health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub  
errors; too few pgs per osd (4  min 20); mon.ceph002 low disk space
   monmap e3: 3 mons at  
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2  
ceph001,ceph002,ceph003

   mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
   osdmap e146452: 78 osds: 78 up, 78 in
pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
  1503 GB used, 129 TB / 131 TB avail
   279 active+clean
39 active+clean+inconsistent
 1 active+clean+scrubbing+deep
 1 active+clean+scrubbing+deep+inconsistent+repair

I let it recovering through the night, but this morning the mons were  
all gone, nothing to see in the log files.. The osds were all still up!


cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
 health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub  
errors; too few pgs per osd (4  min 20)
 monmap e7: 3 mons at  
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 44, quorum 0,1,2  
ceph001,ceph002,ceph003

 mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
 osdmap e203410: 78 osds: 78 up, 78 in
  pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects
1547 GB used, 129 TB / 131 TB avail
   1 active+clean+scrubbing+deep+inconsistent+repair
 284 active+clean
  35 active+clean+inconsistent

I restarted the monitors now, I will let you know when I see something more..



- Message from Haomai Wang haomaiw...@gmail.com -
 Date: Sun, 24 Aug 2014 12:51:41 +0800
 From: Haomai Wang haomaiw...@gmail.com
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman kenneth.waege...@ugent.be,  
ceph-users@lists.ceph.com




It's really strange! I write a test program according the key ordering
you provided and parse the corresponding value. It's true!

I have no idea now. If free, could you add this debug code to
src/os/GenericObjectMap.cc and insert *before* assert(start =
header.oid);:

  dout(0)  start:   start  header.oid:   header.oid  dendl;

Then you need to recompile ceph-osd and run it again. The output log
can help it!

On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang haomaiw...@gmail.com wrote:

I feel a little embarrassed, 1024 rows still true for me.

I was wondering if you could give your all keys via
ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_  keys.log“.

thanks!

On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:


- Message from Haomai Wang haomaiw...@gmail.com -
 Date: Tue, 19 Aug 2014 12:28:27 +0800

 From: Haomai Wang haomaiw...@gmail.com
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman kenneth.waege...@ugent.be
   Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com



On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:



- Message from Haomai Wang haomaiw...@gmail.com -
 Date: Mon, 18 Aug 2014 18:34:11 +0800

 From: Haomai Wang haomaiw...@gmail.com
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman kenneth.waege...@ugent.be
   Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com




On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:



Hi,

I tried this after restarting the osd, but I guess that was not the aim
(
# ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_|
grep 6adb1100 -A 100
IO error: lock /var/lib/ceph/osd/ceph-67/current//LOCK: Resource
temporarily
unavailable
tools/ceph_kvstore_tool.cc: In function 

Re: [ceph-users] v0.84 released

2014-08-26 Thread Stijn De Weirdt

hi all,


there are a zillion OSD bug fixes. Things are looking pretty good for the
Giant release that is coming up in the next month.
any chance of having a compilable cephfs kernel module for el7 for the 
next major release?




stijn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Irek Fasikhov
Move logs on the SSD and immediately increase performance. you have about
50% of the performance lost on logs. And just for the three replications
recommended more than 5 hosts


2014-08-26 12:17 GMT+04:00 Mateusz Skała mateusz.sk...@budikom.net:


 Hi thanks for reply.



  From the top of my head, it is recommended to use 3 mons in
 production. Also, for the 22 osds your number of PGs look a bug low,
 you should look at that.

 I get it from http://ceph.com/docs/master/rados/operations/placement-
 groups/

 (22osd's * 100)/3 replicas = 733, ~1024 pgs
 Please correct me if I'm wrong.

 It will be 5 mons (on 6 hosts) but now we must migrate some data from used
 servers.




 The performance of the cluster is poor - this is too vague. What is
 your current performance, what benchmarks have you tried, what is your
 data workload and most importantly, how is your cluster setup. what
 disks, ssds, network, ram, etc.

 Please provide more information so that people could help you.

 Andrei


 Hardware informations:
 ceph15:
 RAM: 4GB
 Network: 4x 1GB NIC
 OSD disk's:
 2x SATA Seagate ST31000524NS
 2x SATA WDC WD1003FBYX-18Y7B0

 ceph25:
 RAM: 16GB
 Network: 4x 1GB NIC
 OSD disk's:
 2x SATA WDC WD7500BPKX-7
 2x SATA WDC WD7500BPKX-2
 2x SATA SSHD ST1000LM014-1EJ164

 ceph30
 RAM: 16GB
 Network: 4x 1GB NIC
 OSD disks:
 6x SATA SSHD ST1000LM014-1EJ164

 ceph35:
 RAM: 16GB
 Network: 4x 1GB NIC
 OSD disks:
 6x SATA SSHD ST1000LM014-1EJ164


 All journals are on OSD's. 2 NIC are for backend network (10.20.4.0/22)
 and 2 NIC are for frontend (10.20.8.0/22).

 This cluster we use as storage backend for 100VM's on KVM. I don't make
 benchmarks but all vm's are migrated from Xen+GlusterFS(NFS), before
 migration every VM are running fine, now each VM  from time to time hangs
 for few seconds, apps installed on VM's loading much more time. GlusterFS
 are running on 2 servers with 1x 1GB NIC and 2x8 disks WDC WD7500BPKX-7.

 I make one test with recovery, if disk marks out, then recovery io is
 150-200MB/s but all vm's hangs until recovery ends.

 Biggest load is on ceph35, IOps on each disk are near 150, cpu load ~4-5.
 On other hosts cpu load 2, 120~130iops

 Our ceph.conf

 ===
 [global]

 fsid=a9d17295-62f2-46f6-8325-1cad7724e97f
 mon initial members = ceph35, ceph30, ceph25, ceph15
 mon host = 10.20.8.35, 10.20.8.30, 10.20.8.25, 10.20.8.15
 public network = 10.20.8.0/22
 cluster network = 10.20.4.0/22
 osd journal size = 1024
 filestore xattr use omap = true
 osd pool default size = 3
 osd pool default min size = 1
 osd pool default pg num = 1024
 osd pool default pgp num = 1024
 osd crush chooseleaf type = 1
 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 rbd default format = 2

 ##ceph35 osds
 [osd.0]
 cluster addr = 10.20.4.35
 [osd.1]
 cluster addr = 10.20.4.35
 [osd.2]
 cluster addr = 10.20.4.35
 [osd.3]
 cluster addr = 10.20.4.36
 [osd.4]
 cluster addr = 10.20.4.36
 [osd.5]
 cluster addr = 10.20.4.36

 ##ceph25 osds
 [osd.6]
 cluster addr = 10.20.4.25
 public addr = 10.20.8.25
 [osd.7]
 cluster addr = 10.20.4.25
 public addr = 10.20.8.25
 [osd.8]
 cluster addr = 10.20.4.25
 public addr = 10.20.8.25
 [osd.9]
 cluster addr = 10.20.4.26
 public addr = 10.20.8.26
 [osd.10]
 cluster addr = 10.20.4.26
 public addr = 10.20.8.26
 [osd.11]
 cluster addr = 10.20.4.26
 public addr = 10.20.8.26

 ##ceph15 osds
 [osd.12]
 cluster addr = 10.20.4.15
 public addr = 10.20.8.15
 [osd.13]
 cluster addr = 10.20.4.15
 public addr = 10.20.8.15
 [osd.14]
 cluster addr = 10.20.4.15
 public addr = 10.20.8.15
 [osd.15]
 cluster addr = 10.20.4.16
 public addr = 10.20.8.16

 ##ceph30 osds
 [osd.16]
 cluster addr = 10.20.4.30
 public addr = 10.20.8.30
 [osd.17]
 cluster addr = 10.20.4.30
 public addr = 10.20.8.30
 [osd.18]
 cluster addr = 10.20.4.30
 public addr = 10.20.8.30
 [osd.19]
 cluster addr = 10.20.4.31
 public addr = 10.20.8.31
 [osd.20]
 cluster addr = 10.20.4.31
 public addr = 10.20.8.31
 [osd.21]
 cluster addr = 10.20.4.31
 public addr = 10.20.8.31

 [mon.ceph35]
 host = ceph35
 mon addr = 10.20.8.35:6789
 [mon.ceph30]
 host = ceph30
 mon addr = 10.20.8.30:6789
 [mon.ceph25]
 host = ceph25
 mon addr = 10.20.8.25:6789
 [mon.ceph15]
 host = ceph15
 mon addr = 10.20.8.15:6789
 

 Regards,

 Mateusz


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster inconsistency?

2014-08-26 Thread Haomai Wang
Hmm, it looks like you hit this bug(http://tracker.ceph.com/issues/9223).

Sorry for the late message, I forget that this fix is merged into 0.84.

Thanks for your patient :-)

On Tue, Aug 26, 2014 at 4:39 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:

 Hi,

 In the meantime I already tried with upgrading the cluster to 0.84, to see
 if that made a difference, and it seems it does.
 I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' anymore.

 But now the cluster detect it is inconsistent:

   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few pgs
 per osd (4  min 20); mon.ceph002 low disk space
monmap e3: 3 mons at
 {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
 election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
osdmap e145384: 78 osds: 78 up, 78 in
 pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
   1502 GB used, 129 TB / 131 TB avail
279 active+clean
 40 active+clean+inconsistent
  1 active+clean+scrubbing+deep


 I tried to do ceph pg repair for all the inconsistent pgs:

   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub errors;
 too few pgs per osd (4  min 20); mon.ceph002 low disk space
monmap e3: 3 mons at
 {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
 election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
osdmap e146452: 78 osds: 78 up, 78 in
 pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
   1503 GB used, 129 TB / 131 TB avail
279 active+clean
 39 active+clean+inconsistent
  1 active+clean+scrubbing+deep
  1 active+clean+scrubbing+deep+inconsistent+repair

 I let it recovering through the night, but this morning the mons were all
 gone, nothing to see in the log files.. The osds were all still up!

 cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
  health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub errors;
 too few pgs per osd (4  min 20)
  monmap e7: 3 mons at
 {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
 election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003
  mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
  osdmap e203410: 78 osds: 78 up, 78 in
   pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects
 1547 GB used, 129 TB / 131 TB avail
1 active+clean+scrubbing+deep+inconsistent+repair
  284 active+clean
   35 active+clean+inconsistent

 I restarted the monitors now, I will let you know when I see something
 more..




 - Message from Haomai Wang haomaiw...@gmail.com -
  Date: Sun, 24 Aug 2014 12:51:41 +0800

  From: Haomai Wang haomaiw...@gmail.com
 Subject: Re: [ceph-users] ceph cluster inconsistency?
To: Kenneth Waegeman kenneth.waege...@ugent.be,
 ceph-users@lists.ceph.com


 It's really strange! I write a test program according the key ordering
 you provided and parse the corresponding value. It's true!

 I have no idea now. If free, could you add this debug code to
 src/os/GenericObjectMap.cc and insert *before* assert(start =
 header.oid);:

   dout(0)  start:   start  header.oid:   header.oid  dendl;

 Then you need to recompile ceph-osd and run it again. The output log
 can help it!

 On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang haomaiw...@gmail.com
 wrote:

 I feel a little embarrassed, 1024 rows still true for me.

 I was wondering if you could give your all keys via
 ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
 _GHOBJTOSEQ_  keys.log“.

 thanks!

 On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:


 - Message from Haomai Wang haomaiw...@gmail.com -
  Date: Tue, 19 Aug 2014 12:28:27 +0800

  From: Haomai Wang haomaiw...@gmail.com
 Subject: Re: [ceph-users] ceph cluster inconsistency?
To: Kenneth Waegeman kenneth.waege...@ugent.be
Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com


 On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:



 - Message from Haomai Wang haomaiw...@gmail.com -
  Date: Mon, 18 Aug 2014 18:34:11 +0800

  From: Haomai Wang haomaiw...@gmail.com
 Subject: Re: [ceph-users] ceph cluster inconsistency?
To: Kenneth Waegeman kenneth.waege...@ugent.be
Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com



 On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:



 Hi,

 

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Mateusz Skała

You mean to move /var/log/ceph/* to SSD disk?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote:

  Message: 25
  Date: Fri, 15 Aug 2014 15:06:49 +0200
  From: Loic Dachary l...@dachary.org
  To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
  Message-ID: 53ee05e9.1040...@dachary.org
  Content-Type: text/plain; charset=iso-8859-1
  ...
  Here is how I reason about it, roughly:
 
  If the probability of loosing a disk is 0.1%, the probability of
  loosing two disks simultaneously (i.e. before the failure can be
  recovered) would be 0.1*0.1 = 0.01% and three disks becomes
  0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
 
 I watched this conversation and an older similar one (Failure
 probability with largish deployments) with interest as we are in the
 process of planning a pretty large Ceph cluster (~3.5 PB), so I have
 been trying to wrap my head around these issues.

As the OP of the Failure probability with largish deployments thread I
have to thank Blair for raising this issue again and doing the hard math
below. Which looks fine to me.

At the end of that slightly inconclusive thread I walked away with the
same impression as Blair, namely that the survival of PGs is the key
factor and that they will likely be spread out over most, if not all the
OSDs.

Which in turn did reinforce my decision to deploy our first production
Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind
a HW RAID controller with 4GB cache AND SDD journals. 
I can live with the reduced performance (which is caused by the OSD code
running out of steam long before the SSDs or the RAIDs do), because not
only do I save 1/3rd of the space and 1/4th of the cost compared to a
replication 3 cluster, the total of disks that need to fail within the
recovery window to cause data loss is now 4.

The next cluster I'm currently building is a classic Ceph design,
replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with
this cluster I won't have predictable I/O patterns and loads.
OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with
the odds here.

I think doing the exact maths for a cluster of the size you're planning
would be very interesting and also very much needed. 
3.5PB usable space would be close to 3000 disks with a replication of 3,
but even if you meant that as gross value it would probably mean that
you're looking at frequent, if not daily disk failures.


Regards,

Christian
 Loic's reasoning (above) seems sound as a naive approximation assuming
 independent probabilities for disk failures, which may not be quite
 true given potential for batch production issues, but should be okay
 for other sorts of correlations (assuming a sane crushmap that
 eliminates things like controllers and nodes as sources of
 correlation).
 
 One of the things that came up in the Failure probability with
 largish deployments thread and has raised its head again here is the
 idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
 be somehow more prone to data-loss than non-striped. I don't think
 anyone has so far provided an answer on this, so here's my thinking...
 
 The level of atomicity that matters when looking at durability 
 availability in Ceph is the Placement Group. For any non-trivial RBD
 it is likely that many RBDs will span all/most PGs, e.g., even a
 relatively small 50GiB volume would (with default 4MiB object size)
 span 12800 PGs - more than there are in many production clusters
 obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
 one PG will cause data-loss. The failure-probability effects of
 striping across multiple PGs are immaterial considering that loss of
 any single PG is likely to damage all your RBDs/IMPORTANT. This
 might be why the reliability calculator doesn't consider total number
 of disks.
 
 Related to all this is the durability of 2 versus 3 replicas (or e.g.
 M=1 for Erasure Coding). It's easy to get caught up in the worrying
 fallacy that losing any M OSDs will cause data-loss, but this isn't
 true - they have to be members of the same PG for data-loss to occur.
 So then it's tempting to think the chances of that happening are so
 slim as to not matter and why would we ever even need 3 replicas. I
 mean, what are the odds of exactly those 2 drives, out of the
 100,200... in my cluster, failing in recovery window?! But therein
 lays the rub - you should be thinking about PGs. If a drive fails then
 the chance of a data-loss event resulting are dependent on the chances
 of losing further drives from the affected/degraded PGs.
 
 I've got a real cluster at hand, so let's use that as an example. We
 have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
 failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
 dies. How many PGs are now at risk:
 $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc
 109 109 861
 (NB: 10 is the pool 

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Irek Fasikhov
I'm sorry, of course it journals)


2014-08-26 13:16 GMT+04:00 Mateusz Skała mateusz.sk...@budikom.net:

 You mean to move /var/log/ceph/* to SSD disk?


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph monitor load, low performance

2014-08-26 Thread pawel . orzechowski
 

Hello Gentelmen:-) 

Let me point one important aspect of this low performance problem:
from all 4 nodes of our ceph cluster only one node shows bad metrics,
that is very high latency on its osd's (from 200-600ms), while other
three nodes behave normaly, thats is latency of their osds is between
1-10ms. 

So, the idea of putting journals on SSD is something that we are looking
at, but we think that we have in general some problem with that
particular node, what affects whole cluster. 

So can the number (4) of hosts a reason for that? Any other hints? 

Thanks 

Pawel ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread yuelongguang

thanks Irek Fasikhov.
is it the only way to test ceph-rbd?  and an important aim of the test is to 
find where  the bottleneck is.   qemu/librbd/ceph.
could you share your test result with me?
 
 
 
thanks




 


在 2014-08-26 04:22:22,Irek Fasikhov malm...@gmail.com 写道:

Hi.
I and many people use fio. 
For ceph rbd has a special engine: 
https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html



2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com:

hi,all
 
i am planning to do a test on ceph, include performance, throughput, 
scalability,availability.
in order to get a full test result, i  hope you all can give me some advice. 
meanwhile i can send the result to you,if you like.
as for each category test( performance, throughput, scalability,availability)  
,  do you have some some test idea and test tools?
basicly i have know some tools to test throughtput,iops .  but you can tell the 
tools you prefer and the result you expect.  
 
thanks very much
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







--

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread Irek Fasikhov
For me, the bottleneck is single-threaded operation. The recording will
have more or less solved with the inclusion of rbd cache, but there are
problems with reading. But I think that these problems can be solved cache
pool, but have not tested.

It follows that the more threads, the greater the speed of reading and
writing. But in reality it is different.

The speed and number of operations, depending on many factors, such as
network latency.

Examples testing, special attention to the charts:

https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i
and
https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii


2014-08-26 15:11 GMT+04:00 yuelongguang fasts...@163.com:


 thanks Irek Fasikhov.
 is it the only way to test ceph-rbd?  and an important aim of the test is
 to find where  the bottleneck is.   qemu/librbd/ceph.
 could you share your test result with me?



 thanks






 在 2014-08-26 04:22:22,Irek Fasikhov malm...@gmail.com 写道:

 Hi.
 I and many people use fio.
 For ceph rbd has a special engine:
 https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html


 2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com:

 hi,all

 i am planning to do a test on ceph, include performance, throughput,
 scalability,availability.
 in order to get a full test result, i  hope you all can give me some
 advice. meanwhile i can send the result to you,if you like.
 as for each category test( performance, throughput,
 scalability,availability)  ,  do you have some some test idea and test
 tools?
 basicly i have know some tools to test throughtput,iops .  but you can
 tell the tools you prefer and the result you expect.

 thanks very much




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757






-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread Irek Fasikhov
Sorry..Enter pressed :)

continued...
no, it's not the only way to check, but it depends what you want to use ceph


2014-08-26 15:22 GMT+04:00 Irek Fasikhov malm...@gmail.com:

 For me, the bottleneck is single-threaded operation. The recording will
 have more or less solved with the inclusion of rbd cache, but there are
 problems with reading. But I think that these problems can be solved cache
 pool, but have not tested.

 It follows that the more threads, the greater the speed of reading and
 writing. But in reality it is different.

 The speed and number of operations, depending on many factors, such as
 network latency.

 Examples testing, special attention to the charts:


 https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i
 and

 https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii


 2014-08-26 15:11 GMT+04:00 yuelongguang fasts...@163.com:


 thanks Irek Fasikhov.
 is it the only way to test ceph-rbd?  and an important aim of the test is
 to find where  the bottleneck is.   qemu/librbd/ceph.
 could you share your test result with me?



 thanks






 在 2014-08-26 04:22:22,Irek Fasikhov malm...@gmail.com 写道:

 Hi.
 I and many people use fio.
 For ceph rbd has a special engine:
 https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html


 2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com:

 hi,all

 i am planning to do a test on ceph, include performance, throughput,
 scalability,availability.
 in order to get a full test result, i  hope you all can give me some
 advice. meanwhile i can send the result to you,if you like.
 as for each category test( performance, throughput,
 scalability,availability)  ,  do you have some some test idea and test
 tools?
 basicly i have know some tools to test throughtput,iops .  but you can
 tell the tools you prefer and the result you expect.

 thanks very much




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757






 --
 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757




-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering

2014-08-26 Thread yuelongguang
hi,all
 
i have 5 osds and 3 mons. its status is ok then.
 
to be mentioned , this cluster has no any data.  i just deploy it and to be 
familar with some command lines.
what is the probpem and how to fix?
 
thanks
 
 
---environment-
ceph-release-1-0.el6.noarch
ceph-deploy-1.5.11-0.noarch
ceph-0.81.0-5.el6.x86_64
ceph-libs-0.81.0-5.el6.x86_64
-ceph -s --
[root@cephosd1-mona ~]# ceph -s
cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
 health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 pgs stuck 
unclean; clock skew detected on mon.cephosd2-monb, mon.cephosd3-monc
 monmap e13: 3 mons at 
{cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0},
 election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc
 osdmap e151: 5 osds: 5 up, 5 in
  pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects
201 MB used, 102143 MB / 102344 MB avail
 167 peering
 201 active+clean
  16 remapped+peering
 
 
--log--osd.0
2014-08-26 19:16:13.926345 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:13.926355 7f114a8d2700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d5960).accept: got bad authorizer
2014-08-26 19:16:28.928023 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:28.928050 7f114a8d2700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d56a0).accept: got bad authorizer
2014-08-26 19:16:28.929139 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:16:28.929237 7f114c009700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:43.930846 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:43.930899 7f114a8d2700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d0b00).accept: got bad authorizer
2014-08-26 19:16:43.932204 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:16:43.932230 7f114c009700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:58.933526 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:58.935094 7f114a8d2700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d0840).accept: got bad authorizer
2014-08-26 19:16:58.936239 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:16:58.936261 7f114c009700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38074 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:13.937335 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:13.937368 7f114a8d2700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x4dc2080 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d1b80).accept: got bad authorizer
2014-08-26 19:17:13.937923 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:17:13.937933 7f114c009700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38075 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:28.939439 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:28.939455 7f114a8d2700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x4dc1e00 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d5540).accept: got bad authorizer
2014-08-26 19:17:28.939716 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:17:28.939731 7f114c009700  0 -- 11.154.249.2:6800/1667  
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38076 s=1 pgs=0 cs=0 l=0 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Hi Blair,

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the 
failure of the first disk (assuming AFR 
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
other disks are lost before recovery. Since the disk that failed initialy 
participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
lost. Or the entire pool if it is used in a way that loosing a PG means loosing 
all data in the pool (as in your example, where it contains RBD volumes and 
each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different 
organizations, this calculation makes sense to me. However, if the cluster is 
in a single datacenter, isn't it possible that some event independent of Ceph 
has a greater probability of permanently destroying the data ? A month ago I 
lost three machines in a Ceph cluster and realized on that occasion that the 
crushmap was not configured properly and that PG were lost as a result. 
Fortunately I was able to recover the disks and plug them in another machine to 
recover the lost PGs. I'm not a system administrator and the probability of me 
failing to do the right thing is higher than normal: this is just an example of 
a high probability event leading to data loss. In other words, I wonder if this 
0.0001% chance of losing a PG within the hour following a disk failure matters 
or if it is dominated by other factors. What do you think ?

Cheers

On 26/08/2014 02:23, Blair Bethwaite wrote:
 Message: 25
 Date: Fri, 15 Aug 2014 15:06:49 +0200
 From: Loic Dachary l...@dachary.org
 To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
 Message-ID: 53ee05e9.1040...@dachary.org
 Content-Type: text/plain; charset=iso-8859-1
 ...
 Here is how I reason about it, roughly:

 If the probability of loosing a disk is 0.1%, the probability of loosing two 
 disks simultaneously (i.e. before the failure can be recovered) would be 
 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
 becomes 0.0001%
 
 I watched this conversation and an older similar one (Failure
 probability with largish deployments) with interest as we are in the
 process of planning a pretty large Ceph cluster (~3.5 PB), so I have
 been trying to wrap my head around these issues.
 
 Loic's reasoning (above) seems sound as a naive approximation assuming
 independent probabilities for disk failures, which may not be quite
 true given potential for batch production issues, but should be okay
 for other sorts of correlations (assuming a sane crushmap that
 eliminates things like controllers and nodes as sources of
 correlation).
 
 One of the things that came up in the Failure probability with
 largish deployments thread and has raised its head again here is the
 idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
 be somehow more prone to data-loss than non-striped. I don't think
 anyone has so far provided an answer on this, so here's my thinking...
 
 The level of atomicity that matters when looking at durability 
 availability in Ceph is the Placement Group. For any non-trivial RBD
 it is likely that many RBDs will span all/most PGs, e.g., even a
 relatively small 50GiB volume would (with default 4MiB object size)
 span 12800 PGs - more than there are in many production clusters
 obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
 one PG will cause data-loss. The failure-probability effects of
 striping across multiple PGs are immaterial considering that loss of
 any single PG is likely to damage all your RBDs/IMPORTANT. This
 might be why the reliability calculator doesn't consider total number
 of disks.
 
 Related to all this is the durability of 2 versus 3 replicas (or e.g.
 M=1 for Erasure Coding). It's easy to get caught up in the worrying
 fallacy that losing any M OSDs will cause data-loss, but this isn't
 true - they have to be members of the same PG for data-loss to occur.
 So then it's tempting to think the chances of that happening are so
 slim as to not matter and why would we ever even need 3 replicas. I
 mean, what are the odds of exactly those 2 drives, out of the
 100,200... in my cluster, failing in recovery window?! But therein
 lays the rub - you should be thinking about PGs. If a drive fails then
 the chance of a data-loss event resulting are dependent on the chances
 of losing further drives from the affected/degraded PGs.
 
 I've got a real cluster at hand, so let's use that as an example. We
 have 96 drives/OSDs - 8 nodes, 12 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Using percentages instead of numbers lead me to calculations errors. Here it is 
again using 1/100 instead of % for clarity ;-)

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour following the 
failure of the first disk (assuming AFR 
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, 
divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 1/100,000*1/100,000 = 1/10,000,000,000 
chance that two other disks are lost before recovery. Since the disk that 
failed initialy participates in 100 PG, that is 1/10,000,000,000 x 100 = 
1/100,000,000 chance that a PG is lost. Or the entire pool if it is used in a 
way that loosing a PG means loosing all data in the pool (as in your example, 
where it contains RBD volumes and each of the RBD volume uses all the available 
PG).

If the pool is using at least two datacenters operated by two different 
organizations, this calculation makes sense to me. However, if the cluster is 
in a single datacenter, isn't it possible that some event independent of Ceph 
has a greater probability of permanently destroying the data ? A month ago I 
lost three machines in a Ceph cluster and realized on that occasion that the 
crushmap was not configured properly and that PG were lost as a result. 
Fortunately I was able to recover the disks and plug them in another machine to 
recover the lost PGs. I'm not a system administrator and the probability of me 
failing to do the right thing is higher than normal: this is just an example of 
a high probability event leading to data loss. Another example would be if all 
disks in the same PG are part of the same batch and therefore likely to fail at 
the same time. In other words, I wonder if this 0.0001% chance of losing a PG 
within the hour following a disk failure matters or if it is dominate
d by other factors. What do you think ?

Cheers

 
 Assuming that:
 
 * The pool is configured for three replicas (size = 3 which is the default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
 * Any other disk has a 0.001% chance to fail within the hour following the 
 failure of the first disk (assuming AFR 
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
 divided by the number of hours during a year).
 * A given disk does not participate in more than 100 PG
 
 Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
 other disks are lost before recovery. Since the disk that failed initialy 
 participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
 lost. Or the entire pool if it is used in a way that loosing a PG means 
 loosing all data in the pool (as in your example, where it contains RBD 
 volumes and each of the RBD volume uses all the available PG).
 
 If the pool is using at least two datacenters operated by two different 
 organizations, this calculation makes sense to me. However, if the cluster is 
 in a single datacenter, isn't it possible that some event independent of Ceph 
 has a greater probability of permanently destroying the data ? A month ago I 
 lost three machines in a Ceph cluster and realized on that occasion that the 
 crushmap was not configured properly and that PG were lost as a result. 
 Fortunately I was able to recover the disks and plug them in another machine 
 to recover the lost PGs. I'm not a system administrator and the probability 
 of me failing to do the right thing is higher than normal: this is just an 
 example of a high probability event leading to data loss. In other words, I 
 wonder if this 0.0001% chance of losing a PG within the hour following a disk 
 failure matters or if it is dominated by other factors. What do you think ?
 
 Cheers

On 26/08/2014 15:25, Loic Dachary wrote: Hi Blair,
 
 Assuming that:
 
 * The pool is configured for three replicas (size = 3 which is the default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
 * Any other disk has a 0.001% chance to fail within the hour following the 
 failure of the first disk (assuming AFR 
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
 divided by the number of hours during a year).
 * A given disk does not participate in more than 100 PG
 
 Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
 other disks are lost before recovery. Since the disk that failed initialy 
 participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
 lost. Or the entire pool if it is used in a way that loosing a PG means 
 loosing all data in the pool (as in your example, where it contains RBD 
 volumes and each of the RBD volume uses all the available PG).
 
 If the 

[ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread MinhTien MinhTien
Hi all,

I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate =  2)

When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.

I have 3 MDS in 3 nodes,the MDS process is dying after a while with a
stack trace:

---

 2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154
== osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230
10003f6. [tmapup 0~0] ondisk = 0) v4  119+0+0
(1770421071 0 0) 0x2aece00 con 0x2aa4200
   -54 2014-08-26 17:08:34.362942 7f1c2c704700  1 --
10.20.0.21:6800/22154 == osd.55 10.20.0.23:6800/2407 10 
osd_op_reply(263 100048a. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (3908997833 0 0) 0x1e63000 con
0x1e7aaa0
   -53 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log
submit_entry 427629603~1541 : EUpdate purge_stray truncate [metablob
100, 2 dirs]
   -52 2014-08-26 17:08:34.363022 7f1c2c704700  1 --
10.20.0.21:6800/22154 == osd.37 10.20.0.22:6898/11994 6 
osd_op_reply(226 1. [tmapput 0~7664] ondisk = 0) v4 
109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0
   -51 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
segment 293601899 2548 events
   -50 2014-08-26 17:08:34.363117 7f1c2c704700  1 --
10.20.0.21:6800/22154 == osd.17 10.20.0.21:6941/17572 9 
osd_op_reply(264 1000489. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (1979034473 0 0) 0x1e62200 con
0x1e7b180
   -49 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log
submit_entry 427631148~1541 : EUpdate purge_stray truncate [metablob
100, 2 dirs]
   -48 2014-08-26 17:08:34.363197 7f1c2c704700  1 --
10.20.0.21:6800/22154 == osd.1 10.20.0.21:6872/13227 6 
osd_op_reply(265 1000491. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (1231782695 0 0) 0x1e63400 con
0x1e7ac00
   -47 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log
submit_entry 427632693~1541 : EUpdate purge_stray truncate [metablob
100, 2 dirs]
   -46 2014-08-26 17:08:34.363274 7f1c2c704700  1 --
10.20.0.21:6800/22154 == osd.11 10.20.0.21:6884/7018 5 
osd_op_reply(266 100047d. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (2737916920 0 0) 0x1e61e00 con
0x1e7bc80

-
I try to restart MDSs, but after a few seconds in a state of active, MDS
switch to state laggy or crashed. I have a lot of important data on it.
I do not want to use the command:
ceph mds newfs metadata pool id data pool id --yes-i-really-mean-it

:(

Tien Bui.



-- 
Bui Minh Tien
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering

2014-08-26 Thread Michael
How far out are your clocks? It's showing a clock skew, if they're too 
far out it can cause issues with cephx.

Otherwise you're probably going to need to check your cephx auth keys.

-Michael

On 26/08/2014 12:26, yuelongguang wrote:

hi,all
i have 5 osds and 3 mons. its status is ok then.
to be mentioned , this cluster has no any data.  i just deploy it and 
to be familar with some command lines.

what is the probpem and how to fix?
thanks
---environment-
ceph-release-1-0.el6.noarch
ceph-deploy-1.5.11-0.noarch
ceph-0.81.0-5.el6.x86_64
ceph-libs-0.81.0-5.el6.x86_64
-ceph -s --
[root@cephosd1-mona ~]# ceph -s
cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
 health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 
pgs stuck unclean; clock skew detected on mon.cephosd2-monb, 
mon.cephosd3-monc
 monmap e13: 3 mons at 
{cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0}, 
election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc

 osdmap e151: 5 osds: 5 up, 5 in
  pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects
201 MB used, 102143 MB / 102344 MB avail
 167 peering
 201 active+clean
  16 remapped+peering
--log--osd.0
2014-08-26 19:16:13.926345 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:13.926355 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d5960).accept: got bad authorizer
2014-08-26 19:16:28.928023 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:28.928050 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d56a0).accept: got bad authorizer
2014-08-26 19:16:28.929139 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:16:28.929237 7f114c009700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:43.930846 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:43.930899 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d0b00).accept: got bad authorizer
2014-08-26 19:16:43.932204 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:16:43.932230 7f114c009700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:58.933526 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:58.935094 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d0840).accept: got bad authorizer
2014-08-26 19:16:58.936239 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:16:58.936261 7f114c009700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38074 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:13.937335 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:13.937368 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x4dc2080 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d1b80).accept: got bad authorizer
2014-08-26 19:17:13.937923 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:17:13.937933 7f114c009700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38075 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:28.939439 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:28.939455 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
 11.154.249.7:6800/1599 pipe(0x4dc1e00 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d5540).accept: got bad authorizer
2014-08-26 19:17:28.939716 7f114c009700  0 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max
backfills = 1).   I believe that increases my risk of failure by 48^2 .
 Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk.  So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
higher risk than 1 / 10^8.


A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure.  Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number.  Managing human error is
much harder.






On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org wrote:

 Using percentages instead of numbers lead me to calculations errors. Here
 it is again using 1/100 instead of % for clarity ;-)

 Assuming that:

 * The pool is configured for three replicas (size = 3 which is the default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
 * Any other disk has a 1/100,000 chance to fail within the hour following
 the failure of the first disk (assuming AFR
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
 1/100,000
 * A given disk does not participate in more than 100 PG

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Craig Lewis
I had a similar problem once.  I traced my problem it to a failed battery
on my RAID card, which disabled write caching.  One of the many things I
need to add to monitoring.



On Tue, Aug 26, 2014 at 3:58 AM, pawel.orzechow...@budikom.net wrote:

  Hello Gentelmen:-)

 Let me point one important aspect of this low performance problem: from
 all 4 nodes of our ceph cluster only one node shows bad metrics, that is
 very high latency on its osd's (from 200-600ms), while other three nodes
 behave normaly, thats is latency of their osds is between 1-10ms.

 So, the idea of putting journals on SSD is something that we are looking
 at, but we think that we have in general some problem with that particular
 node, what affects whole cluster.

 So can the number (4) of hosts a reason for that? Any other hints?

 Thanks

 Pawel

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost of the 
cluster low ? I wrote 1h recovery time because it is roughly the time it 
would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to 
reduce the recovery time to less than two hours ? Or are there factors other 
than cost that prevent this ?

Cheers

On 26/08/2014 19:37, Craig Lewis wrote:
 My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max 
 backfills = 1).   I believe that increases my risk of failure by 48^2 .  
 Since your numbers are failure rate per hour per disk, I need to consider the 
 risk for the whole time for each disk.  So more formally, rebuild time to the 
 power of (replicas -1).
 
 So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much higher 
 risk than 1 / 10^8.
 
 
 A risk of 1/43,000 means that I'm more likely to lose data due to human error 
 than disk failure.  Still, I can put a small bit of effort in to optimize 
 recovery speed, and lower this number.  Managing human error is much harder.
 
 
 
 
 
 
 On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org 
 mailto:l...@dachary.org wrote:
 
 Using percentages instead of numbers lead me to calculations errors. Here 
 it is again using 1/100 instead of % for clarity ;-)
 
 Assuming that:
 
 * The pool is configured for three replicas (size = 3 which is the 
 default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
 * Any other disk has a 1/100,000 chance to fail within the hour following 
 the failure of the first disk (assuming AFR 
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, 
 divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000
 * A given disk does not participate in more than 100 PG
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-08-26 Thread Steve Anthony
Ok, after some delays and the move to new network hardware I have an
update. I'm still seeing the same low bandwidth and high retransmissions
from iperf after moving to the Cisco 6001 (10Gb) and 2960 (1Gb). I've
narrowed it down to transmissions from a 10Gb connected host to a 1Gb
connected host. Taking a more targeted tcpdump, I discovered that there
are multiple duplicate ACKs, triggering fast retransmissions between the
two test hosts.

There are several websites/articles which suggest that mixing 10Gb and
1Gb hosts causes performance issues, but no concrete explanation of why
that's the case, and if it can be avoided without moving everything to
10Gb, eg.

http://blogs.technet.com/b/networking/archive/2011/05/16/tcp-dupacks-and-tcp-fast-retransmits.aspx
http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19856911/download.aspx
[PDF]
http://packetpushers.net/flow-control-storm-%E2%80%93-ip-storage-performance-effects/

I verified that it's not a flow control storm (the pause frame counters
along the network path are zero), so assuming it might be bandwidth
related I installed trickle and used it to limit the bandwidth of iperf
to 1Gb; no change. I further restricted it down to 100Kbps, and was
*still* seeing high retransmission. This seems to imply it's not purely
bandwidth related.

After further research, I noticed a difference of about 0.1ms in the RTT
between two 10Gb hosts (intra-switch) and the 10Gb and 1Gb host
(inter-switch). I theorized this may be affecting the retransmission
timeout counter calculations, per:

http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html

so I used ethtool to set the link plugged into the 10Gb 6001 to 1Gb;
this immediately fixed the issue. After this change the difference in
RTTs moved to about .025ms. Plugging another host into the old 10Gb FEX,
I have 10Gb to 10Gb RTTs withing .001ms of 6001 to 2960 RTTs, and don't
see the high retransmissions with iperf between those 10Gb hosts.


 tldr 

So, right now I don't see retransmissions between hosts on the same
switch (even if speeds are mixed), but I do across switches when the
hosts are mixed 10Gb/1Gb. Also, I wonder what the difference between
process bandwidth limiting and link 1Gb negotiation is which leads to
the differences observed. I checked the link per Mark's suggestion
below, but all the values they increase in that old post are already
lower than the defaults set on my hosts.

If anyone has any ideas or explanations, I'd appreciate it. Otherwise,
I'll keep the list posted if I uncover a solution or make more progress.
Thanks.

-Steve

On 07/28/2014 01:21 PM, Mark Nelson wrote:
 On 07/28/2014 11:28 AM, Steve Anthony wrote:
 While searching for more information I happened across the following
 post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
 I've been experiencing. I ran tcpdump and noticed what appeared to be a
 high number of retransmissions on the host where the images are mounted
 during a read from a Ceph rbd, so I ran iperf3 to get some concrete
 numbers:

 Very interesting that you are seeing retransmissions.


 Server: nas4 (where rbd images are mapped)
 Client: ceph2 (currently not in the cluster, but configured
 identically to the other nodes)

 Start server on nas4:
 iperf3 -s

 On ceph2, connect to server nas4, send 4096MB of data, report on 1
 second intervals. Add -R to reverse the client/server roles.
 iperf3 -c nas4 -n 4096M -i 1

 Summary of traffic going out the 1Gb interface to a switch

 [ ID] Interval   Transfer Bandwidth   Retr
 [  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15
 sender
 [  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec
 receiver

 Reversed, summary of traffic going over the fabric extender

 [ ID] Interval   Transfer Bandwidth   Retr
 [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
 sender
 [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec
 receiver

 Definitely looks suspect!



 It appears that the issue is related to the network topology employed.
 The private cluster network and nas4's public interface are both
 connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
 Nexus 7000. This was meant as a temporary solution until our network
 team could finalize their design and bring up the Nexus 6001 for the
 cluster. From what our network guys have said, the FEX has been much
 more limited than they anticipated and they haven't been pleased with it
 as a solution in general. The 6001 is supposed be ready this week, so
 once it's online I'll move the cluster to that switch and re-test to see
 if this fixes the issues I've been experiencing.

 If it's not the hardware, one other thing you might want to test is to
 make sure it's not something similar to the autotuning issues we used
 to see.  I don't think this should be an issue at this point given the
 code changes we made to address it, but it would be easy to test. 
 Doesn't seem like