Re: [ceph-users] Two osds are spaming dmesg every 900 seconds
This is being output by one of the kernel clients, and it's just saying that the connections to those two OSDs have died from inactivity. Either the other OSD connections are used a lot more, or aren't used at all. In any case, it's not a problem; just a noisy notification. There's not much you can do about it; sorry. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky and...@arhont.com wrote: Hello I am seeing this message every 900 seconds on the osd servers. My dmesg output is all filled with: [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state OPEN) [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state OPEN) Looking at the ceph-osd logs I see the following at the same time: 2014-08-25 19:48:14.869145 7f0752125700 0 -- 192.168.168.200:6821/4097 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket is 192.168.168.200:54457/0) This happens only on two osds and the rest of osds seem fine. Does anyone know why am I seeing this and how to correct it? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-fuse fails to mount
In particular, we changed things post-Firefly so that the filesystem isn't created automatically. You'll need to set it up (and its pools, etc) explicitly to use it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby richardnixonsh...@gmail.com wrote: Hi James, On 26 August 2014 07:17, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: [ceph@first_cluster ~]$ ceph -s cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d health HEALTH_OK monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election epoch 2, quorum 0 first_cluster mdsmap e4: 1/1/1 up {0=first_cluster=up:active} osdmap e13: 3 osds: 3 up, 3 in pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects 19835 MB used, 56927 MB / 76762 MB avail 192 active+clean This cluster has an MDS. It should mount. [ceph@second_cluster ~]$ ceph -s cluster 06f655b7-e147-4790-ad52-c57dcbf160b7 health HEALTH_OK monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election epoch 1, quorum 0 cilsdbxd1768 osdmap e16: 7 osds: 7 up, 7 in pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects 252 MB used, 194 GB / 194 GB avail 192 active+clean No mdsmap line for this cluster, and therefore the filesystem won't mount. Have you added an MDS for this cluster, or has the mds daemon died? You'll have to get the mdsmap line to show before it will mount Sean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS dying on Ceph 0.67.10
I don't think the log messages you're showing are the actual cause of the failure. The log file should have a proper stack trace (with specific function references and probably a listed assert failure), can you find that? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien tientienminh080...@gmail.com wrote: Hi all, I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate = 2) When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing. I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack trace: --- 2014-08-26 17:08:34.362901 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.10 10.20.0.21:6802/15917 1 osd_op_reply(230 10003f6. [tmapup 0~0] ondisk = 0) v4 119+0+0 (1770421071 0 0) 0x2aece00 con 0x2aa4200 -54 2014-08-26 17:08:34.362942 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.55 10.20.0.23:6800/2407 10 osd_op_reply(263 100048a. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0 -53 2014-08-26 17:08:34.363001 7f1c2c704700 5 mds.0.log submit_entry 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -52 2014-08-26 17:08:34.363022 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.37 10.20.0.22:6898/11994 6 osd_op_reply(226 1. [tmapput 0~7664] ondisk = 0) v4 109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0 -51 2014-08-26 17:08:34.363092 7f1c2c704700 5 mds.0.log _expired segment 293601899 2548 events -50 2014-08-26 17:08:34.363117 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.17 10.20.0.21:6941/17572 9 osd_op_reply(264 1000489. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180 -49 2014-08-26 17:08:34.363177 7f1c2c704700 5 mds.0.log submit_entry 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -48 2014-08-26 17:08:34.363197 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.1 10.20.0.21:6872/13227 6 osd_op_reply(265 1000491. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1231782695 0 0) 0x1e63400 con 0x1e7ac00 -47 2014-08-26 17:08:34.363255 7f1c2c704700 5 mds.0.log submit_entry 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -46 2014-08-26 17:08:34.363274 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.11 10.20.0.21:6884/7018 5 osd_op_reply(266 100047d. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (2737916920 0 0) 0x1e61e00 con 0x1e7bc80 - I try to restart MDSs, but after a few seconds in a state of active, MDS switch to state laggy or crashed. I have a lot of important data on it. I do not want to use the command: ceph mds newfs metadata pool id data pool id --yes-i-really-mean-it :( Tien Bui. -- Bui Minh Tien ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fresh Firefly install degraded without modified default tunables
Hmm, that all looks basically fine. But why did you decide not to segregate OSDs across hosts (according to your CRUSH rules)? I think maybe it's the interaction of your map, setting choose_local_tries to 0, and trying to go straight to the OSDs instead of choosing hosts. But I'm not super familiar with how the tunables would act under these exact conditions. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji ri...@nathuji.com wrote: Hi Greg, Thanks for helping to take a look. Please find your requested outputs below. ceph osd tree: # id weight type name up/down reweight -1 0 root default -2 0 host osd1 0 0 osd.0 up 1 4 0 osd.4 up 1 8 0 osd.8 up 1 11 0 osd.11 up 1 -3 0 host osd0 1 0 osd.1 up 1 3 0 osd.3 up 1 6 0 osd.6 up 1 9 0 osd.9 up 1 -4 0 host osd2 2 0 osd.2 up 1 5 0 osd.5 up 1 7 0 osd.7 up 1 10 0 osd.10 up 1 ceph -s: cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45 health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery 43/86 objects degraded (50.000%) monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2, quorum 0 ceph-mon0 osdmap e34: 12 osds: 12 up, 12 in pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects 403 MB used, 10343 MB / 10747 MB avail 43/86 objects degraded (50.000%) 832 active+degraded Thanks, Ripal On Aug 25, 2014, at 12:45 PM, Gregory Farnum g...@inktank.com wrote: What's the output of ceph osd tree? And the full output of ceph -s? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji ri...@nathuji.com wrote: Hi folks, I've come across an issue which I found a fix for, but I'm not sure whether it's correct or if there is some other misconfiguration on my end and this is merely a symptom. I'd appreciate any insights anyone could provide based on the information below, and happy to provide more details as necessary. Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying number of OSD hosts (1, 2, 3), where each OSD host has four storage drives. The configuration file defines a default replica size of 2, and allows leafs of type 0. Specific snippet: [global] ... osd pool default size = 2 osd crush chooseleaf type = 0 I verified the crush rules were as expected: rules: [ { rule_id: 0, rule_name: replicated_ruleset, ruleset: 0, type: 1, min_size: 1, max_size: 10, steps: [ { op: take, item: -1, item_name: default}, { op: choose_firstn, num: 0, type: osd}, { op: emit}]}], Inspecting the pg dump I observed that all pgs had a single osd in the up/acting sets. That seemed to explain why the pgs were degraded, but it was unclear to me why a second OSD wasn't in the set. After trying a variety of things, I noticed that there was a difference between Emperor (which works fine in these configurations) and Firefly with the default tunables, where Firefly comes up with the bobtail profile. The setting choose_local_fallback_tries is 0 in this profile while it used to default to 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to a non-zero value, the cluster remaps and goes healthy with all pgs active+clean. The documentation states the optimal value of choose_local_fallback_tries is 0 for FF, so I'd like to get a better understanding of this parameter and why modifying the default value moves the pgs to a clean state in my scenarios. Thanks, Ripal ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-community] ceph replication and striping
On Tue, Aug 26, 2014 at 5:07 AM, m.channappa.nega...@accenture.com wrote: Hello all, I have configured a ceph storage cluster. 1. I created the volume .I would like to know that replication of data will happen automatically in ceph ? 2. how to configure striped volume using ceph ? Regards, Malleshi CN If I understand your position and questions correctly... the replication level is configured per-pool, so whatever your size parameter is set to for the pool you created the volume in will dictate how many copies are stored. (Default is 3, IIRC.) RADOS block device volumes are always striped across 4 MiB objects. I don't believe that is configurable (at least not yet.) FYI, this list is intended for discussion of Ceph community concerns. These kinds of questions are better handled on the ceph-users list, and I've forwarded your message accordingly. -Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-fuse fails to mount
[Re-added the list.] I believe you'll find everything you need at http://ceph.com/docs/master/cephfs/createfs/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Aug 26, 2014 at 1:25 PM, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: So is there a link for documentation on the newer versions? (we're doing evaluations at present, so I had wanted to work with newer versions, since it would be closer to what we would end up using). -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Tuesday, August 26, 2014 4:05 PM To: Sean Crosby Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph-fuse fails to mount In particular, we changed things post-Firefly so that the filesystem isn't created automatically. You'll need to set it up (and its pools, etc) explicitly to use it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby richardnixonsh...@gmail.com wrote: Hi James, On 26 August 2014 07:17, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: [ceph@first_cluster ~]$ ceph -s cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d health HEALTH_OK monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election epoch 2, quorum 0 first_cluster mdsmap e4: 1/1/1 up {0=first_cluster=up:active} osdmap e13: 3 osds: 3 up, 3 in pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects 19835 MB used, 56927 MB / 76762 MB avail 192 active+clean This cluster has an MDS. It should mount. [ceph@second_cluster ~]$ ceph -s cluster 06f655b7-e147-4790-ad52-c57dcbf160b7 health HEALTH_OK monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election epoch 1, quorum 0 cilsdbxd1768 osdmap e16: 7 osds: 7 up, 7 in pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects 252 MB used, 194 GB / 194 GB avail 192 active+clean No mdsmap line for this cluster, and therefore the filesystem won't mount. Have you added an MDS for this cluster, or has the mds daemon died? You'll have to get the mdsmap line to show before it will mount Sean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fresh Firefly install degraded without modified default tunables
Hi Greg, Good question: I started with a single node test and had just left the setting in across larger configs as in earlier versions (e.g. Emperor) it didn't seem to matter. I also had the same thought that it could be causing an issue with the new default tunables in Firefly and did try removing for multi-host (all things the same except for omitting osd crush chooseleaf type = 0 in ceph.conf). However, I observed the same behavior in both cases. Thanks, Ripal On Aug 26, 2014, at 3:04 PM, Gregory Farnum g...@inktank.com wrote: Hmm, that all looks basically fine. But why did you decide not to segregate OSDs across hosts (according to your CRUSH rules)? I think maybe it's the interaction of your map, setting choose_local_tries to 0, and trying to go straight to the OSDs instead of choosing hosts. But I'm not super familiar with how the tunables would act under these exact conditions. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji ri...@nathuji.com wrote: Hi Greg, Thanks for helping to take a look. Please find your requested outputs below. ceph osd tree: # id weight type name up/down reweight -1 0 root default -2 0 host osd1 0 0 osd.0 up 1 4 0 osd.4 up 1 8 0 osd.8 up 1 11 0 osd.11 up 1 -3 0 host osd0 1 0 osd.1 up 1 3 0 osd.3 up 1 6 0 osd.6 up 1 9 0 osd.9 up 1 -4 0 host osd2 2 0 osd.2 up 1 5 0 osd.5 up 1 7 0 osd.7 up 1 10 0 osd.10 up 1 ceph -s: cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45 health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery 43/86 objects degraded (50.000%) monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2, quorum 0 ceph-mon0 osdmap e34: 12 osds: 12 up, 12 in pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects 403 MB used, 10343 MB / 10747 MB avail 43/86 objects degraded (50.000%) 832 active+degraded Thanks, Ripal On Aug 25, 2014, at 12:45 PM, Gregory Farnum g...@inktank.com wrote: What's the output of ceph osd tree? And the full output of ceph -s? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji ri...@nathuji.com wrote: Hi folks, I've come across an issue which I found a fix for, but I'm not sure whether it's correct or if there is some other misconfiguration on my end and this is merely a symptom. I'd appreciate any insights anyone could provide based on the information below, and happy to provide more details as necessary. Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying number of OSD hosts (1, 2, 3), where each OSD host has four storage drives. The configuration file defines a default replica size of 2, and allows leafs of type 0. Specific snippet: [global] ... osd pool default size = 2 osd crush chooseleaf type = 0 I verified the crush rules were as expected: rules: [ { rule_id: 0, rule_name: replicated_ruleset, ruleset: 0, type: 1, min_size: 1, max_size: 10, steps: [ { op: take, item: -1, item_name: default}, { op: choose_firstn, num: 0, type: osd}, { op: emit}]}], Inspecting the pg dump I observed that all pgs had a single osd in the up/acting sets. That seemed to explain why the pgs were degraded, but it was unclear to me why a second OSD wasn't in the set. After trying a variety of things, I noticed that there was a difference between Emperor (which works fine in these configurations) and Firefly with the default tunables, where Firefly comes up with the bobtail profile. The setting choose_local_fallback_tries is 0 in this profile while it used to default to 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to a non-zero value, the cluster remaps and goes healthy with all pgs active+clean. The documentation states the optimal value of choose_local_fallback_tries is 0 for FF, so I'd like to get a better understanding of this parameter and why modifying the default value moves the pgs to a clean state in my scenarios. Thanks, Ripal ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] do RGW have billing feature? If have, how do we use it ?
baijia...@126.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. More in another reply. Cheers On 26/08/2014 19:37, Craig Lewis wrote: My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy with --release (--stable) for dumpling?
On Tue, Aug 26, 2014 at 5:10 PM, Konrad Gutkowski konrad.gutkow...@ffs.pl wrote: Ceph-deploy should set priority for ceph repository, which it doesn't, this usually installs the best available version from any repository. Thanks Konrad for the tip. It took several goes (notably ceph-deploy purge did not, for me at least, seem to be removing librbd1 cleanly) but I managed to get 0.67.10 to be preferred, basically I did this: root@ceph12:~# ceph -v ceph version 0.67.10 root@ceph12:~# cat /etc/apt/preferences Package: * Pin: origin ceph.com Pin-priority: 900 Package: * Pin: origin ceph.newdream.net Pin-priority: 900 root@ceph12:~# ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Hello, On Tue, 26 Aug 2014 16:12:11 +0200 Loic Dachary wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD I think Craig and I have debunked that number. It will be something like that depends on many things starting with the amount of data, the disk speeds, the contention (client and other ops), the network speed/utilization, the actual OSD process and queue handling speed, etc.. If you want to make an assumption that's not an order of magnitude wrong, start with 24 hours. It would be nice to hear from people with really huge clusters like Dan at CERN how their recovery speeds are. * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG You will find that the smaller the cluster, the more likely it is to be higher than 100, due to rounding up or just upping things because the distribution is too uneven otherwise. Each time an OSD is lost, there is a 1/100,000*1/100,000 = 1/10,000,000,000 chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. Another example would be if all disks in the same PG are part of the same batch and therefore likely to fail at the same time. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominate d by other factors. What do you think ? Batch failures are real, I'm seeing that all the time. However they tend to be still spaced out widely enough most of the time. Still something to consider in a complete calculation. As for failures other than disks, these tend to be recoverable, as you experienced yourself. A node, rack, whatever failure might make your cluster temporarily inaccessible (and thus should be avoided by proper CRUSH maps and other precautions), but it will not lead to actual data loss. Regards, Christian Cheers Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year). * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing
Re: [ceph-users] MDS dying on Ceph 0.67.10
Hi Gregory Farmum, Thank you for your reply! This is the log: 2014-08-26 16:22:39.103461 7f083752f700 -1 mds/CDir.cc: In function 'void CDir::_committed(version_t)' thread 7f083752f700 time 2014-08-26 16:22:39.075809 mds/CDir.cc: 2071: FAILED assert(in-is_dirty() || in-last ((__u64)(-2))) ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f) 1: (CDir::_committed(unsigned long)+0xc4e) [0x74d9ee] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe8d) [0x7d09bd] 3: (MDS::handle_core_message(Message*)+0x987) [0x57c457] 4: (MDS::_dispatch(Message*)+0x2f) [0x57c50f] 5: (MDS::ms_dispatch(Message*)+0x19b) [0x57dfbb] 6: (DispatchQueue::entry()+0x5a2) [0x904732] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8afdbd] 8: (()+0x79d1) [0x7f083c2979d1] 9: (clone()+0x6d) [0x7f083afb6b5d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mds.Ceph01-dc5k3u0104.log --- end dump of recent events --- 2014-08-26 16:22:39.134173 7f083752f700 -1 *** Caught signal (Aborted) ** in thread 7f083752f700 On Wed, Aug 27, 2014 at 3:09 AM, Gregory Farnum g...@inktank.com wrote: I don't think the log messages you're showing are the actual cause of the failure. The log file should have a proper stack trace (with specific function references and probably a listed assert failure), can you find that? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien tientienminh080...@gmail.com wrote: Hi all, I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate = 2) When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing. I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack trace: --- 2014-08-26 17:08:34.362901 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.10 10.20.0.21:6802/15917 1 osd_op_reply(230 10003f6. [tmapup 0~0] ondisk = 0) v4 119+0+0 (1770421071 0 0) 0x2aece00 con 0x2aa4200 -54 2014-08-26 17:08:34.362942 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.55 10.20.0.23:6800/2407 10 osd_op_reply(263 100048a. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0 -53 2014-08-26 17:08:34.363001 7f1c2c704700 5 mds.0.log submit_entry 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -52 2014-08-26 17:08:34.363022 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.37 10.20.0.22:6898/11994 6 osd_op_reply(226 1. [tmapput 0~7664] ondisk = 0) v4 109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0 -51 2014-08-26 17:08:34.363092 7f1c2c704700 5 mds.0.log _expired segment 293601899 2548 events -50 2014-08-26 17:08:34.363117 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.17 10.20.0.21:6941/17572 9 osd_op_reply(264 1000489. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180 -49 2014-08-26 17:08:34.363177 7f1c2c704700 5 mds.0.log submit_entry 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -48 2014-08-26 17:08:34.363197 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.1 10.20.0.21:6872/13227 6 osd_op_reply(265 1000491. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1231782695 0 0) 0x1e63400 con 0x1e7ac00 -47 2014-08-26 17:08:34.363255 7f1c2c704700 5 mds.0.log submit_entry 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -46 2014-08-26 17:08:34.363274 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.11 10.20.0.21:6884/7018 5 osd_op_reply(266 100047d. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (2737916920 0 0) 0x1e61e00 con 0x1e7bc80 - I try to restart MDSs, but after a
[ceph-users] 'incomplete' PGs: what does it mean?
In the docs [1], 'incomplete' is defined thusly: Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information. However, during an extensive review of list postings related to incomplete PGs, an alternate and oft-repeated definition is something like 'the number of existing replicas is less than the min_size of the pool'. In no list posting was there any acknowledgement of the definition from the docs. While trying to understand what 'incomplete' PGs are, I simply set min_size = 1 on this cluster with incomplete PGs, and they continue to be 'incomplete'. Does this mean that definition #2 is incorrect? In case #1 is correct, how can the cluster be told to forget the lapse in history? In our case, there was nothing writing to the cluster during the OSD reorganization that could have caused this lapse. [1] http://ceph.com/docs/master/rados/operations/pg-states/ John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] question about getting rbd.ko and ceph.ko
hi,all is there a way to get rbd,ko and ceph.ko for centos 6.X. or i have to build them from source code? which is the least kernel version? thanks___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks
Hi. I and many people use fio. For ceph rbd has a special engine: https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html 2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com: hi,all i am planning to do a test on ceph, include performance, throughput, scalability,availability. in order to get a full test result, i hope you all can give me some advice. meanwhile i can send the result to you,if you like. as for each category test( performance, throughput, scalability,availability) , do you have some some test idea and test tools? basicly i have know some tools to test throughtput,iops . but you can tell the tools you prefer and the result you expect. thanks very much ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster inconsistency?
Hi, In the meantime I already tried with upgrading the cluster to 0.84, to see if that made a difference, and it seems it does. I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' anymore. But now the cluster detect it is inconsistent: cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few pgs per osd (4 min 20); mon.ceph002 low disk space monmap e3: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby osdmap e145384: 78 osds: 78 up, 78 in pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects 1502 GB used, 129 TB / 131 TB avail 279 active+clean 40 active+clean+inconsistent 1 active+clean+scrubbing+deep I tried to do ceph pg repair for all the inconsistent pgs: cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub errors; too few pgs per osd (4 min 20); mon.ceph002 low disk space monmap e3: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby osdmap e146452: 78 osds: 78 up, 78 in pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects 1503 GB used, 129 TB / 131 TB avail 279 active+clean 39 active+clean+inconsistent 1 active+clean+scrubbing+deep 1 active+clean+scrubbing+deep+inconsistent+repair I let it recovering through the night, but this morning the mons were all gone, nothing to see in the log files.. The osds were all still up! cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub errors; too few pgs per osd (4 min 20) monmap e7: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby osdmap e203410: 78 osds: 78 up, 78 in pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects 1547 GB used, 129 TB / 131 TB avail 1 active+clean+scrubbing+deep+inconsistent+repair 284 active+clean 35 active+clean+inconsistent I restarted the monitors now, I will let you know when I see something more.. - Message from Haomai Wang haomaiw...@gmail.com - Date: Sun, 24 Aug 2014 12:51:41 +0800 From: Haomai Wang haomaiw...@gmail.com Subject: Re: [ceph-users] ceph cluster inconsistency? To: Kenneth Waegeman kenneth.waege...@ugent.be, ceph-users@lists.ceph.com It's really strange! I write a test program according the key ordering you provided and parse the corresponding value. It's true! I have no idea now. If free, could you add this debug code to src/os/GenericObjectMap.cc and insert *before* assert(start = header.oid);: dout(0) start: start header.oid: header.oid dendl; Then you need to recompile ceph-osd and run it again. The output log can help it! On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang haomaiw...@gmail.com wrote: I feel a little embarrassed, 1024 rows still true for me. I was wondering if you could give your all keys via ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list _GHOBJTOSEQ_ keys.log“. thanks! On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: - Message from Haomai Wang haomaiw...@gmail.com - Date: Tue, 19 Aug 2014 12:28:27 +0800 From: Haomai Wang haomaiw...@gmail.com Subject: Re: [ceph-users] ceph cluster inconsistency? To: Kenneth Waegeman kenneth.waege...@ugent.be Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: - Message from Haomai Wang haomaiw...@gmail.com - Date: Mon, 18 Aug 2014 18:34:11 +0800 From: Haomai Wang haomaiw...@gmail.com Subject: Re: [ceph-users] ceph cluster inconsistency? To: Kenneth Waegeman kenneth.waege...@ugent.be Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi, I tried this after restarting the osd, but I guess that was not the aim ( # ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list _GHOBJTOSEQ_| grep 6adb1100 -A 100 IO error: lock /var/lib/ceph/osd/ceph-67/current//LOCK: Resource temporarily unavailable tools/ceph_kvstore_tool.cc: In function
Re: [ceph-users] v0.84 released
hi all, there are a zillion OSD bug fixes. Things are looking pretty good for the Giant release that is coming up in the next month. any chance of having a compilable cephfs kernel module for el7 for the next major release? stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph monitor load, low performance
Move logs on the SSD and immediately increase performance. you have about 50% of the performance lost on logs. And just for the three replications recommended more than 5 hosts 2014-08-26 12:17 GMT+04:00 Mateusz Skała mateusz.sk...@budikom.net: Hi thanks for reply. From the top of my head, it is recommended to use 3 mons in production. Also, for the 22 osds your number of PGs look a bug low, you should look at that. I get it from http://ceph.com/docs/master/rados/operations/placement- groups/ (22osd's * 100)/3 replicas = 733, ~1024 pgs Please correct me if I'm wrong. It will be 5 mons (on 6 hosts) but now we must migrate some data from used servers. The performance of the cluster is poor - this is too vague. What is your current performance, what benchmarks have you tried, what is your data workload and most importantly, how is your cluster setup. what disks, ssds, network, ram, etc. Please provide more information so that people could help you. Andrei Hardware informations: ceph15: RAM: 4GB Network: 4x 1GB NIC OSD disk's: 2x SATA Seagate ST31000524NS 2x SATA WDC WD1003FBYX-18Y7B0 ceph25: RAM: 16GB Network: 4x 1GB NIC OSD disk's: 2x SATA WDC WD7500BPKX-7 2x SATA WDC WD7500BPKX-2 2x SATA SSHD ST1000LM014-1EJ164 ceph30 RAM: 16GB Network: 4x 1GB NIC OSD disks: 6x SATA SSHD ST1000LM014-1EJ164 ceph35: RAM: 16GB Network: 4x 1GB NIC OSD disks: 6x SATA SSHD ST1000LM014-1EJ164 All journals are on OSD's. 2 NIC are for backend network (10.20.4.0/22) and 2 NIC are for frontend (10.20.8.0/22). This cluster we use as storage backend for 100VM's on KVM. I don't make benchmarks but all vm's are migrated from Xen+GlusterFS(NFS), before migration every VM are running fine, now each VM from time to time hangs for few seconds, apps installed on VM's loading much more time. GlusterFS are running on 2 servers with 1x 1GB NIC and 2x8 disks WDC WD7500BPKX-7. I make one test with recovery, if disk marks out, then recovery io is 150-200MB/s but all vm's hangs until recovery ends. Biggest load is on ceph35, IOps on each disk are near 150, cpu load ~4-5. On other hosts cpu load 2, 120~130iops Our ceph.conf === [global] fsid=a9d17295-62f2-46f6-8325-1cad7724e97f mon initial members = ceph35, ceph30, ceph25, ceph15 mon host = 10.20.8.35, 10.20.8.30, 10.20.8.25, 10.20.8.15 public network = 10.20.8.0/22 cluster network = 10.20.4.0/22 osd journal size = 1024 filestore xattr use omap = true osd pool default size = 3 osd pool default min size = 1 osd pool default pg num = 1024 osd pool default pgp num = 1024 osd crush chooseleaf type = 1 auth cluster required = cephx auth service required = cephx auth client required = cephx rbd default format = 2 ##ceph35 osds [osd.0] cluster addr = 10.20.4.35 [osd.1] cluster addr = 10.20.4.35 [osd.2] cluster addr = 10.20.4.35 [osd.3] cluster addr = 10.20.4.36 [osd.4] cluster addr = 10.20.4.36 [osd.5] cluster addr = 10.20.4.36 ##ceph25 osds [osd.6] cluster addr = 10.20.4.25 public addr = 10.20.8.25 [osd.7] cluster addr = 10.20.4.25 public addr = 10.20.8.25 [osd.8] cluster addr = 10.20.4.25 public addr = 10.20.8.25 [osd.9] cluster addr = 10.20.4.26 public addr = 10.20.8.26 [osd.10] cluster addr = 10.20.4.26 public addr = 10.20.8.26 [osd.11] cluster addr = 10.20.4.26 public addr = 10.20.8.26 ##ceph15 osds [osd.12] cluster addr = 10.20.4.15 public addr = 10.20.8.15 [osd.13] cluster addr = 10.20.4.15 public addr = 10.20.8.15 [osd.14] cluster addr = 10.20.4.15 public addr = 10.20.8.15 [osd.15] cluster addr = 10.20.4.16 public addr = 10.20.8.16 ##ceph30 osds [osd.16] cluster addr = 10.20.4.30 public addr = 10.20.8.30 [osd.17] cluster addr = 10.20.4.30 public addr = 10.20.8.30 [osd.18] cluster addr = 10.20.4.30 public addr = 10.20.8.30 [osd.19] cluster addr = 10.20.4.31 public addr = 10.20.8.31 [osd.20] cluster addr = 10.20.4.31 public addr = 10.20.8.31 [osd.21] cluster addr = 10.20.4.31 public addr = 10.20.8.31 [mon.ceph35] host = ceph35 mon addr = 10.20.8.35:6789 [mon.ceph30] host = ceph30 mon addr = 10.20.8.30:6789 [mon.ceph25] host = ceph25 mon addr = 10.20.8.25:6789 [mon.ceph15] host = ceph15 mon addr = 10.20.8.15:6789 Regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster inconsistency?
Hmm, it looks like you hit this bug(http://tracker.ceph.com/issues/9223). Sorry for the late message, I forget that this fix is merged into 0.84. Thanks for your patient :-) On Tue, Aug 26, 2014 at 4:39 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi, In the meantime I already tried with upgrading the cluster to 0.84, to see if that made a difference, and it seems it does. I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' anymore. But now the cluster detect it is inconsistent: cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few pgs per osd (4 min 20); mon.ceph002 low disk space monmap e3: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby osdmap e145384: 78 osds: 78 up, 78 in pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects 1502 GB used, 129 TB / 131 TB avail 279 active+clean 40 active+clean+inconsistent 1 active+clean+scrubbing+deep I tried to do ceph pg repair for all the inconsistent pgs: cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub errors; too few pgs per osd (4 min 20); mon.ceph002 low disk space monmap e3: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby osdmap e146452: 78 osds: 78 up, 78 in pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects 1503 GB used, 129 TB / 131 TB avail 279 active+clean 39 active+clean+inconsistent 1 active+clean+scrubbing+deep 1 active+clean+scrubbing+deep+inconsistent+repair I let it recovering through the night, but this morning the mons were all gone, nothing to see in the log files.. The osds were all still up! cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub errors; too few pgs per osd (4 min 20) monmap e7: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby osdmap e203410: 78 osds: 78 up, 78 in pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects 1547 GB used, 129 TB / 131 TB avail 1 active+clean+scrubbing+deep+inconsistent+repair 284 active+clean 35 active+clean+inconsistent I restarted the monitors now, I will let you know when I see something more.. - Message from Haomai Wang haomaiw...@gmail.com - Date: Sun, 24 Aug 2014 12:51:41 +0800 From: Haomai Wang haomaiw...@gmail.com Subject: Re: [ceph-users] ceph cluster inconsistency? To: Kenneth Waegeman kenneth.waege...@ugent.be, ceph-users@lists.ceph.com It's really strange! I write a test program according the key ordering you provided and parse the corresponding value. It's true! I have no idea now. If free, could you add this debug code to src/os/GenericObjectMap.cc and insert *before* assert(start = header.oid);: dout(0) start: start header.oid: header.oid dendl; Then you need to recompile ceph-osd and run it again. The output log can help it! On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang haomaiw...@gmail.com wrote: I feel a little embarrassed, 1024 rows still true for me. I was wondering if you could give your all keys via ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list _GHOBJTOSEQ_ keys.log“. thanks! On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: - Message from Haomai Wang haomaiw...@gmail.com - Date: Tue, 19 Aug 2014 12:28:27 +0800 From: Haomai Wang haomaiw...@gmail.com Subject: Re: [ceph-users] ceph cluster inconsistency? To: Kenneth Waegeman kenneth.waege...@ugent.be Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: - Message from Haomai Wang haomaiw...@gmail.com - Date: Mon, 18 Aug 2014 18:34:11 +0800 From: Haomai Wang haomaiw...@gmail.com Subject: Re: [ceph-users] ceph cluster inconsistency? To: Kenneth Waegeman kenneth.waege...@ugent.be Cc: Sage Weil sw...@redhat.com, ceph-users@lists.ceph.com On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi,
Re: [ceph-users] Ceph monitor load, low performance
You mean to move /var/log/ceph/* to SSD disk? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Hello, On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote: Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. As the OP of the Failure probability with largish deployments thread I have to thank Blair for raising this issue again and doing the hard math below. Which looks fine to me. At the end of that slightly inconclusive thread I walked away with the same impression as Blair, namely that the survival of PGs is the key factor and that they will likely be spread out over most, if not all the OSDs. Which in turn did reinforce my decision to deploy our first production Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind a HW RAID controller with 4GB cache AND SDD journals. I can live with the reduced performance (which is caused by the OSD code running out of steam long before the SSDs or the RAIDs do), because not only do I save 1/3rd of the space and 1/4th of the cost compared to a replication 3 cluster, the total of disks that need to fail within the recovery window to cause data loss is now 4. The next cluster I'm currently building is a classic Ceph design, replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with this cluster I won't have predictable I/O patterns and loads. OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with the odds here. I think doing the exact maths for a cluster of the size you're planning would be very interesting and also very much needed. 3.5PB usable space would be close to 3000 disks with a replication of 3, but even if you meant that as gross value it would probably mean that you're looking at frequent, if not daily disk failures. Regards, Christian Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in recovery window?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15 dies. How many PGs are now at risk: $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc 109 109 861 (NB: 10 is the pool
Re: [ceph-users] Ceph monitor load, low performance
I'm sorry, of course it journals) 2014-08-26 13:16 GMT+04:00 Mateusz Skała mateusz.sk...@budikom.net: You mean to move /var/log/ceph/* to SSD disk? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph monitor load, low performance
Hello Gentelmen:-) Let me point one important aspect of this low performance problem: from all 4 nodes of our ceph cluster only one node shows bad metrics, that is very high latency on its osd's (from 200-600ms), while other three nodes behave normaly, thats is latency of their osds is between 1-10ms. So, the idea of putting journals on SSD is something that we are looking at, but we think that we have in general some problem with that particular node, what affects whole cluster. So can the number (4) of hosts a reason for that? Any other hints? Thanks Pawel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks
thanks Irek Fasikhov. is it the only way to test ceph-rbd? and an important aim of the test is to find where the bottleneck is. qemu/librbd/ceph. could you share your test result with me? thanks 在 2014-08-26 04:22:22,Irek Fasikhov malm...@gmail.com 写道: Hi. I and many people use fio. For ceph rbd has a special engine: https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html 2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com: hi,all i am planning to do a test on ceph, include performance, throughput, scalability,availability. in order to get a full test result, i hope you all can give me some advice. meanwhile i can send the result to you,if you like. as for each category test( performance, throughput, scalability,availability) , do you have some some test idea and test tools? basicly i have know some tools to test throughtput,iops . but you can tell the tools you prefer and the result you expect. thanks very much ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks
For me, the bottleneck is single-threaded operation. The recording will have more or less solved with the inclusion of rbd cache, but there are problems with reading. But I think that these problems can be solved cache pool, but have not tested. It follows that the more threads, the greater the speed of reading and writing. But in reality it is different. The speed and number of operations, depending on many factors, such as network latency. Examples testing, special attention to the charts: https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i and https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii 2014-08-26 15:11 GMT+04:00 yuelongguang fasts...@163.com: thanks Irek Fasikhov. is it the only way to test ceph-rbd? and an important aim of the test is to find where the bottleneck is. qemu/librbd/ceph. could you share your test result with me? thanks 在 2014-08-26 04:22:22,Irek Fasikhov malm...@gmail.com 写道: Hi. I and many people use fio. For ceph rbd has a special engine: https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html 2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com: hi,all i am planning to do a test on ceph, include performance, throughput, scalability,availability. in order to get a full test result, i hope you all can give me some advice. meanwhile i can send the result to you,if you like. as for each category test( performance, throughput, scalability,availability) , do you have some some test idea and test tools? basicly i have know some tools to test throughtput,iops . but you can tell the tools you prefer and the result you expect. thanks very much ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks
Sorry..Enter pressed :) continued... no, it's not the only way to check, but it depends what you want to use ceph 2014-08-26 15:22 GMT+04:00 Irek Fasikhov malm...@gmail.com: For me, the bottleneck is single-threaded operation. The recording will have more or less solved with the inclusion of rbd cache, but there are problems with reading. But I think that these problems can be solved cache pool, but have not tested. It follows that the more threads, the greater the speed of reading and writing. But in reality it is different. The speed and number of operations, depending on many factors, such as network latency. Examples testing, special attention to the charts: https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i and https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii 2014-08-26 15:11 GMT+04:00 yuelongguang fasts...@163.com: thanks Irek Fasikhov. is it the only way to test ceph-rbd? and an important aim of the test is to find where the bottleneck is. qemu/librbd/ceph. could you share your test result with me? thanks 在 2014-08-26 04:22:22,Irek Fasikhov malm...@gmail.com 写道: Hi. I and many people use fio. For ceph rbd has a special engine: https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html 2014-08-26 12:15 GMT+04:00 yuelongguang fasts...@163.com: hi,all i am planning to do a test on ceph, include performance, throughput, scalability,availability. in order to get a full test result, i hope you all can give me some advice. meanwhile i can send the result to you,if you like. as for each category test( performance, throughput, scalability,availability) , do you have some some test idea and test tools? basicly i have know some tools to test throughtput,iops . but you can tell the tools you prefer and the result you expect. thanks very much ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering
hi,all i have 5 osds and 3 mons. its status is ok then. to be mentioned , this cluster has no any data. i just deploy it and to be familar with some command lines. what is the probpem and how to fix? thanks ---environment- ceph-release-1-0.el6.noarch ceph-deploy-1.5.11-0.noarch ceph-0.81.0-5.el6.x86_64 ceph-libs-0.81.0-5.el6.x86_64 -ceph -s -- [root@cephosd1-mona ~]# ceph -s cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651 health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 pgs stuck unclean; clock skew detected on mon.cephosd2-monb, mon.cephosd3-monc monmap e13: 3 mons at {cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0}, election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc osdmap e151: 5 osds: 5 up, 5 in pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects 201 MB used, 102143 MB / 102344 MB avail 167 peering 201 active+clean 16 remapped+peering --log--osd.0 2014-08-26 19:16:13.926345 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:13.926355 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d5960).accept: got bad authorizer 2014-08-26 19:16:28.928023 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:28.928050 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d56a0).accept: got bad authorizer 2014-08-26 19:16:28.929139 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:16:28.929237 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:16:43.930846 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:43.930899 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d0b00).accept: got bad authorizer 2014-08-26 19:16:43.932204 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:16:43.932230 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:16:58.933526 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:58.935094 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d0840).accept: got bad authorizer 2014-08-26 19:16:58.936239 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:16:58.936261 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38074 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:17:13.937335 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:17:13.937368 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2080 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d1b80).accept: got bad authorizer 2014-08-26 19:17:13.937923 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:17:13.937933 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38075 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:17:28.939439 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:17:28.939455 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc1e00 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d5540).accept: got bad authorizer 2014-08-26 19:17:28.939716 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:17:28.939731 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38076 s=1 pgs=0 cs=0 l=0
Re: [ceph-users] Best practice K/M-parameters EC pool
Hi Blair, Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year). * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ? Cheers On 26/08/2014 02:23, Blair Bethwaite wrote: Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in recovery window?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12
Re: [ceph-users] Best practice K/M-parameters EC pool
Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 1/100,000*1/100,000 = 1/10,000,000,000 chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. Another example would be if all disks in the same PG are part of the same batch and therefore likely to fail at the same time. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominate d by other factors. What do you think ? Cheers Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year). * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ? Cheers On 26/08/2014 15:25, Loic Dachary wrote: Hi Blair, Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year). * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the
[ceph-users] MDS dying on Ceph 0.67.10
Hi all, I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate = 2) When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing. I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack trace: --- 2014-08-26 17:08:34.362901 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.10 10.20.0.21:6802/15917 1 osd_op_reply(230 10003f6. [tmapup 0~0] ondisk = 0) v4 119+0+0 (1770421071 0 0) 0x2aece00 con 0x2aa4200 -54 2014-08-26 17:08:34.362942 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.55 10.20.0.23:6800/2407 10 osd_op_reply(263 100048a. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0 -53 2014-08-26 17:08:34.363001 7f1c2c704700 5 mds.0.log submit_entry 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -52 2014-08-26 17:08:34.363022 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.37 10.20.0.22:6898/11994 6 osd_op_reply(226 1. [tmapput 0~7664] ondisk = 0) v4 109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0 -51 2014-08-26 17:08:34.363092 7f1c2c704700 5 mds.0.log _expired segment 293601899 2548 events -50 2014-08-26 17:08:34.363117 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.17 10.20.0.21:6941/17572 9 osd_op_reply(264 1000489. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180 -49 2014-08-26 17:08:34.363177 7f1c2c704700 5 mds.0.log submit_entry 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -48 2014-08-26 17:08:34.363197 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.1 10.20.0.21:6872/13227 6 osd_op_reply(265 1000491. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1231782695 0 0) 0x1e63400 con 0x1e7ac00 -47 2014-08-26 17:08:34.363255 7f1c2c704700 5 mds.0.log submit_entry 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -46 2014-08-26 17:08:34.363274 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.11 10.20.0.21:6884/7018 5 osd_op_reply(266 100047d. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (2737916920 0 0) 0x1e61e00 con 0x1e7bc80 - I try to restart MDSs, but after a few seconds in a state of active, MDS switch to state laggy or crashed. I have a lot of important data on it. I do not want to use the command: ceph mds newfs metadata pool id data pool id --yes-i-really-mean-it :( Tien Bui. -- Bui Minh Tien ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering
How far out are your clocks? It's showing a clock skew, if they're too far out it can cause issues with cephx. Otherwise you're probably going to need to check your cephx auth keys. -Michael On 26/08/2014 12:26, yuelongguang wrote: hi,all i have 5 osds and 3 mons. its status is ok then. to be mentioned , this cluster has no any data. i just deploy it and to be familar with some command lines. what is the probpem and how to fix? thanks ---environment- ceph-release-1-0.el6.noarch ceph-deploy-1.5.11-0.noarch ceph-0.81.0-5.el6.x86_64 ceph-libs-0.81.0-5.el6.x86_64 -ceph -s -- [root@cephosd1-mona ~]# ceph -s cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651 health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 pgs stuck unclean; clock skew detected on mon.cephosd2-monb, mon.cephosd3-monc monmap e13: 3 mons at {cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0}, election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc osdmap e151: 5 osds: 5 up, 5 in pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects 201 MB used, 102143 MB / 102344 MB avail 167 peering 201 active+clean 16 remapped+peering --log--osd.0 2014-08-26 19:16:13.926345 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:13.926355 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d5960).accept: got bad authorizer 2014-08-26 19:16:28.928023 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:28.928050 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d56a0).accept: got bad authorizer 2014-08-26 19:16:28.929139 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:16:28.929237 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:16:43.930846 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:43.930899 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d0b00).accept: got bad authorizer 2014-08-26 19:16:43.932204 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:16:43.932230 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:16:58.933526 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:16:58.935094 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d0840).accept: got bad authorizer 2014-08-26 19:16:58.936239 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:16:58.936261 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38074 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:17:13.937335 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:17:13.937368 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc2080 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d1b80).accept: got bad authorizer 2014-08-26 19:17:13.937923 7f114c009700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-08-26 19:17:13.937933 7f114c009700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38075 s=1 pgs=0 cs=0 l=0 c=0x45d23c0).failed verifying authorize reply 2014-08-26 19:17:28.939439 7f114a8d2700 0 cephx: verify_authorizer could not decrypt ticket info: error: decryptor.MessageEnd::Exception: StreamTransformationFilter: invalid PKCS #7 block padding found 2014-08-26 19:17:28.939455 7f114a8d2700 0 -- 11.154.249.2:6800/1667 11.154.249.7:6800/1599 pipe(0x4dc1e00 sd=25 :6800 s=0 pgs=0 cs=0 l=0 c=0x45d5540).accept: got bad authorizer 2014-08-26 19:17:28.939716 7f114c009700 0
Re: [ceph-users] Best practice K/M-parameters EC pool
My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph monitor load, low performance
I had a similar problem once. I traced my problem it to a failed battery on my RAID card, which disabled write caching. One of the many things I need to add to monitoring. On Tue, Aug 26, 2014 at 3:58 AM, pawel.orzechow...@budikom.net wrote: Hello Gentelmen:-) Let me point one important aspect of this low performance problem: from all 4 nodes of our ceph cluster only one node shows bad metrics, that is very high latency on its osd's (from 200-600ms), while other three nodes behave normaly, thats is latency of their osds is between 1-10ms. So, the idea of putting journals on SSD is something that we are looking at, but we think that we have in general some problem with that particular node, what affects whole cluster. So can the number (4) of hosts a reason for that? Any other hints? Thanks Pawel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? Cheers On 26/08/2014 19:37, Craig Lewis wrote: My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)
Ok, after some delays and the move to new network hardware I have an update. I'm still seeing the same low bandwidth and high retransmissions from iperf after moving to the Cisco 6001 (10Gb) and 2960 (1Gb). I've narrowed it down to transmissions from a 10Gb connected host to a 1Gb connected host. Taking a more targeted tcpdump, I discovered that there are multiple duplicate ACKs, triggering fast retransmissions between the two test hosts. There are several websites/articles which suggest that mixing 10Gb and 1Gb hosts causes performance issues, but no concrete explanation of why that's the case, and if it can be avoided without moving everything to 10Gb, eg. http://blogs.technet.com/b/networking/archive/2011/05/16/tcp-dupacks-and-tcp-fast-retransmits.aspx http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19856911/download.aspx [PDF] http://packetpushers.net/flow-control-storm-%E2%80%93-ip-storage-performance-effects/ I verified that it's not a flow control storm (the pause frame counters along the network path are zero), so assuming it might be bandwidth related I installed trickle and used it to limit the bandwidth of iperf to 1Gb; no change. I further restricted it down to 100Kbps, and was *still* seeing high retransmission. This seems to imply it's not purely bandwidth related. After further research, I noticed a difference of about 0.1ms in the RTT between two 10Gb hosts (intra-switch) and the 10Gb and 1Gb host (inter-switch). I theorized this may be affecting the retransmission timeout counter calculations, per: http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html so I used ethtool to set the link plugged into the 10Gb 6001 to 1Gb; this immediately fixed the issue. After this change the difference in RTTs moved to about .025ms. Plugging another host into the old 10Gb FEX, I have 10Gb to 10Gb RTTs withing .001ms of 6001 to 2960 RTTs, and don't see the high retransmissions with iperf between those 10Gb hosts. tldr So, right now I don't see retransmissions between hosts on the same switch (even if speeds are mixed), but I do across switches when the hosts are mixed 10Gb/1Gb. Also, I wonder what the difference between process bandwidth limiting and link 1Gb negotiation is which leads to the differences observed. I checked the link per Mark's suggestion below, but all the values they increase in that old post are already lower than the defaults set on my hosts. If anyone has any ideas or explanations, I'd appreciate it. Otherwise, I'll keep the list posted if I uncover a solution or make more progress. Thanks. -Steve On 07/28/2014 01:21 PM, Mark Nelson wrote: On 07/28/2014 11:28 AM, Steve Anthony wrote: While searching for more information I happened across the following post (http://dachary.org/?p=2961) which vaguely resembled the symptoms I've been experiencing. I ran tcpdump and noticed what appeared to be a high number of retransmissions on the host where the images are mounted during a read from a Ceph rbd, so I ran iperf3 to get some concrete numbers: Very interesting that you are seeing retransmissions. Server: nas4 (where rbd images are mapped) Client: ceph2 (currently not in the cluster, but configured identically to the other nodes) Start server on nas4: iperf3 -s On ceph2, connect to server nas4, send 4096MB of data, report on 1 second intervals. Add -R to reverse the client/server roles. iperf3 -c nas4 -n 4096M -i 1 Summary of traffic going out the 1Gb interface to a switch [ ID] Interval Transfer Bandwidth Retr [ 5] 0.00-36.53 sec 4.00 GBytes 941 Mbits/sec 15 sender [ 5] 0.00-36.53 sec 4.00 GBytes 940 Mbits/sec receiver Reversed, summary of traffic going over the fabric extender [ ID] Interval Transfer Bandwidth Retr [ 5] 0.00-80.84 sec 4.00 GBytes 425 Mbits/sec 30756 sender [ 5] 0.00-80.84 sec 4.00 GBytes 425 Mbits/sec receiver Definitely looks suspect! It appears that the issue is related to the network topology employed. The private cluster network and nas4's public interface are both connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a Nexus 7000. This was meant as a temporary solution until our network team could finalize their design and bring up the Nexus 6001 for the cluster. From what our network guys have said, the FEX has been much more limited than they anticipated and they haven't been pleased with it as a solution in general. The 6001 is supposed be ready this week, so once it's online I'll move the cluster to that switch and re-test to see if this fixes the issues I've been experiencing. If it's not the hardware, one other thing you might want to test is to make sure it's not something similar to the autotuning issues we used to see. I don't think this should be an issue at this point given the code changes we made to address it, but it would be easy to test. Doesn't seem like