Re: [ceph-users] debian repositories path change?
On Sat, 19 Sep 2015, Brian Kroth wrote: > Just to be clear, there's no longer going to be a generic > http://downloads.ceph.com/debian (sans -{ceph-release-name}) path? In other > words, we'll have to monitor something else to determine what's considered > stable for our {distro-release} and then update the sources to point at a > new debian-{ceph-release-name} ourselves, correct? This was an oversight.. I'll add the symlinks for debian and rpm, pointing to hammer for now. They'll generally always point to the most recent stable release. sage > > Thanks, > Brian > > > On Fri, Sep 18, 2015, 09:45 Alfredo Dezawrote: > The new locations are in: > > > http://packages.ceph.com/ > > For debian this would be: > > http://packages.ceph.com/debian-{release} > > Note that ceph-extras is no longer available: the current repos > should > provide everything/anything that is needed to properly install > ceph. Otherwise, please let us know . > > On Fri, Sep 18, 2015 at 10:35 AM, Brian Kroth > wrote: > > Hmm, apparently I haven't gotten that far in my email backlog > yet. That's > > good to know too. > > > > Thanks, > > Brian > > > > Olivier Bonvalet 2015-09-18 16:02: > > > >> Hi, > >> > >> not sure if it's related, but there is recent changes because > of a > >> security issue : > >> > >> > > >>http://ceph.com/releases/important-security-notice-regarding-signing-key-an > d-binary-downloads-of-ceph/ > >> > >> > >> > >> > >> Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a > écrit : > >>> > >>> Hi all, we've had the following in our > >>> /etc/apt/sources.list.d/ceph.list > >>> for a while based on some previous docs, > >>> > >>> # ceph upstream stable (currently giant) release packages > for wheezy: > >>> deb http://ceph.com/debian/ wheezy main > >>> > >>> # ceph extras: > >>> deb http://ceph.com/packages/ceph-extras/debian wheezy main > >>> > >>> but it seems like the straight "debian/" portion of that > path has > >>> gone > >>> missing recently, and now there's only debian-firefly/, > debian > >>> -giant/, > >>> debian-hammer/, etc. > >>> > >>> Is that just an oversight, or should we be switching our > sources to > >>> one > >>> of the named releases? I figured that the unnamed one would > >>> automatically track what ceph currently considered "stable" > for the > >>> target distro release for me, but maybe that's not the case. > >>> > >>> Thanks, > >>> Brian > >>> ___ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] debian repositories path change?
On 19 September 2015 at 01:55, Ken Dreyerwrote: > To avoid confusion here, I've deleted packages.ceph.com from DNS > today, and the change will propagate soon. > > Please use download.ceph.com (it's the same IP address and server, > 173.236.248.54) > I'm getting: W: GPG error: http://download.ceph.com wheezy Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY E84AC2C0460F3994 Trying to update from there -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] multi-datacenter crush map
Ok, so if I understand correctly, for replication level 3 or 4 I would have to use the rule rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take root step choose firstn 2 type datacenter step chooseleaf firstn 2 type host step emit } The question I have now is: how will it behave when a DC goes down? (Assuming catastrophic failure, the thing burns down) For example, if I set replication to 3, min_rep to 3. Then, if a DC goes down, crush will only return 2 PG's, so everything will hang (same for 4/4 and 4/3) If I set replication to 3, min_rep to 2, it could occur that all data of a PG is in one DC (degraded mode). if this DC goes down, the PG will hang, As far as I know, degraded PG's will still accept writes, so data loss is possible. (same for 4/2) I can't seem to find a way around this. What am I missing. Wouter On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnumwrote: > On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger > wrote: > > Hi all, > > > > I have found on the mailing list that it should be possible to have a > multi > > datacenter setup, if latency is low enough. > > > > I would like to set this up, so that each datacenter has at least two > > replicas and each PG has a replication level of 3. > > > > In this mail, it is suggested that I should use the following crush map > for > > multi DC: > > > > rule dc { > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take default > > step chooseleaf firstn 0 type datacenter > > step emit > > } > > > > This looks suspicious to me, as it will only generate a list of two PG's, > > (and only one PG if one DC is down). > > > > I think I should use: > > > > rule replicated_ruleset { > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take root > > step choose firstn 2 type datacenter > > step chooseleaf firstn 2 type host > > step emit > > step take root > > step chooseleaf firstn -4 type host > > step emit > > } > > > > This correctly generates a list with 2 PG's in one DC, then 2 PG's in the > > other and then a list of PG's > > > > The problem is that this list contains duplicates (e.g. for 8 OSDS per > DC) > > > > [13,11,1,8,13,11,16,4,3,7] > > [9,2,13,11,9,15,12,18,3,5] > > [3,5,17,10,3,5,7,13,18,10] > > [7,6,11,14,7,14,3,16,4,11] > > [6,3,15,18,6,3,12,9,16,15] > > > > Will this be a problem? > > For replicated pools, it probably will cause trouble. For EC pools I > think it should work fine, but obviously you're losing all kinds of > redundancy. Nothing in the system will do work to avoid colocating > them if you use a rule like this. Rather than distributing some of the > replicas randomly across DCs, you really just want to split them up > evenly across datacenters (or in some ratio, if one has more space > than the other). Given CRUSH's current abilities that does require > building the replication size into the rule, but such is life. > > > > If crush is executed, will it only consider osd's which are (up,in) or > all > > OSD's in the map and then filter them from the list afterwards? > > CRUSH will consider all OSDs, but if it selects any OSDs which are out > then it retries until it gets one that is still marked in. > -Greg > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Potential OSD deadlock?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We have had two situations where I/O just seems to be indefinitely blocked on our production cluster today (0.94.3). In the case this morning, it was just normal I/O traffic, no recovery or backfill. The case this evening, we were backfilling to some new OSDs. I would have loved to have bumped up the debugging to get an idea of what was going on, but time was exhausted. The incident this evening I was able to do some additional troubleshooting, but got real anxious after I/O had been blocked for 10 minutes and OPs was getting hot around the collar. Here are the important parts of the logs: [osd.30] 2015-09-18 23:05:36.188251 7efed0ef0700 0 log_channel(cluster) log [WRN] : slow request 30.662958 seconds old, received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441 rbd_data.1099d2f67aaea.0f62 [set-alloc-hint object_size 8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c ack+ondisk+write+known_if_redirected e55919) currently waiting for subops from 32,70,72 [osd.72] 2015-09-18 23:05:19.302985 7f3fa19f8700 0 log_channel(cluster) log [WRN] : slow request 30.200408 seconds old, received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311 rbd_data.3f41d41bd65b28.9e2b [set-alloc-hint object_size 4194304 write_size 4194304,write 1048576~421888] 17.40adcada ack+ondisk+write+known_if_redirected e55919) currently waiting for subops from 2,30,90 The other OSDs listed (32,70,2,90) did not have any errors in the logs about blocked I/O. It seems that osd.30 was waiting for osd.72 and visa versa. I looked at top and iostat of these two hosts and the OSD processes and disk I/O were pretty idle. I know that this isn't a lot to go on. Our cluster is under very heavy load and we get several blocked I/Os every hour, but they usually clear up within 15 seconds. We seem to get I/O blocked when the op latency of the cluster goes above 1 (average from all OSDs as seen by Graphite). Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately cleared all the blocked I/O and then it was fine after rejoining the cluster. Increasing what logs and to what level would be most beneficial in this case for troubleshooting? I hope this makes sense, it has been a long day. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.1.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7 cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE 9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75 Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2 4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio Roj5 =ut4B -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to clear unnormal huge objects
Hello, I have built a ceph cluster for test, after I performed some recover testing, some osd down as no available disk space, when I check the osd data folder, I found that there are many huge objects which have prefix obj-xvrzfdsafd, I would like to know how those objects generated and what its use for? how to clear? It seems that I can not delete them directly Ps: I used block storage(rbd) only Also, I found the pool usage even more than 100%,why ? Waiting for your feedback ,thanks [cid:image003.png@01D0F2F5.37CD86F0] [cid:image002.png@01D0F2F4.73BC59D0] [cid:image004.png@01D0F2F6.40FF3C90] Best Regards! * [图像 007] 向毓(Raijin.Xiang) 计算与存储部(Computing and Storage Dept) 华为技术有限公司(Huawei Technologies Co., Ltd.) Mobile:+86 186 2032 2562 Mail: xiang...@huawei.com 地址:深圳市龙岗区坂田街道雪岗路2018号天安云谷产业园一期1栋B座 邮编:518129 Bldg1-B, Cloud Park, Huancheng Road, Bantian Str., Longgang District, 518129 Shenzhen, P. R. China *** ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] lttng duplicate registration problem when using librados2 and libradosstriper
Hi Paul, I hit the same problem here (see last post): https://groups.google.com/forum/#!topic/bareos-users/mEzJ7IbDxvA If I ever get to the bottom of it, I will let you know. Sorry I can't be of any more help. Nick > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Paul Mansfield > Sent: 18 September 2015 17:16 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] lttng duplicate registration problem when using > librados2 and libradosstriper > > Hello, > thanks for your attention. > > I have started using rados striper library, calling the functions from a C > program. > > As soon as I add libradosstriper to the linking process, I get this error when > the program runs, even though I am not calling any functions from the rados > striper library (I commented them out). > > LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate > registration of tracepoint probes having the same name is not allowed. > /bin/sh: line 1: 61001 Aborted (core dumped) ./$test > > > I had been using lttng in my program but removed it to ensure it wasn't > causing the problem. > > I have tried running the program using gdb but the calls to initialise lttng occur > before main() is called and so I cannot add a break point to see what is > happening. > > > thanks > Paul > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to move OSD form 1TB disk to 2TB disk
I know another way: out 1TB osd, up 2TB osd as osd.X without data, then rados will backfill the data to 2TB disks. Now I use rsync to mv data form 1TB disk to 2TB disk, but the new osd coredump. What's the problem? ceph version:0.80.1 osd.X host1 with 1TB disks host2 with 2TB disks on host1: osd.X down ceph-osd -i X --flush-journal rsync -av /data/osd/osd.X/ root:host2:/data/osd/osd.X/ on host2: vim ceph.conf ceph-osd -i X --mkjournal ceph-osd -i X then osd.X coredump osd log: -1> 2015-09-19 14:52:22.371149 7f008cd007a0 0 osd.29 416 load_pgs 0> 2015-09-19 14:52:22.378677 7f008cd007a0 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, ceph::bufferlist*)' thread 7f008cd007a0 time 2015-09-19 14:52:22.377569 osd/PG.cc: 2559: FAILED assert(r > 0) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, ceph::buffer::list*)+0x48d) [0x7fa4ad] 2: (OSD::load_pgs()+0x18f1) [0x63c771] 3: (OSD::init()+0x22b0) [0x6550e0] 4: (main()+0x359e) [0x5f931e] 5: (__libc_start_main()+0xfd) [0x3073c1ed5d] 6: ceph-osd() [0x5f59c9] coredump: (gdb) bt 0 0x00307400f5db in raise () from /lib64/libpthread.so.0 1 0x009ab7f4 in ?? () 2 3 0x003073c32635 in raise () from /lib64/libc.so.6 4 0x003073c33e15 in abort () from /lib64/libc.so.6 5 0x003b4febea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 6 0x003b4febcbd6 in ?? () from /usr/lib64/libstdc++.so.6 7 0x003b4febcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6 8 0x003b4febcd22 in __cxa_throw () from /usr/lib64/libstdc++.so.6 9 0x00aec612 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () 10 0x007fa4ad in PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, ceph::buffer::list*) () 11 0x0063c771 in OSD::load_pgs() () 12 0x006550e0 in OSD::init() () 13 0x005f931e in main () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] debian repositories path change?
On 18/09/15 17:28, Sage Weil wrote: Make that download.ceph.com .. the packages url was temporary while we got the new site ready and will go away shortly! (Also, HTTPS is enabled now.) But still no jessie packages available... :( ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to move OSD form 1TB disk to 2TB disk
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Just use the built in Ceph recovery to move data to the new disk. By changing disk sizes, you also change the mapping across the cluster so you are going to be moving more data than necessary. My recommendation, bring the new disk in as a new OSD. Then set the old disk to 'out'. This will keep the OSD participating in the backfills until it is empty. Once the backfill is done, stop the old OSD and remove it from the cluster. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sat, Sep 19, 2015 at 2:30 AM, wsnote wrote: > I know another way: out 1TB osd, up 2TB osd as osd.X without data, then > rados will backfill the data to 2TB disks. > Now I use rsync to mv data form 1TB disk to 2TB disk, but the new osd > coredump. > What's the problem? > > ceph version:0.80.1 > osd.X > host1 with 1TB disks > host2 with 2TB disks > > on host1: > osd.X down > ceph-osd -i X --flush-journal > rsync -av /data/osd/osd.X/ root:host2:/data/osd/osd.X/ > on host2: > vim ceph.conf > ceph-osd -i X --mkjournal > ceph-osd -i X > > then osd.X coredump > osd log: > -1> 2015-09-19 14:52:22.371149 7f008cd007a0 0 osd.29 416 load_pgs > 0> 2015-09-19 14:52:22.378677 7f008cd007a0 -1 osd/PG.cc: In function 'static > epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > ceph::bufferlist*)' thread 7f008cd007a0 time 2015-09-19 14:52:22.377569 > osd/PG.cc: 2559: FAILED assert(r > 0) > > ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > ceph::buffer::list*)+0x48d) [0x7fa4ad] > 2: (OSD::load_pgs()+0x18f1) [0x63c771] > 3: (OSD::init()+0x22b0) [0x6550e0] > 4: (main()+0x359e) [0x5f931e] > 5: (__libc_start_main()+0xfd) [0x3073c1ed5d] > 6: ceph-osd() [0x5f59c9] > > coredump: > (gdb) bt > > 0 0x00307400f5db in raise () from /lib64/libpthread.so.0 > > 1 0x009ab7f4 in ?? () > > 2 > > 3 0x003073c32635 in raise () from /lib64/libc.so.6 > > 4 0x003073c33e15 in abort () from /lib64/libc.so.6 > > 5 0x003b4febea7d in __gnu_cxx::__verbose_terminate_handler() () from > /usr/lib64/libstdc++.so.6 > > 6 0x003b4febcbd6 in ?? () from /usr/lib64/libstdc++.so.6 > > 7 0x003b4febcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6 > > 8 0x003b4febcd22 in __cxa_throw () from /usr/lib64/libstdc++.so.6 > > 9 0x00aec612 in ceph::__ceph_assert_fail(char const*, char const*, > int, char const*) () > > 10 0x007fa4ad in PG::peek_map_epoch(ObjectStore*, coll_t, > hobject_t&, ceph::buffer::list*) () > > 11 0x0063c771 in OSD::load_pgs() () > > 12 0x006550e0 in OSD::init() () > > 13 0x005f931e in main () > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -BEGIN PGP SIGNATURE- Version: Mailvelope v1.1.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV/XWUCRDmVDuy+mK58QAAfLgP/RPZkQDcV+odf2eaI3vR CVt0nxWq4RDf6jARMPtQRO9k+BROvoNON0UGFeOeeX3AXGd46gw/DqrAiswb PsYfM5FSmirWAjB6vQ0A4+nZsVfSPyazA2/XZJ6oIGjiS3RVLDFB+ZrLZO7C +/XBGet7LUjvp6F8WtQr7lBWY+i93aYeNHdmD3u7hIypSgyNqbWvpjM9xZR1 nJLvoUan/qG96bvAGWyJQz1CSmEfYxJeFTooOZTeCXTKu+bGwQxFqsam4JoJ Og4t7vlo0zA4LRrT6q/8SdO3u1cPf6CZJ1LY4uPmNlu8FsBQpk44ILsCY3hY wj49JVGistxK3ADJSPxxJRs/Tuh53lvBYaY5D16sMLVw98lhEBXEyTH6ivII cUc60HK/v91iVFef2oVNBlVtxIMHK9PXxIBfnmLffjcRcl22w9crvtUuz9ts ime/ebMiFTUnK/xUqrDbjgvgBDWkpdvYb7Mq/koGhG+7TJXx5cEktd1q0rvr WWaphubYPXTfPF/UCjX73jAKIMav8Frl+LbKzZyQtNuJbdwI8s9DrbTaMVKa 711Du6G+YzccAZ4BdKk32+8xuNtblnnEanbLHdrhwwuTBpsb+ynrtf2yCqix yHW7DSeK19y+aEu0PurBvOUMmu41viUQQEd4aXYVh8hAExpjqVVBsr3T/Ow1 Qen3 =klMb -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to move OSD form 1TB disk to 2TB disk
On 09/19/2015 10:30 AM, wsnote wrote: > I know another way: out 1TB osd, up 2TB osd as osd.X without data, then rados > will backfill the data to 2TB disks. > Now I use rsync to mv data form 1TB disk to 2TB disk, but the new osd > coredump. > What's the problem? > Did you use rsync with the -X option to also transfer xattrs? Wido > > ceph version:0.80.1 > osd.X > host1 with 1TB disks > host2 with 2TB disks > > > on host1: > osd.X down > ceph-osd -i X --flush-journal > rsync -av /data/osd/osd.X/ root:host2:/data/osd/osd.X/ > on host2: > vim ceph.conf > ceph-osd -i X --mkjournal > ceph-osd -i X > > > then osd.X coredump > osd log: > -1> 2015-09-19 14:52:22.371149 7f008cd007a0 0 osd.29 416 load_pgs > 0> 2015-09-19 14:52:22.378677 7f008cd007a0 -1 osd/PG.cc: In function 'static > epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > ceph::bufferlist*)' thread 7f008cd007a0 time 2015-09-19 14:52:22.377569 > osd/PG.cc: 2559: FAILED assert(r > 0) > > ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > ceph::buffer::list*)+0x48d) [0x7fa4ad] > 2: (OSD::load_pgs()+0x18f1) [0x63c771] > 3: (OSD::init()+0x22b0) [0x6550e0] > 4: (main()+0x359e) [0x5f931e] > 5: (__libc_start_main()+0xfd) [0x3073c1ed5d] > 6: ceph-osd() [0x5f59c9] > > coredump: > (gdb) bt > > 0 0x00307400f5db in raise () from /lib64/libpthread.so.0 > 1 0x009ab7f4 in ?? () > 2 > 3 0x003073c32635 in raise () from /lib64/libc.so.6 > 4 0x003073c33e15 in abort () from /lib64/libc.so.6 > 5 0x003b4febea7d in __gnu_cxx::__verbose_terminate_handler() () from > /usr/lib64/libstdc++.so.6 > 6 0x003b4febcbd6 in ?? () from /usr/lib64/libstdc++.so.6 > 7 0x003b4febcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6 > 8 0x003b4febcd22 in __cxa_throw () from /usr/lib64/libstdc++.so.6 > 9 0x00aec612 in ceph::__ceph_assert_fail(char const*, char const*, > int, char const*) () > 10 0x007fa4ad in PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > ceph::buffer::list*) () > 11 0x0063c771 in OSD::load_pgs() () > 12 0x006550e0 in OSD::init() () > 13 0x005f931e in main () > > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] multi-datacenter crush map
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 You will want size=4 min_size=2 if you want to keep I/O going if a DC fails and ensure some data integrity. Data checksumming (I think is being added) would provide much stronger data integrity checking in a two copy situation as you would be able to tell which of the two copies is the good copy instead of needing a third to break the tie. However, you have yet another problem on your hands. The way monitors works makes this tricky. If you have one monitor in one DC and two in the other, if the two monitor DC burns down, the surviving cluster stops working too because there isn't more than 50% of the monitors available. Putting two monitors in each DC only causes both to stop working if one goes down (you need three to make a quorum). It has been suggested that putting the odd monitor in the cloud (or other off-site location to both DCs) could be an option, but latency could cause problems. The cloud monitor would complete the quorum with whichever DC survives. Also remember that there is no data locality awareness in Ceph at the moment. This could mean that the primary for a PG is in the other DC. So your client has to contact the primary in the other DC, then that OSD contacts one OSD in it's DC and two in the other and has to get confirmation that the write is acknowledged then ack the write to the client. For a write you will be between 2 x ( LAN latency + WAN latency ) and 2 x ( LAN latency + 2 x WAN latency ). Additionally your reads will be between 2 x LAN latency and 2 x WAN latency. Then there is write amplification so you need to make sure you have a lot more WAN bandwidth than you think you need. I think the large majority of us are eagerly waiting for the RBD replication feature or some sort of lag behind OSD for situations like this. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.1.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV/bgrCRDmVDuy+mK58QAAOTsQAJd5Q/3uVMP6D0U+iZv/ FGvEThfxLqarEo/n/TAPiJdCeZP9sKr8szTP72Iajt5UAwH8Ry5qcvClUoet LmMXfOxHJQJcMbXcKHxI8G7w9h/8ExkGA3GkoBYltUvZ9+oEI30ANHZphBiK HhaLWanrEKh8L4EbXnqA9JvEYwf1BGDvxKbdvFDNSIIbDywN3DJn7OavRhC9 M63GQnFmxSO6F+Oy1q5vMfpur/VtZ27GRfzIDsougRTmM5q9zbdpSY8pHrrZ RDExkM1t0orl1gUnbNhl/YgQTGfU/XWpEKtJju7Wk9Ciem5SFczJRWsputHc AhBtnxBoEInlsnpHKnCsPvbY8wEcoo+YxNt79/M3cR8x0UzXl+/4SoDlYnSK X3afL/YmVnbCV6hoxl2LAOqHbTYasN9VxQIbpQe4kAzSq45yJX//k8NRXBfD +hGF8qfxpcbTe/9IjJiqwe+ZpaAd4vX7Xfq4oHHeMwWUrvd8sXSbr5CIV1AJ CYsixEy2gJ0oFFVKcBGtzAfBUxJHb/FAcAuV97zSdYyYRplMq5Qjaz/hwGeu 9pC83kxY40pfzdD9uEElWoI3+6/34LdNo4TLi3IM8aeZmNGzzIgt/MxAuFOk 9Jf2Dwmab0+Ut6uJasY4Fr6HiyNoeTXea+CSWrnvsMohOseyJg996GUP3gUl OEoA =PfhN -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sat, Sep 19, 2015 at 12:54 PM, Wouter De Borgerwrote: > Ok, so if I understand correctly, for replication level 3 or 4 I would have > to use the rule > > rule replicated_ruleset { > > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take root > step choose firstn 2 type datacenter > step chooseleaf firstn 2 type host > step emit > } > > The question I have now is: how will it behave when a DC goes down? > (Assuming catastrophic failure, the thing burns down) > > For example, if I set replication to 3, min_rep to 3. > Then, if a DC goes down, crush will only return 2 PG's, so everything will > hang (same for 4/4 and 4/3) > > If I set replication to 3, min_rep to 2, it could occur that all data of a > PG is in one DC (degraded mode). if this DC goes down, the PG will hang, > As far as I know, degraded PG's will still accept writes, so data loss is > possible. (same for 4/2) > > > > I can't seem to find a way around this. What am I missing. > > > Wouter > > > > > On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum wrote: >> >> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger >> wrote: >> > Hi all, >> > >> > I have found on the mailing list that it should be possible to have a >> > multi >> > datacenter setup, if latency is low enough. >> > >> > I would like to set this up, so that each datacenter has at least two >> > replicas and each PG has a replication level of 3. >> > >> > In this mail, it is suggested that I should use the following crush map >> > for >> > multi DC: >> > >> > rule dc { >> > ruleset 0 >> > type replicated >> > min_size 1 >> > max_size 10 >> > step take default >> > step chooseleaf firstn 0 type datacenter >> > step emit >> > } >> > >> > This looks suspicious to me, as it will only generate a list of two >> > PG's, >> > (and only one PG if one DC is down). >> > >> > I think I should use: >> > >> > rule replicated_ruleset { >> > ruleset 0 >> > type replicated >> > min_size 1 >> > max_size 10 >> > step take root >> > step choose firstn 2 type datacenter >> > step chooseleaf