Re: [ceph-users] [rgw] civetweb behind haproxy doesn't work with absolute URI
I think if you haven't defined it in the Ceph config, it's disabled? Matt On Sat, Mar 31, 2018 at 4:59 PM, Rudenko Aleksandrwrote: > Hi, Sean. > > Thank you for the reply. > > What does it mean: “We had to disable "rgw dns name" in the end”? > > "rgw_dns_name": “”, has no effect for me. > > > > On 29 Mar 2018, at 11:23, Sean Purdy wrote: > > We had something similar recently. We had to disable "rgw dns name" in the > end > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [rgw] civetweb behind haproxy doesn't work with absolute URI
Hi, Sean. Thank you for the reply. What does it mean: “We had to disable "rgw dns name" in the end”? "rgw_dns_name": “”, has no effect for me. On 29 Mar 2018, at 11:23, Sean Purdy> wrote: We had something similar recently. We had to disable "rgw dns name" in the end ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1 mon unable to join the quorum
At first the cluster has been deployed using ceph-ansible in version infernalis. For some unknown reason the controller02 was out of the quorum and we were unable to add it in the quorum. We have updated the cluster to jewel version using the rolling-update playbook from ceph-ansible The controller02 was still not in the quorum. We tried to delete the mon completely and add it again using the manual method of http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-mons/ (with id controller02) The logs provided are when the controller02 was added with the manual method. But the controller02 won't join the cluster Hope It helps understand On 31/03/2018 02:12, Brad Hubbard wrote: I'm not sure I completely understand your "test". What exactly are you trying to achieve and what documentation are you following? On Fri, Mar 30, 2018 at 10:49 PM, Julien Lavesquewrote: Brad, Thanks for your answer On 30/03/2018 02:09, Brad Hubbard wrote: 2018-03-19 11:03:50.819493 7f842ed47640 0 mon.controller02 does not exist in monmap, will attempt to join an existing cluster 2018-03-19 11:03:50.820323 7f842ed47640 0 starting mon.controller02 rank -1 at 172.18.8.6:6789/0 mon_data /var/lib/ceph/mon/ceph-controller02 fsid f37f31b1-92c5-47c8-9834-1757a677d020 We are called 'mon.controller02' and we can not find our name in the local copy of the monmap. 2018-03-19 11:03:52.346318 7f842735d700 10 mon.controller02@-1(probing) e68 ready to join, but i'm not in the monmap or my addr is blank, trying to join Our name is not in the copy of the monmap we got from peer controller01 either. During our test we have deleted completely the controller02 monitor and add it again. The log you have is when the controller02 is added (so it wasn't in the monmap before) $ cat ../controller02-mon_status.log [root@controller02 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.controller02.asok mon_status { "name": "controller02", "rank": 1, "state": "electing", "election_epoch": 32749, "quorum": [], "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 71, "fsid": "f37f31b1-92c5-47c8-9834-1757a677d020", "modified": "2018-03-29 10:48:06.371157", "created": "0.00", "mons": [ { "rank": 0, "name": "controller01", "addr": "172.18.8.5:6789\/0" }, { "rank": 1, "name": "controller02", "addr": "172.18.8.6:6789\/0" }, { "rank": 2, "name": "controller03", "addr": "172.18.8.7:6789\/0" } ] } } In the monmaps we are called 'controller02', not 'mon.controller02'. These names need to be identical. The cluster has been deployed using ceph-ansible with the servers hostname. All monitors are called mon.controller0x in the monmap and all the 3 monitors have the same configuration We have the same behavior creating a monmap from scratch : [root@controller03 ~]# monmaptool --create --add controller01 172.18.8.5:6789 --add controller02 172.18.8.6:6789 --add controller03 172.18.8.7:6789 --fsid f37f31b1-92c5-47c8-9834-1757a677d020 --clobber test-monmap monmaptool: monmap file test-monmap monmaptool: set fsid to f37f31b1-92c5-47c8-9834-1757a677d020 monmaptool: writing epoch 0 to test-monmap (3 monitors) [root@controller03 ~]# monmaptool --print test-monmap monmaptool: monmap file test-monmap epoch 0 fsid f37f31b1-92c5-47c8-9834-1757a677d020 last_changed 2018-03-30 14:42:18.809719 created 2018-03-30 14:42:18.809719 0: 172.18.8.5:6789/0 mon.controller01 1: 172.18.8.6:6789/0 mon.controller02 2: 172.18.8.7:6789/0 mon.controller03 On Thu, Mar 29, 2018 at 7:23 PM, Julien Lavesque wrote: Hi Brad, The results have been uploaded on the tracker (https://tracker.ceph.com/issues/23403) Julien On 29/03/2018 07:54, Brad Hubbard wrote: Can you update with the result of the following commands from all of the MONs? # ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok mon_status # ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok quorum_status On Thu, Mar 29, 2018 at 3:11 PM, Gauvain Pocentek wrote: Hello Ceph users, We are having a problem on a ceph cluster running Jewel: one of the mons left the quorum, and we have not been able to make it join again. The two other monitors are running just fine, but obviously we need this third one. The problem happened before Jewel, when the cluster was running Infernalis. We upgraded hoping that it would solve the problem, but no luck. We've validated several things: no network problem, no clock skew, same OS and ceph version everywhere. We've also removed the mon completely, and recreated it. We also
[ceph-users] [Hamme-r][Simple Msg]Cluster can not work when Accepter::entry quit
Hi cephers, Recently there has been a big problem in our production ceph cluster.It has been running very well for one and a half years. RBD client network and ceph public network are different, communicating through a router. Our ceph version is 0.94.5. Our IO transport is using Simple Messanger. Yesterday some of our VM (using qemu librbd) can not send IO to ceph cluster. Ceph status is healthy and no osd up/down and no pg inactive and down. When we export an rbd image through rbd export ,we find the rbd client can not connect to one osd just to say osd.34. We find thant osd.34 up and running ,but in the log we find some errors as follows: accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. accepter no incoming connection? sd =-1 ,errer 24, too many open files. We find that our max open files is set to 20, but filestore fd cache size is too big like 50. I think we have some wrong fd configurations.But when there are some errors in Accepter::entry() ,it's better to assert the osd process so that new rbd client can connect to the ceph cluster and when there are some network probem, the old rbd client can also reconnect to the cluster. I do not know if there has been some fixes in upper version. Best regards, brandy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore caching, flawed by design?
On 03/31/2018 03:24 PM, Mark Nelson wrote: >> 1. Completely new users may think that bluestore defaults are fine and >> waste all that RAM in their machines. > > What does "wasting" RAM mean in the context of a node running ceph? Are > you upset that other applications can't come in and evict bluestore > onode, OMAP, or object data from cache? I think he thought of your #1 Unless I am mistaken, with bluestore, you allocate some cache per OSD, and the OSD won't use more, even if there is free memory laying around Thus, a "waste" of ram >> 2. Having a per OSD cache is inefficient compared to a common cache like >> pagecache, since an OSD that is busier than others would benefit from a >> shared cache more. > > It's only "inefficient" if you assume that using the pagecache, and more > generally, kernel syscalls, is free. Yes the pagecache is convenient > and yes it gives you a lot of flexibility, but you pay for that > flexibility if you are trying to do anything fast. I think he thought of your #2 "Inefficient" because each OSDs have a fixed cache size, unrelated to their real usage To me, "flawed" is a bit extreme, bluestore is a good piece of work, even if there is still place for improvements; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore caching, flawed by design?
On 03/29/2018 08:59 PM, Christian Balzer wrote: Hello, my crappy test cluster was rendered inoperational by an IP renumbering that wasn't planned and forced on me during a DC move, so I decided to start from scratch and explore the fascinating world of Luminous/bluestore and all the assorted bugs. ^_- (yes I could have recovered the cluster by setting up a local VLAN with the old IPs, extract the monmap, etc, but I consider the need for a running monitor a flaw, since all the relevant data was present in the leveldb). Anyways, while I've read about bluestore OSD cache in passing here, the back of my brain was clearly still hoping that it would use pagecache/SLAB like other filesystems. Which after my first round of playing with things clearly isn't the case. This strikes me as a design flaw and regression because: Bluestore's cache is not broken by design. I'm not totally convinced that some of the trade-offs we've made with bluestore's cache implementation are optimal, but I think you should consider cooling your rhetoric down. 1. Completely new users may think that bluestore defaults are fine and waste all that RAM in their machines. What does "wasting" RAM mean in the context of a node running ceph? Are you upset that other applications can't come in and evict bluestore onode, OMAP, or object data from cache? 2. Having a per OSD cache is inefficient compared to a common cache like pagecache, since an OSD that is busier than others would benefit from a shared cache more. It's only "inefficient" if you assume that using the pagecache, and more generally, kernel syscalls, is free. Yes the pagecache is convenient and yes it gives you a lot of flexibility, but you pay for that flexibility if you are trying to do anything fast. For instance, take the new KPTI patches in the kernel for meltdown. Look at how badly it can hurt MyISAM database performance in MariaDB: https://mariadb.org/myisam-table-scan-performance-kpti/ MyISAM does not have a dedicated row cache and instead caches row data in the page cache as you suggest Bluestore should do for it's data. Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated 128MB cache (less than 1%). KPTI is a really good example of how much this stuff can hurt you, but syscalls, context switches, and page faults were already expensive even before meltdown. Not to mention that right now bluestore keeps onodes and buffers stored in it's cache in an unencoded form. Here's a couple of other articles worth looking at: https://eng.uber.com/mysql-migration/ https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/ http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html 3. A uniform OSD cache size of course will be a nightmare when having non-uniform HW, either with RAM or number of OSDs. Non-Uniform hardware is a big reason that pinning dedicated memory to specific cores/sockets is really nice vs relying on potentially remote memory page cache reads. A long time ago I was responsible for validating the performance of CXFS on an SGI Altix UV distributed shared-memory supercomputer. As it turns out, we could achieve about 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x slower. A big part of that turned out to be the kernel distributing page cache across the Numalink5 interconnects to remote memory. The problem can potentially happen on any NUMA system to varying degrees. Personally I have two primary issues with bluestore's memory configuration right now: 1) It's too complicated for users to figure out where to assign memory and in what ratios. I'm attempting to improve this by making bluestore's cache autotuning so the user just gives it a number and bluestore will try to work out where it should assign memory. 2) In the case where a subset of OSDs are really hot (maybe RGW bucket accesses) you might want some OSDs to get more memory than others. I think we can tackle this better if we migrate to a one-osd-per-node sharded architecture (likely based on seastar), though we'll still need to be very aware of remote memory. Given that this is fairly difficult to do well, we're probably going to be better off just dedicating a static pool to each shard initially. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com