Re: [ceph-users] [rgw] civetweb behind haproxy doesn't work with absolute URI

2018-03-31 Thread Matt Benjamin
I think if you haven't defined it in the Ceph config, it's disabled?

Matt

On Sat, Mar 31, 2018 at 4:59 PM, Rudenko Aleksandr  wrote:
> Hi, Sean.
>
> Thank you for the reply.
>
> What does it mean: “We had to disable "rgw dns name" in the end”?
>
> "rgw_dns_name": “”, has no effect for me.
>
>
>
> On 29 Mar 2018, at 11:23, Sean Purdy  wrote:
>
> We had something similar recently.  We had to disable "rgw dns name" in the
> end
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rgw] civetweb behind haproxy doesn't work with absolute URI

2018-03-31 Thread Rudenko Aleksandr
Hi, Sean.

Thank you for the reply.

What does it mean: “We had to disable "rgw dns name" in the end”?

"rgw_dns_name": “”, has no effect for me.



On 29 Mar 2018, at 11:23, Sean Purdy 
> wrote:

We had something similar recently.  We had to disable "rgw dns name" in the end

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1 mon unable to join the quorum

2018-03-31 Thread Julien Lavesque
At first the cluster has been deployed using ceph-ansible in version 
infernalis.
For some unknown reason the controller02 was out of the quorum and we 
were unable to add it in the quorum.


We have updated the cluster to jewel version using the rolling-update 
playbook from ceph-ansible


The controller02 was still not in the quorum.

We tried to delete the mon completely and add it again using the manual 
method of 
http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-mons/ (with 
id controller02)


The logs provided are when the controller02 was added with the manual 
method.


But the controller02 won't join the cluster

Hope It helps understand


On 31/03/2018 02:12, Brad Hubbard wrote:

I'm not sure I completely understand your "test". What exactly are you
trying to achieve and what documentation are you following?

On Fri, Mar 30, 2018 at 10:49 PM, Julien Lavesque
 wrote:

Brad,

Thanks for your answer

On 30/03/2018 02:09, Brad Hubbard wrote:


2018-03-19 11:03:50.819493 7f842ed47640  0 mon.controller02 does not
exist in monmap, will attempt to join an existing cluster
2018-03-19 11:03:50.820323 7f842ed47640  0 starting mon.controller02
rank -1 at 172.18.8.6:6789/0 mon_data
/var/lib/ceph/mon/ceph-controller02 fsid
f37f31b1-92c5-47c8-9834-1757a677d020

We are called 'mon.controller02' and we can not find our name in the
local copy of the monmap.

2018-03-19 11:03:52.346318 7f842735d700 10
mon.controller02@-1(probing) e68  ready to join, but i'm not in the
monmap or my addr is blank, trying to join

Our name is not in the copy of the monmap we got from peer 
controller01

either.



During our test we have deleted completely the controller02 monitor 
and add

it again.

The log you have is when the controller02 is added (so it wasn't in 
the

monmap before)




$ cat ../controller02-mon_status.log
[root@controller02 ~]# ceph --admin-daemon
/var/run/ceph/ceph-mon.controller02.asok mon_status
{
"name": "controller02",
"rank": 1,
"state": "electing",
"election_epoch": 32749,
"quorum": [],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 71,
"fsid": "f37f31b1-92c5-47c8-9834-1757a677d020",
"modified": "2018-03-29 10:48:06.371157",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "controller01",
"addr": "172.18.8.5:6789\/0"
},
{
"rank": 1,
"name": "controller02",
"addr": "172.18.8.6:6789\/0"
},
{
"rank": 2,
"name": "controller03",
"addr": "172.18.8.7:6789\/0"
}
]
}
}

In the monmaps we are called 'controller02', not 'mon.controller02'.
These names need to be identical.



The cluster has been deployed using ceph-ansible with the servers 
hostname.

All monitors are called mon.controller0x in the monmap and all the 3
monitors have the same configuration

We have the same behavior creating a monmap from scratch :

[root@controller03 ~]# monmaptool --create --add controller01
172.18.8.5:6789 --add controller02 172.18.8.6:6789 --add controller03
172.18.8.7:6789 --fsid f37f31b1-92c5-47c8-9834-1757a677d020 --clobber
test-monmap
monmaptool: monmap file test-monmap
monmaptool: set fsid to f37f31b1-92c5-47c8-9834-1757a677d020
monmaptool: writing epoch 0 to test-monmap (3 monitors)

[root@controller03 ~]# monmaptool --print test-monmap
monmaptool: monmap file test-monmap
epoch 0
fsid f37f31b1-92c5-47c8-9834-1757a677d020
last_changed 2018-03-30 14:42:18.809719
created 2018-03-30 14:42:18.809719
0: 172.18.8.5:6789/0 mon.controller01
1: 172.18.8.6:6789/0 mon.controller02
2: 172.18.8.7:6789/0 mon.controller03




On Thu, Mar 29, 2018 at 7:23 PM, Julien Lavesque
 wrote:


Hi Brad,

The results have been uploaded on the tracker
(https://tracker.ceph.com/issues/23403)

Julien


On 29/03/2018 07:54, Brad Hubbard wrote:



Can you update with the result of the following commands from all 
of the

MONs?

# ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok 
mon_status

# ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok
quorum_status

On Thu, Mar 29, 2018 at 3:11 PM, Gauvain Pocentek
 wrote:



Hello Ceph users,

We are having a problem on a ceph cluster running Jewel: one of 
the

mons
left the quorum, and we  have not been able to make it join 
again. The

two
other monitors are running just fine, but obviously we need this 
third

one.

The problem happened before Jewel, when the cluster was running
Infernalis.
We upgraded hoping that it would solve the problem, but no luck.

We've validated several things: no network problem, no clock 
skew, same

OS
and ceph version everywhere. We've also removed the mon 
completely, and
recreated it. We also 

[ceph-users] [Hamme-r][Simple Msg]Cluster can not work when Accepter::entry quit

2018-03-31 Thread yu2xiangyang
Hi cephers,
Recently there has been a big problem in our production ceph
cluster.It has been running very well for one and a half years.
RBD client network and ceph public network are different,
communicating through a router.
Our ceph version is 0.94.5. Our IO transport is using Simple Messanger.
Yesterday some of our VM (using qemu librbd) can not send IO to ceph cluster.
Ceph status is healthy and no osd up/down and no pg inactive and down.
When we export an rbd image through rbd export ,we find the rbd client
can not connect to one osd just to say osd.34.
We find thant osd.34 up and running ,but in the log we find some
errors as follows:
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
accepter no incoming connection?  sd =-1 ,errer 24, too many open files.
We find that our max open files is set to 20, but filestore fd
cache size is too big like 50.
I think we have some wrong fd configurations.But when there are some
errors in Accepter::entry() ,it's better to assert the osd process  so
that new rbd client can connect to the ceph cluster  and when there
are some network probem, the old rbd client can also reconnect to the
cluster.
I do not know if there has been some fixes in upper version.
Best regards,
brandy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore caching, flawed by design?

2018-03-31 Thread Jack
On 03/31/2018 03:24 PM, Mark Nelson wrote:
>> 1. Completely new users may think that bluestore defaults are fine and
>> waste all that RAM in their machines.
> 
> What does "wasting" RAM mean in the context of a node running ceph? Are
> you upset that other applications can't come in and evict bluestore
> onode, OMAP, or object data from cache?

I think he thought of your #1
Unless I am mistaken, with bluestore, you allocate some cache per OSD,
and the OSD won't use more, even if there is free memory laying around
Thus, a "waste" of ram


>> 2. Having a per OSD cache is inefficient compared to a common cache like
>> pagecache, since an OSD that is busier than others would benefit from a
>> shared cache more.
> 
> It's only "inefficient" if you assume that using the pagecache, and more
> generally, kernel syscalls, is free.  Yes the pagecache is convenient
> and yes it gives you a lot of flexibility, but you pay for that
> flexibility if you are trying to do anything fast.

I think he thought of your #2
"Inefficient" because each OSDs have a fixed cache size, unrelated to
their real usage


To me, "flawed" is a bit extreme, bluestore is a good piece of work,
even if there is still place for improvements;

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore caching, flawed by design?

2018-03-31 Thread Mark Nelson

On 03/29/2018 08:59 PM, Christian Balzer wrote:


Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:


Bluestore's cache is not broken by design.

I'm not totally convinced that some of the trade-offs we've made with 
bluestore's cache implementation are optimal, but I think you should 
consider cooling your rhetoric down.



1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.


What does "wasting" RAM mean in the context of a node running ceph? Are 
you upset that other applications can't come in and evict bluestore 
onode, OMAP, or object data from cache?



2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.


It's only "inefficient" if you assume that using the pagecache, and more 
generally, kernel syscalls, is free.  Yes the pagecache is convenient 
and yes it gives you a lot of flexibility, but you pay for that 
flexibility if you are trying to do anything fast.


For instance, take the new KPTI patches in the kernel for meltdown. Look 
at how badly it can hurt MyISAM database performance in MariaDB:


https://mariadb.org/myisam-table-scan-performance-kpti/

MyISAM does not have a dedicated row cache and instead caches row data 
in the page cache as you suggest Bluestore should do for it's data.  
Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a 
dedicated 128MB cache (less than 1%).  KPTI is a really good example of 
how much this stuff can hurt you, but syscalls, context switches, and 
page faults were already expensive even before meltdown.  Not to mention 
that right now bluestore keeps onodes and buffers stored in it's cache 
in an unencoded form.


Here's a couple of other articles worth looking at:

https://eng.uber.com/mysql-migration/
https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html


3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.


Non-Uniform hardware is a big reason that pinning dedicated memory to 
specific cores/sockets is really nice vs relying on potentially remote 
memory page cache reads.  A long time ago I was responsible for 
validating the performance of CXFS on an SGI Altix UV distributed 
shared-memory supercomputer.  As it turns out, we could achieve about 
22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x 
slower.  A big part of that turned out to be the kernel distributing 
page cache across the Numalink5 interconnects to remote memory.  The 
problem can potentially happen on any NUMA system to varying degrees.


Personally I have two primary issues with bluestore's memory 
configuration right now:


1) It's too complicated for users to figure out where to assign memory 
and in what ratios.  I'm attempting to improve this by making 
bluestore's cache autotuning so the user just gives it a number and 
bluestore will try to work out where it should assign memory.


2) In the case where a subset of OSDs are really hot (maybe RGW bucket 
accesses) you might want some OSDs to get more memory than others.  I 
think we can tackle this better if we migrate to a one-osd-per-node 
sharded architecture (likely based on seastar), though we'll still need 
to be very aware of remote memory.  Given that this is fairly difficult 
to do well, we're probably going to be better off just dedicating a 
static pool to each shard initially.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com