Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Gregory Farnum
On Fri, Oct 14, 2016 at 7:45 PM,   wrote:
> I was doing parallel deletes until the point when there are >1M objects in
> the stry. Then delete fails with ‘no space left’ error. If one would
> deep-scrub those pgs containing corresponidng metadata, they turn to be
> inconsistent. In worst case one would get virtually empty folders that have
> size of 16EB. Those are impossible to delete as they are ‘non empty’.

Yeah, as far as I can tell these are unrelated. You just got unlucky. :)
-Greg

>
>
>
> -Mykola
>
>
>
> From: Gregory Farnum
> Sent: Saturday, 15 October 2016 05:02
> To: Mykola Dvornik
> Cc: Heller, Chris; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] cephfs slow delete
>
>
>
> On Fri, Oct 14, 2016 at 6:26 PM,   wrote:
>
>> If you are running 10.2.3 on your cluster, then I would strongly recommend
>
>> to NOT delete files in parallel as you might hit
>
>> http://tracker.ceph.com/issues/17177
>
>
>
> I don't think these have anything to do with each other. What gave you
>
> the idea simultaneous deletes could invoke that issue?
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread mykola.dvornik
I was doing parallel deletes until the point when there are >1M objects in the 
stry. Then delete fails with ‘no space left’ error. If one would deep-scrub 
those pgs containing corresponidng metadata, they turn to be inconsistent. In 
worst case one would get virtually empty folders that have size of 16EB. Those 
are impossible to delete as they are ‘non empty’. 

-Mykola

From: Gregory Farnum
Sent: Saturday, 15 October 2016 05:02
To: Mykola Dvornik
Cc: Heller, Chris; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs slow delete

On Fri, Oct 14, 2016 at 6:26 PM,   wrote:
> If you are running 10.2.3 on your cluster, then I would strongly recommend
> to NOT delete files in parallel as you might hit
> http://tracker.ceph.com/issues/17177

I don't think these have anything to do with each other. What gave you
the idea simultaneous deletes could invoke that issue?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Gregory Farnum
On Fri, Oct 14, 2016 at 6:26 PM,   wrote:
> If you are running 10.2.3 on your cluster, then I would strongly recommend
> to NOT delete files in parallel as you might hit
> http://tracker.ceph.com/issues/17177

I don't think these have anything to do with each other. What gave you
the idea simultaneous deletes could invoke that issue?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread mykola.dvornik
If you are running 10.2.3 on your cluster, then I would strongly recommend to 
NOT delete files in parallel as you might hit 
http://tracker.ceph.com/issues/17177

-Mykola

From: Heller, Chris
Sent: Saturday, 15 October 2016 03:36
To: Gregory Farnum
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs slow delete

Just a thought, but since a directory tree is a first class item in cephfs, 
could the wire protocol be extended with an “recursive delete” operation, 
specifically for cases like this?

On 10/14/16, 4:16 PM, "Gregory Farnum"  wrote:

On Fri, Oct 14, 2016 at 1:11 PM, Heller, Chris  wrote:
> Ok. Since I’m running through the Hadoop/ceph api, there is no syscall 
boundary so there is a simple place to improve the throughput here. Good to 
know, I’ll work on a patch…

Ah yeah, if you're in whatever they call the recursive tree delete
function you can unroll that loop a whole bunch. I forget where the
boundary is so you may need to go play with the JNI code; not sure.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Just a thought, but since a directory tree is a first class item in cephfs, 
could the wire protocol be extended with an “recursive delete” operation, 
specifically for cases like this?

On 10/14/16, 4:16 PM, "Gregory Farnum"  wrote:

On Fri, Oct 14, 2016 at 1:11 PM, Heller, Chris  wrote:
> Ok. Since I’m running through the Hadoop/ceph api, there is no syscall 
boundary so there is a simple place to improve the throughput here. Good to 
know, I’ll work on a patch…

Ah yeah, if you're in whatever they call the recursive tree delete
function you can unroll that loop a whole bunch. I forget where the
boundary is so you may need to go play with the JNI code; not sure.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing arm64 Ubuntu packages for 10.2.3

2016-10-14 Thread Stillwell, Bryan J
On 10/14/16, 2:29 PM, "Alfredo Deza"  wrote:

>On Thu, Oct 13, 2016 at 5:19 PM, Stillwell, Bryan J
> wrote:
>> On 10/13/16, 2:32 PM, "Alfredo Deza"  wrote:
>>
>>>On Thu, Oct 13, 2016 at 11:33 AM, Stillwell, Bryan J
>>> wrote:
 I have a basement cluster that is partially built with Odroid-C2
boards
and
 when I attempted to upgrade to the 10.2.3 release I noticed that this
 release doesn't have an arm64 build.  Are there any plans on
continuing
to
 make arm64 builds?
>>>
>>>We have a couple of machines for building ceph releases on ARM64 but
>>>unfortunately they sometimes have issues and since Arm64 is
>>>considered a "nice to have" at the moment we usually skip them if
>>>anything comes up.
>>>
>>>So it is an on-and-off kind of situation (I don't recall what happened
>>>for 10.2.3)
>>>
>>>But since you've asked, I can try to get them built and see if we can
>>>get 10.2.3 out.
>>
>> Sounds good, thanks Alfredo!
>
>10.2.3 arm64 for xenial (and centos7) is out. We only have xenial
>available for arm64, hopefully that will work for you.

Thanks Alfredo, but I'm only seeing xenial arm64 dbg packages here:

http://download.ceph.com/debian-jewel/pool/main/c/ceph/


There's also a report on IRC that the Packages file no longer contains the
10.2.3 amd64 packages for xenial.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing arm64 Ubuntu packages for 10.2.3

2016-10-14 Thread Alfredo Deza
On Thu, Oct 13, 2016 at 5:19 PM, Stillwell, Bryan J
 wrote:
> On 10/13/16, 2:32 PM, "Alfredo Deza"  wrote:
>
>>On Thu, Oct 13, 2016 at 11:33 AM, Stillwell, Bryan J
>> wrote:
>>> I have a basement cluster that is partially built with Odroid-C2 boards
>>>and
>>> when I attempted to upgrade to the 10.2.3 release I noticed that this
>>> release doesn't have an arm64 build.  Are there any plans on continuing
>>>to
>>> make arm64 builds?
>>
>>We have a couple of machines for building ceph releases on ARM64 but
>>unfortunately they sometimes have issues and since Arm64 is
>>considered a "nice to have" at the moment we usually skip them if
>>anything comes up.
>>
>>So it is an on-and-off kind of situation (I don't recall what happened
>>for 10.2.3)
>>
>>But since you've asked, I can try to get them built and see if we can
>>get 10.2.3 out.
>
> Sounds good, thanks Alfredo!

10.2.3 arm64 for xenial (and centos7) is out. We only have xenial
available for arm64, hopefully that will work for you.
>
> Bryan
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Gregory Farnum
On Fri, Oct 14, 2016 at 1:11 PM, Heller, Chris  wrote:
> Ok. Since I’m running through the Hadoop/ceph api, there is no syscall 
> boundary so there is a simple place to improve the throughput here. Good to 
> know, I’ll work on a patch…

Ah yeah, if you're in whatever they call the recursive tree delete
function you can unroll that loop a whole bunch. I forget where the
boundary is so you may need to go play with the JNI code; not sure.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Ok. Since I’m running through the Hadoop/ceph api, there is no syscall boundary 
so there is a simple place to improve the throughput here. Good to know, I’ll 
work on a patch…

On 10/14/16, 3:58 PM, "Gregory Farnum"  wrote:

On Fri, Oct 14, 2016 at 11:41 AM, Heller, Chris  wrote:
> Unfortunately, it was all in the unlink operation. Looks as if it took 
nearly 20 hours to remove the dir, roundtrip is a killer there. What can be 
done to reduce RTT to the MDS? Does the client really have to sequentially 
delete directories or can it have internal batching or parallelization?

It's bound by the same syscall APIs as anything else. You can spin off
multiple deleters; I'd either keep them on one client (if you want to
work within a single directory) or if using multiple clients assign
them to different portions of the hierarchy. That will let you
parallelize across the IO latency until you hit a cap on the MDS'
total throughput (should be 1-10k deletes/s based on latest tests
IIRC).
-Greg

>
> -Chris
>
> On 10/13/16, 4:22 PM, "Gregory Farnum"  wrote:
>
> On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris  
wrote:
> > I have a directory I’ve been trying to remove from cephfs (via
> > cephfs-hadoop), the directory is a few hundred gigabytes in size and
> > contains a few million files, but not in a single sub directory. I 
startd
> > the delete yesterday at around 6:30 EST, and it’s still 
progressing. I can
> > see from (ceph osd df) that the overall data usage on my cluster is
> > decreasing, but at the rate its going it will be a month before the 
entire
> > sub directory is gone. Is a recursive delete of a directory known 
to be a
> > slow operation in CephFS or have I hit upon some bad configuration? 
What
> > steps can I take to better debug this scenario?
>
> Is it the actual unlink operation taking a long time, or just the
> reduction in used space? Unlinks require a round trip to the MDS
> unfortunately, but you should be able to speed things up at least some
> by issuing them in parallel on different directories.
>
> If it's the used space, you can let the MDS issue more RADOS delete
> ops by adjusting the "mds max purge files" and "mds max purge ops"
> config values.
> -Greg
>
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Gregory Farnum
On Fri, Oct 14, 2016 at 11:41 AM, Heller, Chris  wrote:
> Unfortunately, it was all in the unlink operation. Looks as if it took nearly 
> 20 hours to remove the dir, roundtrip is a killer there. What can be done to 
> reduce RTT to the MDS? Does the client really have to sequentially delete 
> directories or can it have internal batching or parallelization?

It's bound by the same syscall APIs as anything else. You can spin off
multiple deleters; I'd either keep them on one client (if you want to
work within a single directory) or if using multiple clients assign
them to different portions of the hierarchy. That will let you
parallelize across the IO latency until you hit a cap on the MDS'
total throughput (should be 1-10k deletes/s based on latest tests
IIRC).
-Greg

>
> -Chris
>
> On 10/13/16, 4:22 PM, "Gregory Farnum"  wrote:
>
> On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris  
> wrote:
> > I have a directory I’ve been trying to remove from cephfs (via
> > cephfs-hadoop), the directory is a few hundred gigabytes in size and
> > contains a few million files, but not in a single sub directory. I 
> startd
> > the delete yesterday at around 6:30 EST, and it’s still progressing. I 
> can
> > see from (ceph osd df) that the overall data usage on my cluster is
> > decreasing, but at the rate its going it will be a month before the 
> entire
> > sub directory is gone. Is a recursive delete of a directory known to be 
> a
> > slow operation in CephFS or have I hit upon some bad configuration? What
> > steps can I take to better debug this scenario?
>
> Is it the actual unlink operation taking a long time, or just the
> reduction in used space? Unlinks require a round trip to the MDS
> unfortunately, but you should be able to speed things up at least some
> by issuing them in parallel on different directories.
>
> If it's the used space, you can let the MDS issue more RADOS delete
> ops by adjusting the "mds max purge files" and "mds max purge ops"
> config values.
> -Greg
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radowsg keystone integration in mitaka

2016-10-14 Thread Jonathan Proulx
Hi All,

Recently upgraded from Kilo->Mitaka on my OpenStack deploy and now
radowsgw nodes (jewel) are unable to validate keystone tokens.


Initially I though it was because radowsgw relies on admin_token
(which is a a bad idea, but ...) and that's now deperecated.  I
verified the token was still in keystone.conf and fixed it when I foun
it had been commented out of  keystone-paste.ini but even after fixing
that and resarting my keystone I get:


-- grep req-a5030a83-f265-4b25-b6e5-1918c978f824 /var/log/keystone/keystone.log
2016-10-14 15:12:47.631 35977 WARNING keystone.middleware.auth 
[req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] Deprecated: 
build_auth_context middleware checking for the admin token is deprecated as of 
the Mitaka release and will be removed in the O release. If your deployment 
requires use of the admin token, update keystone-paste.ini so that 
admin_token_auth is before build_auth_context in the paste pipelines, otherwise 
remove the admin_token_auth middleware from the paste pipelines.
2016-10-14 15:12:47.671 35977 INFO keystone.common.wsgi 
[req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] GET 
https://nimbus-1.csail.mit.edu:35358/v2.0/tokens/
2016-10-14 15:12:47.672 35977 WARNING oslo_log.versionutils 
[req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] Deprecated: validate_token 
of the v2 API is deprecated as of Mitaka in favor of a similar function in the 
v3 API and may be removed in Q.
2016-10-14 15:12:47.684 35977 WARNING keystone.common.wsgi 
[req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] You are not authorized to 
perform the requested action: identity:validate_token

I've dug through keystone/policy.json and identity:validate_token is
authorized to "role:admin or is_admin:1" which I *think* should cover
the token use case...but not 100% sure.

Can radosgw use a propper keystone user so I can avoid the admin_token
mess (http://docs.ceph.com/docs/jewel/radosgw/keystone/ seems to
indicate no)?

Or anyone see where in my keystone chain I might have dropped a link?

Thanks,
-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Unfortunately, it was all in the unlink operation. Looks as if it took nearly 
20 hours to remove the dir, roundtrip is a killer there. What can be done to 
reduce RTT to the MDS? Does the client really have to sequentially delete 
directories or can it have internal batching or parallelization?

-Chris

On 10/13/16, 4:22 PM, "Gregory Farnum"  wrote:

On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris  wrote:
> I have a directory I’ve been trying to remove from cephfs (via
> cephfs-hadoop), the directory is a few hundred gigabytes in size and
> contains a few million files, but not in a single sub directory. I startd
> the delete yesterday at around 6:30 EST, and it’s still progressing. I can
> see from (ceph osd df) that the overall data usage on my cluster is
> decreasing, but at the rate its going it will be a month before the entire
> sub directory is gone. Is a recursive delete of a directory known to be a
> slow operation in CephFS or have I hit upon some bad configuration? What
> steps can I take to better debug this scenario?

Is it the actual unlink operation taking a long time, or just the
reduction in used space? Unlinks require a round trip to the MDS
unfortunately, but you should be able to speed things up at least some
by issuing them in parallel on different directories.

If it's the used space, you can let the MDS issue more RADOS delete
ops by adjusting the "mds max purge files" and "mds max purge ops"
config values.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-14 Thread info
Hi all, 

after encountering a warning about one of my OSDs running out of space i tried 
to study better how data distribution works. 

I'm running a Hammer Ceph cluster v. 0.94.7 

I did some test with crushtool trying to figure out how to achieve even data 
distribution across OSDs. 

Let's take this simple CRUSH MAP: 

# begin crush map 
tunable choose_local_tries 0 
tunable choose_local_fallback_tries 0 
tunable choose_total_tries 50 
tunable chooseleaf_descend_once 1 
tunable straw_calc_version 1 
tunable chooseleaf_vary_r 1 

# devices 
# ceph-osd-001 
device 0 osd.0 # sata-p 
device 1 osd.1 # sata-p 
device 3 osd.3 # sata-p 
device 4 osd.4 # sata-p 
device 5 osd.5 # sata-p 
device 7 osd.7 # sata-p 
device 9 osd.9 # sata-p 
device 10 osd.10 # sata-p 
device 11 osd.11 # sata-p 
device 13 osd.13 # sata-p 
# ceph-osd-002 
device 14 osd.14 # sata-p 
device 15 osd.15 # sata-p 
device 16 osd.16 # sata-p 
device 18 osd.18 # sata-p 
device 19 osd.19 # sata-p 
device 21 osd.21 # sata-p 
device 23 osd.23 # sata-p 
device 24 osd.24 # sata-p 
device 25 osd.25 # sata-p 
device 26 osd.26 # sata-p 
# ceph-osd-003 
device 28 osd.28 # sata-p 
device 29 osd.29 # sata-p 
device 30 osd.30 # sata-p 
device 31 osd.31 # sata-p 
device 32 osd.32 # sata-p 
device 33 osd.33 # sata-p 
device 34 osd.34 # sata-p 
device 35 osd.35 # sata-p 
device 36 osd.36 # sata-p 
device 41 osd.41 # sata-p 
# types 
type 0 osd 
type 1 server 
type 3 datacenter 

# buckets 

### CEPH-OSD-003 ### 
server ceph-osd-003-sata-p { 
id -12 
alg straw 
hash 0 # rjenkins1 
item osd.28 weight 1.000 
item osd.29 weight 1.000 
item osd.30 weight 1.000 
item osd.31 weight 1.000 
item osd.32 weight 1.000 
item osd.33 weight 1.000 
item osd.34 weight 1.000 
item osd.35 weight 1.000 
item osd.36 weight 1.000 
item osd.41 weight 1.000 
} 

### CEPH-OSD-002 ### 
server ceph-osd-002-sata-p { 
id -9 
alg straw 
hash 0 # rjenkins1 
item osd.14 weight 1.000 
item osd.15 weight 1.000 
item osd.16 weight 1.000 
item osd.18 weight 1.000 
item osd.19 weight 1.000 
item osd.21 weight 1.000 
item osd.23 weight 1.000 
item osd.24 weight 1.000 
item osd.25 weight 1.000 
item osd.26 weight 1.000 
} 

### CEPH-OSD-001 ### 
server ceph-osd-001-sata-p { 
id -5 
alg straw 
hash 0 # rjenkins1 
item osd.0 weight 1.000 
item osd.1 weight 1.000 
item osd.3 weight 1.000 
item osd.4 weight 1.000 
item osd.5 weight 1.000 
item osd.7 weight 1.000 
item osd.9 weight 1.000 
item osd.10 weight 1.000 
item osd.11 weight 1.000 
item osd.13 weight 1.000 
} 

# DATACENTER 
datacenter dc1 { 
id -1 
alg straw 
hash 0 # rjenkins1 
item ceph-osd-001-sata-p weight 10.000 
item ceph-osd-002-sata-p weight 10.000 
item ceph-osd-003-sata-p weight 10.000 
} 

# rules 
rule sata-p { 
ruleset 0 
type replicated 
min_size 2 
max_size 10 
step take dc1 
step chooseleaf firstn 0 type server 
step emit 
} 

# end crush map 


Basically it's 30 OSDs spanned across 3 servers. One rule exists, the classic 
replica-3 


cephadm@cephadm01:/etc/ceph/$ crushtool -i crushprova.c --test 
--show-utilization --num-rep 3 --tree --max-x 1 

ID WEIGHT TYPE NAME 
-1 30.0 datacenter milano1 
-5 10.0 server ceph-osd-001-sata-p 
0 1.0 osd.0 
1 1.0 osd.1 
3 1.0 osd.3 
4 1.0 osd.4 
5 1.0 osd.5 
7 1.0 osd.7 
9 1.0 osd.9 
10 1.0 osd.10 
11 1.0 osd.11 
13 1.0 osd.13 
-9 10.0 server ceph-osd-002-sata-p 
14 1.0 osd.14 
15 1.0 osd.15 
16 1.0 osd.16 
18 1.0 osd.18 
19 1.0 osd.19 
21 1.0 osd.21 
23 1.0 osd.23 
24 1.0 osd.24 
25 1.0 osd.25 
26 1.0 osd.26 
-12 10.0 server ceph-osd-003-sata-p 
28 1.0 osd.28 
29 1.0 osd.29 
30 1.0 osd.30 
31 1.0 osd.31 
32 1.0 osd.32 
33 1.0 osd.33 
34 1.0 osd.34 
35 1.0 osd.35 
36 1.0 osd.36 
41 1.0 osd.41 

rule 0 (sata-performance), x = 0..1023, numrep = 3..3 
rule 0 (sata-performance) num_rep 3 result size == 3: 1024/1024 
device 0: stored : 95 expected : 102.49 
device 1: stored : 95 expected : 102.49 
device 3: stored : 104 expected : 102.49 
device 4: stored : 95 expected : 102.49 
device 5: stored : 110 expected : 102.49 
device 7: stored : 111 expected : 102.49 
device 9: stored : 106 expected : 102.49 
device 10: stored : 97 expected : 102.49 
device 11: stored : 105 expected : 102.49 
device 13: stored : 106 expected : 102.49 
device 14: stored : 107 expected : 102.49 
device 15: stored : 107 expected : 102.49 
device 16: stored : 101 expected : 102.49 
device 18: stored : 93 expected : 102.49 
device 19: stored : 102 expected : 102.49 
device 21: stored : 112 expected : 102.49 
device 23: stored : 115 expected : 102.49 
device 24: stored : 95 expected : 102.49 
device 25: stored : 98 expected : 102.49 
device 26: stored : 94 expected : 102.49 
device 28: stored : 92 expected : 102.49 
device 29: stored : 87 expected : 102.49 
device 30: stored : 109 expected : 102.49 

Re: [ceph-users] resolve split brain situation in ceph cluster

2016-10-14 Thread Gregory Farnum
On Fri, Oct 14, 2016 at 7:27 AM, Manuel Lausch  wrote:
> Hi,
>
> I need some help to fix a broken cluster. I think we broke the cluster, but
> I want to know your opinion and if you see a possibility to recover it.
>
> Let me explain what happend.
>
> We have a cluster (Version 0.94.9) in two datacenters (A and B). In each 12
> nodes á 60 ODSs. In A we have 3 monitor nodes and in B  2. The crushrule and
> replication factor forces two replicas in each datacenter.
>
> We write objects via librados in the cluster. The objects are immutable, so
> they are either present or absent.
>
> In this cluster we tested what happens if datacenter A will fail and we need
> to bring up the cluster in B by creating a monitor quorum in B. We did this
> by cut off the network connection betwenn the two datacenters. The OSDs from
> DC B went down like expected. Now we removed the mon Nodes from the monmap
> in B (by extracting it offline and edit it). Our clients wrote now data in
> both independent clusterparts before we stopped the mons in A. (YES I know.
> This is a really bad thing).

This story line seems to be missing some points. How did you cut off
the network connection? What leads you to believe the OSDs accepted
writes on both sides of the split? Did you edit the monmap in both
data centers, or just DC A (that you wanted to remain alive)? What
monitor counts do you have in each DC?
-Greg

>
> Now we try to join the two sides again. But so far without success.
>
> Only the OSDs in B are running. The OSDs in A started but the OSDs stay
> down. In the mon log we see a lot of „...(leader).pg v3513957 ignoring stats
> from non-active osd“ alerts.
>
> We see, that the current osdmap epoch in the running cluster is „28873“. In
> the OSDs in A the epoch is „29003“. We assume that this is the reason why
> the OSDs won't to jump in.
>
>
> BTW: This is only a testcluster, so no important data are harmed.
>
>
> Regards
> Manuel
>
>
> --
> Manuel Lausch
>
> Systemadministrator
> Cloud Services
>
> 1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135
> Karlsruhe | Germany
> Phone: +49 721 91374-1847
> E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de
>
> Amtsgericht Montabaur, HRB 5452
>
> Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen
>
>
> Member of United Internet
>
> Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen
> enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese
> E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und
> vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten
> ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt
> auf welche Weise auch immer zu verwenden.
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient of this e-mail, you are hereby notified that
> saving, distribution or use of the content of this e-mail in any way is
> prohibited. If you have received this e-mail in error, please notify the
> sender and delete the e-mail.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] resolve split brain situation in ceph cluster

2016-10-14 Thread Manuel Lausch

Hi,

I need some help to fix a broken cluster. I think we broke the cluster, 
but I want to know your opinion and if you see a possibility to recover it.


Let me explain what happend.

We have a cluster (Version 0.94.9) in two datacenters (A and B). In each 
12 nodes á 60 ODSs. In A we have 3 monitor nodes and in B  2. The 
crushrule and replication factor forces two replicas in each datacenter.


We write objects via librados in the cluster. The objects are immutable, 
so they are either present or absent.


In this cluster we tested what happens if datacenter A will fail and we 
need to bring up the cluster in B by creating a monitor quorum in B. We 
did this by cut off the network connection betwenn the two datacenters. 
The OSDs from DC B went down like expected. Now we removed the mon Nodes 
from the monmap in B (by extracting it offline and edit it). Our clients 
wrote now data in both independent clusterparts before we stopped the 
mons in A. (YES I know. This is a really bad thing).


Now we try to join the two sides again. But so far without success.

Only the OSDs in B are running. The OSDs in A started but the OSDs stay 
down. In the mon log we see a lot of „...(leader).pg v3513957 ignoring 
stats from non-active osd“ alerts.


We see, that the current osdmap epoch in the running cluster is „28873“. 
In the OSDs in A the epoch is „29003“. We assume that this is the reason 
why the OSDs won't to jump in.



BTW: This is only a testcluster, so no important data are harmed.


Regards
Manuel


--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 
distribution or use of the content of this e-mail in any way is prohibited. If 
you have received this e-mail in error, please notify the sender and delete the 
e-mail.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Calc the nuber of shards needed for a pucket

2016-10-14 Thread Ansgar Jazdzewski
Hi,

I like to know if someone of you have some kind of a formula to set
the right number of shards for a bucket.
We have currently a Bucket with 30M objects and expect that it will go
up to 50M.
At the moment we have 64 Shards configured, but it was told me that
this is much to less.

Any hints / Formulars for me, thanks
Ansgar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"

2016-10-14 Thread Henrik Korkuc

On 16-10-13 22:46, Chris Murray wrote:

On 13/10/2016 11:49, Henrik Korkuc wrote:
Is apt/dpkg doing something now? Is problem repeatable, e.g. by 
killing upgrade and starting again. Are there any stuck systemctl 
processes?


I had no problems upgrading 10.2.x clusters to 10.2.3

On 16-10-13 13:41, Chris Murray wrote:

On 22/09/2016 15:29, Chris Murray wrote:

Hi all,

Might anyone be able to help me troubleshoot an "apt-get dist-upgrade"
which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"?

I'm upgrading from 10.2.2. The two OSDs on this node are up, and think
they are version 10.2.3, but the upgrade doesn't appear to be 
finishing

... ?

Thank you in advance,
Chris


Hi,

Are there possibly any pointers to help troubleshoot this? I've got 
a test system on which the same thing has happened.


The cluster's status is "HEALTH_OK" before starting. I'm running 
Debian Jessie.


dpkg.log only has the following:

2016-10-13 11:37:25 configure ceph-osd:amd64 10.2.3-1~bpo80+1 
2016-10-13 11:37:25 status half-configured ceph-osd:amd64 
10.2.3-1~bpo80+1


At this point, the ugrade gets stuck and doesn't go any further. 
Where could I look for the next clue?


Thanks,

Chris


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thank you Henrik, I see it's a systemctl process that's stuck.

It is reproducible for me on every run of  dpkg --configure -a

And, indeed, reproducible across two separate machines.

I'll pursue the stuck "/bin/systemctl start ceph-osd.target".



you can try to check if systemctl daemon-rexec helps to solve this 
problem. I couldn't find a link quickly but it seems that Jessie systemd 
sometomes manages to get stuck on systemctl calls.



Thanks again,
Chris

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com