Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Denny Fuchs

hi,

I never recognized the Debian /etc/default/ceph :-)

=
# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728


that is, what is active now.


Huge pages:

# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never


# dpkg -S  /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1


so file exists on PROXMOX 5.x (Ceph version 12.2.11-pve1)


If I understand correct: I should try to set bitmap allocator


[osd]
...
bluestore_allocator = bitmap
bluefs_allocator = bitmap

I would restart the nodes one by one and see, what happens.

cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Fri, 3 May 2019 at 05:12, Mark Nelson  wrote:
[...]
> > -- https://www.kernel.org/doc/Documentation/vm/transhuge.txt
>
> Why are you quoting the description for the madvise setting when that's
> clearly not what was set in the case I just showed you?

Similarly why(?) are you telling us it must be due to THPs if:

1) by default they're not used unless madvise()'ed,
2) none of jemalloc or tcmalloc would madvise by default too.

[...]
> previously|malloc|'ed. Because the machine used transparent huge pages,

Is it from DigitalOcean's blog? I read it pretty long ago. And it was
written long ago, referring to some
ancient release of jemalloc and what's more important -- to a system
that has THP activated.

-- But I've shown you that it's not default kernel's setting to use
THP -- unless madvise would tell kernel so.
Your example with CentOS isn't relevant due to person who started this
thread use Debian (Proxmox, to be more correct).
Moreover, something's telling me that even in default CentOS installs
THPs are also set to madvise()-only.

> I'm not going to argue with you about this.

I don't argue with you.
I'm merely showing you that instead of doing baseless claims (or wild
guess-working), it's worth checking facts first.
Checking if THP's are used at all (although it might be not due OSDs
but, say, KVM) is as simple as looking into /proc/meminfo.

> Test it if you want or don't.

I didn't start this thread. ;)
As to me -- I've played enough with all kind of allocators and THP settings. :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson


On 5/2/19 1:51 PM, Igor Podlesny wrote:

On Fri, 3 May 2019 at 01:29, Mark Nelson  wrote:

On 5/2/19 11:46 AM, Igor Podlesny wrote:

On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]

FWIW, if you still have an OSD up with tcmalloc, it's probably worth
looking at the heap stats to see how much memory tcmalloc thinks it's
allocated vs how much RSS memory is being used by the process.  It's
quite possible that there is memory that has been unmapped but that the
kernel can't (or has decided not yet to) reclaim.
Transparent huge pages can potentially have an effect here both with tcmalloc 
and with
jemalloc so it's not certain that switching the allocator will fix it entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.


  From one of our centos nodes with no special actions taken to change
THP settings (though it's possible it was inherited from something else):


$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

-- https://www.kernel.org/doc/Documentation/vm/transhuge.txt



Why are you quoting the description for the madvise setting when that's 
clearly not what was set in the case I just showed you?






And regarding madvise and alternate memory allocators:
https:

[...]

did you ever read any of it?

one link's info:

"By default jemalloc does not use huge pages for heap memory (there is
opt.metadata_thp which uses THP for internal metadata though)"



"It turns out that|jemalloc(3)|uses|madvise(2)|extensively to notify the 
operating system that it's done with a range of memory which it had 
previously|malloc|'ed. Because the machine used transparent huge pages, 
the page size was 2MB. As such, a lot of the memory which was being 
marked with|madvise(..., MADV_DONTNEED)|was within ranges substantially 
smaller than 2MB. This meant that the operating system never was able to 
evict pages which had ranges marked as|MADV_DONTNEED|because the entire 
page would have to be unneeded to allow it to be reused.


So despite initially looking like a leak, the operating system itself 
was unable to free memory because of|madvise(2)|and transparent huge 
pages.^4 
 
This led to sustained memory pressure on the machine 
and|redis-server|eventually getting OOM killed."



I'm not going to argue with you about this.  Test it if you want or don't.


Mark




(and I've said

None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.

before)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw index all keys in all buckets

2019-05-02 Thread Aaron Bassett
Hello,
I'm trying to write a tool to index all keys in all buckets stored in radosgw. 
I've created a user with the following caps:

"caps": [
{
"type": "buckets",
"perm": "read"
},
{
"type": "metadata",
"perm": "read"
},
{
"type": "usage",
"perm": "read"
},
{
"type": "users",
"perm": "read"
}
],


With these caps I'm able to use a python radosgw-admin lib to list buckets and 
acls and users, but not keys. This user is also unable to read buckets and/or 
keys through the normal s3 api. Is there a way to create an s3 user that has 
read access to all buckets and keys without explicitly being granted acls?

Thanks,
Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Fri, 3 May 2019 at 01:29, Mark Nelson  wrote:
> On 5/2/19 11:46 AM, Igor Podlesny wrote:
> > On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
> > [...]
> >> FWIW, if you still have an OSD up with tcmalloc, it's probably worth
> >> looking at the heap stats to see how much memory tcmalloc thinks it's
> >> allocated vs how much RSS memory is being used by the process.  It's
> >> quite possible that there is memory that has been unmapped but that the
> >> kernel can't (or has decided not yet to) reclaim.
> >> Transparent huge pages can potentially have an effect here both with 
> >> tcmalloc and with
> >> jemalloc so it's not certain that switching the allocator will fix it 
> >> entirely.
> > Most likely wrong. -- Default kernel's settings in regards of THP are 
> > "madvise".
> > None of tcmalloc or jemalloc would madvise() to make it happen.
> > With fresh enough jemalloc you could have it, but it needs special
> > malloc.conf'ing.
>
>
>  From one of our centos nodes with no special actions taken to change
> THP settings (though it's possible it was inherited from something else):
>
>
> $ cat /etc/redhat-release
> CentOS Linux release 7.5.1804 (Core)
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

-- https://www.kernel.org/doc/Documentation/vm/transhuge.txt

> And regarding madvise and alternate memory allocators:
> https:
[...]

did you ever read any of it?

one link's info:

"By default jemalloc does not use huge pages for heap memory (there is
opt.metadata_thp which uses THP for internal metadata though)"

(and I've said
> > None of tcmalloc or jemalloc would madvise() to make it happen.
> > With fresh enough jemalloc you could have it, but it needs special
> > malloc.conf'ing.
before)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson



On 5/2/19 11:46 AM, Igor Podlesny wrote:

On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]

FWIW, if you still have an OSD up with tcmalloc, it's probably worth
looking at the heap stats to see how much memory tcmalloc thinks it's
allocated vs how much RSS memory is being used by the process.  It's
quite possible that there is memory that has been unmapped but that the
kernel can't (or has decided not yet to) reclaim.
Transparent huge pages can potentially have an effect here both with tcmalloc 
and with
jemalloc so it's not certain that switching the allocator will fix it entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.



From one of our centos nodes with no special actions taken to change 
THP settings (though it's possible it was inherited from something else):



$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


And regarding madvise and alternate memory allocators:


https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/

https://www.nuodb.com/techblog/linux-transparent-huge-pages-jemalloc-and-nuodb

https://github.com/gperftools/gperftools/issues/1073

https://github.com/jemalloc/jemalloc/issues/1243

https://github.com/jemalloc/jemalloc/issues/1128





First I would just get the heap stats and then after that I would be
very curious if disabling transparent huge pages helps. Alternately,
it's always possible it's a memory leak. :D

RedHat can do better (hopefully). ;-P

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restricting access to RadosGW/S3 buckets

2019-05-02 Thread Benjeman Meekhof
Hi Vlad,

If a user creates a bucket then only that user can see the bucket
unless an S3 ACL is applied giving additional permissionsbut I'd
guess you are asking a more complex question than that.

If you are looking to apply some kind of policy over-riding whatever
ACL a user might apply to a bucket then it looks like the integration
with Open Policy Agent can do what you want.  I have not myself tried
this out but it looks very interesting if you have the Nautilus
release.
http://docs.ceph.com/docs/nautilus/radosgw/opa/

A third option is you could run the RGW behind something like HAproxy
and configure ACL there which allow/disallow requests based on
different criteria.  For example you can parse the bucket name out of
the URL and match against an ACL.  You may be able to use the
Authorization header to pull out the access key id and match that
against a map file and allow/disallow the request, or use some other
criteria as might be available in HAproxy.  HAproxy does have a unix
socket interface allowing for modifying mapfile entries without
restarting/editing the proxy config files.
http://cbonte.github.io/haproxy-dconv/1.8/configuration.html#7

thanks,
Ben

On Thu, May 2, 2019 at 12:53 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am trying to figure out a way to restrict access to S3 buckets. Is it
> possible to create a RadosGW user that can only access specific bucket(s)?
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Restricting access to RadosGW/S3 buckets

2019-05-02 Thread Vladimir Brik

Hello

I am trying to figure out a way to restrict access to S3 buckets. Is it 
possible to create a RadosGW user that can only access specific bucket(s)?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs on an EC Pool - What determines object size

2019-05-02 Thread Daniel Williams
Thanks so much for your help!

On Mon, Apr 29, 2019 at 6:49 PM Gregory Farnum  wrote:

> Yes, check out the file layout options:
> http://docs.ceph.com/docs/master/cephfs/file-layouts/
>
> On Mon, Apr 29, 2019 at 3:32 PM Daniel Williams 
> wrote:
> >
> > Is the 4MB configurable?
> >
> > On Mon, Apr 29, 2019 at 4:36 PM Gregory Farnum 
> wrote:
> >>
> >> CephFS automatically chunks objects into 4MB objects by default. For
> >> an EC pool, RADOS internally will further subdivide them based on the
> >> erasure code and striping strategy, with a layout that can vary. But
> >> by default if you have eg an 8+3 EC code, you'll end up with a bunch
> >> of (4MB/8=)512KB objects within the OSD.
> >> -Greg
> >>
> >> On Sun, Apr 28, 2019 at 12:42 PM Daniel Williams 
> wrote:
> >> >
> >> > Hey,
> >> >
> >> > What controls / determines object size of a purely cephfs ec (6.3)
> pool? I have large file but seemingly small objects.
> >> >
> >> > Daniel
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]
> FWIW, if you still have an OSD up with tcmalloc, it's probably worth
> looking at the heap stats to see how much memory tcmalloc thinks it's
> allocated vs how much RSS memory is being used by the process.  It's
> quite possible that there is memory that has been unmapped but that the
> kernel can't (or has decided not yet to) reclaim.

> Transparent huge pages can potentially have an effect here both with tcmalloc 
> and with
> jemalloc so it's not certain that switching the allocator will fix it 
> entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.

> First I would just get the heap stats and then after that I would be
> very curious if disabling transparent huge pages helps. Alternately,
> it's always possible it's a memory leak. :D

RedHat can do better (hopefully). ;-P

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-05-02 Thread Shain Miley

Just to follow up on this:

I enabled up enabling the balancer module in upmap mode.

This did resolve the the short term issue and even things out a 
bit...but things are still far from uniform.


It seems like the balancer option is an ongoing process that continues 
to run over time...so maybe things will improve even more over the next 
few weeks.


Thank you to everyone who helped provide insight into possible solutions.

Shain

On 4/30/19 2:08 PM, Dan van der Ster wrote:

Removing pools won't make a difference.

Read up to slide 22 here: 
https://www.slideshare.net/mobile/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer 



..
Dan

(Apologies for terseness, I'm mobile)



On Tue, 30 Apr 2019, 20:02 Shain Miley, > wrote:


Here is the per pool pg_num info:

'data' pg_num 64
'metadata' pg_num 64
'rbd' pg_num 64
'npr_archive' pg_num 6775
'.rgw.root' pg_num 64
'.rgw.control' pg_num 64
'.rgw' pg_num 64
'.rgw.gc' pg_num 64
'.users.uid' pg_num 64
'.users.email' pg_num 64
'.users' pg_num 64
'.usage' pg_num 64
'.rgw.buckets.index' pg_num 128
'.intent-log' pg_num 8
'.rgw.buckets' pg_num 64
'kube' pg_num 512
'.log' pg_num 8

Here is the df output:

GLOBAL:
 SIZE    AVAIL  RAW USED %RAW USED
 1.06PiB 306TiB   778TiB 71.75
POOLS:
 NAME   ID USED    %USED MAX AVAIL
OBJECTS
 data   0  11.7GiB  0.14
8.17TiB 3006
 metadata   1   0B 0
8.17TiB    0
 rbd    2  43.2GiB  0.51
8.17TiB    11147
 npr_archive    3   258TiB 97.93 5.45TiB
82619649
 .rgw.root  4    1001B 0
8.17TiB    5
 .rgw.control   5   0B 0
8.17TiB    8
 .rgw   6  6.16KiB 0
8.17TiB   35
 .rgw.gc    7   0B 0
8.17TiB   32
 .users.uid 8   0B 0
8.17TiB    0
 .users.email   9   0B 0
8.17TiB    0
 .users 10  0B 0
8.17TiB    0
 .usage 11  0B 0
8.17TiB    1
 .rgw.buckets.index 12  0B 0
8.17TiB   26
 .intent-log    17  0B 0
5.45TiB    0
 .rgw.buckets   18 24.2GiB  0.29
8.17TiB 6622
 kube   21 1.82GiB  0.03
5.45TiB  550
 .log   22  0B 0
5.45TiB  176


The stuff in the data pool and the rwg pools is old data that we used
for testing...if you guys think that removing everything outside
of rbd
and npr_archive would make a significant impact I will give it a try.

Thanks,

Shain



On 4/30/19 1:15 PM, Jack wrote:
> Hi,
>
> I see that you are using rgw
> RGW comes with many pools, yet most of them are used for
metadata and
> configuration, those do not store many data
> Such pools do not need more than a couple PG, each (I use pg_num
= 8)
>
> You need to allocate your pg on pool that actually stores the data
>
> Please do the following, to let us know more:
> Print the pg_num per pool:
> for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
> pg_num; done
>
> Print the usage per pool:
> ceph df
>
> Also, instead of doing a "ceph osd reweight-by-utilization",
check out
> the balancer plugin :

https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_mimic_mgr_balancer_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=YoiU-wa-ZXHUEj8xYmiSVRVnXnDenoUaRZMa-bfRFvo&e=
>
> Finally, in nautilus, the pg can now upscale and downscale
automaticaly
> See

https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_rados_new-2Din-2Dnautilus-2Dpg-2Dmerging-2Dand-2Dautotuning_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=7-W9i3gJAcCtrL7MzjJlG5LZ_91zeesYBT7g0rGrLh0&e=
>
>
> On 04/30/2019 06:34 PM, Shain Miley wrote:
>> Hi,
>>
>> We have a cluster with 235 osd's running version 12.2.11 

Re: [ceph-users] upgrade to nautilus: "require-osd-release nautilus" required to increase pg_num

2019-05-02 Thread Sage Weil
On Mon, 29 Apr 2019, Alexander Y. Fomichev wrote:
> Hi,
> 
> I just upgraded from mimic to nautilus(14.2.0) and stumbled upon a strange
> "feature".
> I tried to increase pg_num for a pool. There was no errors but also no
> visible effect:
> 
> # ceph osd pool get foo_pool01 pg_num
> pg_num: 256
> # ceph osd pool set foo_pool01 pg_num 512
> set pool 11 pg_num to 512
> # ceph osd pool get foo_pool01 pg_num
> pg_num: 256
> 
> until finally I found that
> # ceph osd require-osd-release nautilus
> solves this problem
> 
> Docs are very scarce about "require-osd-release" command. Something like
> "Complete the upgrade by disallowing pre-Nautilus OSDs and enabling all new
> Nautilus-only functionality:" which gives no clue to understand why pretty
> old feature of increasing pg_num dosn't work. Any way I doubt that silently
> ignoring user commands is a good idea.
> So question is: It is intentional behavior or I hit a bug.

You hit a bug.  :)

Starting in nautilus, the pg_num and pgp_num adjustments are managed by 
the manager. When you make a change on the CLI, you're setting the 
pgp_num_target value, and the mgr makes the real changes in a 
managed/careful way.  

In this case, the mon was assuming the nautilus behavior and then 
effectively discarding it because the osdmaps are still encoded in the 
mimic format.  Opened

http://tracker.ceph.com/issues/39570

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Beast frontend and ipv6 options

2019-05-02 Thread Abhishek Lekshmanan
Daniel Gryniewicz  writes:

> After discussing with Casey, I'd like to propose some clarifications to 
> this.
>
> First, we do not treat EAFNOSUPPORT as a non-fatal error.  Any other 
> error binding is fatal, but that one we warn and continue.
>
> Second, we treat "port=" as expanding to "endpoint=0.0.0.0:, 
> endpoint=[::]".
>
> Then, we just process the set of endpoints properly.  This should, I 
> believe, result in simple, straight-forward code, and easily 
> understandable semantics, and should make it simple for orchetrators.

Agree, this makes a lot of sense. specifying both a port and a endpoint
is somewhat of a corner case and I guess for this particular case
failure to bind is acceptable with the documentation already mentioning
the port's implicit endpoint behaviour.
>
> This would make 1 and 2 below fallout naturally.  3 is modified so that 
> we only use configured endpoints, but port= is now implicit endpoint 
> configuration.



>
> Daniel
>
> On 5/2/19 10:08 AM, Daniel Gryniewicz wrote:
>> Based on past experience with this issue in other projects, I would 
>> propose this:
>> 
>> 1. By default (rgw frontends=beast), we should bind to both IPv4 and 
>> IPv6, if available.
>> 
>> 2. Just specifying port (rgw frontends=beast port=8000) should apply to 
>> both IPv4 and IPv6, if available.
>> 
>> 3. If the user provides endpoint config, we should use only that 
>> endpoint config.  For example, if they provide only v4 addresses, we 
>> should only bind to v4.
>> 
>> This should all be independent of the bindv6only setting; that is, we 
>> should specifically bind our v4 and v6 addresses, and not depend on the 
>> system to automatically bind v4 when binding v6.
>> 
>> In the case of 1 or 2, if the system has disabled either v4 or v6, this 
>> should not be an error, as long as one of the two binds works.  In the 
>> case of 3, we should error out if any configured endpoint cannot be bound.
>> 
>> This should allow an orchestrator to confidently install a system, 
>> knowing what will happen, without needing to know or manipulate the 
>> bindv6only flag.
>> 
>> As for what happens if you specify an endpoint and a port, I don't have 
>> a strong opinion.  I see 2 reasonable possibilites:
>> 
>> a. Make it an error
>> 
>> b. Treat a port in this case as an endpoint of 0.0.0.0:port (v4-only)
>> 
>> Daniel
>> 
>> On 4/26/19 4:49 AM, Abhishek Lekshmanan wrote:
>>>
>>> Currently RGW's beast frontend supports ipv6 via the endpoint
>>> configurable. The port option will bind to ipv4 _only_.
>>>
>>> http://docs.ceph.com/docs/master/radosgw/frontends/#options
>>>
>>> Since many Linux systems may default the sysconfig net.ipv6.bindv6only
>>> flag to true, it usually means that specifying a v6 endpoint will bind
>>> to both v4 and v6. But this also means that deployment systems must be
>>> aware of this while configuring depending on whether both v4 and v6
>>> endpoints need to work or not. Specifying both a v4 and v6 endpoint or a
>>> port (v4) and endpoint with the same v6 port will currently lead to a
>>> failure as the system would've already bound the v6 port to both v4 and
>>> v6. This leaves us with a few options.
>>>
>>> 1. Keep the implicit behaviour as it is, document this, as systems are
>>> already aware of sysconfig flags and will expect that at a v6 endpoint
>>> will bind to both v4 and v6.
>>>
>>> 2. Be explicit with endpoints & configuration, Beast itself overrides
>>> the socket option to bind both v4 and v6, which means that v6 endpoint
>>> will bind to v6 *only* and binding to a v4 will need an explicit
>>> specification. (there is a pr in progress for this:
>>> https://github.com/ceph/ceph/pull/27270)
>>>
>>> Any more suggestions on how systems handle this are also welcome.
>>>
>>> -- 
>>> Abhishek
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> 
>
>

-- 
Abhishek Lekshmanan
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Beast frontend and ipv6 options

2019-05-02 Thread Daniel Gryniewicz
After discussing with Casey, I'd like to propose some clarifications to 
this.


First, we do not treat EAFNOSUPPORT as a non-fatal error.  Any other 
error binding is fatal, but that one we warn and continue.


Second, we treat "port=" as expanding to "endpoint=0.0.0.0:, 
endpoint=[::]".


Then, we just process the set of endpoints properly.  This should, I 
believe, result in simple, straight-forward code, and easily 
understandable semantics, and should make it simple for orchetrators.


This would make 1 and 2 below fallout naturally.  3 is modified so that 
we only use configured endpoints, but port= is now implicit endpoint 
configuration.


Daniel

On 5/2/19 10:08 AM, Daniel Gryniewicz wrote:
Based on past experience with this issue in other projects, I would 
propose this:


1. By default (rgw frontends=beast), we should bind to both IPv4 and 
IPv6, if available.


2. Just specifying port (rgw frontends=beast port=8000) should apply to 
both IPv4 and IPv6, if available.


3. If the user provides endpoint config, we should use only that 
endpoint config.  For example, if they provide only v4 addresses, we 
should only bind to v4.


This should all be independent of the bindv6only setting; that is, we 
should specifically bind our v4 and v6 addresses, and not depend on the 
system to automatically bind v4 when binding v6.


In the case of 1 or 2, if the system has disabled either v4 or v6, this 
should not be an error, as long as one of the two binds works.  In the 
case of 3, we should error out if any configured endpoint cannot be bound.


This should allow an orchestrator to confidently install a system, 
knowing what will happen, without needing to know or manipulate the 
bindv6only flag.


As for what happens if you specify an endpoint and a port, I don't have 
a strong opinion.  I see 2 reasonable possibilites:


a. Make it an error

b. Treat a port in this case as an endpoint of 0.0.0.0:port (v4-only)

Daniel

On 4/26/19 4:49 AM, Abhishek Lekshmanan wrote:


Currently RGW's beast frontend supports ipv6 via the endpoint
configurable. The port option will bind to ipv4 _only_.

http://docs.ceph.com/docs/master/radosgw/frontends/#options

Since many Linux systems may default the sysconfig net.ipv6.bindv6only
flag to true, it usually means that specifying a v6 endpoint will bind
to both v4 and v6. But this also means that deployment systems must be
aware of this while configuring depending on whether both v4 and v6
endpoints need to work or not. Specifying both a v4 and v6 endpoint or a
port (v4) and endpoint with the same v6 port will currently lead to a
failure as the system would've already bound the v6 port to both v4 and
v6. This leaves us with a few options.

1. Keep the implicit behaviour as it is, document this, as systems are
already aware of sysconfig flags and will expect that at a v6 endpoint
will bind to both v4 and v6.

2. Be explicit with endpoints & configuration, Beast itself overrides
the socket option to bind both v4 and v6, which means that v6 endpoint
will bind to v6 *only* and binding to a v4 will need an explicit
specification. (there is a pr in progress for this:
https://github.com/ceph/ceph/pull/27270)

Any more suggestions on how systems handle this are also welcome.

--
Abhishek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Beast frontend and ipv6 options

2019-05-02 Thread Daniel Gryniewicz
Based on past experience with this issue in other projects, I would 
propose this:


1. By default (rgw frontends=beast), we should bind to both IPv4 and 
IPv6, if available.


2. Just specifying port (rgw frontends=beast port=8000) should apply to 
both IPv4 and IPv6, if available.


3. If the user provides endpoint config, we should use only that 
endpoint config.  For example, if they provide only v4 addresses, we 
should only bind to v4.


This should all be independent of the bindv6only setting; that is, we 
should specifically bind our v4 and v6 addresses, and not depend on the 
system to automatically bind v4 when binding v6.


In the case of 1 or 2, if the system has disabled either v4 or v6, this 
should not be an error, as long as one of the two binds works.  In the 
case of 3, we should error out if any configured endpoint cannot be bound.


This should allow an orchestrator to confidently install a system, 
knowing what will happen, without needing to know or manipulate the 
bindv6only flag.


As for what happens if you specify an endpoint and a port, I don't have 
a strong opinion.  I see 2 reasonable possibilites:


a. Make it an error

b. Treat a port in this case as an endpoint of 0.0.0.0:port (v4-only)

Daniel

On 4/26/19 4:49 AM, Abhishek Lekshmanan wrote:


Currently RGW's beast frontend supports ipv6 via the endpoint
configurable. The port option will bind to ipv4 _only_.

http://docs.ceph.com/docs/master/radosgw/frontends/#options

Since many Linux systems may default the sysconfig net.ipv6.bindv6only
flag to true, it usually means that specifying a v6 endpoint will bind
to both v4 and v6. But this also means that deployment systems must be
aware of this while configuring depending on whether both v4 and v6
endpoints need to work or not. Specifying both a v4 and v6 endpoint or a
port (v4) and endpoint with the same v6 port will currently lead to a
failure as the system would've already bound the v6 port to both v4 and
v6. This leaves us with a few options.

1. Keep the implicit behaviour as it is, document this, as systems are
already aware of sysconfig flags and will expect that at a v6 endpoint
will bind to both v4 and v6.

2. Be explicit with endpoints & configuration, Beast itself overrides
the socket option to bind both v4 and v6, which means that v6 endpoint
will bind to v6 *only* and binding to a v4 will need an explicit
specification. (there is a pr in progress for this:
https://github.com/ceph/ceph/pull/27270)

Any more suggestions on how systems handle this are also welcome.

--
Abhishek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume activate runs infinitely

2019-05-02 Thread Alfredo Deza
On Thu, May 2, 2019 at 8:28 AM Robert Sander
 wrote:
>
> Hi,
>
> On 02.05.19 13:40, Alfredo Deza wrote:
>
> > Can you give a bit more details on the environment? How dense is the
> > server? If the unit retries is fine and I was hoping at some point it
> > would see things ready and start activating (it does retry
> > indefinitely at the moment).
>
> It is a machine with 13 Bluestore OSDs on LVM with SSDs as Block.DB devices.
> The SSDs have also been setup with LVM. This has been done with "ceph-volume 
> lvm batch".
>
> The issue started with the latest Ubuntu updates (no Ceph updates involved)
> and the following reboot. The customer let the boot process run for over
> 30 minutes but the ceph-volume activation services (and wpa-supplicant + 
> logind)
> were not able to start.
>
> > Would also help to see what problems is it encountering as it can't
> > get to activate. There are two logs for this, one for the systemd unit
> > at /var/log/ceph/ceph-volume-systemd.log and the other one at
> > /var/log/ceph/ceph-volume.log that might
> > help.
>
> Like these entries?
>
> [2019-05-02 10:04:32,211][ceph_volume.process][INFO  ] stderr Job for 
> ceph-osd@21.service canceled.
> [2019-05-02 10:04:32,211][ceph_volume][ERROR ] exception caught by decorator
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 59, 
> in newfunc
> return f(*a, **kw)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 148, in 
> main
> terminal.dispatch(self.mapper, subcommand_args)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/main.py", 
> line 40, in main
> terminal.dispatch(self.mapper, self.argv)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, **kw)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/trigger.py", 
> line 70, in main
> Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
>   File 
> "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 
> 339, in main
> self.activate(args)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, **kw)
>   File 
> "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 
> 261, in activate
> return activate_bluestore(lvs, no_systemd=args.no_systemd)
>   File 
> "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 
> 196, in activate_bluestore
> systemctl.start_osd(osd_id)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/systemd/systemctl.py", 
> line 39, in start_osd
> return start(osd_unit % id_)
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/systemd/systemctl.py", 
> line 8, in start
> process.run(['systemctl', 'start', unit])
>   File "/usr/lib/python2.7/dist-packages/ceph_volume/process.py", line 153, 
> in run
> raise RuntimeError(msg)
> RuntimeError: command returned non-zero exit status: 1
>
>
> [2019-05-02 10:04:32,222][ceph_volume.process][INFO  ] stdout Running 
> command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-21
> --> Absolute path not found for executable: restorecon
> --> Ensure $PATH environment variable contains common executable locations
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-21
> Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir 
> --dev 
> /dev/ceph-block-393ba2fc-e970-4d48-8dcb-c6261dfdfe08/osd-block-931e2d94-63f6-4df8-baed-6873eb0123e2
>  --path /var/lib/ceph/osd/ceph-21 --no-mon-config
> Running command: /bin/ln -snf 
> /dev/ceph-block-393ba2fc-e970-4d48-8dcb-c6261dfdfe08/osd-block-931e2d94-63f6-4df8-baed-6873eb0123e2
>  /var/lib/ceph/osd/ceph-21/block
> Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-21/block
> Running command: /bin/chown -R ceph:ceph /dev/dm-12
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-21
> Running command: /bin/ln -snf 
> /dev/ceph-block-dbs-75eda181-946f-4a40-b4e0-8ecd60721398/osd-block-db-45ee9a1f-3ee2-4db9-a057-fd06fa1452e8
>  /var/lib/ceph/osd/ceph-21/block.db
> Running command: /bin/chown -h ceph:ceph 
> /dev/ceph-block-dbs-75eda181-946f-4a40-b4e0-8ecd60721398/osd-block-db-45ee9a1f-3ee2-4db9-a057-fd06fa1452e8
> Running command: /bin/chown -R ceph:ceph /dev/dm-21
> Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-21/block.db
> Running command: /bin/chown -R ceph:ceph /dev/dm-21
> Running command: /bin/systemctl enable 
> ceph-volume@lvm-21-e6f688e0-3e71-4ee6-90f3-b3c07a99059f
> Running command: /bin/systemctl enable --runtime ceph-osd@21
>  stderr: Created symlink 
> /run/systemd/system/ceph-osd.target.wants/ceph-osd@21.servi

Re: [ceph-users] ceph-volume activate runs infinitely

2019-05-02 Thread Robert Sander
Hi,

On 02.05.19 13:40, Alfredo Deza wrote:

> Can you give a bit more details on the environment? How dense is the
> server? If the unit retries is fine and I was hoping at some point it
> would see things ready and start activating (it does retry
> indefinitely at the moment).

It is a machine with 13 Bluestore OSDs on LVM with SSDs as Block.DB devices.
The SSDs have also been setup with LVM. This has been done with "ceph-volume 
lvm batch".

The issue started with the latest Ubuntu updates (no Ceph updates involved)
and the following reboot. The customer let the boot process run for over
30 minutes but the ceph-volume activation services (and wpa-supplicant + logind)
were not able to start.

> Would also help to see what problems is it encountering as it can't
> get to activate. There are two logs for this, one for the systemd unit
> at /var/log/ceph/ceph-volume-systemd.log and the other one at
> /var/log/ceph/ceph-volume.log that might
> help.

Like these entries?

[2019-05-02 10:04:32,211][ceph_volume.process][INFO  ] stderr Job for 
ceph-osd@21.service canceled.
[2019-05-02 10:04:32,211][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 59, 
in newfunc
return f(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 148, in main
terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/main.py", line 
40, in main
terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, 
in is_root
return func(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/trigger.py", 
line 70, in main
Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", 
line 339, in main
self.activate(args)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, 
in is_root
return func(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", 
line 261, in activate
return activate_bluestore(lvs, no_systemd=args.no_systemd)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", 
line 196, in activate_bluestore
systemctl.start_osd(osd_id)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/systemd/systemctl.py", 
line 39, in start_osd
return start(osd_unit % id_)
  File "/usr/lib/python2.7/dist-packages/ceph_volume/systemd/systemctl.py", 
line 8, in start
process.run(['systemctl', 'start', unit])
  File "/usr/lib/python2.7/dist-packages/ceph_volume/process.py", line 153, in 
run
raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 1


[2019-05-02 10:04:32,222][ceph_volume.process][INFO  ] stdout Running command: 
/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-21
--> Absolute path not found for executable: restorecon
--> Ensure $PATH environment variable contains common executable locations
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-21
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir 
--dev 
/dev/ceph-block-393ba2fc-e970-4d48-8dcb-c6261dfdfe08/osd-block-931e2d94-63f6-4df8-baed-6873eb0123e2
 --path /var/lib/ceph/osd/ceph-21 --no-mon-config
Running command: /bin/ln -snf 
/dev/ceph-block-393ba2fc-e970-4d48-8dcb-c6261dfdfe08/osd-block-931e2d94-63f6-4df8-baed-6873eb0123e2
 /var/lib/ceph/osd/ceph-21/block
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-21/block
Running command: /bin/chown -R ceph:ceph /dev/dm-12
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-21
Running command: /bin/ln -snf 
/dev/ceph-block-dbs-75eda181-946f-4a40-b4e0-8ecd60721398/osd-block-db-45ee9a1f-3ee2-4db9-a057-fd06fa1452e8
 /var/lib/ceph/osd/ceph-21/block.db
Running command: /bin/chown -h ceph:ceph 
/dev/ceph-block-dbs-75eda181-946f-4a40-b4e0-8ecd60721398/osd-block-db-45ee9a1f-3ee2-4db9-a057-fd06fa1452e8
Running command: /bin/chown -R ceph:ceph /dev/dm-21
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-21/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-21
Running command: /bin/systemctl enable 
ceph-volume@lvm-21-e6f688e0-3e71-4ee6-90f3-b3c07a99059f
Running command: /bin/systemctl enable --runtime ceph-osd@21
 stderr: Created symlink 
/run/systemd/system/ceph-osd.target.wants/ceph-osd@21.service → 
/lib/systemd/system/ceph-osd@.service.
Running command: /bin/systemctl start ceph-osd@21
 stderr: Job for ceph-osd@21.service canceled.

There is nothing in the global journal because journald had not
been started at that time.

> The "After=" directive is ju

Re: [ceph-users] ceph-volume activate runs infinitely

2019-05-02 Thread Alfredo Deza
On Thu, May 2, 2019 at 5:27 AM Robert Sander
 wrote:
>
> Hi,
>
> The ceph-volume@.service units on an Ubuntu 18.04.2 system
> run unlimited and do not finish.
>
> Only after we create this override config the system boots again:
>
> # /etc/systemd/system/ceph-volume@.service.d/override.conf
> [Unit]
> After=network-online.target local-fs.target time-sync.target ceph-mon.target
>
> It looks like "After=local-fs.target" (the original value) is not
> enough for the dependencies.

Can you give a bit more details on the environment? How dense is the
server? If the unit retries is fine and I was hoping at some point it
would see things ready and start activating (it does retry
indefinitely at the moment).

Would also help to see what problems is it encountering as it can't
get to activate. There are two logs for this, one for the systemd unit
at /var/log/ceph/ceph-volume-systemd.log and the other one at
/var/log/ceph/ceph-volume.log that might
help.

The "After=" directive is just adding some wait time to start
activating here, so I wonder how is it that your OSDs didn't
eventually came up.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 93818 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore Compression

2019-05-02 Thread Igor Fedotov

Hi Ashley,

general rule is that compression switch do not affect existing data but 
controls future write request processing.


You can enable/disable compression at any time.

Once disabled - no more compression is happening. And data that has been 
compressed remains in this state until removal or overwrite or (under 
some circumstances when keeping it compressed isn't beneficial any 
more)  garbage collection. The latter is mostly triggered by some 
partial overwrite.



Thanks,

Igor

On 5/2/2019 12:20 PM, Ashley Merrick wrote:

Hello,

I am aware that when enabling compression in bluestore it will only 
compress new data.


However, if I had compression enabled for a period of time, is it then 
possible to disable compression and any data that was compressed 
continue to be uncompressed on read as normal but any new data not be 
compressed.


Or once it's enabled for a pool there is no going back apart from 
creating a new fresh pool?


, Ashley

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume activate runs infinitely

2019-05-02 Thread Robert Sander
Hi,

The ceph-volume@.service units on an Ubuntu 18.04.2 system
run unlimited and do not finish.

Only after we create this override config the system boots again:

# /etc/systemd/system/ceph-volume@.service.d/override.conf
[Unit]
After=network-online.target local-fs.target time-sync.target ceph-mon.target

It looks like "After=local-fs.target" (the original value) is not
enough for the dependencies.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore Compression

2019-05-02 Thread Ashley Merrick
Hello,

I am aware that when enabling compression in bluestore it will only
compress new data.

However, if I had compression enabled for a period of time, is it then
possible to disable compression and any data that was compressed continue
to be uncompressed on read as normal but any new data not be compressed.

Or once it's enabled for a pool there is no going back apart from creating
a new fresh pool?

, Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] sync rados objects to other cluster

2019-05-02 Thread Florian Engelmann

Hi,

we need to migrate a ceph pool used for gnocchi to another cluster in 
another datacenter. Gnocchi uses the python rados or cradox module to 
access the Ceph cluster. The pool is dedicated to gnocchi only. The 
source pool is based on HDD OSDs while the target pool got SSD only. As 
there are > 600.000 small objects (total = 12GB) in the pool a


rados export ... - | ssh ... rados import -

takes to long (more than 2 hours). So we would loose billing data of 2 
hours.


We will now try to add SSDs to the source cluster and modify the crush 
map to speed up the migration.


Are there any alternative options? Any "rsync" style RADOS tool?

All the best,
Flo


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] co-located cephfs client deadlock

2019-05-02 Thread Dan van der Ster
On the stuck client:

  cat /sys/kernel/debug/ceph/*/osdc

REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS
REQUESTS 1 homeless 0
245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100 e74658
fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.0001
0x400024 1 write
LINGER REQUESTS
BACKOFFS

osd.100 is clearly there ^^

-- dan

On Thu, May 2, 2019 at 9:25 AM Marc Roos  wrote:
>
>
> How did you retreive what osd nr to restart?
>
> Just for future reference, when I run into a similar situation. If you
> have a client hang on a osd node. This can be resolved by restarting
> the osd that it is reading from?
>
>
>
>
> -Original Message-
> From: Dan van der Ster [mailto:d...@vanderster.com]
> Sent: donderdag 2 mei 2019 8:51
> To: Yan, Zheng
> Cc: ceph-users; pablo.llo...@cern.ch
> Subject: Re: [ceph-users] co-located cephfs client deadlock
>
> On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng  wrote:
> >
> > On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster 
> wrote:
> > >
> > > Hi all,
> > >
> > > We have been benchmarking a hyperconverged cephfs cluster (kernel
> > > clients + osd on same machines) for awhile. Over the weekend (for
> > > the first time) we had one cephfs mount deadlock while some clients
> > > were running ior.
> > >
> > > All the ior processes are stuck in D state with this stack:
> > >
> > > [] wait_on_page_bit+0x83/0xa0 []
>
> > > __filemap_fdatawait_range+0x111/0x190
> > > [] filemap_fdatawait_range+0x14/0x30
> > > [] filemap_write_and_wait_range+0x56/0x90
> > > [] ceph_fsync+0x55/0x420 [ceph]
> > > [] do_fsync+0x67/0xb0 []
> > > SyS_fsync+0x10/0x20 []
> > > system_call_fastpath+0x22/0x27 []
> > > 0x
> > >
> >
> > are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc?
>
> We never managed to reproduce on this cluster.
>
> But on a separate (not co-located) cluster we had a similar issue. A
> client was stuck like this for several hours:
>
> HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to
> respond to capability release
> mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
> failing to respond to capability release client_id: 69092525
> MDS_SLOW_REQUEST 1 MDSs report slow requests
> mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30
> sec
>
>
> Indeed there was a hung write on hpc070.cern.ch:
>
> 245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
> e74658
> fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.0001
> 0x4000241 write
>
> I restarted osd.100 and the deadlocked request went away.
> Does this sound like a known issue?
>
> Thanks, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] co-located cephfs client deadlock

2019-05-02 Thread Marc Roos
 
How did you retreive what osd nr to restart? 

Just for future reference, when I run into a similar situation. If you 
have a client hang on a osd node. This can be resolved by restarting
the osd that it is reading from?




-Original Message-
From: Dan van der Ster [mailto:d...@vanderster.com] 
Sent: donderdag 2 mei 2019 8:51
To: Yan, Zheng
Cc: ceph-users; pablo.llo...@cern.ch
Subject: Re: [ceph-users] co-located cephfs client deadlock

On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng  wrote:
>
> On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster  
wrote:
> >
> > Hi all,
> >
> > We have been benchmarking a hyperconverged cephfs cluster (kernel 
> > clients + osd on same machines) for awhile. Over the weekend (for 
> > the first time) we had one cephfs mount deadlock while some clients 
> > were running ior.
> >
> > All the ior processes are stuck in D state with this stack:
> >
> > [] wait_on_page_bit+0x83/0xa0 [] 

> > __filemap_fdatawait_range+0x111/0x190
> > [] filemap_fdatawait_range+0x14/0x30 
> > [] filemap_write_and_wait_range+0x56/0x90
> > [] ceph_fsync+0x55/0x420 [ceph] 
> > [] do_fsync+0x67/0xb0 [] 
> > SyS_fsync+0x10/0x20 [] 
> > system_call_fastpath+0x22/0x27 [] 
> > 0x
> >
>
> are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc?

We never managed to reproduce on this cluster.

But on a separate (not co-located) cluster we had a similar issue. A 
client was stuck like this for several hours:

HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs 
report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to 
respond to capability release
mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02 
failing to respond to capability release client_id: 69092525 
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 
sec


Indeed there was a hung write on hpc070.cern.ch:

245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
e74658  
fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.0001
0x4000241 write

I restarted osd.100 and the deadlocked request went away.
Does this sound like a known issue?

Thanks, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hardware requirements for metadata server

2019-05-02 Thread Marc Roos



I have only 366M meta data stored in an ssd pool, with 16TB (10 million 
objects) of filesystem data (hdd pools).
The active mds is using 13GB memory.

Some stats of the active mds server
[@c01 ~]# ceph daemonperf mds.a
---mds --mds_cache--- --mds_log-- 
-mds_mem- mds_server- mds_ -objecter-- purg
req  rlat fwd  inos caps exi  imi |stry recy recd|subm evts segs 
repl|ino  dn  |hcr  hcs  hsr  cre |sess|actv rd   wr   rdwr|purg|
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0000 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0000 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0000 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0000 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0300 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  6   85k 1290 
|2.3M 3.6M|  0500 | 16 |  0060 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0200 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0000 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  0   85k 1290 
|2.3M 3.6M|  0100 | 16 |  0000 |  0
  000  3.6M  72k   00 | 28k   00 |  3   85k 1290 
|2.3M 3.6M|  0100 | 16 |  0000 |  0

-Original Message-
From: Manuel Sopena Ballesteros [mailto:manuel...@garvan.org.au] 
Sent: donderdag 2 mei 2019 2:46
To: ceph-users@lists.ceph.com
Subject: [ceph-users] hardware requirements for metadata server

Dear Ceph users,

 

I would like to ask, does the metadata server needs much block devices 
for storage? Or does it only needs RAM? How could I calculate the amount 
of disks and/or memory needed?

 

Thank you very much

NOTICE
Please consider the environment before printing this email. This message 
and any attachments are intended for the addressee named and may contain 
legally privileged/confidential/copyright information. If you are not 
the intended recipient, you should not read, use, disclose, copy or 
distribute this communication. If you have received this message in 
error please notify us at once by return email and then delete both 
messages. We accept no liability for the distribution of viruses or 
similar in electronic communications. This notice should not be removed. 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com