Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Patrick Donnelly
Hi Goncalo,

I believe this segfault may be the one fixed here:

https://github.com/ceph/ceph/pull/10027

(Sorry for brief top-post. Im on mobile.)

On Jul 4, 2016 9:16 PM, "Goncalo Borges" 
wrote:
>
> Dear All...
>
> We have recently migrated all our ceph infrastructure from 9.2.0 to
10.2.2.
>
> We are currently using ceph-fuse to mount cephfs in a number of clients.
>
> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
application requesting 4 hosts with 4 cores each (16 instances in total) .
According to the user, each instance has its own dedicated inputs and
outputs.
>
> Please note that if we go back to ceph-fuse 9.2.0 client everything works
fine.
>
> The ceph-fuse 10.2.2 client segfault is the following (we were able to
capture it mounting ceph-fuse in debug mode):
>>
>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
>> ceph-fuse[7346]: starting ceph client
>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv =
0x7f6af8c12320 newargc=11
>> ceph-fuse[7346]: starting fuse
>> *** Caught signal (Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal
(Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
[0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
>>
>>
> The full dump is quite long. Here are the very last bits of it. Let me
know if you need the full dump.
>>
>> --- begin dump of recent events ---
>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
_getxattr(137c789, "security.capability", 0) = -61
>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559
ll_write 0x7f6a08028be0 137c78c 20094~34
>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559
ll_write 0x7f6a08028be0 20094~34 = 34
>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559
ll_write 0x7f6a100145f0 137c78d 28526~34
>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559
ll_write 0x7f6a100145f0 28526~34 = 34
>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559
ll_forget 137c78c 1
>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559
ll_forget 137c789 1
>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a94006350 137c789 22010~216
>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a94006350 22010~216 = 216
>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
ll_getxattr 137c78c.head security.capability size 0
>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
_getxattr(137c78c, "security.capability", 0) = -61
>>
>> 
>>
>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
_getxattr(137c78a, "security.capability", 0) = -61
>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559
ll_write 0x7f6a08042560 137c78b 11900~34
>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559
ll_write 0x7f6a08042560 11900~34 = 34
>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
ll_getattr 11e9c80.head
>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
ll_getattr 11e9c80.head = 0
>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559
ll_forget 137c78a 1
>>   -154> 2016-07-05 10:09:14.043738 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a140d5930 137c78a 18292~34
>>   -153> 2016-07-05 10:09:14.043759 7f6a5ebfd700  3 client.464559
ll_write 0x7f6a140d5930 18292~34 = 34
>>   -152> 2016-07-05 10:09:14.043767 7f6ac17fb700  3 client.464559
ll_forget 11e9c80 1
>>   -151> 2016-07-05 10:09:14.043784 7f6aa8cf9700  3 client.464559
ll_flush 0x7f6a00049fe0 11e9c80
>>   -150> 2016-07-05 10:09:14.043794 7f6aa8cf9700  3 client.464559
ll_getxattr 137c78a.head security.capability 

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-06 Thread Patrick Donnelly
Hi Goncalo,

On Wed, Jul 6, 2016 at 2:18 AM, Goncalo Borges
 wrote:
> Just to confirm that, after applying the patch and recompiling, we are no
> longer seeing segfaults.
>
> I just tested with a user application which would kill ceph-fuse almost
> instantaneously.  Now it is running for quite some time, reading and
> updating the files that it should.
>
> I should test with other applications which were also triggering the
> ceph-fuse segfault, but for now, it is looking good.

Great, thanks for letting us know it worked.

> Is there a particular reason why in 9.2.0 we were not getting such
> segfaults? I am asking because the patch was simply to introduce two lock
> functions in two specific lines of src/client/Client.cc  which, I imagine,
> were also not there in 9.2.0 (unless there was a big rewrite of
> src/client/Client.cc from 9.2.0 to 10.2.2)

The locks were missing in 9.2.0. There were probably instances of the
segfault unreported/unresolved.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-07 Thread Patrick Donnelly
On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges
 wrote:
> Unfortunately, the other user application breaks ceph-fuse again (It is a
> completely different application then in my previous test).
>
> We have tested it in 4 machines with 4 cores. The user is submitting 16
> single core jobs which are all writing different output files (one per job)
> to a common dir in cephfs. The first 4 jobs run happily and never break
> ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines,
> trigger a segmentation fault, which is completely different from the other
> case.
>
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> 1: (()+0x297fe2) [0x7f54402b7fe2]
> 2: (()+0xf7e0) [0x7f543ecf77e0]
> 3: (ObjectCacher::bh_write_scattered(std::list std::allocator >&)+0x36) [0x7f5440268086]
> 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*,
> std::chrono::time_point std::chrono::duration > >, long*,
> int*)+0x22c) [0x7f5440268a3c]
> 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef]
> 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34]
> 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d]
> 8: (()+0x7aa1) [0x7f543ecefaa1]
>  9: (clone()+0x6d) [0x7f543df6893d]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.

This one looks like a very different problem. I've created an issue
here: http://tracker.ceph.com/issues/16610

Thanks for the report and debug log!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds standby + standby-reply upgrade

2016-07-08 Thread Patrick Donnelly
Hi Dzianis,

On Thu, Jun 30, 2016 at 4:03 PM, Dzianis Kahanovich  wrote:
> Upgraded infernalis->jewel (git, Gentoo). Upgrade passed over global
> stop/restart everything oneshot.
>
> Infernalis: e5165: 1/1/1 up {0=c=up:active}, 1 up:standby-replay, 1 up:standby
>
> Now after upgrade start and next mon restart, active monitor falls with
> "assert(info.state == MDSMap::STATE_STANDBY)" (even without running mds) .

This is the first time you've upgraded your pool to jewel right?
Straight from 9.X to 10.2.2?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread Patrick Donnelly
Hi Goncalo,

On Fri, Jul 8, 2016 at 3:01 AM, Goncalo Borges
 wrote:
> 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
> virtual memory when there is no applications using the filesystem.
>
>  7152 root  20   0 1108m  12m 5496 S  0.0  0.0   0:00.04 ceph-fuse
>
> When I only have one instance of the user application running, ceph-fuse (in
> 10.2.2) slowly rises with time up to 10 GB of memory usage.
>
> if I submit a large number of user applications simultaneously, ceph-fuse
> goes very fast to ~10GB.
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 18563 root  20   0 10.0g 328m 5724 S  4.0  0.7   1:38.00 ceph-fuse
>  4343 root  20   0 3131m 237m  12m S  0.0  0.5  28:24.56 dsm_om_connsvcd
>  5536 goncalo   20   0 1599m  99m  32m R 99.9  0.2  31:35.46 python
> 31427 goncalo   20   0 1597m  89m  20m R 99.9  0.2  31:35.88 python
> 20504 goncalo   20   0 1599m  89m  20m R 100.2  0.2  31:34.29 python
> 20508 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:34.20 python
>  4973 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:35.70 python
>  1331 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:35.72 python
> 20505 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:34.46 python
> 20507 goncalo   20   0 1599m  87m  20m R 99.9  0.2  31:34.37 python
> 28375 goncalo   20   0 1597m  86m  20m R 99.9  0.2  31:35.52 python
> 20503 goncalo   20   0 1597m  85m  20m R 100.2  0.2  31:34.09 python
> 20506 goncalo   20   0 1597m  84m  20m R 99.5  0.2  31:34.42 python
> 20502 goncalo   20   0 1597m  83m  20m R 99.9  0.2  31:34.32 python

I've seen this type of thing before. It could be glibc's malloc arenas
for threads. See:

https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en

I would guess there are 20 cores on this machine*?

* 20 = 10GB/(8*64MB)

If the cause here is glibc arenas, I don't think we need to do
anything special. The virtual memory is not actually being used due to
Linux overcommit.

> 6./ On the machines where the user had the segfault, we have 16 GB of RAM
> and 1GB of SWAP
>
> Mem:  16334244k total,  3590100k used, 12744144k free,   221364k buffers
> Swap:  1572860k total,10512k used,  1562348k free,  2937276k cached

But do we know that ceph-fuse is using 10G VM on those machines (the
core count may be different)?

> 7./ I think what is happening is that once the user submits his sets of
> jobs, the memory usage goes to the very limit on this type machine, and the
> raise is actually to fast that ceph-fuse segfaults before OOM Killer can
> kill it.

It's possible but we have no evidence yet that ceph-fuse is using up
all the memory on those machines yet, right?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS write performance

2016-07-19 Thread Patrick Donnelly
On Tue, Jul 19, 2016 at 10:25 AM, Fabiano de O. Lucchese
 wrote:
> I configured the cluster to replicate data twice (3 copies), so these
> numbers fall within my expectations. So far so good, but here's comes the
> issue: I configured CephFS and mounted a share locally on one of my servers.
> When I write data to it, it shows abnormally high performance at the
> beginning for about 5 seconds, stalls for about 20 seconds and then picks up
> again. For long running tests, the observed write throughput is very close
> to what the rados bench provided (about 640 MB/s), but for short-lived
> tests, I get peak performances of over 5GB/s. I know that journaling is
> expected to cause spiky performance patters like that, but not to this
> level, which makes me think that CephFS is buffering my writes and returning
> the control back to client before persisting them to the jounal, which looks
> undesirable.

The client is buffering the writes to RADOS which would give you the
abnormally high initial performance until the cache needs flushed. You
might try tweaking certain osd settings:

http://docs.ceph.com/docs/hammer/rados/configuration/osd-config-ref/

in particular: "osd client message size cap". Also:

http://docs.ceph.com/docs/hammer/rados/configuration/journal-ref/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot preferred behaviors

2016-07-27 Thread Patrick Donnelly
On Mon, Jul 25, 2016 at 5:41 PM, Gregory Farnum  wrote:
> Some specific questions:
> * Right now, we allow users to rename snapshots. (This is newish, so
> you may not be aware of it if you've been using snapshots for a
> while.) Is that an important ability to preserve?

IMO, renaming snapshots is very useful when doing regular time-based
snapshots (e.g. a "today" snapshot is renamed "yesterday"). This is a
very popular feature in ZFS.

> * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar"
> and then take a snapshot at "/1/2/foo", it *will not* capture the file
> data in bar. Is that okay? Doing otherwise is *exceedingly* difficult.

This is only the case if /1/2/foo/ does not have the embedded inode
for "bar", right? (That's normally the case but an intervening unlink
of "1/3/bar" may eventually cause "/1/2/foo/bar" to become the new
primary inode?)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash

2016-08-10 Thread Patrick Donnelly
Hello Randy,

On Wed, Aug 10, 2016 at 12:20 PM, Randy Orr  wrote:
> mds/Locker.cc: In function 'bool Locker::check_inode_max_size(CInode*, bool,
> bool, uint64_t, bool, uint64_t, utime_t)' thread 7fc305b83700 time
> 2016-08-09 18:51:50.626630
> mds/Locker.cc: 2190: FAILED assert(in->is_file())
>
>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x563d1e0a2d3b]
>  2: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool,
> unsigned long, utime_t)+0x15e3) [0x563d1de506a3]
>  3: (Server::handle_client_open(std::shared_ptr&)+0x1061)
> [0x563d1dd386a1]
>  4: (Server::dispatch_client_request(std::shared_ptr&)+0xa0b)
> [0x563d1dd5709b]
>  5: (Server::handle_client_request(MClientRequest*)+0x47f) [0x563d1dd5768f]
>  6: (Server::dispatch(Message*)+0x3bb) [0x563d1dd5b8db]
>  7: (MDSRank::handle_deferrable_message(Message*)+0x80c) [0x563d1dce1f8c]
>  8: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x563d1dceb081]
>  9: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x563d1dcec1d5]
>  10: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x563d1dcd3f83]
>  11: (DispatchQueue::entry()+0x78b) [0x563d1e1996cb]
>  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x563d1e08862d]
>  13: (()+0x8184) [0x7fc30bd7c184]
>  14: (clone()+0x6d) [0x7fc30a2d337d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.

I have a bug report filed for this issue: http://tracker.ceph.com/issues/16983

I believe it should be straightforward to solve and we'll have a fix
for it soon.

Thanks for the report!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash

2016-08-10 Thread Patrick Donnelly
Randy, are you using ceph-fuse or the kernel client (or something else)?

On Wed, Aug 10, 2016 at 2:33 PM, Randy Orr  wrote:
> Great, thank you. Please let me know if I can be of any assistance in
> testing or validating a fix.
>
> -Randy
>
> On Wed, Aug 10, 2016 at 1:21 PM, Patrick Donnelly 
> wrote:
>>
>> Hello Randy,
>>
>> On Wed, Aug 10, 2016 at 12:20 PM, Randy Orr  wrote:
>> > mds/Locker.cc: In function 'bool Locker::check_inode_max_size(CInode*,
>> > bool,
>> > bool, uint64_t, bool, uint64_t, utime_t)' thread 7fc305b83700 time
>> > 2016-08-09 18:51:50.626630
>> > mds/Locker.cc: 2190: FAILED assert(in->is_file())
>> >
>> >  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > const*)+0x8b) [0x563d1e0a2d3b]
>> >  2: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long,
>> > bool,
>> > unsigned long, utime_t)+0x15e3) [0x563d1de506a3]
>> >  3: (Server::handle_client_open(std::shared_ptr&)+0x1061)
>> > [0x563d1dd386a1]
>> >  4:
>> > (Server::dispatch_client_request(std::shared_ptr&)+0xa0b)
>> > [0x563d1dd5709b]
>> >  5: (Server::handle_client_request(MClientRequest*)+0x47f)
>> > [0x563d1dd5768f]
>> >  6: (Server::dispatch(Message*)+0x3bb) [0x563d1dd5b8db]
>> >  7: (MDSRank::handle_deferrable_message(Message*)+0x80c)
>> > [0x563d1dce1f8c]
>> >  8: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x563d1dceb081]
>> >  9: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x563d1dcec1d5]
>> >  10: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x563d1dcd3f83]
>> >  11: (DispatchQueue::entry()+0x78b) [0x563d1e1996cb]
>> >  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x563d1e08862d]
>> >  13: (()+0x8184) [0x7fc30bd7c184]
>> >  14: (clone()+0x6d) [0x7fc30a2d337d]
>> >  NOTE: a copy of the executable, or `objdump -rdS ` is
>> > needed to
>> > interpret this.
>>
>> I have a bug report filed for this issue:
>> http://tracker.ceph.com/issues/16983
>>
>> I believe it should be straightforward to solve and we'll have a fix
>> for it soon.
>>
>> Thanks for the report!
>>
>> --
>> Patrick Donnelly
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds log: dne in the mdsmap

2017-07-11 Thread Patrick Donnelly
On Tue, Jul 11, 2017 at 7:36 AM, John Spray  wrote:
> On Tue, Jul 11, 2017 at 3:23 PM, Webert de Souza Lima
>  wrote:
>> Hello,
>>
>> today I got a MDS respawn with the following message:
>>
>> 2017-07-11 07:07:55.397645 7ffb7a1d7700  1 mds.b handle_mds_map i
>> (10.0.1.2:6822/28190) dne in the mdsmap, respawning myself
>
> "dne in the mdsmap" is what an MDS says when the monitors have
> concluded that the MDS is dead, but the MDS is really alive.  "dne"
> stands for "does not exist", so the MDS is complaining that it has
> been removed from the mdsmap.
>
> The message could definitely be better worded!

Tracker ticket: http://tracker.ceph.com/issues/20583

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] updating the documentation

2017-07-12 Thread Patrick Donnelly
On Wed, Jul 12, 2017 at 11:29 AM, Sage Weil  wrote:
> In the meantime, we can also avoid making the problem worse by requiring
> that all pull requests include any relevant documentation updates.  This
> means (1) helping educate contributors that doc updates are needed, (2)
> helping maintainers and reviewers remember that doc updates are part of
> the merge criteria (it will likely take a bit of time before this is
> second nature), and (3) generally inducing developers to become aware of
> the documentation that exists so that they know what needs to be updated
> when they make a change.

There was a joke to add a bot which automatically fails PRs for no
documentation but I think there is an way to make that work in a
reasonable way. Perhaps the bot could simply comment on all PRs
touching src/ that documentation is required and where to look, and
then fails a doc check. A developer must comment on the PR to say it
passes documentation requirements before the bot changes the check to
pass.

This addresses all three points in an automatic way.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stealth Jewel release?

2017-07-12 Thread Patrick Donnelly
On Wed, Jul 12, 2017 at 11:31 AM, Dan van der Ster  wrote:
> On Wed, Jul 12, 2017 at 5:51 PM, Abhishek L
>  wrote:
>> On Wed, Jul 12, 2017 at 9:13 PM, Xiaoxi Chen  wrote:
>>> +However, it also introduced a regression that could cause MDS damage.
>>> +Therefore, we do *not* recommend that Jewel users upgrade to this version -
>>> +instead, we recommend upgrading directly to v10.2.9 in which the 
>>> regression is
>>> +fixed.
>>>
>>> It looks like this version is NOT production ready. Curious why we
>>> want a not-recwaended version  to be released?
>>
>> We found a regression in MDS right after packages were built, and the release
>> was about to be announced. This is why we didn't announce the release.
>> We're  currently running tests after the fix for MDS was merged.
>>
>> So when we do announce the release we'll announce 10.2.9 so that users
>> can upgrade from 10.2.7->10.2.9
>
> Suppose some users already upgraded their CephFS to 10.2.8 -- what is
> the immediate recommended course of action? Downgrade or wait for the
> 10.2.9 ?

I'm not aware of or see any changes that would make downgrading back
to 10.2.7 a problem but the safest thing to do would be to replace the
v10.2.8 ceph-mds binaries with the v10.2.7 binary. If that's not
practical, I would recommend a cluster-wide downgrade to 10.2.7.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Change the meta data pool of cephfs

2017-07-12 Thread Patrick Donnelly
On Tue, Jul 11, 2017 at 12:22 AM, Marc Roos  wrote:
>
>
> Is it possible to change the cephfs meta data pool. I would like to
> lower the pg's. And thought about just making a new pool, copying the
> pool and then renaming them. But I guess cephfs works with the pool id
> not? How can this be best done?

There is currently no way to change the metadata pool except through
manual recovery into a new pool:
http://docs.ceph.com/docs/master/cephfs/disaster-recovery/#using-an-alternate-metadata-pool-for-recovery

I would strongly recommend backups before trying such a procedure.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stealth Jewel release?

2017-07-14 Thread Patrick Donnelly
On Fri, Jul 14, 2017 at 12:26 AM, Martin Palma  wrote:
> So only the ceph-mds is affected? Let's say if we have mons and osds
> on 10.2.8 and the MDS on 10.2.6 or 10.2.7 we would be "safe"?

Yes, only the MDS was affected.

As Udo mentioned, v10.2.9 is out so feel free to upgrade to that instead.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-17 Thread Patrick Donnelly
Hi Gencer,

On Mon, Jul 17, 2017 at 12:31 PM,   wrote:
> I located and applied almost every different tuning setting/config over the
> internet. I couldn’t manage to speed up my speed one byte further. It is
> always same speed whatever I do.

I believe you're frustrated but this type of information isn't really
helpful. Instead tell us which config settings you've tried tuning.

> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each node
> has 24 cores and 64GB of RAM. Ceph nodes are connected via 10GbE NIC. No
> FUSE used. But tried that too. Same results.
>
>
>
> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct

This looks like your problem: don't use oflag=direct. That will cause
CephFS to do synchronous I/O at great cost to performance in order to
avoid buffering by the client.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-17 Thread Patrick Donnelly
On Mon, Jul 17, 2017 at 1:08 PM,   wrote:
> But lets try another. Lets say i have a file in my server which is 5GB. If i
> do this:
>
> $ rsync ./bigfile /mnt/cephfs/targetfile --progress
>
> Then i see max. 200 mb/s. I think it is still slow :/ Is this an expected?

Perhaps that is the bandwidth limit of your local device rsync is reading from?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Defining quota in CephFS - quota is ignored

2017-07-26 Thread Patrick Donnelly
On Wed, Jul 26, 2017 at 2:26 AM,   wrote:
> Hello!
>
> Based on the documentation for defining quotas in CephFS for any directory 
> (http://docs.ceph.com/docs/master/cephfs/quota/), I defined a quota for 
> attribute max_bytes:
> ld4257:~ # getfattr -n ceph.quota.max_bytes /mnt/ceph-fuse/MTY/
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/ceph-fuse/MTY/
> ceph.quota.max_bytes="1"
>
> To validate if the quota is working, I write a 128MB file in 
> /mnt/ceph-fuse/MTY:
> ld4257:~ # dd if=/dev/zero of=/mnt/ceph-fuse/MTY/128MBfile bs=64M count=2
> 2+0 records in
> 2+0 records out
> 134217728 bytes (134 MB, 128 MiB) copied, 0.351206 s, 382 MB/s
>
> This file is created correctly, and the utilization statistcs confirm it:
> ld4257:~ # rados df
> pool name KB  objects   clones degraded  
> unfound   rdrd KB   wrwr KB
> hdb-backup131072   3200   
>  08843251 88572586
> hdb-backup_metadata27920   2700   
>  0  301   168115 645955386
> rbd0000   
>  00000
> templates  0000   
>  00000
>   total used 9528188   59
>   total avail   811829446772
>   total space   811838974960
>
>
> Question:
> Why can I create a file with size 128MB after defining a quota of 100MB?

I don't have a cluster to check this on now but perhaps because a
sparse file (you wrote all zeros) does not consume its entire file
size in the quota (only what it uses).  Retry with /dev/urandom.

(And the usual disclaimer: quotas only work with libcephfs/ceph-fuse.
The kernel client does not support quotas.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse hanging on df with ceph luminous >= 12.1.3

2017-08-22 Thread Patrick Donnelly
On Mon, Aug 21, 2017 at 5:37 PM, Alessandro De Salvo
 wrote:
> Hi,
>
> when trying to use df on a ceph-fuse mounted cephfs filesystem with ceph
> luminous >= 12.1.3 I'm having hangs with the following kind of messages in
> the logs:
>
>
> 2017-08-22 02:20:51.094704 7f80addb7700  0 client.174216 ms_handle_reset on
> 192.168.0.10:6789/0
>
>
> The logs are only showing this type of messages and nothing more useful. The
> only possible way to resume the operations is to kill ceph-fuse and remount.
> Only df is hanging though, while file operations, like copy/rm/ls are
> working as expected.
>
> This behavior is only shown for ceph >= 12.1.3, while for example ceph-fuse
> on 12.1.2 works.
>
> Anyone has seen the same problems? Any help is highly appreciated.

It could be caused by [1]. I don't see a particular reason why you
would experience a hang in the client. You can try adding "debug
client = 20" and "debug ms = 5" to your ceph.conf [2] to get more
information.

[1] https://github.com/ceph/ceph/pull/16378/
[2] http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Segfault 12.2.0

2017-09-18 Thread Patrick Donnelly
Hi Derek,

On Mon, Sep 18, 2017 at 1:30 PM, Derek Yarnell  wrote:
> We have a recent cluster upgraded from Jewel to Luminous.  Today we had
> a segmentation fault that led to file system degraded.  Systemd then
> decided to restart the daemon over and over with a different stack trace
> (can be seen after the 10k events in the log file[0]).
>
> After trying to fail over to the standby which also kept failing.  After
> shutting down both MDSs for some time we brought one back online and
> what seemed to be the clients had been out long enough to be evicted.
> We were able to then reboot clients (RHEL 7.4) and have them re-connect
> to the file system.

This looks like an instance of:

http://tracker.ceph.com/issues/21070

Upcoming v12.2.1 has the fix. Until then, you will need to apply the
patch locally.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-10-02 Thread Patrick Donnelly
On Thu, Sep 28, 2017 at 5:16 AM, Micha Krause  wrote:
> Hi,
>
> I had a chance to catch John Spray at the Ceph Day, and he suggested that I
> try to reproduce this bug in luminos.

Did you edit the code before trying Luminous? I also noticed from your
original mail that it appears you're using multiple active metadata
servers? If so, that's not stable in Jewel. You may have tripped on
one of many bugs fixed in Luminous for that configuration.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what does associating ceph pool to application do?

2017-10-06 Thread Patrick Donnelly
Hello Chad,

On Fri, Oct 6, 2017 at 10:01 AM, Chad William Seys
 wrote:
> Thanks John!  I see that a pool can have more than one "application". Should
> I feel free to combine uses (e.g. cephfs,rbd) or is this counterindicated?

That's not currently possible but we are thinking about changes which
would allow multiple ceph file systems to use the same data pool by
having each FS work in a separate namespace. See also:

http://tracker.ceph.com/issues/15066

Support with CephFS and RBD using the same pool may follow that.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs ceph-fuse performance

2017-10-18 Thread Patrick Donnelly
Hello Ashley,

On Wed, Oct 18, 2017 at 12:45 AM, Ashley Merrick  wrote:
> 1/ Is there any options or optimizations that anyone has used or can suggest
> to increase ceph-fuse performance?

You may try playing with the sizes of reads/writes. Another
alternative is to use libcephfs directly to avoid fuse entirely.

> 2/ The reason for looking at ceph-fuse is the benefit of cephfs quotas
> (currently not enabled), will it ever be possible for enable quotas on the
> kernel mount or is this something not possible with the current
> implementation of quotas?

Adding quota support to the kernel is one of our priorities for Mimic.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous ceph-fuse crashes with "failed to remount for kernel dentry trimming"

2017-11-27 Thread Patrick Donnelly
Hello Andras,

On Mon, Nov 27, 2017 at 2:31 PM, Andras Pataki
 wrote:
> After upgrading to the Luminous 12.2.1 ceph-fuse client, we've seen clients
> on various nodes randomly crash at the assert
> FAILED assert(0 == "failed to remount for kernel dentry trimming")
>
> with the stack:
>
>  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x5584ad80]
>  2: (C_Client_Remount::finish(int)+0xcf) [0x557e7fff]
>  3: (Context::complete(int)+0x9) [0x557e3dc9]
>  4: (Finisher::finisher_thread_entry()+0x198) [0x55849d18]
>  5: (()+0x7e25) [0x760a3e25]
>  6: (clone()+0x6d) [0x74f8234d]

What kernel version are you using? We have seen instances of this
error recently. It may be related to [1]. Are you running out of
memory on these machines?

[1] http://tracker.ceph.com/issues/17517

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange error on link() for nfs over cephfs

2017-11-29 Thread Patrick Donnelly
On Wed, Nov 29, 2017 at 3:44 AM, Jens-U. Mozdzen  wrote:
> Hi *,
>
> we recently have switched to using CephFS (with Luminous 12.2.1). On one
> node, we're kernel-mounting the CephFS (kernel 4.4.75, openSUSE version) and
> export it via kernel nfsd. As we're transitioning right now, a number of
> machines still auto-mount users home directories from that nfsd.

You need to try a newer kernel as there have been many fixes since 4.4
which probably have not been backported to your distribution's kernel.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS log jam prevention

2017-12-05 Thread Patrick Donnelly
On Tue, Dec 5, 2017 at 8:07 AM, Reed Dier  wrote:
> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD
> backed CephFS pool.
>
> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running
> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and
> clients.

You should try a newer kernel client if possible since the MDS is
having trouble trimming its cache.

> HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing
> to respond to cache pressure; 1 MDSs behind on tr
> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1
> pool(s); 242 slow requests are blocked > 32 sec
> ; 769378 stuck requests are blocked > 4096 sec
> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by
> clients, 1 stray files
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
> pressure
> mdsdb(mds.0): Many clients (37) failing to respond to cache
> pressureclient_count: 37
> MDS_TRIM 1 MDSs behind on trimming
> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30,
> num_segments: 36252

See also: http://tracker.ceph.com/issues/21975

You can try doubling (several times if necessary) the MDS configs
`mds_log_max_segments` and `mds_log_max_expiring` to make it more
aggressively trim its journal. (That may not help since your OSD
requests are slow.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Patrick Donnelly
On Thu, Dec 14, 2017 at 9:18 AM, Webert de Souza Lima
 wrote:
> So, questions: does that really matter? What are possible impacts? What
> could have caused this 2 hosts to hold so many capabilities?
> 1 of the hosts are for tests purposes, traffic is close to zero. The other
> host wasn't using cephfs at all. All services stopped.

It's likely you're a victim of a kernel backport that removed a dentry
invalidation mechanism for FUSE mounts. The result is that ceph-fuse
can't trim dentries. We have a patch to turn off that particular
mechanism by default:

https://github.com/ceph/ceph/pull/17925

I suggest setting that config manually to false on all of your clients
and ensure each client can remount itself to trim dentries (i.e. it's
being run as root or with sufficient capabiltities) which is a
fallback mechanism.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Patrick Donnelly
On Thu, Dec 14, 2017 at 4:44 PM, Webert de Souza Lima
 wrote:
> Hi Patrick,
>
> On Thu, Dec 14, 2017 at 7:52 PM, Patrick Donnelly 
> wrote:
>>
>>
>> It's likely you're a victim of a kernel backport that removed a dentry
>> invalidation mechanism for FUSE mounts. The result is that ceph-fuse
>> can't trim dentries.
>
>
> even though I'm not using FUSE? I'm using kernel mounts.
>
>
>>
>> I suggest setting that config manually to false on all of your clients
>
>
> Ok how do I do that?

I missed that you were using the kernel client. I agree with Zheng's analysis.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS cache size limits

2018-01-04 Thread Patrick Donnelly
Hello Stefan,

On Thu, Jan 4, 2018 at 1:45 AM, Stefan Kooman  wrote:
> I have a question about the "mds_cache_memory_limit" parameter and MDS
> memory usage. We currently have set mds_cache_memory_limit=150G.
> The MDS server itself (and its active-standby) have 256 GB of RAM.
> Eventually the MDS process will consume ~ 87.5% of available memory.
> At that point it will trim its cache, confirmed with:
>
> while sleep 1; do ceph daemon mds.mds1 perf dump | jq '.mds_mem.rss'; ceph
> daemon mds.mds1 dump_mempools | jq -c '.mds_co'; done
>
> 1 cephfs kernel client (4.13.0-21-generic), Ceph 12.2.2.
>
> Anyways, it will consume roughly 1.5 times the amount of memory it is
> allowed to use according to mds_cache_memory_limit. Is this expected
> behaviour?

It's expected but not desired: http://tracker.ceph.com/issues/21402

The memory usage tracking is off by a constant factor. I'd suggest
just lowering the limit so it's about where it should be for your
system.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS cache size limits

2018-01-05 Thread Patrick Donnelly
On Fri, Jan 5, 2018 at 3:54 AM, Stefan Kooman  wrote:
> Quoting Patrick Donnelly (pdonn...@redhat.com):
>>
>> It's expected but not desired: http://tracker.ceph.com/issues/21402
>>
>> The memory usage tracking is off by a constant factor. I'd suggest
>> just lowering the limit so it's about where it should be for your
>> system.
>
> Thanks for the info. Yeah, we did exactly that (observe and adjust
> setting accordingly). Is this something worth
> mentioning in the documentation? Escpecially when this "factor" is a
> constant? Over time (with issue 21402 being worked on) things will
> change. Ceph operators will want to make use of as much cache as
> possible without overcommitting (MDS won't notice until there is no more
> memory left, restart, and looses all its cache :/).

Yup: http://tracker.ceph.com/issues/22599

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-18 Thread Patrick Donnelly
Hi Andras,

On Thu, Jan 18, 2018 at 3:38 AM, Andras Pataki
 wrote:
> Hi John,
>
> Some other symptoms of the problem:  when the MDS has been running for a few
> days, it starts looking really busy.  At this time, listing directories
> becomes really slow.  An "ls -l" on a directory with about 250 entries takes
> about 2.5 seconds.  All the metadata is on OSDs with NVMe backing stores.
> Interestingly enough the memory usage seems pretty low (compared to the
> allowed cache limit).
>
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
> 1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92
> /usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup
> ceph
>
> Once I bounce it (fail it over), the CPU usage goes down to the 10-25%
> range.  The same ls -l after the bounce takes about 0.5 seconds.  I
> remounted the filesystem before each test to ensure there isn't anything
> cached.
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>   00 ceph  20   0 6537052 5.864g  18500 S  17.6  2.3   9:23.55
> /usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup
> ceph
>
> Also, I have a crawler that crawls the file system periodically.  Normally
> the full crawl runs for about 24 hours, but with the slowing down MDS, now
> it has been running for more than 2 days and isn't close to finishing.
>
> The MDS related settings we are running with are:
>
> mds_cache_memory_limit = 17179869184
> mds_cache_reservation = 0.10

Debug logs from the MDS at that time would be helpful with `debug mds
= 20` and `debug ms = 1`. Feel free to create a tracker ticket and use
ceph-post-file [1] to share logs.

[1] http://docs.ceph.com/docs/hammer/man/8/ceph-post-file/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client with uid

2018-01-25 Thread Patrick Donnelly
On Wed, Jan 24, 2018 at 7:47 AM, Keane Wolter  wrote:
> Hello all,
>
> I was looking at the Client Config Reference page
> (http://docs.ceph.com/docs/master/cephfs/client-config-ref/) and there was
> mention of a flag --client_with_uid. The way I read it is that you can
> specify the UID of a user on a cephfs and the user mounting the filesystem
> will act as the same UID. I am using the flags --client_mount_uid and
> --client_mount_gid set equal to my UID and GID values on the cephfs when
> running ceph-fuse. Is this the correct action for the flags or am I
> misunderstanding the flags?

These options are no longer used (with the exception of some bugs
[1,2]). The uid/gid should be provided by FUSE so you don't need to do
anything. If you're using the client library, you provide the uid/gid
via the UserPerm struct to each operation.

[1] http://tracker.ceph.com/issues/22802
[2] http://tracker.ceph.com/issues/22801


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client with uid

2018-02-06 Thread Patrick Donnelly
On Mon, Feb 5, 2018 at 9:08 AM, Keane Wolter  wrote:
> Hi Patrick,
>
> Thanks for the info. Looking at the fuse options in the man page, I should
> be able to pass "-o uid=$(id -u)" at the end of the ceph-fuse command.
> However, when I do, it returns with an unknown option for fuse and
> segfaults. Any pointers would be greatly appreciated. This is the result I
> get:

I'm not familiar with that uid= option, you'll ahve to redirect that
question to FUSE devs. (However, I don't think it does what you want
it to. It says it only hard-codes the st_uid field returned by stat.)

> daemoneye@wolterk:~$ ceph-fuse --id=kwolter_test1 -r /user/kwolter/
> /home/daemoneye/ceph/ --client-die-on-failed-remount=false -o uid=$(id -u)
> ceph-fuse[25156]: starting ceph client
> fuse: unknown option `uid=1000'
> ceph-fuse[25156]: fuse failed to start
> *** Caught signal (Segmentation fault) **
>  in thread 7efc7da86100 thread_name:ceph-fuse
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x6a8784) [0x5583372d8784]
>  2: (()+0x12180) [0x7efc7bb4f180]
>  3: (Client::_ll_drop_pins()+0x67) [0x558336e5dea7]
>  4: (Client::unmount()+0x943) [0x558336e67323]
>  5: (main()+0x7ed) [0x558336e02b0d]
>  6: (__libc_start_main()+0xea) [0x7efc7a892f2a]
>  7: (_start()+0x2a) [0x558336e0b73a]
> ceph-fuse [25154]: (33) Numerical argument out of domain
> daemoneye@wolterk:~$

I wasn't able to reproduce this.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Problems

2016-11-04 Thread Patrick Donnelly
ceph-mds1 handle_mds_map i 
> (10.1.2.76:6805/1334) dne in the mdsmap, respawning
> myself
> 2016-11-04 13:35:15.412886 7fcf47754700  1 mds.gp-ceph-mds1 respawn
>
> Some of the Asserts
>
> 2016-11-04 13:26:30.344284 7f03fd42e700 -1 mds/MDSDaemon.cc: In function 
> 'void MDSDaemon::respawn()' thread 7f03fd42e700 time
> 2016-11-04 13:26:30.329841
> mds/MDSDaemon.cc: 1132: FAILED assert(0)
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x557546692e80]
>  2: (MDSDaemon::respawn()+0x73d) [0x5575462789fd]
>  3: (MDSDaemon::handle_mds_map(MMDSMap*)+0x1517) [0x557546281667]
>  4: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x557546284a03]
>  5: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x557546284cf3]
>  6: (DispatchQueue::entry()+0xf2b) [0x557546799f6b]
>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x55754667911d]
>  8: (()+0x76fa) [0x7f04026b06fa]
>  9: (clone()+0x6d) [0x7f0400b71b5d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.

This assert might be this issue: http://tracker.ceph.com/issues/17531

However, the exe_path debug line in your log would not indicate that
bug. You would see something like:

2016-10-06 15:12:04.933212 7fd94f072700  1 mds.a  exe_path
/home/pdonnell/ceph/build/bin/ceph-mds (deleted)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practices for use ceph cluster and directories with many! Entries

2016-11-15 Thread Patrick Donnelly
On Tue, Nov 15, 2016 at 8:40 AM, Hauke Homburg  wrote:
> In the last weeks we enabled for testing the dir fragmentation. The Resultat
> is that we have sometimes error messages with rsync with unlink and no-space
> left on device.

Enabling directory fragmentation would not cause the unlink and ENOSPC
errors. Failure to unlink is caused by the stray directories on the
MDS growing too large. The only current solution is to wait for the
MDS to eventually purge the stray directory entries. Retry the unlink
as necessary. [The other workaround is to increase
mds_bal_fragment_size_max [1] which is not recommended.]

Directory fragmentation is not yet considered stable so beware
potential issues including data loss. However, fragmentation will
allow your directories to grow to unbounded size. This includes the
stray directories which would permit unlink to avoid this issue.

>  Does anyone have a timeline for the testing dir frag mds?

Directory fragmentation is on track to be stable for the Luminous release.

[1] https://github.com/ceph/ceph/pull/9789

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 2:30 PM, Kevin  wrote:
> Sorry for the long posting but trying to cover everything
>
> I woke up to find my cephfs filesystem down. This was in the logs
>
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head

Being that this came from the OSD, you should look to resolve that
problem. What you've done below is blow the journal away which hasn't
helped you any because (a) now your journal is probably lost without a
lot of manual intervention and (b) the "new" journal is still written
to the same bad backing device/file so it's probably still unusable as
you found out.

> I had one standby MDS, but as far as I can tell it did not fail over. This
> was in the logs

If a rank becomes damaged, standbys will not take over. You must mark
it repaired first.

> (insufficient standby MDS daemons available)
>
> Currently my ceph looks like this
>   cluster:
> id: ..
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 mds daemon damaged
>
>   services:
> mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
> mgr: ids27(active)
> mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
> osd: 5 osds: 5 up, 5 in
>
>   data:
> pools:   3 pools, 202 pgs
> objects: 1013k objects, 4018 GB
> usage:   12085 GB used, 6544 GB / 18630 GB avail
> pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>
> I started trying to get the damaged MDS back online
>
> Based on this page
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
>
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is
> unreadableErrors: 0
>
> cephfs-journal-tool journal reset - (I think this command might have worked)
>
> Next up, tried to reset the filesystem
>
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>
> Each time same errors
>
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned
> to filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200:
> (5) Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is
> degraded; 1 mds daemon damaged
>
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
>
> Command worked, but each time I run the reset command the same errors above
> appear
>
> Online searches say the object read error has to be removed. But there's no
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
>
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
> completes but still have the same issue above
>
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
>
> Why didn't the standby MDS failover?
>
> Just looking for any way to recover the cephfs, thanks!

I think it's time to do a scrub on the PG containing that object.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 3:55 PM, Patrick Donnelly  wrote:
>> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
>> completes but still have the same issue above
>>
>> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
>> has data it should become live. If it was not
>> I assume we will lose the filesystem at this point
>>
>> Why didn't the standby MDS failover?
>>
>> Just looking for any way to recover the cephfs, thanks!
>
> I think it's time to do a scrub on the PG containing that object.

Sorry didn't read the part of the email that said you did that :) Did
you confirm that after the deep scrub finished that the pg is
active+clean? It looks like you're still scrubbing that PG.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Insane CPU utilization in ceph.fuse

2018-07-23 Thread Patrick Donnelly
On Mon, Jul 23, 2018 at 5:48 AM, Daniel Carrasco  wrote:
> Hi, thanks for your response.
>
> Clients are about 6, and 4 of them are the most of time on standby. Only two
> are active servers that are serving the webpage. Also we've a varnish on
> front, so are not getting all the load (below 30% in PHP is not much).
> About the MDS cache, now I've the mds_cache_memory_limit at 8Mb.

What! Please post `ceph daemon mds. config diff`,  `... perf
dump`, and `... dump_mempools `  from the server the active MDS is on.

> I've tested
> also 512Mb, but the CPU usage is the same and the MDS RAM usage grows up to
> 15GB (on a 16Gb server it starts to swap and all fails). With 8Mb, at least
> the memory usage is stable on less than 6Gb (now is using about 1GB of RAM).

We've seen reports of possible memory leaks before and the potential
fixes for those were in 12.2.6. How fast does your MDS reach 15GB?
Your MDS cache size should be configured to 1-8GB (depending on your
preference) so it's disturbing to see you set it so low.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Secure way to wipe a Ceph cluster

2018-07-27 Thread Patrick Donnelly
Hello Christopher,

On Fri, Jul 27, 2018 at 12:00 AM, Christopher Kunz
 wrote:
> Hello all,
>
> as part of deprovisioning customers, we regularly have the task of
> wiping their Ceph clusters. Is there a certifiable, GDPR compliant way
> to do so without physically shredding the disks?

This should work and should be as fast as it can be:

wipefs -a /dev/sdX
shred /dev/sdX

Whether or not that's "GDPR compliant" will depend on external
certification, I guess.

(The issues might be that you can't guarantee all blocks in an SSD/HDD
are actually erased because the device firmware may retire bad blocks
and make them inaccessible. It may not be possible for the device to
physically destroy those blocks either even with SMART directives. You
may be stuck with an industrial shredder to be compliant if the rules
are stringent.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Patrick Donnelly
On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth
 wrote:
> These features are critical for us, so right now we use the Fuse client. My 
> hope is CentOS 8 will use a recent enough kernel
> to get those features automatically, though.

Your cluster needs to be running Mimic and Linux v4.17+.

See also: https://github.com/ceph/ceph/pull/23728/files

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-11 Thread Patrick Donnelly
On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
 wrote:
> On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum  wrote:
>>
>> 
>> In general, if the key-value storage is of unpredictable or non-trivial
>> size, you should use omap.
>>
>> At the bottom layer where the data is actually stored, they're likely to
>> be in the same places (if using BlueStore, they are the same — in FileStore,
>> a rados xattr *might* be in the local FS xattrs, or it might not). It is
>> somewhat more likely that something stored in an xattr will get pulled into
>> memory at the same time as the object's internal metadata, but that only
>> happens if it's quite small (think the xfs or ext4 xattr rules).
>
>
> Based on this description, if I'm planning on using Bluestore, there is no
> particular reason to ever prefer using xattrs over omap (outside of ease of
> use in the API), correct?

You may prefer xattrs on bluestore if the metadata is small and you
may need to store the xattrs on an EC pool. omap is not supported on
ecpools.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS performance.

2018-10-04 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 2:10 AM Ronny Aasen  wrote:
> in rbd there is a fancy striping solution, by using --stripe-unit and
> --stripe-count. This would get more spindles running ; perhaps consider
> using rbd instead of cephfs if it fits the workload.

CephFS also supports custom striping via layouts:
http://docs.ceph.com/docs/master/cephfs/file-layouts/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-08 Thread Patrick Donnelly
+ceph-announce

On Sun, Oct 7, 2018 at 7:30 PM Yan, Zheng  wrote:
> There is a bug in v13.2.2 mds, which causes decoding purge queue to
> fail. If mds is already in damaged state, please downgrade mds to
> 13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' .
>
> Sorry for all the trouble I caused.
> Yan, Zheng

This issue is being tracked here: http://tracker.ceph.com/issues/36346

The problem was caused by a backport of the wrong commit which
unfortunately was not caught. The backport was not done to Luminous;
only Mimic 13.2.2 is affected. New deployments on 13.2.2 are also
affected but do not require immediate action. A procedure for handling
upgrades of fresh deployments from 13.2.2 to 13.2.3 will be included
in the release notes for 13.2.3.
-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-08 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman  wrote:
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146 
> pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping 
> beacon, heartbeat map not healthy
>
> As far as I can see just normal behaviour.
>
> The big question is: what is happening when the mds start logging the 
> hearbeat_map messages?
> Why does it log "heartbeat_map is_healthy", just to log .04 seconds later 
> it's not healthy?
>
> Ceph version: 12.2.8 on all nodes (mon, osd, mds)
> mds: one active / one standby-replay
>
> The system was not under any kind of resource pressure: plenty of CPU, RAM
> available. Metrics all look normal up to the moment things go into a deadlock
> (so it seems).

Thanks for the detailed notes. It looks like the MDS is stuck
somewhere it's not even outputting any log messages. If possible, it'd
be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
if you're comfortable with gdb, a backtrace of any threads that look
suspicious (e.g. not waiting on a futex) including `info threads`.
-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Patrick Donnelly
On Wed, Oct 17, 2018 at 11:05 AM Alexandre DERUMIER  wrote:
>
> Hi,
>
> Is it possible to have more infos or announce about this problem ?
>
> I'm currently waiting to migrate from luminious to mimic, (I need new quota 
> feature for cephfs)
>
> is it safe to upgrade to 13.2.2 ?
>
> or better to wait to 13.2.3 ? or install 13.2.1 for now ?

Upgrading to 13.2.1 would be safe.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-21 Thread Patrick Donnelly
Hello Daniel,

On Wed, Feb 21, 2018 at 10:26 AM, Daniel Carrasco  wrote:
> Is possible to make a better distribution on the MDS load of both nodes?.

We are aware of bugs with the balancer which are being worked on. You
can also manually create a partition if the workload can benefit:

https://ceph.com/community/new-luminous-cephfs-subtree-pinning/

> Is posible to set all nodes as Active without problems?

No. I recommend you read the docs carefully:

http://docs.ceph.com/docs/master/cephfs/multimds/

> My last question is if someone can recomend me a good client configuration
> like cache size, and maybe something to lower the metadata servers load.

>>
>> ##
>> [mds]
>>  mds_cache_size = 25
>>  mds_cache_memory_limit = 792723456

You should only specify one of those. See also:

http://docs.ceph.com/docs/master/cephfs/cache-size-limits/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-22 Thread Patrick Donnelly
On Wed, Feb 21, 2018 at 11:17 PM, Daniel Carrasco  wrote:
> I want to search also if there is any way to cache file metadata on client,
> to lower the MDS load. I suppose that files are cached but the client check
> with MDS if there are changes on files. On my server files are the most of
> time read-only so MDS data can be also cached for a while.

The MDS issues capabilities that allow clients to coherently cache metadata.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-23 Thread Patrick Donnelly
On Fri, Feb 23, 2018 at 12:54 AM, Daniel Carrasco  wrote:
>  client_permissions = false

Yes, this will potentially reduce checks against the MDS.

>   client_quota = false

This option no longer exists since Luminous; quota enforcement is no
longer optional. However, if you don't have any quotas then there is
no added load on the client/mds.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Patrick Donnelly
On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
 wrote:
> Am 25.02.2018 um 21:50 schrieb John Spray:
>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>>> First off, the MDS could not restart anymore - it required >40 GB of 
>>> memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and 
>>> swap.
>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
>>> but join took many minutes:
>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>>> and finally, 5 minutes later, OOM.
>>>
>>> I stopped half of the stress-test tar's, which did not help - then I 
>>> rebooted half of the clients, which did help and let the MDS recover just 
>>> fine.
>>> So it seems the client caps have been too many for the MDS to handle. I'm 
>>> unsure why "tar" would cause so many open file handles.
>>> Is there anything that can be configured to prevent this from happening?
>>
>> Clients will generally hold onto capabilities for files they've
>> written out -- this is pretty sub-optimal for many workloads where
>> files are written out but not likely to be accessed again in the near
>> future.  While clients hold these capabilities, the MDS cannot drop
>> things from its own cache.
>>
>> The way this is *meant* to work is that the MDS hits its cache size
>> limit, and sends a message to clients asking them to drop some files
>> from their local cache, and consequently release those capabilities.
>> However, this has historically been a tricky area with ceph-fuse
>> clients (there are some hacks for detecting kernel version and using
>> different mechanisms for different versions of fuse), and it's
>> possible that on your clients this mechanism is simply not working,
>> leading to a severely oversized MDS cache.
>>
>> The MDS should have been showing health alerts in "ceph status" about
>> this, but I suppose it's possible that it wasn't surviving long enough
>> to hit the timeout (60s) that we apply for warning about misbehaving
>> clients?  It would be good to check the cluster log to see if you were
>> getting any health messages along the lines of "Client xyz failing to
>> respond to cache pressure".
>
> This explains the high memory usage indeed.
> I can also confirm seeing those health alerts, now that I check the logs.
> The systems have been (servers and clients) all exclusively CentOS 7.4,
> so kernels are rather old, but I would have hoped things have been backported
> by RedHat.
>
> Is there anything one can do to limit client's cache sizes?

You said the clients are ceph-fuse running 12.2.3? Then they should have:

http://tracker.ceph.com/issues/22339

(Please double check you're not running older clients on accident.)

I have run small file tests with ~128 clients without issue. Generally
if there is an issue it is because clients are not releasing their
capabilities properly (due to invalidation bugs which should be caught
by the above backport) or the MDS memory usage exceeds RAM. If the
clients are not releasing their capabilities, you should see the
errors John described in the cluster log.

You said in the original post that the `mds cache memory limit = 4GB`.
If that's the case, you really shouldn't be exceeding 40GB of RAM!
It's possible you have found a bug of some kind. I suggest tracking
the MDS cache statistics (which includes the inode count in cache) by
collecting a `perf dump` via the admin socket. Then you can begin to
find out what's consuming all of the MDS memory.

Additionally, I concur with John on digging into why the MDS is
missing heartbeats by collecting debug logs (`debug mds = 15`) at that
time. It may also shed light on the issue.

Thanks for performing the test and letting us know the results.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Patrick Donnelly
On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
 wrote:
> Looking with:
> ceph daemon osd.2 perf dump
> I get:
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 84760592384,
> "db_used_bytes": 78920024064,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 0,
> "slow_used_bytes": 0,
> so it seems this is almost exclusively RocksDB usage.
>
> Is this expected?

Yes. The directory entries are stored in the omap of the objects. This
will be stored in the RocksDB backend of Bluestore.

> Is there a recommendation on how much MDS storage is needed for a CephFS with 
> 450 TB?

It seems in the above test you're using about 1KB per inode (file).
Using that you can extrapolate how much space the data pool needs
based on your file system usage. (If all you're doing is filling the
file system with empty files, of course you're going to need an
unusually large metadata pool.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Patrick Donnelly
On Mon, Feb 26, 2018 at 7:59 AM, Patrick Donnelly  wrote:
> It seems in the above test you're using about 1KB per inode (file).
> Using that you can extrapolate how much space the data pool needs

s/data pool/metadata pool/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-02-28 Thread Patrick Donnelly
On Wed, Feb 28, 2018 at 2:07 AM, Dan van der Ster  wrote:
> (Sorry to spam)
>
> I guess it's related to this fix to the layout v2 feature id:
> https://github.com/ceph/ceph/pull/18782/files
>
> -#define MDS_FEATURE_INCOMPAT_FILE_LAYOUT_V2 CompatSet::Feature(8,
> "file layout v2")
> +#define MDS_FEATURE_INCOMPAT_FILE_LAYOUT_V2 CompatSet::Feature(9,
> "file layout v2")

Yes, this looks to be the issue.

> Is there a way to update from 12.2.2 without causing the other active
> MDS's to suicide?

I think it will be necessary to reduce the actives to 1 (max_mds -> 1;
deactivate other ranks), shutdown standbys, upgrade the single active,
then upgrade/start the standbys.

Unfortunately this didn't get flagged in upgrade testing. Thanks for
the report Dan.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Fuse and mount namespaces

2018-02-28 Thread Patrick Donnelly
On Tue, Feb 27, 2018 at 3:27 PM, Oliver Freyermuth
 wrote:
> As you can see:
> - Name collision for admin socket, since the helper is already running.

You can change the admin socket path using the `admin socket` config
variable. Use metavariables [1] to make the path unique.

> - A second helper for the same mountpoint was fired up!

This is expected. If you want a single ceph-fuse mount then you need
to persist the mount in the host namespace somewhere (using bind
mounts) so you can reuse it. However, mind what David Turner said
regarding using a single ceph-fuse client for multiple containers.
Right now parallel requests are not handled well in the client so it
can be slow for multiple applications (or containers). Another option
is to use a kernel mount which would be more performant and also allow
parallel requests.

> - On a side-note, once I exit the container (and hence close the mount 
> namespace), the "old" helper is finally freed.

Once the last mount point is unmounted, FUSE will destroy the userspace helper.

[1] 
http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=configuration#metavariables

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't use ceph mds set max_mds

2018-03-07 Thread Patrick Donnelly
On Wed, Mar 7, 2018 at 5:29 AM, John Spray  wrote:
> On Wed, Mar 7, 2018 at 10:11 AM, Dan van der Ster  wrote:
>> Hi all,
>>
>> What is the purpose of
>>
>>ceph mds set max_mds 
>>
>> ?
>>
>> We just used that by mistake on a cephfs cluster when attempting to
>> decrease from 2 to 1 active mds's.
>>
>> The correct command to do this is of course
>>
>>   ceph fs set  max_mds 
>>
>> So, is `ceph mds set max_mds` useful for something? If not, should it
>> be removed from the CLI?
>
> It's the legacy version of the command from before we had multiple
> filesystems.  Those commands are marked as obsolete internally so that
> they're not included in the --help output, but they're still handled
> (applied to the "default" filesystem) if called.
>
> The multi-fs stuff went in for Jewel, so maybe we should think about
> removing the old commands in Mimic: any thoughts Patrick?

These commands have already been removed (obsoleted) in master/Mimic.
You can no longer use them. In Luminous, the commands are deprecated
(basically, omitted from --help).

See also: https://tracker.ceph.com/issues/20596

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-03-14 Thread Patrick Donnelly
On Wed, Mar 14, 2018 at 5:48 AM, Lars Marowsky-Bree  wrote:
> On 2018-02-28T02:38:34, Patrick Donnelly  wrote:
>
>> I think it will be necessary to reduce the actives to 1 (max_mds -> 1;
>> deactivate other ranks), shutdown standbys, upgrade the single active,
>> then upgrade/start the standbys.
>>
>> Unfortunately this didn't get flagged in upgrade testing. Thanks for
>> the report Dan.
>
> This means that - when the single active is being updated - there's a
> time when there's no MDS active, right?

Yes. But the real outcome is not "no MDS [is] active" but "some or all
metadata I/O will pause" -- and there is no avoiding that. During an
MDS upgrade, a standby must take over the MDS being shutdown (and
upgraded).  During takeover, metadata I/O will briefly pause as the
rank is unavailable. (Specifically, no other rank can obtains locks or
communicate with the "failed" rank; so metadata I/O will necessarily
pause until a standby takes over.) Single active vs. multiple active
upgrade makes little difference in this outcome.

> Is another approach theoretically feasible? Have the updated MDS only go
> into the incompatible mode once there's a quorum of new ones available,
> or something?

I believe so, yes. That option wasn't explored for this patch because
it was just disambiguating the compatibility flags and the full
side-effects weren't realized.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rctime not tracking inode ctime

2018-03-14 Thread Patrick Donnelly
On Wed, Mar 14, 2018 at 9:22 AM, Dan van der Ster  wrote:
> Hi all,
>
> On our luminous v12.2.4 ceph-fuse clients / mds the rctime is not
> tracking the latest inode ctime, but only the latest directory ctimes.
>
> Initial empty dir:
>
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="0"
> ceph.dir.rctime="1521043742.09466372697"
>
> Create a file, rctime is updated:
>
> # touch a
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="0"
> ceph.dir.rctime="1521043831.0921836283"
>
> Modify a file, rbytes is updated but not rctime:
>
> # echo hello > a
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="6"
> ceph.dir.rctime="1521043831.0921836283"
>
> Modify the dir, rctime is updated:
>
> # touch b
> # getfattr -d -m ceph . | egrep 'bytes|ctime'
> ceph.dir.rbytes="6"
> ceph.dir.rctime="1521043861.09597651370"
>
> Do others see the same rctime behaviour? Is this how it's supposed to work?

It appears rctime is meant to reflect changes to directory inodes.
Traditionally, modifying a file (truncate, write) does not involve
metadata changes to a directory inode.

Whether that is the intended behavior is a good question. Perhaps it
should be changed?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs and number of clients

2018-03-20 Thread Patrick Donnelly
On Tue, Mar 20, 2018 at 3:27 AM, James Poole  wrote:
> I have a query regarding cephfs and prefered number of clients. We are
> currently using luminous cephfs to support storage for a number of web
> servers. We have one file system split into folders, example:
>
> /vol1
> /vol2
> /vol3
> /vol4
>
> At the moment the root of the cephfs filesystem is mounted to each web
> server. The query is would there be a benefit to having separate mount
> points for each folder like above?

Performance benefit? No. Data isolation benefit? Sure.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-03-27 Thread Patrick Donnelly
Hello Alexandre,

On Thu, Mar 22, 2018 at 2:29 AM, Alexandre DERUMIER  wrote:
> Hi,
>
> I'm running cephfs since 2 months now,
>
> and my active msd memory usage is around 20G now (still growing).
>
> ceph 1521539 10.8 31.2 20929836 20534868 ?   Ssl  janv.26 8573:34 
> /usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
> USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>
>
> this is on luminous 12.2.2
>
> only tuning done is:
>
> mds_cache_memory_limit = 5368709120
>
>
> (5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB 
>
>
> Is it normal ?

No, that's definitely not normal!


> # ceph daemon mds.2 perf dump mds
> {
> "mds": {
> "request": 1444009197,
> "reply": 1443999870,
> "reply_latency": {
> "avgcount": 1443999870,
> "sum": 1657849.656122933,
> "avgtime": 0.001148095
> },
> "forward": 0,
> "dir_fetch": 51740910,
> "dir_commit": 9069568,
> "dir_split": 64367,
> "dir_merge": 58016,
> "inode_max": 2147483647,
> "inodes": 2042975,
> "inodes_top": 152783,
> "inodes_bottom": 138781,
> "inodes_pin_tail": 1751411,
> "inodes_pinned": 1824714,
> "inodes_expired": 7258145573,
> "inodes_with_caps": 1812018,
> "caps": 2538233,
> "subtrees": 2,
> "traverse": 1591668547,
> "traverse_hit": 1259482170,
> "traverse_forward": 0,
> "traverse_discover": 0,
> "traverse_dir_fetch": 30827836,
> "traverse_remote_ino": 7510,
> "traverse_lock": 86236,
> "load_cent": 144401980319,
> "q": 49,
> "exported": 0,
> "exported_inodes": 0,
> "imported": 0,
> "imported_inodes": 0
> }
> }

Can you also share `ceph daemon mds.2 cache status`, the full `ceph
daemon mds.2 perf dump`, and `ceph status`?

Note [1] will be in 12.2.5 and may help with your issue.

[1] https://github.com/ceph/ceph/pull/20527

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to suggest the active MDS to move to a datacenter ?

2018-03-29 Thread Patrick Donnelly
On Thu, Mar 29, 2018 at 1:02 PM, Nicolas Huillard  wrote:
> Hi,
>
> I manage my 2 datacenters with Pacemaker and Booth. One of them is the
> publicly-known one, thanks to Booth.
> Whatever the "public datacenter", Ceph is a single storage cluster.
> Since most of the cephfs traffic come from this "public datacenter",
> I'd like to suggest or force the active MDS to move to the same
> datacenter, hoping to reduce trafic on the inter-datacenter link, and
> reduce cephfs metadata operations latency.
>
> Is it possible for forcefully move the active MDS using external
> triggers ?

No and it probably wouldn't be beneficial. The MDS still needs to talk
to the metadata/data pools and increasing the latency between the MDS
and the OSDs will probably do more harm.

One possibility for helping your situation is to put NFS-Ganesha in
the public datacenter as a gateway to CephFS. This may help with your
performance by (a) sharing a larger cache among multiple clients and
(b) reducing capability conflicts between clients thereby resulting in
less metadata traffic with the MDS. Be aware an HA solution doesn't
yet exist for NFS-Ganesha+CephFS outside of Openstack Queens
deployments.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse segfaults

2018-04-02 Thread Patrick Donnelly
Probably fixed by this: http://tracker.ceph.com/issues/17206

You need to upgrade your version of ceph-fuse.

On Mon, Apr 2, 2018 at 12:56 AM, Zhang Qiang  wrote:
> Hi,
>
> I'm using ceph-fuse 10.2.3 on CentOS 7.3.1611. ceph-fuse always
> segfaults after running for some time.
>
> *** Caught signal (Segmentation fault) **
>  in thread 7f455d832700 thread_name:ceph-fuse
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x2a442a) [0x7f457208e42a]
>  2: (()+0xf5e0) [0x7f4570b895e0]
>  3: (Client::get_root_ino()+0x10) [0x7f4571f86a20]
>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x18d)
> [0x7f4571f844bd]
>  5: (()+0x19ae21) [0x7f4571f84e21]
>  6: (()+0x164b5) [0x7f457199e4b5]
>  7: (()+0x16bdb) [0x7f457199ebdb]
>  8: (()+0x13471) [0x7f457199b471]
>  9: (()+0x7e25) [0x7f4570b81e25]
>  10: (clone()+0x6d) [0x7f456fa6934d]
>
> Detailed events dump:
> https://drive.google.com/file/d/0B_4ESJRu7BZFcHZmdkYtVG5CTGQ3UVFod0NxQloxS0ZCZmQ0/view?usp=sharing
> Let me know if more info is needed.
>
> Thanks.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs hardlink snapshot

2018-04-05 Thread Patrick Donnelly
Hi Marc,

On Wed, Apr 4, 2018 at 11:21 PM, Marc Roos  wrote:
>
> 'Hard links do not interact well with snapshots' is this still an issue?
> Because I am using rsync and hardlinking. And it would be nice if I can
> snapshot the directory, instead of having to copy it.

Hardlink handling for snapshots will be in Mimic.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs snapshot format upgrade

2018-04-10 Thread Patrick Donnelly
On Tue, Apr 10, 2018 at 5:54 AM, John Spray  wrote:
> On Tue, Apr 10, 2018 at 1:44 PM, Yan, Zheng  wrote:
>> Hello
>>
>> To simplify snapshot handling in multiple active mds setup, we changed
>> format of snaprealm in mimic dev.
>> https://github.com/ceph/ceph/pull/16779.
>>
>> The new version mds can handle old format snaprealm in single active
>> setup. It also can convert old format snaprealm to the new format when
>> snaprealm is modified. The problem is that new version mds can not
>> properly handle old format snaprealm in multiple active setup. It may
>> crash when it encounter old format snaprealm. For existing filesystem
>> with snapshots, upgrading mds to mimic seems to be no problem at first
>> glance. But if user later enables multiple active mds,  mds may
>> crashes continuously. No easy way to switch back to single acitve mds.
>>
>> I don't have clear idea how to handle this situation. I can think of a
>> few options.
>>
>> 1. Forbid multiple active before all old snapshots are deleted or
>> before all snaprealms are converted to new format. Format conversion
>> requires traversing while whole filesystem tree.  Not easy to
>> implement.
>
> This has been a general problem with metadata format changes: we can
> never know if all the metadata in a filesystem has been brought up to
> a particular version.  Scrubbing (where scrub does the updates) should
> be the answer, but we don't have the mechanism for recording/ensuring
> the scrub has really happened.
>
> Maybe we need the MDS to be able to report a complete whole-filesystem
> scrub to the monitor, and record a field like
> "latest_scrubbed_version" in FSMap, so that we can be sure that all
> the filesystem metadata has been brought up to a certain version
> before enabling certain features?  So we'd then have a
> "latest_scrubbed_version >= mimic" test before enabling multiple
> active daemons.
>
> For this particular situation, we'd also need to protect against
> people who had enabled multimds and snapshots experimentally, with an
> MDS startup check like:
>  ((ever_allowed_features & CEPH_MDSMAP_ALLOW_SNAPS) &&
> (allows_multimds() || in.size() >1)) && latest_scrubbed_version <
> mimic

This sounds like the right approach to me. The mons should also be
capable of performing the same test and raising a health error that
pre-Mimic MDSs must be started and the number of actives be reduced to
1.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-11 Thread Patrick Donnelly
Hello Ronny,

On Wed, Apr 11, 2018 at 10:25 AM, Ronny Aasen  wrote:
> mds: restart mds's one at the time. you will notice the standby mds taking
> over for the mds that was restarted. do both.

No longer recommended. See:
http://docs.ceph.com/docs/master/cephfs/upgrading/#upgrading-the-mds-cluster

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-12 Thread Patrick Donnelly
On Thu, Apr 12, 2018 at 5:05 AM, Mark Schouten  wrote:
> On Wed, 2018-04-11 at 17:10 -0700, Patrick Donnelly wrote:
>> No longer recommended. See:
>> http://docs.ceph.com/docs/master/cephfs/upgrading/#upgrading-the-mds-
>> cluster
>
> Shouldn't docs.ceph.com/docs/luminous/cephfs/upgrading include that
> too?

The backport is in-progress: https://github.com/ceph/ceph/pull/21352

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Patrick Donnelly
Hello Linh,

On Tue, Apr 24, 2018 at 12:34 AM, Linh Vu  wrote:
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.

As Dan said, this is simply a spurious log message. Nothing is being
exported. This will be fixed in 12.2.6 as part of several fixes to the
load balancer:

https://github.com/ceph/ceph/pull/21412/commits/cace918dd044b979cd0d54b16a6296094c8a9f90

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-26 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 3:16 PM, Scottix  wrote:
> Updated to 12.2.5
>
> We are starting to test multi_mds cephfs and we are going through some
> failure scenarios in our test cluster.
>
> We are simulating a power failure to one machine and we are getting mixed
> results of what happens to the file system.
>
> This is the status of the mds once we simulate the power loss considering
> there are no more standbys.
>
> mds: cephfs-2/2/2 up
> {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)}
>
> 1. It is a little unclear if it is laggy or really is down, using this line
> alone.

Of course -- the mons can't tell the difference!

> 2. The first time we lost total access to ceph folder and just blocked i/o

You must have standbys for high availability. This is the docs.

> 3. One time we were still able to access ceph folder and everything seems to
> be running.

It depends(tm) on how the metadata is distributed and what locks are
held by each MDS.

Standbys are not optional in any production cluster.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-26 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 4:40 PM, Scottix  wrote:
>> Of course -- the mons can't tell the difference!
> That is really unfortunate, it would be nice to know if the filesystem has
> been degraded and to what degree.

If a rank is laggy/crashed, the file system as a whole is generally
unavailable. The span between partial outage and full is small and not
worth quantifying.

>> You must have standbys for high availability. This is the docs.
> Ok but what if you have your standby go down and a master go down. This
> could happen in the real world and is a valid error scenario.
>Also there is
> a period between when the standby becomes active what happens in-between
> that time?

The standby MDS goes through a series of states where it recovers the
lost state and connections with clients. Finally, it goes active.

>> It depends(tm) on how the metadata is distributed and what locks are
> held by each MDS.
> Your saying depending on which mds had a lock on a resource it will block
> that particular POSIX operation? Can you clarify a little bit?
>
>> Standbys are not optional in any production cluster.
> Of course in production I would hope people have standbys but in theory
> there is no enforcement in Ceph for this other than a warning. So when you
> say not optional that is not exactly true it will still run.

It's self-defeating to expect CephFS to enforce having standbys --
presumably by throwing an error or becoming unavailable -- when the
standbys exist to make the system available.

There's nothing to enforce. A warning is sufficient for the operator
that (a) they didn't configure any standbys or (b) MDS daemon
processes/boxes are going away and not coming back as standbys (i.e.
the pool of MDS daemons is decreasing with each failover)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-27 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 7:04 PM, Scottix  wrote:
> Ok let me try to explain this better, we are doing this back and forth and
> its not going anywhere. I'll just be as genuine as I can and explain the
> issue.
>
> What we are testing is a critical failure scenario and actually more of a
> real world scenario. Basically just what happens when it is 1AM and the shit
> hits the fan, half of your servers are down and 1 of the 3 MDS boxes are
> still alive.
> There is one very important fact that happens with CephFS and when the
> single Active MDS server fails. It is guaranteed 100% all IO is blocked. No
> split-brain, no corrupted data, 100% guaranteed ever since we started using
> CephFS
>
>
> Now with multi_mds, I understand this changes the logic and I understand how
> difficult and how hard this problem is, trust me I would not be able to
> tackle this. Basically I need to answer the question; what happens when 1 of
> 2 multi_mds fails with no standbys ready to come save them?
> What I have tested is not the same of a single active MDS; this absolutely
> changes the logic of what happens and how we troubleshoot. The CephFS is
> still alive and it does allow operations and does allow resources to go
> through. How, why and what is affected are very relevant questions if this
> is what the failure looks like since it is not 100% blocking.

Okay so now I understand what your real question is: what is the state
of CephFS when one or more ranks have failed but no standbys exist to
takeover? The answer is that there may be partial availability from
the up:active ranks which may hand out capabilities for the subtrees
they manage or no availability if that's not possible because it
cannot obtain the necessary locks.  No metadata is lost. No
inconsistency is created between clients. Full availability will be
restored when the lost ranks come back online.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Patrick Donnelly
Hello Sean,

On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan  wrote:
> I was creating a new user and mount point. On another hardware node I
> mounted CephFS as admin to mount as root. I created /aufstest and then
> unmounted. From there it seems that both of my mds nodes crashed for some
> reason and I can't start them any more.
>
> https://pastebin.com/1ZgkL9fa -- my mds log
>
> I have never had this happen in my tests so now I have live data here. If
> anyone can lend a hand or point me in the right direction while
> troubleshooting that would be a godsend!

Thanks for keeping the list apprised of your efforts. Since this is so
easily reproduced for you, I would suggest that you next get higher
debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
a segmentation fault, a backtrace with debug symbols from gdb would
also be helpful.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Patrick Donnelly
Hello Brady,

On Thu, May 10, 2018 at 7:35 AM, Brady Deetz  wrote:
> I am now seeing the exact same issues you are reporting. A heap release did
> nothing for me.

I'm not sure it's the same issue...

> [root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
> {
> "mds_cache_memory_limit": "80530636800"
> }

80G right? What was the memory use from `ps aux | grep ceph-mds`?

> [root@mds0 ~]# ceph daemon mds.mds0 perf dump
> {
> ...
> "inode_max": 2147483647,
> "inodes": 35853368,
> "inodes_top": 23669670,
> "inodes_bottom": 12165298,
> "inodes_pin_tail": 18400,
> "inodes_pinned": 2039553,
> "inodes_expired": 142389542,
> "inodes_with_caps": 831824,
> "caps": 881384,

Your cap count is 2% of the inodes in cache; the inodes pinned 5% of
the total. Your cache should be getting trimmed assuming the cache
size (as measured by the MDS, there are fixes in 12.2.5 which improve
its precision) is larger than your configured limit.

If the cache size is larger than the limit (use `cache status` admin
socket command) then we'd be interested in seeing a few seconds of the
MDS debug log with higher debugging set (`config set debug_mds 20`).

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Patrick Donnelly
On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote:
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph
>
>
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> {
> "pool": {
> "items": 173261056,
> "bytes": 76504108600
> }
> }
>
> So, 80GB is my configured limit for the cache and it appears the mds is
> following that limit. But, the mds process is using over 100GB RAM in my
> 128GB host. I thought I was playing it safe by configuring at 80. What other
> things consume a lot of RAM for this process?
>
> Let me know if I need to create a new thread.

The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade ASAP.

[1] https://tracker.ceph.com/issues/22972

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Too many active mds servers

2018-05-15 Thread Patrick Donnelly
Hello Thomas,

On Tue, May 15, 2018 at 2:35 PM, Thomas Bennett  wrote:
> Hi,
>
> I'm running Luminous 12.2.5 and I'm testing cephfs.
>
> However, I seem to have too many active mds servers on my test cluster.
>
> How do I set one of my mds servers to become standby?
>
> I've run ceph fs set cephfs max_mds 2 which set the max_mds from 3 to 2 but
> has no effect on my running configuration.

http://docs.ceph.com/docs/luminous/cephfs/multimds/#decreasing-the-number-of-ranks

Note: the behavior is changing in Mimic to be automatic after reducing max_mds.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (yet another) multi active mds advise needed

2018-05-18 Thread Patrick Donnelly
Hello Webert,

On Fri, May 18, 2018 at 1:10 PM, Webert de Souza Lima
 wrote:
> Hi,
>
> We're migrating from a Jewel / filestore based cephfs archicture to a
> Luminous / buestore based one.
>
> One MUST HAVE is multiple Active MDS daemons. I'm still lacking knowledge of
> how it actually works.
> After reading the docs and ML we learned that they work by sort of dividing
> the responsibilities, each with his own and only directory subtree. (please
> correct me if I'm wrong).

Each MDS may have multiple subtrees they are authoritative for. Each
MDS may also replicate metadata from another MDS as a form of load
balancing.

> Question 1: I'd like to know if it is viable to have 4 MDS daemons, being 3
> Active and 1 Standby (or Standby-Replay if that's still possible with
> multi-mds).

standby-replay daemons are not available to take over for ranks other
than the one it follows. So, you would want to have a standby-replay
daemon for each rank or just have normal standbys. It will likely
depend on the size of your MDS (cache size) and available hardware.

> Basically, what we have is 2 subtrees used by dovecot: INDEX and MAIL.
> Their tree is almost identical but INDEX stores all dovecot metadata with
> heavy IO going on and MAIL stores actual email files, with much more writes
> than reads.
>
> I don't know by now which one could bottleneck the MDS servers most so I
> wonder if I can take metrics on MDS usage per pool when it's deployed.
> Question 2: If the metadata workloads are very different I wonder if I can
> isolate them, like pinning MDS servers X and Y to one of the directories.

It's best if y ou see if the normal balancer (especially in v12.2.6
[1]) can handle the load for you without trying to micromanage things
via pins. You can use pinning to isolate metadata load from other
ranks as a stop-gap measure.

[1] https://github.com/ceph/ceph/pull/21412

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS "move" operation

2018-05-25 Thread Patrick Donnelly
On Fri, May 25, 2018 at 6:46 AM, Oliver Freyermuth
 wrote:
>> It might be possible to allow rename(2) to proceed in cases where
>> nlink==1, but the behavior will probably seem inconsistent (some files get
>> EXDEV, some don't).
>
> I believe even this would be extremely helpful, performance-wise. At least in 
> our case, hardlinks are seldomly used,
> it's more about data movement between user, group and scratch areas.
> For files with nlinks>1, it's more or less expected a copy has to be 
> performed when crossing quota boundaries (I think).

It may be possible to allow the rename in the MDS and check quotas
there. I've filed a tracker ticket here:
http://tracker.ceph.com/issues/24305


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic (13.2.0) Release Notes Bug on CephFS Snapshot Upgrades

2018-06-07 Thread Patrick Donnelly
There was a bug [1] in the release notes [2] which had incorrect
commands for upgrading the snapshot format of an existing CephFS file
system which has had snapshots enabled at some point. The correction
is here [3]:

diff --git a/doc/releases/mimic.rst b/doc/releases/mimic.rst
index 137d56311c..3a3345bbc0 100644
--- a/doc/releases/mimic.rst
+++ b/doc/releases/mimic.rst
@@ -346,8 +346,8 @@ These changes occurred between the Luminous and
Mimic releases.
 previous max_mds" step in above URL to fail. To re-enable the feature,
 either delete all old snapshots or scrub the whole filesystem:

-  - ``ceph daemon  scrub_path /``
-  - ``ceph daemon  scrub_path '~mdsdir'``
+  - ``ceph daemon  scrub_path / force recursive repair``
+  - ``ceph daemon  scrub_path '~mdsdir' force
recursive repair``

   - Support has been added in Mimic for quotas in the Linux kernel
client as of v4.17.


The release notes on the blog have already been updated.

If you executed the wrong commands already, it should be sufficient to
run the correct commands once more to fix the file system.

[1] https://tracker.ceph.com/issues/24435
[2] https://ceph.com/releases/v13-2-0-mimic-released/
[3] https://github.com/ceph/ceph/pull/22445/files

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS mount in Kubernetes requires setenforce

2018-06-18 Thread Patrick Donnelly
On Mon, Jun 18, 2018 at 10:34 AM, Rares Vernica  wrote:
> I have a CentOS cluster running Ceph, in particular CephFS. I'm also running
> Kubernetes on the cluster and using CephFS as a persistent storage for the
> Kubernetes pods. I noticed that the pods can't read or write on the mounted
> CephFS volumes unless I do "setenforce 0" on the CentOS hosts. Is this
> expected? Is there a better way to enable pods to write to the CephFS
> volumes?

It's a known issue that  the CephFS kernel client doesn't work with
SELinux yet: http://tracker.ceph.com/issues/13231

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Patrick Donnelly
Hello Kevin,

On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
> 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe journal,

Is the "journal" used as a ZIL?

> We experienced a lot of io blocks (X requests blocked > 32 sec) when a lot
> of data is changed in cloned RBDs (disk imported via OpenStack Glance,
> cloned during instance creation by Cinder).
> If the disk was cloned some months ago and large software updates are
> applied (a lot of small files) combined with a lot of syncs, we often had a
> node hit suicide timeout.
> Most likely this is a problem with op thread count, as it is easy to block
> threads with RAIDZ2 (RAID6) if many small operations are written to disk
> (again, COW is not optimal here).
> When recovery took place (0.020% degraded) the cluster performance was very
> bad - remote service VMs (Windows) were unusable. Recovery itself was using
> 70 - 200 mb/s which was okay.

I would think having an SSD ZIL here would make a very large
difference. Probably a ZIL may have a much larger performance impact
than an L2ARC device. [You may even partition it and have both but I'm
not sure if that's normally recommended.]

Thanks for your writeup!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] systemd and ceph-mon autostart on Ubuntu 16.04

2017-01-25 Thread Patrick Donnelly
On Wed, Jan 25, 2017 at 2:19 PM, Wido den Hollander  wrote:
> Hi,
>
> I thought this issue was resolved a while ago, but while testing Kraken with 
> BlueStore I ran into the problem again.
>
> My monitors are not being started on boot:
>
> Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-59-generic x86_64)
>
>  * Documentation:  https://help.ubuntu.com
>  * Management: https://landscape.canonical.com
>  * Support:https://ubuntu.com/advantage
> Last login: Wed Jan 25 15:08:57 2017 from 2001:db8::100
> root@bravo:~# systemctl status ceph-mon.target
> ● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service 
> instances at once
>Loaded: loaded (/lib/systemd/system/ceph-mon.target; disabled; vendor 
> preset: enabled)
>Active: inactive (dead)
> root@bravo:~#
>
> If I enable ceph-mon.target my Monitors start just fine on boot:
>
> root@bravo:~# systemctl enable ceph-mon.target
> Created symlink from 
> /etc/systemd/system/multi-user.target.wants/ceph-mon.target to 
> /lib/systemd/system/ceph-mon.target.
> Created symlink from /etc/systemd/system/ceph.target.wants/ceph-mon.target to 
> /lib/systemd/system/ceph-mon.target.
> root@bravo:~# ceph -v
> ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
> root@bravo:~#
>
> Anybody else seeing this before I start digging into the .deb packaging?

Are you wanting ceph-mon.target to automatically be enabled on package
install? That doesn't sound good to me but I'm not familiar with
Ubuntu's packaging rules. I would think the sysadmin must enable the
services they install themselves.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Passing LUA script via python rados execute

2017-02-19 Thread Patrick Donnelly
On Sat, Feb 18, 2017 at 2:55 PM, Noah Watkins  wrote:
> The least intrusive solution is to simply change the sandbox to allow
> the standard file system module loading function as expected. Then any
> user would need to make sure that every OSD had consistent versions of
> dependencies installed using something like LuaRocks. This is simple,
> but could make debugging and deployment a major headache.

A locked down require which doesn't load C bindings (i.e. only load
.lua files) would probably be alright.

> A more ambitious version would be to create an interface for users to
> upload scripts and dependencies into objects, and support referencing
> those objects as standard dependencies in Lua scripts as if they were
> standard modules on the file system. Each OSD could then cache scripts
> and dependencies, allowing applications to use references to scripts
> instead of sending a script with every request.

This is very doable. I imagine we'd just put all of the Lua modules in
a flattened hierarchy under a RADOS namespace? The potentially
annoying nit in this is writing some kind of mechanism for installing
a Lua module tree into RADOS. Users would install locally and then
upload the tree through some tool.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Passing LUA script via python rados execute

2017-02-21 Thread Patrick Donnelly
On Tue, Feb 21, 2017 at 4:45 PM, Nick Fisk  wrote:
> I'm trying to put some examples together for a book and so wanted to try and 
> come up with a more out of the box experience someone could follow. I'm 
> guessing some basic examples in LUA and then come custom rados classes in C++ 
> might be the best approach for this for now?

FYI, since you are writing a book: Lua is not an acronym:
https://www.lua.org/about.html#name

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] purging strays faster

2017-03-07 Thread Patrick Donnelly
Hi Dan,

On Tue, Mar 7, 2017 at 11:10 AM, Daniel Davidson
 wrote:
> When I try this command, I still get errors:
>
> ceph daemon mds.0 config show
> admin_socket: exception getting command descriptions: [Errno 2] No such file
> or directory
> admin_socket: exception getting command descriptions: [Errno 2] No such file
> or directory
>
> I am guessing there is a path set up incorrectly somewhere, but I do not
> know where to look.

You need to run the command on the machine where the daemon is running.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-04 Thread Patrick Donnelly
On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen  wrote:
> On 1-4-2017 21:59, Wido den Hollander wrote:
>>
>>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen :
>>>
>>>
>>> On 31-3-2017 17:32, Wido den Hollander wrote:
>>>> Hi Willem Jan,
>>>>
>>>>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>>>>> :
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm pleased to announce that my efforts to port to FreeBSD have
>>>>> resulted in a ceph-devel port commit in the ports tree.
>>>>>
>>>>> https://www.freshports.org/net/ceph-devel/
>>>>>
>>>>
>>>> Awesome work! I don't touch FreeBSD that much, but I can imagine that
>>>> people want this.
>>>>
>>>> Out of curiosity, does this run on ZFS under FreeBSD? Or what
>>>> Filesystem would you use behind FileStore with this? Or does
>>>> BlueStore work?
>>>
>>> Since I'm a huge ZFS fan, that is what I run it on.
>>
>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!
>
> Right, ZIL is magic, and more or equal to the journal now used with OSDs
> for exactly the same reason. Sad thing is that a write is now 3*
> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
> bandwidth to the SSDs is double of what it could be.
>
> Had some discussion about this, but disabling the Ceph journal is not
> just setting an option. Although I would like to test performance of an
> OSD with just the ZFS journal. But I expect that the OSD journal is
> rather firmly integrated.

Disabling the OSD journal will never be viable. The journal is also
necessary for transactions and batch updates which cannot be done
atomically in FileStore.

This is great work Willem. I'm especially looking forward to seeing
BlueStore performance on a ZVol.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: ceph-fuse segfaults

2017-04-07 Thread Patrick Donnelly
Hello Andras,

On Wed, Mar 29, 2017 at 11:07 AM, Andras Pataki
 wrote:
> Below is a crash we had on a few machines with the ceph-fuse client on the
> latest Jewel release 10.2.6.  A total of 5 ceph-fuse processes crashed more
> or less the same way at different times.  The full logs are at
> http://voms.simonsfoundation.org:50013/9SXnEpflYPmE6UhM9EgOR3us341eqym/ceph-20170328

This is a reference count bug. I'm afraid it won't be possible to
debug it without a higher debug setting (probably "debug client =
0/20"). Be aware that will slow down your client.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Race Condition(?) in CephFS

2017-04-25 Thread Patrick Donnelly
Hello Adam,

On Tue, Apr 25, 2017 at 5:32 PM, Adam Tygart  wrote:
> I'm using CephFS, on CentOS 7. We're currently migrating away from
> using a catch-all cephx key to mount the filesystem (with the kernel
> module), to a much more restricted key.
>
> In my tests, I've come across an issue, extracting a tar archive with
> a mount using the restricted key routinely cannot create files or
> directories in recently created directories. I need to keep running a
> CentOS based kernel on the clients because of some restrictions from
> other software. Below looks like a race condition to me, although I am
> not versed well enough in Ceph or the inner workings of the kernel to
> know for sure.
> [...]
>
> We're currently running Ceph Jewel (10.2.5). We're looking to update
> soon, but we wanted a clean backup of everything in CephFS first.

To me, this looks like: http://tracker.ceph.com/issues/17858

Fortunately you should only need to upgrade to 10.2.6 or 10.2.7 to fix this.

HTH,

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to read JournalPointer - MDS error (mds rank 0 is damaged)

2017-05-02 Thread Patrick Donnelly
Looks like: http://tracker.ceph.com/issues/17236

The fix is in v10.2.6.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing MDS

2018-10-30 Thread Patrick Donnelly
On Tue, Oct 30, 2018 at 4:05 PM Rhian Resnick  wrote:
> We are running into issues deactivating mds ranks. Is there a way to safely 
> forcibly remove a rank?

No, there's no "safe" way to force the issue. The rank needs to come
back, flush its journal, and then complete its deactivation. To get
more help, you need to describe your environment, version of Ceph in
use, relevant log snippets, etc.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon:failed in thread_name:safe_timer

2018-11-19 Thread Patrick Donnelly
On Mon, Nov 19, 2018 at 7:17 PM 楼锴毅  wrote:
> sorry to disturb , but recently when I use ceph(12.2.8),I found that the 
> leader monitor will always failed in thread_name:safe_timer.
> [...]

Try upgrading the mons to v12.2.9 (but see recent warnings concerning
upgrades to v12.2.9 for the OSDs):
https://tracker.ceph.com/issues/35848

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-20 Thread Patrick Donnelly
You either need to accept that reads/writes will land on different data
centers, primary OSD for a given pool is always in the desired data center,
or some other non-Ceph solution which will have either expensive, eventual,
or false consistency.

On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov  This is what Jean suggested. I understand it and it works with primary.
> *But what I need is for all clients to access same files, not separate
> sets (like red blue green)*
>
> Thanks Konstantin.
>
> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin 
> wrote:
>
>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
>> > Exactly. But write operations should go to all nodes.
>>
>> This can be set via primary affinity [1], when a ceph client reads or
>> writes data, it always contacts the primary OSD in the acting set.
>>
>>
>> If u want to totally segregate IO, you can use device classes:
>>
>> Just create osds with different classes:
>>
>> dc1
>>
>>host1
>>
>>  red osd.0 primary
>>
>>  blue osd.1
>>
>>  green osd.2
>>
>> dc2
>>
>>host2
>>
>>  red osd.3
>>
>>  blue osd.4 primary
>>
>>  green osd.5
>>
>> dc3
>>
>>host3
>>
>>  red osd.6
>>
>>  blue osd.7
>>
>>  green osd.8 primary
>>
>>
>> create 3 crush rules:
>>
>> ceph osd crush rule create-replicated red default host red
>>
>> ceph osd crush rule create-replicated blue default host blue
>>
>> ceph osd crush rule create-replicated green default host green
>>
>>
>> and 3 pools:
>>
>> ceph osd pool create red 64 64 replicated red
>>
>> ceph osd pool create blue 64 64 replicated blue
>>
>> ceph osd pool create blue 64 64 replicated green
>>
>>
>> [1]
>>
>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
>> '
>>
>>
>>
>> k
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon:failed in thread_name:safe_timer

2018-11-21 Thread Patrick Donnelly
On Tue, Nov 20, 2018 at 6:18 PM 楼锴毅  wrote:
> Hello
> Yesterday I upgraded my cluster to v12.2.9.But the mons still failed for the 
> same reason.And when I run 'ceph versions', it returned
> "
> "mds": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
> luminous (stable)": 1,
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 4
> },
> "
> But actually I only have four MDS , and their versions are all v12.2.9 .I am 
> confused about it.

How did you restart the MDSs? If you used `ceph mds fail` then the
executable version (v12.2.8) will not change.

Also, the monitor failure requires updating the monitor to v12.2.9.
What version is the mons?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS optimal setup on Google Cloud

2019-01-07 Thread Patrick Donnelly
Hello Mahmoud,

On Fri, Dec 21, 2018 at 7:44 AM Mahmoud Ismail
 wrote:
> I'm doing benchmarks for metadata operations on CephFS, HDFS, and HopsFS on 
> Google Cloud. In my current setup, i'm using 32 vCPU machines with 29 GB 
> memory, and i have 1 MDS, 1 MON and 3 OSDs. The MDS and the MON nodes are 
> co-located on one vm, while each of the OSDs is on a separate vm with 1 SSD 
> disk attached. I'm using the default configuration for MDS, and OSDs.
>
> I'm running 300 clients on 10 machines (16 vCPU), each client creates a 
> CephFileSystem using the CephFS hadoop plugin, and then writes empty files 
> for 30 seconds followed by reading the empty files for another 30 seconds. 
> The aggregated throughput is around 2000 file create opertions/sec and 1 
> file read operations/sec. However, the MDS is not fully utilizing the 32 
> cores on the machine, is there any configuration that i should consider to 
> fully utilize the machine?.

The MDS is not yet very parallel; it can only utilize about 2.5 cores
in the best circumstances. Make sure you allocate plenty of RAM for
the MDS. 16GB or 32GB would be a good choice. See (and disregard the
warning on that page):
http://docs.ceph.com/docs/mimic/cephfs/cache-size-limits/

You may also try using multiple active metadata servers to increase
throughput. See: http://docs.ceph.com/docs/mimic/cephfs/multimds/

> Also, i noticed that running more than 20-30 clients (on different threads) 
> per machine degrade the aggregated throughput for read, is there a limitation 
> on CephFileSystem and libceph on the number of clients created per machine?

No. Can't give you any hints without more information about the test
setup. We also have not tested with the Hadoop plugin in years. There
may be limitations we're not presently aware of.

> Another issue,  Are the MDS operations single threaded as pointed here 
> "https://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark";?

Yes, this is still the case.

> Regarding the MDS global lock, is it it a single lock per MDS or is it a 
> global distributed lock for all MDSs?

per-MDS


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] v13.2.4 Mimic released

2019-01-08 Thread Patrick Donnelly
On Mon, Jan 7, 2019 at 7:10 AM Alexandre DERUMIER  wrote:
>
> Hi,
>
> >>* Ceph v13.2.2 includes a wrong backport, which may cause mds to go into
> >>'damaged' state when upgrading Ceph cluster from previous version.
> >>The bug is fixed in v13.2.3. If you are already running v13.2.2,
> >>upgrading to v13.2.3 does not require special action.
>
> Any special action for upgrading from 13.2.1 ?

No special actions for CephFS are required for the upgrade.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tuning ceph mds cache settings

2019-01-09 Thread Patrick Donnelly
Hello Jonathan,

On Wed, Jan 9, 2019 at 5:37 AM Jonathan Woytek  wrote:
> While working on examining performance under load at scale, I see a marked 
> performance improvement whenever I would restart certain mds daemons. I was 
> able to duplicate the performance improvement by issuing a "daemon mds.blah 
> cache drop". The performance bump lasts for quite a long time--far longer 
> than it takes for the cache to "fill" according to the stats.

What version of Ceph are you running? Can you expand on what this
performance improvement is?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH_FSAL Nfs-ganesha

2019-01-15 Thread Patrick Donnelly
On Mon, Jan 14, 2019 at 7:11 AM Daniel Gryniewicz  wrote:
>
> Hi.  Welcome to the community.
>
> On 01/14/2019 07:56 AM, David C wrote:
> > Hi All
> >
> > I've been playing around with the nfs-ganesha 2.7 exporting a cephfs
> > filesystem, it seems to be working pretty well so far. A few questions:
> >
> > 1) The docs say " For each NFS-Ganesha export, FSAL_CEPH uses a
> > libcephfs client,..." [1]. For arguments sake, if I have ten top level
> > dirs in my Cephfs namespace, is there any value in creating a separate
> > export for each directory? Will that potentially give me better
> > performance than a single export of the entire namespace?
>
> I don't believe there are any advantages from the Ceph side.  From the
> Ganesha side, you configure permissions, client ACLs, squashing, and so
> on on a per-export basis, so you'll need different exports if you need
> different settings for each top level directory.  If they can all use
> the same settings, one export is probably better.

There may be performance impact (good or bad) with having separate
exports for CephFS. Each export instantiates a separate instance of
the CephFS client which has its own bookkeeping and set of
capabilities issued by the MDS. Also, each client instance has a
separate big lock (potentially a big deal for performance). If the
data for each export is disjoint (no hard links or shared inodes) and
the NFS server is expected to have a lot of load, breaking out the
exports can have a positive impact on performance. If there are hard
links, then the clients associated with the exports will potentially
fight over capabilities which will add to request latency.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do multiple cephfs mounts.

2019-01-17 Thread Patrick Donnelly
On Thu, Jan 17, 2019 at 3:23 AM Marc Roos  wrote:
> Should I not be able to increase the io's by splitting the data writes
> over eg. 2 cephfs mounts? I am still getting similar overall
> performance. Is it even possible to increase performance by using
> multiple mounts?
>
> Using 2 kernel mounts on CentOS 7.6

It's unlikely this changes anything unless you also split the workload
into two. That may allow the kernel to do parallel requests?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-filesystem wthin a cluster

2019-01-17 Thread Patrick Donnelly
On Thu, Jan 17, 2019 at 2:44 AM Dan van der Ster  wrote:
>
> On Wed, Jan 16, 2019 at 11:17 PM Patrick Donnelly  wrote:
> >
> > On Wed, Jan 16, 2019 at 1:21 AM Marvin Zhang  wrote:
> > > Hi CephFS experts,
> > > From document, I know multi-fs within a cluster is still experiment 
> > > feature.
> > > 1. Is there any estimation about stability and performance for this 
> > > feature?
> >
> > Remaining blockers [1] need completed. No developer has yet taken on
> > this task. Perhaps by O release.
> >
> > > 2. It seems that each FS will consume at least 1 active MDS and
> > > different FS can't share MDS. Suppose I want to create 10 FS , I need
> > > at least 10 MDS. Is it right? Is ther any limit number for MDS within
> > > a cluster?
> >
> > No limit on number of MDS but there is a limit on the number of
> > actives (multimds).
>
> TIL...
> What is the max number of actives in a single FS?

https://github.com/ceph/ceph/blob/39f9e8db4dc7f8bfcb01a9ad20b8961c36138f4f/src/mds/mdstypes.h#L40

I don't think there's a particular reason for this limit. There may be
some parts of the code that expect fewer than 256 active MDS but that
could probably be easily changed.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-11 Thread Patrick Donnelly
On Mon, Feb 11, 2019 at 12:10 PM Götz Reinicke
 wrote:
> as 12.2.11 is out for some days and no panic mails showed up on the list I 
> was planing to update too.
>
> I know there are recommended orders in which to update/upgrade the cluster 
> but I don’t know how rpm packages are handling restarting services after a 
> yum update. E.g. when MDS and MONs are on the same server.

This should be fine. The MDS only uses a new executable file if you
explicitly restart it via systemd (or, the MDS fails and systemd
restarts it).

More info: when the MDS respawns in normal circumstances, it passes
the /proc/self/exe file to execve. An intended side-effect is that the
MDS will continue using the same executable file across execs.

> And regarding an MDS Cluster I like to ask, if the upgrading instructions 
> regarding only running one MDS during upgrading also applies for an update?
>
> http://docs.ceph.com/docs/mimic/cephfs/upgrading/

If you upgrade an MDS, it may update the compatibility bits in the
Monitor's MDSMap. Other MDSs will abort when they see this change. The
upgrade process intended to help you avoid seeing those errors so you
don't inadvertently think something went wrong.

If you don't mind seeing those errors and you're using 1 active MDS,
then don't worry about it.

Good luck!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - read latency.

2019-02-18 Thread Patrick Donnelly
On Sun, Feb 17, 2019 at 9:51 PM  wrote:
>
> > Probably not related to CephFS. Try to compare the latency you are
> > seeing to the op_r_latency reported by the OSDs.
> >
> > The fast_read option on the pool can also help a lot for this IO pattern.
>
> Magic, that actually cut the read-latency in half - making it more
> aligned with what to expect from the HW+network side:
>
> N   Min   MaxMedian   AvgStddev
> x 100  0.015687  0.221538  0.0252530.03259606   0.028827849
>
> 25ms as a median, 32ms average is still on the high side,
> but way, way better.

I'll use this opportunity to point out that serial archive programs
like tar are terrible for distributed file systems. It would be
awesome if someone multithreaded tar or extended it for asynchronous
I/O. If only I had more time (TM)...

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding EC properties for CephFS / small files.

2019-02-18 Thread Patrick Donnelly
Hello Jesper,

On Sat, Feb 16, 2019 at 11:11 PM  wrote:
>
> Hi List.
>
> I'm trying to understand the nuts and bolts of EC / CephFS
> We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
> slow bulk / archive storage.
>
> # getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/home/cluster/mysqlbackup
> ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
> pool=cephfs_data_ec42"
>
> This configuration is taken directly out of the online documentation:
> (Which may have been where it went all wrong from our perspective):

Correction: this is from the Ceph default for the file layout. The
default is that no file striping is performed and 4MB chunks are used
for file blocks. You may find this document instructive on how files
are striped (especially the ASCII art):

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst

> http://docs.ceph.com/docs/master/cephfs/file-layouts/
>
> Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each
> with 2 erasure coding chuncks? I dont really understand the stripe_count
> element?

A 16 MB file would be split into 4 RADOS objects. Then those objects
would be distributed across OSDs according to the EC profile.

> And since erasure-coding works at the object level, striping individual
> objects across - here 4 replicas - it'll end up filling 16MB ? Or
> is there an internal optimization causing this not to be the case?
>
> Additionally, when reading the file, all 4 chunck need to be read to
> assemble the object. Causing (at a minumum) 4 IOPS per file.
>
> Now, my common file size is < 8MB and commonly 512KB files are on
> this pool.
>
> Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks
> to fill the erasure coded profile and then 2 coding chuncks on top?
> In total 24MB for storing 512KB ?

No. Files do not always use the full 4MB chunk. The final chunk of the
file will be minimally sized. For example:

pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ cp /bin/grep .
pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ stat grep
  File: 'grep'
  Size: 211224  Blocks: 413IO Block: 4194304 regular file
Device: 2ch/44d Inode: 1099511627836  Links: 1
Access: (0750/-rwxr-x---)  Uid: ( 1163/pdonnell)   Gid: ( 1163/pdonnell)
Access: 2019-02-18 14:02:11.503875296 -0500
Modify: 2019-02-18 14:02:11.523375657 -0500
Change: 2019-02-18 14:02:11.523375657 -0500
 Birth: -
pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ printf %x 1099511627836
13c
$ bin/rados -p cephfs.a.data stat 13c.
cephfs.a.data/100003c. mtime 2019-02-18 14:02:11.00, size 211224

So the object holding "grep" still only uses ~200KB and not 4MB.


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-20 Thread Patrick Donnelly
On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> From documentation:
>
> mds beacon grace
> Description:The interval without beacons before Ceph declares an MDS 
> laggy (and possibly replace it).
> Type:   Float
> Default:15
>
> I do not understand, 15 - are is seconds or beacons?

seconds

> And an additional misunderstanding - if we gently turn off the MDS (or MON), 
> why it does not inform everyone interested before death - "I am turned off, 
> no need to wait, appoint a new active server"

The MDS does inform the monitors if it has been shutdown. If you pull
the plug or SIGKILL, it does not. :)


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS_SLOW_METADATA_IO

2019-02-28 Thread Patrick Donnelly
On Thu, Feb 28, 2019 at 12:49 PM Stefan Kooman  wrote:
>
> Dear list,
>
> After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
> (MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
> more that 5 seconds. We have one active, and one active standby MDS. All
> storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
> reported. The MDSes are underutilized, only a handful of active clients
> and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
> cluster is also far from busy.
>
> I've dumped ops in flight on the MDSes but all ops that are printed are
> finished in a split second (duration: 0.000152), flag_point": "acquired
> locks".

I believe you're looking at the wrong "ops" dump. You want to check
"objector_requests".

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Scale Ceph for Large Numbers of Clients?

2019-03-06 Thread Patrick Donnelly
Hello Zack,

On Wed, Mar 6, 2019 at 1:18 PM Zack Brenton  wrote:
>
> Hello,
>
> We're running Ceph on Kubernetes 1.12 using the Rook operator 
> (https://rook.io), but we've been struggling to scale applications mounting 
> CephFS volumes above 600 pods / 300 nodes. All our instances use the kernel 
> client and run kernel `4.19.23-coreos-r1`.
>
> We've tried increasing the MDS memory limits, running multiple active MDS 
> pods, and running different versions of Ceph (up to the latest Luminous and 
> Mimic releases), but we run into MDS_SLOW_REQUEST errors at the same scale 
> regardless of the memory limits we set. See this GitHub issue for more info 
> on what we've tried up to this point: https://github.com/rook/rook/issues/2590
>
> I've written a simple load test that reads all the files in a given directory 
> on an interval. While running this test, I've noticed that the `mds_co.bytes` 
> value (from `ceph daemon mds.myfs-a dump_mempools | jq -c 
> '.mempool.by_pool.mds_co'`) increases each time files are read. Why is this 
> number increasing after the first iteration? If the same client is reading 
> the same cached files, why would the data in the cache change at all? What is 
> `mds_co.bytes` actually reporting?
>
> My most important question is this: How do I configure Ceph to be able to 
> scale to large numbers of clients?

Please post more information about your cluster: types of devices,
`ceph osd tree`, `ceph osd df`, and `ceph osd lspools`.

There's no reason why CephFS shouldn't be able to scale to that number
of clients. The issue is probably related configuration of the
pools/MDS. From your ticket, I have a *lot* of trouble believing the
MDS still at 3GB memory usage with that number of clients and
mds_cache_memory_limit=17179869184 (16GB).

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >