[ceph-users] assertion error trying to start mds server

2017-10-10 Thread Bill Sharer
I've been in the process of updating my gentoo based cluster both with
new hardware and a somewhat postponed update.  This includes some major
stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
and using gcc 6.4.0 to make better use of AMD Ryzen on the new
hardware.  The existing cluster was on 10.2.2, but I was going to
10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
transitioning to bluestore on the osd's.

The Ryzen units are slated to be bluestore based OSD servers if and when
I get to that point.  Up until the mds failure, they were simply cephfs
clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
MON) and had two servers left to update.  Both of these are also MONs
and were acting as a pair of dual active MDS servers running 10.2.2. 
Monday morning I found out the hard way that an UPS one of them was on
has a dead battery.  After I fsck'd and came back up, I saw the
following assertion error when it was trying to start it's mds.B server:


 mdsbeacon(64162/B up:replay seq 3 v4699) v7  126+0+0 (709014160
0 0) 0x7f6fb4001bc0 con 0x55f94779d
8d0
 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
function 'virtual void EImportStart::r
eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x82) [0x55f93d64a122]
 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
 5: (()+0x74a4) [0x7f6fd009b4a4]
 6: (clone()+0x6d) [0x7f6fce5a598d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mds.B.log



When I was googling around, I ran into this Cern presentation and tried
out the offline backware scrubbing commands on slide 25 first:

https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf


Both ran without any messages, so I'm assuming I have sane contents in
the cephfs_data and cephfs_metadata pools.  Still no luck getting things
restarted, so I tried the cephfs-journal-tool journal reset on slide
23.  That didn't work either.  Just for giggles, I tried setting up the
two Ryzen boxes as new mds.C and mds.D servers which would run on
10.2.7-r1 instead of using mds.A and mds.B (10.2.2).  The D server fails
with the same assert as follows:


=== 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con
0x7fffe0013310
 0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In
function 'virtual void EImportStart::replay(MDSRank*)' thread
7fffd99f5700 time 2017-10-09 13:01:31.570608
mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv)

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x55b7ebc8]
 2: (EImportStart::replay(MDSRank*)+0x9ea) [0x55a5674a]
 3: (MDLog::_replay_thread()+0xe51) [0x559cef21]
 4: (MDLog::ReplayThread::entry()+0xd) [0x557778cd]
 5: (()+0x7364) [0x77bc5364]
 6: (clone()+0x6d) [0x76051ccd]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] min_size & hybrid OSD latency

2017-10-10 Thread Christian Balzer

Hello,

On Wed, 11 Oct 2017 00:05:26 +0200 Jack wrote:

> Hi,
> 
> I would like some information about the following
> 
> Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs
> My single pool has size=3, min_size=2
> 
> For a write-only pattern, I thought I would get SSDs performance level,
> because the write would be acked as soon as min_size OSDs acked
> 
> But I am right ?
> 
You're the 2nd person in very recent times to come up with that wrong
conclusion about min_size.

All writes have to be ACKed, the only time where hybrid stuff helps is to
accelerate reads.
Which is something that people like me at least have very little interest
in as the writes need to be fast. 

Christian

> (the same setup could involve some high latency OSDs, in the case of
> country-level cluster)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW flush_read_list error

2017-10-10 Thread Travis Nielsen
In Luminous 12.2.1, when running a GET on a large (1GB file) repeatedly
for an hour from RGW, the following error was hit intermittently a number
of times. The first error was hit after 45 minutes and then the error
happened frequently for the remainder of the test.

ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

Here is some more context from the rgw log around one of the failures.

2017-10-10 18:20:32.321681 I | rgw: 2017-10-10 18:20:32.321643
7f8929f41700 1 civetweb: 0x55bd25899000: 10.32.0.1 - -
[10/Oct/2017:18:19:07 +] "GET /bucket100/testfile.tst HTTP/1.1" 1 0 -
aws-sdk-java/1.9.0 Linux/4.4.0-93-generic
OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
2017-10-10 18:20:32.383855 I | rgw: 2017-10-10 18:20:32.383786
7f8924736700 1 == starting new request req=0x7f892472f140 =
2017-10-10 18:20:46.605668 I | rgw: 2017-10-10 18:20:46.605576
7f894af83700 0 ERROR: flush_read_list(): d->client_cb->handle_data()
returned -5
2017-10-10 18:20:46.605934 I | rgw: 2017-10-10 18:20:46.605914
7f894af83700 1 == req done req=0x7f894af7c140 op status=-5
http_status=200 ==
2017-10-10 18:20:46.606249 I | rgw: 2017-10-10 18:20:46.606225
7f8924736700 0 ERROR: flush_read_list(): d->client_cb->handle_data()
returned -5

I don't see anything else standing out in the log. The object store was
configured with an erasure-coded data pool with k=2 and m=1.

There are a number of threads around this, but I don't see a resolution.
Is there a tracking issue for this?
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007756.ht
ml
https://www.spinics.net/lists/ceph-users/msg16117.html
https://www.spinics.net/lists/ceph-devel/msg37657.html


Here's our tracking Rook issue.
https://github.com/rook/rook/issues/1067


Thanks,
Travis



On 10/10/17, 3:05 PM, "ceph-users on behalf of Jack"
 wrote:

>Hi,
>
>I would like some information about the following
>
>Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs
>My single pool has size=3, min_size=2
>
>For a write-only pattern, I thought I would get SSDs performance level,
>because the write would be acked as soon as min_size OSDs acked
>
>But I am right ?
>
>(the same setup could involve some high latency OSDs, in the case of
>country-level cluster)
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.ceph
>.com%2Flistinfo.cgi%2Fceph-users-ceph.com=02%7C01%7CTravis.Nielsen%40
>quantum.com%7C16f668da252f4e6f355308d5102b09c1%7C322a135f14fb4d72aede12227
>2134ae0%7C1%7C0%7C636432699404298770=tmIMMyQ7ia%2FVmHrSGcF9t4sMpt2bj
>dexriEhEg3XUGU%3D=0

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] min_size & hybrid OSD latency

2017-10-10 Thread Jack
Hi,

I would like some information about the following

Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs
My single pool has size=3, min_size=2

For a write-only pattern, I thought I would get SSDs performance level,
because the write would be acked as soon as min_size OSDs acked

But I am right ?

(the same setup could involve some high latency OSDs, in the case of
country-level cluster)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?

2017-10-10 Thread Alejandro Comisario
Hi, i see some notes there that did'nt existed on jewel :

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd

In my case what im using right now on that OSD is this :

root@ndc-cl-osd4:~# ls -lsah /var/lib/ceph/osd/ceph-104
total 64K
   0 drwxr-xr-x  2 ceph ceph  310 Sep 21 10:56 .
4.0K drwxr-xr-x 25 ceph ceph 4.0K Sep 21 10:56 ..
   0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block ->
/dev/disk/by-partuuid/0ffa3ed7-169f-485c-9170-648ce656e9b1
   0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.db ->
/dev/disk/by-partuuid/5873e2cb-3c26-4a7d-8ff1-1bc3e2d62e5a
   0 lrwxrwxrwx  1 ceph ceph   58 Sep 21 10:30 block.wal ->
/dev/disk/by-partuuid/aed9e5e4-c798-46b5-8243-e462e74f6485

block.db and block.wal are on two different NVME partitions, witch are
nvme1n1p17
and nvme1n1p18 so assuming after hot swaping the device, the drive letter
is "sdx" according to the link above what would be the right command to
re-use the two NVME partitions for block db and wal ?

I presume that everything else is the same.
best.


On Sat, Sep 30, 2017 at 9:00 PM, David Turner  wrote:

> I'm pretty sure that the process is the same as with filestore. The
> cluster doesn't really know if an osd is filestore or bluestore... It's
> just an osd running a daemon.
>
> If there are any differences, they would be in the release notes for
> Luminous as changes from Jewel.
>
> On Sat, Sep 30, 2017, 6:28 PM Alejandro Comisario 
> wrote:
>
>> Hi all.
>> Independetly that i've deployerd a ceph Luminous cluster with Bluestore
>> using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the
>> right way to replace a disk when using Bluestore ?
>>
>> I will try to forget everything i know on how to recover things with
>> filestore and start fresh.
>>
>> Any how-to's ? experiences ? i dont seem to find an official way of doing
>> it.
>> best.
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> _
>> www.nubeliu.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory

2017-10-10 Thread John Spray
On Tue, Oct 10, 2017 at 5:40 PM, Shawfeng Dong  wrote:
> Hi Yoann,
>
> I confirm too that your recipe works!
>
> We run CentOS 7:
> [root@pulpo-admin ~]# uname -r
> 3.10.0-693.2.2.el7.x86_64
>
> Here were the old caps for user 'hydra':
> # ceph auth get client.hydra
> exported keyring for client.hydra
> [client.hydra]
> key = AQ==
> caps mds = "allow rw"
> caps mgr = "allow r"
> caps mon = "allow r"
> caps osd = "allow rw"
>
> Our CephFS name is 'pulpos', when I tried to restrict CephFS client
> capabilities:
> # ceph fs authorize pulpos client.hydra /hydra rw
> I got this error:
> Error EINVAL: key for client.hydra exists but cap mds does not match
>
> In retrospect, the error means exactly what it says: the user caps and
> CephFS client caps must match! You can't *restrict* (narrow down) user caps
> with 'ceph fs authorize'.
>
> For example, this won't work (I can't give 'rw' cap to all pools the
> restrict it):
> # ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw' mds
> 'allow rw path=/hydra'
> updated caps for client.hydra
> # ceph fs authorize pulpos client.hydra /hydra rw
> Error EINVAL: key for client.hydra exists but cap osd does not match
>
> I find only this works:
> # ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw
> pool=cephfs_data' mds 'allow rw path=/hydra'
> updated caps for client.hydra
> # ceph fs authorize pulpos client.hydra /hydra rw
> [client.hydra]
> key = AQ==
> ```
>
> But I still have 2 lingering questions:
> 1. If the user caps and CephFS client caps must match, why do we need 2
> commands ('ceph auth' & 'ceph fs authorzie')? The first one is sufficient.

The "fs authorize" command is a new thing that's meant to make it
easier, so that people don't have to know the syntax of the path=
stuff, and so that they don't have to manually list out their data
pools etc.

> 2. We only give 'rw' cap to the data pool and it works. Why is it
> unnecessary to give 'rw' cap to the metadata pool?

Clients never need access to the metadata pool.  Was something giving
them access?

John

>
> Best regards,
> Shaw
>
>
>
> On Tue, Oct 10, 2017 at 4:20 AM, Yoann Moulin  wrote:
>>
>>
>> >> I am trying to follow the instructions at:
>> >> http://docs.ceph.com/docs/master/cephfs/client-auth/
>> >> to restrict a client to a subdirectory of  Ceph filesystem, but always
>> >> get
>> >> an error.
>> >>
>> >> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7
>> >> servers. The user 'hydra' has the following capabilities:
>> >> # ceph auth get client.hydra
>> >> exported keyring for client.hydra
>> >> [client.hydra]
>> >> key = AQ==
>> >> caps mds = "allow rw"
>> >> caps mgr = "allow r"
>> >> caps mon = "allow r"
>> >> caps osd = "allow rw"
>> >>
>> >> When I tried to restrict the client to only mount and work within the
>> >> directory /hydra of the Ceph filesystem 'pulpos', I got an error:
>> >> # ceph fs authorize pulpos client.hydra /hydra rw
>> >> Error EINVAL: key for client.dong exists but cap mds does not match
>> >>
>> >> I've tried a few combinations of user caps and CephFS client caps; but
>> >> always got the same error!
>> >
>> > The "fs authorize" command isn't smart enough to edit existing
>> > capabilities safely, so it is cautious and refuses to overwrite what
>> > is already there.  If you remove your client.hydra user and try again,
>> > it should create it for you with the correct capabilities.
>>
>> I confirm it works perfectly ! it should be added to the documentation. :)
>>
>> # ceph fs authorize cephfs client.foo1 /foo1 rw
>> [client.foo1]
>> key = XXX1
>> # ceph fs authorize cephfs client.foo2 / r /foo2 rw
>> [client.foo2]
>> key = XXX2
>>
>> # ceph auth get client.foo1
>> exported keyring for client.foo1
>> [client.foo1]
>> key = XXX1
>> caps mds = "allow rw path=/foo1"
>> caps mon = "allow r"
>> caps osd = "allow rw pool=cephfs_data"
>>
>> # ceph auth get client.foo2
>> exported keyring for client.foo2
>> [client.foo2]
>> key = XXX2
>> caps mds = "allow r, allow rw path=/foo2"
>> caps mon = "allow r"
>> caps osd = "allow rw pool=cephfs_data"
>>
>> Best regards,
>>
>> --
>> Yoann Moulin
>> EPFL IC-IT
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory

2017-10-10 Thread Shawfeng Dong
Hi Yoann,

I confirm too that your recipe works!

We run CentOS 7:
[root@pulpo-admin ~]# uname -r
3.10.0-693.2.2.el7.x86_64

Here were the old caps for user 'hydra':
# ceph auth get client.hydra
exported keyring for client.hydra
[client.hydra]
key = AQ==
caps mds = "allow rw"
caps mgr = "allow r"
caps mon = "allow r"
caps osd = "allow rw"

Our CephFS name is 'pulpos', when I tried to restrict CephFS client
capabilities:
# ceph fs authorize pulpos client.hydra /hydra rw
I got this error:
Error EINVAL: key for client.hydra exists but cap mds does not match

In retrospect, the error means exactly what it says: the user caps and
CephFS client caps must match! You can't *restrict* (narrow down) user caps
with 'ceph fs authorize'.

For example, this won't work (I can't give 'rw' cap to all pools the
restrict it):
# ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw'
mds 'allow rw path=/hydra'
updated caps for client.hydra
# ceph fs authorize pulpos client.hydra /hydra rw
Error EINVAL: key for client.hydra exists but cap osd does not match

I find only this works:
# ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw
pool=cephfs_data' mds 'allow rw path=/hydra'
updated caps for client.hydra
# ceph fs authorize pulpos client.hydra /hydra rw
[client.hydra]
key = AQ==
```

But I still have 2 lingering questions:
1. If the user caps and CephFS client caps must match, why do we need 2
commands ('ceph auth' & 'ceph fs authorzie')? The first one is sufficient.
2. We only give 'rw' cap to the data pool and it works. Why is it
unnecessary to give 'rw' cap to the metadata pool?

Best regards,
Shaw



On Tue, Oct 10, 2017 at 4:20 AM, Yoann Moulin  wrote:

>
> >> I am trying to follow the instructions at:
> >> http://docs.ceph.com/docs/master/cephfs/client-auth/
> >> to restrict a client to a subdirectory of  Ceph filesystem, but always
> get
> >> an error.
> >>
> >> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7
> >> servers. The user 'hydra' has the following capabilities:
> >> # ceph auth get client.hydra
> >> exported keyring for client.hydra
> >> [client.hydra]
> >> key = AQ==
> >> caps mds = "allow rw"
> >> caps mgr = "allow r"
> >> caps mon = "allow r"
> >> caps osd = "allow rw"
> >>
> >> When I tried to restrict the client to only mount and work within the
> >> directory /hydra of the Ceph filesystem 'pulpos', I got an error:
> >> # ceph fs authorize pulpos client.hydra /hydra rw
> >> Error EINVAL: key for client.dong exists but cap mds does not match
> >>
> >> I've tried a few combinations of user caps and CephFS client caps; but
> >> always got the same error!
> >
> > The "fs authorize" command isn't smart enough to edit existing
> > capabilities safely, so it is cautious and refuses to overwrite what
> > is already there.  If you remove your client.hydra user and try again,
> > it should create it for you with the correct capabilities.
>
> I confirm it works perfectly ! it should be added to the documentation. :)
>
> # ceph fs authorize cephfs client.foo1 /foo1 rw
> [client.foo1]
> key = XXX1
> # ceph fs authorize cephfs client.foo2 / r /foo2 rw
> [client.foo2]
> key = XXX2
>
> # ceph auth get client.foo1
> exported keyring for client.foo1
> [client.foo1]
> key = XXX1
> caps mds = "allow rw path=/foo1"
> caps mon = "allow r"
> caps osd = "allow rw pool=cephfs_data"
>
> # ceph auth get client.foo2
> exported keyring for client.foo2
> [client.foo2]
> key = XXX2
> caps mds = "allow r, allow rw path=/foo2"
> caps mon = "allow r"
> caps osd = "allow rw pool=cephfs_data"
>
> Best regards,
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?

2017-10-10 Thread Peter Linder
Probably chooseleaf also instead of choose. 

Konrad Riedel  skrev: (10 oktober 2017 17:05:52 CEST)
>Hello Ceph-users,
>
>after switching to luminous I was excited about the great
>crush-device-class feature - now we have 5 servers with 1x2TB NVMe
>based OSDs, 3 of them additionally with 4 HDDS per server. (we have
>only three 400G NVMe disks for block.wal and block.db and therefore
>can't distribute all HDDs evenly on all servers.)
>
>Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on
>the same
>Host:
>
>ceph pg map 5.b
>osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8]
>
>(on rebooting this host I had 4 stale PGs)
>
>I've written a small perl script to add hostname after OSD number and
>got many PGs where
>ceph placed 2 replicas on the same host... :
>
>5.1e7: 8 - daniel 9 - daniel 11 - udo
>5.1eb: 10 - udo 7 - daniel 9 - daniel
>5.1ec: 10 - udo 11 - udo 7 - daniel
>5.1ed: 13 - felix 16 - felix 5 - udo
>
>
>Is there any way I can correct this?
>
>
>Please see crushmap below. Thanks for any help!
>
># begin crush map
>tunable choose_local_tries 0
>tunable choose_local_fallback_tries 0
>tunable choose_total_tries 50
>tunable chooseleaf_descend_once 1
>tunable chooseleaf_vary_r 1
>tunable chooseleaf_stable 1
>tunable straw_calc_version 1
>tunable allowed_bucket_algs 54
>
># devices
>device 0 osd.0 class hdd
>device 1 device1
>device 2 osd.2 class ssd
>device 3 device3
>device 4 device4
>device 5 osd.5 class hdd
>device 6 device6
>device 7 osd.7 class hdd
>device 8 osd.8 class hdd
>device 9 osd.9 class hdd
>device 10 osd.10 class hdd
>device 11 osd.11 class hdd
>device 12 osd.12 class hdd
>device 13 osd.13 class hdd
>device 14 osd.14 class hdd
>device 15 device15
>device 16 osd.16 class hdd
>device 17 device17
>device 18 device18
>device 19 device19
>device 20 device20
>device 21 device21
>device 22 device22
>device 23 device23
>device 24 osd.24 class hdd
>device 25 device25
>device 26 osd.26 class hdd
>device 27 osd.27 class hdd
>device 28 osd.28 class hdd
>device 29 osd.29 class hdd
>device 30 osd.30 class ssd
>device 31 osd.31 class ssd
>device 32 osd.32 class ssd
>device 33 osd.33 class ssd
>
># types
>type 0 osd
>type 1 host
>type 2 rack
>type 3 row
>type 4 room
>type 5 datacenter
>type 6 root
>
># buckets
>host daniel {
>   id -4   # do not change unnecessarily
>   id -2 class hdd # do not change unnecessarily
>   id -9 class ssd # do not change unnecessarily
>   # weight 3.459
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.31 weight 1.819
>   item osd.7 weight 0.547
>   item osd.8 weight 0.547
>   item osd.9 weight 0.547
>}
>host felix {
>   id -5   # do not change unnecessarily
>   id -3 class hdd # do not change unnecessarily
>   id -10 class ssd# do not change unnecessarily
>   # weight 3.653
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.33 weight 1.819
>   item osd.13 weight 0.547
>   item osd.14 weight 0.467
>   item osd.16 weight 0.547
>   item osd.0 weight 0.274
>}
>host udo {
>   id -6   # do not change unnecessarily
>   id -7 class hdd # do not change unnecessarily
>   id -11 class ssd# do not change unnecessarily
>   # weight 4.006
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.32 weight 1.819
>   item osd.5 weight 0.547
>   item osd.10 weight 0.547
>   item osd.11 weight 0.547
>   item osd.12 weight 0.547
>}
>host moritz {
>   id -13  # do not change unnecessarily
>   id -14 class hdd# do not change unnecessarily
>   id -15 class ssd# do not change unnecessarily
>   # weight 1.819
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.30 weight 1.819
>}
>host bruno {
>   id -16  # do not change unnecessarily
>   id -17 class hdd# do not change unnecessarily
>   id -18 class ssd# do not change unnecessarily
>   # weight 3.183
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.24 weight 0.273
>   item osd.26 weight 0.273
>   item osd.27 weight 0.273
>   item osd.28 weight 0.273
>   item osd.29 weight 0.273
>   item osd.2 weight 1.819
>}
>root default {
>   id -1   # do not change unnecessarily
>   id -8 class hdd # do not change unnecessarily
>   id -12 class ssd# do not change unnecessarily
>   # weight 16.121
>   alg straw2
>   hash 0  # rjenkins1
>   item daniel weight 3.459
>   item felix weight 3.653
>   item udo weight 4.006
>   item moritz weight 1.819
>   item bruno weight 3.183
>}
>
># rules
>rule ssd {
>   id 0
>   type replicated
>   min_size 1
>   max_size 10
>   step take default class ssd
>   step choose firstn 0 type osd
>   step emit
>}
>rule hdd {
>  

Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?

2017-10-10 Thread Peter Linder
I think your failure domain within your rules is wrong.

step choose firstn 0 type osd

Should be:

step choose firstn 0 type host


On 10/10/2017 5:05 PM, Konrad Riedel wrote:
> Hello Ceph-users,
>
> after switching to luminous I was excited about the great
> crush-device-class feature - now we have 5 servers with 1x2TB NVMe
> based OSDs, 3 of them additionally with 4 HDDS per server. (we have
> only three 400G NVMe disks for block.wal and block.db and therefore
> can't distribute all HDDs evenly on all servers.)
>
> Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on
> the same
> Host:
>
> ceph pg map 5.b
> osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8]
>
> (on rebooting this host I had 4 stale PGs)
>
> I've written a small perl script to add hostname after OSD number and
> got many PGs where
> ceph placed 2 replicas on the same host... :
>
> 5.1e7: 8 - daniel 9 - daniel 11 - udo
> 5.1eb: 10 - udo 7 - daniel 9 - daniel
> 5.1ec: 10 - udo 11 - udo 7 - daniel
> 5.1ed: 13 - felix 16 - felix 5 - udo
>
>
> Is there any way I can correct this?
>
>
> Please see crushmap below. Thanks for any help!
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable chooseleaf_stable 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> # devices
> device 0 osd.0 class hdd
> device 1 device1
> device 2 osd.2 class ssd
> device 3 device3
> device 4 device4
> device 5 osd.5 class hdd
> device 6 device6
> device 7 osd.7 class hdd
> device 8 osd.8 class hdd
> device 9 osd.9 class hdd
> device 10 osd.10 class hdd
> device 11 osd.11 class hdd
> device 12 osd.12 class hdd
> device 13 osd.13 class hdd
> device 14 osd.14 class hdd
> device 15 device15
> device 16 osd.16 class hdd
> device 17 device17
> device 18 device18
> device 19 device19
> device 20 device20
> device 21 device21
> device 22 device22
> device 23 device23
> device 24 osd.24 class hdd
> device 25 device25
> device 26 osd.26 class hdd
> device 27 osd.27 class hdd
> device 28 osd.28 class hdd
> device 29 osd.29 class hdd
> device 30 osd.30 class ssd
> device 31 osd.31 class ssd
> device 32 osd.32 class ssd
> device 33 osd.33 class ssd
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
>
> # buckets
> host daniel {
> id -4    # do not change unnecessarily
> id -2 class hdd    # do not change unnecessarily
> id -9 class ssd    # do not change unnecessarily
> # weight 3.459
> alg straw2
> hash 0    # rjenkins1
> item osd.31 weight 1.819
> item osd.7 weight 0.547
> item osd.8 weight 0.547
> item osd.9 weight 0.547
> }
> host felix {
> id -5    # do not change unnecessarily
> id -3 class hdd    # do not change unnecessarily
> id -10 class ssd    # do not change unnecessarily
> # weight 3.653
> alg straw2
> hash 0    # rjenkins1
> item osd.33 weight 1.819
> item osd.13 weight 0.547
> item osd.14 weight 0.467
> item osd.16 weight 0.547
> item osd.0 weight 0.274
> }
> host udo {
> id -6    # do not change unnecessarily
> id -7 class hdd    # do not change unnecessarily
> id -11 class ssd    # do not change unnecessarily
> # weight 4.006
> alg straw2
> hash 0    # rjenkins1
> item osd.32 weight 1.819
> item osd.5 weight 0.547
> item osd.10 weight 0.547
> item osd.11 weight 0.547
> item osd.12 weight 0.547
> }
> host moritz {
> id -13    # do not change unnecessarily
> id -14 class hdd    # do not change unnecessarily
> id -15 class ssd    # do not change unnecessarily
> # weight 1.819
> alg straw2
> hash 0    # rjenkins1
> item osd.30 weight 1.819
> }
> host bruno {
> id -16    # do not change unnecessarily
> id -17 class hdd    # do not change unnecessarily
> id -18 class ssd    # do not change unnecessarily
> # weight 3.183
> alg straw2
> hash 0    # rjenkins1
> item osd.24 weight 0.273
> item osd.26 weight 0.273
> item osd.27 weight 0.273
> item osd.28 weight 0.273
> item osd.29 weight 0.273
> item osd.2 weight 1.819
> }
> root default {
> id -1    # do not change unnecessarily
> id -8 class hdd    # do not change unnecessarily
> id -12 class ssd    # do not change unnecessarily
> # weight 16.121
> alg straw2
> hash 0    # rjenkins1
> item daniel weight 3.459
> item felix weight 3.653
> item udo weight 4.006
> item moritz weight 1.819
> item bruno weight 3.183
> }
>
> # rules
> rule ssd {
> id 0
> type replicated
> min_size 1
> max_size 10
> step take default class ssd
> step choose firstn 0 type osd
> step emit
> }
> rule hdd {
> id 1
> type replicated
> min_size 1
> 

[ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?

2017-10-10 Thread Konrad Riedel

Hello Ceph-users,

after switching to luminous I was excited about the great crush-device-class 
feature - now we have 5 servers with 1x2TB NVMe based OSDs, 3 of them 
additionally with 4 HDDS per server. (we have only three 400G NVMe disks for 
block.wal and block.db and therefore can't distribute all HDDs evenly on all 
servers.)

Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on the same
Host:

ceph pg map 5.b
osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8]

(on rebooting this host I had 4 stale PGs)

I've written a small perl script to add hostname after OSD number and got many 
PGs where
ceph placed 2 replicas on the same host... :

5.1e7: 8 - daniel 9 - daniel 11 - udo
5.1eb: 10 - udo 7 - daniel 9 - daniel
5.1ec: 10 - udo 11 - udo 7 - daniel
5.1ed: 13 - felix 16 - felix 5 - udo


Is there any way I can correct this?


Please see crushmap below. Thanks for any help!

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 device1
device 2 osd.2 class ssd
device 3 device3
device 4 device4
device 5 osd.5 class hdd
device 6 device6
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 device15
device 16 osd.16 class hdd
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 osd.24 class hdd
device 25 device25
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class ssd
device 31 osd.31 class ssd
device 32 osd.32 class ssd
device 33 osd.33 class ssd

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host daniel {
id -4   # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
id -9 class ssd # do not change unnecessarily
# weight 3.459
alg straw2
hash 0  # rjenkins1
item osd.31 weight 1.819
item osd.7 weight 0.547
item osd.8 weight 0.547
item osd.9 weight 0.547
}
host felix {
id -5   # do not change unnecessarily
id -3 class hdd # do not change unnecessarily
id -10 class ssd# do not change unnecessarily
# weight 3.653
alg straw2
hash 0  # rjenkins1
item osd.33 weight 1.819
item osd.13 weight 0.547
item osd.14 weight 0.467
item osd.16 weight 0.547
item osd.0 weight 0.274
}
host udo {
id -6   # do not change unnecessarily
id -7 class hdd # do not change unnecessarily
id -11 class ssd# do not change unnecessarily
# weight 4.006
alg straw2
hash 0  # rjenkins1
item osd.32 weight 1.819
item osd.5 weight 0.547
item osd.10 weight 0.547
item osd.11 weight 0.547
item osd.12 weight 0.547
}
host moritz {
id -13  # do not change unnecessarily
id -14 class hdd# do not change unnecessarily
id -15 class ssd# do not change unnecessarily
# weight 1.819
alg straw2
hash 0  # rjenkins1
item osd.30 weight 1.819
}
host bruno {
id -16  # do not change unnecessarily
id -17 class hdd# do not change unnecessarily
id -18 class ssd# do not change unnecessarily
# weight 3.183
alg straw2
hash 0  # rjenkins1
item osd.24 weight 0.273
item osd.26 weight 0.273
item osd.27 weight 0.273
item osd.28 weight 0.273
item osd.29 weight 0.273
item osd.2 weight 1.819
}
root default {
id -1   # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
id -12 class ssd# do not change unnecessarily
# weight 16.121
alg straw2
hash 0  # rjenkins1
item daniel weight 3.459
item felix weight 3.653
item udo weight 4.006
item moritz weight 1.819
item bruno weight 3.183
}

# rules
rule ssd {
id 0
type replicated
min_size 1
max_size 10
step take default class ssd
step choose firstn 0 type osd
step emit
}
rule hdd {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step choose firstn 0 type osd
step emit
}

# end crush map

--

Mit freundlichen Grüßen


Re: [ceph-users] Ceph-mgr summarize recovery counters

2017-10-10 Thread John Spray
On Tue, Oct 10, 2017 at 3:50 PM, Benjeman Meekhof  wrote:
> Hi John,
>
> Thanks for the guidance!  Is pg_status something we should expect to
> find in Luminous (12.2.1)?  It doesn't seem to exist.  We do have a
> 'pg_summary' object which contains a list of every PG and current
> state (active, etc) but nothing about I/O.
>
> Calls to self.get('pg_status') in our module log:  mgr get_python
> Python module requested unknown data 'pg_status'

Yes, it's new in master.

When modules like influx & prometheus are using those calls in master
though, we can backport things like the pg_status implementation at
the same time as backporting the modules if we do that.

John


>
> thanks,
> Ben
>
> On Thu, Oct 5, 2017 at 8:42 AM, John Spray  wrote:
>> On Wed, Oct 4, 2017 at 7:14 PM, Gregory Farnum  wrote:
>>> On Wed, Oct 4, 2017 at 9:14 AM, Benjeman Meekhof  wrote:
 Wondering if anyone can tell me how to summarize recovery
 bytes/ops/objects from counters available in the ceph-mgr python
 interface?  To put another way, how does the ceph -s command put
 together that infomation and can I access that information from a
 counter queryable by the ceph-mgr python module api?

 I want info like the 'recovery' part of the status output.  I have a
 ceph-mgr module that feeds influxdb but I'm not sure what counters
 from ceph-mgr to summarize to create this information.  OSD have
 available a recovery_ops counter which is not quite the same.  Maybe
 the various 'subop_..' counters encompass recovery ops?  It's not
 clear to me but I'm hoping it is obvious to someone more familiar with
 the internals.

 io:
 client:   2034 B/s wr, 0 op/s rd, 0 op/s wr
 recovery: 1173 MB/s, 8 keys/s, 682 objects/s
>>>
>>>
>>> You'll need to run queries against the PGMap. I'm not sure how that
>>> works in the python interfaces but I'm led to believe it's possible.
>>> Documentation is probably all in the PGMap.h header; you can look at
>>> functions like the "recovery_rate_summary" to see what they're doing.
>>
>> Try get("pg_status") from a python module, that should contain the
>> recovery/client IO amongst other things.
>>
>> You may find that the fields only appear when they're nonzero, I would
>> be happy to see a change that fixed the underlying functions to always
>> output the fields (e.g. in PGMapDigest::recovery_rate_summary) when
>> writing to a Formatter.  Skipping the irrelevant stuff is only useful
>> when doing plain text output.
>>
>> John
>>
>>> -Greg
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr summarize recovery counters

2017-10-10 Thread Benjeman Meekhof
Hi John,

Thanks for the guidance!  Is pg_status something we should expect to
find in Luminous (12.2.1)?  It doesn't seem to exist.  We do have a
'pg_summary' object which contains a list of every PG and current
state (active, etc) but nothing about I/O.

Calls to self.get('pg_status') in our module log:  mgr get_python
Python module requested unknown data 'pg_status'

thanks,
Ben

On Thu, Oct 5, 2017 at 8:42 AM, John Spray  wrote:
> On Wed, Oct 4, 2017 at 7:14 PM, Gregory Farnum  wrote:
>> On Wed, Oct 4, 2017 at 9:14 AM, Benjeman Meekhof  wrote:
>>> Wondering if anyone can tell me how to summarize recovery
>>> bytes/ops/objects from counters available in the ceph-mgr python
>>> interface?  To put another way, how does the ceph -s command put
>>> together that infomation and can I access that information from a
>>> counter queryable by the ceph-mgr python module api?
>>>
>>> I want info like the 'recovery' part of the status output.  I have a
>>> ceph-mgr module that feeds influxdb but I'm not sure what counters
>>> from ceph-mgr to summarize to create this information.  OSD have
>>> available a recovery_ops counter which is not quite the same.  Maybe
>>> the various 'subop_..' counters encompass recovery ops?  It's not
>>> clear to me but I'm hoping it is obvious to someone more familiar with
>>> the internals.
>>>
>>> io:
>>> client:   2034 B/s wr, 0 op/s rd, 0 op/s wr
>>> recovery: 1173 MB/s, 8 keys/s, 682 objects/s
>>
>>
>> You'll need to run queries against the PGMap. I'm not sure how that
>> works in the python interfaces but I'm led to believe it's possible.
>> Documentation is probably all in the PGMap.h header; you can look at
>> functions like the "recovery_rate_summary" to see what they're doing.
>
> Try get("pg_status") from a python module, that should contain the
> recovery/client IO amongst other things.
>
> You may find that the fields only appear when they're nonzero, I would
> be happy to see a change that fixed the underlying functions to always
> output the fields (e.g. in PGMapDigest::recovery_rate_summary) when
> writing to a Formatter.  Skipping the irrelevant stuff is only useful
> when doing plain text output.
>
> John
>
>> -Greg
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw resharding operation seemingly won't end

2017-10-10 Thread Ryan Leimenstoll
Thanks for the response Yehuda. 


Staus:
[root@objproxy02 UMobjstore]# radosgw-admin reshard status —bucket=$bucket_name
[
{
"reshard_status": 1,
"new_bucket_instance_id": 
"8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.47370206.1",
"num_shards": 4
}
]

I cleared the flag using the bucket check —fix command and will keep an eye on 
that tracker issue. 

Do you have any insight into why the RGWs ultimately paused/reloaded and failed 
to come back? I am happy to provide more information that could assist. At the 
moment we are somewhat nervous to reenable dynamic sharding as it seems to have 
contributed to this problem. 

Thanks,
Ryan



> On Oct 9, 2017, at 5:26 PM, Yehuda Sadeh-Weinraub  wrote:
> 
> On Mon, Oct 9, 2017 at 1:59 PM, Ryan Leimenstoll
>  wrote:
>> Hi all,
>> 
>> We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now 
>> seeing issues running radosgw. Specifically, it appears an automatically 
>> triggered resharding operation won’t end, despite the jobs being cancelled 
>> (radosgw-admin reshard cancel). I have also disabled dynamic sharding for 
>> the time being in the ceph.conf.
>> 
>> 
>> [root@objproxy02 ~]# radosgw-admin reshard list
>> []
>> 
>> The two buckets were also reported in the `radosgw-admin reshard list` 
>> before our RGW frontends paused recently (and only came back after a service 
>> restart). These two buckets cannot currently be written to at this point 
>> either.
>> 
>> 2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: 
>> bucket is still resharding, please retry
>> 2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err 
>> err_no=2300 resorting to 500
>> 2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: 
>> RESTFUL_IO(s)->complete_header() returned err=Input/output error
>> 2017-10-06 22:41:19.548570 7f90506e9700 1 == req done req=0x7f90506e3180 
>> op status=-2300 http_status=500 ==
>> 2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: 
>> $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT /
>> $REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 
>> Python/2.7.12 Linux/4.9.43-17.3
>> 9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource
>> [.. slightly later in the logs..]
>> 2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends 
>> paused
>> 2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, 
>> completion_mgr.get_next() returned ret=-125
>> 2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation 
>> processing returned error r=-22
>> 2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation 
>> processing returned error r=-22
>> 
>> Can anyone advise on the best path forward to stop the current sharding 
>> states and avoid this moving forward?
>> 
> 
> What does 'radosgw-admin reshard status --bucket=' return?
> I think just manually resharding the buckets should clear this flag,
> is that not an option?
> manual reshard: radosgw-admin bucket reshard --bucket=
> --num-shards=
> 
> also, the 'radosgw-admin bucket check --fix' might clear that flag.
> 
> For some reason it seems that the reshard cancellation code is not
> clearing that flag on the bucket index header (pretty sure it used to
> do it at one point). I'll open a tracker ticket.
> 
> Thanks,
> Yehuda
> 
>> 
>> Some other details:
>> - 3 rgw instances
>> - Ceph Luminous 12.2.1
>> - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs
>> 
>> 
>> Thanks,
>> Ryan Leimenstoll
>> rleim...@umiacs.umd.edu
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-10 Thread Willem Jan Withagen
On 10-10-2017 14:21, Alfredo Deza wrote:
> On Tue, Oct 10, 2017 at 8:14 AM, Willem Jan Withagen  wrote:
>> On 10-10-2017 13:51, Alfredo Deza wrote:
>>> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer  wrote:

 Hello,

 (pet peeve alert)
 On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote:

> To put this in context, the goal here is to kill ceph-disk in mimic.
>>
>> Right, that means we need a ceph-volume zfs before things get shot down.
>> Fortunately there is little history to carry over.
>>
>> But then still somebody needs to do the work. ;-|
>> Haven't looked at ceph-volume, but I'll put it on the agenda.
> 
> An interesting take on zfs (and anything else we didn't set up from
> the get-go) is that we envisioned developers might
> want to craft plugins for ceph-volume and expand its capabilities,
> without placing the burden of coming up
> with new device technology to support.
> 
> The other nice aspect of this is that a plugin would get to re-use all
> the tooling in place in ceph-volume. The plugin architecture
> exists but it isn't fully developed/documented yet.

I was part of the original discussion when ceph-volume said it was going
to be plugable... And would be a great proponent of thye plugins.
If only because ceph-disk is rather convoluted to add to. Not that it
cannot be done, but the code is rather loaded with linuxisms for its
devices. And it takes some care to not upset the old code, even to the
point that code for a routine is refactored into 3 new routines: one OS
selctor and then the old code for Linux, and the new code for FreeBSD.
And that starts to look like a poor mans plugin. :)

But still I need to find the time, and sharpen my python skills.
Luckily mimic is 9 months away. :)

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: migration and disk partition support

2017-10-10 Thread Alfredo Deza
On Tue, Oct 10, 2017 at 3:28 AM, Stefan Kooman  wrote:
> Hi,
>
> Quoting Alfredo Deza (ad...@redhat.com):
>> Hi,
>>
>> Now that ceph-volume is part of the Luminous release, we've been able
>> to provide filestore support for LVM-based OSDs. We are making use of
>> LVM's powerful mechanisms to store metadata which allows the process
>> to no longer rely on UDEV and GPT labels (unlike ceph-disk).
>>
>> Bluestore support should be the next step for `ceph-volume lvm`, and
>> while that is planned we are thinking of ways to improve the current
>> caveats (like OSDs not coming up) for clusters that have deployed OSDs
>> with ceph-disk.
>
> I'm a bit confused after reading this. Just to make things clear. Would
> bluestore be put on top of a LVM volume (in an ideal world)? Has
> bluestore in Ceph luminious support for LVM? I.e. is there code in
> bluestore to support LVM? Or is it _just_ support of `ceph-volume lvm`
> for bluestore?

There is currently no support in `ceph-volume lvm` for bluestore yet.
It is being worked on today and should be ready soon (hopefully in the
next Luminous release).

And yes, in the case of  `ceph-volume lvm` it means that bluestore
would be "on top" of LVM volumes.

>
>> --- New clusters ---
>> The `ceph-volume lvm` deployment is straightforward (currently
>> supported in ceph-ansible), but there isn't support for plain disks
>> (with partitions) currently, like there is with ceph-disk.
>>
>> Is there a pressing interest in supporting plain disks with
>> partitions? Or only supporting LVM-based OSDs fine?
>
> We're still in a green field situation. Users with an installed base
> will have to comment on this. If the assumption that bluestore would be
> put on top of LVM is true, it would make things simpler (in our own Ceph
> ansible playbook).

There is already support in ceph-ansible too, which will mean that
when bluestore support is added, it will be added in ceph-ansible at
the same time.
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-10 Thread Alfredo Deza
On Tue, Oct 10, 2017 at 8:14 AM, Willem Jan Withagen  wrote:
> On 10-10-2017 13:51, Alfredo Deza wrote:
>> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer  wrote:
>>>
>>> Hello,
>>>
>>> (pet peeve alert)
>>> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote:
>>>
 To put this in context, the goal here is to kill ceph-disk in mimic.
>
> Right, that means we need a ceph-volume zfs before things get shot down.
> Fortunately there is little history to carry over.
>
> But then still somebody needs to do the work. ;-|
> Haven't looked at ceph-volume, but I'll put it on the agenda.

An interesting take on zfs (and anything else we didn't set up from
the get-go) is that we envisioned developers might
want to craft plugins for ceph-volume and expand its capabilities,
without placing the burden of coming up
with new device technology to support.

The other nice aspect of this is that a plugin would get to re-use all
the tooling in place in ceph-volume. The plugin architecture
exists but it isn't fully developed/documented yet.

>
> --WjW
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?

2017-10-10 Thread Daniel Baumann
On 10/10/2017 02:10 PM, John Spray wrote:
> Yes.

worked, rank 6 is back and cephfs up again. thank you very much.

> Do a final ls to make sure you got all of them -- it is
> dangerous to leave any fragments behind.

will do.

> BTW opened http://tracker.ceph.com/issues/21749 for the underlying bug.

thanks; I've saved all the logs, so I'm happy to provide anything you need.

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-10 Thread Willem Jan Withagen
On 10-10-2017 13:51, Alfredo Deza wrote:
> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer  wrote:
>>
>> Hello,
>>
>> (pet peeve alert)
>> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote:
>>
>>> To put this in context, the goal here is to kill ceph-disk in mimic.

Right, that means we need a ceph-volume zfs before things get shot down.
Fortunately there is little history to carry over.

But then still somebody needs to do the work. ;-|
Haven't looked at ceph-volume, but I'll put it on the agenda.

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?

2017-10-10 Thread John Spray
On Tue, Oct 10, 2017 at 12:30 PM, Daniel Baumann  wrote:
> Hi John,
>
> thank you very much for your help.
>
> On 10/10/2017 12:57 PM, John Spray wrote:
>>  A) Do a "rados -p  ls | grep "^506\." or similar, to
>> get a list of the objects
>
> done, gives me these:
>
>   506.
>   506.0017
>   506.001b
>   506.0019
>   506.001a
>   506.001c
>   506.0018
>   506.0016
>   506.001e
>   506.001f
>   506.001d
>
>>  B) Write a short bash loop to do a "rados -p  get" on
>> each of those objects into a file.
>
> done, saved them as the object name as filename, resulting in these 11
> files:
>
>90 Oct 10 13:17 506.
>  4.0M Oct 10 13:17 506.0016
>  4.0M Oct 10 13:17 506.0017
>  4.0M Oct 10 13:17 506.0018
>  4.0M Oct 10 13:17 506.0019
>  4.0M Oct 10 13:17 506.001a
>  4.0M Oct 10 13:17 506.001b
>  4.0M Oct 10 13:17 506.001c
>  4.0M Oct 10 13:17 506.001d
>  4.0M Oct 10 13:17 506.001e
>  4.0M Oct 10 13:17 506.001f
>
>>  C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20",
>> mark the rank repaired, start the MDS again, and then gather the
>> resulting log (it should end in the same "Error -22 recovering
>> write_pos", but have much much more detail about what came before).
>
> I've attached the entire log from right before issueing "repaired" until
> after the mds drops to standby again.
>
>> Because you've hit a serious bug, it's really important to gather all
>> this and share it, so that we can try to fix it and prevent it
>> happening again to you or others.
>
> absolutely, sure. If you need anything more, I'm happy to share.
>
>> You have two options, depending on how much downtime you can tolerate:
>>  - carefully remove all the metadata objects that start with 506. --
>
> given the outtage (and people need access to their data), I'd go with
> this. Just to be safe: that would go like this?
>
>   rados -p  rm 506.
>   rados -p  rm 506.0016

Yes.  Do a final ls to make sure you got all of them -- it is
dangerous to leave any fragments behind.

BTW opened http://tracker.ceph.com/issues/21749 for the underlying bug.

John


>   [...]
>
> Regards,
> Daniel
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-10 Thread Alfredo Deza
On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer  wrote:
>
> Hello,
>
> (pet peeve alert)
> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote:
>
>> To put this in context, the goal here is to kill ceph-disk in mimic.
>>
>> One proposal is to make it so new OSDs can *only* be deployed with LVM,
>> and old OSDs with the ceph-disk GPT partitions would be started via
>> ceph-volume support that can only start (but not deploy new) OSDs in that
>> style.
>>
>> Is the LVM-only-ness concerning to anyone?
>>
> If the provision below is met, not really.
>
>> Looking further forward, NVMe OSDs will probably be handled a bit
>> differently, as they'll eventually be using SPDK and kernel-bypass (hence,
>> no LVM).  For the time being, though, they would use LVM.
>>
> And so it begins.
> LVM does a lot of nice things, but not everything for everybody.
> It is also another layer added with all the (minor) reductions in
> performance (with normal storage, not NVMe) and of course potential bugs.
>

ceph-volume was crafted in a way that we wouldn't be forcing anyone to
a single backend (e.g. LVM). Initially it went even further,
as just being a simple orchestrator for getting devices mounted and
starting the OSD with minimal configuration and *regardless* of what
type of devices were being used.

The current status of the LVM portion is *very* robust, although it is
lacking a big chunk of feature parity with ceph-disk. I anticipate
potential bugs
anyway :)

>>
>> On Fri, 6 Oct 2017, Alfredo Deza wrote:
>> > Now that ceph-volume is part of the Luminous release, we've been able
>> > to provide filestore support for LVM-based OSDs. We are making use of
>> > LVM's powerful mechanisms to store metadata which allows the process
>> > to no longer rely on UDEV and GPT labels (unlike ceph-disk).
>> >
>> > Bluestore support should be the next step for `ceph-volume lvm`, and
>> > while that is planned we are thinking of ways to improve the current
>> > caveats (like OSDs not coming up) for clusters that have deployed OSDs
>> > with ceph-disk.
>> >
>> > --- New clusters ---
>> > The `ceph-volume lvm` deployment is straightforward (currently
>> > supported in ceph-ansible), but there isn't support for plain disks
>> > (with partitions) currently, like there is with ceph-disk.
>> >
>> > Is there a pressing interest in supporting plain disks with
>> > partitions? Or only supporting LVM-based OSDs fine?
>>
>> Perhaps the "out" here is to support a "dir" option where the user can
>> manually provision and mount an OSD on /var/lib/ceph/osd/*, with 'journal'
>> or 'block' symlinks, and ceph-volume will do the last bits that initialize
>> the filestore or bluestore OSD from there.  Then if someone has a scenario
>> that isn't captured by LVM (or whatever else we support) they can always
>> do it manually?
>>
> Basically this.
> Since all my old clusters were deployed like this, with no
> chance/intention to upgrade to GPT or even LVM.
> How would symlinks work with Bluestore, the tiny XFS bit?

In this case, we are looking to allow ceph-volume to scan currently
deployed OSDs, and get all the information
needed and save it as a plain configuration file that will be read at
boot time. That is the only other option that
is not dependent on udev/ceph-disk that doesn't mean redoing an OSD
from scratch.

It would be a one-time operation to get out of old deployment's tie
into udev/gpt/ceph-disk

>
>> > --- Existing clusters ---
>> > Migration to ceph-volume, even with plain disk support means
>> > re-creating the OSD from scratch, which would end up moving data.
>> > There is no way to make a GPT/ceph-disk OSD become a ceph-volume one
>> > without starting from scratch.
>> >
>> > A temporary workaround would be to provide a way for existing OSDs to
>> > be brought up without UDEV and ceph-disk, by creating logic in
>> > ceph-volume that could load them with systemd directly. This wouldn't
>> > make them lvm-based, nor it would mean there is direct support for
>> > them, just a temporary workaround to make them start without UDEV and
>> > ceph-disk.
>> >
>> > I'm interested in what current users might look for here,: is it fine
>> > to provide this workaround if the issues are that problematic? Or is
>> > it OK to plan a migration towards ceph-volume OSDs?
>>
>> IMO we can't require any kind of data migration in order to upgrade, which
>> means we either have to (1) keep ceph-disk around indefinitely, or (2)
>> teach ceph-volume to start existing GPT-style OSDs.  Given all of the
>> flakiness around udev, I'm partial to #2.  The big question for me is
>> whether #2 alone is sufficient, or whether ceph-volume should also know
>> how to provision new OSDs using partitions and no LVM.  Hopefully not?
>>
> I really disliked the udev/GPT stuff from the get-go and flakiness is
> being kind for sometimes completely indeterministic behavior.
>

Yep, forcing users to always fit one model seemed annoying to me. I
understand the 

Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?

2017-10-10 Thread Daniel Baumann
Hi John,

thank you very much for your help.

On 10/10/2017 12:57 PM, John Spray wrote:
>  A) Do a "rados -p  ls | grep "^506\." or similar, to
> get a list of the objects

done, gives me these:

  506.
  506.0017
  506.001b
  506.0019
  506.001a
  506.001c
  506.0018
  506.0016
  506.001e
  506.001f
  506.001d

>  B) Write a short bash loop to do a "rados -p  get" on
> each of those objects into a file.

done, saved them as the object name as filename, resulting in these 11
files:

   90 Oct 10 13:17 506.
 4.0M Oct 10 13:17 506.0016
 4.0M Oct 10 13:17 506.0017
 4.0M Oct 10 13:17 506.0018
 4.0M Oct 10 13:17 506.0019
 4.0M Oct 10 13:17 506.001a
 4.0M Oct 10 13:17 506.001b
 4.0M Oct 10 13:17 506.001c
 4.0M Oct 10 13:17 506.001d
 4.0M Oct 10 13:17 506.001e
 4.0M Oct 10 13:17 506.001f

>  C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20",
> mark the rank repaired, start the MDS again, and then gather the
> resulting log (it should end in the same "Error -22 recovering
> write_pos", but have much much more detail about what came before).

I've attached the entire log from right before issueing "repaired" until
after the mds drops to standby again.

> Because you've hit a serious bug, it's really important to gather all
> this and share it, so that we can try to fix it and prevent it
> happening again to you or others.

absolutely, sure. If you need anything more, I'm happy to share.

> You have two options, depending on how much downtime you can tolerate:
>  - carefully remove all the metadata objects that start with 506. --

given the outtage (and people need access to their data), I'd go with
this. Just to be safe: that would go like this?

  rados -p  rm 506.
  rados -p  rm 506.0016
  [...]

Regards,
Daniel
2017-10-10 13:21:55.413752 7f3f3011a700  5 mds.mds9 handle_mds_map epoch 96224 
from mon.0
2017-10-10 13:21:55.413836 7f3f3011a700 10 mds.mds9  my compat 
compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file 
layout v2}
2017-10-10 13:21:55.413847 7f3f3011a700 10 mds.mds9  mdsmap compat 
compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2017-10-10 13:21:55.413852 7f3f3011a700 10 mds.mds9 map says I am 
147.87.226.189:6800/1634944095 mds.6.96224 state up:replay
2017-10-10 13:21:55.414088 7f3f3011a700  4 mds.6.purge_queue operator():  data 
pool 7 not found in OSDMap
2017-10-10 13:21:55.414141 7f3f3011a700 10 mds.mds9 handle_mds_map: 
initializing MDS rank 6
2017-10-10 13:21:55.414410 7f3f3011a700 10 mds.6.0 update_log_config 
log_to_monitors {default=true}
2017-10-10 13:21:55.414415 7f3f3011a700 10 mds.6.0 create_logger
2017-10-10 13:21:55.414635 7f3f3011a700  7 mds.6.server operator(): full = 0 
epoch = 0
2017-10-10 13:21:55.414644 7f3f3011a700  4 mds.6.purge_queue operator():  data 
pool 7 not found in OSDMap
2017-10-10 13:21:55.414648 7f3f3011a700  4 mds.6.0 handle_osd_map epoch 0, 0 
new blacklist entries
2017-10-10 13:21:55.414660 7f3f3011a700 10 mds.6.server apply_blacklist: killed 0
2017-10-10 13:21:55.414830 7f3f3011a700 10 mds.mds9 handle_mds_map: handling 
map as rank 6
2017-10-10 13:21:55.414839 7f3f3011a700  1 mds.6.96224 handle_mds_map i am now 
mds.6.96224
2017-10-10 13:21:55.414843 7f3f3011a700  1 mds.6.96224 handle_mds_map state 
change up:boot --> up:replay
2017-10-10 13:21:55.414855 7f3f3011a700 10 mds.beacon.mds9 set_want_state: 
up:standby -> up:replay
2017-10-10 13:21:55.414859 7f3f3011a700  1 mds.6.96224 replay_start
2017-10-10 13:21:55.414873 7f3f3011a700  7 mds.6.cache set_recovery_set 
0,1,2,3,4,5,7,8
2017-10-10 13:21:55.414883 7f3f3011a700  1 mds.6.96224  recovery set is 
0,1,2,3,4,5,7,8
2017-10-10 13:21:55.414893 7f3f3011a700  1 mds.6.96224  waiting for osdmap 
18607 (which blacklists prior instance)
2017-10-10 13:21:55.414901 7f3f3011a700  4 mds.6.purge_queue operator():  data 
pool 7 not found in OSDMap
2017-10-10 13:21:55.416011 7f3f3011a700  7 mds.6.server operator(): full = 0 
epoch = 18608
2017-10-10 13:21:55.416024 7f3f3011a700  4 mds.6.96224 handle_osd_map epoch 
18608, 0 new blacklist entries
2017-10-10 13:21:55.416027 7f3f3011a700 10 mds.6.server apply_blacklist: killed 0
2017-10-10 13:21:55.416076 7f3f2a10e700 10 MDSIOContextBase::complete: 
12C_IO_Wrapper
2017-10-10 13:21:55.416095 7f3f2a10e700 10 MDSInternalContextBase::complete: 
15C_MDS_BootStart
2017-10-10 13:21:55.416101 7f3f2a10e700  2 mds.6.96224 boot_start 0: opening 
inotable
2017-10-10 13:21:55.416120 7f3f2a10e700 10 mds.6.inotable: load
2017-10-10 13:21:55.416301 7f3f2a10e700  2 mds.6.96224 boot_start 0: opening 
sessionmap
2017-10-10 13:21:55.416310 

[ceph-users] BlueStore Cache Ratios

2017-10-10 Thread Jorge Pinilla López
I've been reading about BlueStore and I came across the BlueStore Cache
and its ratios. I couldn't fully understand it.

Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
too disproporcionate.
Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would be
used for KV but there is also another attributed called
bluestore_cache_kv_max which is by fault 512MB, then what is the rest of
the cache used for?, nothing? shouldnt it be more kv_max value or less
KV ratio?

I know it really depends on the enviroment (size, amount of IOS,
files...) but all of this data seems to me a little too odd and not
razonable.

Is there any way I can make an aprox about how much KV and metadata is
generating for GB of actual data?

Does it make anypoint to left some cache for the Data itself or its
better to just store ONodes and metadata?

Another little question
I dont really understand how BlueStore gets its speed from cause it
actually writes the data directly to the end device (not like FS where
you had a journal), then... shouldnt speed be limitated for that device
write speed even having a SSD for RocksDB?

Thanks a lot.


*Jorge Pinilla López*
jorp...@unizar.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory

2017-10-10 Thread Yoann Moulin

>> I am trying to follow the instructions at:
>> http://docs.ceph.com/docs/master/cephfs/client-auth/
>> to restrict a client to a subdirectory of  Ceph filesystem, but always get
>> an error.
>>
>> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7
>> servers. The user 'hydra' has the following capabilities:
>> # ceph auth get client.hydra
>> exported keyring for client.hydra
>> [client.hydra]
>> key = AQ==
>> caps mds = "allow rw"
>> caps mgr = "allow r"
>> caps mon = "allow r"
>> caps osd = "allow rw"
>>
>> When I tried to restrict the client to only mount and work within the
>> directory /hydra of the Ceph filesystem 'pulpos', I got an error:
>> # ceph fs authorize pulpos client.hydra /hydra rw
>> Error EINVAL: key for client.dong exists but cap mds does not match
>>
>> I've tried a few combinations of user caps and CephFS client caps; but
>> always got the same error!
> 
> The "fs authorize" command isn't smart enough to edit existing
> capabilities safely, so it is cautious and refuses to overwrite what
> is already there.  If you remove your client.hydra user and try again,
> it should create it for you with the correct capabilities.

I confirm it works perfectly ! it should be added to the documentation. :)

# ceph fs authorize cephfs client.foo1 /foo1 rw
[client.foo1]
key = XXX1
# ceph fs authorize cephfs client.foo2 / r /foo2 rw
[client.foo2]
key = XXX2

# ceph auth get client.foo1
exported keyring for client.foo1
[client.foo1]
key = XXX1
caps mds = "allow rw path=/foo1"
caps mon = "allow r"
caps osd = "allow rw pool=cephfs_data"

# ceph auth get client.foo2
exported keyring for client.foo2
[client.foo2]
key = XXX2
caps mds = "allow r, allow rw path=/foo2"
caps mon = "allow r"
caps osd = "allow rw pool=cephfs_data"

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory

2017-10-10 Thread John Spray
On Tue, Oct 10, 2017 at 2:22 AM, Shawfeng Dong  wrote:
> Dear all,
>
> I am trying to follow the instructions at:
> http://docs.ceph.com/docs/master/cephfs/client-auth/
> to restrict a client to a subdirectory of  Ceph filesystem, but always get
> an error.
>
> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7
> servers. The user 'hydra' has the following capabilities:
> # ceph auth get client.hydra
> exported keyring for client.hydra
> [client.hydra]
> key = AQ==
> caps mds = "allow rw"
> caps mgr = "allow r"
> caps mon = "allow r"
> caps osd = "allow rw"
>
> When I tried to restrict the client to only mount and work within the
> directory /hydra of the Ceph filesystem 'pulpos', I got an error:
> # ceph fs authorize pulpos client.hydra /hydra rw
> Error EINVAL: key for client.dong exists but cap mds does not match
>
> I've tried a few combinations of user caps and CephFS client caps; but
> always got the same error!

The "fs authorize" command isn't smart enough to edit existing
capabilities safely, so it is cautious and refuses to overwrite what
is already there.  If you remove your client.hydra user and try again,
it should create it for you with the correct capabilities.

John

>
> Has anyone able to get this to work? What is your recipe?
>
> Thanks,
> Shaw
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?

2017-10-10 Thread John Spray
On Tue, Oct 10, 2017 at 10:28 AM, Daniel Baumann  wrote:
> Hi all,
>
> unfortunatly I'm still struggling bringing cephfs back up after one of
> the MDS has been marked "damaged" (see messages from monday).
>
> 1. When I mark the rank as "repaired", this is what I get in the monitor
>log (leaving unrelated leveldb compacting chatter aside):
>
> 2017-10-10 10:51:23.177865 7f3290710700  0 log_channel(audit) log [INF]
> : from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd
> ='[{"prefix": "mds repaired", "rank": "6"}]': finished
> 2017-10-10 10:51:23.177993 7f3290710700  0 log_channel(cluster) log
> [DBG] : fsmap cephfs-9/9/9 up  {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3
> =up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve}
> [...]
>
> 2017-10-10 10:51:23.492040 7f328ab1c700  1 mon.mon1@0(leader).mds e96186
>  mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat=
> {},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versi
> oned encoding,6=dirfrag is stored in omap,8=file layout v2}
> [...]
>
> 2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log
> [ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE)
>
> 2. ...and this is what I get on the mds:
>
> 2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro)
> _decode error from assimilate_prefetch
> 2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover:
> Error -22 recovering write_pos

This is probably the root cause: somehow the PurgeQueue (one of the
on-disk metadata structures) has become corrupt.

The purge queue objects for rank 6 will all have names starting "506."
in the metadata pool.

This is probably the result of a bug of some kind, so to give us a
chance of working out what went wrong let's gather some evidence
first:
 A) Do a "rados -p  ls | grep "^506\." or similar, to
get a list of the objects
 B) Write a short bash loop to do a "rados -p  get" on
each of those objects into a file.
 C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20",
mark the rank repaired, start the MDS again, and then gather the
resulting log (it should end in the same "Error -22 recovering
write_pos", but have much much more detail about what came before).

Because you've hit a serious bug, it's really important to gather all
this and share it, so that we can try to fix it and prevent it
happening again to you or others.

Once you've put all that evidence somewhere safe, you can start
intervening to repair it.  The good news is that this is the best part
of your metadata to damage, because all it does is record the list of
deleted files to purge.

You have two options, depending on how much downtime you can tolerate:
 - carefully remove all the metadata objects that start with 506. --
this will cause that MDS rank to completely forget about purging
anything in its queue.  This will leave some orphan data objects in
the data pool that will never get cleaned up (without doing some more
offline repair).
 - inspect the detailed logs from step C of the evidence gathering, to
work out exactly how far the journal loading got before hitting
something corrupt.  Then with some finer-grained editing of the
on-disk objects, we can persuade it to skip over the part that was
damaged.

John

> (see attachment for the full mds log during the "repair" action)
>
>
> I'm really stuck here and would greatly appreciate any help. How can I
> see what is actually going on/the problem? Running ceph-mon/ceph-mds
> with debug levels logs just "damaged" as quoted above, but doesn't tell
> what is wrong or why it's failing.
>
> would going back to single MDS with "ceph fs reset" allow me to access
> the data again?
>
>
> Regards,
> Daniel
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] advice on number of objects per OSD

2017-10-10 Thread Alexander Kushnirenko
Hi,

Are there any recommendations on what is the limit when osd performance
start to decline because of large number of objects? Or perhaps a procedure
on how to find this number (luminous)?  My understanding is that the
recommended object size is 10-100 MB, but is there any performance hit due
to large number of objects?  I ran across a number of about 1M objects, is
that so?  We do not have special SSD for journal and use librados for I/O.

Alexander.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 1 MDSs behind on trimming (was Re: clients failing to advance oldest client/flush tid)

2017-10-10 Thread John Spray
On Tue, Oct 10, 2017 at 3:48 AM, Nigel Williams
 wrote:
> On 9 October 2017 at 19:21, Jake Grimmett  wrote:
>> HEALTH_WARN 9 clients failing to advance oldest client/flush tid;
>> 1 MDSs report slow requests; 1 MDSs behind on trimming

(This is the less worrying of the original thread's messages, so I've
edited subject line)

> On a proof-of-concept 12.2.1 cluster (few random files added, 30 OSDs,
> default Ceph settings) I can get the above error by doing this from a
> client:
>
> bonnie++ -s 0 -n 1000 -u 0
>
> This makes 1 million files in a single directory (we wanted to see
> what might break).
>
> This takes a few hours to run but seems to finish without incident.
> Over that time we get this in the logs:

We do sometimes see this in systems that have the metadata pool either
on kinda-slow drives, or on drives that are shared with a very busy
data pool.  If either is the case (i.e. if your OSDs are very busy)
then the warning is probably nothing to worry about (it does make me
wonder if we should make the default journal length longer though).

You can make the system more tolerant of slow metadata writeback by
adjusting mds_log_max_segments upwards (for example, doubling from the
default 30 is not a big deal).

If your OSDs are *not* very busy, and you're still seeing this
warning, then you are hitting a bug and it's worth investigating.

John

>
> root@c0mon-101:/var/log/ceph# zcat ceph-mon.c0mon-101.log.6.gz|fgrep MDS_TRIM
> 2017-10-04 11:14:18.489943 7ff914a26700  0 log_channel(cluster) log
> [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM)
> 2017-10-04 11:14:22.523117 7ff914a26700  0 log_channel(cluster) log
> [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on
> trimming)
> 2017-10-04 11:14:26.589797 7ff914a26700  0 log_channel(cluster) log
> [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM)
> 2017-10-04 11:14:34.614567 7ff914a26700  0 log_channel(cluster) log
> [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on
> trimming)
> 2017-10-04 20:38:22.812032 7ff914a26700  0 log_channel(cluster) log
> [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM)
> 2017-10-04 20:41:14.700521 7ff914a26700  0 log_channel(cluster) log
> [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on
> trimming)
> root@c0mon-101:/var/log/ceph#
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to debug (in order to repair) damaged MDS (rank)?

2017-10-10 Thread Daniel Baumann
Hi all,

unfortunatly I'm still struggling bringing cephfs back up after one of
the MDS has been marked "damaged" (see messages from monday).

1. When I mark the rank as "repaired", this is what I get in the monitor
   log (leaving unrelated leveldb compacting chatter aside):

2017-10-10 10:51:23.177865 7f3290710700  0 log_channel(audit) log [INF]
: from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd
='[{"prefix": "mds repaired", "rank": "6"}]': finished
2017-10-10 10:51:23.177993 7f3290710700  0 log_channel(cluster) log
[DBG] : fsmap cephfs-9/9/9 up  {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3
=up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve}
[...]

2017-10-10 10:51:23.492040 7f328ab1c700  1 mon.mon1@0(leader).mds e96186
 mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat=
{},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versi
oned encoding,6=dirfrag is stored in omap,8=file layout v2}
[...]

2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log
[ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE)

2. ...and this is what I get on the mds:

2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro)
_decode error from assimilate_prefetch
2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover:
Error -22 recovering write_pos

(see attachment for the full mds log during the "repair" action)


I'm really stuck here and would greatly appreciate any help. How can I
see what is actually going on/the problem? Running ceph-mon/ceph-mds
with debug levels logs just "damaged" as quoted above, but doesn't tell
what is wrong or why it's failing.

would going back to single MDS with "ceph fs reset" allow me to access
the data again?

Regards,
Daniel
2017-10-10 11:21:26.419394 7fcb0670c700 10 mds.mds9  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=de
fault file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2017-10-10 11:21:26.419399 7fcb0670c700 10 mds.mds9 map says I am 147.87.226.189:6800/1182896077 mds.6.96195 state up:replay
2017-10-10 11:21:26.419623 7fcb0670c700  4 mds.6.purge_queue operator():  data pool 7 not found in OSDMap
2017-10-10 11:21:26.419679 7fcb0670c700 10 mds.mds9 handle_mds_map: initializing MDS rank 6
2017-10-10 11:21:26.419916 7fcb0670c700 10 mds.6.0 update_log_config log_to_monitors {default=true}
2017-10-10 11:21:26.419920 7fcb0670c700 10 mds.6.0 create_logger
2017-10-10 11:21:26.420138 7fcb0670c700  7 mds.6.server operator(): full = 0 epoch = 0
2017-10-10 11:21:26.420146 7fcb0670c700  4 mds.6.purge_queue operator():  data pool 7 not found in OSDMap
2017-10-10 11:21:26.420150 7fcb0670c700  4 mds.6.0 handle_osd_map epoch 0, 0 new blacklist entries
2017-10-10 11:21:26.420159 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0
2017-10-10 11:21:26.420338 7fcb0670c700 10 mds.mds9 handle_mds_map: handling map as rank 6
2017-10-10 11:21:26.420347 7fcb0670c700  1 mds.6.96195 handle_mds_map i am now mds.6.96195
2017-10-10 11:21:26.420351 7fcb0670c700  1 mds.6.96195 handle_mds_map state change up:boot --> up:replay
2017-10-10 11:21:26.420366 7fcb0670c700 10 mds.beacon.mds9 set_want_state: up:standby -> up:replay
2017-10-10 11:21:26.420370 7fcb0670c700  1 mds.6.96195 replay_start
2017-10-10 11:21:26.420375 7fcb0670c700  7 mds.6.cache set_recovery_set 0,1,2,3,4,5,7,8
2017-10-10 11:21:26.420380 7fcb0670c700  1 mds.6.96195  recovery set is 0,1,2,3,4,5,7,8
2017-10-10 11:21:26.420395 7fcb0670c700  1 mds.6.96195  waiting for osdmap 18593 (which blacklists prior instance)
2017-10-10 11:21:26.420401 7fcb0670c700  4 mds.6.purge_queue operator():  data pool 7 not found in OSDMap
2017-10-10 11:21:26.421206 7fcb0670c700  7 mds.6.server operator(): full = 0 epoch = 18593
2017-10-10 11:21:26.421217 7fcb0670c700  4 mds.6.96195 handle_osd_map epoch 18593, 0 new blacklist entries
2017-10-10 11:21:26.421220 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0
2017-10-10 11:21:26.421253 7fcb00700700 10 MDSIOContextBase::complete: 12C_IO_Wrapper
2017-10-10 11:21:26.421263 7fcb00700700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart
2017-10-10 11:21:26.421267 7fcb00700700  2 mds.6.96195 boot_start 0: opening inotable
2017-10-10 11:21:26.421285 7fcb00700700 10 mds.6.inotable: load
2017-10-10 11:21:26.421441 7fcb00700700  2 mds.6.96195 boot_start 0: opening sessionmap
2017-10-10 11:21:26.421449 7fcb00700700 10 mds.6.sessionmap load
2017-10-10 11:21:26.421551 7fcb00700700  2 mds.6.96195 boot_start 0: opening mds log
2017-10-10 11:21:26.421558 7fcb00700700  5 mds.6.log open discovering log bounds
2017-10-10 11:21:26.421720 7fcaff6fe700 10 mds.6.log _submit_thread start
2017-10-10 11:21:26.423002 7fcb00700700 10 MDSIOContextBase::complete: N12_GLOBAL__N_112C_IO_SM_LoadE

Re: [ceph-users] ceph-volume: migration and disk partition support

2017-10-10 Thread Dan van der Ster
On Fri, Oct 6, 2017 at 6:56 PM, Alfredo Deza  wrote:
> Hi,
>
> Now that ceph-volume is part of the Luminous release, we've been able
> to provide filestore support for LVM-based OSDs. We are making use of
> LVM's powerful mechanisms to store metadata which allows the process
> to no longer rely on UDEV and GPT labels (unlike ceph-disk).
>
> Bluestore support should be the next step for `ceph-volume lvm`, and
> while that is planned we are thinking of ways to improve the current
> caveats (like OSDs not coming up) for clusters that have deployed OSDs
> with ceph-disk.
>
> --- New clusters ---
> The `ceph-volume lvm` deployment is straightforward (currently
> supported in ceph-ansible), but there isn't support for plain disks
> (with partitions) currently, like there is with ceph-disk.
>
> Is there a pressing interest in supporting plain disks with
> partitions? Or only supporting LVM-based OSDs fine?
>
> --- Existing clusters ---
> Migration to ceph-volume, even with plain disk support means
> re-creating the OSD from scratch, which would end up moving data.
> There is no way to make a GPT/ceph-disk OSD become a ceph-volume one
> without starting from scratch.
>
> A temporary workaround would be to provide a way for existing OSDs to
> be brought up without UDEV and ceph-disk, by creating logic in
> ceph-volume that could load them with systemd directly. This wouldn't
> make them lvm-based, nor it would mean there is direct support for
> them, just a temporary workaround to make them start without UDEV and
> ceph-disk.
>
> I'm interested in what current users might look for here,: is it fine
> to provide this workaround if the issues are that problematic? Or is
> it OK to plan a migration towards ceph-volume OSDs?

Without fully understanding the technical details and plans, it will
be hard to answer this.

In general, I wouldn't plan to recreate all OSDs. In our case, we
don't currently plan to recreate FileStore OSDs as Bluestore after the
Luminous upgrade, as that would be too much work. *New* OSDs will be
created the *new* way (is that ceph-disk bluestore? ceph-volume lvm
bluestore??) It wouldn't be nice if we created new OSDs today with
ceph-disk bluestore, then have to recreate all those with ceph-volume
bluestore in a few months.

Disks/servers have a ~5 year lifetime, and we want to format OSDs
exactly once. I'd hope those OSDs remain bootable for the upcoming
releases.

(ceph-disk activation works reliably enough here -- just don't remove
the existing functionality and we'll be happy).

-- dan

>
> -Alfredo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: migration and disk partition support

2017-10-10 Thread Stefan Kooman
Hi,

Quoting Alfredo Deza (ad...@redhat.com):
> Hi,
> 
> Now that ceph-volume is part of the Luminous release, we've been able
> to provide filestore support for LVM-based OSDs. We are making use of
> LVM's powerful mechanisms to store metadata which allows the process
> to no longer rely on UDEV and GPT labels (unlike ceph-disk).
> 
> Bluestore support should be the next step for `ceph-volume lvm`, and
> while that is planned we are thinking of ways to improve the current
> caveats (like OSDs not coming up) for clusters that have deployed OSDs
> with ceph-disk.

I'm a bit confused after reading this. Just to make things clear. Would
bluestore be put on top of a LVM volume (in an ideal world)? Has
bluestore in Ceph luminious support for LVM? I.e. is there code in
bluestore to support LVM? Or is it _just_ support of `ceph-volume lvm`
for bluestore?

> --- New clusters ---
> The `ceph-volume lvm` deployment is straightforward (currently
> supported in ceph-ansible), but there isn't support for plain disks
> (with partitions) currently, like there is with ceph-disk.
> 
> Is there a pressing interest in supporting plain disks with
> partitions? Or only supporting LVM-based OSDs fine?

We're still in a green field situation. Users with an installed base
will have to comment on this. If the assumption that bluestore would be
put on top of LVM is true, it would make things simpler (in our own Ceph
ansible playbook).

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory

2017-10-10 Thread Yoann Moulin
Hello,

> I am trying to follow the instructions at:
> http://docs.ceph.com/docs/master/cephfs/client-auth/
> to restrict a client to a subdirectory of  Ceph filesystem, but always get an 
> error.
> 
> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7 
> servers. The user 'hydra' has the following capabilities:
> # ceph auth get client.hydra
> exported keyring for client.hydra
> [client.hydra]
>         key = AQ==
>         caps mds = "allow rw"
>         caps mgr = "allow r"
>         caps mon = "allow r"
>         caps osd = "allow rw"
> 
> When I tried to restrict the client to only mount and work within the 
> directory /hydra of the Ceph filesystem 'pulpos', I got an error:
> # ceph fs authorize pulpos client.hydra /hydra rw
> Error EINVAL: key for client.dong exists but cap mds does not match
> 
> I've tried a few combinations of user caps and CephFS client caps; but always 
> got the same error!
> 
> Has anyone able to get this to work? What is your recipe?

In the case, the client runs an old kernel (at least 4.4 is old, 4.10 is not), 
you need to give a read access to the entire cephfs fs, if not,
you won't be able to mount the subdirectory.

1/ give read access to the mds and rw to the subdirectory :

  # ceph auth get-or-create client.foo mon "allow r" osd "allow rw 
pool=cephfs_data" mds "allow r, allow rw path=/foo"

or, if client.foo already exist :

  # ceph auth caps client.foo mon "allow r" osd "allow rw pool=cephfs_data" mds 
"allow r, allow rw path=/foo"

[client.foo]
key = XXX
caps mds = "allow r, allow rw path=/foo"
caps mon = "allow r"
caps osd = "allow rw pool=cephfs_data"

2/ you give read access to / and rw access to the subdirectory :

  # ceph fs authorize cephfs client.foo / r /foo rw

Then you get the secret key and mount :

  # ceph --cluster container auth get-key client.foo > foo.secret
  # mount.ceph mds1,mds2,mds3:/foo /foo -v -o 
name=foo,secretfile=/path/to/foo.secret

With an old kernel, you will always be able to mount the root of the cephfs fs.

  # mount.ceph mds1,mds2,mds3:/ /foo -v -o 
name=foo,secretfile=/path/to/foo.secret

if your client runs a not so old kernel you can do this :

1/ you need to give an access to the specific path like :

  # ceph auth get-or-create client.bar mon "allow r" osd "allow rw 
pool=cephfs_data" mds "allow rw path=/bar"

or, if the client.bar already exist :

  # ceph auth caps client.bar mon "allow r" osd "allow rw pool=cephfs_data" mds 
"allow rw path=/bar"

[client.bar]
key = XXX
caps mds = "allow rw path=/bar"
caps mon = "allow r"
caps osd = "allow rw pool=cephfs_data"

2/ you give rw access only on the subdirectory :

  # ceph fs authorize cephfs client.bar /bar rw

Then you get the secret key and mount :

  # ceph --cluster container auth get-key client.bar > bar.secret
  # mount.ceph mds1,mds2,mds3:/bar /bar -v -o 
name=bar,secretfile=/path/to/bar.secret

if you try to mount the cephfs root, you should get an access denied

  # mount.ceph mds1,mds2,mds3:/ /bar -v -o 
name=bar,secretfile=/path/to/bar.secret


In the case you want to increase the security, you might give a look to 
namespace and file layout

http://docs.ceph.com/docs/master/cephfs/file-layouts/

I don't have given a look at yet but looks like really interesting !


> 
> Thanks,
> Shaw
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com