Re: [ceph-users] CephFS with cache-tier kernel-mount client unable to write (Nautilus)

2020-01-21 Thread Ilya Dryomov
On Tue, Jan 21, 2020 at 7:51 PM Hayashida, Mami  wrote:
>
> Ilya,
>
> Thank you for your suggestions!
>
> `dmsg` (on the client node) only had `libceph: mon0 10.33.70.222:6789 socket 
> error on write`.  No further detail.  But using the admin key (client.admin) 
> for mounting CephFS solved my problem.  I was able to write successfully! :-)
>
> $ sudo mount -t ceph 10.33.70.222:6789:/  /mnt/cephfs -o 
> name=admin,secretfile=/etc/ceph/fsclient_secret // with the corresponding 
> client.admin key
>
> $ sudo vim /mnt/cephfs/file4
> $ sudo ls -l /mnt/cephfs
> total 1
> -rw-r--r-- 1 root root  0 Jan 21 16:25 file1
> -rw-r--r-- 1 root root  0 Jan 21 16:45 file2
> -rw-r--r-- 1 root root  0 Jan 21 18:35 file3
> -rw-r--r-- 1 root root 22 Jan 21 18:42 file4
>
> Now, here is the difference between the two keys. client.testuser was 
> obviously generated with the command `ceph fs authorize cephfs_test 
> client.testuser / rw`, but something in there is obviously interfering with 
> CephFS with a Cache Tier pool.  Do I need to edit the `tag` or the `data` 
> part?  Now, I should mention the same type of key (like client.testuser) 
> worked just fine when I was testing CephFS without a Cache Tier pool.
>
> client.admin
> key: XXXZZZ
> caps: [mds] allow *
> caps: [mgr] allow *
> caps: [mon] allow *
> caps: [osd] allow *
>
> client.testuser
> key: XXXZZZ
> caps: [mds] allow rw
> caps: [mon] allow r
> caps: [osd] allow rw tag cephfs data=cephfs_test

Right.  I think this is because with cache tiering you have two data
pools involved, but "ceph fs authorize" generates an OSD cap that ends
up restricting the client to the data pool that that the filesystem
"knows" about.

You will probably need to create your client users by hand instead of
generating them with "ceph fs authorize".  CCing Patrick who might know
more.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with cache-tier kernel-mount client unable to write (Nautilus)

2020-01-21 Thread Ilya Dryomov
On Tue, Jan 21, 2020 at 6:02 PM Hayashida, Mami  wrote:
>
> I am trying to set up a CephFS with a Cache Tier (for data) on a mini test 
> cluster, but a kernel-mount CephFS client is unable to write.  Cache tier 
> setup alone seems to be working fine (I tested it with `rados put` and `osd 
> map` commands to verify on which OSDs the objects are placed) and setting up 
> CephFS without the cache-tiering also worked fine on the same cluster with 
> the same client, but combining the two fails.  Here is what I have tried:
>
> Ceph version: 14.2.6
>
> Set up Cache Tier:
> $ ceph osd crush rule create-replicated highspeedpool default host ssd
> $ ceph osd crush rule create-replicated highcapacitypool default host hdd
>
> $ ceph osd pool create cephfs-data 256 256 highcapacitypool
> $ ceph osd pool create cephfs-metadata 128 128 highspeedpool
> $ ceph osd pool create cephfs-data-cache 256 256 highspeedpool
>
> $ ceph osd tier add cephfs-data cephfs-data-cache
> $ ceph osd tier cache-mode cephfs-data-cache writeback
> $ ceph osd tier set-overlay cephfs-data cephfs-data-cache
>
> $ ceph osd pool set cephfs-data-cache hit_set_type bloom
>
> ###
> All the cache tier configs set (hit_set_count, hit_set period, 
> target_max_bytes etc.)
> ###
>
> $ ceph-deploy mds create 
> $ ceph fs new cephfs_test cephfs-metadata cephfs-data
>
> $ ceph fs authorize cephfs_test client.testuser / rw
> $ ceph auth ls
> client.testuser
> key: XXX
> caps: [mds] allow rw
> caps: [mon] allow r
> caps: [osd] allow rw tag cephfs data=cephfs_test
>
> ### Confirm the pool setting
> $ ceph osd pool ls detail
> pool 1 'cephfs-data' replicated size 3 min_size 2 crush_rule 2 object_hash 
> rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 63 lfor 
> 53/53/53 flags hashpspool tiers 3 read_tier 3 write_tier 3 stripe_width 0 
> application cephfs
> pool 2 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 
> 63 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
> recovery_priority 5 application cephfs
> pool 3 'cephfs-data-cache' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 
> 63 lfor 53/53/53 flags hashpspool,incomplete_clones tier_of 1 cache_mode 
> writeback target_bytes 800 hit_set bloom{false_positive_probability: 
> 0.05, target_size: 0, seed: 0} 120s x2 decay_rate 0 search_last_n 0 
> stripe_width 0
>
>  Set up the client side (kernel mount)
> $ sudo vim /etc/ceph/fsclient_secret
> $ sudo mkdir /mnt/cephfs
> $ sudo mount -t ceph :6789:/  /mnt/cephfs -o 
> name=testuser,secretfile=/etc/ceph/fsclient_secret // no errors at this 
> point
>
> $ sudo vim /mnt/cephfs/file1   // Writing attempt fails
>
> "file1" E514: write error (file system full?)
> WARNING: Original file may be lost or damaged
> don't quit the editor until the file is successfully written!
>
> $ ls -l /mnt/cephfs
> total 0
> -rw-r--r-- 1 root root 0 Jan 21 16:25 file1
>
> Any help will be appreciated.

Hi Mami,

Is there anything in dmesg?

What happens if you mount without involving testuser (i.e. using
client.admin and the admin key)?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird mount issue (Ubuntu 18.04, Ceph 14.2.5 & 14.2.6)

2020-01-17 Thread Ilya Dryomov
On Fri, Jan 17, 2020 at 2:21 AM Aaron  wrote:
>
> No worries, can definitely do that.
>
> Cheers
> Aaron
>
> On Thu, Jan 16, 2020 at 8:08 PM Jeff Layton  wrote:
>>
>> On Thu, 2020-01-16 at 18:42 -0500, Jeff Layton wrote:
>> > On Wed, 2020-01-15 at 08:05 -0500, Aaron wrote:
>> > > Seeing a weird mount issue.  Some info:
>> > >
>> > > No LSB modules are available.
>> > > Distributor ID: Ubuntu
>> > > Description: Ubuntu 18.04.3 LTS
>> > > Release: 18.04
>> > > Codename: bionic
>> > >
>> > > Ubuntu 18.04.3 with kerne 4.15.0-74-generic
>> > > Ceph 14.2.5 & 14.2.6
>> > >
>> > > With ceph-common, ceph-base, etc installed:
>> > >
>> > > ceph/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-base/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-common/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > > ceph-mds/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-mgr/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > > ceph-mgr-dashboard/stable,stable,now 14.2.6-1bionic all [installed]
>> > > ceph-mon/stable,now 14.2.6-1bionic amd64 [installed]
>> > > ceph-osd/stable,now 14.2.6-1bionic amd64 [installed]
>> > > libcephfs2/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > > python-ceph-argparse/stable,stable,now 14.2.6-1bionic all 
>> > > [installed,automatic]
>> > > python-cephfs/stable,now 14.2.6-1bionic amd64 [installed,automatic]
>> > >
>> > > I create a user via get-or-create cmd, and I have a users/secret now.
>> > > When I try to mount on these Ubuntu nodes,
>> > >
>> > > The mount cmd I run for testing is:
>> > > sudo mount -t ceph -o
>> > > name=user-20c5338c-34db-11ea-b27a-de7033e905f6,secret=AQC6dhpeyczkDxAAhRcr7oERUY4BcD2NCUkuNg==
>> > > 10.10.10.10:6789:/work/20c5332d-34db-11ea-b27a-de7033e905f6 /tmp/test
>> > >
>> > > I get the error:
>> > > couldn't finalize options: -34
>> > >
>> > > From some tracking down, it's part of the get_secret_option() in
>> > > common/secrets.c and the Linux System Error:
>> > >
>> > > #define ERANGE  34  /* Math result not representable */
>> > >
>> > > Now the weird part...when I remove all the above libs above, the mount
>> > > command works. I know that there are ceph.ko modules in the Ubuntu
>> > > filesystems DIR, and that Ubuntu comes with some understanding of how
>> > > to mount a cephfs system.  So, that explains how it can mount
>> > > cephfs...but, what I don't understand is why I'm getting that -34
>> > > error with the 14.2.5 and 14.2.6 libs installed. I didn't have this
>> > > issue with 14.2.3 or 14.2.4.
>> >
>> > This sounds like a regression in mount.ceph, probably due to something
>> > that went in for v14.2.5. I can reproduce the problem on Fedora, and I
>> > think it has something to do with the very long username you're using.
>> >
>> > I'll take a closer look and let you know. Stay tuned.
>> >
>>
>> I think I see the issue. The SECRET_OPTION_BUFSIZE is just too small for
>> your use case. We need to make that a little larger than the largest
>> name= parameter can be. Prior to v14.2.5, it was ~1000 bytes, but I made
>> it smaller in that set thinking that was too large. Mea culpa.
>>
>> The problem is determining how big that size can be. AFAICT EntityName
>> is basically a std::string in the ceph code, which can be an arbitrary
>> size (up to 4g or so).

It's just that you made SECRET_OPTION_BUFSIZE account precisely for
"secret=", but it can also be "key=".

I don't think there is much of a problem.  Defining it back to ~1000 is
guaranteed to work.  Or we could remove it and just compute the size of
secret_option exactly the same way as get_secret_option() does it:

  strlen(cmi->cmi_secret) + strlen(cmi->cmi_name) + 7 + 1

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Ilya Dryomov
On Thu, Jan 9, 2020 at 2:52 PM Kyriazis, George
 wrote:
>
> Hello ceph-users!
>
> My setup is that I’d like to use RBD images as a replication target of a 
> FreeNAS zfs pool.  I have a 2nd FreeNAS (in a VM) to act as a backup target 
> in which I mount the RBD image.  All this (except the source FreeNAS server) 
> is in Proxmox.
>
> Since I am using RBD as a backup target, performance is not really critical, 
> but I still don’t want it to take months to complete the backup.  My source 
> pool size is in the order of ~30TB.
>
> I’ve set up an EC RBD pool (and the matching replicated pool) and created 
> image with no problems.  However, with the stock 4MB object size, backup 
> speed in quite slow.  I tried creating an image with 4K object size, but even 
> for a relatively small image size (of 1TB), I get:
>
> # rbd -p rbd_backup create vm-118-disk-0 --size 1T --object-size 4K 
> --data-pool rbd_ec
> 2020-01-09 07:40:27.120 7f3e4aa15f40 -1 librbd::image::CreateRequest: 
> validate_layout: image size not compatible with object map
> rbd: create error: (22) Invalid argument
> #

Yeah, this is an object map limitation.  Given that this is a backup
target, you don't really need the object map feature.  Disable it with
"rbd feature disable vm-118-disk-0 object-map" and you should be able
to create an image of any size.

That said, are you sure that object size is the issue?  If you expect
small sequential writes and want them to go to different OSDs, look at
using a fancy striping pattern instead of changing the object size:

  https://docs.ceph.com/docs/master/man/8/rbd/#striping

E.g. with --stripe-unit 4K --stripe-count 8, the first 4K will go to
object 1, the second 4K to object 2, etc.  The ninth 4K will return to
object 1, the tenth to object 2, etc.  When objects 1-8 become full, it
will move on to objects 9-16, then to 17-24, etc.

This way you get the increased parallelism without the very significant
overhead of tons of small objects (if your OSDs are capable enough).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd du command

2020-01-06 Thread Ilya Dryomov
On Mon, Jan 6, 2020 at 2:51 PM M Ranga Swami Reddy  wrote:
>
> Thank you.
> Can you please share a simple example here?
>
> Thanks
> Swami
>
> On Mon, Jan 6, 2020 at 4:02 PM  wrote:
>>
>> Hi,
>>
>> rbd are thin provisionned, you need to trim on the upper level, either
>> via the fstrim command, or the discard option (on Linux)
>>
>> Unless you trim, the rbd layer does not know that data has been removed
>> and are thus no longer needed
>>
>>
>>
>> On 1/6/20 10:30 AM, M Ranga Swami Reddy wrote:
>> > Hello,
>> > I ran the "rbd du /image" command. Its shows increasing, when I add
>> > data to the image. That looks good. But when I removed data from the image,
>> > its not showing the decreasing the size.
>> >
>> > Is this expected with "rbd du" or its not implemented?
>> >
>> > NOTE: Expected behavior is the same as " Linux du command"
>> >
>> > Thanks
>> > Swami

Literally just "sudo fstrim ".  Another alternative is to
mount with "-o discard", but that can negatively affect performance.

I wrote up a detailed explanation of what is reported by "rbd du" in
another thread:

  https://www.mail-archive.com/ceph-users@lists.ceph.com/msg57186.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Object-Map Usuage incorrect

2019-12-12 Thread Ilya Dryomov
On Thu, Dec 12, 2019 at 9:12 AM Ashley Merrick  wrote:
>
> Due to the recent 5.3.x kernel having support for Object-Map and other 
> features required in KRBD I have now enabled object-map,fast-diff on some RBD 
> images with CEPH (14.2.5), I have rebuilt the object map using "rbd 
> object-map rebuild"
>
> However for some RBD images, the Provisioned/Total Provisioned then listed in 
> the Ceph MGR for some images is the full RBD size and not the true size 
> reflected in a VM using df -h, I have discard enabled and have run fstrim but 
> I know that for example a 20TB RBD has never gone above the current 9TB shown 
> in df -h but in CEPH MGR shows as 20TB under Provisioned/Total Provisioned.
>
> Not sure if I am hitting a bug? Or if this is expected behavior?

Unless you know *exactly* what the filesystem is doing in your case and
see an inconsistency, this is expected.

If you are interested, here is an example:

$ rbd create --size 1G img
$ sudo rbd map img
/dev/rbd0
$ sudo mkfs.ext4 /dev/rbd0
$ sudo mount /dev/rbd0 /mnt
$ df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   976M  2.6M  907M   1% /mnt
$ rbd du img
NAME PROVISIONED USED
img1 GiB 60 MiB
$ ceph df | grep -B1 rbd
POOL ID STORED OBJECTS USED   %USED MAX AVAIL
rbd   1 33 MiB  20 33 MiB 0  1013 GiB

After I create a big file, almost the entire image is shown as used:

$ dd if=/dev/zero of=/mnt/file bs=1M count=900
$ df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   976M  903M  6.2M 100% /mnt
$ rbd du img
NAME PROVISIONED USED
img1 GiB 956 MiB
$ ceph df | grep -B1 rbd
POOL ID STORED  OBJECTS USED%USED MAX AVAIL
rbd   1 933 MiB 248 933 MiB  0.09  1012 GiB

Now if I carefully punch out most of that file, leaving one page in
each megabyte, and run fstrim:

$ for ((i = 0; i < 900; i++)); do fallocate -p -n -o $((i * 2**20)) -l
$((2**20 - 4096)) /mnt/file; done
$ sudo fstrim /mnt
$ df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   976M  6.1M  903M   1% /mnt
$ rbd du img
NAME PROVISIONED USED
img1 GiB 956 MiB
$ ceph df | grep -B1 rbd
POOL ID STORED OBJECTS USED   %USED MAX AVAIL
rbd   1 36 MiB 248 36 MiB 0  1013 GiB

You can see that df -h is back to ~6M, but "rbd du" USED remained
the same.  This is because "rbd du" is very coarse-grained, it works
at the object level and doesn't go any deeper.  If the number of
objects and their sizes remain the same, "rbd du" USED remains the
same.  It doesn't account for sparseness which I produced above.

"ceph df" goes down to the individual bluestore blobs, but only per
pool.  Looking at STORED, you can see that the space is back, even
though the number of objects remained the same.  Unfortunately, there
is no (fast) way to get the same information per image.

So what you see in the dashboard is basically "rbd du".  It is fast
to compute (especially when object map is enabled), but it shows you
the picture at the object level, not at the blob level.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel module lockups in Ubuntu linux-image-5.0.0-32-generic?

2019-10-24 Thread Ilya Dryomov
On Thu, Oct 24, 2019 at 5:45 PM Paul Emmerich  wrote:
>
> Could it be related to the broken backport as described in
> https://tracker.ceph.com/issues/40102 ?
>
> (It did affect 4.19, not sure about 5.0)

It does, I have just updated the linked ticket to reflect that.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Ilya Dryomov
On Sat, Oct 19, 2019 at 2:00 PM Lei Liu  wrote:
>
> Hello llya,
>
> After updated client kernel version to 3.10.0-862 , ceph features shows:
>
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 5
> },
> "group": {
> "features": "0x7fddff8ee8cbffb",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 6
> },
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 1
> }
> }
>
> both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel 
> client.
>
> Is it now possible to force set-require-min-compat-client to be luminous, if 
> not how to fix it?

No, you haven't upgraded the one with features 0x7fddff8ee8cbffb (or
rather it looks like you have upgraded it from 0x7fddff8ee84bffb, but
to a version that is still too old).

What exactly did you do on that machine?  That change doesn't look like
it came from a kernel upgrade.  What is the output of "uname -a" there?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-17 Thread Ilya Dryomov
On Thu, Oct 17, 2019 at 3:38 PM Lei Liu  wrote:
>
> Hi Cephers,
>
> We have some ceph clusters in 12.2.x version, now we want to use upmap 
> balancer,but when i set set-require-min-compat-client to luminous, it's failed
>
> # ceph osd set-require-min-compat-client luminous
> Error EPERM: cannot set require_min_compat_client to luminous: 6 connected 
> client(s) look like jewel (missing 0xa20); 1 connected client(s) 
> look like jewel (missing 0x800); 1 connected client(s) look like 
> jewel (missing 0x820); add --yes-i-really-mean-it to do it anyway
>
> ceph features
>
> "client": {
> "group": {
> "features": "0x40106b84a842a52",
> "release": "jewel",
> "num": 6
> },
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x7fddff8ee84bffb",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 7
> }
> }
>
> and sessions
>
> "MonSession(unknown.0 10.10.100.6:0/1603916368 is open allow *, features 
> 0x40106b84a842a52 (jewel))",
> "MonSession(unknown.0 10.10.100.2:0/2484488531 is open allow *, features 
> 0x40106b84a842a52 (jewel))",
> "MonSession(client.? 10.10.100.6:0/657483412 is open allow *, features 
> 0x7fddff8ee84bffb (jewel))",
> "MonSession(unknown.0 10.10.14.67:0/500706582 is open allow *, features 
> 0x7010fb86aa42ada (jewel))"
>
> can i use --yes-i-really-mean-it to force enable it ?

No.  0x40106b84a842a52 and 0x7fddff8ee84bffb are too old.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Panic in kernel CephFS client after kernel update

2019-10-05 Thread Ilya Dryomov
On Tue, Oct 1, 2019 at 9:12 PM Jeff Layton  wrote:
>
> On Tue, 2019-10-01 at 15:04 -0400, Sasha Levin wrote:
> > On Tue, Oct 01, 2019 at 01:54:45PM -0400, Jeff Layton wrote:
> > > On Tue, 2019-10-01 at 19:03 +0200, Ilya Dryomov wrote:
> > > > On Tue, Oct 1, 2019 at 6:41 PM Kenneth Van Alstyne
> > > >  wrote:
> > > > > All:
> > > > > I’m not sure this should go to LKML or here, but I’ll start here.  
> > > > > After upgrading from Linux kernel 4.19.60 to 4.19.75 (or 76), I 
> > > > > started running into kernel panics in the “ceph” module.  Based on 
> > > > > the call trace, I believe I was able to narrow it down to the 
> > > > > following commit in the Linux kernel 4.19 source tree:
> > > > >
> > > > > commit 81281039a673d30f9d04d38659030a28051a
> > > > > Author: Yan, Zheng 
> > > > > Date:   Sun Jun 2 09:45:38 2019 +0800
> > > > >
> > > > > ceph: use ceph_evict_inode to cleanup inode's resource
> > > > >
> > > > > [ Upstream commit 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15 ]
> > > > >
> > > > > remove_session_caps() relies on __wait_on_freeing_inode(), to 
> > > > > wait for
> > > > > freeing inode to remove its caps. But VFS wakes freeing inode 
> > > > > waiters
> > > > > before calling destroy_inode().
> > > > >
> > > > > Cc: sta...@vger.kernel.org
> > > > > Link: https://tracker.ceph.com/issues/40102
> > > > > Signed-off-by: "Yan, Zheng" 
> > > > > Reviewed-by: Jeff Layton 
> > > > > Signed-off-by: Ilya Dryomov 
> > > > > Signed-off-by: Sasha Levin 
> > > > >
> > > > >
> > > > > Backing this patch out and recompiling my kernel has since resolved 
> > > > > my issues (as far as I can tell thus far).  The issue was fairly easy 
> > > > > to create by simply creating and deleting files.  I tested using ‘dd’ 
> > > > > and was pretty consistently able to reproduce the issue. Since the 
> > > > > issue occurred in a VM, I do have a screenshot of the crashed machine 
> > > > > and to avoid attaching an image, I’ll link to where they are:  
> > > > > http://kvanals.kvanals.org/.ceph_kernel_panic_images/
> > > > >
> > > > > Am I way off base or has anyone else run into this issue?
> > > >
> > > > Hi Kenneth,
> > > >
> > > > This might be a botched backport.  The first version of this patch had
> > > > a conflict with Al's change that introduced ceph_free_inode() and Zheng
> > > > had to adjust it for that.  However, it looks like it has been taken to
> > > > 4.19 verbatim, even though 4.19 does not have ceph_free_inode().
> > > >
> > > > Zheng, Jeff, please take a look ASAP.
> > > >
> > >
> > > (Sorry for the resend -- I got Sasha's old addr)
> > >
> > > Thanks Ilya,
> > >
> > > I think you're right -- this patch should not have been merged on any
> > > pre-5.2 kernels. We should go ahead and revert this for now, and do a
> > > one-off backport for v4.19.
> > >
> > > Sasha, what do we need to do to make that happen?
> >
> > I think the easiest would be to just revert the broken one and apply a
> > clean backport which you'll send me?
> >
>
> Thanks, Sasha. You can revert the old patch as soon as you're ready.
> It'll take me a bit to put together and test a proper backport, but
> I'll try to have something ready within the next day or so.

Kenneth, this is now fixed in 4.19.77.  Thanks for the report!

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Panic in kernel CephFS client after kernel update

2019-10-01 Thread Ilya Dryomov
On Tue, Oct 1, 2019 at 6:41 PM Kenneth Van Alstyne
 wrote:
>
> All:
> I’m not sure this should go to LKML or here, but I’ll start here.  After 
> upgrading from Linux kernel 4.19.60 to 4.19.75 (or 76), I started running 
> into kernel panics in the “ceph” module.  Based on the call trace, I believe 
> I was able to narrow it down to the following commit in the Linux kernel 4.19 
> source tree:
>
> commit 81281039a673d30f9d04d38659030a28051a
> Author: Yan, Zheng 
> Date:   Sun Jun 2 09:45:38 2019 +0800
>
> ceph: use ceph_evict_inode to cleanup inode's resource
>
> [ Upstream commit 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15 ]
>
> remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
> freeing inode to remove its caps. But VFS wakes freeing inode waiters
> before calling destroy_inode().
>
> Cc: sta...@vger.kernel.org
> Link: https://tracker.ceph.com/issues/40102
> Signed-off-by: "Yan, Zheng" 
> Reviewed-by: Jeff Layton 
> Signed-off-by: Ilya Dryomov 
> Signed-off-by: Sasha Levin 
>
>
> Backing this patch out and recompiling my kernel has since resolved my issues 
> (as far as I can tell thus far).  The issue was fairly easy to create by 
> simply creating and deleting files.  I tested using ‘dd’ and was pretty 
> consistently able to reproduce the issue. Since the issue occurred in a VM, I 
> do have a screenshot of the crashed machine and to avoid attaching an image, 
> I’ll link to where they are:  
> http://kvanals.kvanals.org/.ceph_kernel_panic_images/
>
> Am I way off base or has anyone else run into this issue?

Hi Kenneth,

This might be a botched backport.  The first version of this patch had
a conflict with Al's change that introduced ceph_free_inode() and Zheng
had to adjust it for that.  However, it looks like it has been taken to
4.19 verbatim, even though 4.19 does not have ceph_free_inode().

Zheng, Jeff, please take a look ASAP.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS meltdown fallout: mds assert failure, kernel oopses

2019-08-14 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 1:06 PM Hector Martin  wrote:
>
> I just had a minor CephFS meltdown caused by underprovisioned RAM on the
> MDS servers. This is a CephFS with two ranks; I manually failed over the
> first rank and the new MDS server ran out of RAM in the rejoin phase
> (ceph-mds didn't get OOM-killed, but I think things slowed down enough
> due to swapping out that something timed out). This happened 4 times,
> with the rank bouncing between two MDS servers, until I brought up an
> MDS on a bigger machine.
>
> The new MDS managed to become active, but then crashed with an assert:
>
> 2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1239 from mon.1
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am
> now mds.0.1164
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state
> change up:clientreplay --> up:active
> 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
> 2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
> 2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1240 from mon.1
> 2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> version 1241 from mon.1
> 2019-08-13 16:03:50.286 7fd4578b2700 -1
> /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
> MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13
> 16:03:50.279463
> /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED
> assert(o->get_num_ref() == 0)
>
>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14e) [0x7fd46650eb5e]
>   2: (()+0x2c4cb7) [0x7fd46650ecb7]
>   3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
>   4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long,
> LogSegment*)+0x1f2) [0x55f423dc7192]
>   5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
>   6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
>   7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
>   8: (()+0x76db) [0x7fd465dc26db]
>   9: (clone()+0x3f) [0x7fd464fa888f]
>
> Thankfully this didn't happen on a subsequent attempt, and I got the
> filesystem happy again.
>
> At this point, of the 4 kernel clients actively using the filesystem, 3
> had gone into a strange state (can't SSH in, partial service). Here is a
> kernel log from one of the hosts (the other two were similar):
> https://mrcn.st/p/ezrhr1qR
>
> After playing some service failover games and hard rebooting the three
> affected client boxes everything seems to be fine. The remaining FS
> client box had no kernel errors (other than blocked task warnings and
> cephfs talking about reconnections and such) and seems to be fine.
>
> I can't find these errors anywhere, so I'm guessing they're not known bugs?

Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
Please take a look.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Canonical Livepatch broke CephFS client

2019-08-14 Thread Ilya Dryomov
On Wed, Aug 14, 2019 at 1:54 PM Tim Bishop  wrote:
>
> On Wed, Aug 14, 2019 at 12:44:15PM +0200, Ilya Dryomov wrote:
> > On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop  wrote:
> > > This email is mostly a heads up for others who might be using
> > > Canonical's livepatch on Ubuntu on a CephFS client.
> > >
> > > I have an Ubuntu 18.04 client with the standard kernel currently at
> > > version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
> > > with the kernel client. Cluster is running mimic 13.2.6. I've got
> > > livepatch running and this evening it did an update:
> > >
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with 
> > > livepatch service.
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
> > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
> > > Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 
> > > for 4.15.0-54.58-generic
> > > Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not 
> > > signed with a trusted key
> > > Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling 
> > > patch 'lkp_Ubuntu_4_15_0_54_58_generic_54'
> > > Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
> > > 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
> > > Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
> > > 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
> > > Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 
> > > 54.1 to 4.15.0-54.58-generic
> > >
> > > And then immediately I saw:
> > >
> > > Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > > Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > > Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > > Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 
> > > 1.2.3.4:6800 socket closed (con state OPEN)
> > >
> > > And on the MDS:
> > >
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message 
> > > signature does not match contents.
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on 
> > > message:
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
> > > 10517606059379971075
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally 
> > > calculated signature:
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
> > > sig_check:4899837294009305543
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
> > > 2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
> > > 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
> > > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
> > > Signature check failed
> > >
> > > Thankfully I was able to umount -f to unfreeze the client, but I have
> > > been unsuccessful remounting the file system using the kernel client.
> > > The fuse client worked fine as a workaround, but is slower.
> > >
> > > Taking a look at livepatch 54.1 I can see it touches Ceph code in the
> > > kernel:
> > >
> > > https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29
> > >
> > > But the relevance of those changes isn't immediately clear to me. I
> > > expect after a reboot it'll be fine, but as yet untested.
> >
> > These changes are very relevant.  They introduce support for CEPHX_V2
> > protocol, where message signatures are computed slightly differently:
> > same algorithm but a different set of inputs.  The live-patched kernel
> > likely started signing using CEPHX_V2 without renegotiating.
>
> Ah - thanks for looking. Looks like something that wasn't a security
> issue so shouldn't have been included in the live patch.

Well, strictly speaking it is a security issue because the protocol was
rev'ed in response to two CVEs:

  https://nvd.nist.gov/vuln/detail/CVE-2018-1128
  https://nvd.nist.gov/vuln/detail/CVE-2018-1129

That said, it definitely doesn't qualify for live-patching, especially
when the resulting kernel image is not thoroughly tested.

>
> > This is a good example of how live-patching can go wrong.  A reboot
> > should definitely help.
>
> Yup, it certainly has its tradeoffs (not having to reboot so regularly
> is certainly a positive, though). I've replicated on a test machine and
> confirmed that a reboot does indeed fix the problem.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Canonical Livepatch broke CephFS client

2019-08-14 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop  wrote:
>
> Hi,
>
> This email is mostly a heads up for others who might be using
> Canonical's livepatch on Ubuntu on a CephFS client.
>
> I have an Ubuntu 18.04 client with the standard kernel currently at
> version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
> with the kernel client. Cluster is running mimic 13.2.6. I've got
> livepatch running and this evening it did an update:
>
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch 
> service.
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
> Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
> Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 for 
> 4.15.0-54.58-generic
> Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not signed 
> with a trusted key
> Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 
> 'lkp_Ubuntu_4_15_0_54_58_generic_54'
> Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
> 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
> Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
> 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
> Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 
> 54.1 to 4.15.0-54.58-generic
>
> And then immediately I saw:
>
> Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
> Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
> Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
> Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 1.2.3.4:6800 
> socket closed (con state OPEN)
>
> And on the MDS:
>
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message signature 
> does not match contents.
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on message:
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
> 10517606059379971075
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally calculated 
> signature:
> 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
> sig_check:4899837294009305543
> 2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
> 2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
> 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
> Signature check failed
>
> Thankfully I was able to umount -f to unfreeze the client, but I have
> been unsuccessful remounting the file system using the kernel client.
> The fuse client worked fine as a workaround, but is slower.
>
> Taking a look at livepatch 54.1 I can see it touches Ceph code in the
> kernel:
>
> https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29
>
> But the relevance of those changes isn't immediately clear to me. I
> expect after a reboot it'll be fine, but as yet untested.

Hi Tim,

These changes are very relevant.  They introduce support for CEPHX_V2
protocol, where message signatures are computed slightly differently:
same algorithm but a different set of inputs.  The live-patched kernel
likely started signing using CEPHX_V2 without renegotiating.

This is a good example of how live-patching can go wrong.  A reboot
should definitely help.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Time of response of "rbd ls" command

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 6:37 PM Gesiel Galvão Bernardes
 wrote:
>
> HI,
>
> I recently noticed that in two of my pools the command "rbd ls" has take 
> several minutes to return the values. These pools have between 100 and 120 
> images each.
>
> Where should I look to check why this slowness? The cluster is apparently 
> fine, without any warning.
>
> Thank you very much in advance.

Hi Gesiel,

Try

$ rbd ls --debug-ms 1

and look at the timestamps.  If the latency is coming from RADOS, it
would probably be between "... osd_op(..." and "... osd_op_reply(...".

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 4:30 PM Serkan Çoban  wrote:
>
> I am out of office right now, but I am pretty sure it was the same
> stack trace as in tracker.
> I will confirm tomorrow.
> Any workarounds?

Compaction

# echo 1 >/proc/sys/vm/compact_memory

might help if the memory in question is moveable.  If not, reboot and
mount on a freshly booted node.

I have raised the priority on the ticket.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 3:57 PM Serkan Çoban  wrote:
>
> I checked /var/log/messages and see there are page allocation
> failures. But I don't understand why?
> The client has 768GB memory and most of it is not used, cluster has
> 1500OSDs. Do I need to increase vm.min_free_kytes? It is set to 1GB
> now.
> Also huge_page is disabled in clients.

https://tracker.ceph.com/issues/40481

I can confirm if you pastebin page allocation splats.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs cannot mount with kernel client

2019-08-13 Thread Ilya Dryomov
On Tue, Aug 13, 2019 at 12:36 PM Serkan Çoban  wrote:
>
> Hi,
>
> Just installed nautilus 14.2.2 and setup cephfs on it. OS is all centos 7.6.
> From a client I can mount the cephfs with ceph-fuse, but I cannot
> mount with ceph kernel client.
> It gives "mount error 110 connection timeout" and I can see "libceph:
> corrupt full osdmap (-12) epoch 2759 off 656" in /var/log/messages.
> This client is not on same subnet with ceph servers.
>
> However on a client with the same subnet with the servers I can
> successfully mount both with ceph-fuse and kernel client.
>
> Do I need to configure anything for the clients those are in different subnet?
> Is this a kernel issue?

Hi Serkan,

It is failing to allocate memory, so the subnet is probably not the
issue.  Is there anything else pointing to memory shortage -- "page
allocation failure" splats, etc?

How much memory is available for use on that node?  How many OSDs do
you have in your cluster?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems understanding 'ceph-features' output

2019-08-05 Thread Ilya Dryomov
On Tue, Jul 30, 2019 at 10:33 AM Massimo Sgaravatto
 wrote:
>
> The documentation that I have seen says that the minimum requirements for 
> clients to use upmap are:
>
> - CentOs 7.5 or kernel 4.5
> - Luminous version

Do you have a link for that?

This is wrong: CentOS 7.5 (i.e. RHEL 7.5 kernel) is right, but for
upstream kernels it is 4.13 (unless someone did a large backport that
I'm not aware of).

>
> But in general ceph admins could not have access to all clients to check 
> these versions.
>
> In general: is there a table somewhere reporting the minimum "feature" 
> version supported by upmap ?
>
> E.g. right now I am interested about 0x1ffddff8eea4fffb. Is this also good 
> enough for upmap ?

Yeah, this is annoying.  The missing feature bit has been merged into
5.3, so starting with 5.3 the kernel client will finally report itself
as luminous.

In the meantime, use this:

$ cat /tmp/detect_upmap.py
if int(input()) & (1 << 21):
print("Upmap is supported")
else:
print("Upmap is NOT supported")

$ echo 0x1ffddff8eea4fffb | python /tmp/detect_upmap.py
Upmap is supported

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "session established", "io error", "session lost, hunting for new mon" solution/fix

2019-07-16 Thread Ilya Dryomov
On Fri, Jul 12, 2019 at 5:38 PM Marc Roos  wrote:
>
>
> Thanks Ilya for explaining. Am I correct to understand from the link[0]
> mentioned in the issue, that because eg. I have an unhealthy state for
> some time (1 pg on a insignificant pool) I have larger osdmaps,
> triggering this issue? Or is just random bad luck? (Just a bit curious
> why I have this issue)
>
> [0]
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg51522.html

I'm not sure.  I wouldn't expect one unhealthy PG to trigger a large
osdmap message.  Only verbose logs can tell.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "session established", "io error", "session lost, hunting for new mon" solution/fix

2019-07-12 Thread Ilya Dryomov
On Fri, Jul 12, 2019 at 12:33 PM Paul Emmerich  wrote:
>
>
>
> On Thu, Jul 11, 2019 at 11:36 PM Marc Roos  wrote:
>> Anyone know why I would get these? Is it not strange to get them in a
>> 'standard' setup?
>
> you are probably running on an ancient kernel. this bug has been fixed a long 
> time ago.

This is not a kernel bug:

http://tracker.ceph.com/issues/38040

It is possible to hit with few OSDs too.  The actual problem is the
size of the osdmap message which can contain multiple full osdmaps, not
the number of OSDs.  The size of a full osdmap is proportional to the
number of OSDs but it's not the only way to get a big osdmap message.

As you have experienced, these settings used to be expressed in the
number of osdmaps and our defaults were too high for a stream of full
osdmaps (as opposed to incrementals).  It is now expressed in bytes,
the patch should be in 12.2.13.

>
> Paul
>
>> -Original Message-
>> Subject: [ceph-users] "session established", "io error", "session lost,
>> hunting for new mon" solution/fix
>>
>>
>> I have on a cephfs client again (luminous cluster, centos7, only 32
>> osds!). Wanted to share the 'fix'
>>
>> [Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session
>> established
>> [Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 io error
>> [Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session
>> lost, hunting for new mon
>> [Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 session
>> established
>> [Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 io error
>> [Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 session
>> lost, hunting for new mon
>> [Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session
>> established
>> [Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 io error
>> [Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session
>> lost, hunting for new mon
>> [Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 session
>> established
>> [Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 io error
>> [Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 session
>> lost, hunting for new mon
>>
>> 1) I blocked client access to the monitors with
>> iptables -I INPUT -p tcp -s 192.168.10.43 --dport 6789 -j REJECT
>> Resulting in
>>
>> [Thu Jul 11 12:34:16 2019] libceph: mon1 192.168.10.112:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:18 2019] libceph: mon1 192.168.10.112:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:22 2019] libceph: mon1 192.168.10.112:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:26 2019] libceph: mon2 192.168.10.113:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:27 2019] libceph: mon2 192.168.10.113:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:28 2019] libceph: mon2 192.168.10.113:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:30 2019] libceph: mon1 192.168.10.112:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:30 2019] libceph: mon2 192.168.10.113:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:34 2019] libceph: mon2 192.168.10.113:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:42 2019] libceph: mon2 192.168.10.113:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:44 2019] libceph: mon0 192.168.10.111:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:45 2019] libceph: mon0 192.168.10.111:6789 socket
>> closed (con state CONNECTING)
>> [Thu Jul 11 12:34:46 2019] libceph: mon0 192.168.10.111:6789 socket
>> closed (con state CONNECTING)
>>
>> 2) I applied the suggested changes to the osd map message max, mentioned
>>
>> in early threads[0]
>> ceph tell osd.* injectargs '--osd_map_message_max=10'
>> ceph tell mon.* injectargs '--osd_map_message_max=10'
>> [@c01 ~]# ceph daemon osd.0 config show|grep message_max
>> "osd_map_message_max": "10",
>> [@c01 ~]# ceph daemon mon.a config show|grep message_max
>> "osd_map_message_max": "10",
>>
>> [0]
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg54419.html
>> http://tracker.ceph.com/issues/38040
>>
>> 3) Allow access to a monitor with
>> iptables -D INPUT -p tcp -s 192.168.10.43 --dport 6789 -j REJECT
>>
>> Getting
>> [Thu Jul 11 12:39:26 2019] libceph: mon0 192.168.10.111:6789 session
>> established
>> [Thu Jul 11 12:39:26 2019] libceph: osd0 down
>> [Thu Jul 11 12:39:26 2019] libceph: osd0 up
>>
>> Problems solved, in D state hung unmount was released.
>>
>> I am not sure if the prolonged disconnection to the monitors was the
>> solution or the osd_map_message_max=10, or both.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd namespace missing in /dev

2019-06-10 Thread Ilya Dryomov
On Mon, Jun 10, 2019 at 8:03 PM Jason Dillaman  wrote:
>
> On Mon, Jun 10, 2019 at 1:50 PM Jonas Jelten  wrote:
> >
> > When I run:
> >
> >   rbd map --name client.lol poolname/somenamespace/imagename
> >
> > The image is mapped to /dev/rbd0 and
> >
> >   /dev/rbd/poolname/imagename
> >
> > I would expect the rbd to be mapped to (the rbdmap tool tries this name):
> >
> >   /dev/rbd/poolname/somenamespace/imagename
> >
> > The current map point would not allow same-named images in different 
> > namespaces, and the automatic mount of rbdmap fails
> > because of this.
> >
> >
> > Are there plans to fix this?
>
> I opened a tracker ticket for this issue [1].
>
> [1] http://tracker.ceph.com/issues/40247

If we are going to touch it, we might want to include cluster fsid as
well.  There is an old ticket on this:

http://tracker.ceph.com/issues/16811

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to fix this? session lost, hunting for new mon, session established, io error

2019-05-24 Thread Ilya Dryomov
On Tue, May 21, 2019 at 11:41 AM Marc Roos  wrote:
>
>
>
> I have this on a cephfs client, I had ceph common on 12.2.11, and
> upgraded to 12.2.12 while having this error. They are writing here [0]
> you need to upgrade kernel and it is fixed in 12.2.2
>
> [@~]# uname -a
> Linux mail03 3.10.0-957.5.1.el7.x86_6
>
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 session
> established
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 io error
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 session
> lost, hunting for new mon
> [Tue May 21 11:23:26 2019] libceph: mon0 192.168.10.111:6789 session
> established
> [Tue May 21 11:23:26 2019] libceph: mon0 192.168.10.111:6789 io error
> [Tue May 21 11:23:26 2019] libceph: mon0 192.168.10.111:6789 session
> lost, hunting for new mon
> [Tue May 21 11:23:26 2019] libceph: mon1 192.168.10.112:6789 session
> established
> [Tue May 21 11:23:26 2019] libceph: mon1 192.168.10.112:6789
> [Tue May 21 11:23:26 2019] libceph: mon1 192.168.10.112:6789 session
> lost, hunting for new mon
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 session
> established
>
>
>
> ceph version
> ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
> (stable)
>
> [0]
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg52177.html
> https://tracker.ceph.com/issues/23537

Hi Marc,

The issue you linked is definitely not related -- no "io error" there.

This looks like http://tracker.ceph.com/issues/38040.  This is a server
side issue, so no point in upgrading the kernel.  It's still present in
luminous, but there is an easy workaround -- try decreasing "osd map
message max" as described in the thread linked from the description.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-02-27 Thread Ilya Dryomov
On Wed, Feb 27, 2019 at 12:00 PM Thomas <74cmo...@gmail.com> wrote:
>
> Hi,
> I have noticed an error when writing to a mapped RBD.
> Therefore I unmounted the block device.
> Then I tried to unmap it w/o success:
> ld2110:~ # rbd unmap /dev/rbd0
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> The same block device is mapped on another client and there are no issues:
> root@ld4257:~# rbd info hdb-backup/ld2110
> rbd image 'ld2110':
> size 7.81TiB in 2048000 objects
> order 22 (4MiB objects)
> block_name_prefix: rbd_data.3cda0d6b8b4567
> format: 2
> features: layering
> flags:
> create_timestamp: Fri Feb 15 10:53:50 2019
> root@ld4257:~# rados -p hdb-backup  listwatchers rbd_data.3cda0d6b8b4567
> error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
> file or directory
> root@ld4257:~# rados -p hdb-backup  listwatchers rbd_header.3cda0d6b8b4567
> watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
> watcher=10.97.206.97:0/4023931980 client.18484780
> cookie=18446462598732841027
>
>
> Question:
> How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?

Hi Thomas,

It appears that /dev/rbd0 is still open on that node.

Was the unmount successful?  Which filesystem (ext4, xfs, etc)?

What is the output of "ps aux | grep rbd" on that node?

Try lsof, fuser, check for LVM volumes and multipath -- these have been
reported to cause this issue previously:

  http://tracker.ceph.com/issues/12763

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Ilya Dryomov
On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
>
> Hi Marc,
>
> You can see previous designs on the Ceph store:
>
> https://www.proforma.com/sdscommunitystore

Hi Mike,

This site stopped working during DevConf and hasn't been working since.
I think Greg has contacted some folks about this, but it would be great
if you could follow up because it's been a couple of weeks now...

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd and image striping

2019-02-06 Thread Ilya Dryomov
On Wed, Feb 6, 2019 at 11:09 AM James Dingwall
 wrote:
>
> Hi,
>
> I have been doing some testing with striped rbd images and have a
> question about the calculation of the optimal_io_size and
> minimum_io_size parameters.  My test image was created using a 4M object
> size, stripe unit 64k and stripe count 16.
>
> In the kernel rbd_init_disk() code:
>
> unsigned int objset_bytes =
>  rbd_dev->layout.object_size * rbd_dev->layout.stripe_count;
>
>  blk_queue_io_min(q, objset_bytes);
>  blk_queue_io_opt(q, objset_bytes);
>
> Which resulted in 64M minimal / optimal io sizes.  If I understand the
> meaning correctly then even for a small write there is going to be at
> least 64M data written?

No, these are just hints.  The exported values are pretty stupid even
in the default case and more so in the custom striping case and should
be changed.  It's certainly not the case that any write is going to be
turned into io_min or io_opt sized write.

>
> My use case is a ceph cluster (13.2.4) hosting rbd images for VMs
> running on Xen.  The rbd volumes are mapped to dom0 and then passed
> through to the guest using standard blkback/blkfront drivers.
>
> I am doing a bit of testing with different stripe unit sizes but keeping
> object size * count = 4M.  Does anyone have any experience finding
> optimal rbd parameters for this scenario?

I'd recommend focusing on the client side performance numbers for the
expected workload(s), not io_min/io_opt or object size * count target.
su = 64k and sc = 16 means that a 1M request will need responses from
up to 16 OSDs at once, which is probably not what you want unless you
have a small sequential write workload (where a custom striping layout
can prove very useful).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel requirements for balancer in upmap mode

2019-02-04 Thread Ilya Dryomov
On Mon, Feb 4, 2019 at 9:25 AM Massimo Sgaravatto
 wrote:
>
> The official documentation [*] says that the only requirement to use the 
> balancer in upmap mode is that all clients must run at least luminous.
> But I read somewhere (also in this mailing list) that there are also 
> requirements wrt the kernel.
> If so:
>
> 1) Could you please specify what is the minimum required kernel ?

4.13 or CentOS 7.5.  See [1] for details.

> 2) Does this kernel requirement apply only to the OSD nodes ? Or also to the 
> clients ?

No, only to the kernel client nodes.  If the kernel client isn't used,
there is no requirement at all.

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client hangs

2019-01-28 Thread Ilya Dryomov
On Mon, Jan 28, 2019 at 7:31 AM ST Wong (ITSC)  wrote:
>
> > That doesn't appear to be an error -- that's just stating that it found a 
> > dead client that was holding the exclusice-lock, so it broke the dead 
> > client's lock on the image (by blacklisting the client).
>
> As there is only 1 RBD client in this testing, does it mean the RBD client 
> process keeps failing?
> In a fresh boot RBD client, doing some basic operations also gives the 
> warning:
>
>  cut here 
> # rbd -n client.acapp1 map 4copy/foo
> /dev/rbd0
> # mount /dev/rbd0 /4copy
> # cd /4copy; ls
>
>
> # tail /var/log/messages
> Jan 28 14:23:39 acapp1 kernel: Key type ceph registered
> Jan 28 14:23:39 acapp1 kernel: libceph: loaded (mon/osd proto 15/24)
> Jan 28 14:23:39 acapp1 kernel: rbd: loaded (major 252)
> Jan 28 14:23:39 acapp1 kernel: libceph: mon2 192.168.1.156:6789 session 
> established
> Jan 28 14:23:39 acapp1 kernel: libceph: client80624 fsid 
> cc795498-5d16-4b84-9584-1788d0458be9
> Jan 28 14:23:39 acapp1 kernel: rbd: rbd0: capacity 10737418240 features 0x5
> Jan 28 14:23:44 acapp1 kernel: XFS (rbd0): Mounting V5 Filesystem
> Jan 28 14:23:44 acapp1 kernel: rbd: rbd0: client80621 seems dead, breaking 
> lock <--
> Jan 28 14:23:45 acapp1 kernel: XFS (rbd0): Starting recovery (logdev: 
> internal)
> Jan 28 14:23:45 acapp1 kernel: XFS (rbd0): Ending recovery (logdev: internal)
>
>  cut here 
>
> Is this normal?

Yes -- the lock isn't released because you are hard resetting your
machine.  When it comes back up, the new client fences the old client
to avoid split brain.

>
>
>
> Besides, repeated the testing:
> * Map and mount the rbd device, read/write ok.
> * Umount all rbd, then reboot without problem
> * Reboot hangs if not umounting all rbd before reboot:
>
>  cut here 
> Jan 28 14:13:12 acapp1 kernel: rbd: rbd0: client80531 seems dead, breaking 
> lock
> Jan 28 14:13:13 acapp1 kernel: XFS (rbd0): Ending clean mount 
>   <-- Reboot hangs here
> Jan 28 14:14:06 acapp1 systemd: Stopping Session 1 of user root.  
>   <-- pressing power reset
> Jan 28 14:14:06 acapp1 systemd: Stopped target Multi-User System.
>  cut here 
>
> Is it necessary to umount all RDB before rebooting  the client host?

Yes, it's necessary.  If you enable rbdmap.service, it should do it for
you:

https://github.com/ceph/ceph/blob/f52c22ebf5ff24107faf061a8de1f36376ed515d/systemd/rbdmap.service.in#L15

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-25 Thread Ilya Dryomov
On Fri, Jan 25, 2019 at 9:40 AM Martin Palma  wrote:
>
> > Do you see them repeating every 30 seconds?
>
> yes:
>
> Jan 25 09:34:37 sdccgw01 kernel: [6306813.737615] libceph: mon4
> 10.8.55.203:6789 session lost, hunting for new mon
> Jan 25 09:34:37 sdccgw01 kernel: [6306813.737620] libceph: mon3
> 10.8.55.202:6789 session lost, hunting for new mon
> Jan 25 09:34:37 sdccgw01 kernel: [6306813.737728] libceph: mon2
> 10.8.55.201:6789 session lost, hunting for new mon
> Jan 25 09:34:37 sdccgw01 kernel: [6306813.739711] libceph: mon1
> 10.7.55.202:6789 session established
> Jan 25 09:34:37 sdccgw01 kernel: [6306813.739899] libceph: mon1
> 10.7.55.202:6789 session established
> Jan 25 09:34:37 sdccgw01 kernel: [6306813.740015] libceph: mon3
> 10.8.55.202:6789 session established
> Jan 25 09:34:43 sdccgw01 kernel: [6306819.881560] libceph: mon2
> 10.8.55.201:6789 session lost, hunting for new mon
> Jan 25 09:34:43 sdccgw01 kernel: [6306819.883730] libceph: mon4
> 10.8.55.203:6789 session established
> Jan 25 09:34:47 sdccgw01 kernel: [6306823.977566] libceph: mon0
> 10.7.55.201:6789 session lost, hunting for new mon
> Jan 25 09:34:47 sdccgw01 kernel: [6306823.980033] libceph: mon1
> 10.7.55.202:6789 session established
> Jan 25 09:35:07 sdccgw01 kernel: [6306844.457449] libceph: mon1
> 10.7.55.202:6789 session lost, hunting for new mon
> Jan 25 09:35:07 sdccgw01 kernel: [6306844.457450] libceph: mon3
> 10.8.55.202:6789 session lost, hunting for new mon
> Jan 25 09:35:07 sdccgw01 kernel: [6306844.457612] libceph: mon1
> 10.7.55.202:6789 session lost, hunting for new mon
> Jan 25 09:35:07 sdccgw01 kernel: [6306844.459168] libceph: mon3
> 10.8.55.202:6789 session established
> Jan 25 09:35:07 sdccgw01 kernel: [6306844.459537] libceph: mon4
> 10.8.55.203:6789 session established
> Jan 25 09:35:07 sdccgw01 kernel: [6306844.459792] libceph: mon4
> 10.8.55.203:6789 session established
>
> > Which kernel are you running?
>
> Current running kernel is 4.11.0-13-generic  (Ubuntu 16.04.5 LTS), and
> the latest that is provided is  4.15.0-43-generic

Looks like https://tracker.ceph.com/issues/23537 indeed.  A kernel
upgrade will fix it.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-25 Thread Ilya Dryomov
On Fri, Jan 25, 2019 at 8:37 AM Martin Palma  wrote:
>
> Hi Ilya,
>
> thank you for the clarification. After setting the
> "osd_map_messages_max" to 10 the io errors and the MDS error
> "MDS_CLIENT_LATE_RELEASE" are gone.
>
> The messages of  "mon session lost, hunting for new new mon" didn't go
> away... can it be that this is related to
> https://tracker.ceph.com/issues/23537

Do you see them repeating every 30 seconds?

Which kernel are you running?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-24 Thread Ilya Dryomov
On Thu, Jan 24, 2019 at 6:21 PM Andras Pataki
 wrote:
>
> Hi Ilya,
>
> Thanks for the clarification - very helpful.
> I've lowered osd_map_messages_max to 10, and this resolves the issue
> about the kernel being unhappy about large messages when the OSDMap
> changes.  One comment here though: you mentioned that Luminous uses 40
> as the default, which is indeed the case.  The documentation for
> Luminous (and master), however, says that the default is 100.

Looks like that page hasn't been kept up to date.  I'll fix that
section.

>
> One other follow-up question on the kernel client about something I've
> been seeing while testing.  Does the kernel client clean up when the MDS
> asks due to cache pressure?  On a machine I ran something that touches a
> lot of files, so the kernel client accumulated over 4 million caps.
> Many hours after all the activity finished (i.e. many hours after
> anything accesses ceph on that node) the kernel client still holds
> millions of caps, and the MDS periodically complains about clients not
> responding to cache pressure.  How is this supposed to be handled?
> Obviously asking the kernel to drop caches via /proc/sys/vm/drop_caches
> does a very thorough cleanup, but something in the middle would be better.

The kernel client sitting on way too many caps for way too long is
a long standing issue.  Adding Zheng who has recently been doing some
work to facilitate cap releases and put a limit on the overall cap
count.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-24 Thread Ilya Dryomov
On Thu, Jan 24, 2019 at 8:16 PM Martin Palma  wrote:
>
> We are experiencing the same issues on clients with CephFS mounted
> using the kernel client and 4.x kernels.
>
> The problem  shows up when we add new OSDs, on reboots after
> installing patches and when changing the weight.
>
> Here the logs of a misbehaving client;
>
> [6242967.890611] libceph: mon4 10.8.55.203:6789 session established
> [6242968.010242] libceph: osd534 10.7.55.23:6814 io error
> [6242968.259616] libceph: mon1 10.7.55.202:6789 io error
> [6242968.259658] libceph: mon1 10.7.55.202:6789 session lost, hunting
> for new mon
> [6242968.359031] libceph: mon4 10.8.55.203:6789 session established
> [6242968.622692] libceph: osd534 10.7.55.23:6814 io error
> [6242968.692274] libceph: mon4 10.8.55.203:6789 io error
> [6242968.692337] libceph: mon4 10.8.55.203:6789 session lost, hunting
> for new mon
> [6242968.694216] libceph: mon0 10.7.55.201:6789 session established
> [6242969.099862] libceph: mon0 10.7.55.201:6789 io error
> [6242969.099888] libceph: mon0 10.7.55.201:6789 session lost, hunting
> for new mon
> [6242969.224565] libceph: osd534 10.7.55.23:6814 io error
>
> Additional to the MON io error we also got some OSD io errors.

This isn't surprising -- the kernel client can receive osdmaps from
both monitors and OSDs.

>
> Moreover when the error occurs several clients causes a
> "MDS_CLIENT_LATE_RELEASE" error on the MDS server.
>
> We are currently running on Luminous 12.2.10 and have around 580 OSDs
> and 5 monitor nodes. The cluster is running on CentOS 7.6.
>
> The ‘osd_map_message_max’ setting is set to the default value of 40.
> But we are still getting these errors.

My advise is the same: set it to 20 or even 10.  The problem is this
setting is in terms of the number of osdmaps instead of the size of the
resulting message.  I've filed

  http://tracker.ceph.com/issues/38040

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client hangs

2019-01-21 Thread Ilya Dryomov
On Mon, Jan 21, 2019 at 11:43 AM ST Wong (ITSC)  wrote:
>
> Hi, we’re trying mimic on an VM farm.  It consists 4 OSD hosts (8 OSDs) and 3 
> MON. We tried mounting as RBD and CephFS (fuse and kernel mount) on 
> different clients without problem.

Is this an upgraded or a fresh cluster?

>
> Then one day we perform failover test and stopped one of the OSD.  Not sure 
> if it’s related but after that testing, the RBD client freeze when trying to 
> mount the rbd device.
>
>
>
> Steps to reproduce:
>
>
>
> # modprobe rbd
>
>
>
> (dmesg)
>
> [  309.997587] Key type dns_resolver registered
>
> [  310.043647] Key type ceph registered
>
> [  310.044325] libceph: loaded (mon/osd proto 15/24)
>
> [  310.054548] rbd: loaded
>
>
>
> # rbd -n client.acapp1 map 4copy/foo
>
> /dev/rbd0
>
>
>
> # rbd showmapped
>
> id pool  image snap device
>
> 0  4copy foo   -/dev/rbd0
>
>
>
>
>
> Then hangs if I tried to mount or reboot the server after rbd map.   There 
> are lot of error in dmesg, e.g.
>
>
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: blacklist of client74700 failed: -13
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: failed to acquire lock: -13
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: no lock owners detected
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: client74700 seems dead, breaking 
> lock
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: blacklist of client74700 failed: -13
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: failed to acquire lock: -13
>
> Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: no lock owners detected

Does client.acapp1 have the permission to blacklist other clients?  You
can check with "ceph auth get client.acapp1".  If not, follow step 6 of
http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-18 Thread Ilya Dryomov
On Fri, Jan 18, 2019 at 11:25 AM Mykola Golub  wrote:
>
> On Thu, Jan 17, 2019 at 10:27:20AM -0800, Void Star Nill wrote:
> > Hi,
> >
> > We am trying to use Ceph in our products to address some of the use cases.
> > We think Ceph block device for us. One of the use cases is that we have a
> > number of jobs running in containers that need to have Read-Only access to
> > shared data. The data is written once and is consumed multiple times. I
> > have read through some of the similar discussions and the recommendations
> > on using CephFS for these situations, but in our case Block device makes
> > more sense as it fits well with other use cases and restrictions we have
> > around this use case.
> >
> > The following scenario seems to work as expected when we tried on a test
> > cluster, but we wanted to get an expert opinion to see if there would be
> > any issues in production. The usage scenario is as follows:
> >
> > - A block device is created with "--image-shared" options:
> >
> > rbd create mypool/foo --size 4G --image-shared
>
> "--image-shared" just means that the created image will have
> "exclusive-lock" feature and all other features that depend on it
> disabled. It is useful for scenarios when one wants simulteous write
> access to the image (e.g. when using a shared-disk cluster fs like
> ocfs2) and does not want a performance penalty due to "exlusive-lock"
> being pinged-ponged between writers.
>
> For your scenario it is not necessary but is ok.
>
> > - The image is mapped to a host, formatted in ext4 format (or other file
> > formats), mounted to a directory in read/write mode and data is written to
> > it. Please note that the image will be mapped in exclusive write mode -- no
> > other read/write mounts are allowed a this time.
>
> The map "exclusive" option works only for images with "exclusive-lock"
> feature enabled and prevent in this case automatic exclusive lock
> transitions (ping-pong mentioned above) from one writer to
> another. And in this case it will not prevent from mapping and
> mounting it ro and probably even rw (I am not familiar enough with
> kernel rbd implementation to be sure here), though in the last case
> the write will fail.

With -o exclusive, in addition to preventing automatic lock
transitions, the kernel will attempt to acquire the lock at map time
(i.e. before allowing any I/O) and return an error from "rbd map" in
case the lock cannot be acquired.

However, the fact the image is mapped -o exclusive on one host doesn't
mean that it can't be mapped without -o exclusive on another host.  If
you then try to write though the non-exclusive mapping, the write will
block until the exclusive mapping goes away resulting a hung tasks in
uninterruptible sleep state -- a much less pleasant failure mode.

So make sure that all writers use -o exclusive.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-18 Thread Ilya Dryomov
On Fri, Jan 18, 2019 at 9:25 AM Burkhard Linke
 wrote:
>
> Hi,
>
> On 1/17/19 7:27 PM, Void Star Nill wrote:
>
> Hi,
>
> We am trying to use Ceph in our products to address some of the use cases. We 
> think Ceph block device for us. One of the use cases is that we have a number 
> of jobs running in containers that need to have Read-Only access to shared 
> data. The data is written once and is consumed multiple times. I have read 
> through some of the similar discussions and the recommendations on using 
> CephFS for these situations, but in our case Block device makes more sense as 
> it fits well with other use cases and restrictions we have around this use 
> case.
>
> The following scenario seems to work as expected when we tried on a test 
> cluster, but we wanted to get an expert opinion to see if there would be any 
> issues in production. The usage scenario is as follows:
>
> - A block device is created with "--image-shared" options:
>
> rbd create mypool/foo --size 4G --image-shared
>
>
> - The image is mapped to a host, formatted in ext4 format (or other file 
> formats), mounted to a directory in read/write mode and data is written to 
> it. Please note that the image will be mapped in exclusive write mode -- no 
> other read/write mounts are allowed a this time.
>
> - The volume is unmapped from the host and then mapped on to N number of 
> other hosts where it will be mounted in read-only mode and the data is read 
> simultaneously from N readers
>
>
> There is no read-only ext4. Using the 'ro' mount option is by no means a 
> read-only access to the underlying storage. ext4 maintains a journal for 
> example, and needs to access and flush the journal on mount. You _WILL_ run 
> into unexpected issues.

Only if the journal needs replaying.  If you ensure a clean unmount
after writing the data, it shouldn't need to write to the underlying
block device on subsequent read-only mounts.

As an additional safeguard, map the image with -o ro.  This way the
block device will be read-only from the get-go.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-16 Thread Ilya Dryomov
On Wed, Jan 16, 2019 at 7:12 PM Andras Pataki
 wrote:
>
> Hi Ilya/Kjetil,
>
> I've done some debugging and tcpdump-ing to see what the interaction
> between the kernel client and the mon looks like.  Indeed -
> CEPH_MSG_MAX_FRONT defined as 16Mb seems low for the default mon
> messages for our cluster (with osd_mon_messages_max at 100).  We have
> about 3500 osd's, and the kernel advertises itself as older than

This is too big, especially for a fairly large cluster such as yours.
The default was reduced to 40 in luminous.  Given about 3500 OSDs, you
might want to set it to 20 or even 10.

> Luminous, so it gets full map updates.  The FRONT message size on the
> wire I saw was over 24Mb.  I'll try setting osd_mon_messages_max to 30
> and do some more testing, but from the debugging it definitely seems
> like the issue.
>
> Is the kernel driver really not up to date to be considered at least a
> Luminous client by the mon (i.e. it has some feature really missing)?  I
> looked at the bits, and the MON seems to want is bit 59 in ceph features
> shared by FS_BTIME, FS_CHANGE_ATTR, MSG_ADDR2.  Can the kernel client be
> used when setting require-min-compat to luminous (either with the 4.19.x
> kernel or the Redhat/Centos 7.6 kernel)?  Some background here would be
> helpful.

Yes, the kernel client is missing support for that feature bit, however
4.13+ and RHEL 7.5+ _can_ be used with require-min-compat-client set to
luminous.  See

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-16 Thread Ilya Dryomov
On Wed, Jan 16, 2019 at 1:27 AM Kjetil Joergensen  wrote:
>
> Hi,
>
> you could try reducing "osd map message max", some code paths that end up as 
> -EIO (kernel: libceph: mon1 *** io error) is exceeding 
> include/linux/ceph/libceph.h:CEPH_MSG_MAX_{FRONT,MIDDLE,DATA}_LEN.
>
> This "worked for us" - YMMV.

Kjetil, how big is your cluster?  Do you remember the circumstances
of when you started seeing these errors?

Andras, please let us know if this resolves the issue.  Decreasing
"osd map message max" for large clusters can help with the overall
memory consumption and is probably a good idea in general, but then
these kernel client limits are pretty arbitrary, so we can look at
bumping them.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Ilya Dryomov
On Fri, Jan 11, 2019 at 11:58 AM Rom Freiman  wrote:
>
> Same kernel :)

Rom, can you update your CentOS ticket with the link to the Ceph BZ?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Ilya Dryomov
On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard  wrote:
>
> On Fri, Jan 11, 2019 at 9:57 AM Jason Dillaman  wrote:
> >
> > I think Ilya recently looked into a bug that can occur when
> > CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes
> > through the loopback interface (i.e. co-located OSDs and krbd).
> > Assuming that you have the same setup, you might be hitting the same
> > bug.
>
> Thanks for that Jason, I wasn't aware of that bug. I'm interested to
> see the details.

Here is Rom's BZ, it has some details:

https://bugzilla.redhat.com/show_bug.cgi?id=1665248

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Image has watchers, but cannot determine why

2019-01-10 Thread Ilya Dryomov
On Wed, Jan 9, 2019 at 5:17 PM Kenneth Van Alstyne
 wrote:
>
> Hey folks, I’m looking into what I would think would be a simple problem, but 
> is turning out to be more complicated than I would have anticipated.   A 
> virtual machine managed by OpenNebula was blown away, but the backing RBD 
> images remain.  Upon investigating, it appears
> that the images still have watchers on the KVM node that that VM previously 
> lived on.  I can confirm that there are no mapped RBD images on the machine 
> and the qemu-system-x86_64 process is indeed no longer running.  Any ideas?  
> Additional details are below:
>
> # rbd info one-73-145-10
> rbd image 'one-73-145-10':
> size 1024 GB in 262144 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.27174d6b8b4567
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> flags:
> parent: rbd/one-73@snap
> overlap: 102400 kB
> #
> # rbd status one-73-145-10
> Watchers:
> watcher=10.0.235.135:0/3820784110 client.33810559 cookie=140234310778880
> #
> #
> # rados -p rbd listwatchers rbd_header.27174d6b8b4567
> watcher=10.0.235.135:0/3820784110 client.33810559 cookie=140234310778880

This appears to be a RADOS (i.e. not a kernel client) watch.  Are you
sure that nothing of the sort is running on that node?

In order for the watch to stay live, the watcher has to send periodic
ping messages to the OSD.  Perhaps determine the primary OSD with "ceph
osd map rbd rbd_header.27174d6b8b4567", set debug_ms to 1 on that OSD
and monitor the log for a few minutes?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] list admin issues

2018-12-28 Thread Ilya Dryomov
On Sat, Dec 22, 2018 at 7:18 PM Brian :  wrote:
>
> Sorry to drag this one up again.
>
> Just got the unsubscribed due to excessive bounces thing.
>
> 'Your membership in the mailing list ceph-users has been disabled due
> to excessive bounces The last bounce received from you was dated
> 21-Dec-2018.  You will not get any more messages from this list until
> you re-enable your membership.  You will receive 3 more reminders like
> this before your membership in the list is deleted.'
>
> can anyone check MTA logs to see what the bounce is?

Me too.  Happens regularly and only on ceph-users, not on sepia or
ceph-maintainers, etc.  David, Dan, could you or someone you know look
into this?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.10 rbd kernel mount issue after update

2018-12-06 Thread Ilya Dryomov
On Thu, Dec 6, 2018 at 11:15 AM Ashley Merrick  wrote:
>
> That is correct, but that command was run weeks ago.
>
> And the RBD connected fine on 2.9 via the kernel 4.12 so I’m really lost to 
> why suddenly it’s now blocking a connection it originally allowed through 
> (even if by mistake)

When was it last mapped, before that command was ran or after?  If
before, and the command was ran with --yes-i-really-mean-it, that would
mostly explain it.

>
> Which kernel do I need to run to support luminous level?

4.13 or newer.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.10 rbd kernel mount issue after update

2018-12-06 Thread Ilya Dryomov
On Thu, Dec 6, 2018 at 10:58 AM Ashley Merrick  wrote:
>
> That command returns luminous.

This is the issue.

My guess is someone ran "ceph osd set-require-min-compat-client
luminous", making it so that only luminous aware clients are allowed to
connect to the cluster.  Kernel 4.12 doesn't support luminous features,
so it isn't allowed to connect.  Perhaps you wanted to experiment with
the balancer module?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.10 rbd kernel mount issue after update

2018-12-06 Thread Ilya Dryomov
On Thu, Dec 6, 2018 at 4:22 AM Ashley Merrick  wrote:
>
> Hello,
>
> As mentioned earlier the cluster is seperatly running on the latest mimic.
>
> Due to 14.04 only supporting up to Luminous I was running the 12.2.9 version 
> of ceph-common for the rbd binary.
>
> This is what was upgraded when I did the dist-upgrade on the VM mounting the 
> RBD.
>
> The cluster it self has not changed and has always been running the latest 
> point release on mimic.
>
> All that changed was the move of ceph-common and dependencies on the mounting 
> VM.
>
> 12.2.9 + 4.12 Kernel was able to mount a Mimic EC backed RBD via KRBD, since 
> 12.2.10 I now get the error.
>
> So to me looks like there was a client side change from .9 to .10 as no 
> change cluster side.

The error is coming from the kernel, not from "rbd map".  "rbd map"
doesn't really do much beyond gathering options and setting up the
keys.  I don't think the client side ceph upgrade is the root cause
here.

You didn't answer my other question: what is the output of "ceph osd
get-require-min-compat-client"?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.10 rbd kernel mount issue after update

2018-12-05 Thread Ilya Dryomov
On Wed, Dec 5, 2018 at 3:48 PM Ashley Merrick  wrote:
>
> I have had some ec backed Mimic RBD's mounted via the kernel module on a 
> Ubuntu 14.04 VM, these have been running no issues after updating the kernel 
> to 4.12 to support EC features.
>
> Today I run an apt dist-upgrade which upgraded from 12.2.9 to 12.2.10, since 
> then I have been getting the following line in the syslog and had to role 
> back to using rbd-nbd for the moment which continues to work fine.

Hi Ashley,

Are you sure that the release you upgraded from was 12.2.9?

What upgrade procedure did you follow?

>
> Not sure if there is a change in 12.2.10 that this is expected with a non 
> mimic kernel client such as Luminous.
>
> Error in VM syslog:
>
> feature set mismatch, my 40107b86a842ada < server's 60107b86aa42ada, missing 
> 220
> libceph: mon2 176.9.86.219:6789 missing required protocol features

These are standard luminous feature bits.  It looks like your cluster
didn't require them before the upgrade and now it does.  12.2.10 should
continue to work 4.12 and any older kernel, as long as nothing luminous
only is enabled.

What is the output of "ceph osd get-require-min-compat-client"?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel client versions - pg-upmap

2018-11-08 Thread Ilya Dryomov
On Thu, Nov 8, 2018 at 5:10 PM Stefan Kooman  wrote:
>
> Quoting Stefan Kooman (ste...@bit.nl):
> > I'm pretty sure it isn't. I'm trying to do the same (force luminous
> > clients only) but ran into the same issue. Even when running 4.19 kernel
> > it's interpreted as a jewel client. Here is the list I made so far:
> >
> > Kernel 4.13 / 4.15:
> > "features": "0x7010fb86aa42ada",
> > "release": "jewel"
> >
> > kernel 4.18 / 4.19
> >  "features": "0x27018fb86aa42ada",
> >  "release": "jewel"
>
> On a test cluster with kernel clients 4.13, 4.15, 4.19 I have set the
> "ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it"
> while doing active IO ... no issues. Remount also works ... makes me
> wonder how strict this "require-min-compat-client" is ...

It's there to stop you from accidentally enabling new features that
some of your clients are too old for.  In its current form it's quite
easy to bypass but I think there are plans to make it stronger in the
future.

You didn't actually enable any new features by bumping it to luminous.
But you shouldn't see any issues even if you go ahead and do that (e.g.
put the balancer in upmap mode) because your clients are new enough.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel client versions - pg-upmap

2018-11-08 Thread Ilya Dryomov
On Thu, Nov 8, 2018 at 2:15 PM Stefan Kooman  wrote:
>
> Quoting Ilya Dryomov (idryo...@gmail.com):
> > On Sat, Nov 3, 2018 at 10:41 AM  wrote:
> > >
> > > Hi.
> > >
> > > I tried to enable the "new smart balancing" - backend are on RH luminous
> > > clients are Ubuntu 4.15 kernel.
> [cut]
> > > ok, so 4.15 kernel connects as a "hammer" (<1.0) client?  Is there a
> > > huge gap in upstreaming kernel clients to kernel.org or what am I
> > > misreading here?
> > >
> > > Hammer is 2015'ish - 4.15 is January 2018'ish?
> > >
> > > Is kernel client development lacking behind ?
> >
> > Hi Jesper,
> >
> > There are four different groups of clients in that output.  Which one
> > of those four is the kernel client?  Are you sure it's just the hammer
> > one?
>
> I'm pretty sure it isn't. I'm trying to do the same (force luminous
> clients only) but ran into the same issue. Even when running 4.19 kernel
> it's interpreted as a jewel client. Here is the list I made so far:
>
> Kernel 4.13 / 4.15:
> "features": "0x7010fb86aa42ada",
> "release": "jewel"
>
> kernel 4.18 / 4.19
>  "features": "0x27018fb86aa42ada",
>  "release": "jewel"
>
> I have tested both Ubuntu as CentOS mainline kernels.  I came accross
> this issue made by Sage [1], which is resolved, but which looks similiar
> to this.

I asked about those hammer ones because there were clearly not 4.13+.
For 4.13+ and CentOS 7.5+ you can force require-min-compat-client with
--yes-i-really-mean-it.  See

  https://www.spinics.net/lists/ceph-users/msg45071.html
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029105.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [bug] mount.ceph man description is wrong

2018-11-07 Thread Ilya Dryomov
On Wed, Nov 7, 2018 at 2:25 PM  wrote:
>
> Hi!
>
> I use ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
> (stable) and i want to call `ls -ld` to read whole dir size in cephfs:
>
> When i man mount.ceph:
>
> rbytes Report the recursive size of the directory contents for st_size on 
> directories.  Default: on
>
> But without rbytes like below, "ls -ld" do not work:
>
> mount -t ceph 192.168.0.24:/ /mnt -o 
> name=admin,secretfile=/etc/ceph/admin.secret
>
> [root@test mnt]# ls -ld mongo
> drwxr-xr-x 4 polkitd root 29 11月  6 16:33 mongo
>
> Then i umoun and mount use below cmd, it works:
>
> mount -t ceph 192.168.0.24:/ /mnt -o 
> name=admin,secretfile=/etc/ceph/admin.secret,rbytes
>
>
> [root@test mnt]# ls -ld mongo
> drwxr-xr-x 4 polkitd root 392021518 11月  6 16:33 mongo
>
>
> So the description is wrong, right?

Yes, it's wrong.  Thanks for the PR, if you address the feedback we'll
merge it.

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel client versions - pg-upmap

2018-11-05 Thread Ilya Dryomov
On Sat, Nov 3, 2018 at 10:41 AM  wrote:
>
> Hi.
>
> I tried to enable the "new smart balancing" - backend are on RH luminous
> clients are Ubuntu 4.15 kernel.
>
> As per: http://docs.ceph.com/docs/mimic/rados/operations/upmap/
> $ sudo ceph osd set-require-min-compat-client luminous
> Error EPERM: cannot set require_min_compat_client to luminous: 1 connected
> client(s) look like firefly (missing 0xe010020); 1 connected
> client(s) look like firefly (missing 0xe01); 1 connected
> client(s) look like hammer (missing 0xe20); 55 connected
> client(s) look like jewel (missing 0x800); add
> --yes-i-really-mean-it to do it anyway
>
> ok, so 4.15 kernel connects as a "hammer" (<1.0) client?  Is there a
> huge gap in upstreaming kernel clients to kernel.org or what am I
> misreading here?
>
> Hammer is 2015'ish - 4.15 is January 2018'ish?
>
> Is kernel client development lacking behind ?

Hi Jesper,

There are four different groups of clients in that output.  Which one
of those four is the kernel client?  Are you sure it's just the hammer
one?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache, dm-cache support

2018-10-10 Thread Ilya Dryomov
On Wed, Oct 10, 2018 at 8:48 PM Kjetil Joergensen  wrote:
>
> Hi,
>
> We tested bcache, dm-cache/lvmcache, and one more which name eludes me with 
> PCIe NVME on top of large spinning rust drives behind a SAS3 expander - and 
> decided this were not for us.
>
> This was probably jewel with filestore, and our primary reason for trying to 
> go down this path were that leveldb compaction were killing us, and putting 
> omap/leveldb and things on separate locations were "so-so" supported (IIRC: 
> some were explicitly supported, some you could do a bit of symlink or mount 
> trickery).
>
> The caching worked - although, when we started doing power failure 
> survivability (power cycle the entire rig, wait for recovery, repeat), we 
> ended up with seriously corrupted the XFS filesystems on top of the cached 
> block device within a handful of power cycles). We did not test fully 
> disabling the spinning rust on-device cache (which were the leading 
> hypothesis of why this actually failed, potentially combined with ordering of 
> FLUSH+FUA ending up slightly funky combined with the rather asymmetric commit 
> latency). Just to rule out anything else, we did run the same power-fail test 
> regimen for days without the nvme-over-spinning-rust-caching, without 
> triggering the same filesystem corruption.
>
> So yea - I'd recommend looking at i.e. bluestore and stick rocksdb, journal 
> and anything else performance critical on faster storage instead.
>
> If you do decide to go down the dm-cache/lvmcache/(other cache) road - I'd 
> recommend throughly testing failure scenarios like i.e. power-loss so you 
> don't find out accidentally when you do have a multi-failure-domain outage. :)

Yeah, definitely do a lot of pulling disks and power cycle testing.
dm-cache had a data corruption on power loss bug in 4.9+:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b1fe7bec8a8d0cc547a22e7ddc2bd59acd67de4

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issued! = cap->implemented in handle_cap_export

2018-09-25 Thread Ilya Dryomov
On Tue, Sep 25, 2018 at 2:05 PM 刘 轩  wrote:
>
> Hi Ilya:
>
>  I have some questions about the commit 
> d84b37f9fa9b23a46af28d2e9430c87718b6b044 about the function 
> handle_cap_export. In which case, issued! = cap->implemented may occur.
>
> I encountered this kind of mistake in my cluster. Do you think this is 
> probably BUG?
>
> ceph: limit rate of cap import/export error messages
>
> https://github.com/ceph/ceph-client/commit/d84b37f9fa9b23a46af28d2e9430c87718b6b044
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 7e09fa8ab0ed..f28efaecbb50 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -3438,7 +3438,14 @@ static void handle_cap_export(struct inode *inode, 
> struct ceph_mds_caps *ex,
>*/
>
>   issued = cap->issued;
> - WARN_ON(issued != cap->implemented);
> + if (issued != cap->implemented)
> + pr_err_ratelimited("handle_cap_export: issued != implemented: "
> + "ino (%llx.%llx) mds%d seq %d mseq %d "
> + "issued %s implemented %s\n",
> + ceph_vinop(inode), mds, cap->seq, cap->mseq,
> + ceph_cap_string(issued),
> + ceph_cap_string(cap->implemented));
> +
>
>   tcap = __get_cap_for_mds(ci, target);
>   if (tcap) {

Resending to ceph-users in plain text, adding Zheng.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Get supported features of all connected clients

2018-09-11 Thread Ilya Dryomov
On Tue, Sep 11, 2018 at 1:00 PM Tobias Florek  wrote:
>
> Hi!
>
> I have a cluster serving RBDs and CephFS that has a big number of
> clients I don't control.  I want to know what feature flags I can safely
> set without locking out clients.  Is there a command analogous to `ceph
> versions` that shows the connected clients and their feature support?

Yes, "ceph features".

https://ceph.com/community/new-luminous-upgrade-complete/

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd on CentOS

2018-09-10 Thread Ilya Dryomov
On Mon, Sep 10, 2018 at 7:46 PM David Turner  wrote:
>
> Now that you mention it, I remember those threads on the ML.  What happens if 
> you use --yes-i-really-mean-it to do those things and then later you try to 
> map an RBD with an older kernel for CentOS 7.3 or 7.4?  Will that mapping 
> fail because of the min-client-version of luminous set on the cluster while 
> allowing CentOS 7.5 clients map RBDs?

Yes, more or less.

If you _just_ set the require-min-compat-client setting, nothing will
change.  It's there to prevent you from accidentally locking out older
clients by enabling some new feature.  You will continue to be able to
map images with both old and new kernels.

If you then go ahead and install an upmap exception (manually or via
the balancer module), you will no longer be able to map images with old
kernels.

This applies to all RADOS clients, not just the kernel client.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd on CentOS

2018-09-10 Thread Ilya Dryomov
On Mon, Sep 10, 2018 at 7:19 PM David Turner  wrote:
>
> I haven't found any mention of this on the ML and Google's results are all 
> about compiling your own kernel to use NBD on CentOS. Is everyone that's 
> using rbd-nbd on CentOS honestly compiling their own kernels for the clients? 
> This feels like something that shouldn't be necessary anymore.
>
> I would like to use the balancer module with upmap, but can't do that with 
> kRBD because even the latest kernels still register as Jewel. What have y'all 
> done to use rbd-nbd on CentOS? I'm hoping I'm missing something and not that 
> I'll need to compile a kernel to use on all of the hosts that I want to map 
> RBDs to.

FWIW upmap is fully supported since 4.13 and RHEL 7.5:

  https://www.spinics.net/lists/ceph-users/msg45071.html
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029105.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force unmap of RBD image

2018-09-10 Thread Ilya Dryomov
On Mon, Sep 10, 2018 at 10:46 AM Martin Palma  wrote:
>
> We are trying to unmap an rbd image form a host for deletion and
> hitting the following error:
>
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> We used commands like "lsof" and "fuser" but nothing is reported to
> use the device. Also checked for watcher with "rados -p pool
> listwatchers image.rbd" but there aren't any listed.

The device is still open by someone.  Check for LVM volumes, multipath,
loop devices etc.  None of those typically show up in lsof.

>
> By investigating `/sys/kernel/debug/ceph//osdc` we get:
>
> 160460241osd15019.b2af34image.rbd
> 231954'1271503593144320watch

Which kernel is that?

>
> Our goal is to unmap the image for deletion so if the unmap process
> should destroy the image is for us OK.
>
> Any help/suggestions?

On newer kernels you could do "rbd umap -o force ", but it
looks like you are running an older kernel.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to use RBD mounts for Docker volumes on containerized Ceph nodes

2018-09-09 Thread Ilya Dryomov
On Sun, Sep 9, 2018 at 6:31 AM David Turner  wrote:
>
> The problem is with the kernel pagecache. If that is still shared in a 
> containerized environment with the OSDs in containers and RBDs which are 
> married on The node outside of containers, then it is indeed still a problem. 
> I would guess that's the case, but I do not know for certain. Using rbd-nbd 
> instead of krbd bypasses this problem and you can ignore it. Only using krbd 
> is problematic.

How is the nbd client in the kernel different from the rbd client in
the kernel (i.e.  krbd)?  They are both network block devices, the only
difference is that the latter talks directly to the OSDs while the
former has to go through a proxy.  It'll be the same kernel either way
if you choose to co-locate, so I don't think using rbd-nbd bypasses
this problem.  On the contrary, there is more opportunity for breakage
with an additional daemon in the I/O path.

The kernel is evolving and I haven't seen a report of such a deadlock
in quite a while.  I think it's still there, but it's probably harder
to hit than it used to be.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kRBD write performance for high IO use cases

2018-09-08 Thread Ilya Dryomov
On Sat, Sep 8, 2018 at 1:52 AM Tyler Bishop
 wrote:
>
> I have a fairly large cluster running ceph bluestore with extremely fast SAS 
> ssd for the metadata.  Doing FIO benchmarks I am getting 200k-300k random 
> write iops but during sustained workloads of ElasticSearch my clients seem to 
> hit a wall of around 1100 IO/s per RBD device.  I've tried 1 RBD and 4 RBD 
> devices and I still only get 1100 IO per device, so 4 devices gets me around 
> 4k.
>
> Is there some sort of setting that limits each RBD devices performance?  I've 
> tried playing with nr_requests but that don't seem to change it at all... I'm 
> just looking for another 20-30% performance on random write io... I even 
> thought about doing raid 0 across 4-8 rbd devices just to get the io 
> performance.

What is the I/O profile of that workload?  How did you arrive at the
20-30% number?

Which kernel are you running?  Increasing nr_requests doesn't actually
increase the queue depth, at least on anything moderately recent.  You
need to map with queue_depth=X for that, see [1] for details.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b55841807fb864eccca0167650a65722fd7cd553

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Ilya Dryomov
On Thu, Aug 30, 2018 at 1:04 PM Eugen Block  wrote:
>
> Hi again,
>
> we still didn't figure out the reason for the flapping, but I wanted
> to get back on the dmesg entries.
> They just reflect what happened in the past, they're no indicator to
> predict anything.

The kernel client is just that, a client.  Almost by definition,
everything it sees has already happened.

>
> For example, when I changed the primary-affinity of OSD.24 last week,
> one of the clients realized that only today, 4 days later. If the
> clients don't have to communicate with the respective host/osd in the
> meantime, they log those events on the next reconnect.

Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-container - rbd map failing since upgrade?

2018-08-21 Thread Ilya Dryomov
On Tue, Aug 21, 2018 at 9:19 PM Jacob DeGlopper  wrote:
>
> I'm seeing an error from the rbd map command running in ceph-container;
> I had initially deployed this cluster as Luminous, but a pull of the
> ceph/daemon container unexpectedly upgraded me to Mimic 13.2.1.
>
> [root@nodeA2 ~]# ceph version
> ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
> (stable)
>
> [root@nodeA2 ~]# rbd info mysqlTB
> rbd image 'mysqlTB':
>  size 360 GiB in 92160 objects
>  order 22 (4 MiB objects)
>  id: 206a962ae8944a
>  block_name_prefix: rbd_data.206a962ae8944a
>  format: 2
>  features: layering
>  op_features:
>  flags:
>  create_timestamp: Sat Aug 11 00:00:36 2018
>
> [root@nodeA2 ~]# rbd map mysqlTB
> rbd: failed to add secret 'client.admin' to kernel
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (1) Operation not permitted
>
> [root@nodeA2 ~]# type rbd
> rbd is a function
> rbd ()
> {
>  sudo docker exec ceph-mon-nodeA2 rbd --cluster ceph ${@}
> }
>
> [root@nodeA2 ~]# ls -alF /etc/ceph/ceph.client.admin.keyring
> -rw--- 1 ceph ceph 159 May 21 09:27 /etc/ceph/ceph.client.admin.keyring
>
> System is CentOS 7 with the elrepo mainline kernel:
>
> [root@nodeA2 ~]# uname -a
> Linux nodeA2 4.18.3-1.el7.elrepo.x86_64 #1 SMP Sat Aug 18 09:30:18 EDT
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> I see a similar question here with no answer:
> https://github.com/ceph/ceph-container/issues/1030

Hi Jacob,

You mentioned an upgrade in the subject, did it work with luminous
ceph-container?

It seems unlikely -- docker blocks add_key(2) and other key management
related system calls with seccomp because the kernel keyring is global.
See https://docs.docker.com/engine/security/seccomp/.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs client version in RedHat/CentOS 7.5

2018-08-21 Thread Ilya Dryomov
On Mon, Aug 20, 2018 at 9:49 PM Dan van der Ster  wrote:
>
> On Mon, Aug 20, 2018 at 5:37 PM Ilya Dryomov  wrote:
> >
> > On Mon, Aug 20, 2018 at 4:52 PM Dietmar Rieder
> >  wrote:
> > >
> > > Hi Cephers,
> > >
> > >
> > > I wonder if the cephfs client in RedHat/CentOS 7.5 will be updated to
> > > luminous?
> > > As far as I see there is some luminous related stuff that was
> > > backported, however,
> > > the "ceph features" command just reports "jewel" as release of my cephfs
> > > clients running CentOS 7.5 (kernel 3.10.0-862.11.6.el7.x86_64)
> > >
> > >
> > > {
> > > "mon": {
> > > "group": {
> > > "features": "0x3ffddff8eea4fffb",
> > > "release": "luminous",
> > > "num": 3
> > > }
> > > },
> > > "mds": {
> > > "group": {
> > > "features": "0x3ffddff8eea4fffb",
> > > "release": "luminous",
> > > "num": 3
> > > }
> > > },
> > > "osd": {
> > > "group": {
> > > "features": "0x3ffddff8eea4fffb",
> > > "release": "luminous",
> > > "num": 240
> > > }
> > > },
> > > "client": {
> > > "group": {
> > > "features": "0x7010fb86aa42ada",
> > > "release": "jewel",
> > > "num": 23
> > > },
> > > "group": {
> > > "features": "0x3ffddff8eea4fffb",
> > > "release": "luminous",
> > > "num": 4
> > > }
> > > }
> > > }
> > >
> > >
> > > This prevents me to run ceph balancer using the upmap mode.
> > >
> > >
> > > Any idea?
> >
> > Hi Dietmar,
> >
> > All luminous features are supported in RedHat/CentOS 7.5, but it shows
> > up as jewel due to a technicality.
>
> Except rados namespaces, right? Manila CephFS shares are not yet
> mountable with 7.5.

Yes, I was talking about cluster-wide feature bits, as that is what
"ceph features" is about.  CephFS layouts with namespaces are indeed
not supported in 7.5.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs client version in RedHat/CentOS 7.5

2018-08-21 Thread Ilya Dryomov
On Tue, Aug 21, 2018 at 9:12 AM Dietmar Rieder
 wrote:
>
> On 08/20/2018 05:36 PM, Ilya Dryomov wrote:
> > On Mon, Aug 20, 2018 at 4:52 PM Dietmar Rieder
> >  wrote:
> >>
> >> Hi Cephers,
> >>
> >>
> >> I wonder if the cephfs client in RedHat/CentOS 7.5 will be updated to
> >> luminous?
> >> As far as I see there is some luminous related stuff that was
> >> backported, however,
> >> the "ceph features" command just reports "jewel" as release of my cephfs
> >> clients running CentOS 7.5 (kernel 3.10.0-862.11.6.el7.x86_64)
> >>
> >>
> >> {
> >> "mon": {
> >> "group": {
> >> "features": "0x3ffddff8eea4fffb",
> >> "release": "luminous",
> >> "num": 3
> >> }
> >> },
> >> "mds": {
> >> "group": {
> >> "features": "0x3ffddff8eea4fffb",
> >> "release": "luminous",
> >> "num": 3
> >> }
> >> },
> >> "osd": {
> >> "group": {
> >> "features": "0x3ffddff8eea4fffb",
> >> "release": "luminous",
> >> "num": 240
> >> }
> >> },
> >> "client": {
> >> "group": {
> >> "features": "0x7010fb86aa42ada",
> >> "release": "jewel",
> >> "num": 23
> >> },
> >> "group": {
> >> "features": "0x3ffddff8eea4fffb",
> >> "release": "luminous",
> >> "num": 4
> >> }
> >> }
> >> }
> >>
> >>
> >> This prevents me to run ceph balancer using the upmap mode.
> >>
> >>
> >> Any idea?
> >
> > Hi Dietmar,
> >
> > All luminous features are supported in RedHat/CentOS 7.5, but it shows
> > up as jewel due to a technicality.  Just do
> >
> >   $ ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it
> >
> > to override the safety check.
> >
> > See https://www.spinics.net/lists/ceph-users/msg45071.html for details.
> > It references an upstream kernel, but both the problem and the solution
> > are the same.
> >
>
> Hi Ilya,
>
> thank you for your answer.
>
> Just to make sure:
> The thread you are referring to, is about kernel 4.13+, is this also
> true for the "official" RedHat/CentOS 7.5 kernel 3.10
> (3.10.0-862.11.6.el7.x86_64) ?

Yes, it is.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs client version in RedHat/CentOS 7.5

2018-08-20 Thread Ilya Dryomov
On Mon, Aug 20, 2018 at 4:52 PM Dietmar Rieder
 wrote:
>
> Hi Cephers,
>
>
> I wonder if the cephfs client in RedHat/CentOS 7.5 will be updated to
> luminous?
> As far as I see there is some luminous related stuff that was
> backported, however,
> the "ceph features" command just reports "jewel" as release of my cephfs
> clients running CentOS 7.5 (kernel 3.10.0-862.11.6.el7.x86_64)
>
>
> {
> "mon": {
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 3
> }
> },
> "mds": {
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 3
> }
> },
> "osd": {
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 240
> }
> },
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 23
> },
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 4
> }
> }
> }
>
>
> This prevents me to run ceph balancer using the upmap mode.
>
>
> Any idea?

Hi Dietmar,

All luminous features are supported in RedHat/CentOS 7.5, but it shows
up as jewel due to a technicality.  Just do

  $ ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it

to override the safety check.

See https://www.spinics.net/lists/ceph-users/msg45071.html for details.
It references an upstream kernel, but both the problem and the solution
are the same.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad crc/signature errors

2018-08-14 Thread Ilya Dryomov
On Mon, Aug 13, 2018 at 5:57 PM Nikola Ciprich
 wrote:
>
> Hi Ilya,
>
> hmm, OK, I'm not  sure now whether this is the bug which I'm
> experiencing.. I've had read_partial_message  / bad crc/signature
> problem occurance on the second cluster in short period even though
> we're on the same ceph version (12.2.5) for quite long time (almost since
> its release), so it's starting to pain me.. I suppose this must
> have been caused by some kernel update, (we're currently sticking
> to 4.14.x and lately been upgrading to 4.14.50)

These "bad crc/signature" are usually the sign of faulty hardware.

What was the last "good" kernel and the first "bad" kernel?

You said "on the second cluster".  How is it different from the first?
Are you using the kernel client with both?  Is there Xen involved?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] LVM on top of RBD apparent pagecache corruption with snapshots

2018-08-13 Thread Ilya Dryomov
On Mon, Aug 6, 2018 at 8:17 PM Ilya Dryomov  wrote:
>
> On Mon, Aug 6, 2018 at 8:13 PM Ilya Dryomov  wrote:
> >
> > On Thu, Jul 26, 2018 at 1:55 AM Alex Gorbachev  
> > wrote:
> > >
> > > On Wed, Jul 25, 2018 at 7:07 PM, Alex Gorbachev 
> > >  wrote:
> > > > On Wed, Jul 25, 2018 at 6:07 PM, Alex Gorbachev 
> > > >  wrote:
> > > >> On Wed, Jul 25, 2018 at 5:51 PM, Jason Dillaman  
> > > >> wrote:
> > > >>>
> > > >>>
> > > >>> On Wed, Jul 25, 2018 at 5:41 PM Alex Gorbachev 
> > > >>> 
> > > >>> wrote:
> > > >>>>
> > > >>>> I am not sure this related to RBD, but in case it is, this would be 
> > > >>>> an
> > > >>>> important bug to fix.
> > > >>>>
> > > >>>> Running LVM on top of RBD, XFS filesystem on top of that, consumed 
> > > >>>> in RHEL
> > > >>>> 7.4.
> > > >>>>
> > > >>>> When running a large read operation and doing LVM snapshots during
> > > >>>> that operation, the block being read winds up all zeroes in 
> > > >>>> pagecache.
> > > >>>>
> > > >>>> Dropping the caches syncs up the block with what's on "disk" and
> > > >>>> everything is fine.
> > > >>>>
> > > >>>> Working on steps to reproduce simply - ceph is Luminous 12.2.7, RHEL
> > > >>>> client is Jewel 10.2.10-17.el7cp
> > > >>>
> > > >>>
> > > >>> Is this krbd or QEMU+librbd? If the former, what kernel version are 
> > > >>> you
> > > >>> running?
> > > >>
> > > >> It's krbd on RHEL.
> > > >>
> > > >> RHEL kernel:
> > > >>
> > > >> Linux dmg-cbcache01 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51
> > > >> EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > >
> > > > Not sure if this is exactly replicating the issue, but I was able to
> > > > do this on two different systems:
> > > >
> > > > RHEL 7.4 kernel as above.
> > > >
> > > > Create a PVM PV on a mapped kRBD device
> > > >
> > > > example: pvcreate /dev/rbd/spin1/lvm1
> > > >
> > > > Create a VG and LV, make an XFS FS
> > > >
> > > > vgcreate datavg /dev/rbd/spin1/lvm1
> > > > lvcreate -n data1 -L 5G datavg
> > > > mkfs.xfs /dev/datavg/data1
> > > > 
> > > >
> > > > Get some large file and copy it to some other file, same storage or
> > > > different.  All is well.
> > > >
> > > > Now snapshot the LV
> > > >
> > > > lvcreate -l8%ORIGIN -s -n snap_data1 /dev/datavg/data1 --addtag backup
> > > >
> > > > Now try to copy that file again.  I get:
> > > >
> > > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:3470]
> > > >
> > > > And in dmesg (this is on Proxmox but I did the same on ESXi)
> > > >
> > > > [1397609.308673] sched: RT throttling activated
> > > > [1397658.759259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> > > > [kworker/0:1:2648]
> > > > [1397658.759354] Modules linked in: dm_snapshot dm_bufio rbd libceph
> > > > rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> > > > sunrpc ppdev joydev pcspkr sg parport_pc virtio_balloon parport shpchp
> > > > i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif
> > > > crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi
> > > > virtio_scsi virtio_console virtio_net bochs_drm drm_kms_helper
> > > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata
> > > > serio_raw virtio_pci i2c_core virtio_ring virtio floppy dm_mirror
> > > > dm_region_hash dm_log dm_mod
> > > > [1397658.759400] CPU: 0 PID: 2648 Comm: kworker/0:1 Kdump: loaded Not
> > > > tainted 3.10.0-862.el7.x86_64 #1
> > > > [1397658.759402] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > > 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org
> > > > 04/01/2014
> > > > [1397658.759415] Workqueue: kcopyd do_work [dm_mod]
> > &g

Re: [ceph-users] bad crc/signature errors

2018-08-13 Thread Ilya Dryomov
On Mon, Aug 13, 2018 at 2:49 PM Nikola Ciprich
 wrote:
>
> Hi Paul,
>
> thanks, I'll give it a try.. do you think this might head to
> upstream soon?  for some reason I can't review comments for
> this patch on github.. Is some new version of this patch
> on the way, or can I try to apply this one to latest luminous?
>
> thanks a lot!
>
> nik
>
>
> On Fri, Aug 10, 2018 at 06:05:26PM +0200, Paul Emmerich wrote:
> > I've built a work-around here:
> > https://github.com/ceph/ceph/pull/23273

Those are completely different crc errors.  The ones Paul is talking
about occur in bluestore when fetching data from the underlying disk.
When they occur, there is no data to reply with to the client.  Paul's
pull request is working around that (likely a bug in the core kernel)
by adding up to two retries.

The ones this thread is about occur on the client side when receiving
a reply from the OSD.  The retry logic is already there: the connection
is cut, the client reconnects and resends the OSD request.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] LVM on top of RBD apparent pagecache corruption with snapshots

2018-08-06 Thread Ilya Dryomov
On Mon, Aug 6, 2018 at 8:13 PM Ilya Dryomov  wrote:
>
> On Thu, Jul 26, 2018 at 1:55 AM Alex Gorbachev  
> wrote:
> >
> > On Wed, Jul 25, 2018 at 7:07 PM, Alex Gorbachev  
> > wrote:
> > > On Wed, Jul 25, 2018 at 6:07 PM, Alex Gorbachev 
> > >  wrote:
> > >> On Wed, Jul 25, 2018 at 5:51 PM, Jason Dillaman  
> > >> wrote:
> > >>>
> > >>>
> > >>> On Wed, Jul 25, 2018 at 5:41 PM Alex Gorbachev 
> > >>> 
> > >>> wrote:
> > >>>>
> > >>>> I am not sure this related to RBD, but in case it is, this would be an
> > >>>> important bug to fix.
> > >>>>
> > >>>> Running LVM on top of RBD, XFS filesystem on top of that, consumed in 
> > >>>> RHEL
> > >>>> 7.4.
> > >>>>
> > >>>> When running a large read operation and doing LVM snapshots during
> > >>>> that operation, the block being read winds up all zeroes in pagecache.
> > >>>>
> > >>>> Dropping the caches syncs up the block with what's on "disk" and
> > >>>> everything is fine.
> > >>>>
> > >>>> Working on steps to reproduce simply - ceph is Luminous 12.2.7, RHEL
> > >>>> client is Jewel 10.2.10-17.el7cp
> > >>>
> > >>>
> > >>> Is this krbd or QEMU+librbd? If the former, what kernel version are you
> > >>> running?
> > >>
> > >> It's krbd on RHEL.
> > >>
> > >> RHEL kernel:
> > >>
> > >> Linux dmg-cbcache01 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51
> > >> EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > Not sure if this is exactly replicating the issue, but I was able to
> > > do this on two different systems:
> > >
> > > RHEL 7.4 kernel as above.
> > >
> > > Create a PVM PV on a mapped kRBD device
> > >
> > > example: pvcreate /dev/rbd/spin1/lvm1
> > >
> > > Create a VG and LV, make an XFS FS
> > >
> > > vgcreate datavg /dev/rbd/spin1/lvm1
> > > lvcreate -n data1 -L 5G datavg
> > > mkfs.xfs /dev/datavg/data1
> > > 
> > >
> > > Get some large file and copy it to some other file, same storage or
> > > different.  All is well.
> > >
> > > Now snapshot the LV
> > >
> > > lvcreate -l8%ORIGIN -s -n snap_data1 /dev/datavg/data1 --addtag backup
> > >
> > > Now try to copy that file again.  I get:
> > >
> > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:3470]
> > >
> > > And in dmesg (this is on Proxmox but I did the same on ESXi)
> > >
> > > [1397609.308673] sched: RT throttling activated
> > > [1397658.759259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> > > [kworker/0:1:2648]
> > > [1397658.759354] Modules linked in: dm_snapshot dm_bufio rbd libceph
> > > rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> > > sunrpc ppdev joydev pcspkr sg parport_pc virtio_balloon parport shpchp
> > > i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif
> > > crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi
> > > virtio_scsi virtio_console virtio_net bochs_drm drm_kms_helper
> > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata
> > > serio_raw virtio_pci i2c_core virtio_ring virtio floppy dm_mirror
> > > dm_region_hash dm_log dm_mod
> > > [1397658.759400] CPU: 0 PID: 2648 Comm: kworker/0:1 Kdump: loaded Not
> > > tainted 3.10.0-862.el7.x86_64 #1
> > > [1397658.759402] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org
> > > 04/01/2014
> > > [1397658.759415] Workqueue: kcopyd do_work [dm_mod]
> > > [1397658.759418] task: 932df65d3f40 ti: 932fb138c000 task.ti:
> > > 932fb138c000
> > > [1397658.759420] RIP: 0010:[]  []
> > > copy_callback+0x50/0x130 [dm_snapshot]
> > > [1397658.759426] RSP: 0018:932fb138fd08  EFLAGS: 0283
> > > [1397658.759428] RAX: 0003e5e8 RBX: ebecc4943ec0 RCX:
> > > 932ff4704068
> > > [1397658.759430] RDX: 932dc8050d00 RSI: 932fd6a0f9b8 RDI:
> > > 
> > > [1397658.759431] RBP: 932fb138fd28 R08: 932dc7d2c0b0 R09:
> > > 9

Re: [ceph-users] LVM on top of RBD apparent pagecache corruption with snapshots

2018-08-06 Thread Ilya Dryomov
On Thu, Jul 26, 2018 at 1:55 AM Alex Gorbachev  wrote:
>
> On Wed, Jul 25, 2018 at 7:07 PM, Alex Gorbachev  
> wrote:
> > On Wed, Jul 25, 2018 at 6:07 PM, Alex Gorbachev  
> > wrote:
> >> On Wed, Jul 25, 2018 at 5:51 PM, Jason Dillaman  
> >> wrote:
> >>>
> >>>
> >>> On Wed, Jul 25, 2018 at 5:41 PM Alex Gorbachev 
> >>> wrote:
> 
>  I am not sure this related to RBD, but in case it is, this would be an
>  important bug to fix.
> 
>  Running LVM on top of RBD, XFS filesystem on top of that, consumed in 
>  RHEL
>  7.4.
> 
>  When running a large read operation and doing LVM snapshots during
>  that operation, the block being read winds up all zeroes in pagecache.
> 
>  Dropping the caches syncs up the block with what's on "disk" and
>  everything is fine.
> 
>  Working on steps to reproduce simply - ceph is Luminous 12.2.7, RHEL
>  client is Jewel 10.2.10-17.el7cp
> >>>
> >>>
> >>> Is this krbd or QEMU+librbd? If the former, what kernel version are you
> >>> running?
> >>
> >> It's krbd on RHEL.
> >>
> >> RHEL kernel:
> >>
> >> Linux dmg-cbcache01 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51
> >> EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Not sure if this is exactly replicating the issue, but I was able to
> > do this on two different systems:
> >
> > RHEL 7.4 kernel as above.
> >
> > Create a PVM PV on a mapped kRBD device
> >
> > example: pvcreate /dev/rbd/spin1/lvm1
> >
> > Create a VG and LV, make an XFS FS
> >
> > vgcreate datavg /dev/rbd/spin1/lvm1
> > lvcreate -n data1 -L 5G datavg
> > mkfs.xfs /dev/datavg/data1
> > 
> >
> > Get some large file and copy it to some other file, same storage or
> > different.  All is well.
> >
> > Now snapshot the LV
> >
> > lvcreate -l8%ORIGIN -s -n snap_data1 /dev/datavg/data1 --addtag backup
> >
> > Now try to copy that file again.  I get:
> >
> > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:3470]
> >
> > And in dmesg (this is on Proxmox but I did the same on ESXi)
> >
> > [1397609.308673] sched: RT throttling activated
> > [1397658.759259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> > [kworker/0:1:2648]
> > [1397658.759354] Modules linked in: dm_snapshot dm_bufio rbd libceph
> > rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> > sunrpc ppdev joydev pcspkr sg parport_pc virtio_balloon parport shpchp
> > i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif
> > crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi
> > virtio_scsi virtio_console virtio_net bochs_drm drm_kms_helper
> > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata
> > serio_raw virtio_pci i2c_core virtio_ring virtio floppy dm_mirror
> > dm_region_hash dm_log dm_mod
> > [1397658.759400] CPU: 0 PID: 2648 Comm: kworker/0:1 Kdump: loaded Not
> > tainted 3.10.0-862.el7.x86_64 #1
> > [1397658.759402] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org
> > 04/01/2014
> > [1397658.759415] Workqueue: kcopyd do_work [dm_mod]
> > [1397658.759418] task: 932df65d3f40 ti: 932fb138c000 task.ti:
> > 932fb138c000
> > [1397658.759420] RIP: 0010:[]  []
> > copy_callback+0x50/0x130 [dm_snapshot]
> > [1397658.759426] RSP: 0018:932fb138fd08  EFLAGS: 0283
> > [1397658.759428] RAX: 0003e5e8 RBX: ebecc4943ec0 RCX:
> > 932ff4704068
> > [1397658.759430] RDX: 932dc8050d00 RSI: 932fd6a0f9b8 RDI:
> > 
> > [1397658.759431] RBP: 932fb138fd28 R08: 932dc7d2c0b0 R09:
> > 932dc8050d20
> > [1397658.759433] R10: c7d2b301 R11: ebecc01f4a00 R12:
> > 
> > [1397658.759435] R13: 000180090003 R14:  R15:
> > ff80
> > [1397658.759438] FS:  () GS:932fffc0()
> > knlGS:
> > [1397658.759440] CS:  0010 DS:  ES:  CR0: 8005003b
> > [1397658.759442] CR2: 7f17bcd5e860 CR3: 42c0e000 CR4:
> > 06f0
> > [1397658.759447] Call Trace:
> > [1397658.759452]  [] ? origin_resume+0x70/0x70 
> > [dm_snapshot]
> > [1397658.759459]  [] run_complete_job+0x6b/0xc0 [dm_mod]
> > [1397658.759466]  [] process_jobs+0x60/0x100 [dm_mod]
> > [1397658.759471]  [] ? kcopyd_put_pages+0x50/0x50 [dm_mod]
> > [1397658.759477]  [] do_work+0x42/0x90 [dm_mod]
> > [1397658.759483]  [] process_one_work+0x17f/0x440
> > [1397658.759485]  [] worker_thread+0x22c/0x3c0
> > [1397658.759489]  [] ? manage_workers.isra.24+0x2a0/0x2a0
> > [1397658.759494]  [] kthread+0xd1/0xe0
> > [1397658.759497]  [] ? insert_kthread_work+0x40/0x40
> > [1397658.759503]  [] ret_from_fork_nospec_begin+0x21/0x21
> > [1397658.759506]  [] ? insert_kthread_work+0x40/0x40
> >
> >
>
>
> Tried same on Ubuntu kernel 4.14.39 - no issues

I reproduced multiple soft lockups in copy_callback() on ceph-client
testing branch (based on 4.18-rc7) and on 4.14.39 -- wanted to confirm
it wasn't 

Re: [ceph-users] different size of rbd

2018-08-06 Thread Ilya Dryomov
On Mon, Aug 6, 2018 at 3:24 AM Dai Xiang  wrote:
>
> On Thu, Aug 02, 2018 at 01:04:46PM +0200, Ilya Dryomov wrote:
> > On Thu, Aug 2, 2018 at 12:49 PM  wrote:
> > >
> > > I create a rbd named dx-app with 500G, and map as rbd0.
> > >
> > > But i find the size is different with different cmd:
> > >
> > > [root@dx-app docker]# rbd info dx-app
> > > rbd image 'dx-app':
> > > size 32000 GB in 8192000 objects  <
> > > order 22 (4096 kB objects)
> > > block_name_prefix: rbd_data.1206643c9869
> > > format: 2
> > > features: layering
> > > flags:
> > > create_timestamp: Thu Aug  2 18:18:20 2018
> > >
> > > [root@dx-app docker]# lsblk
> > > NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > > vda   253:0020G  0 disk
> > > └─vda1253:1020G  0 part /
> > > vdb   253:16   0   200G  0 disk
> > > └─vg--test--data-lv--data 252:00 199.9G  0 lvm  /test/data
> > > vdc   253:32   0   200G  0 disk
> > > vdd   253:48   0   200G  0 disk /pkgs
> > > vde   253:64   0   200G  0 disk
> > > rbd0  251:00  31.3T  0 disk /test/docker  
> > > <
> > >
> > > [root@dx-app docker]# df -Th
> > > Filesystem  Type  Size  Used Avail 
> > > Use% Mounted on
> > > /dev/vda1   xfs20G   14G  6.5G  
> > > 68% /
> > > devtmpfsdevtmpfs  7.8G 0  7.8G   
> > > 0% /dev
> > > tmpfs   tmpfs 7.8G   12K  7.8G   
> > > 1% /dev/shm
> > > tmpfs   tmpfs 7.8G  3.7M  7.8G   
> > > 1% /run
> > > tmpfs   tmpfs 7.8G 0  7.8G   
> > > 0% /sys/fs/cgroup
> > > /dev/vdexfs   200G   33M  200G   
> > > 1% /test/software
> > > /dev/vddxfs   200G  117G   84G  
> > > 59% /pkgs
> > > /dev/mapper/vg--test--data-lv--data xfs   200G  334M  200G   1% 
> > > /test/data
> > > tmpfs   tmpfs 1.6G 0  1.6G   
> > > 0% /run/user/0
> > > /dev/rbd0   xfs   500G   34M  500G   
> > > 1% /test/docker  <
> > >
> > > Which is true?
> >
> > Did you run "rbd create", "rbd map", "mkfs.xfs" and "mount" by
> > yourself?  If not, how was that mount created?
>
> Yes, i do `rbd create`, `rbd map`, `mkfs.xfs` and `mount`.
>
> I think the different size is that i do `rbd resize 102400T` and
> cancel it.
>
> But the result is not what we want, right?

"rbd resize" resizes only the rbd image itself.  The filesystem needs
to be resized separately.  So if you created the filesystem and _then_
grew the image with "rbd resize", both are true: the old size for XFS
and the new size for the image.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a little question about rbd_discard parameter len

2018-08-06 Thread Ilya Dryomov
On Mon, Aug 6, 2018 at 9:10 AM Will Zhao  wrote:
>
> Hi all: extern "C" int rbd_discard(rbd_image_t image, uint64_t ofs,
> uint64_t len)
> {
> librbd::ImageCtx *ictx = (librbd::ImageCtx *)image;
> tracepoint(librbd, discard_enter, ictx, ictx->name.c_str(),
> ictx->snap_name.c_str(), ictx->read_only, ofs, len);
> if (len > std::numeric_limits::max()) {
> tracepoint(librbd, discard_exit, -EINVAL);
> return -EINVAL;
> }
> int r = ictx->io_work_queue->discard(ofs, len, ictx->skip_partial_discard);
> tracepoint(librbd, discard_exit, r);
> return r;
> }
> I tried to call rbd python api, rbd.Image.discard , and I found there
> is limit to the parameter len , it is a uint64, but is limited by
> std::numeric_limits::max(), so that I can't discard too large
> space at a time. So I wonder what the considerations are about this?

rbd_discard() returns the number of bytes discarded or a negative error
code.  The return type is just an int though, so the range is capped at
INT_MAX to avoid overflow.

On top of that, current librbd doesn't rate limit on OSD requests in
all cases, so huge discards would need to be rejected anyway (although
the limit would probably be higher than ~2G).  Once that is fixed,
a new version of rbd_discard() with an updated signature could be
added.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] different size of rbd

2018-08-02 Thread Ilya Dryomov
On Thu, Aug 2, 2018 at 12:49 PM  wrote:
>
> I create a rbd named dx-app with 500G, and map as rbd0.
>
> But i find the size is different with different cmd:
>
> [root@dx-app docker]# rbd info dx-app
> rbd image 'dx-app':
> size 32000 GB in 8192000 objects  <
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.1206643c9869
> format: 2
> features: layering
> flags:
> create_timestamp: Thu Aug  2 18:18:20 2018
>
> [root@dx-app docker]# lsblk
> NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> vda   253:0020G  0 disk
> └─vda1253:1020G  0 part /
> vdb   253:16   0   200G  0 disk
> └─vg--test--data-lv--data 252:00 199.9G  0 lvm  /test/data
> vdc   253:32   0   200G  0 disk
> vdd   253:48   0   200G  0 disk /pkgs
> vde   253:64   0   200G  0 disk
> rbd0  251:00  31.3T  0 disk /test/docker  
> <
>
> [root@dx-app docker]# df -Th
> Filesystem  Type  Size  Used Avail Use% 
> Mounted on
> /dev/vda1   xfs20G   14G  6.5G  68% /
> devtmpfsdevtmpfs  7.8G 0  7.8G   0% 
> /dev
> tmpfs   tmpfs 7.8G   12K  7.8G   1% 
> /dev/shm
> tmpfs   tmpfs 7.8G  3.7M  7.8G   1% 
> /run
> tmpfs   tmpfs 7.8G 0  7.8G   0% 
> /sys/fs/cgroup
> /dev/vdexfs   200G   33M  200G   1% 
> /test/software
> /dev/vddxfs   200G  117G   84G  59% 
> /pkgs
> /dev/mapper/vg--test--data-lv--data xfs   200G  334M  200G   1% /test/data
> tmpfs   tmpfs 1.6G 0  1.6G   0% 
> /run/user/0
> /dev/rbd0   xfs   500G   34M  500G   1% 
> /test/docker  <
>
> Which is true?

Did you run "rbd create", "rbd map", "mkfs.xfs" and "mount" by
yourself?  If not, how was that mount created?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbdmap service issue

2018-08-01 Thread Ilya Dryomov
On Wed, Aug 1, 2018 at 11:13 AM  wrote:
>
> Hi!
>
> I find a  rbd map service issue:
> [root@dx-test ~]# systemctl status rbdmap
> ● rbdmap.service - Map RBD devices
>Loaded: loaded (/usr/lib/systemd/system/rbdmap.service; enabled; vendor 
> preset: disabled)
>Active: active (exited) (Result: exit-code) since 六 2018-07-28 13:55:01 
> CST; 11min ago
>   Process: 1459 ExecStart=/usr/bin/rbdmap map (code=exited, status=1/FAILURE)
>  Main PID: 1459 (code=exited, status=1/FAILURE)
>
> 7月 28 13:55:01 dx-test.novalocal systemd[1]: Started Map RBD devices.
> 7月 28 13:55:01 dx-test.novalocal systemd[1]: Starting Map RBD devices...
> 7月 28 14:01:19 dx-test.novalocal systemd[1]: rbdmap.service: main process 
> exited, code=exited, status=1/FAILURE
> [root@dx-test ~]# echo $?
> 0
>
> I am testing rbd map serive HA if ceph cluster down.
>
> I shut down ceph cluster and monitor rbdmap service, it spend 6 mins starting 
> and failed.

rbdmap is nothing but a convenience service.  It attempts to map images
listed in /etc/ceph/rbdmap on start and to unmap them on stop.

If the kernel can't reach the cluster, "rbd map" is expected to fail.
Check dmesg for details.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] LVM on top of RBD apparent pagecache corruption with snapshots

2018-07-27 Thread Ilya Dryomov
On Thu, Jul 26, 2018 at 5:15 PM Alex Gorbachev  wrote:
>
> On Thu, Jul 26, 2018 at 9:49 AM, Ilya Dryomov  wrote:
> > On Thu, Jul 26, 2018 at 1:07 AM Alex Gorbachev  
> > wrote:
> >>
> >> On Wed, Jul 25, 2018 at 6:07 PM, Alex Gorbachev  
> >> wrote:
> >> > On Wed, Jul 25, 2018 at 5:51 PM, Jason Dillaman  
> >> > wrote:
> >> >>
> >> >>
> >> >> On Wed, Jul 25, 2018 at 5:41 PM Alex Gorbachev 
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> I am not sure this related to RBD, but in case it is, this would be an
> >> >>> important bug to fix.
> >> >>>
> >> >>> Running LVM on top of RBD, XFS filesystem on top of that, consumed in 
> >> >>> RHEL
> >> >>> 7.4.
> >> >>>
> >> >>> When running a large read operation and doing LVM snapshots during
> >> >>> that operation, the block being read winds up all zeroes in pagecache.
> >> >>>
> >> >>> Dropping the caches syncs up the block with what's on "disk" and
> >> >>> everything is fine.
> >> >>>
> >> >>> Working on steps to reproduce simply - ceph is Luminous 12.2.7, RHEL
> >> >>> client is Jewel 10.2.10-17.el7cp
> >> >>
> >> >>
> >> >> Is this krbd or QEMU+librbd? If the former, what kernel version are you
> >> >> running?
> >> >
> >> > It's krbd on RHEL.
> >> >
> >> > RHEL kernel:
> >> >
> >> > Linux dmg-cbcache01 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51
> >> > EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
> >>
> >> Not sure if this is exactly replicating the issue, but I was able to
> >> do this on two different systems:
> >>
> >> RHEL 7.4 kernel as above.
> >
> > 3.10.0-862.el7 is a RHEL 7.5 kernel.
> > I think you have a 7.4 system with a 7.5 kernel.
>
> I was wrong, it's 7.5:
>
> Red Hat Enterprise Linux Server release 7.5 (Maipo)
>
> >
> >>
> >> Create a PVM PV on a mapped kRBD device
> >>
> >> example: pvcreate /dev/rbd/spin1/lvm1
> >>
> >> Create a VG and LV, make an XFS FS
> >>
> >> vgcreate datavg /dev/rbd/spin1/lvm1
> >> lvcreate -n data1 -L 5G datavg
> >> mkfs.xfs /dev/datavg/data1
> >> 
> >>
> >> Get some large file and copy it to some other file, same storage or
> >> different.  All is well.
> >
> > Same storage as in the same XFS filesystem?  Is copying it from
> > /mnt/foo to /mnt/bar enough (assuming /dev/datavg/data1 is on /mnt)?
>
> I tried both same and different.  Yes, copy on the same fs appears to
> be enough to trigger this.
>
> >
> >>
> >> Now snapshot the LV
> >>
> >> lvcreate -l8%ORIGIN -s -n snap_data1 /dev/datavg/data1 --addtag backup
> >
> > 8% of 5G is 400M.  How large is the file?
>
> This is just an example, I tried it with a 94 GB filesystem and a 5GB
> file, and also with a 700gb VG and a 70 GB filesystem and a 10 GB file
>
> >
> >>
> >> Now try to copy that file again.  I get:
> >>
> >> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:3470]
> >>
> >> And in dmesg (this is on Proxmox but I did the same on ESXi)
> >>
> >> [1397609.308673] sched: RT throttling activated
> >> [1397658.759259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> >> [kworker/0:1:2648]
> >> [1397658.759354] Modules linked in: dm_snapshot dm_bufio rbd libceph
> >> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> >> sunrpc ppdev joydev pcspkr sg parport_pc virtio_balloon parport shpchp
> >> i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif
> >> crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi
> >> virtio_scsi virtio_console virtio_net bochs_drm drm_kms_helper
> >> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata
> >> serio_raw virtio_pci i2c_core virtio_ring virtio floppy dm_mirror
> >> dm_region_hash dm_log dm_mod
> >> [1397658.759400] CPU: 0 PID: 2648 Comm: kworker/0:1 Kdump: loaded Not
> >> tainted 3.10.0-862.el7.x86_64 #1
> >> [1397658.759402] Hardware name: QEMU Standard PC (i440FX + PIIX,
> >> 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org
> >> 04/01/2014
>

Re: [ceph-users] LVM on top of RBD apparent pagecache corruption with snapshots

2018-07-26 Thread Ilya Dryomov
On Thu, Jul 26, 2018 at 1:07 AM Alex Gorbachev  wrote:
>
> On Wed, Jul 25, 2018 at 6:07 PM, Alex Gorbachev  
> wrote:
> > On Wed, Jul 25, 2018 at 5:51 PM, Jason Dillaman  wrote:
> >>
> >>
> >> On Wed, Jul 25, 2018 at 5:41 PM Alex Gorbachev 
> >> wrote:
> >>>
> >>> I am not sure this related to RBD, but in case it is, this would be an
> >>> important bug to fix.
> >>>
> >>> Running LVM on top of RBD, XFS filesystem on top of that, consumed in RHEL
> >>> 7.4.
> >>>
> >>> When running a large read operation and doing LVM snapshots during
> >>> that operation, the block being read winds up all zeroes in pagecache.
> >>>
> >>> Dropping the caches syncs up the block with what's on "disk" and
> >>> everything is fine.
> >>>
> >>> Working on steps to reproduce simply - ceph is Luminous 12.2.7, RHEL
> >>> client is Jewel 10.2.10-17.el7cp
> >>
> >>
> >> Is this krbd or QEMU+librbd? If the former, what kernel version are you
> >> running?
> >
> > It's krbd on RHEL.
> >
> > RHEL kernel:
> >
> > Linux dmg-cbcache01 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51
> > EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> Not sure if this is exactly replicating the issue, but I was able to
> do this on two different systems:
>
> RHEL 7.4 kernel as above.

3.10.0-862.el7 is a RHEL 7.5 kernel.
I think you have a 7.4 system with a 7.5 kernel.

>
> Create a PVM PV on a mapped kRBD device
>
> example: pvcreate /dev/rbd/spin1/lvm1
>
> Create a VG and LV, make an XFS FS
>
> vgcreate datavg /dev/rbd/spin1/lvm1
> lvcreate -n data1 -L 5G datavg
> mkfs.xfs /dev/datavg/data1
> 
>
> Get some large file and copy it to some other file, same storage or
> different.  All is well.

Same storage as in the same XFS filesystem?  Is copying it from
/mnt/foo to /mnt/bar enough (assuming /dev/datavg/data1 is on /mnt)?

>
> Now snapshot the LV
>
> lvcreate -l8%ORIGIN -s -n snap_data1 /dev/datavg/data1 --addtag backup

8% of 5G is 400M.  How large is the file?

>
> Now try to copy that file again.  I get:
>
> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:3470]
>
> And in dmesg (this is on Proxmox but I did the same on ESXi)
>
> [1397609.308673] sched: RT throttling activated
> [1397658.759259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> [kworker/0:1:2648]
> [1397658.759354] Modules linked in: dm_snapshot dm_bufio rbd libceph
> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> sunrpc ppdev joydev pcspkr sg parport_pc virtio_balloon parport shpchp
> i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif
> crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi
> virtio_scsi virtio_console virtio_net bochs_drm drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata
> serio_raw virtio_pci i2c_core virtio_ring virtio floppy dm_mirror
> dm_region_hash dm_log dm_mod
> [1397658.759400] CPU: 0 PID: 2648 Comm: kworker/0:1 Kdump: loaded Not
> tainted 3.10.0-862.el7.x86_64 #1
> [1397658.759402] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org
> 04/01/2014
> [1397658.759415] Workqueue: kcopyd do_work [dm_mod]
> [1397658.759418] task: 932df65d3f40 ti: 932fb138c000 task.ti:
> 932fb138c000
> [1397658.759420] RIP: 0010:[]  []
> copy_callback+0x50/0x130 [dm_snapshot]
> [1397658.759426] RSP: 0018:932fb138fd08  EFLAGS: 0283
> [1397658.759428] RAX: 0003e5e8 RBX: ebecc4943ec0 RCX:
> 932ff4704068
> [1397658.759430] RDX: 932dc8050d00 RSI: 932fd6a0f9b8 RDI:
> 
> [1397658.759431] RBP: 932fb138fd28 R08: 932dc7d2c0b0 R09:
> 932dc8050d20
> [1397658.759433] R10: c7d2b301 R11: ebecc01f4a00 R12:
> 
> [1397658.759435] R13: 000180090003 R14:  R15:
> ff80
> [1397658.759438] FS:  () GS:932fffc0()
> knlGS:
> [1397658.759440] CS:  0010 DS:  ES:  CR0: 8005003b
> [1397658.759442] CR2: 7f17bcd5e860 CR3: 42c0e000 CR4:
> 06f0
> [1397658.759447] Call Trace:
> [1397658.759452]  [] ? origin_resume+0x70/0x70 [dm_snapshot]
> [1397658.759459]  [] run_complete_job+0x6b/0xc0 [dm_mod]
> [1397658.759466]  [] process_jobs+0x60/0x100 [dm_mod]
> [1397658.759471]  [] ? kcopyd_put_pages+0x50/0x50 [dm_mod]
> [1397658.759477]  [] do_work+0x42/0x90 [dm_mod]
> [1397658.759483]  [] process_one_work+0x17f/0x440
> [1397658.759485]  [] worker_thread+0x22c/0x3c0
> [1397658.759489]  [] ? manage_workers.isra.24+0x2a0/0x2a0
> [1397658.759494]  [] kthread+0xd1/0xe0
> [1397658.759497]  [] ? insert_kthread_work+0x40/0x40
> [1397658.759503]  [] ret_from_fork_nospec_begin+0x21/0x21
> [1397658.759506]  [] ? insert_kthread_work+0x40/0x40

Did you experiment with the snapshot chunk size (lvcreate --chunksize)?
I wonder if the default shapshot chunk size is the same on RHEL 7.4 and
Ubuntu you tried 4.14.39 on.

Thanks,

Ilya

Re: [ceph-users] LVM on top of RBD apparent pagecache corruption with snapshots

2018-07-26 Thread Ilya Dryomov
On Thu, Jul 26, 2018 at 1:55 AM Alex Gorbachev  wrote:
>
> On Wed, Jul 25, 2018 at 7:07 PM, Alex Gorbachev  
> wrote:
> > On Wed, Jul 25, 2018 at 6:07 PM, Alex Gorbachev  
> > wrote:
> >> On Wed, Jul 25, 2018 at 5:51 PM, Jason Dillaman  
> >> wrote:
> >>>
> >>>
> >>> On Wed, Jul 25, 2018 at 5:41 PM Alex Gorbachev 
> >>> wrote:
> 
>  I am not sure this related to RBD, but in case it is, this would be an
>  important bug to fix.
> 
>  Running LVM on top of RBD, XFS filesystem on top of that, consumed in 
>  RHEL
>  7.4.
> 
>  When running a large read operation and doing LVM snapshots during
>  that operation, the block being read winds up all zeroes in pagecache.
> 
>  Dropping the caches syncs up the block with what's on "disk" and
>  everything is fine.
> 
>  Working on steps to reproduce simply - ceph is Luminous 12.2.7, RHEL
>  client is Jewel 10.2.10-17.el7cp
> >>>
> >>>
> >>> Is this krbd or QEMU+librbd? If the former, what kernel version are you
> >>> running?
> >>
> >> It's krbd on RHEL.
> >>
> >> RHEL kernel:
> >>
> >> Linux dmg-cbcache01 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51
> >> EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Not sure if this is exactly replicating the issue, but I was able to
> > do this on two different systems:
> >
> > RHEL 7.4 kernel as above.
> >
> > Create a PVM PV on a mapped kRBD device
> >
> > example: pvcreate /dev/rbd/spin1/lvm1
> >
> > Create a VG and LV, make an XFS FS
> >
> > vgcreate datavg /dev/rbd/spin1/lvm1
> > lvcreate -n data1 -L 5G datavg
> > mkfs.xfs /dev/datavg/data1
> > 
> >
> > Get some large file and copy it to some other file, same storage or
> > different.  All is well.
> >
> > Now snapshot the LV
> >
> > lvcreate -l8%ORIGIN -s -n snap_data1 /dev/datavg/data1 --addtag backup
> >
> > Now try to copy that file again.  I get:
> >
> > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:3470]
> >
> > And in dmesg (this is on Proxmox but I did the same on ESXi)
> >
> > [1397609.308673] sched: RT throttling activated
> > [1397658.759259] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> > [kworker/0:1:2648]
> > [1397658.759354] Modules linked in: dm_snapshot dm_bufio rbd libceph
> > rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> > sunrpc ppdev joydev pcspkr sg parport_pc virtio_balloon parport shpchp
> > i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif
> > crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi
> > virtio_scsi virtio_console virtio_net bochs_drm drm_kms_helper
> > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata
> > serio_raw virtio_pci i2c_core virtio_ring virtio floppy dm_mirror
> > dm_region_hash dm_log dm_mod
> > [1397658.759400] CPU: 0 PID: 2648 Comm: kworker/0:1 Kdump: loaded Not
> > tainted 3.10.0-862.el7.x86_64 #1
> > [1397658.759402] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org
> > 04/01/2014
> > [1397658.759415] Workqueue: kcopyd do_work [dm_mod]
> > [1397658.759418] task: 932df65d3f40 ti: 932fb138c000 task.ti:
> > 932fb138c000
> > [1397658.759420] RIP: 0010:[]  []
> > copy_callback+0x50/0x130 [dm_snapshot]
> > [1397658.759426] RSP: 0018:932fb138fd08  EFLAGS: 0283
> > [1397658.759428] RAX: 0003e5e8 RBX: ebecc4943ec0 RCX:
> > 932ff4704068
> > [1397658.759430] RDX: 932dc8050d00 RSI: 932fd6a0f9b8 RDI:
> > 
> > [1397658.759431] RBP: 932fb138fd28 R08: 932dc7d2c0b0 R09:
> > 932dc8050d20
> > [1397658.759433] R10: c7d2b301 R11: ebecc01f4a00 R12:
> > 
> > [1397658.759435] R13: 000180090003 R14:  R15:
> > ff80
> > [1397658.759438] FS:  () GS:932fffc0()
> > knlGS:
> > [1397658.759440] CS:  0010 DS:  ES:  CR0: 8005003b
> > [1397658.759442] CR2: 7f17bcd5e860 CR3: 42c0e000 CR4:
> > 06f0
> > [1397658.759447] Call Trace:
> > [1397658.759452]  [] ? origin_resume+0x70/0x70 
> > [dm_snapshot]
> > [1397658.759459]  [] run_complete_job+0x6b/0xc0 [dm_mod]
> > [1397658.759466]  [] process_jobs+0x60/0x100 [dm_mod]
> > [1397658.759471]  [] ? kcopyd_put_pages+0x50/0x50 [dm_mod]
> > [1397658.759477]  [] do_work+0x42/0x90 [dm_mod]
> > [1397658.759483]  [] process_one_work+0x17f/0x440
> > [1397658.759485]  [] worker_thread+0x22c/0x3c0
> > [1397658.759489]  [] ? manage_workers.isra.24+0x2a0/0x2a0
> > [1397658.759494]  [] kthread+0xd1/0xe0
> > [1397658.759497]  [] ? insert_kthread_work+0x40/0x40
> > [1397658.759503]  [] ret_from_fork_nospec_begin+0x21/0x21
> > [1397658.759506]  [] ? insert_kthread_work+0x40/0x40
> >
> >
>
>
> Tried same on Ubuntu kernel 4.14.39 - no issues

What about the original "zeroes in the page cache" issue?  Do you
see it with 4.14.39?  Trying to establish whether the soft lockup is
related.


Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Ilya Dryomov
On Fri, Jun 29, 2018 at 8:08 PM Nick Fisk  wrote:
>
> This is for us peeps using Ceph with VMWare.
>
>
>
> My current favoured solution for consuming Ceph in VMWare is via RBD’s 
> formatted with XFS and exported via NFS to ESXi. This seems to perform better 
> than iSCSI+VMFS which seems to not play nicely with Ceph’s PG contention 
> issues particularly if working with thin provisioned VMDK’s.
>
>
>
> I’ve still been noticing some performance issues however, mainly noticeable 
> when doing any form of storage migrations. This is largely due to the way 
> vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does this so 
> Arrays with QOS can balance the IO easier than if larger IO’s were submitted. 
> However Ceph’s PG locking means that only one or two of these IO’s can happen 
> at a time, seriously lowering throughput. Typically you won’t be able to push 
> more than 20-25MB/s during a storage migration
>
>
>
> There is also another issue in that the IO needed for the XFS journal on the 
> RBD, can cause contention and effectively also means every NFS write IO sends 
> 2 down to Ceph. This can have an impact on latency as well. Due to possible 
> PG contention caused by the XFS journal updates when multiple IO’s are in 
> flight, you normally end up making more and more RBD’s to try and spread the 
> load. This normally means you end up having to do storage migrations…..you 
> can see where I’m getting at here.
>
>
>
> I’ve been thinking for a while that CephFS works around a lot of these 
> limitations.
>
>
>
> 1.   It supports fancy striping, so should mean there is less per object 
> contention

Hi Nick,

Fancy striping is supported since 4.17.  I think its primary use case
is small sequential I/Os, so not sure if it is going to help much, but
it might be worth doing some benchmarking.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 6:30 PM, Jason Dillaman  wrote:
> On Thu, Jun 7, 2018 at 12:13 PM, Tracy Reed  wrote:
>> On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
>>> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
>>> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> This is a *very* old kernel.
>>
>> It's what's shipping with CentOS/RHEL 7 and probably what the vast
>> majority of people are using aside from perhaps the Ubuntu LTS people.
>
> I think what Ilya is saying is that it's a very old RHEL 7-based
> kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
> numerous improvements that have been backported from the current
> upstream kernel.

Correct.  RHEL 7.1 isn't supported anymore -- even the EUS (Extended
Update Support) from Red Hat ended more than a year ago.

I would recommend an upgrade to 7.5 or a recent upstream kernel from
ELRepo.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 4:33 PM, Tracy Reed  wrote:
> On Thu, Jun 07, 2018 at 02:05:31AM PDT, Ilya Dryomov spake thusly:
>> > find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
>>
>> Can you paste the entire output of that command?
>>
>> Which kernel are you running on the client box?
>
> Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 
> 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

This is a *very* old kernel.

>
> output is:
>
> # find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> /sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
> epoch 232455
> flags
> pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
> pool 2 pg_num 512 (511) read_tier -1 write_tier -1
> pool 3 pg_num 128 (127) read_tier -1 write_tier -1
> pool 4 pg_num 100 (127) read_tier -1 write_tier -1
> osd010.0.5.3:680154%(exists, up)100%
> osd110.0.5.3:681257%(exists, up)100%
> osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd310.0.5.4:681250%(exists, up)100%
> osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd610.0.5.9:686137%(exists, up)100%
> osd710.0.5.9:687628%(exists, up)100%
> osd810.0.5.9:686443%(exists, up)100%
> osd910.0.5.9:683630%(exists, up)100%
> osd10   10.0.5.9:682022%(exists, up)100%
> osd11   10.0.5.9:684454%(exists, up)100%
> osd12   10.0.5.9:680343%(exists, up)100%
> osd13   10.0.5.9:682641%(exists, up)100%
> osd14   10.0.5.9:685337%(exists, up)100%
> osd15   10.0.5.9:687236%(exists, up)100%
> osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd17   10.0.5.9:681244%(exists, up)100%
> osd18   10.0.5.9:681748%(exists, up)100%
> osd19   10.0.5.9:685633%(exists, up)100%
> osd20   10.0.5.9:680846%(exists, up)100%
> osd21   10.0.5.9:687141%(exists, up)100%
> osd22   10.0.5.9:681649%(exists, up)100%
> osd23   10.0.5.9:682356%(exists, up)100%
> osd24   10.0.5.9:680054%(exists, up)100%
> osd25   10.0.5.9:684854%(exists, up)100%
> osd26   10.0.5.9:684037%(exists, up)100%
> osd27   10.0.5.9:688369%(exists, up)100%
> osd28   10.0.5.9:683339%(exists, up)100%
> osd29   10.0.5.9:680938%(exists, up)100%
> osd30   10.0.5.9:682951%(exists, up)100%
> osd31   10.0.5.11:6828   47%(exists, up)100%
> osd32   10.0.5.11:6848   25%(exists, up)100%
> osd33   10.0.5.11:6802   56%(exists, up)100%
> osd34   10.0.5.11:6840   35%(exists, up)100%
> osd35   10.0.5.11:6856   32%(exists, up)100%
> osd36   10.0.5.11:6832   26%(exists, up)100%
> [88/1848]
> osd37   10.0.5.11:6868   42%(exists, up)100%
> osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd39   10.0.5.11:6812   52%(exists, up)100%
> osd40   10.0.5.11:6864   44%(exists, up)100%
> osd41   10.0.5.11:6801   25%(exists, up)100%
> osd42   10.0.5.11:6872   39%(exists, up)100%
> osd43   10.0.5.13:6809   38%(exists, up)100%
> osd44   10.0.5.11:6844   47%(exists, up)100%
> osd45   10.0.5.11:6816   20%(exists, up)100%
> osd46   10.0.5.3:680058%(exists, up)100%
> osd47   10.0.5.2:680843%(exists, up)100%
> osd48   10.0.5.2:680444%(exists, up)100%
> osd49   10.0.5.2:681244%(exists, up)100%
> osd50   10.0.5.2:680047%(exists, up)100%
> osd51   10.0.5.4:680843%(exists, up)100%
> osd52   10.0.5.12:6815   41%(exists, up)100%
> osd53   10.0.5.11:6820   24%(up)100%
> osd54   10.0.5.11:6876   34%(exists, up)100%
> osd55   10.0.5.11:6836   48%(exists, up)100%
> osd56   10.0.5.11:6824   31%(exists, up)100%
> osd57   10.0.5.11:6860   48%(exists, up)100%
> osd58   10.0.5.11:6852   35%(exists, up)100%
> osd59   10.0.5.11:6800   42%(exists, up)100%
> osd60   10.0.5.11:6880   58%(exists, up)100%
> osd61   10.0.5.3:680352%(exists, up)100%
> osd62   10.0.5.12:6800   42%(exists, up)100%
> osd63   10.0.5.12:6819   46%(exists, up)100%
> osd64   10.0.5.12:6809   44%(exists, up)100%
> osd65   10.0.5.13:6800   44%(exists, up)100%
> osd66   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd67   10.0.5.13:6808   50%(exists, up)100%
> osd6

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 5:12 AM, Tracy Reed  wrote:
>
> Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
> 10.2.9 clients though, haven't been able to upgrade those yet.
>
> Occasionally I have access to rbds hang on the client such as right now.
> I tried to dd a VM image into a mapped rbd and it just hung.
>
> Then I tried to map a new rbd and that hangs also.
>
> How would I troubleshoot this? /var/log/ceph is empty, nothing in
> /var/log/messages or dmesg etc.
>
> I just discovered:
>
> find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
>
> which produces (among other seemingly innocuous things, let me know if
> anyone wants to see the rest):
>
> osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%
>
> which seems suspicious.

Can you paste the entire output of that command?

Which kernel are you running on the client box?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to run MySQL (or other database ) on Ceph using KRBD ?

2018-06-05 Thread Ilya Dryomov
On Tue, Jun 5, 2018 at 4:07 AM, 李昊华  wrote:
> Thanks for reading my questions!
>
> I want to run MySQL on Ceph using KRBD because KRBD is faster than librbd.
> And I know KRBD is a kernel module and we can use KRBD to mount the RBD
> device on the operating systems.
>
> It is easy to use command line tool to mount the RBD device on the operating
> system. It there any other ways to use RBD module , such as changing MySQL
> IO interface by using krbd interface?
>
> I saw krbd.h and I found there was a little function interfaces blew to use.
> Whereas librdb offers us many interfaces such as create, clone a RBD device.
>
> And I want to Verify my hypothesis blew:
>
> 1. Librbd provides a richer interfaces than krbd, and some functions cannot
> be implemented through krdb

librbd is what should be used to create and delete images, perform
maintenance operations on exisitng images, etc.  krbd just drives the
I/O to a mapped image.

>
> 2. Applications can only use krbd via the command line tool instead of code
> interfaces.

Yes, libkrbd / krbd.h is an internal convenience library.  It can be
changed into a standalone .so fairly easily, but we never got to it.

Once the image is mapped with "rbd map" (krbd_map() from krbd.h),
it shows up as a regular block device.  Applications can use normal
system calls like open(), read(), write(), fallocate(), etc on it.
MySQL doesn't need changing to work on krbd.

Any guide on setting up MySQL to use a raw block device should work.
Just substitute e.g. /dev/rbd0 for a block device.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.4: CephFS kernel client (4.15/4.16) shows up as jewel

2018-05-31 Thread Ilya Dryomov
On Thu, May 31, 2018 at 2:39 PM, Heðin Ejdesgaard Møller  wrote:
> I have encountered the same issue and wrote to the mailing list about it, 
> with the subject: [ceph-users] krbd upmap support on kernel-4.16 ?
>
> The odd thing is that I can krbd map an image after setting min compat to 
> luminous, without specifying --yes-i-really-mean-it . It's only nessecary at 
> the point in time when you set the min_compat parameter, if you at that time 
> have krbd mapped an image.

Correct.  You are forcing the set-require-min-compat-client setting,
but as the feature bit that is causing this isn't actually required,
"rbd map" and everything else continues to work as before.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.4: CephFS kernel client (4.15/4.16) shows up as jewel

2018-05-31 Thread Ilya Dryomov
On Thu, May 31, 2018 at 4:16 AM, Linh Vu  wrote:
> Hi all,
>
>
> On my test Luminous 12.2.4 cluster, with this set (initially so I could use
> upmap in the mgr balancer module):
>
>
> # ceph osd set-require-min-compat-client luminous
>
> # ceph osd dump | grep client
> require_min_compat_client luminous
> min_compat_client jewel
>
> Not quite sure why min_compat_client is still jewel.
>
>
> I have created cephfs on the cluster, and use a mix of fuse and kernel
> clients to test it. The fuse clients are on ceph-fuse 12.2.5 and show up as
> luminous clients.
>
>
> The kernel client (just one mount) either on kernel 4.15.13 or 4.16.13 (the
> latest, just out) is showing up as jewel, seen in `ceph features`:
>
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x1ffddff8eea4fffb",
> "release": "luminous",
> "num": 8
> }
> }
>
> I thought I read somewhere here that kernel 4.13+ should have full support
> for Luminous, so I don't know why this is showing up as jewel. I'm also
> surprised that it could mount and write to my cephfs share just fine despite
> that. It also doesn't seem to matter when I run ceph balancer with upmap
> mode despite this client being connected and writing files.
>
>
> I can't see anything in mount.ceph options to specify jewel vs luminous
> either.
>
>
> Is this just a mislabel i.e my kernel client is actually fully Luminous
> supported but showing up as Jewel? Or is the kernel client a bit behind
> still?

All luminous features, including upmap, are supported in 4.13+.

This is just a reporting issue caused by the fact that MSG_ADDR2 (which
came before luminous and isn't a required feature) isn't implemented in
the kernel client yet.

>
>
> Currently we have a mix of ceph-fuse 12.2.5 and kernel client 4.15.13 in our
> production cluster, and I'm looking to set `ceph osd
> set-require-min-compat-client luminous` so I can use ceph balancer with
> upmap mode.

You will need to append --yes-i-really-mean-it as a work around.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor CentOS 7.5 client performance

2018-05-18 Thread Ilya Dryomov
On Fri, May 18, 2018 at 3:25 PM, Donald "Mac" McCarthy
 wrote:
> Ilya,
>   Your recommendation worked beautifully.  Thank you!
>
> Is this something that is expected behavior or is this something that should 
> be filed as a bug.
>
> I ask because I have just enough experience with ceph at this point to be 
> very dangerous and not enough history to know if this was expected from past 
> behavior.
>
> I did the dd testing after noticing poor read/write from a set of machines 
> that use ceph to back their home directories.

This is a bug, definitely not expected.  A Red Hat BZ has already been
filed.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor CentOS 7.5 client performance

2018-05-17 Thread Ilya Dryomov
On Wed, May 16, 2018 at 8:27 PM, Donald "Mac" McCarthy
 wrote:
> CephFS.  8 core atom C2758, 16 GB ram, 256GB ssd, 2.5 GB NIC (supermicro 
> microblade node).
>
> Read test:
> dd if=/ceph/1GB.test of=/dev/null bs=1M

Yup, looks like a kcephfs regression.  The performance of the above
command is highly dependent on readahead settings and it looks like
that got goofed up.

After mounting w/o options on 7.4:

  $ cat /sys/devices/virtual/bdi/ceph-1/read_ahead_kb
  8192

Same on 7.5:

  $ cat /sys/devices/virtual/bdi/ceph-1/read_ahead_kb
  0

As a workaround, try resetting it back to 8192:

  # echo 8192 >/sys/devices/virtual/bdi/ceph-1/read_ahead_kb

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd feature map fail

2018-05-15 Thread Ilya Dryomov
On Tue, May 15, 2018 at 10:07 AM,   wrote:
> Hi, all!
>
> I use rbd to do something and find below issue:
>
> when i create a rbd image with feature:
> layering,exclusive-lock,object-map,fast-diff
>
> failed to map:
> rbd: sysfs write failed
> RBD image feature set mismatch. Try disabling features unsupported by the
> kernel with "rbd feature disable".
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (6) No such device or address
>
> dmesg | tail:
> [960284.869596] rbd: rbd0: capacity 107374182400 features 0x5
> [960310.908615] libceph: mon1 10.0.10.12:6789 session established
> [960310.908916] libceph: client21459 fsid
> fe308030-ae94-471a-8d52-2c12151262fc
> [960310.911729] rbd: image foo: image uses unsupported features: 0x18
> [960337.946856] libceph: mon1 10.0.10.12:6789 session established
> [960337.947320] libceph: client21465 fsid
> fe308030-ae94-471a-8d52-2c12151262fc
> [960337.950116] rbd: image foo: image uses unsupported features: 0x8
> [960346.248676] libceph: mon0 10.0.10.11:6789 session established
> [960346.249077] libceph: client21866 fsid
> fe308030-ae94-471a-8d52-2c12151262fc
> [960346.254145] rbd: rbd0: capacity 107374182400 features 0x5
>
> If i just create layering image, map is ok.
>
> *The question is here:*
>
> Then i enable feature:
> exclusive-lock,object-map,fast-diff
>
> It works.
>
> And rbd info shows all feature i set.
>
> I think it is a bug:
>
> why create with those feature then map failed but map after create is ok?

Yes, it is a bug.  There is a patch pending from Dongsheng, so it will
be fixed in 4.18.

If you are told that these features are unsupported, you shouldn't be
looking for backdoor ways to enable them ;)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] remove big rbd image is very slow

2018-03-26 Thread Ilya Dryomov
On Sat, Mar 17, 2018 at 5:11 PM, shadow_lin  wrote:
> Hi list,
> My ceph version is jewel 10.2.10.
> I tired to use rbd rm to remove a 50TB image(without object map because krbd
> does't support it).It takes about 30mins to just complete about 3%. Is this
> expected? Is there a way to make it faster?
> I know there are scripts to delete rados objects of the rbd image to make it
> faster. But is the slowness expected for rbd rm command?
>
> PS: I also encounter very slow rbd export for large rbd image(20TB image but
> with only a few GB data).Takes hours to completed the export.I guess both
> are related to object map not enabled, but krbd doesn't support object map
> feature.

If you don't have any other images in that pool, you can simply delete
the pool with "ceph osd pool delete".  It'll take a second ;)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-26 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 5:53 PM, Nicolas Huillard <nhuill...@dolomede.fr> wrote:
> Le vendredi 23 mars 2018 à 12:14 +0100, Ilya Dryomov a écrit :
>> On Fri, Mar 23, 2018 at 11:48 AM,  <c...@jack.fr.eu.org> wrote:
>> > The stock kernel from Debian is perfect
>> > Spectre / meltdown mitigations are worthless for a Ceph point of
>> > view,
>> > and should be disabled (again, strictly from a Ceph point of view)
>
> I know that Ceph itself don't need this, but the cpeh client machines,
> specially those hosting VMs or mùore diverse code, should have those
> mitigations.
>
>> > If you need the luminous features, using the userspace
>> > implementations
>> > is required (librbd via rbd-nbd or qemu, libcephfs via fuse etc)
>
> I'd rather use the faster kernel cephfs implementation instead of fuse,
> specially with the Meltdown PTI mitigation (I guess fuse implies twice
> the userland-to-kernel calls which are costly using PTI).
> I don't have an idea yet re. RBD...
>
>> luminous cluster-wide feature bits are supported since kernel 4.13.
>
> This means that there are differences between 4.9 and 4.14 re. Ceph
> features. I know that quota are not supported yet in any kernel, but I
> don't use this...

luminous cluster-wide features include pg-upmap, which in concert with
the new mgr balancer module can provide the perfect distribution of PGs
accross OSDs, and some other OSD performance and memory usage related
improvements.

> Are there some performance/stability improvements in the kernel that
> would justify using 4.14 instead of 4.9 ? I can't find any list
> anywhere...
> Since I'm building a new cluster, I'd rather choose the latest software
> from the start if it's justified.

From the point of view of the kernel client, a number of issues get
fixed in every kernel release, but only a handful of most important
patches get backported.

Given that 4.14 is available in your environment, there is really no
reason to use 4.9, whether you are starting from scratch or not.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 3:01 PM,   wrote:
> Ok ^^
>
> For Cephfs, as far as I know, quota support is not supported in kernel space
> This is not specific to luminous, tho

quota support is coming, hopefully in 4.17.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 2:18 PM,  <c...@jack.fr.eu.org> wrote:
> On 03/23/2018 12:14 PM, Ilya Dryomov wrote:
>> luminous cluster-wide feature bits are supported since kernel 4.13.
>
> ?
>
> # uname -a
> Linux abweb1 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1
> (2018-01-14) x86_64 GNU/Linux
> # rbd info truc
> rbd image 'truc':
> size 20480 MB in 5120 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.9eca966b8b4567
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, 
> deep-flatten
> flags:
> # rbd map truc
> rbd: sysfs write failed
> RBD image feature set mismatch. You can disable features unsupported by
> the kernel with "rbd feature disable pool/truc object-map fast-diff
> deep-flatten".
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (6) No such device or address
> # dmesg | tail -1
> [1108045.667333] rbd: image truc: image uses unsupported features: 0x38

Those are rbd image features.  Your email also mentioned "libcephfs via
fuse", so I assumed you had meant cluster-wide feature bits.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Ilya Dryomov
On Fri, Mar 23, 2018 at 11:48 AM,   wrote:
> The stock kernel from Debian is perfect
> Spectre / meltdown mitigations are worthless for a Ceph point of view,
> and should be disabled (again, strictly from a Ceph point of view)
>
> If you need the luminous features, using the userspace implementations
> is required (librbd via rbd-nbd or qemu, libcephfs via fuse etc)

luminous cluster-wide feature bits are supported since kernel 4.13.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

2018-03-23 Thread Ilya Dryomov
On Wed, Mar 21, 2018 at 6:50 PM, Frederic BRET  wrote:
> Hi all,
>
> The context :
> - Test cluster aside production one
> - Fresh install on Luminous
> - choice of Bluestore (coming from Filestore)
> - Default config (including wpq queuing)
> - 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch
> uplink...
> - R3 pool, 2 nodes per site
> - separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be
> able to observe each kind of IO with iostat
> - RBD client fio --ioengine=libaio --iodepth=128 --direct=1
> - client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
> - Just to point out, this is not a thread on SSD performance or adequation
> between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly
> performing with lot of headroom on the production cluster even with XFS
> filestore and journals on SSDs.
> - This thread is about a possible bottleneck on low size blocks with
> rocksdb/wal/Bluestore.
>
> To begin with, Bluestore performance is really breathtaking compared to
> filestore/XFS : we saturate the 20Gb clients bandwidth on this small test
> cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with
> Filestore and journals, even at 256k.
>
> The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably
> slower and appear somewhat capped.
>
> Just to compare, here are observed latencies at 2 consecutive values for
> blocksize 64k and 32k :
> 64k :
>   write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
>  lat (msec): min=2, max=867, avg=17.29, stdev=32.31
>
> 32k :
>   write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
>  lat (msec): min=1, max=5111, avg=78.81, stdev=430.50
>
> Whereas 64k one is almost filling the 20Gb client connection, the 32k one is
> only getting a mere 1/10th of the bandwidth, and IOs latencies are
> multiplied by 4.5 (or get a  ~60ms pause ? ... )
>
> And we see the same constant latency at 16k, 8k and 4k :
> 16k :
>   write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
>  lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08
>
> 8k :
>   write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
>  lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61
>
> 4k :
>   write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
>  lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29
>
> To compare with filestore, on 4k IOs results I have on hand from previous
> install, we were getting almost 2x the Bluestore perfs on the exact same
> cluster :
> WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec
>
> The thing is during these small blocksize fio benchmarks, nowhere nodes CPU,
> OSD, SSD, or of course network are saturated (ie. I think this has nothing
> to do with write amplification), nevertheless clients IOPS starve at low
> values.
> Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?
>
> To summerize, here is what we can observe :
>
>
> Seeking counters, I found in "perf dump" incrementing values with slow IO
> benchs, here for 1 run of 4k fio :
> "deferred_write_ops": 7631,
> "deferred_write_bytes": 31457280,

bluestore data-journals any write smaller than min_alloc_size because
it has to happen in place, whereas writes equal to or larger than that
go directly to their final location on disk.  IOW anything smaller than
min_alloc_size is written twice.

The default min_alloc_size is 64k.  That is what those counters refer
to.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-13 Thread Ilya Dryomov
On Mon, Mar 12, 2018 at 8:20 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> On 2018-03-12 21:00, Ilya Dryomov wrote:
>
> On Mon, Mar 12, 2018 at 7:41 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
>
> On 2018-03-12 14:23, David Disseldorp wrote:
>
> On Fri, 09 Mar 2018 11:23:02 +0200, Maged Mokhtar wrote:
>
> 2)I undertand that before switching the path, the initiator will send a
> TMF ABORT can we pass this to down to the same abort_request() function
> in osd_client that is used for osd_request_timeout expiry ?
>
>
> IIUC, the existing abort_request() codepath only cancels the I/O on the
> client/gw side. A TMF ABORT successful response should only be sent if
> we can guarantee that the I/O is terminated at all layers below, so I
> think this would have to be implemented via an additional OSD epoch
> barrier or similar.
>
> Cheers, David
>
> Hi David,
>
> I was thinking we would get the block request then loop down to all its osd
> requests and cancel those using the same  osd request cancel function.
>
>
> All that function does is tear down OSD client / messenger data
> structures associated with the OSD request.  Any OSD request that hit
> the TCP layer may eventually get through to the OSDs.
>
> Thanks,
>
> Ilya
>
> Hi Ilya,
>
> OK..so i guess this also applies as well to osd_request_timeout expiry, it
> is not guaranteed to stop all stale ios.

Yes.  The purpose of osd_request_timeout is to unblock the client side
by failing the I/O on the client side.  It doesn't attempt to stop any
in-flight I/O -- it simply marks it as failed.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-12 Thread Ilya Dryomov
On Mon, Mar 12, 2018 at 7:41 PM, Maged Mokhtar  wrote:
> On 2018-03-12 14:23, David Disseldorp wrote:
>
> On Fri, 09 Mar 2018 11:23:02 +0200, Maged Mokhtar wrote:
>
> 2)I undertand that before switching the path, the initiator will send a
> TMF ABORT can we pass this to down to the same abort_request() function
> in osd_client that is used for osd_request_timeout expiry ?
>
>
> IIUC, the existing abort_request() codepath only cancels the I/O on the
> client/gw side. A TMF ABORT successful response should only be sent if
> we can guarantee that the I/O is terminated at all layers below, so I
> think this would have to be implemented via an additional OSD epoch
> barrier or similar.
>
> Cheers, David
>
> Hi David,
>
> I was thinking we would get the block request then loop down to all its osd
> requests and cancel those using the same  osd request cancel function.

All that function does is tear down OSD client / messenger data
structures associated with the OSD request.  Any OSD request that hit
the TCP layer may eventually get through to the OSDs.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd feature overheads

2018-02-13 Thread Ilya Dryomov
On Tue, Feb 13, 2018 at 1:24 AM, Blair Bethwaite
 wrote:
> Thanks Ilya,
>
> We can probably handle ~6.2MB for a 100TB volume. Is it reasonable to expect
> a librbd client such as QEMU to only hold one object-map per guest?

Yes, I think so.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd feature overheads

2018-02-12 Thread Ilya Dryomov
On Mon, Feb 12, 2018 at 6:25 AM, Blair Bethwaite
 wrote:
> Hi all,
>
> Wondering if anyone can clarify whether there are any significant overheads
> from rbd features like object-map, fast-diff, etc. I'm interested in both
> performance overheads from a latency and space perspective, e.g., can
> object-map be sanely deployed on a 100TB volume or does the client try to
> read the whole thing into memory...?

Yes, it does.  Enabling object-map on images larger than 1PB isn't
allowed for exactly that reason.  The memory overhead is 2 bits per
object, i.e. 64K per 1TB assuming the default object size.

object-map also depends on exclusive-lock, which is bad for use cases
where sharing the same image between multiple clients is a requirement.

Once object-map is enabled, fast-diff is virtually no overhead.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Obtaining cephfs client address/id from the host that mounted it

2018-02-09 Thread Ilya Dryomov
On Fri, Feb 9, 2018 at 12:05 PM, Mauricio Garavaglia
 wrote:
> Hello,
> Is it possible to get the cephfs client id/address in the host that mounted
> it, in the same way we can get the address on rbd mapped volumes looking at
> /sys/bus/rbd/devices/*/client_addr?

No, not without querying the servers.

Unfortunately, there is nothing like client_addr for the filesystem.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous/Ubuntu 16.04 kernel recommendation ?

2018-02-08 Thread Ilya Dryomov
On Thu, Feb 8, 2018 at 12:54 PM, Kevin Olbrich  wrote:
> 2018-02-08 11:20 GMT+01:00 Martin Emrich :
>>
>> I have a machine here mounting a Ceph RBD from luminous 12.2.2 locally,
>> running linux-generic-hwe-16.04 (4.13.0-32-generic).
>>
>> Works fine, except that it does not support the latest features: I had to
>> disable exclusive-lock,fast-diff,object-map,deep-flatten on the image.
>> Otherwise it runs well.
>
>
> I always thought that the latest features are built into newer kernels, are
> they available on non-HWE 4.4, HWE 4.8 or HWE 4.10?

No, some of these features haven't made it to the kernel client yet.

> Also I am researching for the OSD server side.

For the OSDs, you should be fine with pretty much any kernel supported
by your distro.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous/Ubuntu 16.04 kernel recommendation ?

2018-02-08 Thread Ilya Dryomov
On Thu, Feb 8, 2018 at 11:20 AM, Martin Emrich
 wrote:
> I have a machine here mounting a Ceph RBD from luminous 12.2.2 locally,
> running linux-generic-hwe-16.04 (4.13.0-32-generic).
>
> Works fine, except that it does not support the latest features: I had to
> disable exclusive-lock,fast-diff,object-map,deep-flatten on the image.
> Otherwise it runs well.

That kernel should support exclusive-lock.  It doesn't hurt to disable
exclusive-lock if you don't need it though.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous rbd feature 'striping' is deprecated or just a bug?

2018-01-29 Thread Ilya Dryomov
On Mon, Jan 29, 2018 at 8:37 AM, Konstantin Shalygin  wrote:
> Anybody know about changes in rbd feature 'striping'? May be is deprecated
> feature? What I mean:
>
> I have volume created by Jewel client on Luminous cluster.
>
> # rbd --user=cinder info
> solid_rbd/volume-12b5df1e-df4c-4574-859d-22a88415aaf7
> rbd image 'volume-12b5df1e-df4c-4574-859d-22a88415aaf7':
> size 200 GB in 51200 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.73bb33166d45615
> format: 2
> features: layering, striping, exclusive-lock, object-map, fast-diff
> flags:
> create_timestamp: Tue Jan 16 15:06:26 2018
> stripe unit: 4096 kB
> stripe count: 1
>
>
> Striping is enabled.
>
>
> When I was try create volume by Luminous client:
>
>
> # rbd --user=cinder create solid_rbd/mysupervol --size=1G --image-feature
> layering,striping
> # rbd --user=cinder info solid_rbd/mysupervol
> rbd image 'mysupervol':
> size 1024 MB in 256 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.ae31f32ae8944a
> format: 2
> features: layering
> flags:
> create_timestamp: Mon Jan 29 14:11:10 2018
>
>
> Striping is silently disabled, due lack of stripe_unit and stripe_count
> defaults:
>
>
> # ceph --show-config | grep rbd_default_stripe
> rbd_default_stripe_count = 0
> rbd_default_stripe_unit = 0

Unless you specify a non-default stripe_unit/stripe_count, striping
feature bit is not set and striping-related fields aren't displayed.
This behaviour is new in luminous, but jewel and older clients still
work with luminous images.

>
>
> Try to define manual defaults
>
>
> # rbd --user=cinder create solid_rbd/mysupervol --size=1G --image-feature
> layering,striping --stripe-unit 4096 --stripe-count 1
> # rbd --user=cinder info solid_rbd/mysupervol
> rbd image 'mysupervol':
> size 1024 MB in 256 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.adebc974b0dc51
> format: 2
> features: layering, striping
> flags:
> create_timestamp: Mon Jan 29 14:16:13 2018
> stripe unit: 4096 bytes
> stripe count: 1

Here you specified a custom stripe_unit, thus enabling the striping
feature bit.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not timing out watcher

2017-12-21 Thread Ilya Dryomov
On Thu, Dec 21, 2017 at 3:04 PM, Serguei Bezverkhi (sbezverk)
 wrote:
> Hi Ilya,
>
> Here you go, no k8s services running this time:
>
> sbezverk@kube-4:~$ sudo rbd map raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> /dev/rbd0
> sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> Watchers:
> watcher=192.168.80.235:0/3465920438 client.65327 cookie=1
> sbezverk@kube-4:~$ sudo rbd info raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> rbd image 'raw-volume':
> size 10240 MB in 2560 objects
> order 22 (4096 kB objects)
> block_name_prefix: rb.0.fafa.625558ec
> format: 1
> sbezverk@kube-4:~$ sudo reboot
>
> sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 
> 192.168.80.233  --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA==
> Watchers: none
>
> It seems when the image was mapped manually, this issue is not reproducible.
>
> K8s does not just map the image, it also creates loopback device which is 
> linked to /dev/rbd0. Maybe this somehow reminds rbd client to re-activate a 
> watcher on reboot. I will try to mimic exact steps k8s follows manually to 
> see what exactly forces an active watcher after reboot.

To confirm, I'd also make sure that nothing runs "rbd unmap" on all
images (or some subset of images) during shutdown in the manual case.
Either do a hard reboot or rename /usr/bin/rbd to something else before
running reboot.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not timing out watcher

2017-12-21 Thread Ilya Dryomov
On Wed, Dec 20, 2017 at 6:20 PM, Serguei Bezverkhi (sbezverk)
 wrote:
> It took 30 minutes for the Watcher to time out after ungraceful restart. Is 
> there a way limit it to something a bit more reasonable? Like 1-3 minutes?
>
> On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)"  
> wrote:
>
> Ok, here is what I found out. If I gracefully kill a pod then watcher 
> gets properly cleared, but if it is done ungracefully, without “rbd unmap” 
> then even after a node reboot Watcher stays up for a long time,  it has been 
> more than 20 minutes and it is still active (no any kubernetes services are 
> running).

Hi Serguei,

Can you try taking k8s out of the equation -- set up a fresh VM with
the same kernel, do "rbd map" in it and kill it?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   >