[ceph-users] troubleshooting ceph rdma performance

2018-11-07 Thread Raju Rangoju
Hello All,

I have been collecting performance numbers on our ceph cluster, and I had 
noticed a very poor throughput on ceph async+rdma when compared with tcp. I was 
wondering what tunings/settings should I do to the cluster that would improve 
the ceph rdma (async+rdma) performance.

Currently, from what we see: Ceph rdma throughput is less than half of the ceph 
tcp throughput (ran fio over iscsi mounted disks).
Our ceph cluster has 8 nodes and configured with two networks, cluster and 
client networks.

Can someone please shed some light.

I'd be glad to provide any further information regarding the setup.

Thanks in Advance,
Raju
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] permission errors rolling back ceph cluster to v13

2018-08-08 Thread Raju Rangoju
Thanks Greg.

I think I have to re-install ceph v13 from scratch then.

-Raju
From: Gregory Farnum 
Sent: 09 August 2018 01:54
To: Raju Rangoju 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] permission errors rolling back ceph cluster to v13

On Tue, Aug 7, 2018 at 6:27 PM Raju Rangoju 
mailto:ra...@chelsio.com>> wrote:
Hi,

I have been running into some connection issues with the latest ceph-14 
version, so we thought the feasible solution would be to roll back the cluster 
to previous version (ceph-13.0.1) where things are known to work properly.

I’m wondering if rollback/downgrade is supported at all ?

After compiling/starting ceph-13 I’m running into a permission error. Basically 
it complains about the incompatibility of disk layout (ceph-13 mimic vs ceph-14 
nautilus)

2018-08-07 10:41:00.580 2b391528e080 -1 ERROR: on disk data includes 
unsupported features: compat={},rocompat={},incompat={11=nautilus ondisk layout}
2018-08-07 10:41:00.580 2b391528e080 -1 error checking features: (1) Operation 
not permitted
2018-08-07 10:41:11.161 2b16a7d14080  0 set uid:gid to 167:167 (ceph:ceph)
2018-08-07 10:41:11.161 2b16a7d14080  0 ceph version 13.0.1-3266-g6b59fbf 
(6b59fbfcc6bbfd67193e1c1e142b478ddd68aab4) mimic (dev), process (unknown), pid 
14013
2018-08-07 10:41:11.161 2b16a7d14080  0 pidfile_write: ignore empty --pid-file


I thought allowing permissions to mon would fix it (see below), but apparently 
ceph command hangs. So it didn’t allow permissions.

ceph auth add osd.44 osd 'allow *' mon 'allow profile osd' -i 
/var/lib/ceph/osd/ceph-44/keyring

[root@hadoop1 my-ceph]# ceph -s
2018-08-07 10:59:59.325 2b5c26347700  0 monclient(hunting): authenticate timed 
out after 300
2018-08-07 10:59:59.325 2b5c26347700  0 librados: client.admin authentication 
error (110) Connection timed out
[errno 110] error connecting to the cluster

Has anyone tried ceph rollback before? Any help is greatly appreciated.

Unfortunately this is generally not possible. Disk encodings change across 
major versions and the old code can't understand what's been written down once 
the new code runs.
-Greg


Thanks,
Raj

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] permission errors rolling back ceph cluster to v13

2018-08-07 Thread Raju Rangoju
Hi,

I have been running into some connection issues with the latest ceph-14 
version, so we thought the feasible solution would be to roll back the cluster 
to previous version (ceph-13.0.1) where things are known to work properly.

I'm wondering if rollback/downgrade is supported at all ?

After compiling/starting ceph-13 I'm running into a permission error. Basically 
it complains about the incompatibility of disk layout (ceph-13 mimic vs ceph-14 
nautilus)

2018-08-07 10:41:00.580 2b391528e080 -1 ERROR: on disk data includes 
unsupported features: compat={},rocompat={},incompat={11=nautilus ondisk layout}
2018-08-07 10:41:00.580 2b391528e080 -1 error checking features: (1) Operation 
not permitted
2018-08-07 10:41:11.161 2b16a7d14080  0 set uid:gid to 167:167 (ceph:ceph)
2018-08-07 10:41:11.161 2b16a7d14080  0 ceph version 13.0.1-3266-g6b59fbf 
(6b59fbfcc6bbfd67193e1c1e142b478ddd68aab4) mimic (dev), process (unknown), pid 
14013
2018-08-07 10:41:11.161 2b16a7d14080  0 pidfile_write: ignore empty --pid-file


I thought allowing permissions to mon would fix it (see below), but apparently 
ceph command hangs. So it didn't allow permissions.

ceph auth add osd.44 osd 'allow *' mon 'allow profile osd' -i 
/var/lib/ceph/osd/ceph-44/keyring

[root@hadoop1 my-ceph]# ceph -s
2018-08-07 10:59:59.325 2b5c26347700  0 monclient(hunting): authenticate timed 
out after 300
2018-08-07 10:59:59.325 2b5c26347700  0 librados: client.admin authentication 
error (110) Connection timed out
[errno 110] error connecting to the cluster

Has anyone tried ceph rollback before? Any help is greatly appreciated.

Thanks,
Raj

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Raju Rangoju
Hey Igor, patch that you pointed worked for me.
Thanks Again.

From: ceph-users  On Behalf Of Igor Fedotov
Sent: 20 June 2018 21:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] issues with ceph nautilus version


Hi Raju,



This is a bug in new BlueStore's bitmap allocator.

This PR will most probably fix that:

https://github.com/ceph/ceph/pull/22610



Also you may try to switch bluestore and bluefs allocators (bluestore_allocator 
and bluefs_allocator parameters respectively) to stupid and restart OSDs.

This should help.



Thanks,

Igor

On 6/20/2018 6:41 PM, Raju Rangoju wrote:
Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev) from 
ceph version 13.0.1, after this, I noticed some weird data usage numbers on the 
cluster.
Here are the issues I'm seeing...

  1.  The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail



before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail



  1.  it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version
ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) 
nautilus (dev)

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_ERR
87 full osd(s)
11 pool(s) full

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop6(active), standbys: hadoop1, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
osd: 88 osds: 87 up, 87 in

  data:
pools:   11 pools, 32588 pgs
objects: 0  objects, 0 B
usage:   16 EiB used, 164 TiB / 158 TiB avail
pgs: 32588 active+clean

Thanks in advance
-Raj





___

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Raju Rangoju
Hi Igor,
Great! Thanks for the quick response.

Will try the fix and let you know how it goes.

-Raj
From: ceph-users  On Behalf Of Igor Fedotov
Sent: 20 June 2018 21:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] issues with ceph nautilus version


Hi Raju,



This is a bug in new BlueStore's bitmap allocator.

This PR will most probably fix that:

https://github.com/ceph/ceph/pull/22610



Also you may try to switch bluestore and bluefs allocators (bluestore_allocator 
and bluefs_allocator parameters respectively) to stupid and restart OSDs.

This should help.



Thanks,

Igor

On 6/20/2018 6:41 PM, Raju Rangoju wrote:
Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev) from 
ceph version 13.0.1, after this, I noticed some weird data usage numbers on the 
cluster.
Here are the issues I'm seeing...

  1.  The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail



before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail



  1.  it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version
ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) 
nautilus (dev)

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_ERR
87 full osd(s)
11 pool(s) full

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop6(active), standbys: hadoop1, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
osd: 88 osds: 87 up, 87 in

  data:
pools:   11 pools, 32588 pgs
objects: 0  objects, 0 B
usage:   16 EiB used, 164 TiB / 158 TiB avail
pgs: 32588 active+clean

Thanks in advance
-Raj





___

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issues with ceph nautilus version

2018-06-20 Thread Raju Rangoju
Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev) from 
ceph version 13.0.1, after this, I noticed some weird data usage numbers on the 
cluster.
Here are the issues I'm seeing...

  1.  The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail



before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail



  1.  it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version
ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) 
nautilus (dev)

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_ERR
87 full osd(s)
11 pool(s) full

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop6(active), standbys: hadoop1, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
osd: 88 osds: 87 up, 87 in

  data:
pools:   11 pools, 32588 pgs
objects: 0  objects, 0 B
usage:   16 EiB used, 164 TiB / 158 TiB avail
pgs: 32588 active+clean

Thanks in advance
-Raj

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issue with OSD class path in RDMA mode

2018-05-31 Thread Raju Rangoju
Hello,

I'm trying to run iscsi tgtd on ceph cluster. When do 'rbd list' I see below 
errors.

[root@ceph1 ceph]# rbd list
2018-05-30 18:19:02.227 2ae7260a8140 -1 librbd::api::Image: list_images: error 
listing image in directory: (5) Input/output error
2018-05-30 18:19:02.227 2ae7260a8140 -1 librbd: error listing v2 images: (5) 
Input/output error
rbd: list: (5) Input/output error

I have followed '/ceph/doc/dev/osd-class-path.rst' [1]  and found that this is 
an issue with OSD class path, and I have added the osd_class_dir to ceph.conf.

After adding osd class path, 'rbd list' worked perfectly fine in TCP mode and 
no issues were seen. But in RDMA(async) mode, 26 out of 88 osds are down. No 
errors are seen in osd logs. Can someone please shed some light?

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_WARN
1 filesystem is degraded
39 osds down
4 hosts (44 osds) down
Reduced data availability: 23944 pgs inactive

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop1(active), standbys: hadoop6, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:replay}, 2 up:standby
osd: 88 osds: 26 up, 88 in

  data:
pools:   13 pools, 23944 pgs
objects: 0 objects, 0
usage:   0 used, 0 / 0 avail
pgs: 100.000% pgs unknown
 23944 unknown

I am more than happy to provide any extra information necessary.

[1]
/root/ceph/doc/dev/osd-class-path.rst

===
OSD class path issues
===
::
  2011-12-05 17:41:00.994075 7ffe8b5c3760 librbd: failed to assign a block name 
for image
  create error: error 5: Input/output error

This usually happens because your OSDs can't find ``cls_rbd.so``. They
search for it in ``osd_class_dir``, which may not be set correctly by
default (http://tracker.ceph.com/issues/1722).

Most likely it's looking in ``/usr/lib/rados-classes`` instead of
``/usr/lib64/rados-classes`` - change ``osd_class_dir`` in your
``ceph.conf`` and restart the OSDs to fix it.

Thanks,
Raju
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] exporting cephfs as nfs share on RDMA transport

2017-08-14 Thread Raju Rangoju
Hi,

I am testing CEPH over RDMA, for one of the tests I had to export ceph 
filesystem as NFS share on RDMA transport. For TCP transport, I used ganesha as 
NFS server that runs in user space and supports the cephFS FSAL using 
libcephfs, and it worked perfectly fine. However, my requirement was to export 
cephFS as NFS share on RDMA transport instead of TCP.

So, I was wondering if there is something like ganesha to achieve the same on 
RDMA transport. Can someone please provide some pointers on this?

Appreciate any help.

Thanks,
Raju

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com