from:"Brad Hubbard"

Re: [ceph-users] GPF kernel panics

2014-07-31 Thread Brad Hubbard

 lp parport nfsd auth_rpcgss nfs_acl nfs lockd
sunrpc
  fscache hid_generic igb ixgbe i2c_algo_bit usbhid dca hid ptp
ahci libahci
  pps_core megaraid_sas mdio
  [   32.843936] CPU: 18 PID: 5030 Comm: tr Not tainted
3.13.0-30-generic
  #54-Ubuntu
  [   32.860163] Hardware name: Dell Inc. PowerEdge R620/0PXXHP,
BIOS 1.6.0
  03/07/2013
  [   32.876774] task: 880417b15fc0 ti: 8804273f4000 task.ti:
  8804273f4000
  [   32.893384] RIP: 0010:[811a19c5]  [811a19c5]
  kmem_cache_alloc+0x75/0x1e0
  [   32.912198] RSP: 0018:8804273f5d40  EFLAGS: 00010286
  [   32.924015] RAX:  RBX:  RCX:
  11ed
  [   32.939856] RDX: 11ec RSI: 80d0 RDI:
  88042f803700
  [   32.955696] RBP: 8804273f5d70 R08: 00017260 R09:
  811be63c
  [   32.971559] R10: 8080808080808080 R11:  R12:
  7d10f8ec0c3cb928
  [   32.987421] R13: 80d0 R14: 88042f803700 R15:
  88042f803700
  [   33.003284] FS:  () GS:88042fd2()
  knlGS:
  [   33.021281] CS:  0010 DS:  ES:  CR0: 80050033
  [   33.034068] CR2: 7f01a8fced40 CR3: 00040e52f000 CR4:
  000407e0
  [   33.049929] Stack:
  [   33.054456]  811be63c  88041be52780
  880428052000
  [   33.071259]  8804273f5f2c ff9c 8804273f5d98
  811be63c
  [   33.088084]  0080 8804273f5f2c 8804273f5e40
  8804273f5e30
  [   33.104908] Call Trace:
  [   33.110399]  [811be63c] ? get_empty_filp+0x5c/0x180
  [   33.123188]  [811be63c] get_empty_filp+0x5c/0x180
  [   33.135593]  [811cc03d] path_openat+0x3d/0x620
  [   33.147422]  [811cd47a] do_filp_open+0x3a/0x90
  [   33.159250]  [811a1985] ? kmem_cache_alloc+0x35/0x1e0
  [   33.172405]  [811cc6bf] ? getname_flags+0x4f/0x190
  [   33.185004]  [811da237] ? __alloc_fd+0xa7/0x130
  [   33.197025]  [811bbb99] do_sys_open+0x129/0x280
  [   33.209049]  [81020d25] ?
syscall_trace_enter+0x145/0x250
  [   33.222992]  [811bbd0e] SyS_open+0x1e/0x20
  [   33.234053]  [8172aeff] tracesys+0xe1/0xe6
  [   33.245112] Code: dc 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d
85 e4 0f
  84 17 01 00 00 48 85 c0 0f 84 0e 01 00 00 49 63 46 20 48 8d 4a 01
4d 8b 06
  49 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
  [   33.292549] RIP  [811a19c5] kmem_cache_alloc+0x75/0x1e0
  [   33.306192]  RSP 8804273f5d40

Hi James,

Are all the stacktraces the same?  When are those rbd images mapped
- during
boot with some sort of init script?  Can you attach the entire dmesg?

Thanks,

 Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Kindest Regards,

Brad Hubbard
Senior Software Maintenance Engineer
Red Hat Global Support Services
Asia Pacific Region
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Block Device

2015-02-17 Thread Brad Hubbard


On 02/18/2015 11:48 AM, Garg, Pankaj wrote:

libkmod: ERROR ../libkmod/libkmod.c:556 kmod_search_moddep: could not open 
moddep file


Try sudo moddep and then running your modprobe again.

This seems more like an OS issue than a Ceph specific issue.

Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Block Device

2015-02-17 Thread Brad Hubbard


On 02/18/2015 09:56 AM, Garg, Pankaj wrote:

Hi,

I have a Ceph cluster and I am trying to create a block device. I execute the 
following command, and get errors:

èsudo rbd map cephblockimage --pool rbd -k /etc/ceph/ceph.client.admin.keyring

libkmod: ERROR ../libkmod/libkmod.c:556 kmod_search_moddep: could not open 
moddep file '/lib/modules/3.18.0-02094-gab62ac9/modules.dep.bin'

modinfo: ERROR: Module alias rbd not found.

modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open 
moddep file '/lib/modules/3.18.0-02094-gab62ac9/modules.dep.bin'

rbd: modprobe rbd failed! (256)


What distro/release is this?

Does /lib/modules/3.18.0-02094-gab62ac9/modules.dep.bin exist?

Can you run the command as root?



Need help with what is wrong. I installed the Ceph package on the machine where 
I execute the command. This is on ARM BTW.  Is there something I am missing?

I am able to run Object storage and rados bench just fine on the cluster.

Thanks

Pankaj



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--


Kindest Regards,

Brad Hubbard
Senior Software Maintenance Engineer
Red Hat Global Support Services
Asia Pacific Region
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-giant installation error on centos 6.6

2015-02-17 Thread Brad Hubbard


On 02/18/2015 12:43 PM, Wenxiao He wrote:


Hello,

I need some help as I am getting package dependency errors when trying to 
install ceph-giant on centos 6.6. See below for repo files and also the yum 
install output.




--- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed
-- Finished Dependency Resolution
Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph)
Requires: liblttng-ust.so.0()(64bit)
Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph)
Requires: libunwind.so.8()(64bit)
Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph)
Requires: liblttng-ust.so.0()(64bit)
Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph)
Requires: liblttng-ust.so.0()(64bit)


Looks like you may need to install libunwind and lttng-ust from EPEL 6?

They seem to be the packages that supply liblttng-ust.so and ibunwind.so so you
could try installing those from EPEL 6 and see how that goes?

Note that this should not be taken as the, or even a, authorative answer :)

Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Brad Hubbard


On 02/26/2015 09:05 AM, Kyle Hutson wrote:

Thank you Thomas. You at least made me look it the right spot. Their long-form 
is showing what to do for a mon, not an osd.

At the bottom of step 11, instead of
sudo touch /var/lib/ceph/mon/{cluster-name}-{hostname}/sysvinit

It should read
sudo touch /var/lib/ceph/osd/{cluster-name}-{osd-num}/sysvinit

Once I did that 'service ceph status' correctly shows that I have that OSD 
available and I can start or stop it from there.



Could you open a bug for this?

Cheers,
Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Centos 7 OSD silently fail to start

2015-02-25 Thread Brad Hubbard


On 02/26/2015 03:24 PM, Kyle Hutson wrote:

Just did it. Thanks for suggesting it.


No, definitely thank you. Much appreciated.



On Wed, Feb 25, 2015 at 5:59 PM, Brad Hubbard bhubb...@redhat.com 
mailto:bhubb...@redhat.com wrote:

On 02/26/2015 09:05 AM, Kyle Hutson wrote:

Thank you Thomas. You at least made me look it the right spot. Their 
long-form is showing what to do for a mon, not an osd.

At the bottom of step 11, instead of
sudo touch /var/lib/ceph/mon/{cluster-__name}-{hostname}/sysvinit

It should read
sudo touch /var/lib/ceph/osd/{cluster-__name}-{osd-num}/sysvinit

Once I did that 'service ceph status' correctly shows that I have that 
OSD available and I can start or stop it from there.


Could you open a bug for this?

Cheers,
Brad




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] stuck ceph-deploy mon create-initial / giant

2015-02-24 Thread Brad Hubbard


On 02/24/2015 09:06 PM, Loic Dachary wrote:



On 24/02/2015 12:00, Christian Balzer wrote:

On Tue, 24 Feb 2015 11:17:22 +0100 Loic Dachary wrote:




On 24/02/2015 09:58, Stephan Seitz wrote:

Hi Loic,

this is the content of our ceph.conf

[global]
fsid = 719f14b2-7475-4b25-8c5f-3ffbcf594d13
mon_initial_members = ceph1, ceph2, ceph3
mon_host = 192.168.10.107,192.168.10.108,192.168.10.109
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default size = 2
public network = 192.168.10.0/24
cluster networt = 192.168.108.0/24


s/networt/network/ ?



If this really should turn out to be the case, it is another painfully
obvious reason why I proposed to provide full config parser output in
the logs when in default debugging level.


I agree. However, it is non trivial to implement because there is not a central 
place where all valid values are defined. It is also likely that third party 
scripts rely on the fact that arbitrary key/values can be stored in the 
configuration file.



This could be implemented as a warning then?

Possible invalid or 3rd party entry found: X. Please check your ceph.conf 
file.

People can then report false positives and they an be added to the list of known
values?

Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] QEMU Venom Vulnerability

2015-05-20 Thread Brad Hubbard


On 05/20/2015 11:02 AM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've downloaded the new tarball, placed it in rpmbuild/SOURCES then
with the extracted spec file in rpmbuild/SPEC, I update it to the new
version and then rpmbuild -ba program.spec. If you install the SRPM
then it will install the RH patches that have been applied to the
package and then you get to have the fun of figuring out which patches
are still needed and which ones need to be modified. You can probably
build the package without the patches, but some things may work a
little differently. That would get you the closest to the official
RPMs

As to where to find the SRPMs, I'm not really sure, I come from a
Debian background where access to source packages is really easy.



# yumdownloader --source qemu-kvm --source qemu-kvm-rhev

This assumes you have the correct source repos enabled. Something like;

# subscription-manager repos --enable=rhel-7-server-openstack-6.0-source-rpms 
--enable=rhel-7-server-source-rpms

Taken from https://access.redhat.com/solutions/1381603

HTH.

Cheers,
Brad


- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, May 19, 2015 at 3:47 PM, Georgios Dimitrakakis  wrote:

Erik,

are you talking about the ones here :
http://ftp.redhat.com/redhat/linux/enterprise/6Server/en/RHEV/SRPMS/ ???

 From what I see the version is rather small 0.12.1.2-2.448

How one can verify that it has been patched against venom vulnerability?

Additionally I only see the qemu-kvm package and not the qemu-img. Is it
essential to update both in order to have a working CentOS system or can I
just proceed with the qemu-kvm?

Robert, any ideas where can I find the latest and patched SRPMs...I have
been building v.2.3.0 from source but I am very reluctant to use it in my
system :-)

Best,

George



You can also just fetch the rhev SRPMs  and build those. They have
rbd enabled already.
On May 19, 2015 12:31 PM, Robert LeBlanc  wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You should be able to get the SRPM, extract the SPEC file and use
that
to build a new package. You should be able to tweak all the compile
options as well. Im still really new to building/rebuilding RPMs
but
Ive been able to do this for a couple of packages.
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, May 19, 2015 at 12:33 PM, Georgios Dimitrakakis  wrote:

I am trying to build the packages manually and I was wondering
is the flag --enable-rbd enough to have full Ceph functionality?

Does anybody know what else flags should I include in order to

have the same

functionality as the original CentOS package plus the RBD

support?


Regards,

George


On Tue, 19 May 2015 13:45:50 +0300, Georgios Dimitrakakis wrote:


Hi!

The QEMU Venom vulnerability (http://venom.crowdstrike.com/ [1])

got my

attention and I would
like to know what are you people doing in order to have the

latest

patched QEMU version
working with Ceph RBD?

In my case I am using the qemu-img and qemu-kvm packages

provided by

Ceph (http://ceph.com/packages/ceph-extras/rpm/centos6/x86_64/

[2]) in

order to have RBD working on CentOS6 since the default

repository

packages do not work!

If I want to update to the latest QEMU packages which ones are

known

to work with Ceph RBD?
I have seen some people mentioning that Fedora packages are

working

but I am not sure if they have the latest packages available and

if

they are going to work eventually.

Is building manually the QEMU packages the only way???


Best regards,


George
___
ceph-users mailing list
ceph-users@lists.ceph.com [3]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]


___
ceph-users mailing list
ceph-users@lists.ceph.com [5]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [6]


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com [7]

wsFcBAEBCAAQBQJVW4+RCRDmVDuy+mK58QAAg8AP/jqmQFYEwOeGRTJigk9M
pBhr34vyA3mky+BjjW9pt2tydECOH0p5PlYXBfhrQeg2B/yT0uVUKYbYkdBU
fY85UhS5NFdm7VyFyMPSGQwZlXIADF8YJw+Zbj1tpfRvbCi/sntbvGQk+9X8
usVSwBTbWKhYyMW8J5edppv72fMwoVjmoNXuE7wCUoqwxpQBUt0ouap6gDNd
Cu0ZMu+RKq+gfLGcIeSIhsDfV0/LHm2QBO/XjNZtMjyomOWNk9nYHp6HGJxH
MV/EoF4dYoCqHcODPjU2NvesQfYkmqfFoq/n9q/fMEV5JQ+mDfXqc2BcQUsx
40LDWDs+4BTw0KI+dNT0XUYTw+O0WnXFzgIn1wqXEs8pyOSJy1gCcnOGEavy
4PqYasm1g+5uzggaIddFPcWHJTw5FuFfjCnHX8Jo3EeQVDM6Vg8FPkkb5JQk
sqxVRQWsF89gGRUbHIQWdkgy3PZN0oTkBvUfflmE/cUq/r40sD4c25D+9Gti
Gj0IKG5uqMaHud3Hln++0ai5roOghoK0KxcDoBTmFLaQSNo9c4CIFCDf2kJ3
idH5tVozDSgvFpgBFLFatb7isctIYf4Luh/XpLXUzdjklGGzo9mhOjXsbm56
WCJZOkQ/OY1UFysMV5+tSSEn7TsF7Np9NagZB7AHhYuTKlOnbv3QJlhATOPp
u4wP
=SsM2
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com [8]

Re: [ceph-users] QEMU Venom Vulnerability

2015-05-20 Thread Brad Hubbard


On 05/21/2015 08:47 AM, Brad Hubbard wrote:

On 05/20/2015 11:02 AM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've downloaded the new tarball, placed it in rpmbuild/SOURCES then
with the extracted spec file in rpmbuild/SPEC, I update it to the new
version and then rpmbuild -ba program.spec. If you install the SRPM
then it will install the RH patches that have been applied to the
package and then you get to have the fun of figuring out which patches
are still needed and which ones need to be modified. You can probably
build the package without the patches, but some things may work a
little differently. That would get you the closest to the official
RPMs

As to where to find the SRPMs, I'm not really sure, I come from a
Debian background where access to source packages is really easy.



# yumdownloader --source qemu-kvm --source qemu-kvm-rhev

This assumes you have the correct source repos enabled. Something like;

# subscription-manager repos --enable=rhel-7-server-openstack-6.0-source-rpms 
--enable=rhel-7-server-source-rpms

Taken from https://access.redhat.com/solutions/1381603


Of course the above is for RHEL only and is unnecessary as there are errata
packages for rhel. I was just trying to explain how you can get access to the
source packages for rhel.

As for Centos 6, although the version number may be small it has the fix.

http://vault.centos.org/6.6/updates/Source/SPackages/qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm

$ rpm -qp --changelog qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm |head -5
warning: qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm: Header V3 RSA/SHA1 Signature, 
key ID c105b9de: NOKEY
* Fri May 08 2015 Miroslav Rezanina mreza...@redhat.com - 
0.12.1.2-2.448.el6_6.3
- kvm-fdc-force-the-fifo-access-to-be-in-bounds-of-the-all.patch [bz#1219267]
- Resolves: bz#1219267
  (EMBARGOED CVE-2015-3456 qemu-kvm: qemu: floppy disk controller flaw 
[rhel-6.6.z])

HTH.


Cheers,
Brad



HTH.

Cheers,
Brad


- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, May 19, 2015 at 3:47 PM, Georgios Dimitrakakis  wrote:

Erik,

are you talking about the ones here :
http://ftp.redhat.com/redhat/linux/enterprise/6Server/en/RHEV/SRPMS/ ???

 From what I see the version is rather small 0.12.1.2-2.448

How one can verify that it has been patched against venom vulnerability?

Additionally I only see the qemu-kvm package and not the qemu-img. Is it
essential to update both in order to have a working CentOS system or can I
just proceed with the qemu-kvm?

Robert, any ideas where can I find the latest and patched SRPMs...I have
been building v.2.3.0 from source but I am very reluctant to use it in my
system :-)

Best,

George



You can also just fetch the rhev SRPMs  and build those. They have
rbd enabled already.
On May 19, 2015 12:31 PM, Robert LeBlanc  wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You should be able to get the SRPM, extract the SPEC file and use
that
to build a new package. You should be able to tweak all the compile
options as well. Im still really new to building/rebuilding RPMs
but
Ive been able to do this for a couple of packages.
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, May 19, 2015 at 12:33 PM, Georgios Dimitrakakis  wrote:

I am trying to build the packages manually and I was wondering
is the flag --enable-rbd enough to have full Ceph functionality?

Does anybody know what else flags should I include in order to

have the same

functionality as the original CentOS package plus the RBD

support?


Regards,

George


On Tue, 19 May 2015 13:45:50 +0300, Georgios Dimitrakakis wrote:


Hi!

The QEMU Venom vulnerability (http://venom.crowdstrike.com/ [1])

got my

attention and I would
like to know what are you people doing in order to have the

latest

patched QEMU version
working with Ceph RBD?

In my case I am using the qemu-img and qemu-kvm packages

provided by

Ceph (http://ceph.com/packages/ceph-extras/rpm/centos6/x86_64/

[2]) in

order to have RBD working on CentOS6 since the default

repository

packages do not work!

If I want to update to the latest QEMU packages which ones are

known

to work with Ceph RBD?
I have seen some people mentioning that Fedora packages are

working

but I am not sure if they have the latest packages available and

if

they are going to work eventually.

Is building manually the QEMU packages the only way???


Best regards,


George
___
ceph-users mailing list
ceph-users@lists.ceph.com [3]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]


___
ceph-users mailing list
ceph-users@lists.ceph.com [5]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [6]


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com [7]

wsFcBAEBCAAQBQJVW4+RCRDmVDuy+mK58QAAg8AP

Re: [ceph-users] QEMU Venom Vulnerability

2015-05-21 Thread Brad Hubbard

On 05/21/2015 03:39 PM, Georgios Dimitrakakis wrote:

Hi Brad!

Thanks for pointing out that for CentOS 6 the fix is included! Good to know
that!

No problem.

But I think that the original package doesn't support RBD by default so it has
to be built again, am I right?

I have not looked at the patch or the source but, judging by the changelog the
ceph package does not have the fix so yes, it would need to be patched and
recompiled.

If that's correct then starting from there and building a new RPM with RBD
support is the proper way of updating. Correct?

I guess there are two ways to approach this.

1. use the existing ceph source rpm here.

http://ceph.com/packages/ceph-extras/rpm/centos6/SRPMS/qemu-kvm-0.12.1.2-2.415.el6.3ceph.src.rpm

and just apply the venom patch(s) to it by adding the patch file, incrementing
pkgrelease and adding a changelog entry (last two are optional but good
practice).

2. use the latest Centos/Red Hat src rpm and add the ceph patches to its source
tree and Patch lines to its spec file as well as the optional pkgrelease and
changelog entries. The hard part may be working out what patches need to be
applied although this may be a starting point.

$ grep -i ceph qemu-kvm.spec
#%define buildid ceph
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
Patch4462: kvm-ceph-rbd-block-driver-for-qemu-kvm.patch
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
- kvm-ceph-rbd-block-driver-for-qemu-kvm.patch [bz#988079]
([6.5 FEAT] qemu runtime support for librbd backend (ceph))

Another approach would be to download the Centos src rpm on which the ceph rpm
was built and diff the two qemu-kvm.spec files.

If this can wait a few days I can chase up what the status of the ceph Centos 6
packages are internally and see if we are going to build more packages or if I
can build these for you/others (this will depend on what I find out).
Unfortunately I am travelling over the next few days so the timing is awkward
(when isn't it?).

Good luck either way and I will chase up the status of the Centos builds when I
am back home regardless.

Cheers,
Brad

Since I am very new at building RPMs is something else that I should be aware
of or take care? Any guidelines maybe

Best regards,

George

On Thu, 21 May 2015 09:25:32 +1000, Brad Hubbard wrote:

On 05/21/2015 08:47 AM, Brad Hubbard wrote:

On 05/20/2015 11:02 AM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've downloaded the new tarball, placed it in rpmbuild/SOURCES then
with the extracted spec file in rpmbuild/SPEC, I update it to the new
version and then rpmbuild -ba program.spec. If you install the SRPM
then it will install the RH patches that have been applied to the
package and then you get to have the fun of figuring out which patches
are still needed and which ones need to be modified. You can probably
build the package without the patches, but some things may work a
little differently. That would get you the closest to the official
RPMs

As to where to find the SRPMs, I'm not really sure, I come from a
Debian background where access to source packages is really easy.

# yumdownloader --source qemu-kvm --source qemu-kvm-rhev

This assumes you have the correct source repos enabled. Something like;

# subscription-manager repos --enable=rhel-7-server-openstack-6.0-source-rpms
--enable=rhel-7-server-source-rpms

Taken from https://access.redhat.com/solutions/1381603

Of course the above is for RHEL only and is unnecessary as there are errata
packages for rhel. I was just trying to explain how you can get access to the
source packages for rhel.

As for Centos 6, although the version number may be small it has the fix.

http://vault.centos.org/6.6/updates/Source/SPackages/qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm

$ rpm -qp --changelog qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm |head -5
warning: qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm: Header V3 RSA/SHA1
Signature, key ID c105b9de: NOKEY
* Fri May 08 2015 Miroslav Rezanina mreza...@redhat.com -
0.12.1.2-2.448.el6_6.3
- kvm-fdc-force-the-fifo-access-to-be-in-bounds-of-the-all.patch [bz#1219267]
- Resolves: bz#1219267
(EMBARGOED CVE-2015-3456 qemu-kvm: qemu: floppy disk controller
flaw [rhel-6.6.z])

HTH.

Cheers,
Brad

HTH.

Cheers,
Brad

-
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

On Tue, May 19, 2015 at 3:47 PM, Georgios Dimitrakakis wrote:

Erik,

are you talking about the ones here :

http://ftp.redhat.com/redhat/linux/enterprise/6Server/en/RHEV/SRPMS/ ???

From what I see the version is rather small 0.12.1.2-2.448

How one can verify that it has been patched against venom vulnerability?

Additionally I

Re: [ceph-users] QEMU Venom Vulnerability

2015-05-21 Thread Brad Hubbard


On 05/21/2015 09:36 PM, Brad Hubbard wrote:

On 05/21/2015 03:39 PM, Georgios Dimitrakakis wrote:

Hi Brad!

Thanks for pointing out that for CentOS 6 the fix is included! Good to know 
that!


No problem.



But I think that the original package doesn't support RBD by default so it has 
to be built again, am I right?


Taking a look at the latest Centos src rpm it appears to include support for
librbd so I suspect the package on ceph.com is deprecated and you can just use
the latest Centos package. So it looks like qemu-kvm-0.12.1.2-2.448.el6_6.3 is
the way to go.

Can anyone confirm this is the case?



I have not looked at the patch or the source but, judging by the changelog the
ceph package does not have the fix so yes, it would need to be patched and
recompiled.



If that's correct then starting from there and building a new RPM with RBD 
support is the proper way of updating. Correct?


I guess there are two ways to approach this.

1. use the existing ceph source rpm here.

http://ceph.com/packages/ceph-extras/rpm/centos6/SRPMS/qemu-kvm-0.12.1.2-2.415.el6.3ceph.src.rpm

and just apply the venom patch(s) to it by adding the patch file, incrementing
pkgrelease and adding a changelog entry (last two are optional but good
practice).

2. use the latest Centos/Red Hat src rpm and add the ceph patches to its source
tree and Patch lines to its spec file as well as the optional pkgrelease and
changelog entries. The hard part may be working out what patches need to be
applied although this may be a starting point.

$ grep -i ceph qemu-kvm.spec
#%define buildid ceph
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
Patch4462: kvm-ceph-rbd-block-driver-for-qemu-kvm.patch
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
# For bz#988079 - [6.5 FEAT] qemu runtime support for librbd backend (ceph)
- kvm-ceph-rbd-block-driver-for-qemu-kvm.patch [bz#988079]
   ([6.5 FEAT] qemu runtime support for librbd backend (ceph))

Another approach would be to download the Centos src rpm on which the ceph rpm
was built and diff the two qemu-kvm.spec files.

If this can wait a few days I can chase up what the status of the ceph Centos 6
packages are internally and see if we are going to build more packages or if I
can build these for you/others (this will depend on what I find out).
Unfortunately I am travelling over the next few days so the timing is awkward
(when isn't it?).

Good luck either way and I will chase up the status of the Centos builds when I
am back home regardless.

Cheers,
Brad



Since I am very new at building RPMs is something else that I should be aware 
of or take care? Any guidelines maybe

Best regards,

George

On Thu, 21 May 2015 09:25:32 +1000, Brad Hubbard wrote:

On 05/21/2015 08:47 AM, Brad Hubbard wrote:

On 05/20/2015 11:02 AM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've downloaded the new tarball, placed it in rpmbuild/SOURCES then
with the extracted spec file in rpmbuild/SPEC, I update it to the new
version and then rpmbuild -ba program.spec. If you install the SRPM
then it will install the RH patches that have been applied to the
package and then you get to have the fun of figuring out which patches
are still needed and which ones need to be modified. You can probably
build the package without the patches, but some things may work a
little differently. That would get you the closest to the official
RPMs

As to where to find the SRPMs, I'm not really sure, I come from a
Debian background where access to source packages is really easy.



# yumdownloader --source qemu-kvm --source qemu-kvm-rhev

This assumes you have the correct source repos enabled. Something like;

# subscription-manager repos --enable=rhel-7-server-openstack-6.0-source-rpms 
--enable=rhel-7-server-source-rpms

Taken from https://access.redhat.com/solutions/1381603


Of course the above is for RHEL only and is unnecessary as there are errata
packages for rhel. I was just trying to explain how you can get access to the
source packages for rhel.

As for Centos 6, although the version number may be small it has the fix.


http://vault.centos.org/6.6/updates/Source/SPackages/qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm

$ rpm -qp --changelog qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm |head -5
warning: qemu-kvm-0.12.1.2-2.448.el6_6.3.src.rpm: Header V3 RSA/SHA1
Signature, key ID c105b9de: NOKEY
* Fri May 08 2015 Miroslav Rezanina mreza...@redhat.com -
0.12.1.2-2.448.el6_6.3
- kvm-fdc-force-the-fifo-access-to-be-in-bounds-of-the-all.patch [bz#1219267]
- Resolves: bz#1219267
  (EMBARGOED CVE-2015-3456 qemu-kvm: qemu: floppy disk controller
flaw [rhel-6.6.z])

HTH.


Cheers,
Brad



HTH.

Cheers,
Brad


- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

Re: [ceph-users] problem with RGW

2015-07-31 Thread Brad Hubbard



- Original Message -
From: Butkeev Stas staer...@ya.ru
To: ceph-us...@ceph.com, ceph-commun...@lists.ceph.com, supp...@ceph.com
Sent: Friday, 31 July, 2015 9:10:40 PM
Subject: [ceph-users] problem with RGW

Hello everybody

We have ceph cluster that consist of 8 host with 12 osd per each host. It's 2T 
SATA disks.
In log osd.0

2015-07-31 14:03:24.490774 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 35 
slow requests, 9 included below; oldest blocked for  3003.952332 secs
2015-07-31 14:03:24.490782 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 960.179599 seconds old, received at 2015-07-31 13:47:24.311080: 
osd_op(client.67321.0:7856 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [writefull 0~0] 
26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) currently no flag 
points reached
2015-07-31 14:03:24.490791 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 960.179357 seconds old, received at 2015-07-31 13:47:24.311323: 
osd_op(client.67321.0:7857 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [writefull 
0~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) currently no 
flag points reached
2015-07-31 14:03:24.490794 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 960.167539 seconds old, received at 2015-07-31 13:47:24.323141: 
osd_op(client.67321.0:7858 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
524288~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached
2015-07-31 14:03:24.490797 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 960.14 seconds old, received at 2015-07-31 13:47:24.335126: 
osd_op(client.67321.0:7859 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
1048576~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached
2015-07-31 14:03:24.490801 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 960.145867 seconds old, received at 2015-07-31 13:47:24.344813: 
osd_op(client.67321.0:7860 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
1572864~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached
2015-07-31 14:03:25.491062 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 35 
slow requests, 4 included below; oldest blocked for  3004.952621 secs
2015-07-31 14:03:25.491078 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 961.140790 seconds old, received at 2015-07-31 13:47:24.350178: 
osd_op(client.67321.0:7861 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
2097152~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached
2015-07-31 14:03:25.491084 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 961.097870 seconds old, received at 2015-07-31 13:47:24.393098: 
osd_op(client.67321.0:7862 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
2621440~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached
2015-07-31 14:03:25.491089 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 961.093229 seconds old, received at 2015-07-31 13:47:24.397740: 
osd_op(client.67321.0:7863 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
3145728~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached
2015-07-31 14:03:25.491095 7f2cd95c5700  0 log_channel(cluster) log [WRN] : 
slow request 961.002957 seconds old, received at 2015-07-31 13:47:24.488012: 
osd_op(client.67321.0:7864 
default.34169.37__shadow_.AnULxoR-51Q7fGdIVVP92CPeptlQJIm_226 [write 
3670016~524288] 26.f9af7c89 ack+ondisk+write+known_if_redirected e9467) 
currently no flag points reached

How I can avoid these blocked requests? What is root cause of this problem?


Do a ceph pg dump and look for the pgs in this state,
ack+ondisk+write+known_if_redirected then do a ceph pg [pgid] query and post
the output here (if there aren't too many, otherwise a representative sample).
Also look carefully at the acting OSDs for these pgs and check the output of
ceph daemon /var/run/ceph/ceph-osd.NNN.asok dump_ops_in_flight. There could be
problems with these OSDs slowing down the requests, including hardware problems
so check thoroughly.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fedora core 22

2015-10-27 Thread Brad Hubbard

- Original Message -
> From: "Andrew Hume" 
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, 27 October, 2015 11:13:04 PM
> Subject: [ceph-users] fedora core 22
> 
> a while back, i had installed ceph (firefly i believe) on my fedora core
> system and all went smoothly.
> i went to repeat this yesterday with hammer, but i am stymied by lack of
> packages. there doesn’t
> appear anything for fc21 or fc22.
> 
> i initially tried ceph-deploy, but it fails because of the above issue.
> i then looked at the manual install documentation but am growing nervous
> because
> it is clearly out of date (contents of ceph.conf are different than what
> ceps-deploy generated).
> 
> how do i make progress?

$ dnf list ceph
Last metadata expiration check performed 7 days, 16:01:51 ago on Tue Oct 20 
18:22:03 2015.
Available Packages
ceph.x86_64  1:0.94.3-1.fc22
  updates

> 
>   andrew
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados: Undefined symbol error

2015-08-27 Thread Brad Hubbard

- Original Message -
 From: Brad Hubbard bhubb...@redhat.com
 To: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
 Cc: ceph-us...@ceph.com
 Sent: Friday, 28 August, 2015 10:54:04 AM
 Subject: Re: [ceph-users] Rados: Undefined symbol error
 
 - Original Message -
  From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
  To: Brad Hubbard bhubb...@redhat.com
  Cc: Jason Dillaman dilla...@redhat.com, ceph-us...@ceph.com
  Sent: Friday, 28 August, 2015 6:15:12 AM
  Subject: RE: [ceph-users] Rados: Undefined symbol error
  
  Hello Brad,
  
  Thank you for your response. Looks like the command is undefined.
  
U _ZN5Mutex4LockEb
   U _ZN5Mutex6UnlockEv
   U _ZN5MutexC1ERKSsbbbP11CephContext
   U _ZN5MutexD1Ev
 
 $ git checkout v9.0.2
 M   src/civetweb
 HEAD is now at be422c8... 9.0.2
 
 $ git show|head -1
 commit be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0
 
 $./autogen.sh  ./configure  make -j2
 $ sudo make install
 
 $ which rados
 /usr/local/bin/rados
 $ rados -v
 ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
 
 $ nm /usr/local/bin/rados|grep ZN5MutexC1ERKSsbbbP11CephContext
 00513790 T _ZN5MutexC1ERKSsbbbP11CephContext
 
 What OS/environment is this in and is there anything unusual about it or the
 build environment or build process?
 
 What does the following command output?
 
 $ `which rados` -v

Hehe, of course you can't do this because getting it to run is the problem isn't
it? :P

Try this instead.

$ strings `which rados`|grep ^ceph version -A5

 
  
  Thanks,
  Aakanksha
  
  -Original Message-
  From: Brad Hubbard [mailto:bhubb...@redhat.com]
  Sent: Wednesday, August 26, 2015 5:46 PM
  To: Aakanksha Pudipeddi-SSI
  Cc: Jason Dillaman; ceph-us...@ceph.com
  Subject: Re: [ceph-users] Rados: Undefined symbol error
  
  - Original Message -
   From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
   To: Jason Dillaman dilla...@redhat.com
   Cc: ceph-us...@ceph.com
   Sent: Thursday, 27 August, 2015 6:22:45 AM
   Subject: Re: [ceph-users] Rados: Undefined symbol error
   
   Hello Jason,
   
   I checked the version of my built packages and they are all 9.0.2. I
   purged the cluster and uninstalled the packages and there seems to be
   nothing else
   - no older version. Could you elaborate on the fix for this issue?
  
  Some thoughts...
  
  # c++filt  _ZN5MutexC1ERKSsbbbP11CephContext
  Mutex::Mutex(std::basic_stringchar, std::char_traitschar,
  std::allocatorchar  const, bool, bool, bool, CephContext*)
  
  Thats from common/Mutex.cc
  
  # nm --dynamic `which rados` 21|grep Mutex
  00504da0 T _ZN5Mutex4LockEb
  00504f70 T _ZN5Mutex6UnlockEv
  00504a50 T _ZN5MutexC1EPKcbbbP11CephContext
  00504a50 T _ZN5MutexC2EPKcbbbP11CephContext
  00504d10 T _ZN5MutexD1Ev
  00504d10 T _ZN5MutexD2Ev
  
  This shows my version is defined in the text section of the binary itself.
  What do you get when you run the above command?
  
  Like Jason says this is some sort of mis-match between your rados binary
  and
  your installed libs.
  
  HTH,
  Brad
  
   
   Thanks,
   Aakanksha
   
   -Original Message-
   From: Jason Dillaman [mailto:dilla...@redhat.com]
   Sent: Friday, August 21, 2015 6:37 AM
   To: Aakanksha Pudipeddi-SSI
   Cc: ceph-us...@ceph.com
   Subject: Re: [ceph-users] Rados: Undefined symbol error
   
   It sounds like you have rados CLI tool from an earlier Ceph release (
   Hammer) installed and it is attempting to use the librados shared
   library from a newer (= Hammer) version of Ceph.
   
   Jason
   
   
   - Original Message -
   
From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
To: ceph-us...@ceph.com
Sent: Thursday, August 20, 2015 11:47:26 PM
Subject: [ceph-users] Rados: Undefined symbol error
   
Hello,
   
I cloned the master branch of Ceph and after setting up the cluster,
when I tried to use the rados commands, I got this error:
   
rados: symbol lookup error: rados: undefined symbol:
_ZN5MutexC1ERKSsbbbP11CephContext
   
I saw a similar post here: http://tracker.ceph.com/issues/12563 but
I am not clear on the solution for this problem. I am not performing
an upgrade here but the error seems to be similar. Could anybody
shed more light on the issue and how to solve it? Thanks a lot!
   
Aakanksha
   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
  
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados: Undefined symbol error

2015-08-26 Thread Brad Hubbard

- Original Message -
 From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
 To: Jason Dillaman dilla...@redhat.com
 Cc: ceph-us...@ceph.com
 Sent: Thursday, 27 August, 2015 6:22:45 AM
 Subject: Re: [ceph-users] Rados: Undefined symbol error
 
 Hello Jason,
 
 I checked the version of my built packages and they are all 9.0.2. I purged
 the cluster and uninstalled the packages and there seems to be nothing else
 - no older version. Could you elaborate on the fix for this issue?

Some thoughts...

# c++filt  _ZN5MutexC1ERKSsbbbP11CephContext
Mutex::Mutex(std::basic_stringchar, std::char_traitschar, 
std::allocatorchar  const, bool, bool, bool, CephContext*)

Thats from common/Mutex.cc

# nm --dynamic `which rados` 21|grep Mutex
00504da0 T _ZN5Mutex4LockEb
00504f70 T _ZN5Mutex6UnlockEv
00504a50 T _ZN5MutexC1EPKcbbbP11CephContext
00504a50 T _ZN5MutexC2EPKcbbbP11CephContext
00504d10 T _ZN5MutexD1Ev
00504d10 T _ZN5MutexD2Ev

This shows my version is defined in the text section of the binary itself. What
do you get when you run the above command?

Like Jason says this is some sort of mis-match between your rados binary and
your installed libs.

HTH,
Brad

 
 Thanks,
 Aakanksha
 
 -Original Message-
 From: Jason Dillaman [mailto:dilla...@redhat.com]
 Sent: Friday, August 21, 2015 6:37 AM
 To: Aakanksha Pudipeddi-SSI
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] Rados: Undefined symbol error
 
 It sounds like you have rados CLI tool from an earlier Ceph release (
 Hammer) installed and it is attempting to use the librados shared library
 from a newer (= Hammer) version of Ceph.
 
 Jason
 
 
 - Original Message -
 
  From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
  To: ceph-us...@ceph.com
  Sent: Thursday, August 20, 2015 11:47:26 PM
  Subject: [ceph-users] Rados: Undefined symbol error
 
  Hello,
 
  I cloned the master branch of Ceph and after setting up the cluster,
  when I tried to use the rados commands, I got this error:
 
  rados: symbol lookup error: rados: undefined symbol:
  _ZN5MutexC1ERKSsbbbP11CephContext
 
  I saw a similar post here: http://tracker.ceph.com/issues/12563 but I
  am not clear on the solution for this problem. I am not performing an
  upgrade here but the error seems to be similar. Could anybody shed
  more light on the issue and how to solve it? Thanks a lot!
 
  Aakanksha
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados: Undefined symbol error

2015-08-31 Thread Brad Hubbard

- Original Message -
> From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Cc: ceph-us...@ceph.com
> Sent: Tuesday, 1 September, 2015 3:33:38 AM
> Subject: RE: [ceph-users] Rados: Undefined symbol error
> 
> Hello Brad,
> 
> Sorry for the delay in replying. As you mentioned earlier,
> 
> $ `which rados` -v
> 
> Returns a command not found error

If it can't find rados how are you using it?

Previously I gave you the following command. how did you run it is `whcih rados`
returns command not found?

# nm --dynamic `which rados` 2>&1|grep Mutex

$ which rados

Should return the path to the rados binary which you are having problems with.

What OS/environment is this in and is there anything unusual about it or the
build environment or build process?

> 
> $ strings `which rados`|grep "^ceph version" -A5
> 
> Returns no results.
> 
> Thanks a lot!
> Aakanksha
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Thursday, August 27, 2015 10:00 PM
> To: Aakanksha Pudipeddi-SSI
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] Rados: Undefined symbol error
> 
> - Original Message -
> > From: "Brad Hubbard" <bhubb...@redhat.com>
> > To: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > Cc: ceph-us...@ceph.com
> > Sent: Friday, 28 August, 2015 10:54:04 AM
> > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > 
> > - Original Message -
> > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > To: "Brad Hubbard" <bhubb...@redhat.com>
> > > Cc: "Jason Dillaman" <dilla...@redhat.com>, ceph-us...@ceph.com
> > > Sent: Friday, 28 August, 2015 6:15:12 AM
> > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > 
> > > Hello Brad,
> > > 
> > > Thank you for your response. Looks like the command is undefined.
> > > 
> > > U _ZN5Mutex4LockEb
> > >  U _ZN5Mutex6UnlockEv
> > >  U _ZN5MutexC1ERKSsbbbP11CephContext
> > >  U _ZN5MutexD1Ev
> > 
> > $ git checkout v9.0.2
> > M   src/civetweb
> > HEAD is now at be422c8... 9.0.2
> > 
> > $ git show|head -1
> > commit be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0
> > 
> > $./autogen.sh && ./configure && make -j2 $ sudo make install
> > 
> > $ which rados
> > /usr/local/bin/rados
> > $ rados -v
> > ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
> > 
> > $ nm /usr/local/bin/rados|grep ZN5MutexC1ERKSsbbbP11CephContext
> > 00513790 T _ZN5MutexC1ERKSsbbbP11CephContext
> > 
> > What OS/environment is this in and is there anything unusual about it
> > or the build environment or build process?
> > 
> > What does the following command output?
> > 
> > $ `which rados` -v
> 
> Hehe, of course you can't do this because getting it to run is the problem
> isn't it? :P
> 
> Try this instead.
> 
> $ strings `which rados`|grep "^ceph version" -A5
> 
> > 
> > > 
> > > Thanks,
> > > Aakanksha
> > > 
> > > -Original Message-
> > > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > > Sent: Wednesday, August 26, 2015 5:46 PM
> > > To: Aakanksha Pudipeddi-SSI
> > > Cc: Jason Dillaman; ceph-us...@ceph.com
> > > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > > 
> > > - Original Message -
> > > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > > To: "Jason Dillaman" <dilla...@redhat.com>
> > > > Cc: ceph-us...@ceph.com
> > > > Sent: Thursday, 27 August, 2015 6:22:45 AM
> > > > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > > > 
> > > > Hello Jason,
> > > > 
> > > > I checked the version of my built packages and they are all 9.0.2.
> > > > I purged the cluster and uninstalled the packages and there seems
> > > > to be nothing else
> > > > - no older version. Could you elaborate on the fix for this issue?
> > > 
> > > Some thoughts...
> > > 
> > > # c++filt  _ZN5MutexC1ERKSsbbbP11CephContext
> > > Mutex::Mutex(std::basic_string<char, std::char_traits,
> > > std::allocator > const&, bool,

Re: [ceph-users] Rados: Undefined symbol error

2015-08-31 Thread Brad Hubbard

- Original Message -
> From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Cc: ceph-us...@ceph.com
> Sent: Tuesday, 1 September, 2015 7:27:04 AM
> Subject: RE: [ceph-users] Rados: Undefined symbol error
> 
> Hello Brad,
> 
> When I type "which rados" it returns /usr/bin/rados.

Ah, I think I see what is happening.

$ strings `which rados`|grep "^ceph version" -A5

Those are backticks "`", not single quotes "'".

Try the following if it's easier.

$ strings $(which rados)|grep "^ceph version" -A5


> I am using Ubuntu 14.04.
> I follow these steps in installing ceph from source:
> 
> 1. ./autogen.sh
> 2. Going to be using rocksdb, so: ./configure --with-librocksdb-static
> 3. make
> 4. sudo dpkg-buildpackage
> 
> Then I use ceph-deploy to complete setup of the cluster and instead of
> ceph-deploy install, I type sudo dpkg -i -R  step 4>.
> 
> Thanks,
> Aakanksha
>  
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Monday, August 31, 2015 2:19 PM
> To: Aakanksha Pudipeddi-SSI
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] Rados: Undefined symbol error
> 
> - Original Message -
> > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > To: "Brad Hubbard" <bhubb...@redhat.com>
> > Cc: ceph-us...@ceph.com
> > Sent: Tuesday, 1 September, 2015 3:33:38 AM
> > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > 
> > Hello Brad,
> > 
> > Sorry for the delay in replying. As you mentioned earlier,
> > 
> > $ `which rados` -v
> > 
> > Returns a command not found error
> 
> If it can't find rados how are you using it?
> 
> Previously I gave you the following command. how did you run it is `whcih
> rados` returns command not found?
> 
> # nm --dynamic `which rados` 2>&1|grep Mutex
> 
> $ which rados
> 
> Should return the path to the rados binary which you are having problems
> with.
> 
> What OS/environment is this in and is there anything unusual about it or the
> build environment or build process?
> 
> > 
> > $ strings `which rados`|grep "^ceph version" -A5
> > 
> > Returns no results.
> > 
> > Thanks a lot!
> > Aakanksha
> > 
> > -Original Message-
> > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > Sent: Thursday, August 27, 2015 10:00 PM
> > To: Aakanksha Pudipeddi-SSI
> > Cc: ceph-us...@ceph.com
> > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > 
> > - Original Message -
> > > From: "Brad Hubbard" <bhubb...@redhat.com>
> > > To: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > Cc: ceph-us...@ceph.com
> > > Sent: Friday, 28 August, 2015 10:54:04 AM
> > > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > > 
> > > - Original Message -
> > > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > > To: "Brad Hubbard" <bhubb...@redhat.com>
> > > > Cc: "Jason Dillaman" <dilla...@redhat.com>, ceph-us...@ceph.com
> > > > Sent: Friday, 28 August, 2015 6:15:12 AM
> > > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > > 
> > > > Hello Brad,
> > > > 
> > > > Thank you for your response. Looks like the command is undefined.
> > > > 
> > > >   U _ZN5Mutex4LockEb
> > > >  U _ZN5Mutex6UnlockEv
> > > >  U _ZN5MutexC1ERKSsbbbP11CephContext
> > > >  U _ZN5MutexD1Ev
> > > 
> > > $ git checkout v9.0.2
> > > M   src/civetweb
> > > HEAD is now at be422c8... 9.0.2
> > > 
> > > $ git show|head -1
> > > commit be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0
> > > 
> > > $./autogen.sh && ./configure && make -j2 $ sudo make install
> > > 
> > > $ which rados
> > > /usr/local/bin/rados
> > > $ rados -v
> > > ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
> > > 
> > > $ nm /usr/local/bin/rados|grep ZN5MutexC1ERKSsbbbP11CephContext
> > > 00513790 T _ZN5MutexC1ERKSsbbbP11CephContext
> > > 
> > > What OS/environment is this in and is there anything unusual about
> > > it or the build environment

Re: [ceph-users] Rados: Undefined symbol error

2015-08-31 Thread Brad Hubbard

- Original Message -
> From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Cc: "ceph-users" <ceph-us...@ceph.com>
> Sent: Tuesday, 1 September, 2015 7:58:33 AM
> Subject: RE: [ceph-users] Rados: Undefined symbol error
> 
> Brad,
> 
> Yes, you are right. Sorry about that! This is what I get when I try with the
> back ticks:
> $ `which rados` -v
> /usr/bin/rados: symbol lookup error: /usr/bin/rados: undefined symbol:
> _ZN5MutexC1ERKSsbbbP11CephContext
> $ strings `which rados`|grep "^ceph version"
> $
> $ strings $(which rados)|grep "^ceph version" -A5
> $
> 
> The latest command returns no results too.

Here's what you should get.

# strings $(which rados)|grep "^ceph version" -A5
ceph version 
e4bfad3a3c51054df7e537a724c8d0bf9be972ff
ConfLine(key = '
', val='
', newsection='
 = "

> 
> Thanks,
> Aakanksha
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Monday, August 31, 2015 2:49 PM
> To: Aakanksha Pudipeddi-SSI
> Cc: ceph-users
> Subject: Re: [ceph-users] Rados: Undefined symbol error
> 
> - Original Message -
> > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > To: "Brad Hubbard" <bhubb...@redhat.com>
> > Cc: ceph-us...@ceph.com
> > Sent: Tuesday, 1 September, 2015 7:27:04 AM
> > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > 
> > Hello Brad,
> > 
> > When I type "which rados" it returns /usr/bin/rados.
> 
> Ah, I think I see what is happening.
> 
> $ strings `which rados`|grep "^ceph version" -A5
> 
> Those are backticks "`", not single quotes "'".
> 
> Try the following if it's easier.
> 
> $ strings $(which rados)|grep "^ceph version" -A5
> 
> 
> > I am using Ubuntu 14.04.
> > I follow these steps in installing ceph from source:
> > 
> > 1. ./autogen.sh
> > 2. Going to be using rocksdb, so: ./configure --with-librocksdb-static
> > 3. make 4. sudo dpkg-buildpackage
> > 
> > Then I use ceph-deploy to complete setup of the cluster and instead of
> > ceph-deploy install, I type sudo dpkg -i -R  > result of step 4>.
> > 
> > Thanks,
> > Aakanksha
> >  
> > 
> > -Original Message-
> > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > Sent: Monday, August 31, 2015 2:19 PM
> > To: Aakanksha Pudipeddi-SSI
> > Cc: ceph-us...@ceph.com
> > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > 
> > - Original Message -
> > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > To: "Brad Hubbard" <bhubb...@redhat.com>
> > > Cc: ceph-us...@ceph.com
> > > Sent: Tuesday, 1 September, 2015 3:33:38 AM
> > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > 
> > > Hello Brad,
> > > 
> > > Sorry for the delay in replying. As you mentioned earlier,
> > > 
> > > $ `which rados` -v
> > > 
> > > Returns a command not found error
> > 
> > If it can't find rados how are you using it?
> > 
> > Previously I gave you the following command. how did you run it is
> > `whcih rados` returns command not found?
> > 
> > # nm --dynamic `which rados` 2>&1|grep Mutex
> > 
> > $ which rados
> > 
> > Should return the path to the rados binary which you are having
> > problems with.
> > 
> > What OS/environment is this in and is there anything unusual about it
> > or the build environment or build process?
> > 
> > > 
> > > $ strings `which rados`|grep "^ceph version" -A5
> > > 
> > > Returns no results.
> > > 
> > > Thanks a lot!
> > > Aakanksha
> > > 
> > > -Original Message-
> > > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > > Sent: Thursday, August 27, 2015 10:00 PM
> > > To: Aakanksha Pudipeddi-SSI
> > > Cc: ceph-us...@ceph.com
> > > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > > 
> > > - Original Message -
> > > > From: "Brad Hubbard" <bhubb...@redhat.com>
> > > > To: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > > Cc: ceph-us...@ceph.com
> > > > Sent: Friday, 28 August, 2015 10:54:04 AM
> > > > Subject: Re: [ceph-users

Re: [ceph-users] Rados: Undefined symbol error

2015-08-31 Thread Brad Hubbard



- Original Message -
> From: "Brad Hubbard" <bhubb...@redhat.com>
> To: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> Cc: "ceph-users" <ceph-us...@ceph.com>
> Sent: Tuesday, 1 September, 2015 8:36:33 AM
> Subject: Re: [ceph-users] Rados: Undefined symbol error
> 
> - Original Message -
> > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > To: "Brad Hubbard" <bhubb...@redhat.com>
> > Cc: "ceph-users" <ceph-us...@ceph.com>
> > Sent: Tuesday, 1 September, 2015 7:58:33 AM
> > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > 
> > Brad,
> > 
> > Yes, you are right. Sorry about that! This is what I get when I try with
> > the
> > back ticks:
> > $ `which rados` -v
> > /usr/bin/rados: symbol lookup error: /usr/bin/rados: undefined symbol:
> > _ZN5MutexC1ERKSsbbbP11CephContext
> > $ strings `which rados`|grep "^ceph version"
> > $
> > $ strings $(which rados)|grep "^ceph version" -A5
> > $
> > 
> > The latest command returns no results too.
> 
> Here's what you should get.
> 
> # strings $(which rados)|grep "^ceph version" -A5
> ceph version
> e4bfad3a3c51054df7e537a724c8d0bf9be972ff

Except you should see be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0 since that is
v9.0.2. Your rados binary just isn't behaving like anything I've seen before.

How about you stand up a fresh VM and run "./autogen.sh && ./configure && make
install" on v9.0.2 and see if you get similar output to what I'm getting
then try working back from there?

> ConfLine(key = '
> ', val='
> ', newsection='
>  = "
> 
> > 
> > Thanks,
> > Aakanksha
> > 
> > -Original Message-
> > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > Sent: Monday, August 31, 2015 2:49 PM
> > To: Aakanksha Pudipeddi-SSI
> > Cc: ceph-users
> > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > 
> > - Original Message -
> > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > To: "Brad Hubbard" <bhubb...@redhat.com>
> > > Cc: ceph-us...@ceph.com
> > > Sent: Tuesday, 1 September, 2015 7:27:04 AM
> > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > 
> > > Hello Brad,
> > > 
> > > When I type "which rados" it returns /usr/bin/rados.
> > 
> > Ah, I think I see what is happening.
> > 
> > $ strings `which rados`|grep "^ceph version" -A5
> > 
> > Those are backticks "`", not single quotes "'".
> > 
> > Try the following if it's easier.
> > 
> > $ strings $(which rados)|grep "^ceph version" -A5
> > 
> > 
> > > I am using Ubuntu 14.04.
> > > I follow these steps in installing ceph from source:
> > > 
> > > 1. ./autogen.sh
> > > 2. Going to be using rocksdb, so: ./configure --with-librocksdb-static
> > > 3. make 4. sudo dpkg-buildpackage
> > > 
> > > Then I use ceph-deploy to complete setup of the cluster and instead of
> > > ceph-deploy install, I type sudo dpkg -i -R  > > result of step 4>.
> > > 
> > > Thanks,
> > > Aakanksha
> > >  
> > > 
> > > -Original Message-
> > > From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > > Sent: Monday, August 31, 2015 2:19 PM
> > > To: Aakanksha Pudipeddi-SSI
> > > Cc: ceph-us...@ceph.com
> > > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > > 
> > > - Original Message -
> > > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > > To: "Brad Hubbard" <bhubb...@redhat.com>
> > > > Cc: ceph-us...@ceph.com
> > > > Sent: Tuesday, 1 September, 2015 3:33:38 AM
> > > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > > 
> > > > Hello Brad,
> > > > 
> > > > Sorry for the delay in replying. As you mentioned earlier,
> > > > 
> > > > $ `which rados` -v
> > > > 
> > > > Returns a command not found error
> > > 
> > > If it can't find rados how are you using it?
> > > 
> > > Previously I gave you the following command. how did you run it is
> > > `whcih rados` returns command not found?
> > > 
> > > # nm --dyna

Re: [ceph-users] Rados: Undefined symbol error

2015-09-01 Thread Brad Hubbard

- Original Message -
> From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Sent: Wednesday, 2 September, 2015 6:25:49 AM
> Subject: RE: [ceph-users] Rados: Undefined symbol error
> 
> Hello Brad,
> 
> I wanted to clarify the "make install" part of building a cluster. I finished
> building the source (have not done "make install" yet) and now when I type
> in "rados", I get this:
> 
> $rados
> 2015-09-01 13:12:25.061939 7f5370f35840 -1 did not load config file, using
> default settings.
> rados: you must give an action. Try --help
> 
> When I built ceph from source a couple of months ago(giant), I found that
> sudo make install does not deploy ceph binaries onto the system and hence,
> went through the process of building packages via dpkg and then deploying
> the cluster with ceph-deploy. I am not sure as to what make install does
> here. Could you elaborate on that?
> 
> I actually tried "make install" yesterday and when I typed "rados", I got
> something like this:
> 
> /usr/local/bin/rados: librados.so.2: cannot open shared object file
> 
> But I had to clone the source again because of some other issues and I am
> currently at the stage I mentioned in the beginning. Now I am not sure if I
> should "make install" or go through the process of building ceph packages
> from source and deploying the cluster with ceph-deploy. Any pointers on this
> would be very helpful! Thanks a lot again for your continued help :)

Note that the idea here was not to go into production with this but merely as a
test, that's why I suggested standing up a new vm to do it.

So let's try some things in the build directory then.

After the build the rados binary should end up in ./src/.libs/rados if deb
systems are the same as Fedora in that regard. If not you will need to find the
rados binary that gets built when you run "make". Once you have that run the
following on it.

$ strings ./src/.libs/rados|grep "^ceph version" -A5 ceph version
$ eu-unstrip -n -e  ./src/.libs/rados
$ nm --dynamic ./src/.libs/rados|grep Mutex
$ ./src/.libs/rados -v

The last command may not work unless you have the correct libraries in place on
the target system but please include all output.

Then you can do your normal packaging and install and run the same commands
substituting "$(which rados)" for ./src/.libs/rados.

It is very important you include all output and, if any of the tools are
missing, you may need to install the equivalent of the elfutils package (for
eu-unsrip, although I guess in pinch you could just use "strip" from the
binutils package. I just prefer the elfutils versions).

> 
> Aakanksha
> 
> 
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Monday, August 31, 2015 3:47 PM
> To: Aakanksha Pudipeddi-SSI
> Cc: ceph-users
> Subject: Re: [ceph-users] Rados: Undefined symbol error
> 
> 
> 
> - Original Message -
> > From: "Brad Hubbard" <bhubb...@redhat.com>
> > To: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > Cc: "ceph-users" <ceph-us...@ceph.com>
> > Sent: Tuesday, 1 September, 2015 8:36:33 AM
> > Subject: Re: [ceph-users] Rados: Undefined symbol error
> > 
> > - Original Message -
> > > From: "Aakanksha Pudipeddi-SSI" <aakanksha...@ssi.samsung.com>
> > > To: "Brad Hubbard" <bhubb...@redhat.com>
> > > Cc: "ceph-users" <ceph-us...@ceph.com>
> > > Sent: Tuesday, 1 September, 2015 7:58:33 AM
> > > Subject: RE: [ceph-users] Rados: Undefined symbol error
> > > 
> > > Brad,
> > > 
> > > Yes, you are right. Sorry about that! This is what I get when I try
> > > with the back ticks:
> > > $ `which rados` -v
> > > /usr/bin/rados: symbol lookup error: /usr/bin/rados: undefined symbol:
> > > _ZN5MutexC1ERKSsbbbP11CephContext
> > > $ strings `which rados`|grep "^ceph version"
> > > $
> > > $ strings $(which rados)|grep "^ceph version" -A5 $
> > > 
> > > The latest command returns no results too.
> > 
> > Here's what you should get.
> > 
> > # strings $(which rados)|grep "^ceph version" -A5 ceph version
> > e4bfad3a3c51054df7e537a724c8d0bf9be972ff
> 
> Except you should see be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0 since that is
> v9.0.2. Your rados binary just isn't behaving like anything I've seen
> before.
> 
> How about you stand up a fresh VM and run "

Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3

2015-09-06 Thread Brad Hubbard

- Original Message -
> From: "Fangzhe Chang (Fangzhe)" 
> To: ceph-users@lists.ceph.com
> Sent: Saturday, 5 September, 2015 6:26:16 AM
> Subject: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> 
> 
> Hi,
> 
> I’m trying to add a second monitor using ‘ceph-deploy mon new  hostname>’. However, the log file shows the following error:
> 
> 2015-09-04 16:13:54.863479 7f4cbc3f7700 0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
> 
> 2015-09-04 16:13:54.863491 7f4cbc3f7700 0 -- :6789/0 >>
> :6789/0 pipe(0x413 sd=12 :57954 s=1 pgs=0 cs=0 l=0
> c=0x3f29600).failed verifying authorize reply

A couple of things to look at are verifying all your clocks are in sync (ntp
helps here) and making sure you are running ceph-deploy in the directory you 
used
to create the cluster.

> 
> 
> 
> Does anyone know how to resolve this?
> 
> Thanks
> 
> 
> 
> Fangzhe
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3

2015-09-08 Thread Brad Hubbard

I'd suggest starting the mon with debugging turned right up and taking
a good look at the output.

Cheers,
Brad

- Original Message -
> From: "Fangzhe Chang (Fangzhe)" <fangzhe.ch...@alcatel-lucent.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, 9 September, 2015 7:35:42 AM
> Subject: RE: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> Thanks for the answer.
> 
> NTP is running on both the existing monitor and the new monitor being
> installed.
> I did run ceph-deploy in the same directory as I created the cluster.
> However, I need to tweak the options supplied to ceph-deploy a little bit
> since I was running it behind a corporate firewall.
> 
> I noticed the ceph-create-keys process is running on the background. When I
> ran it manually, I got the following results.
> 
> $ python /usr/sbin/ceph-create-keys --cluster ceph -i 
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> 
> 
> -Original Message-
> From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: Sunday, September 06, 2015 11:58 PM
> To: Chang, Fangzhe (Fangzhe)
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> 
> - Original Message -
> > From: "Fangzhe Chang (Fangzhe)" <fangzhe.ch...@alcatel-lucent.com>
> > To: ceph-users@lists.ceph.com
> > Sent: Saturday, 5 September, 2015 6:26:16 AM
> > Subject: [ceph-users] Cannot add/create new monitor on ceph v0.94.3
> > 
> > 
> > 
> > Hi,
> > 
> > I’m trying to add a second monitor using ‘ceph-deploy mon new  > hostname>’. However, the log file shows the following error:
> > 
> > 2015-09-04 16:13:54.863479 7f4cbc3f7700 0 cephx: verify_reply couldn't
> > decrypt with error: error decoding block for decryption
> > 
> > 2015-09-04 16:13:54.863491 7f4cbc3f7700 0 -- :6789/0
> > >> :6789/0 pipe(0x413 sd=12 :57954 s=1 pgs=0
> > cs=0 l=0 c=0x3f29600).failed verifying authorize reply
> 
> A couple of things to look at are verifying all your clocks are in sync (ntp
> helps here) and making sure you are running ceph-deploy in the directory you
> used to create the cluster.
> 
> > 
> > 
> > 
> > Does anyone know how to resolve this?
> > 
> > Thanks
> > 
> > 
> > 
> > Fangzhe
> > 
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Brad Hubbard

- Original Message -
> From: "Wido den Hollander" 
> To: "ceph-users" 
> Sent: Friday, 11 September, 2015 6:46:11 AM
> Subject: [ceph-users] 9 PGs stay incomplete
> 
> Hi,
> 
> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
> test 9 PGs stay incomplete:
> 
> osdmap e78770: 2294 osds: 2294 up, 2294 in
> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>755 TB used, 14468 TB / 15224 TB avail
>   51831 active+clean
>   9 incomplete
> 
> As you can see, all 2294 OSDs are online and about all PGs became
> active+clean again, except for 9.
> 
> I found out that these PGs are the problem:
> 
> 10.3762
> 7.309e
> 7.29a2
> 10.2289
> 7.17dd
> 10.165a
> 7.1050
> 7.c65
> 10.abf
> 
> Digging further, all the PGs map back to a OSD which is running on the
> same host. 'ceph-stg-01' in this case.
> 
> $ ceph pg 10.3762 query
> 
> Looking at the recovery state, this is shown:
> 
> {
> "first": 65286,
> "last": 67355,
> "maybe_went_rw": 0,
> "up": [
> 1420,
> 854,
> 1105

Anything interesting in the OSD logs for these OSDs?

> ],
> "acting": [
> 1420
> ],
> "primary": 1420,
> "up_primary": 1420
> },
> 
> osd.1420 is online. I tried restarting it, but nothing happens, these 9
> PGs stay incomplete.
> 
> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
> the PG with identical numbers.
> 
> I restarted both 854 and 1105, without result.
> 
> The output of PG query can be found here: http://pastebin.com/qQL699zC
> 
> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
> 3.13 kernel. XFS is being used as the backing filesystem.
> 
> Any suggestions to fix this issue? There is no valuable data in these
> pools, so I can remove them, but I'd rather fix the root-cause.
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD crash

2015-09-22 Thread Brad Hubbard

- Original Message - 

> From: "Alex Gorbachev" 
> To: "ceph-users" 
> Sent: Wednesday, 9 September, 2015 6:38:50 AM
> Subject: [ceph-users] OSD crash

> Hello,

> We have run into an OSD crash this weekend with the following dump. Please
> advise what this could be.

Hello Alex,

As you know I created http://tracker.ceph.com/issues/13074 for this issue but
the developers working on it would like any additional information you can
provide about the nature of the issue. Could you take a look?

Cheers,
Brad

> Best regards,
> Alex

> 2015-09-07 14:55:01.345638 7fae6c158700 0 -- 10.80.4.25:6830/2003934 >>
> 10.80.4.15:6813/5003974 pipe(0x1dd73000 sd=257 :6830 s=2 pgs=14271 cs=251
> l=0 c=0x10d34580).fault with nothing to send, going to standby
> 2015-09-07 14:56:16.948998 7fae643e8700 -1 *** Caught signal (Segmentation
> fault) **
> in thread 7fae643e8700

> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
> 1: /usr/bin/ceph-osd() [0xacb3ba]
> 2: (()+0x10340) [0x7faea044e340]
> 3:
> (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> unsigned long, int)+0x103) [0x7faea067fac3]
> 4: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*,
> unsigned long)+0x1b) [0x7faea067fb7b]
> 5: (operator delete(void*)+0x1f8) [0x7faea068ef68]
> 6: (std::_Rb_tree > >, std::_Select1st > > >, std::less,
> std::allocator > > > >::_M_erase(std::_Rb_tree_node > > >*)+0x58) [0xca2438]
> 7: (std::_Rb_tree > >, std::_Select1st > > >, std::less,
> std::allocator > > > >::erase(int const&)+0xdf) [0xca252f]
> 8: (Pipe::writer()+0x93c) [0xca097c]
> 9: (Pipe::Writer::entry()+0xd) [0xca40dd]
> 10: (()+0x8182) [0x7faea0446182]
> 11: (clone()+0x6d) [0x7fae9e9b100d]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.

> --- begin dump of recent events ---
> -1> 2015-08-20 05:32:32.454940 7fae8e897700 0 -- 10.80.4.25:6830/2003934
> >> 10.80.4.15:6806/4003754 pipe(0x1992d000 sd=142 :6830 s=0 pgs=0 cs=0 l=0
> c=0x12bf5700).accept connect_seq 816 vs existing 815 state standby

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] a couple of radosgw questions

2015-08-28 Thread Brad Hubbard

- Original Message -
 From: Tom Deneau tom.den...@amd.com
 To: ceph-us...@ceph.com
 Sent: Saturday, 29 August, 2015 4:01:08 AM
 Subject: [ceph-users] a couple of radosgw questions

 A couple of questions on the radosgw...

 1.  I noticed when I use s3cmd to put a 10M object into a bucket in the rados
 object gateway,
 I get the following objects created in .rgw.buckets:
  0.5M
4M
4M
  1.5M

 I assume the 4M breakdown is controlled by rgw obj stripe size.  What
 causes the small initial 0.5M piece?

Does this describe what you are seeing?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-March/037736.html

 Also, is there any diagram showing which parts of this striping, if any,
 occur in parallel?

 2. I noticed when I use s3cmd to remove an object, it is no longer visible
 from the S3 API, but the objects
that comprised it are still there in .rgw.buckets pool.  When do they get
removed?

Does the following command remove them?

http://ceph.com/docs/master/radosgw/purge-temp/ 

 -- Tom Deneau, AMD

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error while installing ceph

2015-08-28 Thread Brad Hubbard

Did you follow this first?

http://docs.ceph.com/docs/v0.80.5/start/quick-start-preflight/

It doesn't seem to be able to locate the repos for the ceph rpms.

- Original Message -
 From: pavana bhat pavanakrishnab...@gmail.com
 To: ceph-users@lists.ceph.com
 Sent: Saturday, 29 August, 2015 8:55:14 AM
 Subject: [ceph-users] Error while installing ceph
 
 Hi,
 
 Im getting an error while ceph installation. Can you please help me?
 
 I'm exactly following the steps given in
 http://docs.ceph.com/docs/v0.80.5/start/quick-ceph-deploy/ to install ceph.
 
 But when I execute ceph-deploy install {ceph-node}[{ceph-node} ...], getting
 following error:
 
 
 
 [ ceph-vm-mon1 ][ DEBUG ] Cleaning up everything
 
 [ ceph-vm-mon1 ][ DEBUG ] Cleaning up list of fastest mirrors
 
 [ ceph-vm-mon1 ][ INFO ] Running command: sudo yum -y install ceph-osd
 ceph-mds ceph-mon ceph-radosgw
 
 [ ceph-vm-mon1 ][ DEBUG ] Loaded plugins: fastestmirror
 
 [ ceph-vm-mon1 ][ DEBUG ] Determining fastest mirrors
 
 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-ha-rpms: 203.36.4.124
 
 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-optional-rpms: 203.36.4.124
 
 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-server-rpms: 203.36.4.124
 
 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-supplemental-rpms: 203.36.4.124
 
 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-osd available.
 
 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-mds available.
 
 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-mon available.
 
 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-radosgw available.
 
 [ ceph-vm-mon1 ][ WARNIN ] Error: Nothing to do
 
 [ ceph-vm-mon1 ][ ERROR ] RuntimeError: command returned non-zero exit
 status: 1
 
 [ ceph_deploy ][ ERROR ] RuntimeError: Failed to execute command: yum -y
 install ceph-osd ceph-mds ceph-mon ceph-radosgw
 
 I have finished the preflight steps and I'm able to connect to internet from
 my nodes.
 
 Thanks,
 
 Pavana
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados: Undefined symbol error

2015-08-27 Thread Brad Hubbard

- Original Message -
 From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
 To: Brad Hubbard bhubb...@redhat.com
 Cc: Jason Dillaman dilla...@redhat.com, ceph-us...@ceph.com
 Sent: Friday, 28 August, 2015 6:15:12 AM
 Subject: RE: [ceph-users] Rados: Undefined symbol error
 
 Hello Brad,
 
 Thank you for your response. Looks like the command is undefined.
 
 U _ZN5Mutex4LockEb
  U _ZN5Mutex6UnlockEv
  U _ZN5MutexC1ERKSsbbbP11CephContext
  U _ZN5MutexD1Ev

$ git checkout v9.0.2
M   src/civetweb
HEAD is now at be422c8... 9.0.2

$ git show|head -1
commit be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0

$./autogen.sh  ./configure  make -j2
$ sudo make install

$ which rados
/usr/local/bin/rados
$ rados -v
ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)

$ nm /usr/local/bin/rados|grep ZN5MutexC1ERKSsbbbP11CephContext
00513790 T _ZN5MutexC1ERKSsbbbP11CephContext

What OS/environment is this in and is there anything unusual about it or the
build environment or build process?

What does the following command output?

$ `which rados` -v

 
 Thanks,
 Aakanksha
 
 -Original Message-
 From: Brad Hubbard [mailto:bhubb...@redhat.com]
 Sent: Wednesday, August 26, 2015 5:46 PM
 To: Aakanksha Pudipeddi-SSI
 Cc: Jason Dillaman; ceph-us...@ceph.com
 Subject: Re: [ceph-users] Rados: Undefined symbol error
 
 - Original Message -
  From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
  To: Jason Dillaman dilla...@redhat.com
  Cc: ceph-us...@ceph.com
  Sent: Thursday, 27 August, 2015 6:22:45 AM
  Subject: Re: [ceph-users] Rados: Undefined symbol error
  
  Hello Jason,
  
  I checked the version of my built packages and they are all 9.0.2. I
  purged the cluster and uninstalled the packages and there seems to be
  nothing else
  - no older version. Could you elaborate on the fix for this issue?
 
 Some thoughts...
 
 # c++filt  _ZN5MutexC1ERKSsbbbP11CephContext
 Mutex::Mutex(std::basic_stringchar, std::char_traitschar,
 std::allocatorchar  const, bool, bool, bool, CephContext*)
 
 Thats from common/Mutex.cc
 
 # nm --dynamic `which rados` 21|grep Mutex
 00504da0 T _ZN5Mutex4LockEb
 00504f70 T _ZN5Mutex6UnlockEv
 00504a50 T _ZN5MutexC1EPKcbbbP11CephContext
 00504a50 T _ZN5MutexC2EPKcbbbP11CephContext
 00504d10 T _ZN5MutexD1Ev
 00504d10 T _ZN5MutexD2Ev
 
 This shows my version is defined in the text section of the binary itself.
 What do you get when you run the above command?
 
 Like Jason says this is some sort of mis-match between your rados binary and
 your installed libs.
 
 HTH,
 Brad
 
  
  Thanks,
  Aakanksha
  
  -Original Message-
  From: Jason Dillaman [mailto:dilla...@redhat.com]
  Sent: Friday, August 21, 2015 6:37 AM
  To: Aakanksha Pudipeddi-SSI
  Cc: ceph-us...@ceph.com
  Subject: Re: [ceph-users] Rados: Undefined symbol error
  
  It sounds like you have rados CLI tool from an earlier Ceph release (
  Hammer) installed and it is attempting to use the librados shared
  library from a newer (= Hammer) version of Ceph.
  
  Jason
  
  
  - Original Message -
  
   From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
   To: ceph-us...@ceph.com
   Sent: Thursday, August 20, 2015 11:47:26 PM
   Subject: [ceph-users] Rados: Undefined symbol error
  
   Hello,
  
   I cloned the master branch of Ceph and after setting up the cluster,
   when I tried to use the rados commands, I got this error:
  
   rados: symbol lookup error: rados: undefined symbol:
   _ZN5MutexC1ERKSsbbbP11CephContext
  
   I saw a similar post here: http://tracker.ceph.com/issues/12563 but
   I am not clear on the solution for this problem. I am not performing
   an upgrade here but the error seems to be similar. Could anybody
   shed more light on the issue and how to solve it? Thanks a lot!
  
   Aakanksha
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error while installing ceph

2015-08-28 Thread Brad Hubbard

- Original Message -
 From: pavana bhat pavanakrishnab...@gmail.com
 To: Brad Hubbard bhubb...@redhat.com
 Cc: ceph-users@lists.ceph.com
 Sent: Saturday, 29 August, 2015 9:40:50 AM
 Subject: Re: [ceph-users] Error while installing ceph
 
 Yes I did follow all the preflight steps.
 
 After yum install (sudo yum update  sudo yum install ceph-deploy), it did
 show the following are installed
 
 rhel-7-ha-rpms
 
 
 rhel-7-optional-rpms
 
 
 rhel-7-server-rpms
 
 
 rhel-7-supplemental-rpms
 
 
 rhel-7-server-rpms/primary_db
 
 ceph-noarch
 
 Installed:
 
   ceph-deploy.noarch 0:1.5.28-0

Perhaps the --repo and/or --release flags are required?

 
 
 Thanks,
 
 Pavana
 
 On Fri, Aug 28, 2015 at 4:29 PM, Brad Hubbard bhubb...@redhat.com wrote:
 
  Did you follow this first?
 
  http://docs.ceph.com/docs/v0.80.5/start/quick-start-preflight/
 
  It doesn't seem to be able to locate the repos for the ceph rpms.
 
  - Original Message -
   From: pavana bhat pavanakrishnab...@gmail.com
   To: ceph-users@lists.ceph.com
   Sent: Saturday, 29 August, 2015 8:55:14 AM
   Subject: [ceph-users] Error while installing ceph
  
   Hi,
  
   Im getting an error while ceph installation. Can you please help me?
  
   I'm exactly following the steps given in
   http://docs.ceph.com/docs/v0.80.5/start/quick-ceph-deploy/ to install
  ceph.
  
   But when I execute ceph-deploy install {ceph-node}[{ceph-node} ...],
  getting
   following error:
  
  
  
   [ ceph-vm-mon1 ][ DEBUG ] Cleaning up everything
  
   [ ceph-vm-mon1 ][ DEBUG ] Cleaning up list of fastest mirrors
  
   [ ceph-vm-mon1 ][ INFO ] Running command: sudo yum -y install ceph-osd
   ceph-mds ceph-mon ceph-radosgw
  
   [ ceph-vm-mon1 ][ DEBUG ] Loaded plugins: fastestmirror
  
   [ ceph-vm-mon1 ][ DEBUG ] Determining fastest mirrors
  
   [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-ha-rpms: 203.36.4.124
  
   [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-optional-rpms: 203.36.4.124
  
   [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-server-rpms: 203.36.4.124
  
   [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-supplemental-rpms: 203.36.4.124
  
   [ ceph-vm-mon1 ][ DEBUG ] No package ceph-osd available.
  
   [ ceph-vm-mon1 ][ DEBUG ] No package ceph-mds available.
  
   [ ceph-vm-mon1 ][ DEBUG ] No package ceph-mon available.
  
   [ ceph-vm-mon1 ][ DEBUG ] No package ceph-radosgw available.
  
   [ ceph-vm-mon1 ][ WARNIN ] Error: Nothing to do
  
   [ ceph-vm-mon1 ][ ERROR ] RuntimeError: command returned non-zero exit
   status: 1
  
   [ ceph_deploy ][ ERROR ] RuntimeError: Failed to execute command: yum -y
   install ceph-osd ceph-mds ceph-mon ceph-radosgw
  
   I have finished the preflight steps and I'm able to connect to internet
  from
   my nodes.
  
   Thanks,
  
   Pavana
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] a couple of radosgw questions

2015-08-28 Thread Brad Hubbard

- Original Message -
 From: Ben Hines bhi...@gmail.com
 To: Brad Hubbard bhubb...@redhat.com
 Cc: Tom Deneau tom.den...@amd.com, ceph-users ceph-us...@ceph.com
 Sent: Saturday, 29 August, 2015 9:49:00 AM
 Subject: Re: [ceph-users] a couple of radosgw questions

 16:22:38 root@sm-cephrgw4 /etc/ceph $ radosgw-admin temp remove
 unrecognized arg remove
 usage: radosgw-admin cmd [options...]
 commands:

   temp removeremove temporary objects that were created up to
  specified date (and optional time)

Looking into this ambiguity, thanks.

 On Fri, Aug 28, 2015 at 4:24 PM, Brad Hubbard bhubb...@redhat.com wrote:
  emove an object, it is no longer visible
  from the S3 API, but the objects
 that comprised it are still there in .rgw.buckets pool.  When do they
 get
 removed?

  Does the following command remove them?

  http://ceph.com/docs/master/radosgw/purge-temp/

Does radosgw-admin gc list show anything?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error while installing ceph

2015-08-28 Thread Brad Hubbard

- Original Message -
 From: Travis Rhoden trho...@gmail.com
 To: Brad Hubbard bhubb...@redhat.com
 Cc: pavana bhat pavanakrishnab...@gmail.com, ceph-users@lists.ceph.com
 Sent: Saturday, 29 August, 2015 10:11:21 AM
 Subject: Re: [ceph-users] Error while installing ceph
 
 A couple of things here...
 
 Looks like you are on RHEL. If you are on RHEL, but *not* trying to install
 RHCS (Red Hat Ceph Storage), a few extra flags are required.  You must use
 --release.  For example, ceph-deploy install --release hammer  in
 order to get the Hammer upstream release.
 
 The docs need to make this more clear (I don't think it's mentioned
 anywhere

Here, http://ceph.com/ceph-deploy/docs/install.html

 -- upstream Ceph on RHEL is not a very common case, but it is
 supposed to work. :))

Must admit I haven't played much with the installation of upstream rpms on
rhel.

 
 That will at least install the right packages.  however, there is still one
 more issue you will hit, which is that when installing upstream Ceph on
 RHEL, it knows that it needs EPEL (EPEL is not needed with RHCS), and it
 will try to install it by name yum install epel-release.  But that
 doesn't work on RHEL.  Until that is fixed, you will also have to install
 EPEL by hand on your nodes.

The above page also says In some distributions, other repos (besides the ceph
repos) will be added, like EPEL for CentOS.

 
 On Fri, Aug 28, 2015 at 5:02 PM, Brad Hubbard bhubb...@redhat.com wrote:
 
  - Original Message -
   From: pavana bhat pavanakrishnab...@gmail.com
   To: Brad Hubbard bhubb...@redhat.com
   Cc: ceph-users@lists.ceph.com
   Sent: Saturday, 29 August, 2015 9:40:50 AM
   Subject: Re: [ceph-users] Error while installing ceph
  
   Yes I did follow all the preflight steps.
  
   After yum install (sudo yum update  sudo yum install ceph-deploy), it
  did
   show the following are installed
  
   rhel-7-ha-rpms
  
  
   rhel-7-optional-rpms
  
  
   rhel-7-server-rpms
  
  
   rhel-7-supplemental-rpms
  
  
   rhel-7-server-rpms/primary_db
  
   ceph-noarch
  
   Installed:
  
 ceph-deploy.noarch 0:1.5.28-0
 
  Perhaps the --repo and/or --release flags are required?
 
  
  
   Thanks,
  
   Pavana
  
   On Fri, Aug 28, 2015 at 4:29 PM, Brad Hubbard bhubb...@redhat.com
  wrote:
  
Did you follow this first?
   
http://docs.ceph.com/docs/v0.80.5/start/quick-start-preflight/
   
It doesn't seem to be able to locate the repos for the ceph rpms.
   
- Original Message -
 From: pavana bhat pavanakrishnab...@gmail.com
 To: ceph-users@lists.ceph.com
 Sent: Saturday, 29 August, 2015 8:55:14 AM
 Subject: [ceph-users] Error while installing ceph

 Hi,

 Im getting an error while ceph installation. Can you please help me?

 I'm exactly following the steps given in
 http://docs.ceph.com/docs/v0.80.5/start/quick-ceph-deploy/ to
  install
ceph.
 
 
 These are pretty old docs (see the version number in the URL).  It's
 probably always best to start at http://docs.ceph.com/docs/master instead.
 How did you get to this old version?  If it was from a link, we would want
 to check that that link still made sense.
 
 

 But when I execute ceph-deploy install {ceph-node}[{ceph-node} ...],
getting
 following error:



 [ ceph-vm-mon1 ][ DEBUG ] Cleaning up everything

 [ ceph-vm-mon1 ][ DEBUG ] Cleaning up list of fastest mirrors

 [ ceph-vm-mon1 ][ INFO ] Running command: sudo yum -y install
  ceph-osd
 ceph-mds ceph-mon ceph-radosgw

 [ ceph-vm-mon1 ][ DEBUG ] Loaded plugins: fastestmirror

 [ ceph-vm-mon1 ][ DEBUG ] Determining fastest mirrors

 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-ha-rpms: 203.36.4.124

 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-optional-rpms: 203.36.4.124

 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-server-rpms: 203.36.4.124

 [ ceph-vm-mon1 ][ DEBUG ] * rhel-7-supplemental-rpms: 203.36.4.124

 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-osd available.

 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-mds available.

 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-mon available.

 [ ceph-vm-mon1 ][ DEBUG ] No package ceph-radosgw available.

 [ ceph-vm-mon1 ][ WARNIN ] Error: Nothing to do

 [ ceph-vm-mon1 ][ ERROR ] RuntimeError: command returned non-zero
  exit
 status: 1

 [ ceph_deploy ][ ERROR ] RuntimeError: Failed to execute command:
  yum -y
 install ceph-osd ceph-mds ceph-mon ceph-radosgw

 I have finished the preflight steps and I'm able to connect to
  internet
from
 my nodes.

 Thanks,

 Pavana

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

   
  
  ___
  ceph-users mailing list
  ceph-users

Re: [ceph-users] OSD error

2015-12-08 Thread Brad Hubbard

+ceph-devel

- Original Message - 

> From: "Dan Nica" 
> To: ceph-us...@ceph.com
> Sent: Tuesday, 8 December, 2015 7:54:20 PM
> Subject: [ceph-users] OSD error

> Hi guys,

> Recently I installed ceph cluster version 9.2.0, and on my osd logs I see
> these errors:

> 2015-12-08 04:49:12.931683 7f42ec266700 -1 lsb_release_parse - pclose failed:
> (13) Permission denied
> 2015-12-08 04:49:12.955264 7f42ec266700 -1 lsb_release_parse - pclose failed:
> (13) Permission denied

> Do I have to worry about it ? what is generating these errors ?

Dan, What does "lsb_release -idrc" return on this system?

I wonder if we are getting hit with EINTR here maybe and getting SIGPIPE?

static void lsb_release_parse(map *m, CephContext *cct)
{
  FILE *fp = popen("lsb_release -idrc", "r");
  if (!fp) {
int ret = -errno;
lderr(cct) << "lsb_release_parse - failed to call lsb_release binary with 
error: " << cpp_strerror(ret) << dendl;
return;
  }

  char buf[512];
  while (fgets(buf, sizeof(buf) - 1, fp) != NULL) {
if (lsb_release_set(buf, "Distributor ID:", m, "distro"))
  continue;
if (lsb_release_set(buf, "Description:", m, "distro_description"))
  continue;
if (lsb_release_set(buf, "Release:", m, "distro_version"))
  continue;
if (lsb_release_set(buf, "Codename:", m, "distro_codename"))
  continue;

lderr(cct) << "unhandled output: " << buf << dendl;
  }

  if (pclose(fp)) {
int ret = -errno;
lderr(cct) << "lsb_release_parse - pclose failed: " << cpp_strerror(ret) << 
dendl;   <--HERE
  }
}

Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph new installation of ceph 0.9.2 issue and crashing osds

2015-12-08 Thread Brad Hubbard

Looks like it's failing to create a thread.

Try setting kernel.pid_max to 4194303 in /etc/sysctl.conf

Cheers,
Brad

- Original Message -
> From: "Kenneth Waegeman" 
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, 8 December, 2015 10:45:11 PM
> Subject: [ceph-users] ceph new installation of ceph 0.9.2 issue and crashing  
> osds
> 
> Hi,
> 
> I installed ceph 0.9.2 on a new cluster of 3 nodes, with 50 OSDs on each
> node (300GB disks, 96GB RAM)
> 
> While installing, I got some issue that I even could not login as ceph
> user. So I increased some limits:
>   security/limits.conf
> 
> ceph-   nproc   1048576
> ceph-   nofile 1048576
> 
> I could then install the other OSDs.
> 
> After the cluster was installed, I added some extra pools. when creating
> the pgs of these pools, the osds of the cluster started to fail, with
> stacktraces. If I try to restart them, they keep on failing. I don't
> know if this is an actual bug of Infernalis, or a limit that is still
> not high enough.. I've increased the noproc and nofile entries even
> more, but no luck. Someone has a clue? Hereby the stacktraces I see:
> 
> Mostly this one:
> 
> -12> 2015-12-08 10:17:18.995243 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b(unlocked)] enter Initial
> -11> 2015-12-08 10:17:18.995279 7fa9063c5700  5 write_log with:
> dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
> 4294967295'184467
> 44073709551615, trimmed:
> -10> 2015-12-08 10:17:18.995292 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] exit Initial 0.48
> 0 0.00
>  -9> 2015-12-08 10:17:18.995301 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] enter Reset
>  -8> 2015-12-08 10:17:18.995310 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Reset 0.08
> 1 0.17
>  -7> 2015-12-08 10:17:18.995326 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started
>  -6> 2015-12-08 10:17:18.995332 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Start
>  -5> 2015-12-08 10:17:18.995338 7fa9063c5700  1 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] state: transi
> tioning to Primary
>  -4> 2015-12-08 10:17:18.995345 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Start 0.12
> 0 0.00
>  -3> 2015-12-08 10:17:18.995352 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started/Primar
> y
>  -2> 2015-12-08 10:17:18.995358 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating] enter Started/Primar
> y/Peering
>  -1> 2015-12-08 10:17:18.995365 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating+peering] enter Starte
> d/Primary/Peering/GetInfo
>   0> 2015-12-08 10:17:18.998472 7fa9063c5700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7fa9063c5700 time
> 2015-12-08 10:17:18.995438
> common/Thread.cc: 154: FAILED assert(ret == 0)
> 
>   ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7fa91924ebe5]
>   2: (Thread::create(unsigned long)+0x8a) [0x7fa91923325a]
>   3: (SimpleMessenger::connect_rank(entity_addr_t const&, int,
> PipeConnection*, Message*)+0x185) [0x7fa919229105]
>   4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x3ba)
> [0x7fa9192298ea]
>   5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1ab)
> [0x7fa918c7318b]
>   6: (OSD::do_queries(std::map std::less, std::allocator >,
> std::less, std::allocator > > > > >&, std::shared_ptr)+0x1f1)
> [0x7fa918c9b061]
>   7: (OSD::dispatch_context(PG::RecoveryCtx&, PG*,
> std::shared_ptr, ThreadPool::TPHandle*)+0x142)
> [0x7fa918cb5832]
>   8:

Re: [ceph-users] infernalis osd activation on centos 7

2015-12-02 Thread Brad Hubbard

- Original Message - 

> From: "Dan Nica" 
> To: ceph-us...@ceph.com
> Sent: Thursday, 3 December, 2015 1:39:16 AM
> Subject: [ceph-users] infernalis osd activation on centos 7

> Hi guys,

> After managing to get the mons up, I am stuck at activating the osds with the
> error below

> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/ceph/.cephdeploy.conf

What is the fsid in this file?

0c36d242-92a9-4331-b48d-ce07b628750a or 0e906cd0-81f1-412c-a3aa-3866192a2de7 ?

Check that your config in /home/ceph/ matches your running config.

Cheers,
Brad


> [osd01][WARNIN] __main__.Error: Error: No cluster conf found in /etc/ceph
> with fsid 0c36d242-92a9-4331-b48d-ce07b628750a
> [osd01][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v
> activate --mark-init systemd --mount /dev/sdb1

> Why do I get no cluster conf ?

> [ceph@osd01 ~]$ ll /etc/ceph/
> total 12
> -rw--- 1 ceph ceph 63 Dec 2 10:30 ceph.client.admin.keyring
> -rw-r--r-- 1 ceph ceph 270 Dec 2 10:31 ceph.conf
> -rwxr-xr-x 1 ceph ceph 92 Nov 10 07:06 rbdmap
> -rw--- 1 ceph ceph 0 Dec 2 10:30 tmp0jJPo4

> [ceph@osd01 ~]$ cat /etc/ceph/ceph.conf
> [global]
> fsid = 0e906cd0-81f1-412c-a3aa-3866192a2de7
> mon_initial_members = cmon01, cmon02, cmon03
> mon_host = 10.8.250.249,10.8.250.248,10.8.250.247
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true

> why is it looking for other fsid than in the ceph.conf ?

> Thanks,
> Dan

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-02 Thread Brad Hubbard

On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
 wrote:

> The only way that I was able to get back to Health_OK was to export/import.  
> * Please note, any time you use the ceph_objectstore_tool you risk data 
> loss if not done carefully.   Never remove a PG until you have a known good 
> export *
>
> Here are the steps I used:
>
> 1. set NOOUT, NO BACKFILL
> 2. Stop the OSD's that have the erroring PG
> 3. Flush the journal and export the primary version of the PG.  This took 1 
> minute on a well-behaved PG and 4 hours on the misbehaving PG
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16 
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export 
> --file /root/32.10c.b.export
>
> 4. Import the PG into a New / Temporary OSD that is also offline,
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100 
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export 
> --file /root/32.10c.b.export

This should be an import op and presumably to a different data path
and journal path more like the following?

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
--journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
import --file /root/32.10c.b.export

Just trying to clarify for anyone that comes across this thread in the future.

Cheers,
Brad

>
> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case it 
> looks like)
> 6. Start cluster OSD's
> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3 
> OSD's it is supposed to be on.
>
> This is similar to the recovery process described in this post from 
> 04/09/2015: 
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>Hopefully it works in your case too and you can the cluster back to a 
> state that you can make the CephFS directories smaller.
>
> - Brandon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Status - Segmentation Fault

2016-05-24 Thread Brad Hubbard

/usr/bin/ceph is a python script so it's not segfaulting but some binary it's
launching is and there doesn't appear to be much information about it in the
log you uploaded.

Are you able to capture a core file and generate a stack trace from gdb?

The following may help to get some data.

$ ulimit -c unlimited
$ ceph -s
$ ls core.*   // This should list a recently made core file
$ file core.XXX
// Now run gdb with the output of the previous "file" command 
$ gdb -c core.XXX  $(which binary_name) -batch -ex "thr apply all bt"
$ ulimit -c 0

You may need debuginfo for the relevant binary and libraries installed to get
good stack traces but it's something you can try.

For example.

$ ulimit -c unlimited
$ sleep 100 &
[1] 32056
$ kill -SIGSEGV 32056
$ ls core.*
core.32056
[1]+  Segmentation fault  (core dumped) sleep 100
$ file core.32056 
core.32056: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 
'sleep 100'
$ gdb -c core.32056 $(which sleep) -batch -ex "thr apply all bt"
[New LWP 32056]

warning: the debug information found in 
"/usr/lib/debug//lib64/libc-2.22.so.debug" does not match "/lib64/libc.so.6" 
(CRC mismatch).


warning: the debug information found in 
"/usr/lib/debug//usr/lib64/libc-2.22.so.debug" does not match 
"/lib64/libc.so.6" (CRC mismatch).


warning: the debug information found in 
"/usr/lib/debug//lib64/ld-2.22.so.debug" does not match 
"/lib64/ld-linux-x86-64.so.2" (CRC mismatch).


warning: the debug information found in 
"/usr/lib/debug//usr/lib64/ld-2.22.so.debug" does not match 
"/lib64/ld-linux-x86-64.so.2" (CRC mismatch).

Core was generated by `sleep 100'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7f1fd99e84b0 in __nanosleep_nocancel () from /lib64/libc.so.6

Thread 1 (LWP 32056):
#0  0x7f1fd99e84b0 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x5641e10ba29f in rpl_nanosleep ()
#2  0x5641e10ba100 in xnanosleep ()
#3  0x5641e10b7a1d in main ()

$ ulimit -c 0

HTH,
Brad

- Original Message -
> From: "Mathias Buresch" 
> To: ceph-us...@ceph.com
> Sent: Monday, 23 May, 2016 9:41:51 PM
> Subject: [ceph-users] Ceph Status - Segmentation Fault
> 
> Hi there,
> I was updating Ceph to 0.94.7 and now I am getting segmantation faults.
> 
> When getting status via "ceph -s" or "ceph health detail" I am getting
> an error "Segmentation fault".
> 
> I have only two Monitor Deamon.. but didn't had any problems yet with
> that.. maybe they maintenance time was too long this time..?!
> 
> When getting the status via admin socket I get following for both:
> 
> ceph daemon mon.pix01 mon_status
> {
> "name": "pix01",
> "rank": 0,
> "state": "leader",
> "election_epoch": 226,
> "quorum": [
> 0,
> 1
> ],
> "outside_quorum": [],
> "extra_probe_peers": [],
> "sync_provider": [],
> "monmap": {
> "epoch": 1,
> "fsid": "28af67eb-4060-4770-ac1d-d2be493877af",
> "modified": "2014-11-12 15:44:27.182395",
> "created": "2014-11-12 15:44:27.182395",
> "mons": [
> {
> "rank": 0,
> "name": "pix01",
> "addr": "x.x.x.x:6789\/0"
> },
> {
> "rank": 1,
> "name": "pix02",
> "addr": "x.x.x.x:6789\/0"
> }
> ]
> }
> }
> 
> ceph daemon mon.pix02 mon_status
> {
> "name": "pix02",
> "rank": 1,
> "state": "peon",
> "election_epoch": 226,
> "quorum": [
> 0,
> 1
> ],
> "outside_quorum": [],
> "extra_probe_peers": [],
> "sync_provider": [],
> "monmap": {
> "epoch": 1,
> "fsid": "28af67eb-4060-4770-ac1d-d2be493877af",
> "modified": "2014-11-12 15:44:27.182395",
> "created": "2014-11-12 15:44:27.182395",
> "mons": [
> {
> "rank": 0,
> "name": "pix01",
> "addr": "x.x.x.x:6789\/0"
> },
> {
> "rank": 1,
> "name": "pix02",
> "addr": "x.x.x.x:6789\/0"
> }
> ]
> }
> }
> 
> Please found the logs with higher debug level attached to this email.
> 
> 
> Kind regards
> Mathias
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Status - Segmentation Fault

2016-05-25 Thread Brad Hubbard

Hi John,

This looks a lot like http://tracker.ceph.com/issues/12417 which is, of
course, fixed.

Worth gathering debug-auth=20 ? Maybe on the MON end as well?

Cheers,
Brad


- Original Message -
> From: "Mathias Buresch" 
> To: jsp...@redhat.com
> Cc: ceph-us...@ceph.com
> Sent: Thursday, 26 May, 2016 12:57:47 AM
> Subject: Re: [ceph-users] Ceph Status - Segmentation Fault
> 
> There wasnt a package ceph-debuginfo available (Maybe bc I am running
> Ubuntu). Have installed those:
> 
>  * ceph-dbg
>  * librados2-dbg
> 
> There would be also ceph-mds-dbg and ceph-fs-common-dbg and so..
> 
> But now there are more information provided by the gdb output :)
> 
> (gdb) run /usr/bin/ceph status --debug-monc=20 --debug-ms=20 --debug-
> rados=20
> Starting program: /usr/bin/python /usr/bin/ceph status --debug-monc=20
> --debug-ms=20 --debug-rados=20
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-
> gnu/libthread_db.so.1".
> [New Thread 0x710f5700 (LWP 26739)]
> [New Thread 0x708f4700 (LWP 26740)]
> [Thread 0x710f5700 (LWP 26739) exited]
> [New Thread 0x710f5700 (LWP 26741)]
> [Thread 0x710f5700 (LWP 26741) exited]
> [New Thread 0x710f5700 (LWP 26742)]
> [Thread 0x710f5700 (LWP 26742) exited]
> [New Thread 0x710f5700 (LWP 26743)]
> [Thread 0x710f5700 (LWP 26743) exited]
> [New Thread 0x710f5700 (LWP 26744)]
> [Thread 0x710f5700 (LWP 26744) exited]
> [New Thread 0x710f5700 (LWP 26745)]
> [Thread 0x710f5700 (LWP 26745) exited]
> [New Thread 0x710f5700 (LWP 26746)]
> [New Thread 0x7fffeb885700 (LWP 26747)]
> 2016-05-25 16:55:30.929131 710f5700 10 monclient(hunting):
> build_initial_monmap
> 2016-05-25 16:55:30.929221 710f5700  1 librados: starting msgr at
> :/0
> 2016-05-25 16:55:30.929226 710f5700  1 librados: starting objecter
> [New Thread 0x7fffeb084700 (LWP 26748)]
> 2016-05-25 16:55:30.930288 710f5700 10 -- :/0 ready :/0
> [New Thread 0x7fffea883700 (LWP 26749)]
> [New Thread 0x7fffea082700 (LWP 26750)]
> 2016-05-25 16:55:30.932251 710f5700  1 -- :/0 messenger.start
> [New Thread 0x7fffe9881700 (LWP 26751)]
> 2016-05-25 16:55:30.933277 710f5700  1 librados: setting wanted
> keys
> 2016-05-25 16:55:30.933287 710f5700  1 librados: calling monclient
> init
> 2016-05-25 16:55:30.933289 710f5700 10 monclient(hunting): init
> 2016-05-25 16:55:30.933279 7fffe9881700 10 -- :/3663984981 reaper_entry
> start
> 2016-05-25 16:55:30.933300 710f5700 10 monclient(hunting):
> auth_supported 2 method cephx
> 2016-05-25 16:55:30.933303 7fffe9881700 10 -- :/3663984981 reaper
> 2016-05-25 16:55:30.933305 7fffe9881700 10 -- :/3663984981 reaper done
> [New Thread 0x7fffe9080700 (LWP 26752)]
> [New Thread 0x7fffe887f700 (LWP 26753)]
> 2016-05-25 16:55:30.935485 710f5700 10 monclient(hunting):
> _reopen_session rank -1 name
> 2016-05-25 16:55:30.935495 710f5700 10 -- :/3663984981 connect_rank
> to 62.176.141.181:6789/0, creating pipe and registering
> [New Thread 0x7fffe3fff700 (LWP 26754)]
> 2016-05-25 16:55:30.936556 710f5700 10 -- :/3663984981 >>
> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=-1 :0 s=1 pgs=0 cs=0 l=1
> c=0x7fffec05aa30).register_pipe
> 2016-05-25 16:55:30.936573 710f5700 10 -- :/3663984981
> get_connection mon.0 62.176.141.181:6789/0 new 0x7fffec064010
> 2016-05-25 16:55:30.936557 7fffe3fff700 10 -- :/3663984981 >>
> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=-1 :0 s=1 pgs=0 cs=0 l=1
> c=0x7fffec05aa30).writer: state = connecting policy.server=0
> 2016-05-25 16:55:30.936583 7fffe3fff700 10 -- :/3663984981 >>
> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=-1 :0 s=1 pgs=0 cs=0 l=1
> c=0x7fffec05aa30).connect 0
> 2016-05-25 16:55:30.936594 710f5700 10 monclient(hunting): picked
> mon.pix01 con 0x7fffec05aa30 addr 62.176.141.181:6789/0
> 2016-05-25 16:55:30.936600 710f5700 20 -- :/3663984981
> send_keepalive con 0x7fffec05aa30, have pipe.
> 2016-05-25 16:55:30.936603 7fffe3fff700 10 -- :/3663984981 >>
> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7fffec05aa30).connecting to 62.176.141.181:6789/0
> 2016-05-25 16:55:30.936615 710f5700 10 monclient(hunting):
> _send_mon_message to mon.pix01 at 62.176.141.181:6789/0
> 2016-05-25 16:55:30.936618 710f5700  1 -- :/3663984981 -->
> 62.176.141.181:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0
> 0x7fffec060450 con 0x7fffec05aa30
> 2016-05-25 16:55:30.936623 710f5700 20 -- :/3663984981
> submit_message auth(proto 0 30 bytes epoch 0) v1 remote,
> 62.176.141.181:6789/0, have pipe.
> 2016-05-25 16:55:30.936626 710f5700 10 monclient(hunting):
> renew_subs
> 2016-05-25 16:55:30.936630 710f5700 10 monclient(hunting):
> authenticate will time out at 2016-05-25 17:00:30.936629
> 2016-05-25 16:55:30.936867 7fffe3fff700 20 -- :/3663984981 >>
> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=3 :38763 s=1 pgs=0 cs=0
> l=1

Re: [ceph-users] Ceph Status - Segmentation Fault

2016-06-13 Thread Brad Hubbard

On Tue, Jun 14, 2016 at 2:26 AM, Mathias Buresch
<mathias.bure...@de.clara.net> wrote:
> Hey,
>
> I opened an issue at tracker.ceph.com -> http://tracker.ceph.com/issues
> /16266

Hi Mathias,

Thanks!

I've added some information in that bug as I came across this same
issue working on something else and saw your bug this morning.

Cheers,
Brad

-Original Message-
> From: Brad Hubbard <bhubb...@redhat.com>
> To: Mathias Buresch <mathias.bure...@de.clara.net>
> Cc: jsp...@redhat.com <jsp...@redhat.com>, ceph-us...@ceph.com  e...@ceph.com>
> Subject: Re: [ceph-users] Ceph Status - Segmentation Fault
> Date: Thu, 2 Jun 2016 09:50:20 +1000
>
> Could this be the call in RotatingKeyRing::get_secret() failing?
>
> Mathias, I'd suggest opening a tracker for this with the information in
> your last post and let us know the number here.
> Cheers,
> Brad
>
> On Wed, Jun 1, 2016 at 3:15 PM, Mathias Buresch <mathias.bure...@de.cla
> ra.net> wrote:
>> Hi,
>>
>> here is the output including --debug-auth=20. Does this help?
>>
>> (gdb) run /usr/bin/ceph status --debug-monc=20 --debug-ms=20 --debug-
>> rados=20 --debug-auth=20
>> Starting program: /usr/bin/python /usr/bin/ceph status --debug-
>> monc=20
>> --debug-ms=20 --debug-rados=20 --debug-auth=20
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-
>> gnu/libthread_db.so.1".
>> [New Thread 0x710f5700 (LWP 2210)]
>> [New Thread 0x708f4700 (LWP 2211)]
>> [Thread 0x710f5700 (LWP 2210) exited]
>> [New Thread 0x710f5700 (LWP 2212)]
>> [Thread 0x710f5700 (LWP 2212) exited]
>> [New Thread 0x710f5700 (LWP 2213)]
>> [Thread 0x710f5700 (LWP 2213) exited]
>> [New Thread 0x710f5700 (LWP 2233)]
>> [Thread 0x710f5700 (LWP 2233) exited]
>> [New Thread 0x710f5700 (LWP 2236)]
>> [Thread 0x710f5700 (LWP 2236) exited]
>> [New Thread 0x710f5700 (LWP 2237)]
>> [Thread 0x710f5700 (LWP 2237) exited]
>> [New Thread 0x710f5700 (LWP 2238)]
>> [New Thread 0x7fffeb885700 (LWP 2240)]
>> 2016-06-01 07:12:55.656336 710f5700 10 monclient(hunting):
>> build_initial_monmap
>> 2016-06-01 07:12:55.656440 710f5700  1 librados: starting msgr at
>> :/0
>> 2016-06-01 07:12:55.656446 710f5700  1 librados: starting
>> objecter
>> [New Thread 0x7fffeb084700 (LWP 2241)]
>> 2016-06-01 07:12:55.657552 710f5700 10 -- :/0 ready :/0
>> [New Thread 0x7fffea883700 (LWP 2242)]
>> [New Thread 0x7fffea082700 (LWP 2245)]
>> 2016-06-01 07:12:55.659548 710f5700  1 -- :/0 messenger.start
>> [New Thread 0x7fffe9881700 (LWP 2248)]
>> 2016-06-01 07:12:55.660530 710f5700  1 librados: setting wanted
>> keys
>> 2016-06-01 07:12:55.660539 710f5700  1 librados: calling
>> monclient
>> init
>> 2016-06-01 07:12:55.660540 710f5700 10 monclient(hunting): init
>> 2016-06-01 07:12:55.660550 710f5700  5 adding auth protocol:
>> cephx
>> 2016-06-01 07:12:55.660552 710f5700 10 monclient(hunting):
>> auth_supported 2 method cephx
>> 2016-06-01 07:12:55.660532 7fffe9881700 10 -- :/1337675866
>> reaper_entry
>> start
>> 2016-06-01 07:12:55.660570 7fffe9881700 10 -- :/1337675866 reaper
>> 2016-06-01 07:12:55.660572 7fffe9881700 10 -- :/1337675866 reaper
>> done
>> 2016-06-01 07:12:55.660733 710f5700  2 auth: KeyRing::load:
>> loaded
>> key file /etc/ceph/ceph.client.admin.keyring
>> [New Thread 0x7fffe9080700 (LWP 2251)]
>> [New Thread 0x7fffe887f700 (LWP 2252)]
>> 2016-06-01 07:12:55.662754 710f5700 10 monclient(hunting):
>> _reopen_session rank -1 name
>> 2016-06-01 07:12:55.662764 710f5700 10 -- :/1337675866
>> connect_rank
>> to 62.176.141.181:6789/0, creating pipe and registering
>> [New Thread 0x7fffe3fff700 (LWP 2255)]
>> 2016-06-01 07:12:55.663789 710f5700 10 -- :/1337675866 >>
>> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=-1 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7fffec05aa30).register_pipe
>> 2016-06-01 07:12:55.663819 710f5700 10 -- :/1337675866
>> get_connection mon.0 62.176.141.181:6789/0 new 0x7fffec064010
>> 2016-06-01 07:12:55.663790 7fffe3fff700 10 -- :/1337675866 >>
>> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=-1 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7fffec05aa30).writer: state = connecting policy.server=0
>> 2016-06-01 07:12:55.663830 7fffe3fff700 10 -- :/1337675866 >>
>> 62.176.141.181:6789/0 pipe(0x7fffec064010 sd=-1 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7fffec05aa30).connect 0
>> 2016-06-01 07:1

Re: [ceph-users] Ceph Status - Segmentation Fault

2016-06-01 Thread Brad Hubbard

t; 62.176.141.181:6789/0
> pipe(0x7fffec064010 sd=3 :41128 s=2 pgs=339278 cs=1 l=1
> c=0x7fffec05aa30).reader got message 2 0x7fffd0002f20 auth_reply(proto
> 2 0 (0) Success) v1
> 2016-06-01 07:12:55.665944 7fffe3efe700 20 --
> 62.176.141.181:0/1337675866 queue 0x7fffd0002f20 prio 196
> 2016-06-01 07:12:55.665950 7fffe3efe700 20 --
> 62.176.141.181:0/1337675866 >> 62.176.141.181:6789/0
> pipe(0x7fffec064010 sd=3 :41128 s=2 pgs=339278 cs=1 l=1
> c=0x7fffec05aa30).reader reading tag...
> 2016-06-01 07:12:55.665891 7fffea883700  1 --
> 62.176.141.181:0/1337675866 <== mon.0 62.176.141.181:6789/0 1 
> mon_map magic: 0 v1  340+0+0 (3213884171 0 0) 0x7fffd0001cb0 con
> 0x7fffec05aa30
> 2016-06-01 07:12:55.665953 7fffe3fff700 10 --
> 62.176.141.181:0/1337675866 >> 62.176.141.181:6789/0
> pipe(0x7fffec064010 sd=3 :41128 s=2 pgs=339278 cs=1 l=1
> c=0x7fffec05aa30).writer: state = open policy.server=0
> 2016-06-01 07:12:55.665960 7fffea883700 10 monclient(hunting):
> handle_monmap mon_map magic: 0 v1
> 2016-06-01 07:12:55.665960 7fffe3fff700 10 --
> 62.176.141.181:0/1337675866 >> 62.176.141.181:6789/0
> pipe(0x7fffec064010 sd=3 :41128 s=2 pgs=339278 cs=1 l=1
> c=0x7fffec05aa30).write_ack 2
> 2016-06-01 07:12:55.665966 7fffe3fff700 10 --
> 62.176.141.181:0/1337675866 >> 62.176.141.181:6789/0
> pipe(0x7fffec064010 sd=3 :41128 s=2 pgs=339278 cs=1 l=1
> c=0x7fffec05aa30).writer: state = open policy.server=0
> 2016-06-01 07:12:55.665971 7fffea883700 10 monclient(hunting):  got
> monmap 1, mon.pix01 is now rank 0
> 2016-06-01 07:12:55.665970 7fffe3fff700 20 --
> 62.176.141.181:0/1337675866 >> 62.176.141.181:6789/0
> pipe(0x7fffec064010 sd=3 :41128 s=2 pgs=339278 cs=1 l=1
> c=0x7fffec05aa30).writer sleeping
> 2016-06-01 07:12:55.665972 7fffea883700 10 monclient(hunting): dump:
> epoch 1
> fsid 28af67eb-4060-4770-ac1d-d2be493877af
> last_changed 2014-11-12 15:44:27.182395
> created 2014-11-12 15:44:27.182395
> 0: 62.176.141.181:6789/0 mon.pix01
> 1: 62.176.141.182:6789/0 mon.pix02
>
> 2016-06-01 07:12:55.665988 7fffea883700 10 --
> 62.176.141.181:0/1337675866 dispatch_throttle_release 340 to dispatch
> throttler 373/104857600
> 2016-06-01 07:12:55.665992 7fffea883700 20 --
> 62.176.141.181:0/1337675866 done calling dispatch on 0x7fffd0001cb0
> 2016-06-01 07:12:55.665997 7fffea883700  1 --
> 62.176.141.181:0/1337675866 <== mon.0 62.176.141.181:6789/0 2 
> auth_reply(proto 2 0 (0) Success) v1  33+0+0 (3918039325 0 0)
> 0x7fffd0002f20 con 0x7fffec05aa30
> 2016-06-01 07:12:55.666015 7fffea883700 10 cephx: set_have_need_key no
> handler for service mon
> 2016-06-01 07:12:55.666016 7fffea883700 10 cephx: set_have_need_key no
> handler for service osd
> 2016-06-01 07:12:55.666017 7fffea883700 10 cephx: set_have_need_key no
> handler for service auth
> 2016-06-01 07:12:55.666018 7fffea883700 10 cephx: validate_tickets want
> 37 have 0 need 37
> 2016-06-01 07:12:55.666020 7fffea883700 10 monclient(hunting): my
> global_id is 3511432
> 2016-06-01 07:12:55.666022 7fffea883700 10 cephx client:
> handle_response ret = 0
> 2016-06-01 07:12:55.666023 7fffea883700 10 cephx client:  got initial
> server challenge 3112857369079243605
> 2016-06-01 07:12:55.666025 7fffea883700 10 cephx client:
> validate_tickets: want=37 need=37 have=0
> 2016-06-01 07:12:55.666026 7fffea883700 10 cephx: set_have_need_key no
> handler for service mon
> 2016-06-01 07:12:55.666027 7fffea883700 10 cephx: set_have_need_key no
> handler for service osd
> 2016-06-01 07:12:55.666030 7fffea883700 10 cephx: set_have_need_key no
> handler for service auth
> 2016-06-01 07:12:55.666030 7fffea883700 10 cephx: validate_tickets want
> 37 have 0 need 37
> 2016-06-01 07:12:55.666031 7fffea883700 10 cephx client: want=37
> need=37 have=0
> 2016-06-01 07:12:55.666034 7fffea883700 10 cephx client: build_request
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffea883700 (LWP 2242)]
> 0x73141a57 in encrypt (cct=,
> error=0x7fffea882280, out=..., in=..., this=0x7fffea882470)
> at auth/cephx/../Crypto.h:110
> 110 auth/cephx/../Crypto.h: No such file or directory.
> (gdb) bt
> #0  0x73141a57 in encrypt (cct=,
> error=0x7fffea882280, out=..., in=..., this=0x7fffea882470)
> at auth/cephx/../Crypto.h:110
> #1  encode_encrypt_enc_bl (cct=,
> error="", out=..., key=..., t=)
> at auth/cephx/CephxProtocol.h:464
> #2  encode_encrypt (cct=, error="",
> out=..., key=..., t=)
> at auth/cephx/CephxProtocol.h:489
> #3  cephx_calc_client_server_challenge (cct=,
> secret=..., server_challenge=3112857369079243605,
> client_challenge=12899511428024786235, key=key@entry=0

Re: [ceph-users] Ceph 10.1.1 rbd map fail

2016-06-21 Thread Brad Hubbard

On Wed, Jun 22, 2016 at 1:35 PM, 王海涛  wrote:
> Hi All
>
> I'm using ceph-10.1.1 to map a rbd image ,but it dosen't work ,the error
> messages are:
>
> root@heaven:~#rbd map rbd/myimage --id admin
> 2016-06-22 11:16:34.546623 7fc87ca53d80 -1 WARNING: the following dangerous
> and experimental features are enabled: bluestore,rocksdb
> 2016-06-22 11:16:34.547166 7fc87ca53d80 -1 WARNING: the following dangerous
> and experimental features are enabled: bluestore,rocksdb
> 2016-06-22 11:16:34.549018 7fc87ca53d80 -1 WARNING: the following dangerous
> and experimental features are enabled: bluestore,rocksdb
> rbd: sysfs write failed
> rbd: map failed: (5) Input/output error

Anything in dmesg, or anywhere, about "feature set mismatch" ?

http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client

>
> Could someone tell me what's wrong?
> Thanks!
>
> Kind Regards,
> Haitao Wang
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Regarding Technical Possibility of Configuring Single Ceph Cluster on Different Networks

2016-06-16 Thread Brad Hubbard

On Fri, Jun 10, 2016 at 3:01 AM, Venkata Manojawa Paritala
 wrote:
> Hello Friends,
>
> I am Manoj Paritala, working in Vedams Software Solutions India Pvt Ltd,
> Hyderabad, India. We are developing a POC with the below specification. I
> would like to know if it is technically possible to configure a Single Ceph
> cluster with this requirement. Please find attached the network diagram for
> more clarity on what we are trying to setup.
>
> 1. There should be 3 OSD nodes (machines), 3 Monitor nodes (machines) and 3
> Client nodes in the Ceph cluster.
>
> 2. There are 3 data centers with 3 different networks. Lets call each Data
> center a Site. So, we have Site1, Site2 and Site3 with different networks.
>
> 3. Each Site should have One OSD node + Monitor node + Client node.
>
> 4. In each Site there should be again 2 sub-networks.
>
> 4a. Site Public Network :- Where in the Ceph Clients, OSDs and Monitor would
> connect.
> 4b. Site Cluster Network :- Where in only OSDs communicate for replication
> and rebalancing.
>
> 5. Configure routing between Cluster networks across sites, in such a way
> that OSD in one site can communicate to the OSDs on other sites.
>
> 6. Configure routing between Site Public Networks across, in such a way that
> ONLY the Monitor & OSD nodes in each site can communicate to the nodes in
> other sites. PLEASE NOTE, CLIENTS IN ONE SITE WILL NOT BE ABLE TO
> COMMUNICATE TO OSDs/CLIENTS ON OTHER SITES.

This won't work. The clients need to communicate with the primary OSD for the pg
not just any OSD so will need access to all OSDs.

A configuration like this is a stretched cluster and the links between
the DCs will kill
performance once you load them up or once recovery is occurring. Do
the links between
your Dcs meet the stated requirements here?

http://docs.ceph.com/docs/master/start/hardware-recommendations/#networks

>
> Hoping that my requirement is clear. Please let me know if I am not clear on
> any step.
>
> Actually, based on our reading, our understanding is that 2-way replication
> between 2 different Ceph clusters is not possible. To overcome the same, we
> came up with the above configuration, which will allow us to create pools
> with OSDs on different sites / data centers and is useful for disaster
> recovery.

I don't think this configuration will work as you expect.

>
> In case our proposed configuration is not possible, can you please suggest
> us an alternative approach to achieve our requirement.

What is your requirement, it's not clearly stated.

Cheers,
Brad

>
> Thanks & Regards,
> Manoj
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Installing ceph monitor on Ubuntu denial: segmentation fault

2016-06-17 Thread Brad Hubbard

On Fri, May 20, 2016 at 7:32 PM, Daniel Wilhelm  wrote:
> Hi
>
>
>
> I relieved to have found a solution to this problem.
>
>
>
> The ansible script for generating the key did not pass the key to the
> following command line and sent therefore an empty string to this script
> (see monitor_secret).
>
>
>
> ceph-authtool /var/lib/ceph/tmp/keyring.mon.{{ monitor_name }}
> --create-keyring --name=mon. --add-key={{ monitor_secret }} --cap mon 'allow
> *'

This issue is being handled in existing tracker
http://tracker.ceph.com/issues/2904

>
>
>
> Now when this invalid key is being used to create the ceph file systems it
> seems to be copied to the location indicated below
> (/var/lib/ceph/mon/ceph-control01/keyring), and is crashing the ceph command
> below.
>
>
>
> Maybe a developer should have a look into this. It seems to me as if a
> base64 decoding went wrong in his case and crashes the process.

I was able to reproduce this and have created a patch. I've opened
http://tracker.ceph.com/issues/16266
for it.

Cheers,
Brad

>
>
>
> Thanks anyway
>
>
>
> Cheers
>
>
>
> Daniel
>
>
>
>
>
> From: Daniel Wilhelm
> Sent: Donnerstag, 19. Mai 2016 12:00
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Installing ceph monitor on Ubuntu denial: segmentation
> fault
>
>
>
> Hi
>
>
>
> I am trying to install ceph with the ceph ansible role:
> https://github.com/shieldwed/ceph-ansible.
>
>
>
> I had to fix some ansible tasks to work correctly with ansible 2.0.2.0 but
> now it seems to work quite well.
>
> Sadly I have now come across a bug, I cannot solve myself:
>
>
>
> When ansible is starting the service ceph-mon@ceph-control01.service,
> ceph-create-keys@control01.service gets started as a dependency to create
> the admin key.
>
>
>
> Within the unit log the following lines are shown:
>
>
>
> May 19 11:42:14 control01 ceph-create-keys[21818]:
> INFO:ceph-create-keys:Talking to monitor...
>
> May 19 11:42:14 control01 ceph-create-keys[21818]:
> INFO:ceph-create-keys:Cannot get or create admin key
>
> May 19 11:42:15 control01 ceph-create-keys[21818]:
> INFO:ceph-create-keys:Talking to monitor...
>
> May 19 11:42:15 control01 ceph-create-keys[21818]:
> INFO:ceph-create-keys:Cannot get or create admin key
>
>
>
> And so on.
>
>
>
> Since this script is calling “ceph --cluster=ceph --name=mon.
> --keyring=/var/lib/ceph/mon/ceph-control01/keyring auth get-or-create
> client.admin mon allow * osd allow * mds allow *”
>
>
>
> I tried to call this command myself and got this as a result:
>
> Segmentation fault (core dumped)
>
>
>
> As for the ceph versions, I tried two different with the same result:
>
> ·   Ubuntu integrated: ceph 10.1.2
>
> ·   Official stable repo: http://download.ceph.com/debian-jewel so:
> 10.2.1
>
>
>
> How can I circumvent this problem? Or is there any solution to that?
>
>
>
> Thanks
>
>
>
> Daniel
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph 10.1.1 rbd map fail

2016-06-23 Thread Brad Hubbard

On Thu, Jun 23, 2016 at 6:38 PM, 王海涛 <wht...@163.com> wrote:
> Hi, Brad:
> This is the output of "ceph osd crush show-tunables -f json-pretty"
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 0,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "firefly",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "minimum_required_version": "firefly",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 0,
> "require_feature_tunables5": 0,
> "has_v5_rules": 0
> }
>
> The value of "require_feature_tunables3" is 1, I think It need to be 0 to
> make my rbd map success.
> So I set it to 0 by ceph osd crush tool, but it still dosn't work.
> Then I checked the rbd image info:
> rbd image 'myimage':
> size 1024 MB in 256 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.5e3074b0dc51
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> flags:
>
> It looks like that some of the features are not supported by my rbd kernel
> module.
> Because when I get rid of the last 4 features, and only keep the "layering"
> feature,
> the image seems to be mapped and used rightly.
>
> Thanks for your answer!

yw

>
> Kind Regards,
> Haitao Wang
>
> At 2016-06-23 09:51:02, "Brad Hubbard" <bhubb...@redhat.com> wrote:
>>On Wed, Jun 22, 2016 at 3:20 PM, 王海涛 <wht...@163.com> wrote:
>>> I find this message in dmesg:
>>> [83090.212918] libceph: mon0 192.168.159.128:6789 feature set mismatch,
>>> my
>>> 4a042a42 < server's 2004a042a42, missing 200
>>>
>>> According to
>>>
>>> "http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client;,
>>> this could mean that I need to upgrade kernel client up to 3.15 or
>>> disable
>>> tunable 3 features.
>>> Our cluster is not convenient to upgrade.
>>> Could you tell me how to disable tunable 3 features?
>>
>>Can you show the output of the following command please?
>>
>># ceph osd crush show-tunables -f json-pretty
>>
>>I believe you'll need to use "ceph osd crush tunables " to adjust this.
>>
>>>
>>> Thanks!
>>>
>>> Kind Regards,
>>> Haitao Wang
>>>
>>>
>>> At 2016-06-22 12:33:42, "Brad Hubbard" <bhubb...@redhat.com> wrote:
>>>>On Wed, Jun 22, 2016 at 1:35 PM, 王海涛 <wht...@163.com> wrote:
>>>>> Hi All
>>>>>
>>>>> I'm using ceph-10.1.1 to map a rbd image ,but it dosen't work ,the
>>>>> error
>>>>> messages are:
>>>>>
>>>>> root@heaven:~#rbd map rbd/myimage --id admin
>>>>> 2016-06-22 11:16:34.546623 7fc87ca53d80 -1 WARNING: the following
>>>>> dangerous
>>>>> and experimental features are enabled: bluestore,rocksdb
>>>>> 2016-06-22 11:16:34.547166 7fc87ca53d80 -1 WARNING: the following
>>>>> dangerous
>>>>> and experimental features are enabled: bluestore,rocksdb
>>>>> 2016-06-22 11:16:34.549018 7fc87ca53d80 -1 WARNING: the following
>>>>> dangerous
>>>>> and experimental features are enabled: bluestore,rocksdb
>>>>> rbd: sysfs write failed
>>>>> rbd: map failed: (5) Input/output error
>>>>
>>>>Anything in dmesg, or anywhere, about "feature set mismatch" ?
>>>>
>>>>http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client
>>>>
>>>>>
>>>>> Could someone tell me what's wrong?
>>>>> Thanks!
>>>>>
>>>>> Kind Regards,
>>>>> Haitao Wang
>>>>>
>>>>>
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>Cheers,
>>>>Brad
>>
>>
>>
>>--
>>Cheers,
>>Brad



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph 10.1.1 rbd map fail

2016-06-22 Thread Brad Hubbard

On Wed, Jun 22, 2016 at 3:20 PM, 王海涛 <wht...@163.com> wrote:
> I find this message in dmesg:
> [83090.212918] libceph: mon0 192.168.159.128:6789 feature set mismatch, my
> 4a042a42 < server's 2004a042a42, missing 200
>
> According to
> "http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client;,
> this could mean that I need to upgrade kernel client up to 3.15 or disable
> tunable 3 features.
> Our cluster is not convenient to upgrade.
> Could you tell me how to disable tunable 3 features?

Can you show the output of the following command please?

# ceph osd crush show-tunables -f json-pretty

I believe you'll need to use "ceph osd crush tunables " to adjust this.

>
> Thanks!
>
> Kind Regards,
> Haitao Wang
>
>
> At 2016-06-22 12:33:42, "Brad Hubbard" <bhubb...@redhat.com> wrote:
>>On Wed, Jun 22, 2016 at 1:35 PM, 王海涛 <wht...@163.com> wrote:
>>> Hi All
>>>
>>> I'm using ceph-10.1.1 to map a rbd image ,but it dosen't work ,the error
>>> messages are:
>>>
>>> root@heaven:~#rbd map rbd/myimage --id admin
>>> 2016-06-22 11:16:34.546623 7fc87ca53d80 -1 WARNING: the following
>>> dangerous
>>> and experimental features are enabled: bluestore,rocksdb
>>> 2016-06-22 11:16:34.547166 7fc87ca53d80 -1 WARNING: the following
>>> dangerous
>>> and experimental features are enabled: bluestore,rocksdb
>>> 2016-06-22 11:16:34.549018 7fc87ca53d80 -1 WARNING: the following
>>> dangerous
>>> and experimental features are enabled: bluestore,rocksdb
>>> rbd: sysfs write failed
>>> rbd: map failed: (5) Input/output error
>>
>>Anything in dmesg, or anywhere, about "feature set mismatch" ?
>>
>>http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client
>>
>>>
>>> Could someone tell me what's wrong?
>>> Thanks!
>>>
>>> Kind Regards,
>>> Haitao Wang
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>--
>>Cheers,
>>Brad



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph not replicating to all osds

2016-06-27 Thread Brad Hubbard

On Tue, Jun 28, 2016 at 1:00 AM, Ishmael Tsoaela  wrote:
> Hi ALL,
>
> Anyone can help with this issue would be much appreciated.
>
> I have created an  image on one client and mounted it on both 2 client I
> have setup.
>
> When I write data on one client, I cannot access the data on another client,
> what could be causing this issue?

I suspect you are talking about files showing up in a filesystem on
the rbd image you have
mounted on both clients? If so, you need to verify the chosen
filesystem supports that.

Let me know if I got this wrong (please provide a more detailed
description), or if you need
more information.

Cheers,
Brad

>
> root@nodeB:/mnt# ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 1.81738 root default
> -2 0.90869 host nodeB
>  0 0.90869 osd.0   up  1.0  1.0
> -3 0.90869 host nodeC
>  1 0.90869 osd.1   up  1.0  1.0
>
>
> cluster_master@nodeC:/mnt$ ceph osd dump | grep data
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 128 pgp_num 128 last_change 17 flags hashpspool stripe_width
> 0
>
>
> cluster_master@nodeC:/mnt$ cat decompiled-crush-map.txt
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host nodeB {
> id -2 # do not change unnecessarily
> # weight 0.909
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 0.909
> }
> host nodeC {
> id -3 # do not change unnecessarily
> # weight 0.909
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 0.909
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 1.817
> alg straw
> hash 0 # rjenkins1
> item nodeB weight 0.909
> item nodeC weight 0.909
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VM shutdown because of PG increase

2016-06-28 Thread Brad Hubbard

On Tue, Jun 28, 2016 at 7:39 PM, Torsten Urbas  wrote:
> Hello,
>
> are you sure about your Ceph version? Below’s output states "0.94.1“.

I suspect it's quite likely that the cluster was upgraded but not the
clients or,
if the clients were upgraded, that the VMs were not restarted so they still have
the old binary images in memory and thus, still report 0.94.1.

A restart on any of the remaining VMs that have not been restarted would be a
good idea.

You can identify these VMs as they should show librbd/librados as "deleted" in
/proc/[PID]/maps output (this will need to be the PID of the qemu-kvm instance).

HTH,
Brad

>
> We have ran into a similar issue with Ceph 0.94.3 and can confirm that we no
> longer see that with Ceph 0.94.5.
>
> If you upgraded during operation, did you at least migrate all of your VMs
> at least once to make sure they are using the most recent librbd?
>
> Cheers,
> Torsten
>
> --
> Torsten Urbas
> Mobile: +49 (170) 77 38 251
>
> Am 28. Juni 2016 um 11:00:21, 한승진 (yongi...@gmail.com) schrieb:
>
> Hi, Cephers.
>
> Our ceph version is Hammer(0.94.7).
>
> I implemented ceph with OpenStack, all instances use block storage as a
> local volume.
>
> After increasing the PG number from 256 to 768, many vms are shutdown.
>
> That was very strange case for me.
>
> Below vm's is libvirt error log.
>
> osd/osd_types.cc: In function 'bool pg_t::is_split(unsigned int, unsigned
> int, std::set*) const' thread 7fc4c01b9700 time 2016-06-28
> 14:17:35.004480
> osd/osd_types.cc: 459: FAILED assert(m_seed < old_pg_num)
>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>  1: (()+0x15374b) [0x7fc4d1ca674b]
>  2: (()+0x222f01) [0x7fc4d1d75f01]
>  3: (()+0x222fdd) [0x7fc4d1d75fdd]
>  4: (()+0xc5339) [0x7fc4d1c18339]
>  5: (()+0xdc3e5) [0x7fc4d1c2f3e5]
>  6: (()+0xdcc4a) [0x7fc4d1c2fc4a]
>  7: (()+0xde1b2) [0x7fc4d1c311b2]
>  8: (()+0xe3fbf) [0x7fc4d1c36fbf]
>  9: (()+0x2c3b99) [0x7fc4d1e16b99]
>  10: (()+0x2f160d) [0x7fc4d1e4460d]
>  11: (()+0x80a5) [0x7fc4cd7aa0a5]
>  12: (clone()+0x6d) [0x7fc4cd4d7cfd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> 2016-06-28 05:17:36.557+: shutting down
>
>
> Could you anybody explain this?
>
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph not replicating to all osds

2016-06-28 Thread Brad Hubbard

On Tue, Jun 28, 2016 at 4:17 PM, Ishmael Tsoaela  wrote:
> Hi,
>
> I am new to Ceph and most of the concepts are new.
>
> image mounted on nodeA, FS is XFS
>
> sudo mkfs.xfs  /dev/rbd/data/data_01
>
> sudo mount /dev/rbd/data/data_01 /mnt
>
> cluster_master@nodeB:~$ mount|grep rbd
> /dev/rbd0 on /mnt type xfs (rw)

XFS is not a network filesystem. It can not be mounted on more than
one system at any given
time without corrupting it, even if one mountpoint does no writes, teh
log will still be replayed
during the mount and that should be enough for at least one system to
detect the filesystem
is corrupted.

Cheers,
Brad

>
>
> Basically I need a way to write on nodeA, mount the same image on nodeB and
> be able to write on either of the nodes, Data should be repilcated to both
> but I see on the logs for both osd, data is only stored on one.
>
>
> I am busy looking at CEPHFS
>
>
> thanks for the assistance.
>
>
>
>
>
>
>
>
>
>
> On Tue, Jun 28, 2016 at 1:09 AM, Christian Balzer  wrote:
>>
>>
>> Hello,
>>
>> On Mon, 27 Jun 2016 17:00:42 +0200 Ishmael Tsoaela wrote:
>>
>> > Hi ALL,
>> >
>> > Anyone can help with this issue would be much appreciated.
>> >
>> Your subject line has nothing to do with your "problem".
>>
>> You're alluding to OSD replication problems, obviously assuming that one
>> client would write to OSD A and the other client reading from OSD B.
>> Which is not how Ceph works, but again, that's not your problem.
>>
>> > I have created an  image on one client and mounted it on both 2 client I
>> > have setup.
>> >
>> Details missing, but it's pretty obvious that you created a plain FS like
>> Ext4 on that image.
>>
>> > When I write data on one client, I cannot access the data on another
>> > client, what could be causing this issue?
>> >
>> This has cropped up here frequently, you're confusing replicated BLOCK
>> storage like RBD or DRBD with shared file systems like NFS of CephFS.
>>
>> EXT4 and other normal FS can't do that and you just corrupted your FS on
>> that image.
>>
>> So either use CephFS or run OCFS2/GFS2 on your shared image and clients.
>>
>> Christian
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librados: how to get notified when a certain object is created

2016-02-23 Thread Brad Hubbard


- Original Message -
> From: "Sorin Manolache" 
> To: ceph-users@lists.ceph.com
> Sent: Sunday, 21 February, 2016 8:20:13 AM
> Subject: [ceph-users] librados: how to get notified when a certain object is  
> created
> 
> Hello,
> 
> I can set a watch on an object in librados. Does this object have to
> exist already at the moment I'm setting the watch on it? What happens if
> the object does not exist? Is my watcher valid? Will I get notified when
> someone else creates the missing object that I'm watching and sends a
> notification?
> 
> If the watch is not valid if the object has not yet been created then
> how can I get notified when the object is created? (I can imagine a
> work-around: there's an additional object, a kind of object registry
> object (the equivalent of a directory in a file system), that contains
> the list of created objects. I'm watching for modifications of the
> object registry object. Whenever a new object is created, the agent that
> creates the object also updates the object registry object.)

Could an object class be the right solution here?

https://github.com/ceph/ceph/blob/master/src/cls/hello/cls_hello.cc#L78

Cheers,
Brad

> 
> Thank you,
> Sorin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw crash - Infernalis

2016-04-28 Thread Brad Hubbard


- Original Message -
> From: "Karol Mroz" <km...@suse.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Cc: "Ben Hines" <bhi...@gmail.com>, "ceph-users" <ceph-users@lists.ceph.com>
> Sent: Thursday, 28 April, 2016 7:17:05 PM
> Subject: Re: [ceph-users] radosgw crash - Infernalis
> 
> Hi Brad,
> 
> On Wed, Apr 27, 2016 at 11:40:40PM -0400, Brad Hubbard wrote:
> [...]
> > > 0030a810 <_Z13pidfile_writePK11md_config_t@@Base>:
> > > ...
> > >   30b09d:   e8 0e 40 e4 ff  callq  14f0b0 <backtrace@plt>
> > >   30b0a2:   4c 89 efmov%r13,%rdi
> > >   ---
> > > ...
> > > 
> > > So either we tripped backtrace() code from pidfile_write() _or_ we can't
> > > trust the stack. From the log snippet, it looks that we're far past the
> > > point
> > > at which we would write a pidfile to disk (ie. at process start during
> > > global_init()).
> > > Rather, we're actually handling a request and outputting some bit of
> > > debug
> > > message
> > > via MSDOp::print() and beyond...
> > 
> > It would help to know what binary this is and what OS.
> > 
> > We know the offset into the function is 0x30b0a2 but we don't know which
> > function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely
> > from
> > the offset? I'm not sure that would be reliable...
> 
> Correct, from the offset. Let me clarify, I don't think pidfile_write() is
> the
> function in which we segfaulted :) Hence my suspicion of a blown stack. I

You could definitely be on the money here but IMHO it is too early to tell.

> don't
> know the specifics behind the backtrace call used to generate this stack...
> so
> maybe this is a naive question... but why do you think the offset is
> unreliable?
> Perhaps I'm not reading this trace correctly?

Well, you could have multiple functions which include an offset of 0x30b0a2.
Which function would it be in that case? The other frame shows an offset of
0xf100, can you identify that function just from the offset?

The following stack gives some good examples.

 1: /usr/bin/ceph-osd() [0xa05e32]
 2: (()+0xf100) [0x7f9ea295c100]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x75a) [0x659e7a]
 4: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x65b0cb]
 5: (DispatchQueue::entry()+0x62a) [0xbc2aba]
 6: (DispatchQueue::DispatchThread::entry()+0xd) [0xae572d]
 7: (()+0x7dc5) [0x7f9ea2954dc5]
 8: (clone()+0x6d) [0x7f9ea143528d]

The offsets are relative to the address where the function is loaded in memory
and I don't think searching for 0x6d, 0x2fb, 0x62a or 0x75a will give you the
correct result if you don't know which function you are dealing with.  The
offset is just an offset from the start of *some function* so without knowing
which function we can't be sure what instruction we were on.  That's my
understanding anyway.

I agree that a stack with only two frames looks dodgy though and we may be
chasing our tails but I'm hoping we can squeeze more info out of a core or a
better stack trace with all debuginfo loaded (if the function has no name due
to lack of debuginfo and not due to stack corruption).

Cheers,
Brad

> 
> > 
> > This is a segfault so the address of the frame where we crashed should be
> > the
> > exact instruction where we crashed. I don't believe a mov from one register
> > to
> > another that does not involve a dereference ((%r13) as opposed to %r13) can
> > cause a segfault so I don't think we are on the right instruction but then,
> > as
> > you say, the stack may be corrupt.
> 
> Agreed... a mov between registers wouldn't cause a segfault.
> 
> --
> Regards,
> Karol
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw crash - Infernalis

2016-04-27 Thread Brad Hubbard

- Original Message -
> From: "Karol Mroz" 
> To: "Ben Hines" 
> Cc: "ceph-users" 
> Sent: Wednesday, 27 April, 2016 7:06:56 PM
> Subject: Re: [ceph-users] radosgw crash - Infernalis
> 
> On Tue, Apr 26, 2016 at 10:17:31PM -0700, Ben Hines wrote:
> [...]
> > --> 10.30.1.6:6800/10350 -- osd_op(client.44852756.0:79
> > default.42048218. [getxattrs,stat,read 0~524288] 12.aa730416
> > ack+read+known_if_redirected e100207) v6 -- ?+0 0x7f49c41880b0 con
> > 0x7f49c4145eb0
> >  0> 2016-04-26 22:07:59.685615 7f49a07f0700 -1 *** Caught signal
> > (Segmentation fault) **
> >  in thread 7f49a07f0700
> > 
> >  ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> >  1: (()+0x30b0a2) [0x7f4c4907f0a2]
> >  2: (()+0xf100) [0x7f4c44f7a100]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> > to interpret this.
> 
> Hi Ben,
> 
> I sense a pretty badly corrupted stack. From the radosgw-9.2.1 (obtained from
> a downloaded rpm):
> 
> 0030a810 <_Z13pidfile_writePK11md_config_t@@Base>:
> ...
>   30b09d:   e8 0e 40 e4 ff  callq  14f0b0 
>   30b0a2:   4c 89 efmov%r13,%rdi
>   ---
> ...
> 
> So either we tripped backtrace() code from pidfile_write() _or_ we can't
> trust the stack. From the log snippet, it looks that we're far past the point
> at which we would write a pidfile to disk (ie. at process start during
> global_init()).
> Rather, we're actually handling a request and outputting some bit of debug
> message
> via MSDOp::print() and beyond...

It would help to know what binary this is and what OS.

We know the offset into the function is 0x30b0a2 but we don't know which
function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely from
the offset? I'm not sure that would be reliable...

This is a segfault so the address of the frame where we crashed should be the
exact instruction where we crashed. I don't believe a mov from one register to
another that does not involve a dereference ((%r13) as opposed to %r13) can
cause a segfault so I don't think we are on the right instruction but then, as
you say, the stack may be corrupt.

> 
> Is this something you're able to easily reproduce? More logs with higher log
> levels
> would be helpful... a coredump with radosgw compiled with -g would be
> excellent :)

Agreed, although if this is an rpm based system it should be sufficient to
run the following.

# debuginfo-install ceph glibc

That may give us the name of the function depending on where we are (if we are
in a library it may require the debuginfo for that library be loaded.

Karol is right that a coredump would be a good idea in this case and will give
us maximum information about the issue you are seeing.

Cheers,
Brad

> 
> --
> Regards,
> Karol
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw crash - Infernalis

2016-04-27 Thread Brad Hubbard

- Original Message - 

> From: "Ben Hines" <bhi...@gmail.com>
> To: "Brad Hubbard" <bhubb...@redhat.com>
> Cc: "Karol Mroz" <km...@suse.com>, "ceph-users" <ceph-users@lists.ceph.com>
> Sent: Thursday, 28 April, 2016 3:09:16 PM
> Subject: Re: [ceph-users] radosgw crash - Infernalis

> Got it again - however, the stack is exactly the same, no symbols - debuginfo
> didn't resolve. Do i need to do something to enable that?

It's possible we are in a library for which you don't have debuginfo loaded.
Given the list of libraries that radosgw links to getting all debuginfo loaded
may be a daunting prospect. The other possibility is the stack is badly
corrupted as Karol suggested.

Any chance you can capture a core?

You could try setting "ulimit -c unlimited" and starting the osd from the
command line.

HTH,
Brad

> The server in 'debug ms=10' this time, so there is a bit more spew:

> -14> 2016-04-27 21:59:58.811919 7f9e817fa700 1 -- 10.30.1.8:0/3291985349 -->
> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:223
> obj_delete_at_hint.55 [call timeindex.list] 10.2c88dbcf
> ack+read+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con
> 0x7f9f1410ed10
> -13> 2016-04-27 21:59:58.812039 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -12> 2016-04-27 21:59:58.812096 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -11> 2016-04-27 21:59:58.814343 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).reader wants 211 from dispatch throttler 0/104857600
> -10> 2016-04-27 21:59:58.814375 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).aborted = 0
> -9> 2016-04-27 21:59:58.814405 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).reader got message 2 0x7f9ec0009250 osd_op_reply(223
> obj_delete_at_hint.55 [call] v0'0 uv1448004 ondisk = 0) v6
> -8> 2016-04-27 21:59:58.814428 7f9e3f96a700 1 -- 10.30.1.8:0/3291985349 <==
> osd.6 10.30.2.13:6805/27519 2  osd_op_reply(223
> obj_delete_at_hint.55 [call] v0'0 uv1448004 ondisk = 0) v6 
> 196+0+15 (3849172018 0 2149983739) 0x7f9ec0009250 con 0x7f9f1410ed10
> -7> 2016-04-27 21:59:58.814472 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349
> dispatch_throttle_release 211 to dispatch throttler 211/104857600
> -6> 2016-04-27 21:59:58.814470 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -5> 2016-04-27 21:59:58.814511 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).write_ack 2
> -4> 2016-04-27 21:59:58.814528 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -3> 2016-04-27 21:59:58.814607 7f9e817fa700 1 -- 10.30.1.8:0/3291985349 -->
> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:224
> obj_delete_at_hint.55 [call lock.unlock] 10.2c88dbcf
> ondisk+write+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con
> 0x7f9f1410ed10
> -2> 2016-04-27 21:59:58.814718 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -1> 2016-04-27 21:59:58.814778 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> 0> 2016-04-27 21:59:58.826494 7f9e7e7f4700 -1 *** Caught signal (Segmentation
> fault) **
> in thread 7f9e7e7f4700

> ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> 1: (()+0x30b0a2) [0x7fa11c5030a2]
> 2: (()+0xf100) [0x7fa1183fe100]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.

> --- logging levels ---
> 

> On Wed, Apr 27, 2016 at 9:39 PM, Ben Hines < bhi...@gmail.com > wrote:

> > Yes, CentOS 7.2. Ha

Re: [ceph-users] Recovery stuck after adjusting to recent tunables

2016-07-25 Thread Brad Hubbard

On Tue, Jul 26, 2016 at 6:08 AM, Kostis Fardelas <dante1...@gmail.com> wrote:
> Following up, I increased pg_num/pgp_num for my 3-replica pool to 128

These pg numbers seem low.

Can you take a look at http://ceph.com/pgcalc/ and verify these values
are appropriate for your environment and use case?

I'd also take a good look at your crush rules to determine if they are
contributing to the problem.

> (being in argonaut tunables) and after a small recovery that followed,
> I switched to bobtail tunables. Remapping started and got stuck (!)
> again without any OSD down this time with 1 PG active+remapped. Tried
> restarting PG's OSDs, no luck.
>
> One thing to notice is that stuck PGs are always on this 3-replicated pool.
>
> Finally, I decided to take the hit and switch to firefly tunables
> (with chooseleaf_vary_r=1) just for the sake of it. Misplaced objects
> are on 51% of the cluster right now, so I am going to wait and update
> our thread with the outcome when the dust settles down.
>
> All in all, even if firefly tunables lead to a healthy PG
> distribution, I am afraid I am going to stick with argonaut tunables
> for now and on, the experience was far from encouraging and there is
> little documentation regarding the cons and pros of profile tunables
> changes and their impact on a production cluster.
>
> Kostis
>
> On 24 July 2016 at 14:29, Kostis Fardelas <dante1...@gmail.com> wrote:
>> nice to hear from you Goncalo,
>> what you propose sounds like an interesting theory, I will test it
>> tomorrow and let you know. In the meanwhile, I did the same test with
>> bobtail and argonaut tunables:
>> - with argonaut tunables, the recovery completes to the end
>> - with bobtail tunables, the situation is worse than with firefly - I
>> got even more degraded and misplaced objects and recovery stuck across
>> 6 PGs
>>
>> I also fell upon a thread with an almost similar case [1], where Sage
>> recommends to switch to hammer tunables and straw2 algorithm, but this
>> is not an option for a lot of people due to kernel requirements
>>
>> [1] https://www.spinics.net/lists/ceph-devel/msg30381.html
>>
>>
>> On 24 July 2016 at 03:44, Goncalo Borges <goncalo.bor...@sydney.edu.au> 
>> wrote:
>>> Hi Kostis
>>> This is a wild guess but one thing I note is that your pool 179 has a very 
>>> low pg number (100).
>>>
>>> Maybe the algorithm behind the new tunable need a higher pg number to 
>>> actually proceed with the recovery?
>>>
>>> You could try to increase the pgs to 128 (it is always better to use powers 
>>> of 2) and see if the recover completes..
>>>
>>> Cheers
>>> G.
>>> 
>>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
>>> Fardelas [dante1...@gmail.com]
>>> Sent: 23 July 2016 16:32
>>> To: Brad Hubbard
>>> Cc: ceph-users
>>> Subject: Re: [ceph-users] Recovery stuck after adjusting to recent tunables
>>>
>>> Hi Brad,
>>>
>>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 2048 pgp_num 2048 last_change 119047
>>> crash_replay_interval 45 stripe_width 0
>>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 3
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119048
>>> stripe_width 0
>>> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 2048 pgp_num 2048 last_change 119049 stripe_width 0
>>> pool 3 'blocks' replicated size 2 min_size 1 crush_ruleset 4
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119050
>>> stripe_width 0
>>> pool 4 'maps' replicated size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 2048 pgp_num 2048 last_change 119051 stripe_width 0
>>> pool 179 'scbench' replicated size 3 min_size 1 crush_ruleset 0
>>> object_hash rjenkins pg_num 100 pgp_num 100 last_change 154034 flags
>>> hashpspool stripe_width 0
>>>
>>> This is the status of 179.38 when the cluster is healthy:
>>> http://pastebin.ca/3663600
>>>
>>> and this is when recovery is stuck:
>>> http://pastebin.ca/3663601
>>>
>>>
>>> It seems that the PG is replicated with size 3 but the cluster cannot
>>> create the third replica for some objects whose third OSD (OSD.14) is
>>> down. That was not the case with argonaut tunables as I remember.
>>>
>>> Regards
>>>
>>>
>>> On 23 J

Re: [ceph-users] mon_osd_nearfull_ratio (unchangeable) ?

2016-07-25 Thread Brad Hubbard

On Tue, Jul 26, 2016 at 12:16:35PM +1000, Goncalo Borges wrote:
> Hi Brad
> 
> Thanks for replying.
> 
> Answers inline.
> 
> 
> > > I am a bit confused about the 'unchachable' message we get in Jewel 10.2.2
> > > when I try to change some cluster configs.
> > > 
> > > For example:
> > > 
> > > 1./ if I try to change mon_osd_nearfull_ratio from 0.85 to 0.90, I get
> > > 
> > > # ceph tell mon.* injectargs "--mon_osd_nearfull_ratio 0.90"
> > > mon.rccephmon1: injectargs:mon_osd_nearfull_ratio = '0.9'
> > > (unchangeable)
> > > mon.rccephmon3: injectargs:mon_osd_nearfull_ratio = '0.9'
> > > (unchangeable)
> > > mon.rccephmon2: injectargs:mon_osd_nearfull_ratio = '0.9'
> > > (unchangeable)
> > This is telling you that this variable has no observers (i.e. nothing 
> > monitors
> > it dynamically) so changing it at runtime has no effect. IOW it is read at
> > start-up and not referred to again after that IIUC.
> > 
> > > but the 0.85 default values continues to be showed in
> > > 
> > >  ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio
> > >  mon_osd_nearfull_ratio = 0.85
> > Try something like the following.
> > 
> > $ ceph daemon mon.a config show|grep mon_osd_nearfull_ratio
> > 
> > > and I continue to have health warnings regarding near full osds.
> > So the actual config value has been changed but has no affect and will not
> > persist. IOW, this value needs to be modified in the conf file and the 
> > daemon
> > restarted.
> > 
> > > 
> > > 2./ If I change in the ceph.conf and restart services, I get the same
> > > behaviour as in 1./ However, if I check the daemon configuration, I see:
> > Please clarify what you mean by "the same behaviour"?
> 
> So, in my ceph.conf I've set 'mon osd nearfull ratio = 0.90' and restarted
> mon and osd (not sure if those were needed) daemons everywhere.
> 
> After restarting, I am still getting the health warnings regarding near full
> osds above 85%. If the new value was active, I should not get such warnings.
> 
> > 
> > >  # ceph daemon mon.rccephmon2 config show | grep 
> > > mon_osd_nearfull_ratio
> > >  "mon_osd_nearfull_ratio": "0.9",
> > Use the daemon command I showed above.
> 
> Isn't it the same as you suggested? That was run after restarting services

Yes, it is. I assumed wrongly that you were using the "--show-config" command
again here.

> so it is still unclear to me why the new value is not picked up and why
> running 'ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio'

That command shows the default ceph config, try something like this.

$ ceph -n mon.rccephmon2 --show-config|grep mon_osd_nearfull_ratio

> still shows 0.85
> 
> Maybe a restart if services is not what has to be done but a stop/start
> instead?

You can certainly try it but I would have thought a restart would involve
stop/start of the MON daemon. This thread includes additional information that
may be relevant to you atm.

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/23391

> 
> Cheers
> Goncalo

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mon_osd_nearfull_ratio (unchangeable) ?

2016-07-25 Thread Brad Hubbard

On Tue, Jul 26, 2016 at 11:01:49AM +1000, Goncalo Borges wrote:
> Dear Cephers...

Hi Goncalo,

> 
> I am a bit confused about the 'unchachable' message we get in Jewel 10.2.2
> when I try to change some cluster configs.
> 
> For example:
> 
> 1./ if I try to change mon_osd_nearfull_ratio from 0.85 to 0.90, I get
> 
># ceph tell mon.* injectargs "--mon_osd_nearfull_ratio 0.90"
>mon.rccephmon1: injectargs:mon_osd_nearfull_ratio = '0.9'
>(unchangeable)
>mon.rccephmon3: injectargs:mon_osd_nearfull_ratio = '0.9'
>(unchangeable)
>mon.rccephmon2: injectargs:mon_osd_nearfull_ratio = '0.9'
>(unchangeable)

This is telling you that this variable has no observers (i.e. nothing monitors
it dynamically) so changing it at runtime has no effect. IOW it is read at
start-up and not referred to again after that IIUC.

> 
> but the 0.85 default values continues to be showed in
> 
> ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio
> mon_osd_nearfull_ratio = 0.85

Try something like the following.

$ ceph daemon mon.a config show|grep mon_osd_nearfull_ratio

> 
> and I continue to have health warnings regarding near full osds.

So the actual config value has been changed but has no affect and will not
persist. IOW, this value needs to be modified in the conf file and the daemon
restarted.

> 
> 
> 2./ If I change in the ceph.conf and restart services, I get the same
> behaviour as in 1./ However, if I check the daemon configuration, I see:

Please clarify what you mean by "the same behaviour"?

> 
> # ceph daemon mon.rccephmon2 config show | grep mon_osd_nearfull_ratio
> "mon_osd_nearfull_ratio": "0.9",

Use the daemon command I showed above.

> 
> 
> There seems to be some discussion on the topic here:
> https://github.com/ceph/ceph/pull/7085
> 
> But in summary, it doesn't seem I am able to change this value.
> 
> Can someone clarify exactly what is the happening here?

Let me know if this is still unclear.

> 
> Cheers
> G.
> 
> -- 
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
> 

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] syslog broke my cluster

2016-07-26 Thread Brad Hubbard

On Tue, Jul 26, 2016 at 03:48:33PM +0100, Sergio A. de Carvalho Jr. wrote:
> As per my previous messages on the list, I was having a strange problem in
> my test cluster (Hammer 0.94.6, CentOS 6.5) where my monitors were
> literally crawling to a halt, preventing them to ever reach quorum and
> causing all sort of problems. As it turned out, to my surprise everything
> went back to normal as soon as I turned off syslog -- special thanks to
> Sean!
> 
> The slowdown with syslog on was so severe that logs were being written with
> a timestamp that was several minutes (and eventually up to hours) behind
> the system clock. The logs from my 4 monitors can be seen in the links
> below:
> 
> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
> https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
> https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
> https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2
> 
> I'm still trying to understand what is going on with my syslog servers but
> I was wondering... is this a known/documented issue?

If it is it would be known/documented by the syslog community right?

> 
> Luckily this was a test cluster but I'm worried I could hit this on a
> production cluster any time soon, and I'm wondering how I could detect it
> before my support engineers loose their minds.

This does not appear to be a ceph-specific issue and would likely affect any
daemon that logs to syslog right?

One thing you could try is running strace against the MON to see what system
calls are taking a long time and extrapolate from there. The procedure would
be the same if things were being held up by a slow disk (for whatever reason)
or filesystem, etc. This is just a standard performance problem and not a
ceph-specific issue.

> 
> Thanks,
> 
> Sergio

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mon_osd_nearfull_ratio (unchangeable) ?

2016-07-26 Thread Brad Hubbard

On Tue, Jul 26, 2016 at 09:37:37AM +0200, Dan van der Ster wrote:
> On Tue, Jul 26, 2016 at 3:52 AM, Brad Hubbard <bhubb...@redhat.com> wrote:
> >> 1./ if I try to change mon_osd_nearfull_ratio from 0.85 to 0.90, I get
> >>
> >># ceph tell mon.* injectargs "--mon_osd_nearfull_ratio 0.90"
> >>mon.rccephmon1: injectargs:mon_osd_nearfull_ratio = '0.9'
> >>(unchangeable)
> >>mon.rccephmon3: injectargs:mon_osd_nearfull_ratio = '0.9'
> >>(unchangeable)
> >>mon.rccephmon2: injectargs:mon_osd_nearfull_ratio = '0.9'
> >>(unchangeable)
> >
> > This is telling you that this variable has no observers (i.e. nothing 
> > monitors
> > it dynamically) so changing it at runtime has no effect. IOW it is read at
> > start-up and not referred to again after that IIUC.
> 
> That's not actually true. In fact, the "unchangeable" feature is just
> misleading/wrong in most cases. See
> http://tracker.ceph.com/issues/16054
> 
> In this case, the config item is re-read for every new pgmap:
> 
> src/mon/PGMonitor.cc:pending_inc.nearfull_ratio =
> g_conf->mon_osd_nearfull_ratio;
> 
> so it _is_ changeable at runtime.

Oh, I see now, my bad. That was a stupid mistake on my part since the code is
quite obvious. I was fooled by the code where "(unchangeable)" is output and
the commit/PR for that change.

Thanks for setting this straight and correcting my mistake.

> 
> -- Dan

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph master build fails on src/gmock, workaround?

2016-07-12 Thread Brad Hubbard

This was resolved in http://tracker.ceph.com/issues/16646

On Sun, Jul 10, 2016 at 5:09 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> On Sat, Jul 09, 2016 at 10:43:52AM +, Kevan Rehm wrote:
>> Greetings,
>>
>> I cloned the master branch of ceph at https://github.com/ceph/ceph.git
>> onto a Centos 7 machine, then did
>>
>> ./autogen.sh
>> ./configure --enable-xio
>> make
>
> BTW, you should be defaulting to cmake if you don't have a specific need to
> use the autotools build.
>
> --
> Cheers,
> Brad



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Brad Hubbard

python
> > >  16846 goncalo   20   0 1594m  84m  19m R 99.9  0.2   1:06.05 python
> > >  29595 goncalo   20   0 1594m  83m  19m R 100.2  0.2   1:05.57 python
> > >  29312 goncalo   20   0 1594m  83m  19m R 99.9  0.2   1:05.01 python
> > >  31979 goncalo   20   0 1595m  82m  19m R 100.2  0.2   1:04.82 python
> > >  29333 goncalo   20   0 1594m  82m  19m R 99.5  0.2   1:04.94 python
> > >  29609 goncalo   20   0 1594m  82m  19m R 99.9  0.2   1:05.07 python
> > > 
> > > 
> > > 5.> Also, is the version of fuse the same on the nodes running 9.2.0 vs. 
> > > the
> > > nodes running 10.2.2?
> > > 
> > > In 10.2.2 I've compiled with fuse 2.9.7 while in 9.2.0 I've compiled 
> > > against
> > > the default sl6 fuse libs version 2.8.7. However, as I said before, I am
> > > seeing the same issue with 9.2.0 (although with a bit less of used virtual
> > > memory in total).
> > > 
> > > 
> > > 
> > > 
> > > On 07/08/2016 10:53 PM, John Spray wrote:
> > > 
> > > On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges
> > > <goncalo.bor...@sydney.edu.au> wrote:
> > > 
> > > Hi Brad, Patrick, All...
> > > 
> > > I think I've understood this second problem. In summary, it is memory
> > > related.
> > > 
> > > This is how I found the source of the problem:
> > > 
> > > 1./ I copied and adapted the user application to run in another cluster of
> > > ours. The idea was for me to understand the application and run it myself 
> > > to
> > > collect logs and so on...
> > > 
> > > 2./ Once I submit it to this other cluster, every thing went fine. I was
> > > hammering cephfs from multiple nodes without problems. This pointed to
> > > something different between the two clusters.
> > > 
> > > 3./ I've started to look better to the segmentation fault message, and
> > > assuming that the names of the methods and functions do mean something, 
> > > the
> > > log seems related to issues on the management of objects in cache. This
> > > pointed to a memory related problem.
> > > 
> > > 4./ On the cluster where the application run successfully, machines have
> > > 48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP 
> > > size,
> > > it is a legacy setup).
> > > 
> > > # top
> > > top - 00:34:01 up 23 days, 22:21,  1 user,  load average: 12.06, 12.12,
> > > 10.40
> > > Tasks: 683 total,  13 running, 670 sleeping,   0 stopped,   0 zombie
> > > Cpu(s): 49.7%us,  0.6%sy,  0.0%ni, 49.7%id,  0.1%wa,  0.0%hi,  0.0%si,
> > > 0.0%st
> > > Mem:  49409308k total, 29692548k used, 19716760k free,   433064k buffers
> > > Swap: 98301948k total,0k used, 98301948k free, 26742484k cached
> > > 
> > > 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
> > > virtual memory when there is no applications using the filesystem.
> > > 
> > >   7152 root  20   0 1108m  12m 5496 S  0.0  0.0   0:00.04 ceph-fuse
> > > 
> > > When I only have one instance of the user application running, ceph-fuse 
> > > (in
> > > 10.2.2) slowly rises with time up to 10 GB of memory usage.
> > > 
> > > if I submit a large number of user applications simultaneously, ceph-fuse
> > > goes very fast to ~10GB.
> > > 
> > >PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> > > 18563 root  20   0 10.0g 328m 5724 S  4.0  0.7   1:38.00 ceph-fuse
> > >   4343 root  20   0 3131m 237m  12m S  0.0  0.5  28:24.56 
> > > dsm_om_connsvcd
> > >   5536 goncalo   20   0 1599m  99m  32m R 99.9  0.2  31:35.46 python
> > > 31427 goncalo   20   0 1597m  89m  20m R 99.9  0.2  31:35.88 python
> > > 20504 goncalo   20   0 1599m  89m  20m R 100.2  0.2  31:34.29 python
> > > 20508 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:34.20 python
> > >   4973 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:35.70 python
> > >   1331 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:35.72 python
> > > 20505 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:34.46 python
> > > 20507 goncalo   20   0 1599m  87m  20m R 99.9  0.2  31:34.37 python
> > > 28375 goncalo   20   0 1597m  86m  20m R 99.9  0.2  31:35.52 python
> > > 20503 goncalo   20   0 1597m  85m  20m R 100.2  0.2  31:34.09 python
> > > 20506 goncalo   20   0 1597m

Re: [ceph-users] Try to install ceph hammer on CentOS7

2016-07-22 Thread Brad Hubbard

On Sat, Jul 23, 2016 at 1:41 AM, Ruben Kerkhof  wrote:
> Please keep the mailing list on the CC.
>
> On Fri, Jul 22, 2016 at 3:40 PM, Manuel Lausch  wrote:
>> oh. This was a copy failure.
>> Of course I checked my config again. Some other variations of configurating
>> didn't help as well.
>>
>> Finaly I took the ceph-0.94.7-0.el7.x86_64.rpm in a directory and created
>> with createrepo the neccessary repository index files. Also with this as a
>> repository the ceph package is not visible. Other packages in the repository
>> works fine.
>>
>> If I try to install the package with yum install
>> ~/ceph-0.94.7-0.el7.x86_64.rpm the Installation including the dependencys is
>> successfull.
>>
>> My knowledge with rpm and yum is not as big as it should be. So I don't know
>> how to debug further.
>
> What does yum repolist show?

This is good advice.

I'd also advise running "yum clean all" before proceeding once you
have confirmed everything is configured correctly.

HTH,
Brad

> It looks like the ceph-noarch repo is ok, the ceph repo isn't.
>
>>
>> Regards,
>> Manuel
>
> Regards,
>
> Ruben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recovery stuck after adjusting to recent tunables

2016-07-22 Thread Brad Hubbard

On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas  wrote:
> Hello,
> being in latest Hammer, I think I hit a bug with more recent than
> legacy tunables.
>
> Being in legacy tunables for a while, I decided to experiment with
> "better" tunables. So first I went from argonaut profile to bobtail
> and then to firefly. However, I decided to make the changes on
> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
> huge), from 5 down to the best value (1). So when I reached
> chooseleaf_vary_r = 2, I decided to run a simple test before going to
> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
> recover. But the recovery never completes and a PG remains stuck,
> reported as undersized+degraded. No OSD is near full and all pools
> have min_size=1.
>
> ceph osd crush show-tunables -f json-pretty
>
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 2,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "unknown",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "require_feature_tunables3": 1,
> "has_v2_rules": 0,
> "has_v3_rules": 0,
> "has_v4_buckets": 0
> }
>
> The really strange thing is that the OSDs of the stuck PG belong to
> other nodes than the one I decided to stop (osd.14).
>
> # ceph pg dump_stuck
> ok
> pg_stat state up up_primary acting acting_primary
> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2

Can you share a query of this pg?

What size (not min size) is this pool (assuming it's 2)?

>
>
> ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 11.19995 root default
> -3 11.19995 rack unknownrack
> -2  0.3 host staging-rd0-03
> 14  0.2 osd.14   up  1.0  1.0
> 15  0.2 osd.15   up  1.0  1.0
> -8  5.19998 host staging-rd0-01
>  6  0.5 osd.6up  1.0  1.0
>  7  0.5 osd.7up  1.0  1.0
>  8  1.0 osd.8up  1.0  1.0
>  9  1.0 osd.9up  1.0  1.0
> 10  1.0 osd.10   up  1.0  1.0
> 11  1.0 osd.11   up  1.0  1.0
> -7  5.19998 host staging-rd0-00
>  0  0.5 osd.0up  1.0  1.0
>  1  0.5 osd.1up  1.0  1.0
>  2  1.0 osd.2up  1.0  1.0
>  3  1.0 osd.3up  1.0  1.0
>  4  1.0 osd.4up  1.0  1.0
>  5  1.0 osd.5up  1.0  1.0
> -4  0.3 host staging-rd0-02
> 12  0.2 osd.12   up  1.0  1.0
> 13  0.2 osd.13   up  1.0  1.0
>
>
> Have you experienced something similar?
>
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] blocked ops

2016-08-11 Thread Brad Hubbard

On Thu, Aug 11, 2016 at 11:33:29PM +0100, Roeland Mertens wrote:
> Hi,
> 
> I was hoping someone on this list may be able to help?
> 
> We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12 hours
> we've been plagued with blocked requests which completely kills the
> performance of the cluster
> 
> # ceph health detail
> HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 1
> pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1 osds
> have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set
> pg 63.1a18 is stuck inactive for 135133.509820, current state
> down+remapped+peering, last acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]

That value (2147483647) is defined in src/crush/crush.h like so;

#define CRUSH_ITEM_NONE   0x7fff  /* no result */

So this could be due to a bad crush rule or maybe choose_total_tries needs to
be higher?

$ ceph osd crush rule ls

For each rule listed by the above command.

$ ceph osd crush rule dump [rule_name]

I'd then dump out the crushmap and test it showing any bad mappings with the
commands listed here;

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I'd also check the pg numbers for your pool(s) are appropriate as not enough
pgs could also be a contributing factor IIRC.

That should hopefully give some insight.

-- 
HTH,
Brad

> pg 63.1a18 is down+remapped+peering, acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
> 100 ops are blocked > 2097.15 sec on osd.4
> 1 osds have slow requests
> noout,nodeep-scrub,sortbitwise flag(s) set
> 
> the one pg down is due to us running into an odd EC issue which I mailed the
> list about earlier, it's the 100 blocked ops that are puzzling us. If we out
> the osd in question, they just shift to another osd (on a different host!).
> We even tried rebooting the node it's on but to little avail.
> 
> We get a ton of log messages like this:
> 
> 2016-08-11 23:32:10.041174 7fc668d9f700  0 log_channel(cluster) log [WRN] :
> 100 slow requests, 5 included below; oldest blocked for > 139.313915 secs
> 2016-08-11 23:32:10.041184 7fc668d9f700  0 log_channel(cluster) log [WRN] :
> slow request 139.267004 seconds old, received at 2016-08-11 23:29:50.774091:
> osd_op(client.9192464.0:485640 66.b96c3a18
> default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read
> 0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109)
> currently waiting for blocked object
> 2016-08-11 23:32:10.041189 7fc668d9f700  0 log_channel(cluster) log [WRN] :
> slow request 139.244839 seconds old, received at 2016-08-11 23:29:50.796256:
> osd_op(client.9192464.0:596033 66.942a5a18
> default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write
> 1048576~524288] snapc 0=[] RETRY=36
> ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for
> blocked object
> 
> A dump of the blocked ops tells us very little , is there anyone who can
> shed some light on this? Or at least give us a hint on how we can fix this?
> 
> # ceph daemon osd.4 dump_blocked_ops
> 
> 
>{
> "description": "osd_op(client.9192464.0:596030 66.942a5a18
> default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [writefull
> 0~0] snapc 0=[] RETRY=32 ack+ondisk+retry+write+known_if_redirected
> e50092)",
> "initiated_at": "2016-08-11 22:58:09.721027",
> "age": 1515.105186,
> "duration": 1515.113255,
> "type_data": [
> "reached pg",
> {
> "client": "client.9192464",
> "tid": 596030
> },
> [
> {
> "time": "2016-08-11 22:58:09.721027",
> "event": "initiated"
> },
> {
> "time": "2016-08-11 22:58:09.721066",
> "event": "waiting_for_map not empty"
> },
> {
> "time": "2016-08-11 22:58:09.813574",
> "event": "reached_pg"
> },
> {
> "time": "2016-08-11 22:58:09.813581",
> "event": "waiting for peered"
> },
> {
> "time": "2016-08-11 22:58:09.852796",
> "event": "reached_pg"
> },
> {
> "time": "2016-08-11 22:58:09.852804",
> "event": "waiting for peered"
> },
> {
> "time": "2016-08-11 22:58:10.876636",
> "event": "reached_pg"
> },
> {
>

Re: [ceph-users] installing multi osd and monitor of ceph in single VM

2016-08-09 Thread Brad Hubbard

On Wed, Aug 10, 2016 at 12:26 AM, agung Laksono  wrote:
>
> Hi Ceph users,
>
> I am new in ceph. I've been succeed installing ceph in 4 VM using Quick
> installation guide in ceph documentation.
>
> And I've also done to compile
> ceph from source code, build and install in single vm.
>
> What I want to do next is that run ceph multiple nodes in a cluster
> but only inside a single machine. I need this because I will
> learn the ceph code and will modify some codes, recompile and
> redeploy on the node/VM. On my study, I've also to be able to run/kill
> particular node.
>
> does somebody know how to configure single vm to run multiple osd and
> monitor of ceph?
>
> Advises and comments are very appreciate. thanks

Hi,

Did you see this?

http://docs.ceph.com/docs/hammer/dev/quick_guide/#running-a-development-deployment

Also take a look at the AIO (all in one) options in ceph-ansible.

HTH,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-09 Thread Brad Hubbard

On Tue, Aug 9, 2016 at 7:39 AM, George Mihaiescu  wrote:
> Look in the cinder db, the volumes table to find the Uuid of the deleted 
> volume.

You could also look through the logs at the time of the delete and I
suspect you should
be able to see how the rbd image was prefixed/named at the time of the delete.

HTH,
Brad

>
> If you go through yours OSDs and look for the directories for PG index 20, 
> you might find some fragments from the deleted volume, but it's a long shot...
>
>> On Aug 8, 2016, at 4:39 PM, Georgios Dimitrakakis  
>> wrote:
>>
>> Dear David (and all),
>>
>> the data are considered very critical therefore all this attempt to recover 
>> them.
>>
>> Although the cluster hasn't been fully stopped all users actions have. I 
>> mean services are running but users are not able to read/write/delete.
>>
>> The deleted image was the exact same size of the example (500GB) but it 
>> wasn't the only one deleted today. Our user was trying to do a "massive" 
>> cleanup by deleting 11 volumes and unfortunately one of them was very 
>> important.
>>
>> Let's assume that I "dd" all the drives what further actions should I do to 
>> recover the files? Could you please elaborate a bit more on the phrase "If 
>> you've never deleted any other rbd images and assuming you can recover data 
>> with names, you may be able to find the rbd objects"??
>>
>> Do you mean that if I know the file names I can go through and check for 
>> them? How?
>> Do I have to know *all* file names or by searching for a few of them I can 
>> find all data that exist?
>>
>> Thanks a lot for taking the time to answer my questions!
>>
>> All the best,
>>
>> G.
>>
>>> I dont think theres a way of getting the prefix from the cluster at
>>> this point.
>>>
>>> If the deleted image was a similar size to the example youve given,
>>> you will likely have had objects on every OSD. If this data is
>>> absolutely critical you need to stop your cluster immediately or make
>>> copies of all the drives with something like dd. If youve never
>>> deleted any other rbd images and assuming you can recover data with
>>> names, you may be able to find the rbd objects.
>>>
>>> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>>>
>> Hi,
>>
>> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
>>
 Hi,

> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>
> Dear all,
>
> I would like your help with an emergency issue but first
> let me describe our environment.
>
> Our environment consists of 2OSD nodes with 10x 2TB HDDs
> each and 3MON nodes (2 of them are the OSD nodes as well)
> all with ceph version 0.80.9
> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>
> This environment provides RBD volumes to an OpenStack
> Icehouse installation.
>
> Although not a state of the art environment is working
> well and within our expectations.
>
> The issue now is that one of our users accidentally
> deleted one of the volumes without keeping its data first!
>
> Is there any way (since the data are considered critical
> and very important) to recover them from CEPH?

 Short answer: no

 Long answer: no, but

 Consider the way Ceph stores data... each RBD is striped
 into chunks
 (RADOS objects with 4MB size by default); the chunks are
 distributed
 among the OSDs with the configured number of replicates
 (probably two
 in your case since you use 2 OSD hosts). RBD uses thin
 provisioning,
 so chunks are allocated upon first write access.
 If an RBD is deleted all of its chunks are deleted on the
 corresponding OSDs. If you want to recover a deleted RBD,
 you need to
 recover all individual chunks. Whether this is possible
 depends on
 your filesystem and whether the space of a former chunk is
 already
 assigned to other RADOS objects. The RADOS object names are
 composed
 of the RBD name and the offset position of the chunk, so if
 an
 undelete mechanism exists for the OSDs filesystem, you have
 to be
 able to recover file by their filename, otherwise you might
 end up
 mixing the content of various deleted RBDs. Due to the thin
 provisioning there might be some chunks missing (e.g. never
 allocated
 before).

 Given the fact that
 - you probably use XFS on the OSDs since it is the
 preferred
 filesystem for OSDs (there is RDR-XFS, but Ive never had to
 use it)
 - you would need to stop the complete ceph cluster
 (recovery tools do
 not work on mounted filesystems)

Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-10 Thread Brad Hubbard

On Wed, Aug 10, 2016 at 3:16 PM, Georgios Dimitrakakis  
wrote:
>
> Hello!
>
> Brad,
>
> is that possible from the default logging or verbose one is needed??
>
> I 've managed to get the UUID of the deleted volume from OpenStack but don't
> really know how to get the offsets and OSD maps since "rbd info" doesn't
> provide any information for that volume.

Did you grep for the UUID (might be safer to grep for the first 8 chars or
so since I'm not 100% sure of the format) in the logs?

There is also a RADOS object called the rbd directory that contains some
mapping information for rbd images but I don't know if this is erased when an
image is deleted, nor how to look at it but someone more adept at RBD may be
able to make suggestions how to confirm this?

HTH,
Brad

>
> Is it possible to somehow get them from leveldb?
>
> Best,
>
> G.
>
>
>> On Tue, Aug 9, 2016 at 7:39 AM, George Mihaiescu
>>  wrote:
>>>
>>> Look in the cinder db, the volumes table to find the Uuid of the deleted
>>> volume.
>>
>>
>> You could also look through the logs at the time of the delete and I
>> suspect you should
>> be able to see how the rbd image was prefixed/named at the time of
>> the delete.
>>
>> HTH,
>> Brad
>>
>>>
>>> If you go through yours OSDs and look for the directories for PG index
>>> 20, you might find some fragments from the deleted volume, but it's a long
>>> shot...
>>>
 On Aug 8, 2016, at 4:39 PM, Georgios Dimitrakakis 
 wrote:

 Dear David (and all),

 the data are considered very critical therefore all this attempt to
 recover them.

 Although the cluster hasn't been fully stopped all users actions have. I
 mean services are running but users are not able to read/write/delete.

 The deleted image was the exact same size of the example (500GB) but it
 wasn't the only one deleted today. Our user was trying to do a "massive"
 cleanup by deleting 11 volumes and unfortunately one of them was very
 important.

 Let's assume that I "dd" all the drives what further actions should I do
 to recover the files? Could you please elaborate a bit more on the phrase
 "If you've never deleted any other rbd images and assuming you can recover
 data with names, you may be able to find the rbd objects"??

 Do you mean that if I know the file names I can go through and check for
 them? How?
 Do I have to know *all* file names or by searching for a few of them I
 can find all data that exist?

 Thanks a lot for taking the time to answer my questions!

 All the best,

 G.

> I dont think theres a way of getting the prefix from the cluster at
> this point.
>
> If the deleted image was a similar size to the example youve given,
> you will likely have had objects on every OSD. If this data is
> absolutely critical you need to stop your cluster immediately or make
> copies of all the drives with something like dd. If youve never
> deleted any other rbd images and assuming you can recover data with
> names, you may be able to find the rbd objects.
>
> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>
 Hi,

 On 08.08.2016 10:50, Georgios Dimitrakakis wrote:

>> Hi,
>>
>>> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>>>
>>> Dear all,
>>>
>>> I would like your help with an emergency issue but first
>>> let me describe our environment.
>>>
>>> Our environment consists of 2OSD nodes with 10x 2TB HDDs
>>> each and 3MON nodes (2 of them are the OSD nodes as well)
>>> all with ceph version 0.80.9
>>> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>>>
>>> This environment provides RBD volumes to an OpenStack
>>> Icehouse installation.
>>>
>>> Although not a state of the art environment is working
>>> well and within our expectations.
>>>
>>> The issue now is that one of our users accidentally
>>> deleted one of the volumes without keeping its data first!
>>>
>>> Is there any way (since the data are considered critical
>>> and very important) to recover them from CEPH?
>>
>>
>> Short answer: no
>>
>> Long answer: no, but
>>
>> Consider the way Ceph stores data... each RBD is striped
>> into chunks
>> (RADOS objects with 4MB size by default); the chunks are
>> distributed
>> among the OSDs with the configured number of replicates
>> (probably two
>> in your case since you use 2 OSD hosts). RBD uses thin
>> provisioning,
>> so chunks are allocated upon first write access.
>> If an RBD is deleted all of its chunks are deleted on the

Re: [ceph-users] installing multi osd and monitor of ceph in single VM

2016-08-10 Thread Brad Hubbard

On Thu, Aug 11, 2016 at 12:45 AM, agung Laksono <agung.sma...@gmail.com> wrote:
> I've seen the Ansible before  but not in detail for that.
> I also have tried to run quick guide for development.
> It did not work on my VM that I already install ceph inside it.
>
> the error is :
>
>  agung@arrasyid:~/ceph/ceph/src$ ./vstart.sh -d -n -x
> ** going verbose **
> [./fetch_config /tmp/fetched.ceph.conf.3818]
> ./init-ceph: failed to fetch config with './fetch_config
> /tmp/fetched.ceph.conf.3818'
>
>
> Do I need to use a vanilla ceph to make vstart.sh work?
>
> When I learn a cloud system, usually I compile
> the source code,  run in pseudo-distributed, modify the code
> and add prints somewhere, recompile and re-run the system.
> Might this method work for exploring ceph?

It should, sure.

Try this. 

1) Clone a fresh copy of the repo.
2) ./do_cmake.sh
3) cd build
4) make
5) OSD=3 MON=3 MDS=1 ../src/vstart.sh -n -x -l
6) bin/ceph -s

That should give you a working cluster with 3 MONs, 3 OSDs and 1 MDS.

-- 
Cheers,
Brad

>
>
> On Wed, Aug 10, 2016 at 9:14 AM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>
>> On Wed, Aug 10, 2016 at 12:26 AM, agung Laksono <agung.sma...@gmail.com>
>> wrote:
>> >
>> > Hi Ceph users,
>> >
>> > I am new in ceph. I've been succeed installing ceph in 4 VM using Quick
>> > installation guide in ceph documentation.
>> >
>> > And I've also done to compile
>> > ceph from source code, build and install in single vm.
>> >
>> > What I want to do next is that run ceph multiple nodes in a cluster
>> > but only inside a single machine. I need this because I will
>> > learn the ceph code and will modify some codes, recompile and
>> > redeploy on the node/VM. On my study, I've also to be able to run/kill
>> > particular node.
>> >
>> > does somebody know how to configure single vm to run multiple osd and
>> > monitor of ceph?
>> >
>> > Advises and comments are very appreciate. thanks
>>
>> Hi,
>>
>> Did you see this?
>>
>>
>> http://docs.ceph.com/docs/hammer/dev/quick_guide/#running-a-development-deployment
>>
>> Also take a look at the AIO (all in one) options in ceph-ansible.
>>
>> HTH,
>> Brad
>
>
>
>
> --
> Cheers,
>
> Agung Laksono
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] blocked ops

2016-08-12 Thread Brad Hubbard

On Fri, Aug 12, 2016 at 07:47:54AM +0100, roeland mertens wrote:
> Hi Brad,
> 
> thank you for that. Unfortunately our immediate concern is the blocked ops
> rather than the broken pg (we know why its broken).

OK, if you look at the following file it shows not only the declaration of
wait_for_blocked_object (highlighted) but also all of it's callers.

https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L500

Multiple calls relate to snapshots but I'd suggest turning debug logging for
the OSDs right up may give us more information.

# ceph tell osd.* injectargs '--debug_osd 20 --debug_ms 5'

Note: The above will turn up debugging for all OSDs, you may want to only
focus on some so adjust accordingly.

> I don't think that's specifically crushmap related nor related to the
> broken pg as the osds involved in the blocked ops aren't the ones that were
> hosting the broken pg.
> 
> 
> 
> 
> On 12 August 2016 at 04:12, Brad Hubbard <bhubb...@redhat.com> wrote:
> 
> > On Thu, Aug 11, 2016 at 11:33:29PM +0100, Roeland Mertens wrote:
> > > Hi,
> > >
> > > I was hoping someone on this list may be able to help?
> > >
> > > We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12
> > hours
> > > we've been plagued with blocked requests which completely kills the
> > > performance of the cluster
> > >
> > > # ceph health detail
> > > HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs
> > down; 1
> > > pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1
> > osds
> > > have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set
> > > pg 63.1a18 is stuck inactive for 135133.509820, current state
> > > down+remapped+peering, last acting [2147483647,2147483647,
> > 2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
> >
> > That value (2147483647) is defined in src/crush/crush.h like so;
> >
> > #define CRUSH_ITEM_NONE   0x7fff  /* no result */
> >
> > So this could be due to a bad crush rule or maybe choose_total_tries needs
> > to
> > be higher?
> >
> > $ ceph osd crush rule ls
> >
> > For each rule listed by the above command.
> >
> > $ ceph osd crush rule dump [rule_name]
> >
> > I'd then dump out the crushmap and test it showing any bad mappings with
> > the
> > commands listed here;
> >
> > http://docs.ceph.com/docs/master/rados/troubleshooting/
> > troubleshooting-pg/#crush-gives-up-too-soon
> >
> > I'd also check the pg numbers for your pool(s) are appropriate as not
> > enough
> > pgs could also be a contributing factor IIRC.
> >
> > That should hopefully give some insight.
> >
> > --
> > HTH,
> > Brad
> >
> > > pg 63.1a18 is down+remapped+peering, acting [2147483647,2147483647,
> > 2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
> > > 100 ops are blocked > 2097.15 sec on osd.4
> > > 1 osds have slow requests
> > > noout,nodeep-scrub,sortbitwise flag(s) set
> > >
> > > the one pg down is due to us running into an odd EC issue which I mailed
> > the
> > > list about earlier, it's the 100 blocked ops that are puzzling us. If we
> > out
> > > the osd in question, they just shift to another osd (on a different
> > host!).
> > > We even tried rebooting the node it's on but to little avail.
> > >
> > > We get a ton of log messages like this:
> > >
> > > 2016-08-11 23:32:10.041174 7fc668d9f700  0 log_channel(cluster) log
> > [WRN] :
> > > 100 slow requests, 5 included below; oldest blocked for > 139.313915 secs
> > > 2016-08-11 23:32:10.041184 7fc668d9f700  0 log_channel(cluster) log
> > [WRN] :
> > > slow request 139.267004 seconds old, received at 2016-08-11
> > 23:29:50.774091:
> > > osd_op(client.9192464.0:485640 66.b96c3a18
> > > default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read
> > > 0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109)
> > > currently waiting for blocked object
> > > 2016-08-11 23:32:10.041189 7fc668d9f700  0 log_channel(cluster) log
> > [WRN] :
> > > slow request 139.244839 seconds old, received at 2016-08-11
> > 23:29:50.796256:
> > > osd_op(client.9192464.0:596033 66.942a5a18
> > > default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write
> > > 1048576~524288] snapc 0=[] RETRY=36
> > > ack+ondisk+retry+write+known_if_redirected e5

Re: [ceph-users] installing multi osd and monitor of ceph in single VM

2016-08-10 Thread Brad Hubbard

 thread_name:ceph-osd

 ceph version v11.0.0-798-g62e8a97 (62e8a97bebb8581318d5484391ec0b131e6f7c71)
 1: /home/brad/working/src/ceph/build/bin/ceph-osd() [0xc1f87e]
 2: (()+0x10c30) [0x7fa0e7546c30]
 3: (pthread_join()+0xad) [0x7fa0e753e6bd]
 4: (Thread::join(void**)+0x2c) [0xe8622c]
 5: (DispatchQueue::wait()+0x12) [0xf47002]
 6: (SimpleMessenger::wait()+0xb59) [0xe21389]
 7: (main()+0x2f00) [0x6b3010]

-- 
HTH,
Brad

> 
> 
> 
> On Thu, Aug 11, 2016 at 4:17 AM, Brad Hubbard <bhubb...@redhat.com> wrote:
> 
> > On Thu, Aug 11, 2016 at 12:45 AM, agung Laksono <agung.sma...@gmail.com>
> > wrote:
> > > I've seen the Ansible before  but not in detail for that.
> > > I also have tried to run quick guide for development.
> > > It did not work on my VM that I already install ceph inside it.
> > >
> > > the error is :
> > >
> > >  agung@arrasyid:~/ceph/ceph/src$ ./vstart.sh -d -n -x
> > > ** going verbose **
> > > [./fetch_config /tmp/fetched.ceph.conf.3818]
> > > ./init-ceph: failed to fetch config with './fetch_config
> > > /tmp/fetched.ceph.conf.3818'
> > >
> > >
> > > Do I need to use a vanilla ceph to make vstart.sh work?
> > >
> > > When I learn a cloud system, usually I compile
> > > the source code,  run in pseudo-distributed, modify the code
> > > and add prints somewhere, recompile and re-run the system.
> > > Might this method work for exploring ceph?
> >
> > It should, sure.
> >
> > Try this.
> >
> > 1) Clone a fresh copy of the repo.
> > 2) ./do_cmake.sh
> > 3) cd build
> > 4) make
> > 5) OSD=3 MON=3 MDS=1 ../src/vstart.sh -n -x -l
> > 6) bin/ceph -s
> >
> > That should give you a working cluster with 3 MONs, 3 OSDs and 1 MDS.
> >
> > --
> > Cheers,
> > Brad
> >
> > >
> > >
> > > On Wed, Aug 10, 2016 at 9:14 AM, Brad Hubbard <bhubb...@redhat.com>
> > wrote:
> > >>
> > >> On Wed, Aug 10, 2016 at 12:26 AM, agung Laksono <agung.sma...@gmail.com
> > >
> > >> wrote:
> > >> >
> > >> > Hi Ceph users,
> > >> >
> > >> > I am new in ceph. I've been succeed installing ceph in 4 VM using
> > Quick
> > >> > installation guide in ceph documentation.
> > >> >
> > >> > And I've also done to compile
> > >> > ceph from source code, build and install in single vm.
> > >> >
> > >> > What I want to do next is that run ceph multiple nodes in a cluster
> > >> > but only inside a single machine. I need this because I will
> > >> > learn the ceph code and will modify some codes, recompile and
> > >> > redeploy on the node/VM. On my study, I've also to be able to run/kill
> > >> > particular node.
> > >> >
> > >> > does somebody know how to configure single vm to run multiple osd and
> > >> > monitor of ceph?
> > >> >
> > >> > Advises and comments are very appreciate. thanks
> > >>
> > >> Hi,
> > >>
> > >> Did you see this?
> > >>
> > >>
> > >> http://docs.ceph.com/docs/hammer/dev/quick_guide/#
> > running-a-development-deployment
> > >>
> > >> Also take a look at the AIO (all in one) options in ceph-ansible.
> > >>
> > >> HTH,
> > >> Brad
> > >
> > >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Agung Laksono
> > >
> >
> >
> >
> 
> 
> -- 
> Cheers,
> 
> Agung Laksono
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd failing to start

2016-07-13 Thread Brad Hubbard

On Thu, Jul 14, 2016 at 06:06:58AM +0200, Martin Wilderoth wrote:
>  Hello,
> 
> I have a ceph cluster where the one osd is failng to start. I have been
> upgrading ceph to see if the error dissappered. Now I'm running jewel but I
> still get the  error message.
> 
> -1> 2016-07-13 17:04:22.061384 7fda4d24e700  1 heartbeat_map is_healthy
> 'OSD::osd_tp thread 0x7fda25dd8700' had suicide timed out after 150

This appears to indicate that an OSD thread pool thread (work queue thread)
has failed to complete an operation within the 150 second grace period.

The most likely and common cause for this is hardware failure and I would
therefore suggest you thoroughly check this device and look for indicators in
syslog, dmesg, diagnostics, etc. tat this device may have failed.

-- 
HTH,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Brad Hubbard

On Mon, Jul 11, 2016 at 04:53:36PM +0200, Lionel Bouton wrote:
> Le 11/07/2016 11:56, Brad Hubbard a écrit :
> > On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
> > <lionel-subscript...@bouton.name> wrote:
> >> Le 11/07/2016 04:48, 한승진 a écrit :
> >>> Hi cephers.
> >>>
> >>> I need your help for some issues.
> >>>
> >>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
> >>>
> >>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
> >>>
> >>> I've experienced one of OSDs was killed himself.
> >>>
> >>> Always it issued suicide timeout message.
> >> This is probably a fragmentation problem : typical rbd access patterns
> >> cause heavy BTRFS fragmentation.
> > To the extent that operations take over 120 seconds to complete? Really?
> 
> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
> aggressive way, rewriting data all over the place and creating/deleting
> snapshots every filestore sync interval (5 seconds max by default IIRC).
> 
> As I said there are 3 main causes of performance degradation :
> - the snapshots,
> - the journal in a standard copy-on-write file (move it out of the FS or
> use NoCow),
> - the weak auto defragmentation of BTRFS (autodefrag mount option).
> 
> Each one of them is enough to impact or even destroy performance in the
> long run. The 3 combined make BTRFS unusable by default. This is why
> BTRFS is not recommended : if you want to use it you have to be prepared
> for some (heavy) tuning. The first 2 points are easy to address, for the
> last (which begins to be noticeable when you accumulate rewrites on your
> data) I'm not aware of any other tool than the one we developed and
> published on github (link provided in previous mail).
> 
> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
> it now and would recommend 4.4.x if it's possible for you and 4.1.x
> otherwise.

Thanks for the information. I wasn't aware things were that bad with BTRFS as
I haven't had much to do with it up to this point.

Cheers,
Brad

> 
> Best regards,
> 
> Lionel

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Brad Hubbard

  0.2   1:06.16 python
> > 16846 goncalo   20   0 1594m  84m  19m R 99.9  0.2   1:06.05 python
> > 29595 goncalo   20   0 1594m  83m  19m R 100.2  0.2   1:05.57 python
> > 29312 goncalo   20   0 1594m  83m  19m R 99.9  0.2   1:05.01 python
> > 31979 goncalo   20   0 1595m  82m  19m R 100.2  0.2   1:04.82 python
> > 29333 goncalo   20   0 1594m  82m  19m R 99.5  0.2   1:04.94 python
> > 29609 goncalo   20   0 1594m  82m  19m R 99.9  0.2   1:05.07 python
> >
> >
> > 5.> Also, is the version of fuse the same on the nodes running 9.2.0 vs. the
> > nodes running 10.2.2?
> >
> > In 10.2.2 I've compiled with fuse 2.9.7 while in 9.2.0 I've compiled against
> > the default sl6 fuse libs version 2.8.7. However, as I said before, I am
> > seeing the same issue with 9.2.0 (although with a bit less of used virtual
> > memory in total).
> >
> >
> >
> >
> > On 07/08/2016 10:53 PM, John Spray wrote:
> >
> > On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges
> > <goncalo.bor...@sydney.edu.au> wrote:
> >
> > Hi Brad, Patrick, All...
> >
> > I think I've understood this second problem. In summary, it is memory
> > related.
> >
> > This is how I found the source of the problem:
> >
> > 1./ I copied and adapted the user application to run in another cluster of
> > ours. The idea was for me to understand the application and run it myself to
> > collect logs and so on...
> >
> > 2./ Once I submit it to this other cluster, every thing went fine. I was
> > hammering cephfs from multiple nodes without problems. This pointed to
> > something different between the two clusters.
> >
> > 3./ I've started to look better to the segmentation fault message, and
> > assuming that the names of the methods and functions do mean something, the
> > log seems related to issues on the management of objects in cache. This
> > pointed to a memory related problem.
> >
> > 4./ On the cluster where the application run successfully, machines have
> > 48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP size,
> > it is a legacy setup).
> >
> > # top
> > top - 00:34:01 up 23 days, 22:21,  1 user,  load average: 12.06, 12.12,
> > 10.40
> > Tasks: 683 total,  13 running, 670 sleeping,   0 stopped,   0 zombie
> > Cpu(s): 49.7%us,  0.6%sy,  0.0%ni, 49.7%id,  0.1%wa,  0.0%hi,  0.0%si,
> > 0.0%st
> > Mem:  49409308k total, 29692548k used, 19716760k free,   433064k buffers
> > Swap: 98301948k total,0k used, 98301948k free, 26742484k cached
> >
> > 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
> > virtual memory when there is no applications using the filesystem.
> >
> >  7152 root  20   0 1108m  12m 5496 S  0.0  0.0   0:00.04 ceph-fuse
> >
> > When I only have one instance of the user application running, ceph-fuse (in
> > 10.2.2) slowly rises with time up to 10 GB of memory usage.
> >
> > if I submit a large number of user applications simultaneously, ceph-fuse
> > goes very fast to ~10GB.
> >
> >   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> > 18563 root  20   0 10.0g 328m 5724 S  4.0  0.7   1:38.00 ceph-fuse
> >  4343 root  20   0 3131m 237m  12m S  0.0  0.5  28:24.56 dsm_om_connsvcd
> >  5536 goncalo   20   0 1599m  99m  32m R 99.9  0.2  31:35.46 python
> > 31427 goncalo   20   0 1597m  89m  20m R 99.9  0.2  31:35.88 python
> > 20504 goncalo   20   0 1599m  89m  20m R 100.2  0.2  31:34.29 python
> > 20508 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:34.20 python
> >  4973 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:35.70 python
> >  1331 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:35.72 python
> > 20505 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:34.46 python
> > 20507 goncalo   20   0 1599m  87m  20m R 99.9  0.2  31:34.37 python
> > 28375 goncalo   20   0 1597m  86m  20m R 99.9  0.2  31:35.52 python
> > 20503 goncalo   20   0 1597m  85m  20m R 100.2  0.2  31:34.09 python
> > 20506 goncalo   20   0 1597m  84m  20m R 99.5  0.2  31:34.42 python
> > 20502 goncalo   20   0 1597m  83m  20m R 99.9  0.2  31:34.32 python
> >
> > 6./ On the machines where the user had the segfault, we have 16 GB of RAM
> > and 1GB of SWAP
> >
> > Mem:  16334244k total,  3590100k used, 12744144k free,   221364k buffers
> > Swap:  1572860k total,10512k used,  1562348k free,  2937276k cached
> >
> > 7./ I think what is happening is that once the user submits his sets of
> > jobs, the memory usage goes to t

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Brad Hubbard

On Tue, Jul 5, 2016 at 12:13 PM, Shinobu Kinjo  wrote:
> Can you reproduce with debug client = 20?

In addition to this I would suggest making sure you have debug symbols
in your build
and capturing a core file.

You can do that by setting "ulimit -c unlimited" in the environment
where ceph-fuse is running.

Once you have a core file you can do the following.

$ gdb /path/to/ceph-fuse core.
(gdb) thread apply all bt full

This looks like it might be a race and that might help us identify the
threads involved.

HTH,
Brad

>
> On Tue, Jul 5, 2016 at 10:16 AM, Goncalo Borges
>  wrote:
>>
>> Dear All...
>>
>> We have recently migrated all our ceph infrastructure from 9.2.0 to
>> 10.2.2.
>>
>> We are currently using ceph-fuse to mount cephfs in a number of clients.
>>
>> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
>> scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
>> application requesting 4 hosts with 4 cores each (16 instances in total) .
>> According to the user, each instance has its own dedicated inputs and
>> outputs.
>>
>> Please note that if we go back to ceph-fuse 9.2.0 client everything works
>> fine.
>>
>> The ceph-fuse 10.2.2 client segfault is the following (we were able to
>> capture it mounting ceph-fuse in debug mode):
>>
>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
>> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
>> ceph-fuse[7346]: starting ceph client
>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 0x7f6af8c12320
>> newargc=11
>> ceph-fuse[7346]: starting fuse
>> *** Caught signal (Segmentation fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>> [0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal (Segmentation
>> fault) **
>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>
>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>> [0x7f6aedaee035]
>>  5: (()+0x199891) [0x7f6aedaee891]
>>  6: (()+0x15b76) [0x7f6aed50db76]
>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>>
>> The full dump is quite long. Here are the very last bits of it. Let me
>> know if you need the full dump.
>>
>> --- begin dump of recent events ---
>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
>> _getxattr(137c789, "security.capability", 0) = -61
>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 ll_write
>> 0x7f6a08028be0 137c78c 20094~34
>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 ll_write
>> 0x7f6a08028be0 20094~34 = 34
>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 ll_write
>> 0x7f6a100145f0 137c78d 28526~34
>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 ll_write
>> 0x7f6a100145f0 28526~34 = 34
>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559 ll_forget
>> 137c78c 1
>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559 ll_forget
>> 137c789 1
>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 ll_write
>> 0x7f6a94006350 137c789 22010~216
>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 ll_write
>> 0x7f6a94006350 22010~216 = 216
>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
>> ll_getxattr 137c78c.head security.capability size 0
>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
>> _getxattr(137c78c, "security.capability", 0) = -61
>>
>> 
>>
>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
>> _getxattr(137c78a, "security.capability", 0) = -61
>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 ll_write
>> 0x7f6a08042560 137c78b 11900~34
>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 ll_write
>> 0x7f6a08042560 11900~34 = 34
>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
>> ll_getattr 11e9c80.head
>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
>> ll_getattr 11e9c80.head = 0
>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-04 Thread Brad Hubbard

On Tue, Jul 5, 2016 at 1:34 PM, Patrick Donnelly  wrote:
> Hi Goncalo,
>
> I believe this segfault may be the one fixed here:
>
> https://github.com/ceph/ceph/pull/10027

Ah, nice one Patrick.

Goncalo, the patch is fairly simple, just the addition of a lock on two lines to
resolve the race. Could you try recompiling with those changes and let
us know how
it goes?

Cheers,
Brad

>
> (Sorry for brief top-post. Im on mobile.)
>
> On Jul 4, 2016 9:16 PM, "Goncalo Borges" 
> wrote:
>>
>> Dear All...
>>
>> We have recently migrated all our ceph infrastructure from 9.2.0 to
>> 10.2.2.
>>
>> We are currently using ceph-fuse to mount cephfs in a number of clients.
>>
>> ceph-fuse 10.2.2 client is segfaulting in some situations. One of the
>> scenarios where ceph-fuse segfaults is when a user submits a parallel (mpi)
>> application requesting 4 hosts with 4 cores each (16 instances in total) .
>> According to the user, each instance has its own dedicated inputs and
>> outputs.
>>
>> Please note that if we go back to ceph-fuse 9.2.0 client everything works
>> fine.
>>
>> The ceph-fuse 10.2.2 client segfault is the following (we were able to
>> capture it mounting ceph-fuse in debug mode):
>>>
>>> 2016-07-04 21:21:00.074087 7f6aed92be40  0 ceph version 10.2.2
>>> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-fuse, pid 7346
>>> ceph-fuse[7346]: starting ceph client
>>> 2016-07-04 21:21:00.107816 7f6aed92be40 -1 init, newargv = 0x7f6af8c12320
>>> newargc=11
>>> ceph-fuse[7346]: starting fuse
>>> *** Caught signal (Segmentation fault) **
>>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>>> [0x7f6aedaee035]
>>>  5: (()+0x199891) [0x7f6aedaee891]
>>>  6: (()+0x15b76) [0x7f6aed50db76]
>>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>> 2016-07-05 10:09:14.045131 7f69d7fff700 -1 *** Caught signal
>>> (Segmentation fault) **
>>>  in thread 7f69d7fff700 thread_name:ceph-fuse
>>>
>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>  1: (()+0x297ef2) [0x7f6aedbecef2]
>>>  2: (()+0x3b88c0f7e0) [0x7f6aec64b7e0]
>>>  3: (Client::get_root_ino()+0x10) [0x7f6aedaf0330]
>>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>>> [0x7f6aedaee035]
>>>  5: (()+0x199891) [0x7f6aedaee891]
>>>  6: (()+0x15b76) [0x7f6aed50db76]
>>>  7: (()+0x12aa9) [0x7f6aed50aaa9]
>>>  8: (()+0x3b88c07aa1) [0x7f6aec643aa1]
>>>  9: (clone()+0x6d) [0x7f6aeb8d193d]
>>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>>> to interpret this.
>>>
>>>
>> The full dump is quite long. Here are the very last bits of it. Let me
>> know if you need the full dump.
>>>
>>> --- begin dump of recent events ---
>>>  -> 2016-07-05 10:09:13.956502 7f6a5700  3 client.464559
>>> _getxattr(137c789, "security.capability", 0) = -61
>>>  -9998> 2016-07-05 10:09:13.956507 7f6aa96fa700  3 client.464559 ll_write
>>> 0x7f6a08028be0 137c78c 20094~34
>>>  -9997> 2016-07-05 10:09:13.956527 7f6aa96fa700  3 client.464559 ll_write
>>> 0x7f6a08028be0 20094~34 = 34
>>>  -9996> 2016-07-05 10:09:13.956535 7f69d7fff700  3 client.464559 ll_write
>>> 0x7f6a100145f0 137c78d 28526~34
>>>  -9995> 2016-07-05 10:09:13.956553 7f69d7fff700  3 client.464559 ll_write
>>> 0x7f6a100145f0 28526~34 = 34
>>>  -9994> 2016-07-05 10:09:13.956561 7f6ac0dfa700  3 client.464559
>>> ll_forget 137c78c 1
>>>  -9993> 2016-07-05 10:09:13.956569 7f6a5700  3 client.464559
>>> ll_forget 137c789 1
>>>  -9992> 2016-07-05 10:09:13.956577 7f6a5ebfd700  3 client.464559 ll_write
>>> 0x7f6a94006350 137c789 22010~216
>>>  -9991> 2016-07-05 10:09:13.956594 7f6a5ebfd700  3 client.464559 ll_write
>>> 0x7f6a94006350 22010~216 = 216
>>>  -9990> 2016-07-05 10:09:13.956603 7f6aa8cf9700  3 client.464559
>>> ll_getxattr 137c78c.head security.capability size 0
>>>  -9989> 2016-07-05 10:09:13.956609 7f6aa8cf9700  3 client.464559
>>> _getxattr(137c78c, "security.capability", 0) = -61
>>>
>>> 
>>>
>>>   -160> 2016-07-05 10:09:14.043687 7f69d7fff700  3 client.464559
>>> _getxattr(137c78a, "security.capability", 0) = -61
>>>   -159> 2016-07-05 10:09:14.043694 7f6ac0dfa700  3 client.464559 ll_write
>>> 0x7f6a08042560 137c78b 11900~34
>>>   -158> 2016-07-05 10:09:14.043712 7f6ac0dfa700  3 client.464559 ll_write
>>> 0x7f6a08042560 11900~34 = 34
>>>   -157> 2016-07-05 10:09:14.043722 7f6ac17fb700  3 client.464559
>>> ll_getattr 11e9c80.head
>>>   -156> 2016-07-05 10:09:14.043727 7f6ac17fb700  3 client.464559
>>> ll_getattr 11e9c80.head = 0
>>>   -155> 2016-07-05 10:09:14.043734 7f69d7fff700  3 client.464559
>>> ll_forget 137c78a 1
>>>   -154>

Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?

2016-07-05 Thread Brad Hubbard

On Sun, Jul 3, 2016 at 7:51 AM, Alex Gorbachev  wrote:
>> Thank you Stefan and Campbell for the info - hope 4.7rc5 resolves this
>> for us - please note that my workload is purely RBD, no QEMU/KVM.
>> Also, we do not have CFQ turned on, neither scsi-mq and blk-mq, so I
>> am surmising ceph-osd must be using something from the fair scheduler.
>> I read that its IO has been switched to blk-mq internally, so maybe
>> there is a relationship there.
>
> If the OSD code is compiled against the source from a buggy fair
> scheduler code, then that would be an OSD code issue, correct?

OSD code is not compiled against any kernel code. ceph-osd runs in userpace,
not kernelspace. A userspace process should not be able to crash a kernel, if it
can that's a kernel bug.

HTH,
Brad
>
>>
>> We had no such problems with kernel 4.2.x, but had other issues with
>> XFS, which do not seem to happen now.
>>
>> Regards,
>> Alex
>>
>>>
>>> Stefan
>>>
>>> Am 29.06.2016 um 11:41 schrieb Campbell Steven:
 Hi Alex/Stefan,

 I'm in the middle of testing 4.7rc5 on our test cluster to confirm
 once and for all this particular issue has been completely resolved by
 Peter's recent patch to sched/fair.c refereed to by Stefan above. For
 us anyway the patches that Stefan applied did not solve the issue and
 neither did any 4.5.x or 4.6.x released kernel thus far, hopefully it
 does the trick for you. We could get about 4 hours uptime before
 things went haywire for us.

 It's interesting how it seems the CEPH workload triggers this bug so
 well as it's quite a long standing issue that's only just been
 resolved, another user chimed in on the lkml thread a couple of days
 ago as well and again his trace had ceph-osd in it as well.

 https://lkml.org/lkml/headers/2016/6/21/491

 Campbell

 On 29 June 2016 at 18:29, Stefan Priebe - Profihost AG
  wrote:
>
> Am 29.06.2016 um 04:30 schrieb Alex Gorbachev:
>> Hi Stefan,
>>
>> On Tue, Jun 28, 2016 at 1:46 PM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Please be aware that you may need even more patches. Overall this needs 
>>> 3
>>> patches. Where the first two try to fix a bug and the 3rd one fixes the
>>> fixes + even more bugs related to the scheduler. I've no idea on which 
>>> patch
>>> level Ubuntu is.
>>
>> Stefan, would you be able to please point to the other two patches
>> beside https://lkml.org/lkml/diff/2016/6/22/102/1 ?
>
> Sorry sure yes:
>
> 1. 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a
> bounded value")
>
> 2.) 40ed9cba24bb7e01cc380a02d3f04065b8afae1d ("sched/fair: Fix
> post_init_entity_util_avg() serialization")
>
> 3.) the one listed at lkml.
>
> Stefan
>
>>
>> Thank you,
>> Alex
>>
>>>
>>> Stefan
>>>
>>> Excuse my typo sent from my mobile phone.
>>>
>>> Am 28.06.2016 um 17:59 schrieb Tim Bishop :
>>>
>>> Yes - I noticed this today on Ubuntu 16.04 with the default kernel. No
>>> useful information to add other than it's not just you.
>>>
>>> Tim.
>>>
>>> On Tue, Jun 28, 2016 at 11:05:40AM -0400, Alex Gorbachev wrote:
>>>
>>> After upgrading to kernel 4.4.13 on Ubuntu, we are seeing a few of
>>>
>>> these issues where an OSD would fail with the stack below.  I logged a
>>>
>>> bug at https://bugzilla.kernel.org/show_bug.cgi?id=121101 and there is
>>>
>>> a similar description at https://lkml.org/lkml/2016/6/22/102, but the
>>>
>>> odd part is we have turned off CFQ and blk-mq/scsi-mq and are using
>>>
>>> just the noop scheduler.
>>>
>>>
>>> Does the ceph kernel code somehow use the fair scheduler code block?
>>>
>>>
>>> Thanks
>>>
>>> --
>>>
>>> Alex Gorbachev
>>>
>>> Storcium
>>>
>>>
>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684974] CPU: 30 PID:
>>>
>>> 10403 Comm: ceph-osd Not tainted 4.4.13-040413-generic #201606072354
>>>
>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.684991] Hardware name:
>>>
>>> Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2
>>>
>>> 03/04/2015
>>>
>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685009] task:
>>>
>>> 880f79df8000 ti: 880f79fb8000 task.ti: 880f79fb8000
>>>
>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685024] RIP:
>>>
>>> 0010:[]  []
>>>
>>> task_numa_find_cpu+0x22e/0x6f0
>>>
>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685051] RSP:
>>>
>>> 0018:880f79fbb818  EFLAGS: 00010206
>>>
>>> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.685063] RAX:
>>>
>>>  RBX: 880f79fbb8b8 RCX:

Re: [ceph-users] Should I restart VMs when I upgrade ceph client version

2016-07-05 Thread Brad Hubbard

On Wed, Jul 6, 2016 at 3:28 PM, 한승진  wrote:
> Hi Cephers,
>
> I implemented Ceph with OpenStack.
>
> Recently, I upgrade Ceph server from Hammer to Jewel.
>
> Also, I plan to upgrade ceph clients that are OpenStack Nodes.
>
> There are a lot of VMs running in Compute Nodes.
>
> Should I restart the VMs after upgrade of Compute Nodes?

Once you upgrade if you look in /proc/[PID]/maps you will see the
Ceph libraries marked as "(deleted)" in the output. This means the
version on disk no longer matches what is in memory. So the version in
the running qemu-kvm instances is still the old version until you restart
the process and it loads the new versions of the libraries from disk. There
is a utility, needs-restarting, which may tell you what needs to be restarted
after an upgrade but I have not used it much so can't vouch for its accuracy.

HTH,
Brad

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-06 Thread Brad Hubbard

On Thu, Jul 7, 2016 at 12:31 AM, Patrick Donnelly  wrote:
>
> The locks were missing in 9.2.0. There were probably instances of the
> segfault unreported/unresolved.

Or even unseen :)

Race conditions are funny things and extremely subtle changes in
timing introduced
by any number of things can affect whether they happen or not. I've
seen races that
only happen on certain CPUs and not others, or that don't happen
unless a particular
flag is on/off during compilation. Difficult to predict.

>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Data recovery stuck

2016-07-08 Thread Brad Hubbard

On Sat, Jul 9, 2016 at 1:20 AM, Pisal, Ranjit Dnyaneshwar
 wrote:
> Hi All,
>
>
>
> I am in process of adding new OSDs to Cluster however after adding second
> node Cluster recovery seems to be stopped.
>
>
>
> Its more than 3 days but Objects degraded % has not improved even by 1%.
>
>
>
> Will adding further OSDs help improve situation or is there any other way to
> improve recovery process?
>
>
>
>
>
> [ceph@MYOPTPDN01 ~]$ ceph -s
>
> cluster 9e3e9015-f626-4a44-83f7-0a939ef7ec02
>
>  health HEALTH_WARN 315 pgs backfill; 23 pgs backfill_toofull; 3 pgs

You have 23 pgs that are "backfill_toofull". You need to identify these pgs.

You could try increasing the backfill full ratio for those pgs.

ceph health detail
ceph tell osd. injectargs '--osd-backfill-full-ratio=0.90'

Keep in mind new storage needs to be added to the cluster as soon as possible
but I guess that's what you are trying to do.

You could also look at reweighting the full OSDs if you have other OSDs with
considerable space available.

HTH,
Brad

> backfilling; 53 pgs degraded; 2 pgs recovering; 232 pgs recovery_wait; 552
> pgs stuck unclean; recovery 3622384/90976826 objects degraded (3.982%); 1
> near full osd(s)
>
>  monmap e4: 5 mons at
> {MYOPTPDN01=10.115.1.136:6789/0,MYOPTPDN02=10.115.1.137:6789/0,MYOPTPDN03=10.115.1.138:6789/0,MYOPTPDN04=10.115.1.139:6789/0,MYOPTPDN05=10.115.1.140:6789/0},
> election epoch 6654, quorum 0,1,2,3,4
> MYOPTPDN01,MYOPTPDN02,MYOPTPDN03,MYOPTPDN04,MYOPTPDN05
>
>  osdmap e198079: 171 osds: 171 up, 171 in
>
>   pgmap v26428186: 5696 pgs, 4 pools, 105 TB data, 28526 kobjects
>
> 329 TB used, 136 TB / 466 TB avail
>
> 3622384/90976826 objects degraded (3.982%)
>
>   23 active+remapped+wait_backfill+backfill_toofull
>
>  120 active+recovery_wait+remapped
>
> 5144 active+clean
>
>1 active+recovering+remapped
>
>  104 active+recovery_wait
>
>   45 active+degraded+remapped+wait_backfill
>
>1 active+recovering
>
>3 active+remapped+backfilling
>
>  247 active+remapped+wait_backfill
>
>8 active+recovery_wait+degraded+remapped
>
>   client io 62143 kB/s rd, 100 MB/s wr, 14427 op/s
>
> [ceph@MYOPTPDN01 ~]$
>
>
>
> Best Regards,
>
> Ranjit
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph master build fails on src/gmock, workaround?

2016-07-10 Thread Brad Hubbard

On Sat, Jul 09, 2016 at 10:43:52AM +, Kevan Rehm wrote:
> Greetings,
> 
> I cloned the master branch of ceph at https://github.com/ceph/ceph.git
> onto a Centos 7 machine, then did
> 
> ./autogen.sh
> ./configure --enable-xio
> make

BTW, you should be defaulting to cmake if you don't have a specific need to
use the autotools build.

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph master build fails on src/gmock, workaround?

2016-07-10 Thread Brad Hubbard

On Sat, Jul 09, 2016 at 10:43:52AM +, Kevan Rehm wrote:
> Greetings,
> 
> I cloned the master branch of ceph at https://github.com/ceph/ceph.git
> onto a Centos 7 machine, then did
> 
> ./autogen.sh
> ./configure --enable-xio
> make
> 
> but the build fails when it references the src/gmock subdirectory, see
> below.   Typing "make" a second time stops in the same place.  To get past
> this problem I had to cd into src/gmock, type "make", wait for it to
> finish compiling, then cd back to the top and restart the make again.
> 
> Anyone else seeing this?  Seems like a Makefile ordering problem, the
> src/gmock directory needs to be compiled before it is referenced.  I'm not
> a Makefile expert, can someone suggest a patch to Makefile.in to get
> builds to work cleanly again?
> 
> Thanks, Kevan
> 
> ...
> CXX  ceph_osd.o
>   CXX  ceph_mds.o
>   CXX  test/erasure-code/ceph_erasure_code_non_regression.o
>   CXX  test/messenger/simple_server-simple_server.o
>   CXX  test/messenger/simple_server-simple_dispatcher.o
>   CXX  test/messenger/simple_client-simple_client.o
>   CXX  test/messenger/simple_client-simple_dispatcher.o
>   CXX  test/messenger/xio_server-xio_server.o
>   CXX  test/messenger/xio_server-xio_dispatcher.o
>   CXX  test/messenger/xio_client-xio_client.o
>   CXX  test/messenger/xio_client-xio_dispatcher.o
>   CXX  test/librgw_file_cd-librgw_file_cd.o
> make[3]: *** No rule to make target `../src/gmock/lib/libgmock_main.la',
> needed by `librgw_file_cd'.  Stop.
> make[3]: *** Waiting for unfinished jobs
> make[3]: Leaving directory `/root/krehm/ceph/ceph-11.0.0/ceph-11.0.0/src'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory `/root/krehm/ceph/ceph-11.0.0/ceph-11.0.0/src'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/root/krehm/ceph/ceph-11.0.0/ceph-11.0.0/src'
> make: *** [all-recursive] Error 1

This has happened before and, as before, running configure with "--with-debug"
allows the build to complete successfully.

Looking at Greg's old email on the subject we can make the following change.

# git diff
diff --git a/src/test/Makefile-client.am b/src/test/Makefile-client.am
index f9534e5..2ba393c 100644
--- a/src/test/Makefile-client.am
+++ b/src/test/Makefile-client.am
@@ -772,7 +772,7 @@ librgw_file_cd_SOURCES = test/librgw_file_cd.cc
 librgw_file_cd_CXXFLAGS = -I$(srcdir)/xxHash $(UNITTEST_CXXFLAGS)
 librgw_file_cd_LDADD = $(UNITTEST_LDADD) \
$(LIBRGW) $(LIBRGW_DEPS) librados.la $(PTHREAD_LIBS) $(CEPH_GLOBAL) 
$(EXTRALIBS)
-noinst_PROGRAMS += librgw_file_cd
+check_PROGRAMS += librgw_file_cd
 
 librgw_file_gp_SOURCES = test/librgw_file_gp.cc
 librgw_file_gp_CXXFLAGS = -I$(srcdir)/xxHash $(UNITTEST_CXXFLAGS)

If we do that the error moves to...

make[3]: *** No rule to make target `../src/gmock/lib/libgmock_main.la', needed 
by `librgw_file_gp'.  Stop.

In the end I had to make the following four changes.

# git diff|gawk '/noinst_PROGRAMS/||/check_PROGRAMS/'
-noinst_PROGRAMS += librgw_file_cd
+check_PROGRAMS += librgw_file_cd
-noinst_PROGRAMS += librgw_file_gp
+check_PROGRAMS += librgw_file_gp
-noinst_PROGRAMS += librgw_file_aw
+check_PROGRAMS += librgw_file_aw
-noinst_PROGRAMS += librgw_file_nfsns
+check_PROGRAMS += librgw_file_nfsns

I'm not convinced this is the way to go but I'll open a tracker and submit a
PR to at least get the conversation started.

Thanks for the report!

--
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-10 Thread Brad Hubbard

On Mon, Jul 11, 2016 at 11:48:57AM +0900, 한승진 wrote:
> Hi cephers.
> 
> I need your help for some issues.
> 
> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
> 
> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
> 
> I've experienced one of OSDs was killed himself.
> 
> Always it issued suicide timeout message.
> 
> Below is detailed logs.
> 
> 
> ==
> 0. ceph df detail
> $ sudo ceph df detail
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED OBJECTS
> 42989G 24734G   18138G 42.19  23443k
> POOLS:
> NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED
>   %USED MAX AVAIL OBJECTS  DIRTY  READ   WRITE
>  RAW USED
> ha-pool 40 -N/A   N/A
>  1405G  9.81 5270G 22986458 22447k  0
> 22447k4217G
> volumes 45 -N/A   N/A
>  4093G 28.57 5270G   933401   911k   648M
> 649M   12280G
> images  46 -N/A   N/A
> 53745M  0.37 5270G 6746   6746  1278k
>  21046 157G
> backups 47 -N/A   N/A
>  0 0 5270G0  0  0  0
>  0
> vms 48 -N/A   N/A
> 309G  2.16 5270G79426  79426 92612k 46506k
> 928G
> 
> 1. ceph no.15 log
> 
> *(20:02 first timed out message)*
> 2016-07-08 20:02:01.049483 7fcd3caa5700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
> 2016-07-08 20:02:01.050403 7fcd3b2a2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
> 2016-07-08 20:02:01.086792 7fcd3b2a2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
> .
> .
> (sometimes this logs with..)
> 2016-07-08 20:02:11.379597 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs
> 2016-07-08 20:02:11.379608 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937:
> osd_op(client.895668.0:5302745 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent
> 2016-07-08 20:02:11.379612 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406:
> osd_op(client.895668.0:5302746 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> 2016-07-08 20:02:11.379630 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907:
> osd_op(client.895668.0:5302747 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> 2016-07-08 20:02:11.379633 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371:
> osd_op(client.895668.0:5302748 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> 2016-07-08 20:02:11.379636 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852:
> osd_op(client.895668.0:5302749 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> .
> .
> (after a lot of same messages)
> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15
> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2da87700' had timed out after 15
> 2016-07-08 20:03:53.682829 7fcd3caa5700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fcd48716700' had timed out after 60
> 2016-07-08 20:03:53.682830 7fcd3caa5700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fcd47f15700' had timed out after 60
> .
> .
> (fault with nothing to send, going to standby massages)
> 2016-07-08 20:03:53.708665 7fcd15787700  0 -- 10.200.10.145:6818/6462 >>
> 10.200.10.146:6806/4642 pipe(0x55818727e000 sd=276 :51916 s=2 pgs=2225 cs=1
> l=0 c=0x558186f61d80).fault with nothing to send, going to standby
> 2016-07-08 20:03:53.724928 7fcd072c2700  0 -- 10.200.10.145:6818/6462 >>
> 10.200.10.146:6800/4336 pipe(0x55818a25b400 sd=109

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-10 Thread Brad Hubbard

On Mon, Jul 11, 2016 at 1:21 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> On Mon, Jul 11, 2016 at 11:48:57AM +0900, 한승진 wrote:
>> Hi cephers.
>>
>> I need your help for some issues.
>>
>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>
>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>
>> I've experienced one of OSDs was killed himself.
>>
>> Always it issued suicide timeout message.
>>
>> Below is detailed logs.
>>
>>
>> ==
>> 0. ceph df detail
>> $ sudo ceph df detail
>> GLOBAL:
>> SIZE   AVAIL  RAW USED %RAW USED OBJECTS
>> 42989G 24734G   18138G 42.19  23443k
>> POOLS:
>> NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED
>>   %USED MAX AVAIL OBJECTS  DIRTY  READ   WRITE
>>  RAW USED
>> ha-pool 40 -N/A   N/A
>>  1405G  9.81 5270G 22986458 22447k  0
>> 22447k4217G
>> volumes 45 -N/A   N/A
>>  4093G 28.57 5270G   933401   911k   648M
>> 649M   12280G
>> images  46 -N/A   N/A
>> 53745M  0.37 5270G 6746   6746  1278k
>>  21046 157G
>> backups 47 -N/A   N/A
>>  0 0 5270G0  0  0  0
>>  0
>> vms 48 -N/A   N/A
>> 309G  2.16 5270G79426  79426 92612k 46506k
>> 928G
>>
>> 1. ceph no.15 log
>>
>> *(20:02 first timed out message)*
>> 2016-07-08 20:02:01.049483 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
>> 2016-07-08 20:02:01.050403 7fcd3b2a2700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
>> 2016-07-08 20:02:01.086792 7fcd3b2a2700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
>> .
>> .
>> (sometimes this logs with..)
>> 2016-07-08 20:02:11.379597 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs
>> 2016-07-08 20:02:11.379608 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937:
>> osd_op(client.895668.0:5302745 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent
>> 2016-07-08 20:02:11.379612 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406:
>> osd_op(client.895668.0:5302746 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> 2016-07-08 20:02:11.379630 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907:
>> osd_op(client.895668.0:5302747 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> 2016-07-08 20:02:11.379633 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371:
>> osd_op(client.895668.0:5302748 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> 2016-07-08 20:02:11.379636 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852:
>> osd_op(client.895668.0:5302749 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> .
>> .
>> (after a lot of same messages)
>> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15
>> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2da87

Re: [ceph-users] ceph admin socket protocol

2016-07-10 Thread Brad Hubbard

On Sun, Jul 10, 2016 at 09:32:33PM +0200, Stefan Priebe - Profihost AG wrote:
> 
> Am 10.07.2016 um 16:33 schrieb Daniel Swarbrick:
> > If you can read C code, there is a collectd plugin that talks directly
> > to the admin socket:
> > 
> > https://github.com/collectd/collectd/blob/master/src/ceph.c
> 
> thanks can read that.

If you're interested in using the AdminSocketClient here's some example code.

#include "common/admin_socket_client.h"

#include 

int main(int argc, char** argv)
{
std::string response;
AdminSocketClient client(argv[1]);
//client.do_request("{\"prefix\":\"help\"}", );
//client.do_request("{\"prefix\":\"help\", \"format\": \"json\"}", 
);
client.do_request("{\"prefix\":\"perf dump\"}", );
//client.do_request("{\"prefix\":\"perf dump\", \"format\": \"json\"}", 
);
std::cout << response << '\n';

return 0;

}

// $ g++ -O2 -std=c++11 ceph-admin-socket-test.cpp -I../ceph/src/ 
-I../ceph/build/include/ ../ceph/build/lib/libcommon.a


-- 
Cheers,
Brad

> 
> Stefan
> 
> > 
> > On 10/07/16 10:36, Stefan Priebe - Profihost AG wrote:
> >> Hi,
> >>
> >> is the ceph admin socket protocol described anywhere? I want to talk
> >> directly to the socket instead of calling the ceph binary. I searched
> >> the doc but didn't find anything useful.
> >>
> >> Thanks,
> >> Stefan
> >>
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Brad Hubbard

On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
 wrote:
> Le 11/07/2016 04:48, 한승진 a écrit :
>> Hi cephers.
>>
>> I need your help for some issues.
>>
>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>
>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>
>> I've experienced one of OSDs was killed himself.
>>
>> Always it issued suicide timeout message.
>
> This is probably a fragmentation problem : typical rbd access patterns
> cause heavy BTRFS fragmentation.

To the extent that operations take over 120 seconds to complete? Really?

I have no experience with BTRFS but had heard that performance can "fall
off a cliff" but I didn't know it was that bad.

-- 
Cheers,
Brad

>
> If you already use the autodefrag mount option, you can try this which
> performs much better for us :
> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb
>
> Note that it can take some time to fully defragment the filesystems but
> it shouldn't put more stress than autodefrag while doing so.
>
> If you don't already use it, set :
> filestore btrfs snap = false
> in ceph.conf an restart your OSDs.
>
> Finally if you use journals on the filesystem and not on dedicated
> partitions, you'll have to recreate them with the NoCow attribute
> (there's no way to defragment journals in any way that doesn't kill
> performance otherwise).
>
> Best regards,
>
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-07 Thread Brad Hubbard

Hi Goncalo,

If possible it would be great if you could capture a core file for this with
full debugging symbols (preferably glibc debuginfo as well). How you do
that will depend on the ceph version and your OS but we can offfer help
if required I'm sure.

Once you have the core do the following.

$ gdb /path/to/ceph-fuse core.
(gdb) set pag off
(gdb) set log on
(gdb) thread apply all bt
(gdb) thread apply all bt full

Then quit gdb and you should find a file called gdb.txt in your
working directory.
If you could attach that file to http://tracker.ceph.com/issues/16610

Cheers,
Brad

On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly  wrote:
> On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges
>  wrote:
>> Unfortunately, the other user application breaks ceph-fuse again (It is a
>> completely different application then in my previous test).
>>
>> We have tested it in 4 machines with 4 cores. The user is submitting 16
>> single core jobs which are all writing different output files (one per job)
>> to a common dir in cephfs. The first 4 jobs run happily and never break
>> ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines,
>> trigger a segmentation fault, which is completely different from the other
>> case.
>>
>> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>> 1: (()+0x297fe2) [0x7f54402b7fe2]
>> 2: (()+0xf7e0) [0x7f543ecf77e0]
>> 3: (ObjectCacher::bh_write_scattered(std::list> std::allocator >&)+0x36) [0x7f5440268086]
>> 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*,
>> std::chrono::time_point> std::chrono::duration > >, long*,
>> int*)+0x22c) [0x7f5440268a3c]
>> 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef]
>> 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34]
>> 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d]
>> 8: (()+0x7aa1) [0x7f543ecefaa1]
>>  9: (clone()+0x6d) [0x7f543df6893d]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
>> interpret this.
>
> This one looks like a very different problem. I've created an issue
> here: http://tracker.ceph.com/issues/16610
>
> Thanks for the report and debug log!
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer: PGs stuck creating

2016-06-30 Thread Brad Hubbard

On Thu, Jun 30, 2016 at 11:34 PM, Brian Felton <bjfel...@gmail.com> wrote:
> Sure.  Here's a complete query dump of one of the 30 pgs:
> http://pastebin.com/NFSYTbUP

Looking at that something immediately stands out.

There are a lot of entries in "past intervals" like so.

"past_intervals": [
 {
 "first": 18522,
 "last": 18523,
 "maybe_went_rw": 1,
 "up": [
 2147483647,
...
"acting": [
2147483647,
2147483647,
2147483647,
2147483647
],
"primary": -1,
"up_primary": -1

That value is defined in src/crush/crush.h like so;

#define CRUSH_ITEM_NONE   0x7fff  /* no result */

So it looks like this could be to do with a bad crush rule (or at least a
previously un-satisfiable rule).

Could you share the output from the following?

$ ceph osd crush rule ls

For each rule listed by the above command.

$ ceph osd crush rule dump [rule_name]

I'd then dump out the crushmap and test it showing any bad mappings with the
commands listed here;

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

That should hopefully give some insight.

HTH,
Brad

>
> Brian
>
> On Wed, Jun 29, 2016 at 6:25 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>
>> On Thu, Jun 30, 2016 at 3:22 AM, Brian Felton <bjfel...@gmail.com> wrote:
>> > Greetings,
>> >
>> > I have a lab cluster running Hammer 0.94.6 and being used exclusively
>> > for
>> > object storage.  The cluster consists of four servers running 60 6TB
>> > OSDs
>> > each.  The main .rgw.buckets pool is using k=3 m=1 erasure coding and
>> > contains 8192 placement groups.
>> >
>> > Last week, one of our guys out-ed and removed one OSD from each of three
>> > of
>> > the four servers in the cluster, which resulted in some general badness
>> > (the
>> > disks were wiped post-removal, so the data are gone).  After a proper
>> > education in why this is a Bad Thing, we got the OSDs added back.  When
>> > all
>> > was said and done, we had 30 pgs that were stuck incomplete, and no
>> > amount
>> > of magic has been able to get them to recover.  From reviewing the data,
>> > we
>> > knew that all of these pgs contained at least 2 of the removed OSDs; I
>> > understand and accept that the data are gone, and that's not a concern
>> > (yay
>> > lab).
>> >
>> > Here are the things I've tried:
>> >
>> > - Restarted all OSDs
>> > - Stopped all OSDs, removed all OSDs from the crush map, and started
>> > everything back up
>> > - Executed a 'ceph pg force_create_pg ' for each of the 30 stuck pgs
>> > - Executed a 'ceph pg send_pg_creates' to get the ball rolling on
>> > creates
>> > - Executed several 'ceph pg  query' commands to ensure we were
>> > referencing valid OSDs after the 'force_create_pg'
>> > - Ensured those OSDs were really removed (e.g. 'ceph auth del', 'ceph
>> > osd
>> > crush remove', and 'ceph osd rm')
>>
>> Can you share some of the pg query output?
>>
>> >
>> > At this point, I've got the same 30 pgs that are stuck creating.  I've
>> > run
>> > out of ideas for getting this back to a healthy state.  In reviewing the
>> > other posts on the mailing list, the overwhelming solution was a bad OSD
>> > in
>> > the crush map, but I'm all but certain that isn't what's hitting us
>> > here.
>> > Normally, being the lab, I'd consider nuking the .rgw.buckets pool and
>> > starting from scratch, but we've recently spent a lot of time pulling
>> > 140TB
>> > of data into this cluster for some performance and recovery tests, and
>> > I'd
>> > prefer not to have to start that process again.  I am willing to
>> > entertain
>> > most any other idea irrespective to how destructive it is to these PGs,
>> > so
>> > long as I don't have to lose the rest of the data in the pool.
>> >
>> > Many thanks in advance for any assistance here.
>> >
>> > Brian Felton
>> >
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Cheers,
>> Brad
>
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer: PGs stuck creating

2016-06-29 Thread Brad Hubbard

On Thu, Jun 30, 2016 at 3:22 AM, Brian Felton  wrote:
> Greetings,
>
> I have a lab cluster running Hammer 0.94.6 and being used exclusively for
> object storage.  The cluster consists of four servers running 60 6TB OSDs
> each.  The main .rgw.buckets pool is using k=3 m=1 erasure coding and
> contains 8192 placement groups.
>
> Last week, one of our guys out-ed and removed one OSD from each of three of
> the four servers in the cluster, which resulted in some general badness (the
> disks were wiped post-removal, so the data are gone).  After a proper
> education in why this is a Bad Thing, we got the OSDs added back.  When all
> was said and done, we had 30 pgs that were stuck incomplete, and no amount
> of magic has been able to get them to recover.  From reviewing the data, we
> knew that all of these pgs contained at least 2 of the removed OSDs; I
> understand and accept that the data are gone, and that's not a concern (yay
> lab).
>
> Here are the things I've tried:
>
> - Restarted all OSDs
> - Stopped all OSDs, removed all OSDs from the crush map, and started
> everything back up
> - Executed a 'ceph pg force_create_pg ' for each of the 30 stuck pgs
> - Executed a 'ceph pg send_pg_creates' to get the ball rolling on creates
> - Executed several 'ceph pg  query' commands to ensure we were
> referencing valid OSDs after the 'force_create_pg'
> - Ensured those OSDs were really removed (e.g. 'ceph auth del', 'ceph osd
> crush remove', and 'ceph osd rm')

Can you share some of the pg query output?

>
> At this point, I've got the same 30 pgs that are stuck creating.  I've run
> out of ideas for getting this back to a healthy state.  In reviewing the
> other posts on the mailing list, the overwhelming solution was a bad OSD in
> the crush map, but I'm all but certain that isn't what's hitting us here.
> Normally, being the lab, I'd consider nuking the .rgw.buckets pool and
> starting from scratch, but we've recently spent a lot of time pulling 140TB
> of data into this cluster for some performance and recovery tests, and I'd
> prefer not to have to start that process again.  I am willing to entertain
> most any other idea irrespective to how destructive it is to these PGs, so
> long as I don't have to lose the rest of the data in the pool.
>
> Many thanks in advance for any assistance here.
>
> Brian Felton
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RADOSGW buckets via NFS?

2016-07-03 Thread Brad Hubbard

On Sun, Jul 3, 2016 at 9:07 PM, Sean Redmond  wrote:
> Hi,
>
> I noticed in the jewel release notes:
>
> "You can now access radosgw buckets via NFS (experimental)."
>
> Are there any docs that explain the configuration of NFS to access RADOSGW
> buckets?

Here's what I found.

http://tracker.ceph.com/projects/ceph/wiki/RGW_-_NFS
https://github.com/nfs-ganesha/nfs-ganesha/tree/next/src/FSAL/FSAL_RGW
https://www.youtube.com/watch?v=zWURdwudAUI

It looks like the information in the video and docs about s3fs-fuse is
no longer relevant.

The file src/test/librgw_file_nfsns.cc in the source tree gives a
little insight.

It looks to me as though the NFS-Ganesha FSAL mounts a bucket using
LibRGW and leverages this library to perform needed operations.

Configuration would involve setting up the FSAL to *point* to the
relevant bucket correctly.
It's very likely the generic Ganesha docs can help here but I have no
experience with that
I'm afraid.

I can't find much more info at this time, which doesn't mean it doesn't exist :)

HTH,
Brad

>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-20 Thread Brad Hubbard

Refer to my previous post for data you can gather that will help
narrow this down.

On Mon, Feb 20, 2017 at 6:36 PM, Jay Linux <jaylinuxg...@gmail.com> wrote:
> Hello John,
>
> Created tracker for this issue Refer-- >
> http://tracker.ceph.com/issues/18994
>
> Thanks
>
> On Fri, Feb 17, 2017 at 6:15 PM, John Spray <jsp...@redhat.com> wrote:
>>
>> On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah
>> <muthiah.muthus...@gmail.com> wrote:
>> > On one our platform mgr uses 3 CPU cores . Is there a ticket available
>> > for
>> > this issue ?
>>
>> Not that I'm aware of, you could go ahead and open one.
>>
>> Cheers,
>> John
>>
>> > Thanks,
>> > Muthu
>> >
>> > On 14 February 2017 at 03:13, Brad Hubbard <bhubb...@redhat.com> wrote:
>> >>
>> >> Could one of the reporters open a tracker for this issue and attach
>> >> the requested debugging data?
>> >>
>> >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis <do...@fortnebula.com>
>> >> wrote:
>> >> > I am having the same issue. When I looked at my idle cluster this
>> >> > morning,
>> >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of
>> >> > that.  I
>> >> > have 3 AIO nodes, and only one of them seemed to be affected.
>> >> >
>> >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard <bhubb...@redhat.com>
>> >> > wrote:
>> >> >>
>> >> >> Want to install debuginfo packages and use something like this to
>> >> >> try
>> >> >> and find out where it is spending most of its time?
>> >> >>
>> >> >> https://poormansprofiler.org/
>> >> >>
>> >> >> Note that you may need to do multiple runs to get a "feel" for where
>> >> >> it is spending most of its time. Also not that likely only one or
>> >> >> two
>> >> >> threads will be using the CPU (you can see this in ps output using a
>> >> >> command like the following) the rest will likely be idle or waiting
>> >> >> for something.
>> >> >>
>> >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>> >> >>
>> >> >> Observation of these two and maybe a couple of manual gstack dumps
>> >> >> like this to compare thread ids to ps output (LWP is the thread id
>> >> >> (tid) in gdb output) should give us some idea of where it is
>> >> >> spinning.
>> >> >>
>> >> >> # gstack $(pidof ceph-mgr)
>> >> >>
>> >> >>
>> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>> >> >> <robert.longst...@tapad.com> wrote:
>> >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on
>> >> >> > CentOS
>> >> >> > 7 w/
>> >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU
>> >> >> > and
>> >> >> > has
>> >> >> > allocated ~11GB of RAM after a single day of usage. Only the
>> >> >> > active
>> >> >> > manager
>> >> >> > is performing this way. The growth is linear and reproducible.
>> >> >> >
>> >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with
>> >> >> > 45x8TB
>> >> >> > OSDs
>> >> >> > each.
>> >> >> >
>> >> >> >
>> >> >> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56,
>> >> >> > 3.94,
>> >> >> > 4.21
>> >> >> >
>> >> >> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0
>> >> >> > zombie
>> >> >> >
>> >> >> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,
>> >> >> > 0.7
>> >> >> > si,
>> >> >> > 0.0
>> >> >> > st
>> >> >> >
>> >> >> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> >> >> > buff/cache
>> >> >> >
>> >> >> > KiB Swap:  2097148 total,  2097148 fr

Re: [ceph-users] Fwd: Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

Is your change reflected in the current crushmap?

On Fri, Feb 24, 2017 at 12:07 PM, Schlacta, Christ <aarc...@aarcane.org> wrote:
> -- Forwarded message --
> From: Schlacta, Christ <aarc...@aarcane.org>
> Date: Thu, Feb 23, 2017 at 6:06 PM
> Subject: Re: [ceph-users] Upgrade Woes on suse leap with OBS ceph.
> To: Brad Hubbard <bhubb...@redhat.com>
>
>
> So setting the above to 0 by sheer brute force didn't work, so it's
> not crush or osd problem..  also, the errors still say mon0, so I
> suspect it's related to communication between libceph in kernel and
> the mon.
>
> aarcane@densetsu:/etc/target$ sudo ceph --cluster rk osd crush tunables hammer
> adjusted tunables profile to hammer
> aarcane@densetsu:/etc/target$ ceph --cluster rk osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 0,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 54,
> "profile": "hammer",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "minimum_required_version": "firefly",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 0,
> "require_feature_tunables5": 0,
> "has_v5_rules": 0
> }
>
> aarcane@densetsu:/etc/target$ sudo rbd --cluster rk map rt1
> rbd: sysfs write failed
> In some cases useful info is found in syslog - try "dmesg | tail" or so.
> rbd: map failed: (110) Connection timed out
> aarcane@densetsu:~$ dmesg | tail
> [10118.778868] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
> [10118.779597] libceph: mon0 10.0.0.67:6789 missing required protocol features
> [10119.834634] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
> [10119.835174] libceph: mon0 10.0.0.67:6789 missing required protocol features
> [10120.762983] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
> [10120.763707] libceph: mon0 10.0.0.67:6789 missing required protocol features
> [10121.787128] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
> [10121.787847] libceph: mon0 10.0.0.67:6789 missing required protocol features
> [10122.97] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
> [10122.911872] libceph: mon0 10.0.0.67:6789 missing required protocol features
> aarcane@densetsu:~$
>
>
> On Thu, Feb 23, 2017 at 5:56 PM, Schlacta, Christ <aarc...@aarcane.org> wrote:
>> They're from the suse leap ceph team.  They maintain ceph, and build
>> up to date versions for suse leap.  What I don't know is how to
>> disable it.  When I try, I get the following mess:
>>
>> aarcane@densetsu:/etc/target$ ceph --cluster rk osd crush set-tunable
>> require_feature_tunables5 0
>> Invalid command:  require_feature_tunables5 not in straw_calc_version
>> osd crush set-tunable straw_calc_version  :  set crush tunable
>>  to 
>> Error EINVAL: invalid command
>>
>> On Thu, Feb 23, 2017 at 5:54 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>> On Fri, Feb 24, 2017 at 11:00 AM, Schlacta, Christ <aarc...@aarcane.org> 
>>> wrote:
>>>> aarcane@densetsu:~$ ceph --cluster rk osd crush show-tunables
>>>> {
>>>> "choose_local_tries": 0,
>>>> "choose_local_fallback_tries": 0,
>>>> "choose_total_tries": 50,
>>>> "chooseleaf_descend_once": 1,
>>>> "chooseleaf_vary_r": 1,
>>>> "chooseleaf_stable": 1,
>>>> "straw_calc_version": 1,
>>>> "allowed_bucket_algs": 54,
>>>> "profile": "jewel",
>>>> "optimal_tunables": 1,
>>>> "legacy_tunables": 0,
>>>> "minimum_required_version": "jewel",
>>>> "require_feature_tunables": 1,
>>>>

Re: [ceph-users] Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

On Fri, Feb 24, 2017 at 11:00 AM, Schlacta, Christ <aarc...@aarcane.org> wrote:
> aarcane@densetsu:~$ ceph --cluster rk osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 1,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 54,
> "profile": "jewel",
> "optimal_tunables": 1,
> "legacy_tunables": 0,
> "minimum_required_version": "jewel",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 0,
> "require_feature_tunables5": 1,

I suspect setting the above to 0 would resolve the issue with the
client but there may be a reason why this is set?

Where did those packages come from?

> "has_v5_rules": 0
> }
>
> On Thu, Feb 23, 2017 at 4:45 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>> On Thu, Feb 23, 2017 at 5:18 PM, Schlacta, Christ <aarc...@aarcane.org> 
>> wrote:
>>> So I updated suse leap, and now I'm getting the following error from
>>> ceph.  I know I need to disable some features, but I'm not sure what
>>> they are..  Looks like 14, 57, and 59, but I can't figure out what
>>> they correspond to, nor therefore, how to turn them off.
>>>
>>> libceph: mon0 10.0.0.67:6789 feature set mismatch, my 40106b84a842a42
>>> < server's e0106b84a846a42, missing a004000
>>
>> http://cpp.sh/2rfy says...
>>
>> Bit 14 set
>> Bit 57 set
>> Bit 59 set
>>
>> Comparing this to
>> https://github.com/ceph/ceph/blob/master/src/include/ceph_features.h
>> shows...
>>
>> DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
>> DEFINE_CEPH_FEATURE(57, 1, MON_STATEFUL_SUB)
>> DEFINE_CEPH_FEATURE(57, 1, MON_ROUTE_OSDMAP) // overlap
>> DEFINE_CEPH_FEATURE(57, 1, OSDSUBOP_NO_SNAPCONTEXT) // overlap
>> DEFINE_CEPH_FEATURE(57, 1, SERVER_JEWEL) // overlap
>> DEFINE_CEPH_FEATURE(59, 1, FS_BTIME)
>> DEFINE_CEPH_FEATURE(59, 1, FS_CHANGE_ATTR) // overlap
>> DEFINE_CEPH_FEATURE(59, 1, MSG_ADDR2) // overlap
>>
>> $ echo "obase=16;ibase=16;$(echo e0106b84a846a42-a004000|tr
>> '[a-z]' '[A-Z]')"|bc -qi
>> obase=16;ibase=16;E0106B84A846A42-A004000
>> 40106B84A842A42
>>
>> So "me" (the client kernel) does not have the above features that are
>> present on the servers.
>>
>> Can you post the output of "ceph osd crush show-tunables"?
>>
>>>
>>> SuSE Leap 42.2 is Up to date as of tonight, no package updates available.
>>> All the ceph packages have the following version:
>>>
>>> 11.1.0+git.1486588482.ba197ae-72.1
>>>
>>> And the kernel has version:
>>>
>>> 4.4.49-16.1
>>>
>>> It was working perfectly before the upgrade.
>>>
>>> Thank you very much
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

Did you dump out the crushmap and look?

On Fri, Feb 24, 2017 at 1:36 PM, Schlacta, Christ <aarc...@aarcane.org> wrote:
> insofar as I can tell, yes.  Everything indicates that they are in effect.
>
> On Thu, Feb 23, 2017 at 7:14 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>> Is your change reflected in the current crushmap?
>>
>> On Fri, Feb 24, 2017 at 12:07 PM, Schlacta, Christ <aarc...@aarcane.org> 
>> wrote:
>>> -- Forwarded message --
>>> From: Schlacta, Christ <aarc...@aarcane.org>
>>> Date: Thu, Feb 23, 2017 at 6:06 PM
>>> Subject: Re: [ceph-users] Upgrade Woes on suse leap with OBS ceph.
>>> To: Brad Hubbard <bhubb...@redhat.com>
>>>
>>>
>>> So setting the above to 0 by sheer brute force didn't work, so it's
>>> not crush or osd problem..  also, the errors still say mon0, so I
>>> suspect it's related to communication between libceph in kernel and
>>> the mon.
>>>
>>> aarcane@densetsu:/etc/target$ sudo ceph --cluster rk osd crush tunables 
>>> hammer
>>> adjusted tunables profile to hammer
>>> aarcane@densetsu:/etc/target$ ceph --cluster rk osd crush show-tunables
>>> {
>>> "choose_local_tries": 0,
>>> "choose_local_fallback_tries": 0,
>>> "choose_total_tries": 50,
>>> "chooseleaf_descend_once": 1,
>>> "chooseleaf_vary_r": 1,
>>> "chooseleaf_stable": 0,
>>> "straw_calc_version": 1,
>>> "allowed_bucket_algs": 54,
>>> "profile": "hammer",
>>> "optimal_tunables": 0,
>>> "legacy_tunables": 0,
>>> "minimum_required_version": "firefly",
>>> "require_feature_tunables": 1,
>>> "require_feature_tunables2": 1,
>>> "has_v2_rules": 0,
>>> "require_feature_tunables3": 1,
>>> "has_v3_rules": 0,
>>> "has_v4_buckets": 0,
>>> "require_feature_tunables5": 0,
>>> "has_v5_rules": 0
>>> }
>>>
>>> aarcane@densetsu:/etc/target$ sudo rbd --cluster rk map rt1
>>> rbd: sysfs write failed
>>> In some cases useful info is found in syslog - try "dmesg | tail" or so.
>>> rbd: map failed: (110) Connection timed out
>>> aarcane@densetsu:~$ dmesg | tail
>>> [10118.778868] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
>>> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
>>> [10118.779597] libceph: mon0 10.0.0.67:6789 missing required protocol 
>>> features
>>> [10119.834634] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
>>> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
>>> [10119.835174] libceph: mon0 10.0.0.67:6789 missing required protocol 
>>> features
>>> [10120.762983] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
>>> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
>>> [10120.763707] libceph: mon0 10.0.0.67:6789 missing required protocol 
>>> features
>>> [10121.787128] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
>>> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
>>> [10121.787847] libceph: mon0 10.0.0.67:6789 missing required protocol 
>>> features
>>> [10122.97] libceph: mon0 10.0.0.67:6789 feature set mismatch, my
>>> 40106b84a842a52 < server's e0106b84a846a52, missing a004000
>>> [10122.911872] libceph: mon0 10.0.0.67:6789 missing required protocol 
>>> features
>>> aarcane@densetsu:~$
>>>
>>>
>>> On Thu, Feb 23, 2017 at 5:56 PM, Schlacta, Christ <aarc...@aarcane.org> 
>>> wrote:
>>>> They're from the suse leap ceph team.  They maintain ceph, and build
>>>> up to date versions for suse leap.  What I don't know is how to
>>>> disable it.  When I try, I get the following mess:
>>>>
>>>> aarcane@densetsu:/etc/target$ ceph --cluster rk osd crush set-tunable
>>>> require_feature_tunables5 0
>>>> Invalid command:  require_feature_tunables5 not in straw_calc_version
>>>> osd crush set-tunable straw_calc_version  :  set crush tunable
>>>>  to 
>>>> Error EINVAL: invalid command
>>>>
>>>> On Thu, Feb 23, 2017 at 5:54 PM, Brad Hubbard <bhubb...@

Re: [ceph-users] Fwd: Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

Hmm,

What's interesting is the feature set reported by the servers has only
changed from

e0106b84a846a42

Bit 1 set Bit 6 set Bit 9 set Bit 11 set Bit 13 set Bit 14 set Bit 18
set Bit 23 set Bit 25 set Bit 27 set Bit 30 set Bit 35 set Bit 36 set
Bit 37 set Bit 39 set Bit 41 set Bit 42 set Bit 48 set Bit 57 set Bit
58 set Bit 59 set

to

e0106b84a846a52

Bit 1 set Bit 4 set Bit 6 set Bit 9 set Bit 11 set Bit 13 set Bit 14
set Bit 18 set Bit 23 set Bit 25 set Bit 27 set Bit 30 set Bit 35 set
Bit 36 set Bit 37 set Bit 39 set Bit 41 set Bit 42 set Bit 48 set Bit
57 set Bit 58 set Bit 59 set

So all it's done is *added* Bit 4 which is DEFINE_CEPH_FEATURE( 4, 1,
SUBSCRIBE2)


On Fri, Feb 24, 2017 at 1:40 PM, Schlacta, Christ <aarc...@aarcane.org> wrote:
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host densetsu {
> id -2   # do not change unnecessarily
> # weight 0.293
> alg straw
> hash 0  # rjenkins1
> item osd.0 weight 0.146
> item osd.1 weight 0.146
> }
> host density {
> id -3   # do not change unnecessarily
> # weight 0.145
> alg straw
> hash 0  # rjenkins1
> item osd.2 weight 0.145
> }
> root default {
> id -1   # do not change unnecessarily
> # weight 0.438
> alg straw
> hash 0  # rjenkins1
> item densetsu weight 0.293
> item density weight 0.145
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
>     max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
> On Thu, Feb 23, 2017 at 7:37 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>> Did you dump out the crushmap and look?
>>
>> On Fri, Feb 24, 2017 at 1:36 PM, Schlacta, Christ <aarc...@aarcane.org> 
>> wrote:
>>> insofar as I can tell, yes.  Everything indicates that they are in effect.
>>>
>>> On Thu, Feb 23, 2017 at 7:14 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>>> Is your change reflected in the current crushmap?
>>>>
>>>> On Fri, Feb 24, 2017 at 12:07 PM, Schlacta, Christ <aarc...@aarcane.org> 
>>>> wrote:
>>>>> -- Forwarded message --
>>>>> From: Schlacta, Christ <aarc...@aarcane.org>
>>>>> Date: Thu, Feb 23, 2017 at 6:06 PM
>>>>> Subject: Re: [ceph-users] Upgrade Woes on suse leap with OBS ceph.
>>>>> To: Brad Hubbard <bhubb...@redhat.com>
>>>>>
>>>>>
>>>>> So setting the above to 0 by sheer brute force didn't work, so it's
>>>>> not crush or osd problem..  also, the errors still say mon0, so I
>>>>> suspect it's related to communication between libceph in kernel and
>>>>> the mon.
>>>>>
>>>>> aarcane@densetsu:/etc/target$ sudo ceph --cluster rk osd crush tunables 
>>>>> hammer
>>>>> adjusted tunables profile to hammer
>>>>> aarcane@densetsu:/etc/target$ ceph --cluster rk osd crush show-tunables
>>>>> {
>>>>> "choose_local_tries": 0,
>>>>> "choose_local_fallback_tries": 0,
>>>>> "choose_total_tries": 50,
>>>>> "chooseleaf_descend_once": 1,
>>>>> "chooseleaf_vary_r": 1,
>>>>> "chooseleaf_stable": 0,
>>>>> "straw_calc_version": 1,
>>>>> "allowed_bucket_algs": 54,
>>>>> "profile": "hammer",
>>>>> "optimal_tunables": 0,
>>>>> "legacy_tunables": 0,
>>>>> "minimum_required_version": "firefly",
>>>>> "require_feature_tunables": 1,
>>>>> "require_feature_tunables2": 1,
>>>>> "has_v2_rules": 0,
>>>>> "require_feature_tun

Re: [ceph-users] Fwd: Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

Kefu has just pointed out that this has the hallmarks of
https://github.com/ceph/ceph/pull/13275

On Fri, Feb 24, 2017 at 3:00 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> Hmm,
>
> What's interesting is the feature set reported by the servers has only
> changed from
>
> e0106b84a846a42
>
> Bit 1 set Bit 6 set Bit 9 set Bit 11 set Bit 13 set Bit 14 set Bit 18
> set Bit 23 set Bit 25 set Bit 27 set Bit 30 set Bit 35 set Bit 36 set
> Bit 37 set Bit 39 set Bit 41 set Bit 42 set Bit 48 set Bit 57 set Bit
> 58 set Bit 59 set
>
> to
>
> e0106b84a846a52
>
> Bit 1 set Bit 4 set Bit 6 set Bit 9 set Bit 11 set Bit 13 set Bit 14
> set Bit 18 set Bit 23 set Bit 25 set Bit 27 set Bit 30 set Bit 35 set
> Bit 36 set Bit 37 set Bit 39 set Bit 41 set Bit 42 set Bit 48 set Bit
> 57 set Bit 58 set Bit 59 set
>
> So all it's done is *added* Bit 4 which is DEFINE_CEPH_FEATURE( 4, 1,
> SUBSCRIBE2)
>
>
> On Fri, Feb 24, 2017 at 1:40 PM, Schlacta, Christ <aarc...@aarcane.org> wrote:
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable chooseleaf_vary_r 1
>> tunable straw_calc_version 1
>> tunable allowed_bucket_algs 54
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>>
>> # types
>> type 0 osd
>> type 1 host
>> type 2 chassis
>> type 3 rack
>> type 4 row
>> type 5 pdu
>> type 6 pod
>> type 7 room
>> type 8 datacenter
>> type 9 region
>> type 10 root
>>
>> # buckets
>> host densetsu {
>> id -2   # do not change unnecessarily
>> # weight 0.293
>> alg straw
>> hash 0  # rjenkins1
>> item osd.0 weight 0.146
>> item osd.1 weight 0.146
>> }
>> host density {
>> id -3   # do not change unnecessarily
>> # weight 0.145
>> alg straw
>> hash 0  # rjenkins1
>> item osd.2 weight 0.145
>> }
>> root default {
>> id -1   # do not change unnecessarily
>> # weight 0.438
>> alg straw
>> hash 0  # rjenkins1
>> item densetsu weight 0.293
>> item density weight 0.145
>> }
>>
>> # rules
>> rule replicated_ruleset {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type host
>> step emit
>> }
>>
>> # end crush map
>>
>> On Thu, Feb 23, 2017 at 7:37 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>> Did you dump out the crushmap and look?
>>>
>>> On Fri, Feb 24, 2017 at 1:36 PM, Schlacta, Christ <aarc...@aarcane.org> 
>>> wrote:
>>>> insofar as I can tell, yes.  Everything indicates that they are in effect.
>>>>
>>>> On Thu, Feb 23, 2017 at 7:14 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>>>> Is your change reflected in the current crushmap?
>>>>>
>>>>> On Fri, Feb 24, 2017 at 12:07 PM, Schlacta, Christ <aarc...@aarcane.org> 
>>>>> wrote:
>>>>>> -- Forwarded message --
>>>>>> From: Schlacta, Christ <aarc...@aarcane.org>
>>>>>> Date: Thu, Feb 23, 2017 at 6:06 PM
>>>>>> Subject: Re: [ceph-users] Upgrade Woes on suse leap with OBS ceph.
>>>>>> To: Brad Hubbard <bhubb...@redhat.com>
>>>>>>
>>>>>>
>>>>>> So setting the above to 0 by sheer brute force didn't work, so it's
>>>>>> not crush or osd problem..  also, the errors still say mon0, so I
>>>>>> suspect it's related to communication between libceph in kernel and
>>>>>> the mon.
>>>>>>
>>>>>> aarcane@densetsu:/etc/target$ sudo ceph --cluster rk osd crush tunables 
>>>>>> hammer
>>>>>> adjusted tunables profile to hammer
>>>>>> aarcane@densetsu:/etc/target$ ceph --cluster rk osd crush show-tunables
>>>>>> {
>>>>>> "choose_local_tries": 0,
>>>>>> "choose_local_fallback_tries": 0,
>>>>>> "choose_total_tries": 50,
>>>>>> "chooseleaf_descend_once": 1,
>>>>>> "choosel

Re: [ceph-users] Fwd: Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

On Fri, Feb 24, 2017 at 3:07 PM, Schlacta, Christ <aarc...@aarcane.org> wrote:
> So hopefully when the suse ceph team get 11.2 released it should fix this,
> yes?

Definitely not a question I can answer.

What I can tell you is the fix is only in master atm, not yet
backported to kraken http://tracker.ceph.com/issues/18842

>
> On Feb 23, 2017 21:06, "Brad Hubbard" <bhubb...@redhat.com> wrote:
>>
>> Kefu has just pointed out that this has the hallmarks of
>> https://github.com/ceph/ceph/pull/13275
>>
>> On Fri, Feb 24, 2017 at 3:00 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>> > Hmm,
>> >
>> > What's interesting is the feature set reported by the servers has only
>> > changed from
>> >
>> > e0106b84a846a42
>> >
>> > Bit 1 set Bit 6 set Bit 9 set Bit 11 set Bit 13 set Bit 14 set Bit 18
>> > set Bit 23 set Bit 25 set Bit 27 set Bit 30 set Bit 35 set Bit 36 set
>> > Bit 37 set Bit 39 set Bit 41 set Bit 42 set Bit 48 set Bit 57 set Bit
>> > 58 set Bit 59 set
>> >
>> > to
>> >
>> > e0106b84a846a52
>> >
>> > Bit 1 set Bit 4 set Bit 6 set Bit 9 set Bit 11 set Bit 13 set Bit 14
>> > set Bit 18 set Bit 23 set Bit 25 set Bit 27 set Bit 30 set Bit 35 set
>> > Bit 36 set Bit 37 set Bit 39 set Bit 41 set Bit 42 set Bit 48 set Bit
>> > 57 set Bit 58 set Bit 59 set
>> >
>> > So all it's done is *added* Bit 4 which is DEFINE_CEPH_FEATURE( 4, 1,
>> > SUBSCRIBE2)
>> >
>> >
>> > On Fri, Feb 24, 2017 at 1:40 PM, Schlacta, Christ <aarc...@aarcane.org>
>> > wrote:
>> >> # begin crush map
>> >> tunable choose_local_tries 0
>> >> tunable choose_local_fallback_tries 0
>> >> tunable choose_total_tries 50
>> >> tunable chooseleaf_descend_once 1
>> >> tunable chooseleaf_vary_r 1
>> >> tunable straw_calc_version 1
>> >> tunable allowed_bucket_algs 54
>> >>
>> >> # devices
>> >> device 0 osd.0
>> >> device 1 osd.1
>> >> device 2 osd.2
>> >>
>> >> # types
>> >> type 0 osd
>> >> type 1 host
>> >> type 2 chassis
>> >> type 3 rack
>> >> type 4 row
>> >> type 5 pdu
>> >> type 6 pod
>> >> type 7 room
>> >> type 8 datacenter
>> >> type 9 region
>> >> type 10 root
>> >>
>> >> # buckets
>> >> host densetsu {
>> >> id -2   # do not change unnecessarily
>> >> # weight 0.293
>> >> alg straw
>> >> hash 0  # rjenkins1
>> >> item osd.0 weight 0.146
>> >> item osd.1 weight 0.146
>> >> }
>> >> host density {
>> >> id -3       # do not change unnecessarily
>> >> # weight 0.145
>> >> alg straw
>> >> hash 0  # rjenkins1
>> >> item osd.2 weight 0.145
>> >> }
>> >> root default {
>> >> id -1   # do not change unnecessarily
>> >> # weight 0.438
>> >> alg straw
>> >> hash 0  # rjenkins1
>> >> item densetsu weight 0.293
>> >> item density weight 0.145
>> >> }
>> >>
>> >> # rules
>> >> rule replicated_ruleset {
>> >> ruleset 0
>> >> type replicated
>> >> min_size 1
>> >> max_size 10
>> >> step take default
>> >> step chooseleaf firstn 0 type host
>> >> step emit
>> >> }
>> >>
>> >> # end crush map
>> >>
>> >> On Thu, Feb 23, 2017 at 7:37 PM, Brad Hubbard <bhubb...@redhat.com>
>> >> wrote:
>> >>> Did you dump out the crushmap and look?
>> >>>
>> >>> On Fri, Feb 24, 2017 at 1:36 PM, Schlacta, Christ
>> >>> <aarc...@aarcane.org> wrote:
>> >>>> insofar as I can tell, yes.  Everything indicates that they are in
>> >>>> effect.
>> >>>>
>> >>>> On Thu, Feb 23, 2017 at 7:14 PM, Brad Hubbard <bhubb...@redhat.com>
>> >>>> wrote:
>> >>>>> Is your change reflected in the current crushmap?
>> >>>>>
>> >>>>> On Fri, Feb 24, 2017 at 12

Re: [ceph-users] Upgrade Woes on suse leap with OBS ceph.

2017-02-23 Thread Brad Hubbard

On Thu, Feb 23, 2017 at 5:18 PM, Schlacta, Christ  wrote:
> So I updated suse leap, and now I'm getting the following error from
> ceph.  I know I need to disable some features, but I'm not sure what
> they are..  Looks like 14, 57, and 59, but I can't figure out what
> they correspond to, nor therefore, how to turn them off.
>
> libceph: mon0 10.0.0.67:6789 feature set mismatch, my 40106b84a842a42
> < server's e0106b84a846a42, missing a004000

http://cpp.sh/2rfy says...

Bit 14 set
Bit 57 set
Bit 59 set

Comparing this to
https://github.com/ceph/ceph/blob/master/src/include/ceph_features.h
shows...

DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
DEFINE_CEPH_FEATURE(57, 1, MON_STATEFUL_SUB)
DEFINE_CEPH_FEATURE(57, 1, MON_ROUTE_OSDMAP) // overlap
DEFINE_CEPH_FEATURE(57, 1, OSDSUBOP_NO_SNAPCONTEXT) // overlap
DEFINE_CEPH_FEATURE(57, 1, SERVER_JEWEL) // overlap
DEFINE_CEPH_FEATURE(59, 1, FS_BTIME)
DEFINE_CEPH_FEATURE(59, 1, FS_CHANGE_ATTR) // overlap
DEFINE_CEPH_FEATURE(59, 1, MSG_ADDR2) // overlap

$ echo "obase=16;ibase=16;$(echo e0106b84a846a42-a004000|tr
'[a-z]' '[A-Z]')"|bc -qi
obase=16;ibase=16;E0106B84A846A42-A004000
40106B84A842A42

So "me" (the client kernel) does not have the above features that are
present on the servers.

Can you post the output of "ceph osd crush show-tunables"?

>
> SuSE Leap 42.2 is Up to date as of tonight, no package updates available.
> All the ceph packages have the following version:
>
> 11.1.0+git.1486588482.ba197ae-72.1
>
> And the kernel has version:
>
> 4.4.49-16.1
>
> It was working perfectly before the upgrade.
>
> Thank you very much
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Authentication error CEPH installation

2017-02-23 Thread Brad Hubbard

You need ceph.client.admin.keyring in /etc/ceph/

On Thu, Feb 23, 2017 at 8:13 PM, Chaitanya Ravuri
 wrote:
> Hi Team,
>
> I have recently deployed a new CEPH cluster for OEL6 boxes for my testing. I
> am getting below error on the admin host. not sure how can i fix it.
>
> 2017-02-23 02:13:04.166366 7f9c85efb700  0 librados: client.admin
> authentication error (1) Operation not permitted
> Error connecting to cluster: PermissionError
>
>
> I have reviewed few blogs and tried copying as below
>
>  scp /etc/ceph/ceph.client.radosgw.keyring host1:/etc/ceph/
>
> It didnt help .
>
> Can anyone please suggest further.
>
> Thanks,
> RC
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-13 Thread Brad Hubbard

Could one of the reporters open a tracker for this issue and attach
the requested debugging data?

On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis <do...@fortnebula.com> wrote:
> I am having the same issue. When I looked at my idle cluster this morning,
> one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.  I
> have 3 AIO nodes, and only one of them seemed to be affected.
>
> On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>
>> Want to install debuginfo packages and use something like this to try
>> and find out where it is spending most of its time?
>>
>> https://poormansprofiler.org/
>>
>> Note that you may need to do multiple runs to get a "feel" for where
>> it is spending most of its time. Also not that likely only one or two
>> threads will be using the CPU (you can see this in ps output using a
>> command like the following) the rest will likely be idle or waiting
>> for something.
>>
>> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>>
>> Observation of these two and maybe a couple of manual gstack dumps
>> like this to compare thread ids to ps output (LWP is the thread id
>> (tid) in gdb output) should give us some idea of where it is spinning.
>>
>> # gstack $(pidof ceph-mgr)
>>
>>
>> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>> <robert.longst...@tapad.com> wrote:
>> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS
>> > 7 w/
>> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
>> > allocated ~11GB of RAM after a single day of usage. Only the active
>> > manager
>> > is performing this way. The growth is linear and reproducible.
>> >
>> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB
>> > OSDs
>> > each.
>> >
>> >
>> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21
>> >
>> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
>> >
>> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
>> > 0.0
>> > st
>> >
>> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> > buff/cache
>> >
>> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail
>> > Mem
>> >
>> >
>> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> > COMMAND
>> >
>> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
>> > ceph-mgr
>> >
>> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
>> > ceph-mon
>> >
>> >
>> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>> > <bryan.stillw...@charter.com> wrote:
>> >>
>> >> John,
>> >>
>> >> This morning I compared the logs from yesterday and I show a noticeable
>> >> increase in messages like these:
>> >>
>> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all mon_status
>> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all health
>> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all pg_summary
>> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> >> mgrdigest v1
>> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all mon_status
>> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all health
>> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all pg_summary
>> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
>> >> mgrdigest v1
>> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> >>
>> >>
>> >> In a 1 minute period yesterday I saw 84 times this group

Re: [ceph-users] Cannot shutdown monitors

2017-02-10 Thread Brad Hubbard

That looks like dmesg output from the libceph kernel module. Do you
have the libceph kernel module loaded?

If the answer to that question is "yes" the follow-up question is
"Why?" as it is not required for a MON or OSD host.

On Sat, Feb 11, 2017 at 1:18 PM, Michael Andersen  wrote:
> Yeah, all three mons have OSDs on the same machines.
>
> On Feb 10, 2017 7:13 PM, "Shinobu Kinjo"  wrote:
>>
>> Is your primary MON running on the host which some OSDs are running on?
>>
>> On Sat, Feb 11, 2017 at 11:53 AM, Michael Andersen
>>  wrote:
>> > Hi
>> >
>> > I am running a small cluster of 8 machines (80 osds), with three
>> > monitors on
>> > Ubuntu 16.04. Ceph version 10.2.5.
>> >
>> > I cannot reboot the monitors without physically going into the
>> > datacenter
>> > and power cycling them. What happens is that while shutting down, ceph
>> > gets
>> > stuck trying to contact the other monitors but networking has already
>> > shut
>> > down or something like that. I get an endless stream of:
>> >
>> > libceph: connect 10.20.0.10:6789 error -101
>> > libceph: connect 10.20.0.13:6789 error -101
>> > libceph: connect 10.20.0.17:6789 error -101
>> >
>> > where in this case 10.20.0.10 is the machine I am trying to shut down
>> > and
>> > all three IPs are the MONs.
>> >
>> > At this stage of the shutdown, the machine doesn't respond to pings, and
>> > I
>> > cannot even log in on any of the virtual terminals. Nothing to do but
>> > poweroff at the server.
>> >
>> > The other non-mon servers shut down just fine, and the cluster was
>> > healthy
>> > at the time I was rebooting the mon (I only reboot one machine at a
>> > time,
>> > waiting for it to come up before I do the next one).
>> >
>> > Also worth mentioning that if I execute
>> >
>> > sudo systemctl stop ceph\*.service ceph\*.target
>> >
>> > on the server, the only things I see are:
>> >
>> > root 11143 2  0 18:40 ?00:00:00 [ceph-msgr]
>> > root 11162 2  0 18:40 ?00:00:00 [ceph-watch-noti]
>> >
>> > and even then, when no ceph daemons are left running, doing a reboot
>> > goes
>> > into the same loop.
>> >
>> > I can't really find any mention of this online, but I feel someone must
>> > have
>> > hit this. Any idea how to fix it? It's really annoying because its hard
>> > for
>> > me to get access to the datacenter.
>> >
>> > Thanks
>> > Michael
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot shutdown monitors

2017-02-10 Thread Brad Hubbard

Just making sure the list sees this for those that are following.

On Sat, Feb 11, 2017 at 2:49 PM, Michael Andersen <mich...@steelcode.com> wrote:
> Right, so yes libceph is loaded
>
> root@compound-7:~# lsmod | egrep "ceph|rbd"
> rbd69632  0
> libceph   245760  1 rbd
> libcrc32c  16384  3 xfs,raid456,libceph
>
> I stopped all the services and unloaded the modules
>
> root@compound-7:~# systemctl stop ceph\*.service ceph\*.target
> root@compound-7:~# modprobe -r rbd
> root@compound-7:~# modprobe -r libceph
> root@compound-7:~# lsmod | egrep "ceph|rbd"
>
> Then rebooted
> root@compound-7:~# reboot
>
> And sure enough the reboot happened OK.
>
> So that solves my immediate problem, I now know how to work around it
> (thanks!), but I would love to work out how to not need this step. Any
> further info I can give to help?
>
>
>
> On Fri, Feb 10, 2017 at 8:42 PM, Michael Andersen <mich...@steelcode.com>
> wrote:
>>
>> Sorry this email arrived out of order. I will do the modprobe -r test
>>
>> On Fri, Feb 10, 2017 at 8:20 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>>
>>> On Sat, Feb 11, 2017 at 2:08 PM, Michael Andersen <mich...@steelcode.com>
>>> wrote:
>>> > I believe I did shutdown mon process. Is that not done by the
>>> >
>>> > sudo systemctl stop ceph\*.service ceph\*.target
>>> >
>>> > command? Also, as I noted, the mon process does not show up in ps after
>>> > I do
>>> > that, but I still get the shutdown halting.
>>> >
>>> > The libceph kernel module may be installed. I did not do so
>>> > deliberately but
>>> > I used ceph-deploy so if it installs that then that is why it's there.
>>> > I
>>> > also run some kubernetes pods with rbd persistent volumes on these
>>> > machines,
>>> > although no rbd volumes are in use or mounted when I try shut down. In
>>> > fact
>>> > I unmapped all rbd volumes across the whole cluster to make sure. Is
>>> > libceph
>>> > required for rbd?
>>>
>>> For kernel rbd (/dev/rbd0, etc.) yes, for librbd, no.
>>>
>>> As a test try modprobe -r on both the libceph and rbd modules before
>>> shutdown and see if that helps ("modprobe -r rbd" should unload
>>> libceph as well but verify that).
>>>
>>> >
>>> > But even so, is it normal for the libceph kernel module to prevent
>>> > shutdown?
>>> > Is there another stage in the shutdown procedure that I am missing?
>>> >
>>> >
>>> > On Feb 10, 2017 7:49 PM, "Brad Hubbard" <bhubb...@redhat.com> wrote:
>>> >
>>> > That looks like dmesg output from the libceph kernel module. Do you
>>> > have the libceph kernel module loaded?
>>> >
>>> > If the answer to that question is "yes" the follow-up question is
>>> > "Why?" as it is not required for a MON or OSD host.
>>> >
>>> > On Sat, Feb 11, 2017 at 1:18 PM, Michael Andersen
>>> > <mich...@steelcode.com>
>>> > wrote:
>>> >> Yeah, all three mons have OSDs on the same machines.
>>> >>
>>> >> On Feb 10, 2017 7:13 PM, "Shinobu Kinjo" <ski...@redhat.com> wrote:
>>> >>>
>>> >>> Is your primary MON running on the host which some OSDs are running
>>> >>> on?
>>> >>>
>>> >>> On Sat, Feb 11, 2017 at 11:53 AM, Michael Andersen
>>> >>> <mich...@steelcode.com> wrote:
>>> >>> > Hi
>>> >>> >
>>> >>> > I am running a small cluster of 8 machines (80 osds), with three
>>> >>> > monitors on
>>> >>> > Ubuntu 16.04. Ceph version 10.2.5.
>>> >>> >
>>> >>> > I cannot reboot the monitors without physically going into the
>>> >>> > datacenter
>>> >>> > and power cycling them. What happens is that while shutting down,
>>> >>> > ceph
>>> >>> > gets
>>> >>> > stuck trying to contact the other monitors but networking has
>>> >>> > already
>>> >>> > shut
>>> >>> > down or something like that. I get an endless stream of:
>>> >>> >
>>> >

Re: [ceph-users] OSD Repeated Failure

2017-02-10 Thread Brad Hubbard

On Sat, Feb 11, 2017 at 2:51 PM, Ashley Merrick  wrote:
> Hello,
>
>
>
> I have a particular OSD (53), which at random will crash with the OSD
> process stopping.
>
>
>
> OS: Debian 8.x
>
> CEPH : ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>
>
> From the logs at the time of the OSD being marked as crashed I can only see
> the following:
>
>
>
> -4> 2017-02-10 23:40:16.820894 7fadbd049700  1 -- 172.16.3.7:6825/16969
> <== osd.26 172.16.2.104:0/5812 1  osd_ping(ping e29842 stamp 2017-02$
>
> -3> 2017-02-10 23:40:16.820918 7fadbd049700  1 -- 172.16.3.7:6825/16969
> --> 172.16.2.104:0/5812 -- osd_ping(ping_reply e29842 stamp 2017-02-10 2$
>
> -2> 2017-02-10 23:40:16.822436 7faddb149700  1 --
> 172.16.2.107:6820/16969 <== client.8222771 172.16.2.2:0/1125091221 86 
> osd_op(client.82227$
>
> -1> 2017-02-10 23:40:16.822453 7faddb149700  5 -- op tracker -- seq:
> 670, time: 2017-02-10 23:40:16.822453, event: queued_for_pg, op: osd_op(cli$
>
>  0> 2017-02-10 23:40:16.832241 7fadd0631700 -1 *** Caught signal
> (Aborted) **
>
> in thread 7fadd0631700 thread_name:tp_osd_tp
>
>
>
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
> 1: (()+0x951cc7) [0x5556d8c4bcc7]
>
> 2: (()+0xf890) [0x7fadf5f8e890]
>
> 3: (gsignal()+0x37) [0x7fadf3fd5067]
>
> 4: (abort()+0x148) [0x7fadf3fd6448]
>
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x256) [0x5556d8d51296]
>
> 6: (FileStore::read(coll_t const&, ghobject_t const&, unsigned long,
> unsigned long, ceph::buffer::list&, unsigned int, bool)+0xd7c)
> [0x5556d89e68ec]
>
> 7: (ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned long,
> unsigned long, unsigned int, ceph::buffer::list*)+0xcd) [0x5556d885ce7d]
>
> 8: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector std::allocator >&)+0x6355) [0x5556d87f6515]
>
> 9: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x61)
> [0x5556d8802101]
>
> 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x936)
> [0x5556d880a566]
>
> 11: (ReplicatedPG::do_op(std::shared_ptr&)+0x37c3)
> [0x5556d880f3d3]
>
> 12: (ReplicatedPG::do_request(std::shared_ptr&,
> ThreadPool::TPHandle&)+0x727) [0x5556d87c6ae7]
>
> 13: (OSD::dequeue_op(boost::intrusive_ptr, std::shared_ptr,
> ThreadPool::TPHandle&)+0x420) [0x5556d866b650]
>
> 14: (PGQueueable::RunVis::operator()(std::shared_ptr&)+0x6a)
> [0x5556d866b8aa]
>
> 15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x87a) [0x5556d8687f7a]
>
> 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8b6)
> [0x5556d8d40c56]
>
> 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5556d8d42c10]
>
> 18: (()+0x8064) [0x7fadf5f87064]
>
> 19: (clone()+0x6d) [0x7fadf408862d]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
>
>
>
> Does this relate to anything or do I need to dig deeper to find the issue?

It's likely a filesystem or hardware problem as it is failing an
assert in FileStore::read.

Could you thoroughly check the filesystem and the underlying hardware.

You can possibly get more information about the specifics of the issue
by caturing a log with debugging turned right up (20).

>
>
>
> ,Ashley
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot shutdown monitors

2017-02-10 Thread Brad Hubbard

On Sat, Feb 11, 2017 at 2:58 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> Just making sure the list sees this for those that are following.
>
> On Sat, Feb 11, 2017 at 2:49 PM, Michael Andersen <mich...@steelcode.com> 
> wrote:
>> Right, so yes libceph is loaded
>>
>> root@compound-7:~# lsmod | egrep "ceph|rbd"
>> rbd69632  0
>> libceph   245760  1 rbd
>> libcrc32c  16384  3 xfs,raid456,libceph
>>
>> I stopped all the services and unloaded the modules
>>
>> root@compound-7:~# systemctl stop ceph\*.service ceph\*.target
>> root@compound-7:~# modprobe -r rbd
>> root@compound-7:~# modprobe -r libceph
>> root@compound-7:~# lsmod | egrep "ceph|rbd"
>>
>> Then rebooted
>> root@compound-7:~# reboot
>>
>> And sure enough the reboot happened OK.
>>
>> So that solves my immediate problem, I now know how to work around it
>> (thanks!), but I would love to work out how to not need this step. Any

Can you double-check that all rbd volumes are unmounted on this host
when shutting down? Maybe unmap them just for good measure.

I don't believe the libceph module should need to talk to the cluster
unless it has active connections at the time of shutdown.

>> further info I can give to help?
>>
>>
>>
>> On Fri, Feb 10, 2017 at 8:42 PM, Michael Andersen <mich...@steelcode.com>
>> wrote:
>>>
>>> Sorry this email arrived out of order. I will do the modprobe -r test
>>>
>>> On Fri, Feb 10, 2017 at 8:20 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>>>
>>>> On Sat, Feb 11, 2017 at 2:08 PM, Michael Andersen <mich...@steelcode.com>
>>>> wrote:
>>>> > I believe I did shutdown mon process. Is that not done by the
>>>> >
>>>> > sudo systemctl stop ceph\*.service ceph\*.target
>>>> >
>>>> > command? Also, as I noted, the mon process does not show up in ps after
>>>> > I do
>>>> > that, but I still get the shutdown halting.
>>>> >
>>>> > The libceph kernel module may be installed. I did not do so
>>>> > deliberately but
>>>> > I used ceph-deploy so if it installs that then that is why it's there.
>>>> > I
>>>> > also run some kubernetes pods with rbd persistent volumes on these
>>>> > machines,
>>>> > although no rbd volumes are in use or mounted when I try shut down. In
>>>> > fact
>>>> > I unmapped all rbd volumes across the whole cluster to make sure. Is
>>>> > libceph
>>>> > required for rbd?
>>>>
>>>> For kernel rbd (/dev/rbd0, etc.) yes, for librbd, no.
>>>>
>>>> As a test try modprobe -r on both the libceph and rbd modules before
>>>> shutdown and see if that helps ("modprobe -r rbd" should unload
>>>> libceph as well but verify that).
>>>>
>>>> >
>>>> > But even so, is it normal for the libceph kernel module to prevent
>>>> > shutdown?
>>>> > Is there another stage in the shutdown procedure that I am missing?
>>>> >
>>>> >
>>>> > On Feb 10, 2017 7:49 PM, "Brad Hubbard" <bhubb...@redhat.com> wrote:
>>>> >
>>>> > That looks like dmesg output from the libceph kernel module. Do you
>>>> > have the libceph kernel module loaded?
>>>> >
>>>> > If the answer to that question is "yes" the follow-up question is
>>>> > "Why?" as it is not required for a MON or OSD host.
>>>> >
>>>> > On Sat, Feb 11, 2017 at 1:18 PM, Michael Andersen
>>>> > <mich...@steelcode.com>
>>>> > wrote:
>>>> >> Yeah, all three mons have OSDs on the same machines.
>>>> >>
>>>> >> On Feb 10, 2017 7:13 PM, "Shinobu Kinjo" <ski...@redhat.com> wrote:
>>>> >>>
>>>> >>> Is your primary MON running on the host which some OSDs are running
>>>> >>> on?
>>>> >>>
>>>> >>> On Sat, Feb 11, 2017 at 11:53 AM, Michael Andersen
>>>> >>> <mich...@steelcode.com> wrote:
>>>> >>> > Hi
>>>> >>> >
>>>> >>> > I am running a small cluster of 8 machines (80 osds), with three
>>>> >>> > monitors o

Re: [ceph-users] build and Compile ceph in development mode takes an hour

2016-08-17 Thread Brad Hubbard

On Thu, Aug 18, 2016 at 1:12 AM, agung Laksono  wrote:
> Hi Ceph User,
>
> When I make change inside ceph codes in the development mode,
> I found that recompiling takes around an hour because I have to remove
> a build folder and all the contest and then reproduce it.
>
> Is there a way to make the compiling process be faster? something like only
> compile a particular code that I change.

Sure, just use the same build directory and run "make" again after you make code
changes and it should only re-compile the binaries that are effected
by your code
changes.

You can use "make -jX" if you aren't already where 'X' is usually
number of CPUs / 2
which may speed up the build.

HTH,
Brad

>
> Thanks before
>
>
> --
> Cheers,
>
> Agung Laksono
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 >

1 - 100 of 404 matches

Mail list logo