Re: [ceph-users] ceph-deploy for Hammer

2015-05-28 Thread Garg, Pankaj
Hi Travis,

These binaries are hosted on Canonical servers and are only for Ubuntu. Until 
the latest FireFly patch release 0.80.9, everything worked fine. I just tried 
the hammer binaries, and they seem to be failing in loading up erasure coding 
libraries.
I have now built my own binaries and I was able to get the cluster up and 
running using ceph-deploy. 
You just have to skip the ceph installation step with ceph-deploy and rather do 
a manual install from deb files. Rest worked fine.

Thanks
Pankaj

-Original Message-
From: Travis Rhoden [mailto:trho...@gmail.com] 
Sent: Thursday, May 28, 2015 8:02 AM
To: Garg, Pankaj
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph-deploy for Hammer

Hi Pankaj,

While there have been times in the past where ARM binaries were hosted on 
ceph.com, there is not currently any ARM hardware for builds.  I don't think 
you will see any ARM binaries in 
http://ceph.com/debian-hammer/pool/main/c/ceph/, for example.

Combine that with the fact that ceph-deploy is not intended to work with 
locally compiled binaries (only packages, as it relies on paths, conventions, 
and service definitions from the packages), and it is a very tricky combo to 
use ceph-deploy and ARM together.

Your most recent error is indicative of the ceph-mon service not coming up 
successfully.  when ceph-mon (the service, not the daemon) is started, it also 
calls ceph-create-keys, which waits for the monitor daemon to come up and the 
creates keys that are necessary for all clusters to run when using cephx (the 
admin key, the bootsraps keys).

 - Travis

On Wed, May 27, 2015 at 8:27 PM, Garg, Pankaj pankaj.g...@caviumnetworks.com 
wrote:
 Actually the ARM binaries do exist and I have been using for previous 
 releases. Somehow this library is the one that doesn’t load.

 Anyway I did compile my own Ceph for ARM, and now getting the 
 following
 issue:



 [ceph_deploy.gatherkeys][WARNIN] Unable to find 
 /etc/ceph/ceph.client.admin.keyring on ceph1

 [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file:
 /etc/ceph/ceph.client.admin.keyring on host ceph1





 From: Somnath Roy [mailto:somnath@sandisk.com]
 Sent: Wednesday, May 27, 2015 4:29 PM
 To: Garg, Pankaj


 Cc: ceph-users@lists.ceph.com
 Subject: RE: ceph-deploy for Hammer



 If you are trying to install the ceph repo hammer binaries, I don’t 
 think it is built for ARM. Both binary and the .so needs to be built 
 in ARM to make this work I guess.

 Try to build hammer code base in your ARM server and then retry.



 Thanks  Regards

 Somnath



 From: Pankaj Garg [mailto:pankaj.g...@caviumnetworks.com]
 Sent: Wednesday, May 27, 2015 4:17 PM
 To: Somnath Roy
 Cc: ceph-users@lists.ceph.com
 Subject: RE: ceph-deploy for Hammer



 Yes I am on ARM.

 -Pankaj

 On May 27, 2015 3:58 PM, Somnath Roy somnath@sandisk.com wrote:

 Are you running this on ARM ?

 If not, it should not go for loading this library.



 Thanks  Regards

 Somnath



 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Garg, Pankaj
 Sent: Wednesday, May 27, 2015 2:26 PM
 To: Garg, Pankaj; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] ceph-deploy for Hammer



 I seem to be getting these errors in the Monitor Log :

 2015-05-27 21:17:41.908839 3ff907368e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error

 2015-05-27 21:17:41.978113 3ff969168e0  0 ceph version 0.94.1 
 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 
 16592

 2015-05-27 21:17:41.984383 3ff969168e0 -1 ErasureCodePluginSelectJerasure:
 load
 dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so):
 /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: 
 cannot open shared object file: No such file or directory

 2015-05-27 21:17:41.98 3ff969168e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error

 2015-05-27 21:17:42.052415 3ff90cf68e0  0 ceph version 0.94.1 
 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 
 16604

 2015-05-27 21:17:42.058656 3ff90cf68e0 -1 ErasureCodePluginSelectJerasure:
 load
 dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so):
 /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: 
 cannot open shared object file: No such file or directory

 2015-05-27 21:17:42.058715 3ff90cf68e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error

 2015-05-27 21:17:42.125279 3ffac4368e0  0 ceph version 0.94.1 
 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 
 16616

 2015-05-27 21:17:42.131666 3ffac4368e0 -1 ErasureCodePluginSelectJerasure:
 load
 dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so):
 /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: 
 cannot open shared object file: No such file or directory

[ceph-users] TCP or UDP

2015-05-28 Thread Garg, Pankaj
Hi,
Does ceph typically use TCP or UDP or something else for data path for 
connection to clients and inter OSD cluster traffic?

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] TCP or UDP

2015-05-28 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

TCP
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, May 28, 2015 at 2:00 PM, Garg, Pankaj  wrote:
 Hi,

 Does ceph typically use TCP or UDP or something else for data path for
 connection to clients and inter OSD cluster traffic?



 Thanks

 Pankaj


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVZ3TbCRDmVDuy+mK58QAANP8QAIWbByF2Dnuvh3u5ZyXO
1Za4R2A9mhsjUhpRQzLnBIszYzqO/NcnrbZBuOyJ7lMlymyl0aMCgn7dqWTS
282dQznCmKlIYymTXUSw5LSEUq8eNzfQOgY7vxgczyceK/+ZmqhI1GCxpdB+
Mf+LqtUgDG207KdBu5OBHWFq5ZNeGEmxynSP09CiSLwL0fn70Uf+rdFRNCLE
Dh1GMjerZ7MXC0WJ8z7dD+MegBSR6KMBK8vA+MPn2WfNoiqNlTcyUwTuB/9n
opp9i7lQCP70d+K6zr7nHG8mYI1HiD2HRoYXfKR0dyfusY5aYJHwFbtG1gqf
M5VSGbMlxX4RRvBuOMs9Oo1WRshjmvAv4q/oUG/7TT2Rk7doylJKd+4oTxLO
TMI/5n1QJBoiKGHAYP6Ou+8bNdpnBftm+t+2htBtyTzso/2FYSCrj2oFJsK7
GBvDlxv9cCkSSzMUjhYoXVf9Gn8s/WUEAh9gsMO7LOrDS9m2a9bc0Y6UTw2l
8RiO0nNHDFBs0wvxpOjuAlOk7ucOWTnCOFV/5P6heIlCu8q1u4H+DapiC9yq
V8otlvDxk81l8HJHqKxJSqYm/pO6EKxTtLjKeKAfWD3OHMZ3LP6FkJiBeq7v
z3dcKMD95xjIjZZwNxDQpf71dPayfoGG3TKuVac2Aafp7va/SjpnOnxZpAb5
LRku
=tCw+
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread John-Paul Robinson
To follow up on the original post,

Further digging indicates this is a problem with RBD image access and is
not related to NFS-RBD interaction as initially suspected.  The nfsd is
simply hanging as a result of a hung request to the XFS file system
mounted on our RBD-NFS gateway.This hung XFS call is caused by a
problem with the RBD module interacting with our Ceph pool.

I've found a reliable way to trigger a hang directly on an rbd image
mapped into our RBD-NFS gateway box.  The image contains an XFS file
system.  When I try to list the contents of a particular directory, the
request hangs indefinitely.

Two weeks ago our ceph status was:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 near full osd(s)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e5978: 66 osds: 66 up, 66 in
pgmap v26434260: 3072 pgs: 3062 active+clean, 6
active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
   mdsmap e1: 0/0/1 up


The near full osd was number 53 and we updated our crush map to rewieght
the osd.  All of the OSDs had a weight of 1 based on the assumption that
all osds were 2.0TB.  Apparently one of our severs had the OSDs Sized to
2.8TB and this caused the OSD imbalance eventhough we are only at 50%
utilization.  We reweighted the near full osd to .8 and that initiated a
rebalance that has since relieved the 95% full condition on that OSD.

However, since that time the repeering has not completed and we suspect
this is causing problems with our access of RBD images.   Our current
ceph status is:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
stuck unclean; recovery 9/23842120 degraded (0.000%)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e6036: 66 osds: 66 up, 66 in
pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
active+clean+scrubbing, 1 remapped+peering, 3
active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB
/ 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
   mdsmap e1: 0/0/1 up


Here are further details on our stuck pgs:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck inactive
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck unclean
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.106   11870   0   9   0   49010106368 163991 
163991  active  2015-05-15 12:47:19.761469  6035'356332
5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
5.104   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.800676  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:22.425105 
0'0 2015-05-08 10:19:54.938934
4.105   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.801028  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:04.434826 
0'0 2015-05-14 18:43:04.434826
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563


The servers in the pool are not overloaded.  On the ceph server that
originally had the nearly full osd, (osd 53), I'm seeing entries like
this in the osd log:

2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log [WRN] : 6 slow
requests, 6 included below; oldest blocked for  1096430.805069 secs
2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow request

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread Georgios Dimitrakakis

Thanks a million for the feedback Christian!

I 've tried to recreate the issue with 10RBD Volumes mounted on a 
single server without success!


I 've issued the mkfs.xfs command simultaneously (or at least as fast 
I could do it in different terminals) without noticing any problems. Can 
you please tell me what was the size of each one of the RBD Volumes 
cause I have a feeling that mine were two small, and if so I have to 
test it on our bigger cluster.


I 've also thought that besides QEMU version it might also be important 
the underlying OS, so what was your testbed?



All the best,

George


Hi George

In order to experience the error it was enough to simply run mkfs.xfs
on all the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 
files

..

This can be changed by setting a decent value in
/etc/libvirt/qemu.conf for max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer
jens-christian.fisc...@switch.ch wrote:


George,

I will let Christian provide you the details. As far as I know, it 
was enough to just do a ‘ls’ on all of the attached drives.


we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   
1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware - 
ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11   
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (common 
files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries 
(miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11   
amd64QEMU utilities


cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis 
gior...@acmac.uoc.gr wrote:



Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which 
are all exported via NFS to other VMs.
No one has complaint for the moment but the load/usage is very 
minimal.
If this problem really exists then very soon that the trial phase 
will be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided 
by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). 
As
soon as he hit all disks, we started to experience these 120 
second

timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch [3]
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:


Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS
versions
as well).
The only common thing is that both of you added OSDs and are 
likely

suffering from delays stemming from Ceph re-balancing or
deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for 
more

than 30
seconds, NFS might take offense to lower values (or the 
accumulation

of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new 
storage

nodes
(so could there be something different with those nodes?), how 
busy

is 

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-28 Thread Gregory Farnum
On Thu, May 28, 2015 at 1:04 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:


 On 05/27/2015 10:30 PM, Gregory Farnum wrote:

 On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:

 We are also running a full backup sync to cephfs, using multiple
 distributed
 rsync streams (with zkrsync), and also ran in this issue today on Hammer
 0.94.1  .
 After setting the beacon higer, and eventually clearing the journal, it
 stabilized again.

 We were using ceph-fuse to mount the cephfs, not the ceph kernel client.


 What's your MDS cache size set to?

 I did set it to 100 before (we have 64G of ram for the mds) trying to
 get rid of the 'Client .. failing to respond to cache pressure' messages

Oh, that's definitely enough if one client is eating it all up to run
into this, without that patch I referenced. :)
-Greg

  Did you have any warnings in the

 ceph log about clients not releasing caps?

 Unfortunately lost the logs of before it happened.. But nothing in the new
 logs about that, I will follow this up


 I think you could hit this in ceph-fuse as well on hammer, although we
 just merged in a fix: https://github.com/ceph/ceph/pull/4653
 -Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Chinese Language List

2015-05-28 Thread Patrick McGarry
On Thu, May 28, 2015 at 12:59 AM, kefu chai tchai...@gmail.com wrote:
 On Wed, May 27, 2015 at 3:36 AM, Patrick McGarry pmcga...@redhat.com wrote:
 Due to popular demand we are expanding the Ceph lists to include a
 Chinese-language list to allow for direct communications for all of
 our friends in China.

 ceph...@lists.ceph.com

 It was decided that there are many fragmented discussions going on in
 the region due to unfamiliarity or discomfort with English. Hopefully
 this will allow for smooth communications between anyone in China that
 is interested in Ceph!

 that's a great news ! Patrick, could you please also update
 https://ceph.com/resources/mailing-list-irc/ ?


done and done


 I would greatly appreciate it if important
 messages/announcements/questions could be translated and forwarded by
 anyone that is able to translate them so that the greater community
 can still benefit. Thanks.

 will try to proxy some of the traffic here =)


awesome, thank you



 --

 Best Regards,

 Patrick McGarry
 Director Ceph Community || Red Hat
 http://ceph.com  ||  http://community.redhat.com
 @scuttlemonkey || @ceph
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Regards
 Kefu Chai
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck degraded, undersized

2015-05-28 Thread Christian Balzer

Hello,

google is your friend, this comes up every month at least, if not more
frequently.

Your default size (replica) is 2, the default CRUSH rule you quote at the
very end of your mail delineates failure domains on the host level (quite
rightly so).
So with 2 replicas (quite dangerous with disk) you will need to have at
least 2 storage nodes.
Or change the CRUSH rule to allow them to be placed on the same host.

Christian

On Fri, 29 May 2015 10:48:04 +0800 Doan Hartono wrote:

 Hi ceph experts,
 
 I just freshly deployed ceph 0.94.1 with one monitor and one storage 
 node containing 4 disks. But ceph health shows pgs stuck in degraded, 
 unclean, and undersized. Any idea how to resolve this issue to get 
 active+clean state?
 
 ceph health
 HEALTH_WARN 27 pgs degraded; 27 pgs stuck degraded; 128 pgs stuck 
 unclean; 27 pgs stuck undersized; 27 pgs undersized
 
 ceph status
  cluster 6a8291d4-a3b8-475b-ad6c-c73895228762
   health HEALTH_WARN
  27 pgs degraded
  27 pgs stuck degraded
  128 pgs stuck unclean
  27 pgs stuck undersized
  27 pgs undersized
   monmap e1: 1 mons at {ceph-mon=10.0.0.154:6789/0}
  election epoch 2, quorum 0 ceph-mon
   osdmap e38: 4 osds: 4 up, 4 in; 101 remapped pgs
pgmap v63: 128 pgs, 1 pools, 0 bytes data, 0 objects
  135 MB used, 7428 GB / 7428 GB avail
73 active+remapped
28 active
27 active+undersized+degraded
 
 I set pg num and pgp num to 128 following ceph recommendation in the 
 documentation
 
 [global]
 fsid = 6a8291d4-a3b8-475b-ad6c-c73895228762
 mon_initial_members = ceph-mon
 mon_host = x
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 osd pool default size = 2
 osd pool default pg num = 128
 osd pool default pgp num = 128
 
 I have set rbd pool's pg_num and pgp_num to 128.
 $ ceph osd pool get rbd pg_num
 pg_num: 128
 $ ceph osd pool get rbd pgp_num
 pgp_num: 128
 $ ceph osd pool get rbd size
 size: 2
 
 I have tried modifying crush tunables as well
 ceph osd crush tunables legacy
 ceph osd crush tunables optimal
 but no effect on ceph health
 
 Crush map:
 
 # begin crush map
 tunable choose_local_tries 0
 tunable choose_local_fallback_tries 0
 tunable choose_total_tries 50
 tunable chooseleaf_descend_once 1
 tunable chooseleaf_vary_r 1
 tunable straw_calc_version 1
 tunable allowed_bucket_algs 54
 
 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 
 # types
 type 0 osd
 type 1 host
 type 2 chassis
 type 3 rack
 type 4 row
 type 5 pdu
 type 6 pod
 type 7 room
 type 8 datacenter
 type 9 region
 type 10 root
 
 # buckets
 host research10-pc {
  id -2   # do not change unnecessarily
  # weight 7.240
  alg straw
  hash 0  # rjenkins1
  item osd.0 weight 1.810
  item osd.1 weight 1.810
  item osd.2 weight 1.810
  item osd.3 weight 1.810
 }
 root default {
  id -1   # do not change unnecessarily
  # weight 7.240
  alg straw
  hash 0  # rjenkins1
  item research10-pc weight 7.240
 }
 
 # rules
 rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
 }
 
 Regards,
 Doan
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck degraded, undersized

2015-05-28 Thread Doan Hartono

Hi Christian,

Based on your feedback, I modified the CRUSH map:

step chooseleaf firstn 0 type host
to
step chooseleaf firstn 0 type osd

And then i compiled and set, and voila, health is OK now. Thanks so much!

ceph health
HEALTH_OK

Regards,
Doan


On 05/29/2015 10:53 AM, Christian Balzer wrote:

Hello,

google is your friend, this comes up every month at least, if not more
frequently.

Your default size (replica) is 2, the default CRUSH rule you quote at the
very end of your mail delineates failure domains on the host level (quite
rightly so).
So with 2 replicas (quite dangerous with disk) you will need to have at
least 2 storage nodes.
Or change the CRUSH rule to allow them to be placed on the same host.

Christian

On Fri, 29 May 2015 10:48:04 +0800 Doan Hartono wrote:


Hi ceph experts,

I just freshly deployed ceph 0.94.1 with one monitor and one storage
node containing 4 disks. But ceph health shows pgs stuck in degraded,
unclean, and undersized. Any idea how to resolve this issue to get
active+clean state?

ceph health
HEALTH_WARN 27 pgs degraded; 27 pgs stuck degraded; 128 pgs stuck
unclean; 27 pgs stuck undersized; 27 pgs undersized

ceph status
  cluster 6a8291d4-a3b8-475b-ad6c-c73895228762
   health HEALTH_WARN
  27 pgs degraded
  27 pgs stuck degraded
  128 pgs stuck unclean
  27 pgs stuck undersized
  27 pgs undersized
   monmap e1: 1 mons at {ceph-mon=10.0.0.154:6789/0}
  election epoch 2, quorum 0 ceph-mon
   osdmap e38: 4 osds: 4 up, 4 in; 101 remapped pgs
pgmap v63: 128 pgs, 1 pools, 0 bytes data, 0 objects
  135 MB used, 7428 GB / 7428 GB avail
73 active+remapped
28 active
27 active+undersized+degraded

I set pg num and pgp num to 128 following ceph recommendation in the
documentation

[global]
fsid = 6a8291d4-a3b8-475b-ad6c-c73895228762
mon_initial_members = ceph-mon
mon_host = x
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default size = 2
osd pool default pg num = 128
osd pool default pgp num = 128

I have set rbd pool's pg_num and pgp_num to 128.
$ ceph osd pool get rbd pg_num
pg_num: 128
$ ceph osd pool get rbd pgp_num
pgp_num: 128
$ ceph osd pool get rbd size
size: 2

I have tried modifying crush tunables as well
ceph osd crush tunables legacy
ceph osd crush tunables optimal
but no effect on ceph health

Crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host research10-pc {
  id -2   # do not change unnecessarily
  # weight 7.240
  alg straw
  hash 0  # rjenkins1
  item osd.0 weight 1.810
  item osd.1 weight 1.810
  item osd.2 weight 1.810
  item osd.3 weight 1.810
}
root default {
  id -1   # do not change unnecessarily
  # weight 7.240
  alg straw
  hash 0  # rjenkins1
  item research10-pc weight 7.240
}

# rules
rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
}

Regards,
Doan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] stuck degraded, undersized

2015-05-28 Thread Doan Hartono

Hi ceph experts,

I just freshly deployed ceph 0.94.1 with one monitor and one storage 
node containing 4 disks. But ceph health shows pgs stuck in degraded, 
unclean, and undersized. Any idea how to resolve this issue to get 
active+clean state?


ceph health
HEALTH_WARN 27 pgs degraded; 27 pgs stuck degraded; 128 pgs stuck 
unclean; 27 pgs stuck undersized; 27 pgs undersized


ceph status
cluster 6a8291d4-a3b8-475b-ad6c-c73895228762
 health HEALTH_WARN
27 pgs degraded
27 pgs stuck degraded
128 pgs stuck unclean
27 pgs stuck undersized
27 pgs undersized
 monmap e1: 1 mons at {ceph-mon=10.0.0.154:6789/0}
election epoch 2, quorum 0 ceph-mon
 osdmap e38: 4 osds: 4 up, 4 in; 101 remapped pgs
  pgmap v63: 128 pgs, 1 pools, 0 bytes data, 0 objects
135 MB used, 7428 GB / 7428 GB avail
  73 active+remapped
  28 active
  27 active+undersized+degraded

I set pg num and pgp num to 128 following ceph recommendation in the 
documentation


[global]
fsid = 6a8291d4-a3b8-475b-ad6c-c73895228762
mon_initial_members = ceph-mon
mon_host = x
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default size = 2
osd pool default pg num = 128
osd pool default pgp num = 128

I have set rbd pool's pg_num and pgp_num to 128.
$ ceph osd pool get rbd pg_num
pg_num: 128
$ ceph osd pool get rbd pgp_num
pgp_num: 128
$ ceph osd pool get rbd size
size: 2

I have tried modifying crush tunables as well
ceph osd crush tunables legacy
ceph osd crush tunables optimal
but no effect on ceph health

Crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host research10-pc {
id -2   # do not change unnecessarily
# weight 7.240
alg straw
hash 0  # rjenkins1
item osd.0 weight 1.810
item osd.1 weight 1.810
item osd.2 weight 1.810
item osd.3 weight 1.810
}
root default {
id -1   # do not change unnecessarily
# weight 7.240
alg straw
hash 0  # rjenkins1
item research10-pc weight 7.240
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

Regards,
Doan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL7.0

2015-05-28 Thread Luke Kao
Hi Bruce,
RHEL7.0 kernel has many issues on filesystem sub modules and most of them fixed 
only in RHEL7.1.
So you should consider to go to RHEL7.1 directly and upgrade to at least kernel 
3.10.0-229.1.2


BR,
Luke


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Bruce 
McFarland [bruce.mcfarl...@taec.toshiba.com]
Sent: Friday, May 29, 2015 5:13 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph on RHEL7.0

We’re planning on moving from Centos6.5 to RHEL7.0 for Ceph storage and monitor 
nodes. Are there any known issues using RHEL7.0?
Thanks



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests/ops?

2015-05-28 Thread Christian Balzer

Hello,

On Thu, 28 May 2015 12:05:03 +0200 Xavier Serrano wrote:

 On Thu May 28 11:22:52 2015, Christian Balzer wrote:
 
   We are testing different scenarios before making our final decision
   (cache-tiering, journaling, separate pool,...).
  
  Definitely a good idea to test things out and get an idea what Ceph and
  your hardware can do.
  
  From my experience and reading this ML however I think your best bet
  (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for
  your 20 OSDs HDDs.
  
  Currently cache-tiering is probably the worst use for those SSD
  resources, though the code and strategy is of course improving.
  
 I agree: in our particular enviroment, our tests also conclude that
 SSD journaling performs far better than cache-tiering, especially when
 cache becomes close to its capacity and data movement between cache
 and backing storage occurs frequently.

Precisely.
 
 We also want to test if it is possible to use SSD disks as a
 transparent cache for the HDDs at system (Linux kernel) level, and how
 reliable/good is it.
 
There are quite a number of threads about this here, some quite
recent/current. 
They range from not worth it (i.e. about the same performance as journal
SSDs) to xyz-cache destroyed my data, ate my babies and set the house on
fire (i.e. massive reliability problems).

Which is a pity, as in theory they look like a nice fit/addition to Ceph.

  Dedicated SSD pools may be a good fit depending on your use case.
  However I'd advise against mixing SSD and HDD OSDs on the same node.
  To fully utilize those SSDs you'll need a LOT more CPU power than
  required by HDD OSDs or SSD journals/HDD OSDs systems. 
  And you already have 20 OSDs in that box.
 
 Good point! We did not consider that, thanks for pointing it out.
 
  What CPUs do you have in those storage nodes anyway?
  
 Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo.
 We have only 1 CPU per osd node, so I'm afraid we have another
 potential bottleneck here.
 
Oh dear, about 10GHz (that CPU is supposedly 2.4, but you may see the
2.5 because it already is in turbo mode) for 20 OSDs.
Where the recommendation for HDD only OSDs is 1GHz.

Fire up atop (large window so you can see all the details and devices) on
one of your storage nodes.

Then from a client (VM) run this:
---
fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4M --iodepth=32
---
This should result in your disks (OSDs) getting busy to the point of 100%
utilization, but your CPU to still have some idle (that's idle AND wait
combined).

If you change the blocksize to 4K (and just ctrl-c fio after 30 or so
seconds) you should see a very different picture, with the CPU being much
busier and the HDDs seeing less than 100% usage.

That will become even more pronounced with faster HDDs and/or journal SSDs.

And pure SSD clusters/pools are way above that in terms of CPU hunger.

  If you have the budget, I'd deploy the current storage nodes in classic
  (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure
  SSD nodes, optimized for their task (more CPU power, faster network).
  
  Then use those SSD nodes to experiment with cache-tiers and pure SSD
  pools and switch over things when you're comfortable with this and
  happy with the performance. 
   
   
However with 20 OSDs per node, you're likely to go from a being
bottlenecked by your HDDs to being CPU limited (when dealing with
lots of small IOPS at least).
Still, better than now for sure.

   This is very interesting, thanks for pointing it out!
   What would you suggest to use in order to identify the actual
   bottleneck? (disk, CPU, RAM, etc.). Tools like munin?
   
  Munin might work, I use collectd to gather all those values (and even
  more importantly all Ceph counters) and graphite to visualize it.
  For ad-hoc, on the spot analysis I really like atop (in a huge window),
  which will make it very clear what is going on.
  
   In addition, there are some kernel tunables that may be helpful
   to improve overall performance. Maybe we are filling some kernel
   internals and that limits our results (for instance, we had to
   increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20
   disks per host). Which tunables should we observe?
   
  I'm no expert for large (not even medium) clusters, so you'll have to
  research the archives and net (the CERN Ceph slide is nice).
  One thing I remember is kernel.pid_max, which is something you're
  likely to run into at some point with your dense storage nodes:
  http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations
  
  Christian
 
 All you say is really interesting. Thanks for your valuable advice.
 We surely still have plenty of things to learn and test before going
 to production.
 
As long as you have the time to test out things, you'll be fine. ^_^

Christian

 Thanks again for your 

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread Trent Lloyd
Jens-Christian Fischer jens-christian.fischer@... writes:

 
 I think we (i.e. Christian) found the problem:
 We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as 
he hit all disks, we started to experience these 120 second timeouts. We 
realized that the QEMU process on the hypervisor is opening a TCP connection 
to every OSD for every mounted volume - exceeding the 1024 FD limit.
 
 So no deep scrubbing etc, but simply to many connections…

Have seen mention of similar from CERN in their presentations, found this 
post on a quick google.. might help?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-
December/026187.html

Cheers,
Trent
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2015-05-28 Thread Christian Balzer

Hello Greg,

On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:

 The description of the logging abruptly ending and the journal being
 bad really sounds like part of the disk is going back in time. I'm not
 sure if XFS internally is set up in such a way that something like
 losing part of its journal would allow that?
 
I'm special. ^o^
No XFS, EXT4. As stated in the original thread, below.
And the (OSD) journal is a raw partition on a DC S3700.

And since there was at least a 30 seconds pause between the completion of
the /etc/init.d/ceph stop and issuing of the shutdown command, the
logging abruptly ending seems to be unlikely related to the shutdown at
all.

 If any of the OSD developers have the time it's conceivable a copy of
 the OSD journal would be enlightening (if e.g. the header offsets are
 wrong but there are a bunch of valid journal entries), but this is two
 reports of this issue from you and none very similar from anybody
 else. I'm still betting on something in the software or hardware stack
 misbehaving. (There aren't that many people running Debian; there are
 lots of people running Ubuntu and we find bad XFS kernels there not
 infrequently; I think you're hitting something like that.)
 
There should be no file system involved with the raw partition SSD
journal, n'est-ce pas?

The hardware is vastly different, the previous case was on an AMD
system with onboard SATA (SP5100), this one is a SM storage goat with LSI
3008.

The only thing they have in common is the Ceph version 0.80.7 (via the
Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
(though there were minor updates on that between those incidents,
backported fixes)
 
A copy of the journal would consist of the entire 10GB partition, since we
don't know where in loop it was at the time, right?

Christian
 
 On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com wrote:
 
  Hello again (marvel at my elephantine memory and thread necromancy)
 
  Firstly, this happened again, details below.
  Secondly, as I changed things to sysv-init AND did a /etc/init.d/ceph
  stop which dutifully listed all OSDs as being killed/stopped BEFORE
  rebooting the node.
 
  This is completely new node with significantly different HW than the
  example below.
  But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
  And just like below/before the logs for that OSD have nothing in them
  indicating it did shut down properly (no journal flush done) and when
  coming back on reboot we get the dreaded:
  ---
  2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
  _open /var/lib/ceph/osd/ceph-30/journal fd 23: 1269312 bytes,
  block size 4096 bytes, directio = 1, aio = 1 2015-05-25
  10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding
  journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1
  filestore(/var/lib/ceph/osd/ceph-30) mount failed to open
  journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument
  2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable
  to mount object store ---
 
  I see nothing in the changelogs for 0.80.8 and .9 that seems related to
  this, never mind that from the looks of it the repository at Ceph has
  only Wheezy (bpo70) packages and Debian Jessie is still stuck at
  0.80.7 (Sid just went to .9 last week)
 
  I'm preserving the state of things as they are for a few days, so if
  any developer would like a peek or more details, speak up now.
 
  I'd open an issue, but I don't have a reliable way to reproduce this
  and even less desire to do so on this production cluster. ^_-
 
  Christian
 
  On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote:
 
  On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote:
 
   On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com
   wrote:
   
Hello,
   
This morning I decided to reboot a storage node (Debian Jessie,
thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals)
after applying some changes.
   
It came back up one OSD short, the last log lines before the
reboot are: ---
2014-12-05 09:35:27.700330 7f87e789c700  2 --
10.0.8.21:6823/29520  10.0.8.22:0/5161 pipe(0x7f881b772580
sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0)
Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4
pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289
n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288
pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active]
cancel_copy_ops ---
   
Quite obviously it didn't complete its shutdown, so
unsurprisingly we get: ---
2014-12-05 09:37:40.278128 7f218a7037c0  1 journal
_open /var/lib/ceph/osd/ceph-4/journal fd 24: 1269312 bytes,
block size 4096 bytes, directio = 1, aio = 1 2014-12-05
09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding
journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1

Re: [ceph-users] Cache Pool Flush/Eviction Limits - Hard of Soft?

2015-05-28 Thread Nick Fisk
Hi Greg,

That is really great, thanks for your response, I completely understand what is 
going on now. I wasn't thinking about capacity in a per PG sense.

I have exported a pg dump of the cache pool and calculated some percentages and 
I can see that the data can vary up to around 5% amongst the PG's, so this 
probably ties up with there being isolated bursts on single OSD's.

I've knocked the cache_target_full_ratio down by 10% and will see if that helps.

FYI Regarding my 2nd point about having high and low ratios for the cache 
eviction/flushing. I have been speaking to Li Wang and he is potentially 
interested in developing a prototype.

Thanks Again,
Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 27 May 2015 22:02
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Pool Flush/Eviction Limits - Hard of Soft?
 
 The max target limit is a hard limit: the OSDs won't let more than that amount
 of data in the cache tier. They will start flushing and evicting based on the
 percentage ratios you can set (I don't remember the exact parameter
 names) and you may need to set these more aggressively for your given
 workload.
 
 The tricky bit with this is that of course the OSDs don't have global 
 knowledge
 about how much total data is in the cache — so when you set a 100TB cache
 that has 1024 PGs, the OSDs are actually applying those limits on a per-PG
 basis, and not letting any given PG use more than
 100/1024 TB. This is probably the heavy read activity you're seeing on one
 OSD at a time, when it happens to reach the hard limit. :/
 
 The specific blocked ops you're seeing are in various stages and probably just
 indicative of the OSD doing a bunch of flushing which is blocking other
 accesses.
 -Greg
 
 On Tue, May 19, 2015 at 12:03 PM, Nick Fisk n...@fisk.me.uk wrote:
  Been doing some more digging. I'm getting messages in the OSD logs
  like these, don't know if these are normal or a clue to something not
  right
 
  2015-05-19 18:36:27.664698 7f58b91dd700  0 log_channel(cluster) log [WRN]
 :
  slow request 30.346117 seconds old, received at 2015-05-19
 18:35:57.318208:
  osd_repop(client.1205463.0:7612211 6.2f
  ec3d412f/rb.0.6e7a9.74b0dc51.000be050/head//6 v 2674'1102892)
  currently commit_sent
 
  2015-05-19 17:50:29.700766 7ff1503db700  0 log_channel(cluster) log [WRN]
 :
  slow request 32.548750 seconds old, received at 2015-05-19
 17:49:57.151935:
  osd_repop_reply(osd.46.0:2088048 6.64 ondisk, result = 0) currently no
  flag points reached
 
  2015-05-19 17:47:26.903122 7f296b6fc700  0 log_channel(cluster) log [WRN]
 :
  slow request 30.620519 seconds old, received at 2015-05-19
 17:46:56.282504:
  osd_op(client.1205463.0:7261972 rb.0.6e7a9.74b0dc51.000b7ff9
  [set-alloc-hint object_size 1048576 write_size 1048576,write
  258048~131072] 6.882797bc ack+ondisk+write+known_if_redirected
 e2674)
  currently commit_sent
 
 
 
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Nick Fisk
  Sent: 18 May 2015 17:25
  To: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Cache Pool Flush/Eviction Limits - Hard of Soft?
 
  Just to update on this, I've been watching iostat across my Ceph
  nodes and
  I
  can see something slightly puzzling happening and is most likely the
  cause
  of
  the slow (32s) requests I am getting.
 
  During a client write-only IO stream, I see reads and writes to the
  cache
  tier,
  which is normal as blocks are being promoted/demoted. The latency
  does suffer, but not excessively and is acceptable for data that has
  fallen out
  of
  cache.
 
  However, every now and again it appears that one of the OSD's
  suddenly
  just
  starts aggressively reading and appears to block any IO until that
  read
  has
  finished. Example below where /dev/sdd is a 10K disk in the cache tier.
  All
  other nodes have their /dev/sdd devices being completely idle during
  this period. The disks on the base tier seem to be doing writes
  during this
  period,
  so looks related to some sort of flushing.
 
  Devicerrqm/s  wrqm/s r/s  w/s rkB/s   wkB/s   rq-sz
  qu-sz
  await r_wait  w_wait  svctm   util
  sdd   0.000.00471.50  0.002680.00  0.00   11.37   0.962.03
  2.03  0.001.9089.80
 
  Most of the times I observed this whilst I was watching iostat, the
  read
  only
  lasted around 5-10s, but I suspect that sometimes it is going on for
  longer and
  is the cause of the requests are blocked errors. I have also
  noticed
  that this
  appears to happen more often depending on if there are a greater
  number of blocks to be promoted/demoted. Other pools are not affected
  during these hangs.
 
  From the look of the iostat stats, I would assume that for a 10k
  disk, it
  must
  be doing a sequential read to get that number of IO's.
 
  Does anybody have any clue what 

Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2015-05-28 Thread Christian Balzer
On Thu, 28 May 2015 10:32:18 +0200 Jan Schermer wrote:

 Can you check the capacitor reading on the S3700 with smartctl ? 

I suppose you mean this?
---
175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always   
-   648 (2 2862)
---

Never mind that these are brand new.

This
 drive has non-volatile cache which *should* get flushed when power is
 lost, depending on what hardware does on reboot it might get flushed
 even when rebooting. 

That would probably trigger an increase in the unsafe shutdown count
SMART value. 
I will have to test that from a known starting point, since the current
values are likely from earlier tests and actual shutdowns. 
I'd be surprised if a reboot would drop power to the drives, but it is a
possibility of course.

However I'm VERY unconvinced that this could result in data loss, with the
SSDs in perfect CAPS health. 

I just got this drive for testing yesterday and
 it’s a beast, but some things were peculiar - for example my fio
 benchmark slowed down (35K IOPS - 5K IOPS) after several GB (random -
 5-40) written, and then it would creep back up over time even under
 load. Disabling write cache helps, no idea why.
 
I haven't seen that behavior with DC S3700s, but with 5xx ones and
some Samsung, yes.

Christian

 Z.
 
 
  On 28 May 2015, at 09:22, Christian Balzer ch...@gol.com wrote:
  
  
  Hello Greg,
  
  On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:
  
  The description of the logging abruptly ending and the journal being
  bad really sounds like part of the disk is going back in time. I'm not
  sure if XFS internally is set up in such a way that something like
  losing part of its journal would allow that?
  
  I'm special. ^o^
  No XFS, EXT4. As stated in the original thread, below.
  And the (OSD) journal is a raw partition on a DC S3700.
  
  And since there was at least a 30 seconds pause between the completion
  of the /etc/init.d/ceph stop and issuing of the shutdown command, the
  logging abruptly ending seems to be unlikely related to the shutdown at
  all.
  
  If any of the OSD developers have the time it's conceivable a copy of
  the OSD journal would be enlightening (if e.g. the header offsets are
  wrong but there are a bunch of valid journal entries), but this is two
  reports of this issue from you and none very similar from anybody
  else. I'm still betting on something in the software or hardware stack
  misbehaving. (There aren't that many people running Debian; there are
  lots of people running Ubuntu and we find bad XFS kernels there not
  infrequently; I think you're hitting something like that.)
  
  There should be no file system involved with the raw partition SSD
  journal, n'est-ce pas?
  
  The hardware is vastly different, the previous case was on an AMD
  system with onboard SATA (SP5100), this one is a SM storage goat with
  LSI 3008.
  
  The only thing they have in common is the Ceph version 0.80.7 (via the
  Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
  (though there were minor updates on that between those incidents,
  backported fixes)
  
  A copy of the journal would consist of the entire 10GB partition,
  since we don't know where in loop it was at the time, right?
  
  Christian
  
  On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com
  wrote:
  
  Hello again (marvel at my elephantine memory and thread necromancy)
  
  Firstly, this happened again, details below.
  Secondly, as I changed things to sysv-init AND did a
  /etc/init.d/ceph stop which dutifully listed all OSDs as being
  killed/stopped BEFORE rebooting the node.
  
  This is completely new node with significantly different HW than the
  example below.
  But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
  And just like below/before the logs for that OSD have nothing in them
  indicating it did shut down properly (no journal flush done) and
  when coming back on reboot we get the dreaded:
  ---
  2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
  _open /var/lib/ceph/osd/ceph-30/journal fd 23: 1269312 bytes,
  block size 4096 bytes, directio = 1, aio = 1 2015-05-25
  10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding
  journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1
  filestore(/var/lib/ceph/osd/ceph-30) mount failed to open
  journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument
  2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable
  to mount object store ---
  
  I see nothing in the changelogs for 0.80.8 and .9 that seems related
  to this, never mind that from the looks of it the repository at Ceph
  has only Wheezy (bpo70) packages and Debian Jessie is still stuck at
  0.80.7 (Sid just went to .9 last week)
  
  I'm preserving the state of things as they are for a few days, so if
  any developer would like a peek or more details, speak up now.
  
  I'd open an issue, but I don't have a reliable way to reproduce this
  

[ceph-users] umount stuck on NFS gateways switch over by using Pacemaker

2015-05-28 Thread WD_Hwang
Hello,
  I am testing NFS over RBD recently. I am trying to build the NFS HA 
environment under Ubuntu 14.04 for testing, and the packages version 
information as follows:
- Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS)
- ceph : 0.80.9-0ubuntu0.14.04.2
- ceph-common : 0.80.9-0ubuntu0.14.04.2
- pacemaker (git20130802-1ubuntu2.3)
- corosync (2.3.3-1ubuntu1)
PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 
3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations.

  The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and two 
NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 'sudo 
service pacemaker stop', on 'nfs1' to force these resources stopped and 
transferred to 'nfs2', and vice versa.

When the two nodes are up, I issue 'sudo service pacemaker stop' on one node, 
the other node will take over all resources. Everything looks fine. Then I wait 
about 30 minutes and do nothing to the NFS gateways. I repeated the previous 
steps to test fail over procedure. I found the process code of 'umount' is 'D' 
(uninterruptible sleep), the 'ps' showed the following result

root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1

Have any idea to solve or work around? Because of 'umount' stuck, both 'reboot' 
and 'shutdown' command can't work well. So if I don't wait 20 minutes for 
'umount' time out, the only way I can do is powering off the server directly.
Any help would be much appreciated.

I attached my configurations and loggings as follows.


Pacemaker configurations:

crm configure primitive p_rbd_map_1 ocf:ceph:rbd.in \
params user=admin pool=block_data name=data01 
cephconf=/etc/ceph/ceph.conf \
op monitor interval=10s timeout=20s

crm configure primitive p_fs_rbd_1 ocf:heartbeat:Filesystem \
params directory=/mnt/block1 fstype=xfs device=/dev/rbd1 \
fast_stop=no options=noatime,nodiratime,nobarrier,inode64 \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s

crm configure primitive p_export_rbd_1 ocf:heartbeat:exportfs \
params directory=/mnt/block1 clientspec=10.35.64.0/24 
options=rw,async,no_subtree_check,no_root_squash fsid=1 \
op monitor interval=10s timeout=20s \
op start interval=0 timeout=40s

crm configure primitive p_vip_1 ocf:heartbeat:IPaddr2 \
params ip=10.35.64.90 cidr_netmask=24 \
op monitor interval=5

crm configure primitive p_nfs_server lsb:nfs-kernel-server \
op monitor interval=10s timeout=30s

crm configure primitive p_rpcbind upstart:rpcbind \
op monitor interval=10s timeout=30s

crm configure group g_rbd_share_1 p_rbd_map_1 p_fs_rbd_1 p_export_rbd_1 p_vip_1 
\
meta target-role=Started

crm configure group g_nfs p_rpcbind p_nfs_server \
meta target-role=Started

crm configure clone clo_nfs g_nfs \
meta globally-unique=false target-role=Started


'crm_mon' status results for normal condition:
Online: [ nfs1 nfs2 ]

Resource Group: g_rbd_share_1
p_rbd_map_1 (ocf::ceph:rbd.in): Started nfs1
p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started nfs1
p_export_rbd_1 (ocf::heartbeat:exportfs): Started nfs1
p_vip_1 (ocf::heartbeat:IPaddr2): Started nfs1
Clone Set: clo_nfs [g_nfs]
Started: [ nfs1 nfs2 ]

'crm_mon' status results for fail over condition:
Online: [ nfs1 nfs2 ]

Resource Group: g_rbd_share_1
p_rbd_map_1 (ocf::ceph:rbd.in): Started nfs1
p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started nfs1 (unmanaged) FAILED
p_export_rbd_1 (ocf::heartbeat:exportfs): Stopped
p_vip_1 (ocf::heartbeat:IPaddr2): Stopped
Clone Set: clo_nfs [g_nfs]
Started: [ nfs2 ]
Stopped: [ nfs1 ]

Failed actions:
p_fs_rbd_1_stop_0 (node=nfs1, call=114, rc=1, status=Timed Out, 
last-rc-change=Wed May 13 16:39:10 2015, queued=60002ms, exec=1ms
): unknown error


'demsg' messages:

[ 9470.284509] nfsd: last server has exited, flushing export cache
[ 9470.322893] init: rpcbind main process (4267) terminated with status 2
[ 9600.520281] INFO: task umount:2675 blocked for more than 120 seconds.
[ 9600.520445] Not tainted 3.13.0-32-generic #57-Ubuntu
[ 9600.520570] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 9600.520792] umount D 88003fc13480 0 2675 1 0x
[ 9600.520800] 88003a4f9dc0 0082 880039ece000 
88003a4f9fd8
[ 9600.520805] 00013480 00013480 880039ece000 
880039ece000
[ 9600.520809] 88003fc141a0 0001  
88003a377928
[ 9600.520814] Call Trace:
[ 9600.520830] [817251a9] schedule+0x29/0x70
[ 9600.520882] [a043b300] _xfs_log_force+0x220/0x280 [xfs]
[ 9600.520891] [8109a9b0] ? wake_up_state+0x20/0x20
[ 9600.520922] [a043b386] xfs_log_force+0x26/0x80 [xfs]
[ 9600.520947] [a03f3b6d] xfs_fs_sync_fs+0x2d/0x50 [xfs]
[ 9600.520954] 

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-28 Thread Kenneth Waegeman



On 05/27/2015 10:30 PM, Gregory Farnum wrote:

On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:

We are also running a full backup sync to cephfs, using multiple distributed
rsync streams (with zkrsync), and also ran in this issue today on Hammer
0.94.1  .
After setting the beacon higer, and eventually clearing the journal, it
stabilized again.

We were using ceph-fuse to mount the cephfs, not the ceph kernel client.


What's your MDS cache size set to?
I did set it to 100 before (we have 64G of ram for the mds) trying 
to get rid of the 'Client .. failing to respond to cache pressure' messages

 Did you have any warnings in the

ceph log about clients not releasing caps?
Unfortunately lost the logs of before it happened.. But nothing in the 
new logs about that, I will follow this up


I think you could hit this in ceph-fuse as well on hammer, although we
just merged in a fix: https://github.com/ceph/ceph/pull/4653
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2015-05-28 Thread Jan Schermer
Can you check the capacitor reading on the S3700 with smartctl ? This drive has 
non-volatile cache which *should* get flushed when power is lost, depending on 
what hardware does on reboot it might get flushed even when rebooting.
I just got this drive for testing yesterday and it’s a beast, but some things 
were peculiar - for example my fio benchmark slowed down (35K IOPS - 5K IOPS) 
after several GB (random - 5-40) written, and then it would creep back up over 
time even under load. Disabling write cache helps, no idea why.

Z.


 On 28 May 2015, at 09:22, Christian Balzer ch...@gol.com wrote:
 
 
 Hello Greg,
 
 On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:
 
 The description of the logging abruptly ending and the journal being
 bad really sounds like part of the disk is going back in time. I'm not
 sure if XFS internally is set up in such a way that something like
 losing part of its journal would allow that?
 
 I'm special. ^o^
 No XFS, EXT4. As stated in the original thread, below.
 And the (OSD) journal is a raw partition on a DC S3700.
 
 And since there was at least a 30 seconds pause between the completion of
 the /etc/init.d/ceph stop and issuing of the shutdown command, the
 logging abruptly ending seems to be unlikely related to the shutdown at
 all.
 
 If any of the OSD developers have the time it's conceivable a copy of
 the OSD journal would be enlightening (if e.g. the header offsets are
 wrong but there are a bunch of valid journal entries), but this is two
 reports of this issue from you and none very similar from anybody
 else. I'm still betting on something in the software or hardware stack
 misbehaving. (There aren't that many people running Debian; there are
 lots of people running Ubuntu and we find bad XFS kernels there not
 infrequently; I think you're hitting something like that.)
 
 There should be no file system involved with the raw partition SSD
 journal, n'est-ce pas?
 
 The hardware is vastly different, the previous case was on an AMD
 system with onboard SATA (SP5100), this one is a SM storage goat with LSI
 3008.
 
 The only thing they have in common is the Ceph version 0.80.7 (via the
 Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
 (though there were minor updates on that between those incidents,
 backported fixes)
 
 A copy of the journal would consist of the entire 10GB partition, since we
 don't know where in loop it was at the time, right?
 
 Christian
 
 On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com wrote:
 
 Hello again (marvel at my elephantine memory and thread necromancy)
 
 Firstly, this happened again, details below.
 Secondly, as I changed things to sysv-init AND did a /etc/init.d/ceph
 stop which dutifully listed all OSDs as being killed/stopped BEFORE
 rebooting the node.
 
 This is completely new node with significantly different HW than the
 example below.
 But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
 And just like below/before the logs for that OSD have nothing in them
 indicating it did shut down properly (no journal flush done) and when
 coming back on reboot we get the dreaded:
 ---
 2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
 _open /var/lib/ceph/osd/ceph-30/journal fd 23: 1269312 bytes,
 block size 4096 bytes, directio = 1, aio = 1 2015-05-25
 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding
 journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1
 filestore(/var/lib/ceph/osd/ceph-30) mount failed to open
 journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument
 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable
 to mount object store ---
 
 I see nothing in the changelogs for 0.80.8 and .9 that seems related to
 this, never mind that from the looks of it the repository at Ceph has
 only Wheezy (bpo70) packages and Debian Jessie is still stuck at
 0.80.7 (Sid just went to .9 last week)
 
 I'm preserving the state of things as they are for a few days, so if
 any developer would like a peek or more details, speak up now.
 
 I'd open an issue, but I don't have a reliable way to reproduce this
 and even less desire to do so on this production cluster. ^_-
 
 Christian
 
 On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote:
 
 On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote:
 
 On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com
 wrote:
 
 Hello,
 
 This morning I decided to reboot a storage node (Debian Jessie,
 thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals)
 after applying some changes.
 
 It came back up one OSD short, the last log lines before the
 reboot are: ---
 2014-12-05 09:35:27.700330 7f87e789c700  2 --
 10.0.8.21:6823/29520  10.0.8.22:0/5161 pipe(0x7f881b772580
 sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0)
 Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4
 pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289
 n=8 ec=5 les/c 289/289 

Re: [ceph-users] Blocked requests/ops?

2015-05-28 Thread Xavier Serrano
On Thu May 28 11:22:52 2015, Christian Balzer wrote:

  We are testing different scenarios before making our final decision
  (cache-tiering, journaling, separate pool,...).
 
 Definitely a good idea to test things out and get an idea what Ceph and
 your hardware can do.
 
 From my experience and reading this ML however I think your best bet
 (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your
 20 OSDs HDDs.
 
 Currently cache-tiering is probably the worst use for those SSD resources,
 though the code and strategy is of course improving.
 
I agree: in our particular enviroment, our tests also conclude that
SSD journaling performs far better than cache-tiering, especially when
cache becomes close to its capacity and data movement between cache
and backing storage occurs frequently.

We also want to test if it is possible to use SSD disks as a transparent
cache for the HDDs at system (Linux kernel) level, and how reliable/good
is it.

 Dedicated SSD pools may be a good fit depending on your use case.
 However I'd advise against mixing SSD and HDD OSDs on the same node.
 To fully utilize those SSDs you'll need a LOT more CPU power than required
 by HDD OSDs or SSD journals/HDD OSDs systems. 
 And you already have 20 OSDs in that box.

Good point! We did not consider that, thanks for pointing it out.

 What CPUs do you have in those storage nodes anyway?
 
Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo.
We have only 1 CPU per osd node, so I'm afraid we have another
potential bottleneck here.

 If you have the budget, I'd deploy the current storage nodes in classic
 (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD
 nodes, optimized for their task (more CPU power, faster network).
 
 Then use those SSD nodes to experiment with cache-tiers and pure SSD pools
 and switch over things when you're comfortable with this and happy with the
 performance. 
  
  
   However with 20 OSDs per node, you're likely to go from a being
   bottlenecked by your HDDs to being CPU limited (when dealing with lots
   of small IOPS at least).
   Still, better than now for sure.
   
  This is very interesting, thanks for pointing it out!
  What would you suggest to use in order to identify the actual
  bottleneck? (disk, CPU, RAM, etc.). Tools like munin?
  
 Munin might work, I use collectd to gather all those values (and even more
 importantly all Ceph counters) and graphite to visualize it.
 For ad-hoc, on the spot analysis I really like atop (in a huge window),
 which will make it very clear what is going on.
 
  In addition, there are some kernel tunables that may be helpful
  to improve overall performance. Maybe we are filling some kernel
  internals and that limits our results (for instance, we had to increase
  fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per
  host). Which tunables should we observe?
  
 I'm no expert for large (not even medium) clusters, so you'll have to
 research the archives and net (the CERN Ceph slide is nice).
 One thing I remember is kernel.pid_max, which is something you're likely
 to run into at some point with your dense storage nodes:
 http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations
 
 Christian

All you say is really interesting. Thanks for your valuable advice.
We surely still have plenty of things to learn and test before going
to production.

Thanks again for your time and help.

Best regards,
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2015-05-28 Thread Jan Schermer

 On 28 May 2015, at 10:56, Christian Balzer ch...@gol.com wrote:
 
 On Thu, 28 May 2015 10:32:18 +0200 Jan Schermer wrote:
 
 Can you check the capacitor reading on the S3700 with smartctl ? 
 
 I suppose you mean this?
 ---
 175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always  
  -   648 (2 2862)
 ---
 
 Never mind that these are brand new.
 

Most of the failures occur on either very new or very old hardware :-)

 This
 drive has non-volatile cache which *should* get flushed when power is
 lost, depending on what hardware does on reboot it might get flushed
 even when rebooting. 
 
 That would probably trigger an increase in the unsafe shutdown count
 SMART value. 
 I will have to test that from a known starting point, since the current
 values are likely from earlier tests and actual shutdowns. 
 I'd be surprised if a reboot would drop power to the drives, but it is a
 possibility of course.
 
 However I'm VERY unconvinced that this could result in data loss, with the
 SSDs in perfect CAPS health. 
 

You are right, it shouldn’t happen, but stuff happens.

 I just got this drive for testing yesterday and
 it’s a beast, but some things were peculiar - for example my fio
 benchmark slowed down (35K IOPS - 5K IOPS) after several GB (random -
 5-40) written, and then it would creep back up over time even under
 load. Disabling write cache helps, no idea why.
 
 I haven't seen that behavior with DC S3700s, but with 5xx ones and
 some Samsung, yes.


Try this simple test

fio --filename=/dev/$device --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
--iodepth=1 --runtime=60 --time_based --name=journal-test —size=10M
(play with iodepth, if I remember correctly then the highest gain was with 
iodepth=1, higher depths reach almost the max without disabling write cache)
first run with WC enabled
hdparm -W1 /dev/$device
then with WCE disabled
hdparm -W0 /dev/$device

I get much higher IOPS with cache disabled on all SSDs I tested - Kingston, 
Samsung, Intel. I think it disables compression on those drives that use it 
internally, and it probably causes the SSD not to wait for other IOs to 
coalesce it with. This might have a very bad effect on the drive longevity in 
the long run, though...

Jan

 
 Christian
 
 Z.
 
 
 On 28 May 2015, at 09:22, Christian Balzer ch...@gol.com wrote:
 
 
 Hello Greg,
 
 On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:
 
 The description of the logging abruptly ending and the journal being
 bad really sounds like part of the disk is going back in time. I'm not
 sure if XFS internally is set up in such a way that something like
 losing part of its journal would allow that?
 
 I'm special. ^o^
 No XFS, EXT4. As stated in the original thread, below.
 And the (OSD) journal is a raw partition on a DC S3700.
 
 And since there was at least a 30 seconds pause between the completion
 of the /etc/init.d/ceph stop and issuing of the shutdown command, the
 logging abruptly ending seems to be unlikely related to the shutdown at
 all.
 
 If any of the OSD developers have the time it's conceivable a copy of
 the OSD journal would be enlightening (if e.g. the header offsets are
 wrong but there are a bunch of valid journal entries), but this is two
 reports of this issue from you and none very similar from anybody
 else. I'm still betting on something in the software or hardware stack
 misbehaving. (There aren't that many people running Debian; there are
 lots of people running Ubuntu and we find bad XFS kernels there not
 infrequently; I think you're hitting something like that.)
 
 There should be no file system involved with the raw partition SSD
 journal, n'est-ce pas?
 
 The hardware is vastly different, the previous case was on an AMD
 system with onboard SATA (SP5100), this one is a SM storage goat with
 LSI 3008.
 
 The only thing they have in common is the Ceph version 0.80.7 (via the
 Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
 (though there were minor updates on that between those incidents,
 backported fixes)
 
 A copy of the journal would consist of the entire 10GB partition,
 since we don't know where in loop it was at the time, right?
 
 Christian
 
 On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com
 wrote:
 
 Hello again (marvel at my elephantine memory and thread necromancy)
 
 Firstly, this happened again, details below.
 Secondly, as I changed things to sysv-init AND did a
 /etc/init.d/ceph stop which dutifully listed all OSDs as being
 killed/stopped BEFORE rebooting the node.
 
 This is completely new node with significantly different HW than the
 example below.
 But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
 And just like below/before the logs for that OSD have nothing in them
 indicating it did shut down properly (no journal flush done) and
 when coming back on reboot we get the dreaded:
 ---
 2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
 _open 

Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2015-05-28 Thread Gregory Farnum
On Thu, May 28, 2015 at 12:22 AM, Christian Balzer ch...@gol.com wrote:

 Hello Greg,

 On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:

 The description of the logging abruptly ending and the journal being
 bad really sounds like part of the disk is going back in time. I'm not
 sure if XFS internally is set up in such a way that something like
 losing part of its journal would allow that?

 I'm special. ^o^
 No XFS, EXT4. As stated in the original thread, below.
 And the (OSD) journal is a raw partition on a DC S3700.

 And since there was at least a 30 seconds pause between the completion of
 the /etc/init.d/ceph stop and issuing of the shutdown command, the
 logging abruptly ending seems to be unlikely related to the shutdown at
 all.

Oh, sorry...
I happened to read this article last night:
http://lwn.net/SubscriberLink/645720/01149aa7c58954eb/

Depending on configuration (I think you'd need to have a
journal-as-file) you could be experiencing that. And again, not many
people use ext4 so who knows what other ways there are of things being
broken that nobody else has seen yet.


 If any of the OSD developers have the time it's conceivable a copy of
 the OSD journal would be enlightening (if e.g. the header offsets are
 wrong but there are a bunch of valid journal entries), but this is two
 reports of this issue from you and none very similar from anybody
 else. I'm still betting on something in the software or hardware stack
 misbehaving. (There aren't that many people running Debian; there are
 lots of people running Ubuntu and we find bad XFS kernels there not
 infrequently; I think you're hitting something like that.)

 There should be no file system involved with the raw partition SSD
 journal, n'est-ce pas?

...and I guess probably you aren't since you are using partitions.


 The hardware is vastly different, the previous case was on an AMD
 system with onboard SATA (SP5100), this one is a SM storage goat with LSI
 3008.

 The only thing they have in common is the Ceph version 0.80.7 (via the
 Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
 (though there were minor updates on that between those incidents,
 backported fixes)

 A copy of the journal would consist of the entire 10GB partition, since we
 don't know where in loop it was at the time, right?

Yeah.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] umount stuck on NFS gateways switch over by using Pacemaker

2015-05-28 Thread Eric Eastman
On Thu, May 28, 2015 at 1:33 AM, wd_hw...@wistron.com wrote:

 Hello,

   I am testing NFS over RBD recently. I am trying to build the NFS HA 
 environment under Ubuntu 14.04 for testing, and the packages version 
 information as follows:
 - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS)
 - ceph : 0.80.9-0ubuntu0.14.04.2
 - ceph-common : 0.80.9-0ubuntu0.14.04.2
 - pacemaker (git20130802-1ubuntu2.3)
 - corosync (2.3.3-1ubuntu1)
 PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 
 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations.

   The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and 
 two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 
 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and 
 transferred to 'nfs2', and vice versa.

 When the two nodes are up, I issue 'sudo service pacemaker stop' on one node, 
 the other node will take over all resources. Everything looks fine. Then I 
 wait about 30 minutes and do nothing to the NFS gateways. I repeated the 
 previous steps to test fail over procedure. I found the process code of 
 'umount' is 'D' (uninterruptible sleep), the 'ps' showed the following result

 root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1

 Have any idea to solve or work around? Because of 'umount' stuck, both 
 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 
 minutes for 'umount' time out, the only way I can do is powering off the 
 server directly.

 Any help would be much appreciated.


I am not sure how to get out of the stuck umount, but you can skip the
shutdown scripts that call the umount during a reboot using:

reboot -fn

This can cause data loss, as it is like a power cycle, so it is best
to run sync before running the reboot -fn command to flush out
buffers.

Sometime when a system is really hung, reboot -fn does not work, but
this seems to always work if run as root:

echo 1  /proc/sys/kernel/sysrq
echo b  /proc/sysrq-trigger

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy for Hammer

2015-05-28 Thread Travis Rhoden
Hi Pankaj,

While there have been times in the past where ARM binaries were hosted
on ceph.com, there is not currently any ARM hardware for builds.  I
don't think you will see any ARM binaries in
http://ceph.com/debian-hammer/pool/main/c/ceph/, for example.

Combine that with the fact that ceph-deploy is not intended to work
with locally compiled binaries (only packages, as it relies on paths,
conventions, and service definitions from the packages), and it is a
very tricky combo to use ceph-deploy and ARM together.

Your most recent error is indicative of the ceph-mon service not
coming up successfully.  when ceph-mon (the service, not the daemon)
is started, it also calls ceph-create-keys, which waits for the
monitor daemon to come up and the creates keys that are necessary for
all clusters to run when using cephx (the admin key, the bootsraps
keys).

 - Travis

On Wed, May 27, 2015 at 8:27 PM, Garg, Pankaj
pankaj.g...@caviumnetworks.com wrote:
 Actually the ARM binaries do exist and I have been using for previous
 releases. Somehow this library is the one that doesn’t load.

 Anyway I did compile my own Ceph for ARM, and now getting the following
 issue:



 [ceph_deploy.gatherkeys][WARNIN] Unable to find
 /etc/ceph/ceph.client.admin.keyring on ceph1

 [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file:
 /etc/ceph/ceph.client.admin.keyring on host ceph1





 From: Somnath Roy [mailto:somnath@sandisk.com]
 Sent: Wednesday, May 27, 2015 4:29 PM
 To: Garg, Pankaj


 Cc: ceph-users@lists.ceph.com
 Subject: RE: ceph-deploy for Hammer



 If you are trying to install the ceph repo hammer binaries, I don’t think it
 is built for ARM. Both binary and the .so needs to be built in ARM to make
 this work I guess.

 Try to build hammer code base in your ARM server and then retry.



 Thanks  Regards

 Somnath



 From: Pankaj Garg [mailto:pankaj.g...@caviumnetworks.com]
 Sent: Wednesday, May 27, 2015 4:17 PM
 To: Somnath Roy
 Cc: ceph-users@lists.ceph.com
 Subject: RE: ceph-deploy for Hammer



 Yes I am on ARM.

 -Pankaj

 On May 27, 2015 3:58 PM, Somnath Roy somnath@sandisk.com wrote:

 Are you running this on ARM ?

 If not, it should not go for loading this library.



 Thanks  Regards

 Somnath



 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Garg, Pankaj
 Sent: Wednesday, May 27, 2015 2:26 PM
 To: Garg, Pankaj; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] ceph-deploy for Hammer



 I seem to be getting these errors in the Monitor Log :

 2015-05-27 21:17:41.908839 3ff907368e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error

 2015-05-27 21:17:41.978113 3ff969168e0  0 ceph version 0.94.1
 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16592

 2015-05-27 21:17:41.984383 3ff969168e0 -1 ErasureCodePluginSelectJerasure:
 load
 dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so):
 /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot
 open shared object file: No such file or directory

 2015-05-27 21:17:41.98 3ff969168e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error

 2015-05-27 21:17:42.052415 3ff90cf68e0  0 ceph version 0.94.1
 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16604

 2015-05-27 21:17:42.058656 3ff90cf68e0 -1 ErasureCodePluginSelectJerasure:
 load
 dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so):
 /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot
 open shared object file: No such file or directory

 2015-05-27 21:17:42.058715 3ff90cf68e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error

 2015-05-27 21:17:42.125279 3ffac4368e0  0 ceph version 0.94.1
 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16616

 2015-05-27 21:17:42.131666 3ffac4368e0 -1 ErasureCodePluginSelectJerasure:
 load
 dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so):
 /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot
 open shared object file: No such file or directory

 2015-05-27 21:17:42.131726 3ffac4368e0 -1
 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code):
 (5) Input/output error





 The lib file exists, so not sure why this is happening. Any help
 appreciated.



 Thanks

 Pankaj



 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Garg, Pankaj
 Sent: Wednesday, May 27, 2015 1:37 PM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] ceph-deploy for Hammer



 Hi,

 Is there a particular verion of Ceph-Deploy that should be used with Hammer
 release? This is a brand new cluster.

 I’m getting the following error when running command : ceph-deploy mon
 create-initial



 [ceph_deploy.conf][DEBUG ] found configuration file at:
 

[ceph-users] mds crash

2015-05-28 Thread Peter Tiernan

Hi all,

I have been testing cephfs with erasure coded pool and cache tier. I 
have 3 mds running on the same physical server as 3 mons. The cluster is 
in ok state otherwise, rbd is working and all pg are active+clean. Im 
running v 0.87.2 giant on all nodes and ubuntu 14.04.2 .


The cluster was working fine but when copying a large file on a client 
to cephfs, it froze and now mdss keep crashing with:


 0 2015-05-28 16:50:58.267112 7f0282946700 -1 mds/MDCache.cc: In 
function 'virtual void C_IO_MDC_TruncateFinish::finish(int)' thread 
7f0282946700 time 2015-05-28 16:50:58.243904

mds/MDCache.cc: 5974: FAILED assert(r == 0 || r == -2)

any ideas?

thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory Allocators and Ceph

2015-05-28 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've got some more tests running right now. Once those are done, I'll
find a couple of tests that had extreme difference and gather some
perf data for them.
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, May 27, 2015 at 3:48 PM, Mark Nelson  wrote:


 On 05/27/2015 04:00 PM, Robert LeBlanc wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256


 On Wed, May 27, 2015 at 2:06 PM, Mark Nelson  wrote:

 Compiling Ceph entirely with jemalloc overall had a negative
 performance impact. This may be due to dynamically linking to RocksDB
 instead of the default static linking.



 Is it possible that there were any other differences?  A 30% gain turning
 into a 30% loss with pre-loading vs compiling seems pretty crazy!


 I tried hard to minimize the differences by backporting the Ceph
 jemalloc feature into 0.94.1 that was used in the other testing. I did
 have to get RocksDB from master to get it to compile against jemalloc
 so there is some difference there. When preloading Ceph with jemalloc,
 parts of Ceph still used tcmalloc because it was statically linked to
 by RocksDB, so it was using both allocators during those tests.
 Programming is not my forte so it is likely that I may have botched
 something with that test.

 The goal of the test was to see if and where these allocators may
 help/hinder performance. It could also provide some feedback to Ceph
 devs on how to leverage one or the other or both. I don't consider
 this test to be extremely reliable as there is some variability in
 this pre-production system even though I tried to remove the
 variability to an extent.

 I hope others can build on this as a jumping off point and at least
 have some interesting places to look instead of having to scope out a
 large section of the space.


 Might be worth trying to reproduce the results and grab perf data or some
 other kind of trace data during the tests.  There's so much variability
 here
 it's really tough to get an idea of why the performance swings so
 dramatically.


 I'm not very familiar with the perf tools (can you use them with
 jemalloc?) and what would be useful. If you would like to tell me some
 configurations and tests you are interested in and let me know how you
 want perf to generate the data, I can see what I can do to provide
 that. Each test suite takes about 9 hours to run so it is pretty
 intensive.


 perf can give you a call graph showing how much cpu time is being spent in
 different parts of the code.

 Something like this during the test:

 sudo perf record --call-graph dwarf -F 99 -a
 sudo perf report

 You may need a newish kernel/os for dwarf support to work.  There are
 probably other tools that may also give insights into what is going on.


 Each sub-test (i.e. 4K seq read) takes 5 minutes, so it is much
 easier to run selections of those if there are specific tests you are
 interested in. I'm happy to provide data, but given the time to run
 these tests if we can focus on specific areas it would provide
 data/benefits much faster.


 I guess starting out I'm interested in what's happening with preloaded vs
 compiled jemalloc.  Other tests might be interesting too though!




 Still, excellent testing!  We definitely need more of this so we can
 determine if jemalloc is something that would be worth switching to
 eventually.




 - 
 Robert LeBlanc
 GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 -BEGIN PGP SIGNATURE-
 Version: Mailvelope v0.13.1
 Comment: https://www.mailvelope.com

 wsFcBAEBCAAQBQJVZjCACRDmVDuy+mK58QAAsHIQAImJWLkGix2sDKCZgcME
 0RHmelyEBtFFjIUNJvrwC0PvUKqQ/sffdtC+QLLcFYKOO2G5lrojKhCdwhXI
 OP0O1IqMcXUCBcq5yNJf8O6uzQ56Q4qCHWJmg49JRHx4gQLNK9VtGLRevL96
 JNrwhllpI5v+ewuQR/P2uD/NAXhFWDjEXLO4xHQGylOQOOVRQBlWeq+3QLqX
 4Zz+yiY4VIdhSe/z3aQYxes12snyjF2zP2Zo/BS47KBtVbmOJ7wGBGIFy8nw
 T4r7HYapCX3sqAN/fHEvwgcunYaW4y8aZT2a3Lv0PZKz23d6zcOUBPEFJ86W
 DzZyqqmDq7QJLtUnAb1yyQj23bWntI/zoT83zWCUvPHU7odmlBvSWZ8w7ToC
 mpOYjPw5CBVvztCFM2gwnmEXdM0qtmtdv/NhfQVu5+FNhQDSlhOPMCXdM3wf
 2JjuygdfRg4kGE6KyX4nYSZxfacsvX3SIkLnKYsdeWMNMZwGC6TvulApY61s
 sedwbe+hyFqlfGlbM+QCtW495Wr9EcfFdM/PWUDkXtfmfE20UdqAKYzIeJfC
 F8HS5sZz6GtiLb1Dbiq69hNmUUtfDEIDVssARKbMtmZ30bPdNe42grBttzDG
 3aNc05TwFe72HMjAhtvQrkrq1C+4XZA3mpNnosiXCUJT9WeOAOJbzWQS0mUS
 Yrtb
 =+ESo
 -END PGP SIGNATURE-



-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVZzoiCRDmVDuy+mK58QAAwiIQALFexcUi7eeosd36JMPQ
ZfDKaeLkkZoftAtM3EYAZVfx2vdiUDeQKyecdhgFin2CGz68NFRRBjZZ9qll
USMyfk85X71XQh7cZplkFGc4fwKN2leUDJWbnbpB8PQa15ocj+wBOlfeFmTX
PCW0+fv06slo/uCPtJH0Drl978pU1MXrESYJwJaGcfK9IUgCGD/w+4rtGwt3
ITvEfdmDBwEmNErxFojBcQ1XTxbb5tDXMjwJ9acdg0mDg0PiKXGtu79fJrle
kouO2RyBYNfA5/w83Hy8IhFncI+9XO2NnCF4pGR6G35yhwNq6TuA67bPQ4ip
+fdkPvp+/v3YOpeB0iBkZJLSGQVTICbCEW3GQNT9lhZ31cc/tyWqMLh5Zdwq

Re: [ceph-users] mds crash

2015-05-28 Thread John Spray


(This came up as in-reply-to to the previous mds crashing thread -- 
it's better to start threads with a fresh message)




On 28/05/2015 16:58, Peter Tiernan wrote:

Hi all,

I have been testing cephfs with erasure coded pool and cache tier. I 
have 3 mds running on the same physical server as 3 mons. The cluster 
is in ok state otherwise, rbd is working and all pg are active+clean. 
Im running v 0.87.2 giant on all nodes and ubuntu 14.04.2 .


The cluster was working fine but when copying a large file on a client 
to cephfs, it froze and now mdss keep crashing with:


 0 2015-05-28 16:50:58.267112 7f0282946700 -1 mds/MDCache.cc: In 
function 'virtual void C_IO_MDC_TruncateFinish::finish(int)' thread 
7f0282946700 time 2015-05-28 16:50:58.243904

mds/MDCache.cc: 5974: FAILED assert(r == 0 || r == -2)

any ideas?


You're getting some kind of IO error from RADOS, and the CephFS code 
doesn't have clean handling for that in many cases, so it's asserting out.


Enable debug objecter = 10 on the MDS to see what the operation is 
that's failing, and please provide the whole section of the log leading 
up to the crash rather than just the last line.


Cheers,
John


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com