Re: [ceph-users] ceph-deploy for Hammer
Hi Travis, These binaries are hosted on Canonical servers and are only for Ubuntu. Until the latest FireFly patch release 0.80.9, everything worked fine. I just tried the hammer binaries, and they seem to be failing in loading up erasure coding libraries. I have now built my own binaries and I was able to get the cluster up and running using ceph-deploy. You just have to skip the ceph installation step with ceph-deploy and rather do a manual install from deb files. Rest worked fine. Thanks Pankaj -Original Message- From: Travis Rhoden [mailto:trho...@gmail.com] Sent: Thursday, May 28, 2015 8:02 AM To: Garg, Pankaj Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy for Hammer Hi Pankaj, While there have been times in the past where ARM binaries were hosted on ceph.com, there is not currently any ARM hardware for builds. I don't think you will see any ARM binaries in http://ceph.com/debian-hammer/pool/main/c/ceph/, for example. Combine that with the fact that ceph-deploy is not intended to work with locally compiled binaries (only packages, as it relies on paths, conventions, and service definitions from the packages), and it is a very tricky combo to use ceph-deploy and ARM together. Your most recent error is indicative of the ceph-mon service not coming up successfully. when ceph-mon (the service, not the daemon) is started, it also calls ceph-create-keys, which waits for the monitor daemon to come up and the creates keys that are necessary for all clusters to run when using cephx (the admin key, the bootsraps keys). - Travis On Wed, May 27, 2015 at 8:27 PM, Garg, Pankaj pankaj.g...@caviumnetworks.com wrote: Actually the ARM binaries do exist and I have been using for previous releases. Somehow this library is the one that doesn’t load. Anyway I did compile my own Ceph for ARM, and now getting the following issue: [ceph_deploy.gatherkeys][WARNIN] Unable to find /etc/ceph/ceph.client.admin.keyring on ceph1 [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file: /etc/ceph/ceph.client.admin.keyring on host ceph1 From: Somnath Roy [mailto:somnath@sandisk.com] Sent: Wednesday, May 27, 2015 4:29 PM To: Garg, Pankaj Cc: ceph-users@lists.ceph.com Subject: RE: ceph-deploy for Hammer If you are trying to install the ceph repo hammer binaries, I don’t think it is built for ARM. Both binary and the .so needs to be built in ARM to make this work I guess. Try to build hammer code base in your ARM server and then retry. Thanks Regards Somnath From: Pankaj Garg [mailto:pankaj.g...@caviumnetworks.com] Sent: Wednesday, May 27, 2015 4:17 PM To: Somnath Roy Cc: ceph-users@lists.ceph.com Subject: RE: ceph-deploy for Hammer Yes I am on ARM. -Pankaj On May 27, 2015 3:58 PM, Somnath Roy somnath@sandisk.com wrote: Are you running this on ARM ? If not, it should not go for loading this library. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, Pankaj Sent: Wednesday, May 27, 2015 2:26 PM To: Garg, Pankaj; ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy for Hammer I seem to be getting these errors in the Monitor Log : 2015-05-27 21:17:41.908839 3ff907368e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error 2015-05-27 21:17:41.978113 3ff969168e0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16592 2015-05-27 21:17:41.984383 3ff969168e0 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so): /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file: No such file or directory 2015-05-27 21:17:41.98 3ff969168e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error 2015-05-27 21:17:42.052415 3ff90cf68e0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16604 2015-05-27 21:17:42.058656 3ff90cf68e0 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so): /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file: No such file or directory 2015-05-27 21:17:42.058715 3ff90cf68e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error 2015-05-27 21:17:42.125279 3ffac4368e0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16616 2015-05-27 21:17:42.131666 3ffac4368e0 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so): /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file: No such file or directory
[ceph-users] TCP or UDP
Hi, Does ceph typically use TCP or UDP or something else for data path for connection to clients and inter OSD cluster traffic? Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] TCP or UDP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 TCP - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, May 28, 2015 at 2:00 PM, Garg, Pankaj wrote: Hi, Does ceph typically use TCP or UDP or something else for data path for connection to clients and inter OSD cluster traffic? Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVZ3TbCRDmVDuy+mK58QAANP8QAIWbByF2Dnuvh3u5ZyXO 1Za4R2A9mhsjUhpRQzLnBIszYzqO/NcnrbZBuOyJ7lMlymyl0aMCgn7dqWTS 282dQznCmKlIYymTXUSw5LSEUq8eNzfQOgY7vxgczyceK/+ZmqhI1GCxpdB+ Mf+LqtUgDG207KdBu5OBHWFq5ZNeGEmxynSP09CiSLwL0fn70Uf+rdFRNCLE Dh1GMjerZ7MXC0WJ8z7dD+MegBSR6KMBK8vA+MPn2WfNoiqNlTcyUwTuB/9n opp9i7lQCP70d+K6zr7nHG8mYI1HiD2HRoYXfKR0dyfusY5aYJHwFbtG1gqf M5VSGbMlxX4RRvBuOMs9Oo1WRshjmvAv4q/oUG/7TT2Rk7doylJKd+4oTxLO TMI/5n1QJBoiKGHAYP6Ou+8bNdpnBftm+t+2htBtyTzso/2FYSCrj2oFJsK7 GBvDlxv9cCkSSzMUjhYoXVf9Gn8s/WUEAh9gsMO7LOrDS9m2a9bc0Y6UTw2l 8RiO0nNHDFBs0wvxpOjuAlOk7ucOWTnCOFV/5P6heIlCu8q1u4H+DapiC9yq V8otlvDxk81l8HJHqKxJSqYm/pO6EKxTtLjKeKAfWD3OHMZ3LP6FkJiBeq7v z3dcKMD95xjIjZZwNxDQpf71dPayfoGG3TKuVac2Aafp7va/SjpnOnxZpAb5 LRku =tCw+ -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS interaction with RBD
To follow up on the original post, Further digging indicates this is a problem with RBD image access and is not related to NFS-RBD interaction as initially suspected. The nfsd is simply hanging as a result of a hung request to the XFS file system mounted on our RBD-NFS gateway.This hung XFS call is caused by a problem with the RBD module interacting with our Ceph pool. I've found a reliable way to trigger a hang directly on an rbd image mapped into our RBD-NFS gateway box. The image contains an XFS file system. When I try to list the contents of a particular directory, the request hangs indefinitely. Two weeks ago our ceph status was: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 near full osd(s) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e5978: 66 osds: 66 up, 66 in pgmap v26434260: 3072 pgs: 3062 active+clean, 6 active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s mdsmap e1: 0/0/1 up The near full osd was number 53 and we updated our crush map to rewieght the osd. All of the OSDs had a weight of 1 based on the assumption that all osds were 2.0TB. Apparently one of our severs had the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough we are only at 50% utilization. We reweighted the near full osd to .8 and that initiated a rebalance that has since relieved the 95% full condition on that OSD. However, since that time the repeering has not completed and we suspect this is causing problems with our access of RBD images. Our current ceph status is: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck unclean; recovery 9/23842120 degraded (0.000%) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e6036: 66 osds: 66 up, 66 in pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9 active+clean+scrubbing, 1 remapped+peering, 3 active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%) mdsmap e1: 0/0/1 up Here are further details on our stuck pgs: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck inactive ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck unclean ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.106 11870 0 9 0 49010106368 163991 163991 active 2015-05-15 12:47:19.761469 6035'356332 5968'1358516 [62,53] [62,53] 5979'356242 2015-05-14 22:22:12.966150 5979'351351 2015-05-12 18:04:41.838686 5.104 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.800676 0'0 5968'1615 [62,53] [62,53] 0'0 2015-05-14 18:43:22.425105 0'0 2015-05-08 10:19:54.938934 4.105 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.801028 0'0 5968'1615 [62,53] [62,53] 0'0 2015-05-14 18:43:04.434826 0'0 2015-05-14 18:43:04.434826 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 The servers in the pool are not overloaded. On the ceph server that originally had the nearly full osd, (osd 53), I'm seeing entries like this in the osd log: 2015-05-28 06:25:02.900129 7f2ea8a4f700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 1096430.805069 secs 2015-05-28 06:25:02.900145 7f2ea8a4f700 0 log [WRN] : slow request
Re: [ceph-users] NFS interaction with RBD
Thanks a million for the feedback Christian! I 've tried to recreate the issue with 10RBD Volumes mounted on a single server without success! I 've issued the mkfs.xfs command simultaneously (or at least as fast I could do it in different terminals) without noticing any problems. Can you please tell me what was the size of each one of the RBD Volumes cause I have a feeling that mine were two small, and if so I have to test it on our bigger cluster. I 've also thought that besides QEMU version it might also be important the underlying OS, so what was your testbed? All the best, George Hi George In order to experience the error it was enough to simply run mkfs.xfs on all the volumes. In the meantime it became clear what the problem was: ~ ; cat /proc/183016/limits ... Max open files1024 4096 files .. This can be changed by setting a decent value in /etc/libvirt/qemu.conf for max_files. Regards Christian On 27 May 2015, at 16:23, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: George, I will let Christian provide you the details. As far as I know, it was enough to just do a ‘ls’ on all of the attached drives. we are using Qemu 2.0: $ dpkg -l | grep qemu ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2ubuntu1 all PXE boot firmware - ROM images for qemu ii qemu-keymaps2.0.0+dfsg-2ubuntu1.11 all QEMU keyboard maps ii qemu-system 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (arm) ii qemu-system-common 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (common files) ii qemu-system-mips2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (mips) ii qemu-system-misc2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (miscelaneous) ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (ppc) ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (sparc) ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.11 amd64QEMU full system emulation binaries (x86) ii qemu-utils 2.0.0+dfsg-2ubuntu1.11 amd64QEMU utilities cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Jens-Christian, how did you test that? Did you just tried to write to them simultaneously? Any other tests that one can perform to verify that? In our installation we have a VM with 30 RBD volumes mounted which are all exported via NFS to other VMs. No one has complaint for the moment but the load/usage is very minimal. If this problem really exists then very soon that the trial phase will be over we will have millions of complaints :-( What version of QEMU are you using? We are using the one provided by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm Best regards, George I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… cheers jc -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch [3] http://www.switch.ch http://www.switch.ch/stories On 25.05.2015, at 06:02, Christian Balzer wrote: Hello, lets compare your case with John-Paul's. Different OS and Ceph versions (thus we can assume different NFS versions as well). The only common thing is that both of you added OSDs and are likely suffering from delays stemming from Ceph re-balancing or deep-scrubbing. Ceph logs will only pipe up when things have been blocked for more than 30 seconds, NFS might take offense to lower values (or the accumulation of several distributed delays). You added 23 OSDs, tell us more about your cluster, HW, network. Were these added to the existing 16 nodes, are these on new storage nodes (so could there be something different with those nodes?), how busy is
Re: [ceph-users] Ceph MDS continually respawning (hammer)
On Thu, May 28, 2015 at 1:04 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: On 05/27/2015 10:30 PM, Gregory Farnum wrote: On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: We are also running a full backup sync to cephfs, using multiple distributed rsync streams (with zkrsync), and also ran in this issue today on Hammer 0.94.1 . After setting the beacon higer, and eventually clearing the journal, it stabilized again. We were using ceph-fuse to mount the cephfs, not the ceph kernel client. What's your MDS cache size set to? I did set it to 100 before (we have 64G of ram for the mds) trying to get rid of the 'Client .. failing to respond to cache pressure' messages Oh, that's definitely enough if one client is eating it all up to run into this, without that patch I referenced. :) -Greg Did you have any warnings in the ceph log about clients not releasing caps? Unfortunately lost the logs of before it happened.. But nothing in the new logs about that, I will follow this up I think you could hit this in ceph-fuse as well on hammer, although we just merged in a fix: https://github.com/ceph/ceph/pull/4653 -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Chinese Language List
On Thu, May 28, 2015 at 12:59 AM, kefu chai tchai...@gmail.com wrote: On Wed, May 27, 2015 at 3:36 AM, Patrick McGarry pmcga...@redhat.com wrote: Due to popular demand we are expanding the Ceph lists to include a Chinese-language list to allow for direct communications for all of our friends in China. ceph...@lists.ceph.com It was decided that there are many fragmented discussions going on in the region due to unfamiliarity or discomfort with English. Hopefully this will allow for smooth communications between anyone in China that is interested in Ceph! that's a great news ! Patrick, could you please also update https://ceph.com/resources/mailing-list-irc/ ? done and done I would greatly appreciate it if important messages/announcements/questions could be translated and forwarded by anyone that is able to translate them so that the greater community can still benefit. Thanks. will try to proxy some of the traffic here =) awesome, thank you -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards Kefu Chai -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] stuck degraded, undersized
Hello, google is your friend, this comes up every month at least, if not more frequently. Your default size (replica) is 2, the default CRUSH rule you quote at the very end of your mail delineates failure domains on the host level (quite rightly so). So with 2 replicas (quite dangerous with disk) you will need to have at least 2 storage nodes. Or change the CRUSH rule to allow them to be placed on the same host. Christian On Fri, 29 May 2015 10:48:04 +0800 Doan Hartono wrote: Hi ceph experts, I just freshly deployed ceph 0.94.1 with one monitor and one storage node containing 4 disks. But ceph health shows pgs stuck in degraded, unclean, and undersized. Any idea how to resolve this issue to get active+clean state? ceph health HEALTH_WARN 27 pgs degraded; 27 pgs stuck degraded; 128 pgs stuck unclean; 27 pgs stuck undersized; 27 pgs undersized ceph status cluster 6a8291d4-a3b8-475b-ad6c-c73895228762 health HEALTH_WARN 27 pgs degraded 27 pgs stuck degraded 128 pgs stuck unclean 27 pgs stuck undersized 27 pgs undersized monmap e1: 1 mons at {ceph-mon=10.0.0.154:6789/0} election epoch 2, quorum 0 ceph-mon osdmap e38: 4 osds: 4 up, 4 in; 101 remapped pgs pgmap v63: 128 pgs, 1 pools, 0 bytes data, 0 objects 135 MB used, 7428 GB / 7428 GB avail 73 active+remapped 28 active 27 active+undersized+degraded I set pg num and pgp num to 128 following ceph recommendation in the documentation [global] fsid = 6a8291d4-a3b8-475b-ad6c-c73895228762 mon_initial_members = ceph-mon mon_host = x auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 2 osd pool default pg num = 128 osd pool default pgp num = 128 I have set rbd pool's pg_num and pgp_num to 128. $ ceph osd pool get rbd pg_num pg_num: 128 $ ceph osd pool get rbd pgp_num pgp_num: 128 $ ceph osd pool get rbd size size: 2 I have tried modifying crush tunables as well ceph osd crush tunables legacy ceph osd crush tunables optimal but no effect on ceph health Crush map: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host research10-pc { id -2 # do not change unnecessarily # weight 7.240 alg straw hash 0 # rjenkins1 item osd.0 weight 1.810 item osd.1 weight 1.810 item osd.2 weight 1.810 item osd.3 weight 1.810 } root default { id -1 # do not change unnecessarily # weight 7.240 alg straw hash 0 # rjenkins1 item research10-pc weight 7.240 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Regards, Doan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] stuck degraded, undersized
Hi Christian, Based on your feedback, I modified the CRUSH map: step chooseleaf firstn 0 type host to step chooseleaf firstn 0 type osd And then i compiled and set, and voila, health is OK now. Thanks so much! ceph health HEALTH_OK Regards, Doan On 05/29/2015 10:53 AM, Christian Balzer wrote: Hello, google is your friend, this comes up every month at least, if not more frequently. Your default size (replica) is 2, the default CRUSH rule you quote at the very end of your mail delineates failure domains on the host level (quite rightly so). So with 2 replicas (quite dangerous with disk) you will need to have at least 2 storage nodes. Or change the CRUSH rule to allow them to be placed on the same host. Christian On Fri, 29 May 2015 10:48:04 +0800 Doan Hartono wrote: Hi ceph experts, I just freshly deployed ceph 0.94.1 with one monitor and one storage node containing 4 disks. But ceph health shows pgs stuck in degraded, unclean, and undersized. Any idea how to resolve this issue to get active+clean state? ceph health HEALTH_WARN 27 pgs degraded; 27 pgs stuck degraded; 128 pgs stuck unclean; 27 pgs stuck undersized; 27 pgs undersized ceph status cluster 6a8291d4-a3b8-475b-ad6c-c73895228762 health HEALTH_WARN 27 pgs degraded 27 pgs stuck degraded 128 pgs stuck unclean 27 pgs stuck undersized 27 pgs undersized monmap e1: 1 mons at {ceph-mon=10.0.0.154:6789/0} election epoch 2, quorum 0 ceph-mon osdmap e38: 4 osds: 4 up, 4 in; 101 remapped pgs pgmap v63: 128 pgs, 1 pools, 0 bytes data, 0 objects 135 MB used, 7428 GB / 7428 GB avail 73 active+remapped 28 active 27 active+undersized+degraded I set pg num and pgp num to 128 following ceph recommendation in the documentation [global] fsid = 6a8291d4-a3b8-475b-ad6c-c73895228762 mon_initial_members = ceph-mon mon_host = x auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 2 osd pool default pg num = 128 osd pool default pgp num = 128 I have set rbd pool's pg_num and pgp_num to 128. $ ceph osd pool get rbd pg_num pg_num: 128 $ ceph osd pool get rbd pgp_num pgp_num: 128 $ ceph osd pool get rbd size size: 2 I have tried modifying crush tunables as well ceph osd crush tunables legacy ceph osd crush tunables optimal but no effect on ceph health Crush map: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host research10-pc { id -2 # do not change unnecessarily # weight 7.240 alg straw hash 0 # rjenkins1 item osd.0 weight 1.810 item osd.1 weight 1.810 item osd.2 weight 1.810 item osd.3 weight 1.810 } root default { id -1 # do not change unnecessarily # weight 7.240 alg straw hash 0 # rjenkins1 item research10-pc weight 7.240 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Regards, Doan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] stuck degraded, undersized
Hi ceph experts, I just freshly deployed ceph 0.94.1 with one monitor and one storage node containing 4 disks. But ceph health shows pgs stuck in degraded, unclean, and undersized. Any idea how to resolve this issue to get active+clean state? ceph health HEALTH_WARN 27 pgs degraded; 27 pgs stuck degraded; 128 pgs stuck unclean; 27 pgs stuck undersized; 27 pgs undersized ceph status cluster 6a8291d4-a3b8-475b-ad6c-c73895228762 health HEALTH_WARN 27 pgs degraded 27 pgs stuck degraded 128 pgs stuck unclean 27 pgs stuck undersized 27 pgs undersized monmap e1: 1 mons at {ceph-mon=10.0.0.154:6789/0} election epoch 2, quorum 0 ceph-mon osdmap e38: 4 osds: 4 up, 4 in; 101 remapped pgs pgmap v63: 128 pgs, 1 pools, 0 bytes data, 0 objects 135 MB used, 7428 GB / 7428 GB avail 73 active+remapped 28 active 27 active+undersized+degraded I set pg num and pgp num to 128 following ceph recommendation in the documentation [global] fsid = 6a8291d4-a3b8-475b-ad6c-c73895228762 mon_initial_members = ceph-mon mon_host = x auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 2 osd pool default pg num = 128 osd pool default pgp num = 128 I have set rbd pool's pg_num and pgp_num to 128. $ ceph osd pool get rbd pg_num pg_num: 128 $ ceph osd pool get rbd pgp_num pgp_num: 128 $ ceph osd pool get rbd size size: 2 I have tried modifying crush tunables as well ceph osd crush tunables legacy ceph osd crush tunables optimal but no effect on ceph health Crush map: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host research10-pc { id -2 # do not change unnecessarily # weight 7.240 alg straw hash 0 # rjenkins1 item osd.0 weight 1.810 item osd.1 weight 1.810 item osd.2 weight 1.810 item osd.3 weight 1.810 } root default { id -1 # do not change unnecessarily # weight 7.240 alg straw hash 0 # rjenkins1 item research10-pc weight 7.240 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Regards, Doan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph on RHEL7.0
Hi Bruce, RHEL7.0 kernel has many issues on filesystem sub modules and most of them fixed only in RHEL7.1. So you should consider to go to RHEL7.1 directly and upgrade to at least kernel 3.10.0-229.1.2 BR, Luke From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Bruce McFarland [bruce.mcfarl...@taec.toshiba.com] Sent: Friday, May 29, 2015 5:13 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] Ceph on RHEL7.0 We’re planning on moving from Centos6.5 to RHEL7.0 for Ceph storage and monitor nodes. Are there any known issues using RHEL7.0? Thanks This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
Hello, On Thu, 28 May 2015 12:05:03 +0200 Xavier Serrano wrote: On Thu May 28 11:22:52 2015, Christian Balzer wrote: We are testing different scenarios before making our final decision (cache-tiering, journaling, separate pool,...). Definitely a good idea to test things out and get an idea what Ceph and your hardware can do. From my experience and reading this ML however I think your best bet (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your 20 OSDs HDDs. Currently cache-tiering is probably the worst use for those SSD resources, though the code and strategy is of course improving. I agree: in our particular enviroment, our tests also conclude that SSD journaling performs far better than cache-tiering, especially when cache becomes close to its capacity and data movement between cache and backing storage occurs frequently. Precisely. We also want to test if it is possible to use SSD disks as a transparent cache for the HDDs at system (Linux kernel) level, and how reliable/good is it. There are quite a number of threads about this here, some quite recent/current. They range from not worth it (i.e. about the same performance as journal SSDs) to xyz-cache destroyed my data, ate my babies and set the house on fire (i.e. massive reliability problems). Which is a pity, as in theory they look like a nice fit/addition to Ceph. Dedicated SSD pools may be a good fit depending on your use case. However I'd advise against mixing SSD and HDD OSDs on the same node. To fully utilize those SSDs you'll need a LOT more CPU power than required by HDD OSDs or SSD journals/HDD OSDs systems. And you already have 20 OSDs in that box. Good point! We did not consider that, thanks for pointing it out. What CPUs do you have in those storage nodes anyway? Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo. We have only 1 CPU per osd node, so I'm afraid we have another potential bottleneck here. Oh dear, about 10GHz (that CPU is supposedly 2.4, but you may see the 2.5 because it already is in turbo mode) for 20 OSDs. Where the recommendation for HDD only OSDs is 1GHz. Fire up atop (large window so you can see all the details and devices) on one of your storage nodes. Then from a client (VM) run this: --- fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4M --iodepth=32 --- This should result in your disks (OSDs) getting busy to the point of 100% utilization, but your CPU to still have some idle (that's idle AND wait combined). If you change the blocksize to 4K (and just ctrl-c fio after 30 or so seconds) you should see a very different picture, with the CPU being much busier and the HDDs seeing less than 100% usage. That will become even more pronounced with faster HDDs and/or journal SSDs. And pure SSD clusters/pools are way above that in terms of CPU hunger. If you have the budget, I'd deploy the current storage nodes in classic (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD nodes, optimized for their task (more CPU power, faster network). Then use those SSD nodes to experiment with cache-tiers and pure SSD pools and switch over things when you're comfortable with this and happy with the performance. However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. This is very interesting, thanks for pointing it out! What would you suggest to use in order to identify the actual bottleneck? (disk, CPU, RAM, etc.). Tools like munin? Munin might work, I use collectd to gather all those values (and even more importantly all Ceph counters) and graphite to visualize it. For ad-hoc, on the spot analysis I really like atop (in a huge window), which will make it very clear what is going on. In addition, there are some kernel tunables that may be helpful to improve overall performance. Maybe we are filling some kernel internals and that limits our results (for instance, we had to increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per host). Which tunables should we observe? I'm no expert for large (not even medium) clusters, so you'll have to research the archives and net (the CERN Ceph slide is nice). One thing I remember is kernel.pid_max, which is something you're likely to run into at some point with your dense storage nodes: http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations Christian All you say is really interesting. Thanks for your valuable advice. We surely still have plenty of things to learn and test before going to production. As long as you have the time to test out things, you'll be fine. ^_^ Christian Thanks again for your
Re: [ceph-users] NFS interaction with RBD
Jens-Christian Fischer jens-christian.fischer@... writes: I think we (i.e. Christian) found the problem: We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he hit all disks, we started to experience these 120 second timeouts. We realized that the QEMU process on the hypervisor is opening a TCP connection to every OSD for every mounted volume - exceeding the 1024 FD limit. So no deep scrubbing etc, but simply to many connections… Have seen mention of similar from CERN in their presentations, found this post on a quick google.. might help? http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013- December/026187.html Cheers, Trent ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
Hello Greg, On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: The description of the logging abruptly ending and the journal being bad really sounds like part of the disk is going back in time. I'm not sure if XFS internally is set up in such a way that something like losing part of its journal would allow that? I'm special. ^o^ No XFS, EXT4. As stated in the original thread, below. And the (OSD) journal is a raw partition on a DC S3700. And since there was at least a 30 seconds pause between the completion of the /etc/init.d/ceph stop and issuing of the shutdown command, the logging abruptly ending seems to be unlikely related to the shutdown at all. If any of the OSD developers have the time it's conceivable a copy of the OSD journal would be enlightening (if e.g. the header offsets are wrong but there are a bunch of valid journal entries), but this is two reports of this issue from you and none very similar from anybody else. I'm still betting on something in the software or hardware stack misbehaving. (There aren't that many people running Debian; there are lots of people running Ubuntu and we find bad XFS kernels there not infrequently; I think you're hitting something like that.) There should be no file system involved with the raw partition SSD journal, n'est-ce pas? The hardware is vastly different, the previous case was on an AMD system with onboard SATA (SP5100), this one is a SM storage goat with LSI 3008. The only thing they have in common is the Ceph version 0.80.7 (via the Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 (though there were minor updates on that between those incidents, backported fixes) A copy of the journal would consist of the entire 10GB partition, since we don't know where in loop it was at the time, right? Christian On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com wrote: Hello again (marvel at my elephantine memory and thread necromancy) Firstly, this happened again, details below. Secondly, as I changed things to sysv-init AND did a /etc/init.d/ceph stop which dutifully listed all OSDs as being killed/stopped BEFORE rebooting the node. This is completely new node with significantly different HW than the example below. But the same SW versions as before (Debian Jessie, Ceph 0.80.7). And just like below/before the logs for that OSD have nothing in them indicating it did shut down properly (no journal flush done) and when coming back on reboot we get the dreaded: --- 2015-05-25 10:32:55.439492 7f568aa157c0 1 journal _open /var/lib/ceph/osd/ceph-30/journal fd 23: 1269312 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-05-25 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1 filestore(/var/lib/ceph/osd/ceph-30) mount failed to open journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable to mount object store --- I see nothing in the changelogs for 0.80.8 and .9 that seems related to this, never mind that from the looks of it the repository at Ceph has only Wheezy (bpo70) packages and Debian Jessie is still stuck at 0.80.7 (Sid just went to .9 last week) I'm preserving the state of things as they are for a few days, so if any developer would like a peek or more details, speak up now. I'd open an issue, but I don't have a reliable way to reproduce this and even less desire to do so on this production cluster. ^_- Christian On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote: On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote: On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com wrote: Hello, This morning I decided to reboot a storage node (Debian Jessie, thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) after applying some changes. It came back up one OSD short, the last log lines before the reboot are: --- 2014-12-05 09:35:27.700330 7f87e789c700 2 -- 10.0.8.21:6823/29520 10.0.8.22:0/5161 pipe(0x7f881b772580 sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0) Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289 n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288 pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active] cancel_copy_ops --- Quite obviously it didn't complete its shutdown, so unsurprisingly we get: --- 2014-12-05 09:37:40.278128 7f218a7037c0 1 journal _open /var/lib/ceph/osd/ceph-4/journal fd 24: 1269312 bytes, block size 4096 bytes, directio = 1, aio = 1 2014-12-05 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1
Re: [ceph-users] Cache Pool Flush/Eviction Limits - Hard of Soft?
Hi Greg, That is really great, thanks for your response, I completely understand what is going on now. I wasn't thinking about capacity in a per PG sense. I have exported a pg dump of the cache pool and calculated some percentages and I can see that the data can vary up to around 5% amongst the PG's, so this probably ties up with there being isolated bursts on single OSD's. I've knocked the cache_target_full_ratio down by 10% and will see if that helps. FYI Regarding my 2nd point about having high and low ratios for the cache eviction/flushing. I have been speaking to Li Wang and he is potentially interested in developing a prototype. Thanks Again, Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 27 May 2015 22:02 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Pool Flush/Eviction Limits - Hard of Soft? The max target limit is a hard limit: the OSDs won't let more than that amount of data in the cache tier. They will start flushing and evicting based on the percentage ratios you can set (I don't remember the exact parameter names) and you may need to set these more aggressively for your given workload. The tricky bit with this is that of course the OSDs don't have global knowledge about how much total data is in the cache — so when you set a 100TB cache that has 1024 PGs, the OSDs are actually applying those limits on a per-PG basis, and not letting any given PG use more than 100/1024 TB. This is probably the heavy read activity you're seeing on one OSD at a time, when it happens to reach the hard limit. :/ The specific blocked ops you're seeing are in various stages and probably just indicative of the OSD doing a bunch of flushing which is blocking other accesses. -Greg On Tue, May 19, 2015 at 12:03 PM, Nick Fisk n...@fisk.me.uk wrote: Been doing some more digging. I'm getting messages in the OSD logs like these, don't know if these are normal or a clue to something not right 2015-05-19 18:36:27.664698 7f58b91dd700 0 log_channel(cluster) log [WRN] : slow request 30.346117 seconds old, received at 2015-05-19 18:35:57.318208: osd_repop(client.1205463.0:7612211 6.2f ec3d412f/rb.0.6e7a9.74b0dc51.000be050/head//6 v 2674'1102892) currently commit_sent 2015-05-19 17:50:29.700766 7ff1503db700 0 log_channel(cluster) log [WRN] : slow request 32.548750 seconds old, received at 2015-05-19 17:49:57.151935: osd_repop_reply(osd.46.0:2088048 6.64 ondisk, result = 0) currently no flag points reached 2015-05-19 17:47:26.903122 7f296b6fc700 0 log_channel(cluster) log [WRN] : slow request 30.620519 seconds old, received at 2015-05-19 17:46:56.282504: osd_op(client.1205463.0:7261972 rb.0.6e7a9.74b0dc51.000b7ff9 [set-alloc-hint object_size 1048576 write_size 1048576,write 258048~131072] 6.882797bc ack+ondisk+write+known_if_redirected e2674) currently commit_sent -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: 18 May 2015 17:25 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Pool Flush/Eviction Limits - Hard of Soft? Just to update on this, I've been watching iostat across my Ceph nodes and I can see something slightly puzzling happening and is most likely the cause of the slow (32s) requests I am getting. During a client write-only IO stream, I see reads and writes to the cache tier, which is normal as blocks are being promoted/demoted. The latency does suffer, but not excessively and is acceptable for data that has fallen out of cache. However, every now and again it appears that one of the OSD's suddenly just starts aggressively reading and appears to block any IO until that read has finished. Example below where /dev/sdd is a 10K disk in the cache tier. All other nodes have their /dev/sdd devices being completely idle during this period. The disks on the base tier seem to be doing writes during this period, so looks related to some sort of flushing. Devicerrqm/s wrqm/s r/s w/s rkB/s wkB/s rq-sz qu-sz await r_wait w_wait svctm util sdd 0.000.00471.50 0.002680.00 0.00 11.37 0.962.03 2.03 0.001.9089.80 Most of the times I observed this whilst I was watching iostat, the read only lasted around 5-10s, but I suspect that sometimes it is going on for longer and is the cause of the requests are blocked errors. I have also noticed that this appears to happen more often depending on if there are a greater number of blocks to be promoted/demoted. Other pools are not affected during these hangs. From the look of the iostat stats, I would assume that for a 10k disk, it must be doing a sequential read to get that number of IO's. Does anybody have any clue what
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
On Thu, 28 May 2015 10:32:18 +0200 Jan Schermer wrote: Can you check the capacitor reading on the S3700 with smartctl ? I suppose you mean this? --- 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 648 (2 2862) --- Never mind that these are brand new. This drive has non-volatile cache which *should* get flushed when power is lost, depending on what hardware does on reboot it might get flushed even when rebooting. That would probably trigger an increase in the unsafe shutdown count SMART value. I will have to test that from a known starting point, since the current values are likely from earlier tests and actual shutdowns. I'd be surprised if a reboot would drop power to the drives, but it is a possibility of course. However I'm VERY unconvinced that this could result in data loss, with the SSDs in perfect CAPS health. I just got this drive for testing yesterday and it’s a beast, but some things were peculiar - for example my fio benchmark slowed down (35K IOPS - 5K IOPS) after several GB (random - 5-40) written, and then it would creep back up over time even under load. Disabling write cache helps, no idea why. I haven't seen that behavior with DC S3700s, but with 5xx ones and some Samsung, yes. Christian Z. On 28 May 2015, at 09:22, Christian Balzer ch...@gol.com wrote: Hello Greg, On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: The description of the logging abruptly ending and the journal being bad really sounds like part of the disk is going back in time. I'm not sure if XFS internally is set up in such a way that something like losing part of its journal would allow that? I'm special. ^o^ No XFS, EXT4. As stated in the original thread, below. And the (OSD) journal is a raw partition on a DC S3700. And since there was at least a 30 seconds pause between the completion of the /etc/init.d/ceph stop and issuing of the shutdown command, the logging abruptly ending seems to be unlikely related to the shutdown at all. If any of the OSD developers have the time it's conceivable a copy of the OSD journal would be enlightening (if e.g. the header offsets are wrong but there are a bunch of valid journal entries), but this is two reports of this issue from you and none very similar from anybody else. I'm still betting on something in the software or hardware stack misbehaving. (There aren't that many people running Debian; there are lots of people running Ubuntu and we find bad XFS kernels there not infrequently; I think you're hitting something like that.) There should be no file system involved with the raw partition SSD journal, n'est-ce pas? The hardware is vastly different, the previous case was on an AMD system with onboard SATA (SP5100), this one is a SM storage goat with LSI 3008. The only thing they have in common is the Ceph version 0.80.7 (via the Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 (though there were minor updates on that between those incidents, backported fixes) A copy of the journal would consist of the entire 10GB partition, since we don't know where in loop it was at the time, right? Christian On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com wrote: Hello again (marvel at my elephantine memory and thread necromancy) Firstly, this happened again, details below. Secondly, as I changed things to sysv-init AND did a /etc/init.d/ceph stop which dutifully listed all OSDs as being killed/stopped BEFORE rebooting the node. This is completely new node with significantly different HW than the example below. But the same SW versions as before (Debian Jessie, Ceph 0.80.7). And just like below/before the logs for that OSD have nothing in them indicating it did shut down properly (no journal flush done) and when coming back on reboot we get the dreaded: --- 2015-05-25 10:32:55.439492 7f568aa157c0 1 journal _open /var/lib/ceph/osd/ceph-30/journal fd 23: 1269312 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-05-25 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1 filestore(/var/lib/ceph/osd/ceph-30) mount failed to open journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable to mount object store --- I see nothing in the changelogs for 0.80.8 and .9 that seems related to this, never mind that from the looks of it the repository at Ceph has only Wheezy (bpo70) packages and Debian Jessie is still stuck at 0.80.7 (Sid just went to .9 last week) I'm preserving the state of things as they are for a few days, so if any developer would like a peek or more details, speak up now. I'd open an issue, but I don't have a reliable way to reproduce this
[ceph-users] umount stuck on NFS gateways switch over by using Pacemaker
Hello, I am testing NFS over RBD recently. I am trying to build the NFS HA environment under Ubuntu 14.04 for testing, and the packages version information as follows: - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS) - ceph : 0.80.9-0ubuntu0.14.04.2 - ceph-common : 0.80.9-0ubuntu0.14.04.2 - pacemaker (git20130802-1ubuntu2.3) - corosync (2.3.3-1ubuntu1) PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations. The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and transferred to 'nfs2', and vice versa. When the two nodes are up, I issue 'sudo service pacemaker stop' on one node, the other node will take over all resources. Everything looks fine. Then I wait about 30 minutes and do nothing to the NFS gateways. I repeated the previous steps to test fail over procedure. I found the process code of 'umount' is 'D' (uninterruptible sleep), the 'ps' showed the following result root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1 Have any idea to solve or work around? Because of 'umount' stuck, both 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 minutes for 'umount' time out, the only way I can do is powering off the server directly. Any help would be much appreciated. I attached my configurations and loggings as follows. Pacemaker configurations: crm configure primitive p_rbd_map_1 ocf:ceph:rbd.in \ params user=admin pool=block_data name=data01 cephconf=/etc/ceph/ceph.conf \ op monitor interval=10s timeout=20s crm configure primitive p_fs_rbd_1 ocf:heartbeat:Filesystem \ params directory=/mnt/block1 fstype=xfs device=/dev/rbd1 \ fast_stop=no options=noatime,nodiratime,nobarrier,inode64 \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s crm configure primitive p_export_rbd_1 ocf:heartbeat:exportfs \ params directory=/mnt/block1 clientspec=10.35.64.0/24 options=rw,async,no_subtree_check,no_root_squash fsid=1 \ op monitor interval=10s timeout=20s \ op start interval=0 timeout=40s crm configure primitive p_vip_1 ocf:heartbeat:IPaddr2 \ params ip=10.35.64.90 cidr_netmask=24 \ op monitor interval=5 crm configure primitive p_nfs_server lsb:nfs-kernel-server \ op monitor interval=10s timeout=30s crm configure primitive p_rpcbind upstart:rpcbind \ op monitor interval=10s timeout=30s crm configure group g_rbd_share_1 p_rbd_map_1 p_fs_rbd_1 p_export_rbd_1 p_vip_1 \ meta target-role=Started crm configure group g_nfs p_rpcbind p_nfs_server \ meta target-role=Started crm configure clone clo_nfs g_nfs \ meta globally-unique=false target-role=Started 'crm_mon' status results for normal condition: Online: [ nfs1 nfs2 ] Resource Group: g_rbd_share_1 p_rbd_map_1 (ocf::ceph:rbd.in): Started nfs1 p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started nfs1 p_export_rbd_1 (ocf::heartbeat:exportfs): Started nfs1 p_vip_1 (ocf::heartbeat:IPaddr2): Started nfs1 Clone Set: clo_nfs [g_nfs] Started: [ nfs1 nfs2 ] 'crm_mon' status results for fail over condition: Online: [ nfs1 nfs2 ] Resource Group: g_rbd_share_1 p_rbd_map_1 (ocf::ceph:rbd.in): Started nfs1 p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started nfs1 (unmanaged) FAILED p_export_rbd_1 (ocf::heartbeat:exportfs): Stopped p_vip_1 (ocf::heartbeat:IPaddr2): Stopped Clone Set: clo_nfs [g_nfs] Started: [ nfs2 ] Stopped: [ nfs1 ] Failed actions: p_fs_rbd_1_stop_0 (node=nfs1, call=114, rc=1, status=Timed Out, last-rc-change=Wed May 13 16:39:10 2015, queued=60002ms, exec=1ms ): unknown error 'demsg' messages: [ 9470.284509] nfsd: last server has exited, flushing export cache [ 9470.322893] init: rpcbind main process (4267) terminated with status 2 [ 9600.520281] INFO: task umount:2675 blocked for more than 120 seconds. [ 9600.520445] Not tainted 3.13.0-32-generic #57-Ubuntu [ 9600.520570] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 9600.520792] umount D 88003fc13480 0 2675 1 0x [ 9600.520800] 88003a4f9dc0 0082 880039ece000 88003a4f9fd8 [ 9600.520805] 00013480 00013480 880039ece000 880039ece000 [ 9600.520809] 88003fc141a0 0001 88003a377928 [ 9600.520814] Call Trace: [ 9600.520830] [817251a9] schedule+0x29/0x70 [ 9600.520882] [a043b300] _xfs_log_force+0x220/0x280 [xfs] [ 9600.520891] [8109a9b0] ? wake_up_state+0x20/0x20 [ 9600.520922] [a043b386] xfs_log_force+0x26/0x80 [xfs] [ 9600.520947] [a03f3b6d] xfs_fs_sync_fs+0x2d/0x50 [xfs] [ 9600.520954]
Re: [ceph-users] Ceph MDS continually respawning (hammer)
On 05/27/2015 10:30 PM, Gregory Farnum wrote: On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: We are also running a full backup sync to cephfs, using multiple distributed rsync streams (with zkrsync), and also ran in this issue today on Hammer 0.94.1 . After setting the beacon higer, and eventually clearing the journal, it stabilized again. We were using ceph-fuse to mount the cephfs, not the ceph kernel client. What's your MDS cache size set to? I did set it to 100 before (we have 64G of ram for the mds) trying to get rid of the 'Client .. failing to respond to cache pressure' messages Did you have any warnings in the ceph log about clients not releasing caps? Unfortunately lost the logs of before it happened.. But nothing in the new logs about that, I will follow this up I think you could hit this in ceph-fuse as well on hammer, although we just merged in a fix: https://github.com/ceph/ceph/pull/4653 -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
Can you check the capacitor reading on the S3700 with smartctl ? This drive has non-volatile cache which *should* get flushed when power is lost, depending on what hardware does on reboot it might get flushed even when rebooting. I just got this drive for testing yesterday and it’s a beast, but some things were peculiar - for example my fio benchmark slowed down (35K IOPS - 5K IOPS) after several GB (random - 5-40) written, and then it would creep back up over time even under load. Disabling write cache helps, no idea why. Z. On 28 May 2015, at 09:22, Christian Balzer ch...@gol.com wrote: Hello Greg, On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: The description of the logging abruptly ending and the journal being bad really sounds like part of the disk is going back in time. I'm not sure if XFS internally is set up in such a way that something like losing part of its journal would allow that? I'm special. ^o^ No XFS, EXT4. As stated in the original thread, below. And the (OSD) journal is a raw partition on a DC S3700. And since there was at least a 30 seconds pause between the completion of the /etc/init.d/ceph stop and issuing of the shutdown command, the logging abruptly ending seems to be unlikely related to the shutdown at all. If any of the OSD developers have the time it's conceivable a copy of the OSD journal would be enlightening (if e.g. the header offsets are wrong but there are a bunch of valid journal entries), but this is two reports of this issue from you and none very similar from anybody else. I'm still betting on something in the software or hardware stack misbehaving. (There aren't that many people running Debian; there are lots of people running Ubuntu and we find bad XFS kernels there not infrequently; I think you're hitting something like that.) There should be no file system involved with the raw partition SSD journal, n'est-ce pas? The hardware is vastly different, the previous case was on an AMD system with onboard SATA (SP5100), this one is a SM storage goat with LSI 3008. The only thing they have in common is the Ceph version 0.80.7 (via the Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 (though there were minor updates on that between those incidents, backported fixes) A copy of the journal would consist of the entire 10GB partition, since we don't know where in loop it was at the time, right? Christian On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com wrote: Hello again (marvel at my elephantine memory and thread necromancy) Firstly, this happened again, details below. Secondly, as I changed things to sysv-init AND did a /etc/init.d/ceph stop which dutifully listed all OSDs as being killed/stopped BEFORE rebooting the node. This is completely new node with significantly different HW than the example below. But the same SW versions as before (Debian Jessie, Ceph 0.80.7). And just like below/before the logs for that OSD have nothing in them indicating it did shut down properly (no journal flush done) and when coming back on reboot we get the dreaded: --- 2015-05-25 10:32:55.439492 7f568aa157c0 1 journal _open /var/lib/ceph/osd/ceph-30/journal fd 23: 1269312 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-05-25 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1 filestore(/var/lib/ceph/osd/ceph-30) mount failed to open journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable to mount object store --- I see nothing in the changelogs for 0.80.8 and .9 that seems related to this, never mind that from the looks of it the repository at Ceph has only Wheezy (bpo70) packages and Debian Jessie is still stuck at 0.80.7 (Sid just went to .9 last week) I'm preserving the state of things as they are for a few days, so if any developer would like a peek or more details, speak up now. I'd open an issue, but I don't have a reliable way to reproduce this and even less desire to do so on this production cluster. ^_- Christian On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote: On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote: On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com wrote: Hello, This morning I decided to reboot a storage node (Debian Jessie, thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) after applying some changes. It came back up one OSD short, the last log lines before the reboot are: --- 2014-12-05 09:35:27.700330 7f87e789c700 2 -- 10.0.8.21:6823/29520 10.0.8.22:0/5161 pipe(0x7f881b772580 sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0) Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289 n=8 ec=5 les/c 289/289
Re: [ceph-users] Blocked requests/ops?
On Thu May 28 11:22:52 2015, Christian Balzer wrote: We are testing different scenarios before making our final decision (cache-tiering, journaling, separate pool,...). Definitely a good idea to test things out and get an idea what Ceph and your hardware can do. From my experience and reading this ML however I think your best bet (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your 20 OSDs HDDs. Currently cache-tiering is probably the worst use for those SSD resources, though the code and strategy is of course improving. I agree: in our particular enviroment, our tests also conclude that SSD journaling performs far better than cache-tiering, especially when cache becomes close to its capacity and data movement between cache and backing storage occurs frequently. We also want to test if it is possible to use SSD disks as a transparent cache for the HDDs at system (Linux kernel) level, and how reliable/good is it. Dedicated SSD pools may be a good fit depending on your use case. However I'd advise against mixing SSD and HDD OSDs on the same node. To fully utilize those SSDs you'll need a LOT more CPU power than required by HDD OSDs or SSD journals/HDD OSDs systems. And you already have 20 OSDs in that box. Good point! We did not consider that, thanks for pointing it out. What CPUs do you have in those storage nodes anyway? Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo. We have only 1 CPU per osd node, so I'm afraid we have another potential bottleneck here. If you have the budget, I'd deploy the current storage nodes in classic (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD nodes, optimized for their task (more CPU power, faster network). Then use those SSD nodes to experiment with cache-tiers and pure SSD pools and switch over things when you're comfortable with this and happy with the performance. However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. This is very interesting, thanks for pointing it out! What would you suggest to use in order to identify the actual bottleneck? (disk, CPU, RAM, etc.). Tools like munin? Munin might work, I use collectd to gather all those values (and even more importantly all Ceph counters) and graphite to visualize it. For ad-hoc, on the spot analysis I really like atop (in a huge window), which will make it very clear what is going on. In addition, there are some kernel tunables that may be helpful to improve overall performance. Maybe we are filling some kernel internals and that limits our results (for instance, we had to increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per host). Which tunables should we observe? I'm no expert for large (not even medium) clusters, so you'll have to research the archives and net (the CERN Ceph slide is nice). One thing I remember is kernel.pid_max, which is something you're likely to run into at some point with your dense storage nodes: http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations Christian All you say is really interesting. Thanks for your valuable advice. We surely still have plenty of things to learn and test before going to production. Thanks again for your time and help. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
On 28 May 2015, at 10:56, Christian Balzer ch...@gol.com wrote: On Thu, 28 May 2015 10:32:18 +0200 Jan Schermer wrote: Can you check the capacitor reading on the S3700 with smartctl ? I suppose you mean this? --- 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 648 (2 2862) --- Never mind that these are brand new. Most of the failures occur on either very new or very old hardware :-) This drive has non-volatile cache which *should* get flushed when power is lost, depending on what hardware does on reboot it might get flushed even when rebooting. That would probably trigger an increase in the unsafe shutdown count SMART value. I will have to test that from a known starting point, since the current values are likely from earlier tests and actual shutdowns. I'd be surprised if a reboot would drop power to the drives, but it is a possibility of course. However I'm VERY unconvinced that this could result in data loss, with the SSDs in perfect CAPS health. You are right, it shouldn’t happen, but stuff happens. I just got this drive for testing yesterday and it’s a beast, but some things were peculiar - for example my fio benchmark slowed down (35K IOPS - 5K IOPS) after several GB (random - 5-40) written, and then it would creep back up over time even under load. Disabling write cache helps, no idea why. I haven't seen that behavior with DC S3700s, but with 5xx ones and some Samsung, yes. Try this simple test fio --filename=/dev/$device --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=journal-test —size=10M (play with iodepth, if I remember correctly then the highest gain was with iodepth=1, higher depths reach almost the max without disabling write cache) first run with WC enabled hdparm -W1 /dev/$device then with WCE disabled hdparm -W0 /dev/$device I get much higher IOPS with cache disabled on all SSDs I tested - Kingston, Samsung, Intel. I think it disables compression on those drives that use it internally, and it probably causes the SSD not to wait for other IOs to coalesce it with. This might have a very bad effect on the drive longevity in the long run, though... Jan Christian Z. On 28 May 2015, at 09:22, Christian Balzer ch...@gol.com wrote: Hello Greg, On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: The description of the logging abruptly ending and the journal being bad really sounds like part of the disk is going back in time. I'm not sure if XFS internally is set up in such a way that something like losing part of its journal would allow that? I'm special. ^o^ No XFS, EXT4. As stated in the original thread, below. And the (OSD) journal is a raw partition on a DC S3700. And since there was at least a 30 seconds pause between the completion of the /etc/init.d/ceph stop and issuing of the shutdown command, the logging abruptly ending seems to be unlikely related to the shutdown at all. If any of the OSD developers have the time it's conceivable a copy of the OSD journal would be enlightening (if e.g. the header offsets are wrong but there are a bunch of valid journal entries), but this is two reports of this issue from you and none very similar from anybody else. I'm still betting on something in the software or hardware stack misbehaving. (There aren't that many people running Debian; there are lots of people running Ubuntu and we find bad XFS kernels there not infrequently; I think you're hitting something like that.) There should be no file system involved with the raw partition SSD journal, n'est-ce pas? The hardware is vastly different, the previous case was on an AMD system with onboard SATA (SP5100), this one is a SM storage goat with LSI 3008. The only thing they have in common is the Ceph version 0.80.7 (via the Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 (though there were minor updates on that between those incidents, backported fixes) A copy of the journal would consist of the entire 10GB partition, since we don't know where in loop it was at the time, right? Christian On Sun, May 24, 2015 at 7:26 PM, Christian Balzer ch...@gol.com wrote: Hello again (marvel at my elephantine memory and thread necromancy) Firstly, this happened again, details below. Secondly, as I changed things to sysv-init AND did a /etc/init.d/ceph stop which dutifully listed all OSDs as being killed/stopped BEFORE rebooting the node. This is completely new node with significantly different HW than the example below. But the same SW versions as before (Debian Jessie, Ceph 0.80.7). And just like below/before the logs for that OSD have nothing in them indicating it did shut down properly (no journal flush done) and when coming back on reboot we get the dreaded: --- 2015-05-25 10:32:55.439492 7f568aa157c0 1 journal _open
Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)
On Thu, May 28, 2015 at 12:22 AM, Christian Balzer ch...@gol.com wrote: Hello Greg, On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: The description of the logging abruptly ending and the journal being bad really sounds like part of the disk is going back in time. I'm not sure if XFS internally is set up in such a way that something like losing part of its journal would allow that? I'm special. ^o^ No XFS, EXT4. As stated in the original thread, below. And the (OSD) journal is a raw partition on a DC S3700. And since there was at least a 30 seconds pause between the completion of the /etc/init.d/ceph stop and issuing of the shutdown command, the logging abruptly ending seems to be unlikely related to the shutdown at all. Oh, sorry... I happened to read this article last night: http://lwn.net/SubscriberLink/645720/01149aa7c58954eb/ Depending on configuration (I think you'd need to have a journal-as-file) you could be experiencing that. And again, not many people use ext4 so who knows what other ways there are of things being broken that nobody else has seen yet. If any of the OSD developers have the time it's conceivable a copy of the OSD journal would be enlightening (if e.g. the header offsets are wrong but there are a bunch of valid journal entries), but this is two reports of this issue from you and none very similar from anybody else. I'm still betting on something in the software or hardware stack misbehaving. (There aren't that many people running Debian; there are lots of people running Ubuntu and we find bad XFS kernels there not infrequently; I think you're hitting something like that.) There should be no file system involved with the raw partition SSD journal, n'est-ce pas? ...and I guess probably you aren't since you are using partitions. The hardware is vastly different, the previous case was on an AMD system with onboard SATA (SP5100), this one is a SM storage goat with LSI 3008. The only thing they have in common is the Ceph version 0.80.7 (via the Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 (though there were minor updates on that between those incidents, backported fixes) A copy of the journal would consist of the entire 10GB partition, since we don't know where in loop it was at the time, right? Yeah. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] umount stuck on NFS gateways switch over by using Pacemaker
On Thu, May 28, 2015 at 1:33 AM, wd_hw...@wistron.com wrote: Hello, I am testing NFS over RBD recently. I am trying to build the NFS HA environment under Ubuntu 14.04 for testing, and the packages version information as follows: - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS) - ceph : 0.80.9-0ubuntu0.14.04.2 - ceph-common : 0.80.9-0ubuntu0.14.04.2 - pacemaker (git20130802-1ubuntu2.3) - corosync (2.3.3-1ubuntu1) PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations. The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and transferred to 'nfs2', and vice versa. When the two nodes are up, I issue 'sudo service pacemaker stop' on one node, the other node will take over all resources. Everything looks fine. Then I wait about 30 minutes and do nothing to the NFS gateways. I repeated the previous steps to test fail over procedure. I found the process code of 'umount' is 'D' (uninterruptible sleep), the 'ps' showed the following result root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1 Have any idea to solve or work around? Because of 'umount' stuck, both 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 minutes for 'umount' time out, the only way I can do is powering off the server directly. Any help would be much appreciated. I am not sure how to get out of the stuck umount, but you can skip the shutdown scripts that call the umount during a reboot using: reboot -fn This can cause data loss, as it is like a power cycle, so it is best to run sync before running the reboot -fn command to flush out buffers. Sometime when a system is really hung, reboot -fn does not work, but this seems to always work if run as root: echo 1 /proc/sys/kernel/sysrq echo b /proc/sysrq-trigger Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy for Hammer
Hi Pankaj, While there have been times in the past where ARM binaries were hosted on ceph.com, there is not currently any ARM hardware for builds. I don't think you will see any ARM binaries in http://ceph.com/debian-hammer/pool/main/c/ceph/, for example. Combine that with the fact that ceph-deploy is not intended to work with locally compiled binaries (only packages, as it relies on paths, conventions, and service definitions from the packages), and it is a very tricky combo to use ceph-deploy and ARM together. Your most recent error is indicative of the ceph-mon service not coming up successfully. when ceph-mon (the service, not the daemon) is started, it also calls ceph-create-keys, which waits for the monitor daemon to come up and the creates keys that are necessary for all clusters to run when using cephx (the admin key, the bootsraps keys). - Travis On Wed, May 27, 2015 at 8:27 PM, Garg, Pankaj pankaj.g...@caviumnetworks.com wrote: Actually the ARM binaries do exist and I have been using for previous releases. Somehow this library is the one that doesn’t load. Anyway I did compile my own Ceph for ARM, and now getting the following issue: [ceph_deploy.gatherkeys][WARNIN] Unable to find /etc/ceph/ceph.client.admin.keyring on ceph1 [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file: /etc/ceph/ceph.client.admin.keyring on host ceph1 From: Somnath Roy [mailto:somnath@sandisk.com] Sent: Wednesday, May 27, 2015 4:29 PM To: Garg, Pankaj Cc: ceph-users@lists.ceph.com Subject: RE: ceph-deploy for Hammer If you are trying to install the ceph repo hammer binaries, I don’t think it is built for ARM. Both binary and the .so needs to be built in ARM to make this work I guess. Try to build hammer code base in your ARM server and then retry. Thanks Regards Somnath From: Pankaj Garg [mailto:pankaj.g...@caviumnetworks.com] Sent: Wednesday, May 27, 2015 4:17 PM To: Somnath Roy Cc: ceph-users@lists.ceph.com Subject: RE: ceph-deploy for Hammer Yes I am on ARM. -Pankaj On May 27, 2015 3:58 PM, Somnath Roy somnath@sandisk.com wrote: Are you running this on ARM ? If not, it should not go for loading this library. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, Pankaj Sent: Wednesday, May 27, 2015 2:26 PM To: Garg, Pankaj; ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy for Hammer I seem to be getting these errors in the Monitor Log : 2015-05-27 21:17:41.908839 3ff907368e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error 2015-05-27 21:17:41.978113 3ff969168e0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16592 2015-05-27 21:17:41.984383 3ff969168e0 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so): /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file: No such file or directory 2015-05-27 21:17:41.98 3ff969168e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error 2015-05-27 21:17:42.052415 3ff90cf68e0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16604 2015-05-27 21:17:42.058656 3ff90cf68e0 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so): /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file: No such file or directory 2015-05-27 21:17:42.058715 3ff90cf68e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error 2015-05-27 21:17:42.125279 3ffac4368e0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 16616 2015-05-27 21:17:42.131666 3ffac4368e0 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so): /usr/lib/aarch64-linux-gnu/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file: No such file or directory 2015-05-27 21:17:42.131726 3ffac4368e0 -1 erasure_code_init(jerasure,/usr/lib/aarch64-linux-gnu/ceph/erasure-code): (5) Input/output error The lib file exists, so not sure why this is happening. Any help appreciated. Thanks Pankaj From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, Pankaj Sent: Wednesday, May 27, 2015 1:37 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] ceph-deploy for Hammer Hi, Is there a particular verion of Ceph-Deploy that should be used with Hammer release? This is a brand new cluster. I’m getting the following error when running command : ceph-deploy mon create-initial [ceph_deploy.conf][DEBUG ] found configuration file at:
[ceph-users] mds crash
Hi all, I have been testing cephfs with erasure coded pool and cache tier. I have 3 mds running on the same physical server as 3 mons. The cluster is in ok state otherwise, rbd is working and all pg are active+clean. Im running v 0.87.2 giant on all nodes and ubuntu 14.04.2 . The cluster was working fine but when copying a large file on a client to cephfs, it froze and now mdss keep crashing with: 0 2015-05-28 16:50:58.267112 7f0282946700 -1 mds/MDCache.cc: In function 'virtual void C_IO_MDC_TruncateFinish::finish(int)' thread 7f0282946700 time 2015-05-28 16:50:58.243904 mds/MDCache.cc: 5974: FAILED assert(r == 0 || r == -2) any ideas? thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Memory Allocators and Ceph
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I've got some more tests running right now. Once those are done, I'll find a couple of tests that had extreme difference and gather some perf data for them. - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, May 27, 2015 at 3:48 PM, Mark Nelson wrote: On 05/27/2015 04:00 PM, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On Wed, May 27, 2015 at 2:06 PM, Mark Nelson wrote: Compiling Ceph entirely with jemalloc overall had a negative performance impact. This may be due to dynamically linking to RocksDB instead of the default static linking. Is it possible that there were any other differences? A 30% gain turning into a 30% loss with pre-loading vs compiling seems pretty crazy! I tried hard to minimize the differences by backporting the Ceph jemalloc feature into 0.94.1 that was used in the other testing. I did have to get RocksDB from master to get it to compile against jemalloc so there is some difference there. When preloading Ceph with jemalloc, parts of Ceph still used tcmalloc because it was statically linked to by RocksDB, so it was using both allocators during those tests. Programming is not my forte so it is likely that I may have botched something with that test. The goal of the test was to see if and where these allocators may help/hinder performance. It could also provide some feedback to Ceph devs on how to leverage one or the other or both. I don't consider this test to be extremely reliable as there is some variability in this pre-production system even though I tried to remove the variability to an extent. I hope others can build on this as a jumping off point and at least have some interesting places to look instead of having to scope out a large section of the space. Might be worth trying to reproduce the results and grab perf data or some other kind of trace data during the tests. There's so much variability here it's really tough to get an idea of why the performance swings so dramatically. I'm not very familiar with the perf tools (can you use them with jemalloc?) and what would be useful. If you would like to tell me some configurations and tests you are interested in and let me know how you want perf to generate the data, I can see what I can do to provide that. Each test suite takes about 9 hours to run so it is pretty intensive. perf can give you a call graph showing how much cpu time is being spent in different parts of the code. Something like this during the test: sudo perf record --call-graph dwarf -F 99 -a sudo perf report You may need a newish kernel/os for dwarf support to work. There are probably other tools that may also give insights into what is going on. Each sub-test (i.e. 4K seq read) takes 5 minutes, so it is much easier to run selections of those if there are specific tests you are interested in. I'm happy to provide data, but given the time to run these tests if we can focus on specific areas it would provide data/benefits much faster. I guess starting out I'm interested in what's happening with preloaded vs compiled jemalloc. Other tests might be interesting too though! Still, excellent testing! We definitely need more of this so we can determine if jemalloc is something that would be worth switching to eventually. - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVZjCACRDmVDuy+mK58QAAsHIQAImJWLkGix2sDKCZgcME 0RHmelyEBtFFjIUNJvrwC0PvUKqQ/sffdtC+QLLcFYKOO2G5lrojKhCdwhXI OP0O1IqMcXUCBcq5yNJf8O6uzQ56Q4qCHWJmg49JRHx4gQLNK9VtGLRevL96 JNrwhllpI5v+ewuQR/P2uD/NAXhFWDjEXLO4xHQGylOQOOVRQBlWeq+3QLqX 4Zz+yiY4VIdhSe/z3aQYxes12snyjF2zP2Zo/BS47KBtVbmOJ7wGBGIFy8nw T4r7HYapCX3sqAN/fHEvwgcunYaW4y8aZT2a3Lv0PZKz23d6zcOUBPEFJ86W DzZyqqmDq7QJLtUnAb1yyQj23bWntI/zoT83zWCUvPHU7odmlBvSWZ8w7ToC mpOYjPw5CBVvztCFM2gwnmEXdM0qtmtdv/NhfQVu5+FNhQDSlhOPMCXdM3wf 2JjuygdfRg4kGE6KyX4nYSZxfacsvX3SIkLnKYsdeWMNMZwGC6TvulApY61s sedwbe+hyFqlfGlbM+QCtW495Wr9EcfFdM/PWUDkXtfmfE20UdqAKYzIeJfC F8HS5sZz6GtiLb1Dbiq69hNmUUtfDEIDVssARKbMtmZ30bPdNe42grBttzDG 3aNc05TwFe72HMjAhtvQrkrq1C+4XZA3mpNnosiXCUJT9WeOAOJbzWQS0mUS Yrtb =+ESo -END PGP SIGNATURE- -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVZzoiCRDmVDuy+mK58QAAwiIQALFexcUi7eeosd36JMPQ ZfDKaeLkkZoftAtM3EYAZVfx2vdiUDeQKyecdhgFin2CGz68NFRRBjZZ9qll USMyfk85X71XQh7cZplkFGc4fwKN2leUDJWbnbpB8PQa15ocj+wBOlfeFmTX PCW0+fv06slo/uCPtJH0Drl978pU1MXrESYJwJaGcfK9IUgCGD/w+4rtGwt3 ITvEfdmDBwEmNErxFojBcQ1XTxbb5tDXMjwJ9acdg0mDg0PiKXGtu79fJrle kouO2RyBYNfA5/w83Hy8IhFncI+9XO2NnCF4pGR6G35yhwNq6TuA67bPQ4ip +fdkPvp+/v3YOpeB0iBkZJLSGQVTICbCEW3GQNT9lhZ31cc/tyWqMLh5Zdwq
Re: [ceph-users] mds crash
(This came up as in-reply-to to the previous mds crashing thread -- it's better to start threads with a fresh message) On 28/05/2015 16:58, Peter Tiernan wrote: Hi all, I have been testing cephfs with erasure coded pool and cache tier. I have 3 mds running on the same physical server as 3 mons. The cluster is in ok state otherwise, rbd is working and all pg are active+clean. Im running v 0.87.2 giant on all nodes and ubuntu 14.04.2 . The cluster was working fine but when copying a large file on a client to cephfs, it froze and now mdss keep crashing with: 0 2015-05-28 16:50:58.267112 7f0282946700 -1 mds/MDCache.cc: In function 'virtual void C_IO_MDC_TruncateFinish::finish(int)' thread 7f0282946700 time 2015-05-28 16:50:58.243904 mds/MDCache.cc: 5974: FAILED assert(r == 0 || r == -2) any ideas? You're getting some kind of IO error from RADOS, and the CephFS code doesn't have clean handling for that in many cases, so it's asserting out. Enable debug objecter = 10 on the MDS to see what the operation is that's failing, and please provide the whole section of the log leading up to the crash rather than just the last line. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com