Re: [ceph-users] about rgw region sync
Are you trying to setup replication on one cluster right now? Generally replication is setup between two different clusters, each having one zone. Both clusters are in the same region. I can't think of a reason why two zones in one cluster wouldn't work. It's more complicated to setup though. Anything outside of a test setup would need a lot of planning to make sure the two zones are as fault isolated as possible. I'm pretty sure you need separate RadosGW nodes for each zone. It could be possible to share, but it will be easier if you don't. I still haven't gone through your previous logs carefully. On Tue, May 12, 2015 at 6:46 AM, TERRY 316828...@qq.com wrote: could i build one region using two clusters, each cluster has one zone。 so that I sync metadata and data from one cluster to another cluster。 I build two ceph clusters. for the first cluster, I do the follow steps 1.create pools sudo ceph osd pool create .us-east.rgw.root 64 64 sudo ceph osd pool create .us-east.rgw.control 64 64 sudo ceph osd pool create .us-east.rgw.gc 64 64 sudo ceph osd pool create .us-east.rgw.buckets 64 64 sudo ceph osd pool create .us-east.rgw.buckets.index 64 64 sudo ceph osd pool create .us-east.rgw.buckets.extra 64 64 sudo ceph osd pool create .us-east.log 64 64 sudo ceph osd pool create .us-east.intent-log 64 64 sudo ceph osd pool create .us-east.usage 64 64 sudo ceph osd pool create .us-east.users 64 64 sudo ceph osd pool create .us-east.users.email 64 64 sudo ceph osd pool create .us-east.users.swift 64 64 sudo ceph osd pool create .us-east.users.uid 64 64 2.create a keyring sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.radosgw.keyring sudo chmod +r /etc/ceph/ceph.client.radosgw.keyring sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n client.radosgw.us-east-1 --gen-key sudo ceph-authtool -n client.radosgw.us-east-1 --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.radosgw.us-east-1 -i /etc/ceph/ceph.client.radosgw.keyring 3.create a region sudo radosgw-admin region set --infile us.json --name client.radosgw.us-east-1 sudo radosgw-admin region default --rgw-region=us --name client.radosgw.us-east-1 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1 the content of us.json: cat us.json { name: us, api_name: us, is_master: true, endpoints: [ http:\/\/WH-CEPH-TEST01.MATRIX.CTRIPCORP.COM:80\/, http:\/\/ WH-CEPH-TEST02.MATRIX.CTRIPCORP.COM:80\/], master_zone: us-east, zones: [ { name: us-east, endpoints: [ http:\/\/WH-CEPH-TEST01.MATRIX.CTRIPCORP.COM:80\/], log_meta: true, log_data: true}, { name: us-west, endpoints: [ http:\/\/WH-CEPH-TEST02.MATRIX.CTRIPCORP.COM:80\/], log_meta: true, log_data: true}], placement_targets: [ { name: default-placement, tags: [] } ], default_placement: default-placement} 4.create zones sudo radosgw-admin zone set --rgw-zone=us-east --infile us-east-secert.json --name client.radosgw.us-east-1 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1 cat us-east-secert.json { domain_root: .us-east.domain.rgw, control_pool: .us-east.rgw.control, gc_pool: .us-east.rgw.gc, log_pool: .us-east.log, intent_log_pool: .us-east.intent-log, usage_log_pool: .us-east.usage, user_keys_pool: .us-east.users, user_email_pool: .us-east.users.email, user_swift_pool: .us-east.users.swift, user_uid_pool: .us-east.users.uid, system_key: { access_key: XNK0ST8WXTMWZGN29NF9, secret_key: 7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5}, placement_pools: [ { key: default-placement, val: { index_pool: .us-east.rgw.buckets.index, data_pool: .us-east.rgw.buckets} } ] } #5 Create Zone Users system user sudo radosgw-admin user create --uid=us-east --display-name=Region-US Zone-East --name client.radosgw.us-east-1 --access_key=XNK0ST8WXTMWZGN29NF9 --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system sudo radosgw-admin user create --uid=us-west --display-name=Region-US Zone-West --name client.radosgw.us-east-1 --access_key=AAK0ST8WXTMWZGN29NF9 --secret=AAJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system #6 creat zone users not system user sudo radosgw-admin user create --uid=us-test-east --display-name=Region-US Zone-East-test --name client.radosgw.us-east-1 --access_key=DDK0ST8WXTMWZGN29NF9 --secret=DDJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 #7 subuser create sudo radosgw-admin subuser create --uid=us-test-east --subuser=us-test-east:swift --access=full --name client.radosgw.us-east-1 --key-type swift --secret=ffJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 sudo /etc/init.d/ceph -a restart sudo /etc/init.d/httpd re sudo /etc/init.d/ceph-radosgw restart for the second cluster, I do the follow steps
Re: [ceph-users] RadosGW - Hardware recomendations
RadosGW is pretty light compared to the rest of Ceph, but it depends on your use case. RadosGW just needs network bandwidth and a bit of CPU. It doesn't access the cluster network, just the public network. If you have some spare public network bandwidth, you can run on existing nodes. If you plan to build a big object store, you should dedicate some nodes. Either way, you'll want a big enough load balancer in front of them. RadosGW is just HTTP, so re-organizing the RadosGW topology is very easy. For dedicated hardware, I would use the same hardware that I use for a MON node. For network bandwidth planning, think of RadosGW as a load balancer. It's simplistic, but it works to a first approximation. An upload comes in to RadosGW, and gets streamed out to the OSDs. A download request is made, RadosGW pulls the data from the OSDs, and sends it to the client. If you want the RadosGW public IPs on a network that isn't the ceph public network, then I'd give them dedicated hardware with a connection to the HTTP network and the ceph public network. I only have 7 nodes in my cluster, and all RadosGW processes use a total of ~150 Mbps. Because my usage is light, I'm running Apache and RadosGW daemon on the MON nodes. Once those nodes start using 50% of their public network bandwidth, I'll move RadosGW to dedicated hardware. On Wed, May 6, 2015 at 2:09 PM, Italo Santos okd...@gmail.com wrote: Hello everyone, I’m build a new infrastructure which will serve S3 protocol, and I’d like your help to estimate a hardware configuration to radosgw servers. I found many information on - http://ceph.com/docs/master/start/hardware-recommendations/ but nothing about the radosgw daemon. Regards. *Italo Santos* http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] about rgw region sync
System users are the only ones that need to be created in both zones. Non-system users (and their sub-users) should be created in the primary zone. radosgw-agent will replicate them to the secondary zone. I didn't create sub-users for my system users, but I don't think it matters. I can read my objects from the primary and secondary zones using the same non-system user's Access and Secret. Using the S3 API, I only had to change the host name to use the DNS entries that point at the secondary cluster. eg http://bucket1.us-east.myceph.com/object and http://bucket1.us-west.myceph.com/object. It's possible that adding the non-system users to the secondary zone causes replication to fail. I would verify that users, buckets, and objects are being replicated using radosgw-admin. `radosgw-admin --name $name bucket list`, `radosgw-admin --name $name user info --uid=$username`, and `radosgw-admin --name $name --bucket=$bucket bucket list`. That will let you determine if you have a replication or an access problem. On Wed, Apr 29, 2015 at 10:27 PM, TERRY 316828...@qq.com wrote: hi: I am using the following script to setup my cluster. I upgrade my radosgw-agent from version 1.2.0 to 1.2.2-1. (1.2.0 will results a error!) cat repeat.sh #!/bin/bash set -e set -x #1 create pools sudo ./create_pools.sh #2 create a keyring sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.radosgw.keyring sudo chmod +r /etc/ceph/ceph.client.radosgw.keyring sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n client.radosgw.us-east-1 --gen-key sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n client.radosgw.us-west-1 --gen-key sudo ceph-authtool -n client.radosgw.us-east-1 --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.client.radosgw.keyring sudo ceph-authtool -n client.radosgw.us-west-1 --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.client.radosgw.keyring sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth del client.radosgw.us-east-1 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth del client.radosgw.us-west-1 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.radosgw.us-east-1 -i /etc/ceph/ceph.client.radosgw.keyring sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.radosgw.us-west-1 -i /etc/ceph/ceph.client.radosgw.keyring # 3 create a region sudo radosgw-admin region set --infile us.json --name client.radosgw.us-east-1 set +e sudo rados -p .us.rgw.root rm region_info.default set -e sudo radosgw-admin region default --rgw-region=us --name client.radosgw.us-east-1 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1 # try don't do it sudo radosgw-admin region set --infile us.json --name client.radosgw.us-west-1 set +e sudo rados -p .us.rgw.root rm region_info.default set -e sudo radosgw-admin region default --rgw-region=us --name client.radosgw.us-west-1 sudo radosgw-admin regionmap update --name client.radosgw.us-west-1 # 4 create zones # try chanege us-east-no-secert.json file contents sudo radosgw-admin zone set --rgw-zone=us-east --infile us-east-no-secert.json --name client.radosgw.us-east-1 sudo radosgw-admin zone set --rgw-zone=us-east --infile us-east-no-secert.json --name client.radosgw.us-west-1 sudo radosgw-admin zone set --rgw-zone=us-west --infile us-west-no-secert.json --name client.radosgw.us-east-1 sudo radosgw-admin zone set --rgw-zone=us-west --infile us-west-no-secert.json --name client.radosgw.us-west-1 set +e sudo rados -p .rgw.root rm zone_info.default set -e sudo radosgw-admin regionmap update --name client.radosgw.us-east-1 # try don't do it sudo radosgw-admin regionmap update --name client.radosgw.us-west-1 #5 Create Zone Users system user sudo radosgw-admin user create --uid=us-east --display-name=Region-US Zone-East --name client.radosgw.us-east-1 --access_key=XNK0ST8WXTMWZGN29NF9 --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system sudo radosgw-admin user create --uid=us-west --display-name=Region-US Zone-West --name client.radosgw.us-west-1 --access_key=AAK0ST8WXTMWZGN29NF9 --secret=AAJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system sudo radosgw-admin user create --uid=us-east --display-name=Region-US Zone-East --name client.radosgw.us-west-1 --access_key=XNK0ST8WXTMWZGN29NF9 --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system sudo radosgw-admin user create --uid=us-west --display-name=Region-US Zone-West --name client.radosgw.us-east-1 --access_key=AAK0ST8WXTMWZGN29NF9 --secret=AAJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system #6 subuser create #may create a user without --system? sudo radosgw-admin subuser create --uid=us-east --subuser=us-east:swift --access=full --name client.radosgw.us-east-1 --key-type swift --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 sudo radosgw-admin subuser create --uid=us-west --subuser=us-west:swift --access=full --name client.radosgw.us-west-1 --key-type swift
Re: [ceph-users] How to backup hundreds or thousands of TB
This is an older post of mine on this topic: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/038484.html. The only thing that's changed since then is that Hammer now supports RadosGW object versioning. A combination of RadosGW replication, versioning, and access control meets my needs for offsite backup. I've abandoned the RadosGW snapshots hack I was working on. On Wed, May 6, 2015 at 1:25 AM, Götz Reinicke - IT Koordinator goetz.reini...@filmakademie.de wrote: Hi folks, beside hardware and performance and failover design: How do you manage to backup hundreds or thousands of TB :) ? Any suggestions? Best practice? A second ceph cluster at a different location? bigger archive Disks in good boxes? Or tabe-libs? What kind of backupsoftware can handle such volumes nicely? Thanks and regards . Götz -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82 420 E-Mail goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Radosgw multi zone data replication failure
[root@us-east-1 ceph]# ceph -s --name client.radosgw.us-east-1 [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-west-1 Are you trying to setup two zones on one cluster? That's possible, but you'll also want to spend some time on your CRUSH map making sure that the two zones are as independent as possible (no shared disks, etc). Are you using Civetweb or Apache + FastCGI? Can you include the output (from both clusters): radosgw-admin --name=client.radosgw.us-east-1 region get radosgw-admin --name=client.radosgw.us-east-1 zone get Double check that both system users exist in both clusters, with the same secret. On Sun, Apr 26, 2015 at 8:01 AM, Vickey Singh vickey.singh22...@gmail.com wrote: Hello Geeks I am trying to setup Ceph Radosgw multi site data replication using official documentation http://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication Everything seems to work except radosgw-agent sync , Request you to please check the below outputs and help me in any possible way. *Environment : * CentOS 7.0.1406 Ceph Versino 0.87.1 Rados Gateway configured using Civetweb *Radosgw zone list : Works nicely * [root@us-east-1 ceph]# radosgw-admin zone list --name client.radosgw.us-east-1 { zones: [ us-west, us-east]} [root@us-east-1 ceph]# *Curl request to master zone : Works nicely * [root@us-east-1 ceph]# curl http://us-east-1.crosslogic.com:7480 ?xml version=1.0 encoding=UTF-8?ListAllMyBucketsResult xmlns= http://s3.amazonaws.com/doc/2006-03-01/ OwnerIDanonymous/IDDisplayName/DisplayName/OwnerBuckets/Buckets/ListAllMyBucketsResult [root@us-east-1 ceph]# *Curl request to secondary zone : Works nicely * [root@us-east-1 ceph]# curl http://us-west-1.crosslogic.com:7480 ?xml version=1.0 encoding=UTF-8?ListAllMyBucketsResult xmlns= http://s3.amazonaws.com/doc/2006-03-01/ OwnerIDanonymous/IDDisplayName/DisplayName/OwnerBuckets/Buckets/ListAllMyBucketsResult [root@us-east-1 ceph]# *Rados Gateway agent configuration file : Seems correct, no TYPO errors* [root@us-east-1 ceph]# cat cluster-data-sync.conf src_access_key: M7QAKDH8CYGTK86CG93U src_secret_key: 0xQR6PINk23W\/GYrWJ14aF+1stG56M6xMkqkdloO destination: http://us-west-1.crosslogic.com:7480 dest_access_key: ZQ32ES1WAWPG05YMZ7T7 dest_secret_key: INvk8AkrZRsejLEL34yRpMLmOqydt8ncOXy4RHCM log_file: /var/log/radosgw/radosgw-sync-us-east-west.log [root@us-east-1 ceph]# *Rados Gateway agent SYNC : Fails , however it can fetch region map so i think src and dest KEYS are correct. But don't know why it fails on AttributeError * *[root@us-east-1 ceph]# radosgw-agent -c cluster-data-sync.conf* *region map is: {u'us': [u'us-west', u'us-east']}* *Traceback (most recent call last):* * File /usr/bin/radosgw-agent, line 21, in module* *sys.exit(main())* * File /usr/lib/python2.7/site-packages/radosgw_agent/cli.py, line 275, in main* *except client.ClientException as e:* *AttributeError: 'module' object has no attribute 'ClientException'* *[root@us-east-1 ceph]#* *Can query to Ceph cluster using us-east-1 ID* [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-east-1 cluster 9609b429-eee2-4e23-af31-28a24fcf5cbc health HEALTH_OK monmap e3: 3 mons at {ceph-node1= 192.168.1.101:6789/0,ceph-node2=192.168.1.102:6789/0,ceph-node3=192.168.1.103:6789/0}, election epoch 448, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3 osdmap e1063: 9 osds: 9 up, 9 in pgmap v8473: 1500 pgs, 43 pools, 374 MB data, 2852 objects 1193 MB used, 133 GB / 134 GB avail 1500 active+clean [root@us-east-1 ceph]# *Can query to Ceph cluster using us-west-1 ID* [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-west-1 cluster 9609b429-eee2-4e23-af31-28a24fcf5cbc health HEALTH_OK monmap e3: 3 mons at {ceph-node1= 192.168.1.101:6789/0,ceph-node2=192.168.1.102:6789/0,ceph-node3=192.168.1.103:6789/0}, election epoch 448, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3 osdmap e1063: 9 osds: 9 up, 9 in pgmap v8473: 1500 pgs, 43 pools, 374 MB data, 2852 objects 1193 MB used, 133 GB / 134 GB avail 1500 active+clean [root@us-east-1 ceph]# *Hope these packages are correct* [root@us-east-1 ceph]# rpm -qa | egrep -i ceph|radosgw libcephfs1-0.87.1-0.el7.centos.x86_64 ceph-common-0.87.1-0.el7.centos.x86_64 python-ceph-0.87.1-0.el7.centos.x86_64 ceph-radosgw-0.87.1-0.el7.centos.x86_64 ceph-release-1-0.el7.noarch ceph-0.87.1-0.el7.centos.x86_64 radosgw-agent-1.2.1-0.el7.centos.noarch [root@us-east-1 ceph]# Regards VS ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster not coming up after reboot
On Thu, Apr 23, 2015 at 5:20 AM, Kenneth Waegeman So it is all fixed now, but is it explainable that at first about 90% of the OSDS going into shutdown over and over, and only after some time got in a stable situation, because of one host network failure ? Thanks again! Yes, unless you've adjusted: [global] mon osd min down reporters = 9 mon osd min down reports = 12 OSDs talk to the MONs on the public network. The cluster network is only used for OSD to OSD communication. If one OSD node can't talk on that network, the other nodes will tell the MONs that it's OSDs are down. And that node will also tell the MONs that all the other OSDs are down. Then the OSDs marked down will tell the MONs that they're not down, and the cycle will repeat. I'm somewhat surprised that your cluster eventually stabilized. I have 8 OSDs per node. I set my min down reporters high enough that no single node can mark another node's OSDs down. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unbalanced OSDs
ceph osd reweight-by-utilization percentage needs another argument to do something. The recommended starting value is 120. Run it again with lower and lower values until you're happy. The value is a percentage, and I'm not sure what happens if you go below 100. If you get into trouble with this (too much backfilling causing problems), you can use ceph osd weight osdid 1 to go back to normal, just look at ceph osd tree to see the reweighted osds. Bear in mind that reweight-by-utilization adjusts the osd weight, which is not a permanent value. in/out events will reset this weight. But that's ok, because you don't need the reweight to last very long. Even if you get it perfectly balanced, you're going to be at ~75%. I order more hardware when I hit 70% utilization. Once you start adding hardware, the data distribution will change, so any permanent weights you set will probably be wrong. If you do want the weights to be permanent, you should look at ceph osd crush reweight osdid weight. This permanently changes the weight in the crush map, and it's not affected by in/out events. Bear in mind that you'll probably have to revisit all of these weights anytime your cluster changes. Also note that this weight is different that ceph osd reweight. This weight is the disk size in TiB. I recommend small change to all over and under utilizied disks, then re-evaluate each pass. On Wed, Apr 22, 2015 at 4:12 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hello, i've heavily unbalanced OSDs. Some are at 61% usage and some at 86%. Which is 372G free space vs 136G free space. All are up and are weightet at 1. I'm running firefly with tunables to optimal and hashpspool 1. Also a reweight-by-utilization does nothing. # ceph osd reweight-by-utilization no change: average_util: 0.714381, overload_util: 0.857257. overloaded osds: (none) Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Odp.: Odp.: CEPH 1 pgs incomplete
ceph pg query says all the OSDs are being probed. If those 6 OSDs are staying up, it probably just needs some time. The OSDs need to stay up longer than 15 mniutes. If any of them are getting marked down at all, that'll cause problems. I'd like to see the past intervals in the recovery state get smaller. All of those entries indicate potential history that needs to be reconciled. If that array is getting smaller, then recovery is proceeding. You could try pushing it a bit with a ceph pg scrub 0.37. If that finishes with out any improvement, try ceph pg deep-scrub 0.37 . Sometimes it helps move things faster, and sometimes it doesn't. On Wed, Apr 22, 2015 at 11:54 AM, MEGATEL / Rafał Gawron rafal.gaw...@megatel.com.pl wrote: All osd are works fine now ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1080.71985 root default -2 120.07999 host s1 0 60.03999 osd.0 up 1.0 1.0 1 60.03999 osd.1 up 1.0 1.0 -3 120.07999 host s2 2 60.03999 osd.2 up 1.0 1.0 3 60.03999 osd.3 up 1.0 1.0 -4 120.07999 host s3 4 60.03999 osd.4 up 1.0 1.0 5 60.03999 osd.5 up 1.0 1.0 -5 120.07999 host s4 6 60.03999 osd.6 up 1.0 1.0 7 60.03999 osd.7 up 1.0 1.0 -6 120.07999 host s5 9 60.03999 osd.9 up 1.0 1.0 8 60.03999 osd.8 up 1.0 1.0 -7 120.07999 host s6 10 60.03999 osd.10 up 1.0 1.0 11 60.03999 osd.11 up 1.0 1.0 -8 120.07999 host s7 12 60.03999 osd.12 up 1.0 1.0 13 60.03999 osd.13 up 1.0 1.0 -9 120.07999 host s8 14 60.03999 osd.14 up 1.0 1.0 15 60.03999 osd.15 up 1.0 1.0 -10 120.07999 host s9 17 60.03999 osd.17 up 1.0 1.0 16 60.03999 osd.16 up 1.0 1.0 Early I had power failure and my cluster was down. After up is recovering but now I have : 1 pgs incomplete 1 pgs stuck inactive 1 pgs stuck unclean Cluster don't can revovery this pg. I try out some osd and add to my cluster but recovery after this things don't rebuild my cluster. -- *Od:* Craig Lewis cle...@centraldesktop.com *Wysłane:* 22 kwietnia 2015 20:40 *Do:* MEGATEL / Rafał Gawron *Temat:* Re: Odp.: [ceph-users] CEPH 1 pgs incomplete So you have flapping OSDs. None of the 6 OSDs involved in that PG are staying up long enough to complete the recovery. What's happened is that because of how quickly the OSDs are coming up and failing, no single OSD has a complete copy of the data. There should be a complete copy of the data, but different osds have different chunks of it. Figure out why those 6 OSDs are failing, and Ceph should recover. Do you see anything interesting in those OSD logs? If not, you might need to increase the logging levels. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is a dirty object
On Mon, Apr 20, 2015 at 3:38 AM, John Spray john.sp...@redhat.com wrote: I hadn't noticed that we presented this as nonzero for regular pools before, it is a bit weird. Perhaps we should show zero here instead for non-cache-tier pools. I have always planned to add a cold EC tier later, once my cluster was large enough to make tiers worth while. This minor change seems like it would make that more complicated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] many slow requests on different osds (scrubbing disabled)
I've seen something like this a few times. Once, I lost the battery in my battery backed RAID card. That caused all the OSDs on that host to be slow, which triggered slow request notices pretty much cluster wide. It was only when I histogrammed the slow request notices that I saw most of them were on a single node. I compared the disk latency graphs between nodes, and saw that one node had a much higher write latency. This took me a while to track down. Another time, I had a consume HDD that was slowly failing. It would hit a group of bad sector, remap, repeat. SMART warned me about it, so I replaced the disk after the second slow request alerts. This was pretty straight forward to diagnose, only because smartd notified me. I both cases, I saw slow request notices on the affect disks. Your osd.284 says osd.186 and osd.177 are being slow, but osd.186 and osd.177 don't claim to be slow. It's possible that their is another disk that is slow, causing osd.186 and osd.177 replication to slow down. With the PG distribution over OSDs, one disk being a little slow can affect a large number of OSDs. If SMART doesn't show you a disk is failing, I'd start looking for disks (the disk itself, not the OSD daemon) with a high latency around your problem times. If you focus on the problem times, give it a +/- 10 minutes window. Sometimes it takes a little while for the disk slowness to spread out enough for Ceph to complain. On Wed, Apr 15, 2015 at 3:20 PM, Dominik Mostowiec dominikmostow...@gmail.com wrote: Hi, From few days we notice on our cluster many slow request. Cluster: ceph version 0.67.11 3 x mon 36 hosts - 10 osd ( 4T ) + 2 SSD (journals) Scrubbing and deep scrubbing is disabled but count of slow requests is still increasing. Disk utilisation is very small after we have disabled scrubbings. Log from one write with slow with debug osd = 20/20 osd.284 - master: http://pastebin.com/xPtpNU6n osd.186 - replica: http://pastebin.com/NS1gmhB0 osd.177 - replica: http://pastebin.com/Ln9L2Z5Z Can you help me find what is reason of it? -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Managing larger ceph clusters
I'm running a small cluster, but I'll chime in since nobody else has. Cern had a presentation a while ago (dumpling time-frame) about their deployment. They go over some of your questions: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern My philosophy on Config Management is that it should save me time. If it's going to take me longer to write a recipe to do something, I'll just do it by hand. Since my cluster is small, there are many things I can do faster by hand. This may or may not work for you, depending on your documentation / repeatability requirements. For things that need to be documented, I'll usually write the recipe anyway (I accept Chef recipes as documentation). For my clusters, I'm using Chef to setups all nodes and manage ceph.conf. I manually manage my pools, CRUSH map, RadosGW users, and disk replacement. I was using Chef to add new disks, but I ran into load problems due to my small cluster size. I'm currently adding disks manually, to manage cluster load better. As my cluster gets larger, that'll be less important. I'm also doing upgrades manually, because it's less work than writing the Chef recipe to do a cluster upgrade. Since Chef isn't cluster aware, it would be a a pain to make the recipe cluster aware enough to handle the upgrade. And I figure if I stall long enough, somebody else will write it :-) Ansible, with it's cluster wide coordination, looks like it would handle that a bit better. On Wed, Apr 15, 2015 at 2:05 PM, Stillwell, Bryan bryan.stillw...@twcable.com wrote: I'm curious what people managing larger ceph clusters are doing with configuration management and orchestration to simplify their lives? We've been using ceph-deploy to manage our ceph clusters so far, but feel that moving the management of our clusters to standard tools would provide a little more consistency and help prevent some mistakes that have happened while using ceph-deploy. We're looking at using the same tools we use in our OpenStack environment (puppet/ansible), but I'm interested in hearing from people using chef/salt/juju as well. Some of the cluster operation tasks that I can think of along with ideas/concerns I have are: Keyring management Seems like hiera-eyaml is a natural fit for storing the keyrings. ceph.conf I believe the puppet ceph module can be used to manage this file, but I'm wondering if using a template (erb?) might be better method to keeping it organized and properly documented. Pool configuration The puppet module seems to be able to handle managing replicas and the number of placement groups, but I don't see support for erasure coded pools yet. This is probably something we would want the initial configuration to be set up by puppet, but not something we would want puppet changing on a production cluster. CRUSH maps Describing the infrastructure in yaml makes sense. Things like which servers are in which rows/racks/chassis. Also describing the type of server (model, number of HDDs, number of SSDs) makes sense. CRUSH rules I could see puppet managing the various rules based on the backend storage (HDD, SSD, primary affinity, erasure coding, etc). Replacing a failed HDD disk Do you automatically identify the new drive and start using it right away? I've seen people talk about using a combination of udev and special GPT partition IDs to automate this. If you have a cluster with thousands of drives I think automating the replacement makes sense. How do you handle the journal partition on the SSD? Does removing the old journal partition and creating a new one create a hole in the partition map (because the old partition is removed and the new one is created at the end of the drive)? Replacing a failed SSD journal Has anyone automated recreating the journal drive using Sebastien Han's instructions, or do you have to rebuild all the OSDs as well? http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou rnal-failure/ Adding new OSD servers How are you adding multiple new OSD servers to the cluster? I could see an ansible playbook which disables nobackfill, noscrub, and nodeep-scrub followed by adding all the OSDs to the cluster being useful. Upgrading releases I've found an ansible playbook for doing a rolling upgrade which looks like it would work well, but are there other methods people are using? http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi ble/ Decommissioning hardware Seems like another ansible playbook for reducing the OSDs weights to zero, marking the OSDs out, stopping the service, removing the OSD ID, removing the CRUSH entry, unmounting the drives, and finally removing the server would be the best method here. Any other ideas on how to approach this? That's all I can think of right now. Is there any other tasks that people have run into
Re: [ceph-users] Rebalance after empty bucket addition
Yes, it's expected. The crush map contains the inputs to the CRUSH hashing algorithm. Every change made to the crush map causes the hashing algorithm to behave slightly differently. It is consistent though. If you removed the new bucket, it would go back to the way it was before you made the change. The Ceph team is working to reduce this, but it's unlikely to go away completely. On Sun, Apr 5, 2015 at 11:45 AM, Andrey Korolyov and...@xdel.ru wrote: Hello, after reaching certain ceiling of host/PG ratio, moving empty bucket in causes a small rebalance: ceph osd crush add-bucket 10.10.2.13 ceph osd crush move 10.10.2.13 root=default rack=unknownrack I have two pools, one is very large and it is keeping up with proper amount of pg/osd but another one contains in fact lesser amount of PGs than the number of active OSDs and after insertion of empty bucket in it goes to a rebalance, though that the actual placement map is not changed. Keeping in mind that this case is very far from being offensive to any kind of a sane production configuration, is this an expected behavior? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool
In that case, I'd set the crush weight to the disk's size in TiB, and mark the osd out: ceph osd crush reweight osd.OSDID weight ceph osd out OSDID Then your tree should look like: -9 *2.72* host ithome 30 *2.72* osd.30 up *0* An OSD can be UP and OUT, which causes Ceph to migrate all of it's data away. On Thu, Apr 2, 2015 at 10:20 PM, Chris Kitzmiller ca...@hampshire.edu wrote: On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles jelo...@redhat.com wrote: according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight osd.30 x.y (where 1.0=1TB) Only when this is done will you see if it joins. I don't really want osd.30 to join my cluster though. It is a purely temporary device that I restored just those two PGs to. It should still be able to (and be trying to) push out those two PGs with a weight of zero, right? I don't want any of my production data to migrate towards osd.30. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error DATE 1970
No, but I've seen it in RadosGW too. I've been meaning to post about it. I get about ten a day, out of about 50k objects/day. clewis@clewis-mac ~ (-) $ s3cmd ls s3://live32/ | grep '1970-01' | head -1 1970-01-01 00:00 0 s3://live-32/39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 Also note the 0 byte filesize and the lack of MD5 checksum. What I find interesting is that I can fix this by downloading the file and uploading it again. The filename is a SHA256 hash of the file contents, and the file downloads correctly every time. I never see this in my replication cluster, only the primary cluster. The access look looks kind of interesting for this file: 192.168.2.146 - - [23/Mar/2015:20:11:30 -0700] PUT /39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 HTTP/1.1 500 722 - aws-sdk-php2/2.7.20 Guzzle/3.9.2 curl/7.40.0 PHP/5.5.21 live-32.us-west-1.ceph.cdlocal 192.168.2.146 - - [23/Mar/2015:20:12:01 -0700] PUT /39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 HTTP/1.1 200 205 - aws-sdk-php2/2.7.20 Guzzle/3.9.2 curl/7.40.0 PHP/5.5.21 live-32.us-west-1.ceph.cdlocal 192.168.2.146 - - [23/Mar/2015:20:12:09 -0700] HEAD /39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 HTTP/1.1 200 250 - aws-sdk-php2/2.7.20 Guzzle/3.9.2 curl/7.40.0 PHP/5.5.21 live-32.us-west-1.ceph.cdlocal 31 seconds is a big spread between the initial PUT and the second PUT. The file is only 43k, so it'll be using the direct PUT, not multi-part upload. I haven't verified this for all of them. There's nothing in radosgw.log at the time of the 500. On Wed, Apr 1, 2015 at 2:42 AM, Jimmy Goffaux ji...@goffaux.fr wrote: English Version : Hello, I found a strange behavior in Ceph. This behavior is visible on Buckets (RGW) and pools (RDB). pools: `` root@:~# qemu-img info rbd:pool/kibana2 image: rbd:pool/kibana2 file format: raw virtual size: 30G (32212254720 bytes) disk size: unavailable Snapshot list: IDTAG VM SIZE DATE VM CLOCK snap2014-08-26-kibana2snap2014-08-26-kibana2 30G 1970-01-01 01:00:00 00:00:00.000 snap2014-09-05-kibana2snap2014-09-05-kibana2 30G 1970-01-01 01:00:00 00:00:00.000 `` As you can see the all dates are set to 1970-01-01 ? Here's the content of a JSON file in a bucket. `` {'bytes': 0, 'last_modified': '1970-01-01T00:00:00.000Z', 'hash': u'', 'name': 'bab34dad-531c-4609-ae5e-62129b43b181'} ``` You can see this is the same for the Last Modified date. Do you have any ideas? French Version : Bonjour, J'ai un comportement anormal sur ceph. J'ai des problèmes sur les Buckets(RGW) et les pools(RDB). Pools : `` root@:~# qemu-img info rbd:pool/kibana2 image: rbd:pool/kibana2 file format: raw virtual size: 30G (32212254720 bytes) disk size: unavailable Snapshot list: IDTAG VM SIZE DATE VM CLOCK snap2014-08-26-kibana2snap2014-08-26-kibana2 30G 1970-01-01 01:00:00 00:00:00.000 snap2014-09-05-kibana2snap2014-09-05-kibana2 30G 1970-01-01 01:00:00 00:00:00.000 `` En effet la DATE est à 1970-01-01 ??? Pour les Buckets voici le retour JSON d'un fichier dans un bucket : `` {'bytes': 0, 'last_modified': '1970-01-01T00:00:00.000Z', 'hash': u'', 'name': 'bab34dad-531c-4609-ae5e-62129b43b181'} ``` Pareil le Last Modified est à 1970-01-01.. Avez-vous des idées ? -- Jimmy Goffaux ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Production Ceph :: PG data lost : Cluster PG incomplete, inactive, unclean
3.d60 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.158179 0'0 262813:169 [60,56,220] 60 [60,56,220] 60 33552'321 2015-03-12 13:44:43.502907 28356'39 2015-03-11 13:44:41.663482 4.1fc 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.217291 0'0 262813:163 [144,58,153] 144 [144,58,153] 144 0'0 2015-03-12 17:58:19.254170 0'0 2015-03-09 17:54:55.720479 3.e02 72 0 0 0 585105425 304 304 down+incomplete 2015-04-01 21:21:16.099150 33568'304 262813:169744 [15,102,147] 15 [15,102,147] 15 33568'304 2015-03-16 10:04:19.894789 2246'4 2015-03-09 11:43:44.176331 8.1d4 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.218644 0'0 262813:21867 [126,43,174] 126 [126,43,174] 126 0'0 2015-03-12 14:34:35.258338 0'0 2015-03-12 14:34:35.258338 4.2f4 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.117515 0'0 262813:116150 [181,186,13] 181 [181,186,13] 181 0'0 2015-03-12 14:59:03.529264 0'0 2015-03-09 13:46:40.601301 3.e5a 76 70 0 0 623902741 325 325 incomplete 2015-04-01 21:21:16.043300 33569'325 262813:73426 [97,22,62] 97 [97,22,62] 97 33569'325 2015-03-12 13:58:05.813966 28433'44 2015-03-11 13:57:53.909795 8.3a0 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.056437 0'0 262813:175168 [62,14,224] 62 [62,14,224] 62 0'0 2015-03-12 13:52:44.546418 0'0 2015-03-12 13:52:44.546418 3.24e 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.130831 0'0 262813:165 [39,202,90] 39 [39,202,90] 39 33556'272 2015-03-13 11:44:41.263725 2327'4 2015-03-09 17:54:43.675552 5.f7 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.145298 0'0 262813:153 [54,193,123] 54 [54,193,123] 54 0'0 2015-03-12 17:58:30.257371 0'0 2015-03-09 17:55:11.725629 [root@pouta-s01 ceph]# ## Example 1 : PG 10.70 ### *10.70 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.152179 0'0 262813:163 [213,88,80] 213 [213,88,80] 213 0'0 2015-03-12 17:59:43.275049 0'0 2015-03-09 17:55:58.745662* This is how i found location of each OSD [root@pouta-s01 ceph]# *ceph osd find 88* { osd: 88, ip: 10.100.50.3:7079\/916853, crush_location: { host: pouta-s03, root: default”}} [root@pouta-s01 ceph]# When i manually check current/pg_head directory , data is not present here ( i.e. data is lost from all the copies ) [root@pouta-s04 current]# ls -l /var/lib/ceph/osd/ceph-80/current/10.70_head *total 0* [root@pouta-s04 current]# On some of the OSD’s HEAD directory does not exists [root@pouta-s03 ~]# ls -l /var/lib/ceph/osd/ceph-88/current/10.70_head *ls: cannot access /var/lib/ceph/osd/ceph-88/current/10.70_head: No such file or directory* [root@pouta-s03 ~]# [root@pouta-s02 ~]# ls -l /var/lib/ceph/osd/ceph-213/current/10.70_head *total 0* [root@pouta-s02 ~]# # ceph pg 10.70 query --- *http://paste.ubuntu.com/10719840/ http://paste.ubuntu.com/10719840/* ## Example 2 : PG 3.7d0 ### *3.7d0 78 0 0 0 609222686 376 376 down+incomplete 2015-04-01 21:21:16.135599 33538'376 262813:185045 [117,118,177] 117 [117,118,177] 117 33538'376 2015-03-12 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288* [root@pouta-s04 current]# ceph pg map 3.7d0 osdmap e262813 pg 3.7d0 (3.7d0) - up [117,118,177] acting [117,118,177] [root@pouta-s04 current]# *Data is present here , so 1 copy is present out of 3 * *[root@pouta-s04 current]# ls -l /var/lib/ceph/osd/ceph-117/current/3.7d0_head/ | wc -l* *63* *[root@pouta-s04 current]#* [root@pouta-s03 ~]# ls -l /var/lib/ceph/osd/ceph-118/current/3.7d0_head/ *total 0* [root@pouta-s03 ~]# [root@pouta-s01 ceph]# ceph osd find 177 { osd: 177, ip: 10.100.50.2:7062\/99, crush_location: { host: pouta-s02, root: default”}} [root@pouta-s01 ceph]# *Even directory is not present here * [root@pouta-s02 ~]# ls -l /var/lib/ceph/osd/ceph-177/current/3.7d0_head/ *ls: cannot access /var/lib/ceph/osd/ceph-177/current/3.7d0_head/: No such file or directory* [root@pouta-s02 ~]# *# ceph pg 3.7d0 query http://paste.ubuntu.com/10720107/ http://paste.ubuntu.com/10720107/* - Karan - On 20 Mar 2015, at 22:43, Craig Lewis cle...@centraldesktop.com wrote: osdmap e261536: 239 osds: 239 up, 238 in Why is that last OSD not IN? The history you need is probably there. Run ceph pg pgid query on some of the stuck PGs. Look for the recovery_state section. That should tell you what Ceph needs to complete the recovery. If you need more help, post the output of a couple pg queries. On Fri, Mar 20, 2015 at 4:22 AM, Karan Singh karan.si...@csc.fi wrote: Hello Guys My CEPH cluster lost data and not its not recovering. This problem occurred when Ceph performed recovery when one of the node was down. Now all the nodes are up but Ceph is showing PG as incomplete , unclean , recovering. I have tried several things to recover them like , *scrub , deep-scrub , pg repair , try changing primary affinity and then scrubbing , osd_pool_default_size etc. BUT NO LUCK* Could yo please advice , how to recover PG and achieve HEALTH_OK # ceph
Re: [ceph-users] PGs issue
This seems to be a fairly consistent problem for new users. The create-or-move is adjusting the crush weight, not the osd weight. Perhaps the init script should set the defaultweight to 0.01 if it's = 0? It seems like there's a downside to this, but I don't see it. On Fri, Mar 20, 2015 at 1:25 PM, Robert LeBlanc rob...@leblancnet.us wrote: The weight can be based on anything, size, speed, capability, some random value, etc. The important thing is that it makes sense to you and that you are consistent. Ceph by default (ceph-disk and I believe ceph-deploy) take the approach of using size. So if you use a different weighting scheme, you should manually add the OSDs, or clean up after using ceph-disk/ceph-deploy. Size works well for most people, unless the disks are less than 10 GB so most people don't bother messing with it. On Fri, Mar 20, 2015 at 12:06 PM, Bogdan SOLGA bogdan.so...@gmail.com wrote: Thank you for the clarifications, Sahana! I haven't got to that part, yet, so these details were (yet) unknown to me. Perhaps some information on the PGs weight should be provided in the 'quick deployment' page, as this issue might be encountered in the future by other users, as well. Kind regards, Bogdan On Fri, Mar 20, 2015 at 12:05 PM, Sahana shna...@gmail.com wrote: Hi Bogdan, Here is the link for hardware recccomendations : http://ceph.com/docs/master/start/hardware-recommendations/#hard-disk-drives. As per this link, minimum size reccommended for osds is 1TB. Butt as Nick said, Ceph OSDs must be min. 10GB to get an weight of 0.01 Here is the snippet from crushmaps section of ceph docs: Weighting Bucket Items Ceph expresses bucket weights as doubles, which allows for fine weighting. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1TB storage device. In such a scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 3.00 would represent approximately 3TB. Higher level buckets have a weight that is the sum total of the leaf items aggregated by the bucket. Thanks Sahana On Fri, Mar 20, 2015 at 2:08 PM, Bogdan SOLGA bogdan.so...@gmail.com wrote: Thank you for your suggestion, Nick! I have re-weighted the OSDs and the status has changed to '256 active+clean'. Is this information clearly stated in the documentation, and I have missed it? In case it isn't - I think it would be recommended to add it, as the issue might be encountered by other users, as well. Kind regards, Bogdan On Fri, Mar 20, 2015 at 10:33 AM, Nick Fisk n...@fisk.me.uk wrote: I see the Problem, as your OSD's are only 8GB they have a zero weight, I think the minimum size you can get away with is 10GB in Ceph as the size is measured in TB and only has 2 decimal places. For a work around try running :- ceph osd crush reweight osd.X 1 for each osd, this will reweight the OSD's. Assuming this is a test cluster and you won't be adding any larger OSD's in the future this shouldn't cause any problems. admin@cp-admin:~/safedrive$ ceph osd tree # idweighttype nameup/downreweight -10root default -20host osd-001 00osd.0up1 10osd.1up1 -30host osd-002 20osd.2up1 30osd.3up1 -40host osd-003 40osd.4up1 50osd.5up1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Production Ceph :: PG data lost : Cluster PG incomplete, inactive, unclean
osdmap e261536: 239 osds: 239 up, 238 in Why is that last OSD not IN? The history you need is probably there. Run ceph pg pgid query on some of the stuck PGs. Look for the recovery_state section. That should tell you what Ceph needs to complete the recovery. If you need more help, post the output of a couple pg queries. On Fri, Mar 20, 2015 at 4:22 AM, Karan Singh karan.si...@csc.fi wrote: Hello Guys My CEPH cluster lost data and not its not recovering. This problem occurred when Ceph performed recovery when one of the node was down. Now all the nodes are up but Ceph is showing PG as incomplete , unclean , recovering. I have tried several things to recover them like , *scrub , deep-scrub , pg repair , try changing primary affinity and then scrubbing , osd_pool_default_size etc. BUT NO LUCK* Could yo please advice , how to recover PG and achieve HEALTH_OK # ceph -s cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33 health *HEALTH_WARN 19 pgs incomplete; 3 pgs recovering; 20 pgs stuck inactive; 23 pgs stuck unclean*; 2 requests are blocked 32 sec; recovery 531/980676 objects degraded (0.054%); 243/326892 unfound (0.074%) monmap e3: 3 mons at {xxx=:6789/0,xxx=:6789:6789/0,xxx=:6789:6789/0}, election epoch 1474, quorum 0,1,2 xx,xx,xx osdmap e261536: 239 osds: 239 up, 238 in pgmap v415790: 18432 pgs, 13 pools, 2330 GB data, 319 kobjects 20316 GB used, 844 TB / 864 TB avail 531/980676 objects degraded (0.054%); 243/326892 unfound (0.074%) 1 creating 18409 active+clean 3 active+recovering 19 incomplete # ceph pg dump_stuck unclean ok pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 10.70 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.534911 0'0 261536:1015 [153,140,80] 153 [153,140,80] 153 0'0 2015-03-12 17:59:43.275049 0'0 2015-03-09 17:55:58.745662 3.dde 68 66 0 66 552861709 297 297 incomplete 2015-03-20 12:19:49.584839 33547'297 261536:228352 [174,5,179] 174 [174,5,179] 174 33547'297 2015-03-12 14:19:15.261595 28522'43 2015-03-11 14:19:13.894538 5.a2 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.560756 0'0 261536:897 [214,191,170] 214 [214,191,170] 214 0'0 2015-03-12 17:58:29.257085 0'0 2015-03-09 17:55:07.684377 13.1b6 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.846253 0'0 261536:1050 [0,176,131] 0 [0,176,131] 0 0'0 2015-03-12 18:00:13.286920 0'0 2015-03-09 17:56:18.715208 7.25b 16 0 0 0 67108864 16 16 incomplete 2015-03-20 12:19:49.639102 27666'16 261536:4777 [194,145,45] 194 [194,145,45] 194 27666'16 2015-03-12 17:59:06.357864 2330'3 2015-03-09 17:55:30.754522 5.19 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.742698 0'0 261536:25410 [212,43,131] 212 [212,43,131] 212 0'0 2015-03-12 13:51:37.777026 0'0 2015-03-11 13:51:35.406246 3.a2f 0 0 0 0 0 0 0 creating 2015-03-20 12:42:15.586372 0'0 0:0 [] -1 [] -1 0'0 0.00 0'0 0.00 7.298 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.566966 0'0 261536:900 [187,95,225] 187 [187,95,225] 187 27666'13 2015-03-12 17:59:10.308423 2330'4 2015-03-09 17:55:35.750109 3.a5a 77 87 261 87 623902741 325 325 active+recovering 2015-03-20 10:54:57.443670 33569'325 261536:182464 [150,149,181] 150 [150,149,181] 150 33569'325 2015-03-12 13:58:05.813966 28433'44 2015-03-11 13:57:53.909795 1.1e7 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.610547 0'0 261536:772 [175,182] 175 [175,182] 175 0'0 2015-03-12 17:55:45.203232 0'0 2015-03-09 17:53:49.694822 3.774 79 0 0 0 645136397 339 339 incomplete 2015-03-20 12:19:49.821708 33570'339 261536:166857 [162,39,161] 162 [162,39,161] 162 33570'339 2015-03-12 14:49:03.869447 2226'2 2015-03-09 13:46:49.783950 3.7d0 78 0 0 0 609222686 376 376 incomplete 2015-03-20 12:19:49.534004 33538'376 261536:182810 [117,118,177] 117 [117,118,177] 117 33538'376 2015-03-12 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288 3.d60 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.647196 0'0 261536:833 [154,172,1] 154 [154,172,1] 154 33552'321 2015-03-12 13:44:43.502907 28356'39 2015-03-11 13:44:41.663482 4.1fc 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.610103 0'0 261536:1069 [70,179,58] 70 [70,179,58] 70 0'0 2015-03-12 17:58:19.254170 0'0 2015-03-09 17:54:55.720479 3.e02 72 0 0 0 585105425 304 304 incomplete 2015-03-20 12:19:49.564768 33568'304 261536:167428 [15,102,147] 15 [15,102,147] 15 33568'304 2015-03-16 10:04:19.894789 2246'4 2015-03-09 11:43:44.176331 8.1d4 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.614727 0'0 261536:19611 [126,43,174] 126 [126,43,174] 126 0'0 2015-03-12 14:34:35.258338 0'0 2015-03-12 14:34:35.258338 4.2f4 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.595109 0'0 261536:113791 [181,186,13] 181 [181,186,13] 181 0'0 2015-03-12 14:59:03.529264 0'0 2015-03-09 13:46:40.601301 3.52c 65 23 69 23
Re: [ceph-users] Ceiling on number of PGs in a OSD
This isn't a hard limit on the number, but it's recommended that you keep it around 100. Smaller values cause data distribution evenness problems. Larger values cause the OSD processes to use more CPU, RAM, and file descriptors, particularly during recovery. With that many OSDs, you're going to want to increase your sysctl's, particularly open file descriptors, open sockets, FDs per process, etc. You don't need the same number of placement groups for every pool. Pools without much data don't need as many PGs. For example, I have a bunch of pools for RGW zones, and they have 32 PGs each. I have a total of 2600 PGs, 2048 are in the .rgw.buckets pool. Also keep in mind that your pg_num and pgp_num need to be multipled by the number of replicas to get the PG per OSD count. I have 2600 PGs and replication 3, so I really have 7800 PGs spread over 72 OSDs. Assuming you have one big pool, 750 OSDs, and replication 3, I'd go with 32k PGs on the big pool. Same thing, but replication 2, I'd still go 32k, but prepare to expand PGs with your next addition of OSDs. If you're going to have several big pools (ie, you're using RGW and RDB heavily), I'd go with 16k PGs for the big pools, and adjust those over time depending on which is used more heavily. If RDB is consuming 2x the space, then increase it's pg_num and pgp_num during the next OSD expansion, but don't increase RGWs pg_num and pgp_num. The number of PGs per OSD should stay around 100 as you add OSDs. If you add 10x the OSDs, you'll multiple the pg_num and pgp_num by 10 too, which gives you the same number of PGs per OSD. My (pg_num / osd_num) fluctuates between 75 and 200, depending on when I do the pg_num and pgp_num increase relative to the OSD adds. When you increase pg_num and pgp_num, don't do a large jump. Ceph will only allow you to double the value. Even that is extreme. It will cause every OSD in the cluster to start splitting PGs. When you want to double your pg_num and pgp_num, it's recommended that you make several passes. I don't recall seeing any recommendations, but I'm planning to break my next increase up into 10 passes. I'm at 2048 now, so I'll probably add 204 PGs until I get to 4096. On Thu, Mar 19, 2015 at 6:12 AM, Sreenath BH bhsreen...@gmail.com wrote: Hi, Is there a celing on the number for number of placement groups in a OSD beyond which steady state and/or recovery performance will start to suffer? Example: I need to create a pool with 750 osds (25 OSD per server, 50 servers). The PG calculator gives me 65536 placement groups with 300 PGs per OSD. Now as the cluster expands, the number of PGs in a OSD has to increase as well. If the cluster size inceases by a factor of 10, the number of PGs per OSD will also need to be increased. What would be the impact of large pg number in a OSD on peering and rebalancing. There is 3GB per OSD available. thanks, Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS Gateway Maturity
I have found a few incompatibilities, but so far they're all on the Ceph side. One example I remember was having to change the way we delete objects. The function we originally used fetches a list of object versions, and deletes all versions. Ceph is implementing objects versions now (I believe that'll ship with Hammer), so we had to call a different function to delete the object without iterating over the versions. AFAIK, that code should work fine if we point it at Amazon. I haven't tried it though. I've been using RGW (with replication) in production for 2 years now, although I'm not large. So far, all of my RGW issues have been Ceph issues. Most of my issues are caused by my under-powered hardware, or shooting myself in the foot with aggressive optimizations. Things are better with my journals on SSD, but the best thing I did was slow down with my changes. For example, I have 7 OSD nodes and 72 OSDs. When I add new OSDs, I add a couple at a time instead of adding all the disks in a node at once. Guess how I learned that lesson. :-) On Wed, Mar 18, 2015 at 10:03 AM, Jerry Lam jerry@oicr.on.ca wrote: Hi Chris, Thank you for your reply. We are also thinking about using the S3 API but we are concerned about how compatible it is with the real S3. For instance, we would like to design the system using pre-signed URL for storing some objects. I read the ceph documentation, it does not mention if it supports it or not. My question is do you guys find that the code using the RADOS S3 API can easily run in Amazon S3 without any change? If no, how much effort it is needed to make it compatible? Best Regards, Jerry From: Chris Jones cjo...@cloudm2.com Date: Tuesday, March 17, 2015 at 4:39 PM To: Jerry Lam jerry@oicr.on.ca Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Subject: Re: [ceph-users] RADOS Gateway Maturity Hi Jerry, I currently work at Bloomberg and we currently have a very large Ceph installation in production and we use the S3 compatible API for rados gateway. We are also re-architecting our new RGW and evaluating a different Apache configuration for a little better performance. We only use replicas right now, no erasure coding yet. Actually, you can take a look at our current configuration at https://github.com/bloomberg/chef-bcpc. -Chris On Tue, Mar 17, 2015 at 10:40 AM, Jerry Lam jerry@oicr.on.ca wrote: Hi Ceph user, I’m new to Ceph but I need to use Ceph as the storage for the Cloud we are building in house. Did anyone use RADOS Gateway in production? How mature it is in terms of compatibility with S3 / Swift? Anyone can share their experience on it? Best Regards, Jerry ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Chris Jones http://www.cloudm2.com cjo...@cloudm2.com (p) 770.655.0770 This message is intended exclusively for the individual or entity to which it is addressed. This communication may contain information that is proprietary, privileged or confidential or otherwise legally exempt from disclosure. If you are not the named addressee, you are not authorized to read, print, retain, copy or disseminate this message or any part of it. If you have received this message in error, please notify the sender immediately by e-mail and delete all copies of the message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven CPU usage on OSD nodes
I would say you're a little light on RAM. With 4TB disks 70% full, I've seen some ceph-osd processes using 3.5GB of RAM during recovery. You'll be fine during normal operation, but you might run into issues at the worst possible time. I have 8 OSDs per node, and 32G of RAM. I've had ceph-osd processes start swapping, and that's a great way to get them kicked out for being unresponsive. I'm not a dev, but I can make some wild and uninformed guesses :-) . The primary OSD uses more CPU than the replicas, and I suspect that you have more primaries on the hot nodes. Since you're testing, try repeating the test on 3 OSD nodes instead of 4. If you don't want to run that test, you can generate a histogram from ceph pg dump data, and see if there are more primary osds (the first one in the acting array) on the hot nodes. On Wed, Mar 18, 2015 at 7:18 AM, f...@univ-lr.fr f...@univ-lr.fr wrote: Hi to the ceph-users list ! We're setting up a new Ceph infrastructure : - 1 MDS admin node - 4 OSD storage nodes (60 OSDs) each of them running a monitor - 1 client Each 32GB RAM/16 cores OSD node supports 15 x 4TB SAS OSDs (XFS) and 1 SSD with 5GB journal partitions, all in JBOD attachement. Every node has 2x10Gb LACP attachement. The OSD nodes are freshly installed with puppet then from the admin node Default OSD weight in the OSD tree 1 test pool with 4096 PGs During setup phase, we're trying to qualify the performance characteristics of our setup. Rados benchmark are done from a client with these commandes : rados -p pool -b 4194304 bench 60 write -t 32 --no-cleanup rados -p pool -b 4194304 bench 60 seq -t 32 --no-cleanup Each time we observed a recurring phenomena : 2 of the 4 OSD nodes have twice the CPU load : http://www.4shared.com/photo/Ua0umPVbba/UnevenLoad.html (What to look at is the real-time %CPU and the cumulated CPU time per ceph-osd process) And after a fresh complete reinstall to be sure, this twice-as-high CPU load is observed but not on the same 2 nodes : http://www.4shared.com/photo/2AJfd1B_ba/UnevenLoad-v2.html Nothing obvious about the installation seems able to explain that. The crush distribution function doesn't have more than 4.5% inequality between the 4 OSD nodes for the primary OSDs of the objects, and less than 3% between the hosts if we considere the whole acting sets for the objects used during the benchmark. And the differences are not accordingly comparable to the CPU loads. So the cause has to be elsewhere. I cannot be sure it has no impact on performance. Even if we have enough CPU cores headroom, logic would say it has to have some consequences on delays and also on performances . Would someone have any idea, or reproduce the test on its setup to see if this is a common comportment ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question Blackout
I'm not a CephFS user, but I have had a few cluster outages. Each OSD has a journal, and Ceph ensures that a write is in all of the journals (primary and replicas) before it acknowledges the write. If an OSD process crashes, it replays the journal on startup, and recovers the write. I've lost power at my data center, and had the whole cluster down. Ceph came back up when power was restored without me getting involved. You might want the paid support package. For extra piece of mind, you can get a paid cluster review, and an engineer will go through your use case with you. On Tue, Mar 17, 2015 at 8:32 PM, Jesus Chavez (jeschave) jesch...@cisco.com wrote: Hi everyone, I am ready to launch ceph on production but there is one thing that keeps on my mind... If there was a Blackout where all the ceph nodes went off what would really happen with the filesystem? It would get corrupt? Or ceph has any Kind of mechanism to survive to something like that? Thanks * Jesus Chavez* SYSTEMS ENGINEER-C.SALES jesch...@cisco.com Phone: *+52 55 5267 3146 +52%2055%205267%203146* Mobile: *+51 1 5538883255 +51%201%205538883255* CCIE - 44433 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping users to different rgw pools
Yes, the placement target feature is logically separate from multi-zone setups. Placement targets are configured in the region though, which somewhat muddies the issue. Placement targets are useful feature for multi-zone, so different zones in a cluster don't share the same disks. Federation setup is the only place I've seen any discussion about the topic. Even that is just a brief mention. I didn't see any documentation directly talking about setting up placement targets, even in the federation guides. It looks like you'll need to edit the default region to add the placement targets, but you won't need to setup zones. As far as I can tell, You'll have to piece together what you need from the federation setup and some experimentation. I highly recommend a test VM that you can experiment on before attempting anything in production. On Sun, Mar 15, 2015 at 11:53 PM, Sreenath BH bhsreen...@gmail.com wrote: Thanks. Is this possible outside of multi-zone setup. (With only one Zone)? For example, I want to have pools with different replication factors(or erasure codings) and map users to these pools. -Sreenath On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote: Yes, RadosGW has the concept of Placement Targets and Placement Pools. You can create a target, and point it a set of RADOS pools. Those pools can be configured to use different storage strategies by creating different crushmap rules, and assigning those rules to the pool. RGW users can be assigned a default placement target. When they create a bucket, they can either specify the target, or use their default one. All objects in a bucket are stored according to the bucket's placement target. I haven't seen a good guide for making use of these features. The best guide I know of is the Federation guide ( http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly mentions placement targets. On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote: Hi all, Can one Radow gateway support more than one pool for storing objects? And as a follow-up question, is there a way to map different users to separate rgw pools so that their obejcts get stored in different pools? thanks, Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW Direct Upload Limitation
Maybe, but I'm not sure if Yehuda would want to take it upstream or not. This limit is present because it's part of the S3 spec. For larger objects you should use multi-part upload, which can get much bigger. -Greg Note that the multi-part upload has a lower limit of 4MiB per part, and the direct upload has an upper limit of 5GiB. So you have to use both methods - direct upload for small files, and multi-part upload for big files. Your best bet is to use the Amazon S3 libraries. They have functions that take care of it for you. I'd like to see this mentioned in the Ceph documentation someplace. When I first encountered the issue, I couldn't find a limit in the RadosGW documentation anywhere. I only found the 5GiB limit in the Amazon API documentation, which lead me to test on RadosGW. Now that I know it was done to preserve Amazon compatibility, I don't want to override the value anymore. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out
If I remember/guess correctly, if you mark an OSD out it won't necessarily change the weight of the bucket above it (ie, the host), whereas if you change the weight of the OSD then the host bucket's weight changes. -Greg That sounds right. Marking an OSD out is a ceph osd reweight, not a ceph osd crush reweight. Experimentally confirmed. I have an OSD out right now, and the host's crush weight is the same as the other hosts' crush weight. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] query about mapping of Swift/S3 APIs to Ceph cluster APIs
On Sat, Mar 14, 2015 at 3:04 AM, pragya jain prag_2...@yahoo.co.in wrote: Hello all! I am working on Ceph object storage architecture from last few months. I am unable to search a document which can describe how Ceph object storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs (librados APIs) to store the data at Ceph storage cluster. As the documents say: Radosgw, a gateway interface for ceph object storage users, accept user request to store or retrieve data in the form of Swift APIs or S3 APIs and convert the user's request in RADOS request. Please help me in knowing 1. how does Radosgw convert user request to RADOS request ? 2. how are HTTP requests mapped with RADOS request? The RadosGW daemon takes care of that. It's an application that sits on top of RADOS. For HTTP, there are a couple ways. The older way has Apache accepting the HTTP request, then forwarding that to the RadosGW daemon using FastCGI. Newer versions support RadosGW handling the HTTP directly. For the full details, you'll want to check out the source code at https://github.com/ceph/ceph If you're not interested enough to read the source code (I wasn't :-) ), setup a test cluster. Create a user, bucket, and object, and look at the contents of the rados pools. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shadow files
Out of curiousity, what's the frequency of the peaks and troughs? RadosGW has configs on how long it should wait after deleting before garbage collecting, how long between GC runs, and how many objects it can GC in per run. The defaults are 2 hours, 1 hour, and 32 respectively. Search http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc. If your peaks and troughs have a frequency less than 1 hour, then GC is going to delay and alias the disk usage w.r.t. the object count. If you have millions of objects, you probably need to tweak those values. If RGW is only GCing 32 objects an hour, it's never going to catch up. Now that I think about it, I bet I'm having issues here too. I delete more than (32*24) objects per day... On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote: It is either a problem with CEPH, Civetweb or something else in our configuration. But deletes in user buckets is still leaving a high number of old shadow files. Since we have millions and millions of objects, it is hard to reconcile what should and shouldnt exist. Looking at our cluster usage, there are no troughs, it is just a rising peak. But when looking at users data usage, we can see peaks and troughs as you would expect as data is deleted and added. Our ceph version 0.80.9 Please ideas? On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote: - Original Message - From: Ben b@benjackson.email To: ceph-us...@ceph.com Sent: Wednesday, March 11, 2015 8:46:25 PM Subject: Re: [ceph-users] Shadow files Anyone got any info on this? Is it safe to delete shadow files? It depends. Shadow files are badly named objects that represent part of the objects data. They are only safe to remove if you know that the corresponding objects no longer exist. Yehuda On 2015-03-11 10:03, Ben wrote: We have a large number of shadow files in our cluster that aren't being deleted automatically as data is deleted. Is it safe to delete these files? Is there something we need to be aware of when deleting them? Is there a script that we can run that will delete these safely? Is there something wrong with our cluster that it isn't deleting these files when it should be? We are using civetweb with radosgw, with tengine ssl proxy infront of it Any advice please Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can not list objects in large bucket
By default, radosgw only returns the first 1000 objects. Looks like radosgw-admin has the same limit. Looking at the man page, I don't see any way to page through the list. I must be missing something. The S3 API does have the ability to page through the list. I use the command line tool s3cmd to get the full bucket list. It does require user credentials though, so that might be a pain if you have many users. On Wed, Mar 11, 2015 at 6:47 PM, Sean Sullivan seapasu...@uchicago.edu wrote: I have a single radosgw user with 2 s3 keys and 1 swift key. I have created a few buckets and I can list all of the contents of bucket A and C but not B with either S3 (boto) or python-swiftclient. I am able to list the first 1000 entries using radosgw-admin 'bucket list --bucket=bucketB' without any issues but this doesn't really help. The odd thing is I can still upload and download objects in the bucket. I just can't list them. I tried setting the bucket canned_acl to private and public but I still can't list the objects inside. I'm using ceph .87 (Giant) Here is some info about the cluster:: http://pastebin.com/LvQYnXem -- ceph.conf http://pastebin.com/efBBPCwa -- ceph -s http://pastebin.com/tF62WMU9 -- radosgw-admin bucket list http://pastebin.com/CZ8TkyNG -- python list bucket objects script http://pastebin.com/TUCyxhMD -- radosgw-admin bucket stats --bucketB http://pastebin.com/uHbEtGHs -- rados -p .rgw.buckets ls | grep default.20283.2 (bucketB marker) http://pastebin.com/WYwfQndV -- Python Error when trying to list BucketB via boto I have no idea why this could be happening outside of the acl. Has anyone seen this before? Any idea on how I can get access to this bucket again via s3/swift? Also is there a way to list the full list of a bucket via radosgw-admin and not the first 9000 lines / 1000 entries, or a way to page through them? EDIT:: I just fixed it (I hope) but the fix doesn't make any sense: radosgw-admin bucket unlink --uid=user --bucket=bucketB radosgw-admin bucket link --uid=user --bucket=bucketB --bucket-id=default.20283.2 Now with swift or s3 (boto) I am able to list the bucket contents without issue ^_^ Can someone elaborate on why this works and how it broken in the first place when ceph was health_ok the entire time? With 3 replicas how did this happen? Could this be a bug? sorry for the rambling. I am confused and tired ;p ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping users to different rgw pools
Yes, RadosGW has the concept of Placement Targets and Placement Pools. You can create a target, and point it a set of RADOS pools. Those pools can be configured to use different storage strategies by creating different crushmap rules, and assigning those rules to the pool. RGW users can be assigned a default placement target. When they create a bucket, they can either specify the target, or use their default one. All objects in a bucket are stored according to the bucket's placement target. I haven't seen a good guide for making use of these features. The best guide I know of is the Federation guide ( http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly mentions placement targets. On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote: Hi all, Can one Radow gateway support more than one pool for storing objects? And as a follow-up question, is there a way to map different users to separate rgw pools so that their obejcts get stored in different pools? thanks, Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH Expansion
It depends. There are a lot of variables, like how many nodes and disks you currently have. Are you using journals on SSD. How much data is already in the cluster. What the client load is on the cluster. Since you only have 40 GB in the cluster, it shouldn't take long to backfill. You may find that it finishes backfilling faster than you can format the new disks. Since you only have a single OSD node, you must've changed the crushmap to allow replication over OSDs instead of hosts. After you get the new node in would be the best time to switch back to host level replication. The more data you have, the more painful that change will become. On Sun, Jan 18, 2015 at 10:09 AM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Hi Jiri, thanks for the feedback. My main concern is if it's better to add each OSD one-by-one and wait for the cluster to rebalance every time or do it all-together at once. Furthermore an estimate of the time to rebalance would be great! Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH Expansion
You've either modified the crushmap, or changed the pool size to 1. The defaults create 3 replicas on different hosts. What does `ceph osd dump | grep ^pool` output? If the size param is 1, then you reduced the replica count. If the size param is 1, you must've adjusted the crushmap. Either way, after you add the second node would be the ideal time to change that back to the default. Given that you only have 40GB of data in the cluster, you shouldn't have a problem adding the 2nd node. On Fri, Jan 23, 2015 at 3:58 PM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Hi Craig! For the moment I have only one node with 10 OSDs. I want to add a second one with 10 more OSDs. Each OSD in every node is a 4TB SATA drive. No SSD disks! The data ara approximately 40GB and I will do my best to have zero or at least very very low load during the expansion process. To be honest I haven't touched the crushmap. I wasn't aware that I should have changed it. Therefore, it still is with the default one. Is that OK? Where can I read about the host level replication in CRUSH map in order to make sure that it's applied or how can I find if this is already enabled? Any other things that I should be aware of? All the best, George It depends. There are a lot of variables, like how many nodes and disks you currently have. Are you using journals on SSD. How much data is already in the cluster. What the client load is on the cluster. Since you only have 40 GB in the cluster, it shouldnt take long to backfill. You may find that it finishes backfilling faster than you can format the new disks. Since you only have a single OSD node, you mustve changed the crushmap to allow replication over OSDs instead of hosts. After you get the new node in would be the best time to switch back to host level replication. The more data you have, the more painful that change will become. On Sun, Jan 18, 2015 at 10:09 AM, Georgios Dimitrakakis wrote: Hi Jiri, thanks for the feedback. My main concern is if its better to add each OSD one-by-one and wait for the cluster to rebalance every time or do it all-together at once. Furthermore an estimate of the time to rebalance would be great! Regards, Links: -- [1] mailto:gior...@acmac.uoc.gr -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow/Hung IOs
I doesn't seem like the problem here, but I've noticed that slow OSDs have a large fan-out. I have less than 100 OSDs, so every OSD talks to every other OSD in my cluster. I was getting slow notices from all of my OSDs. Nothing jumped out, so I started looking at disk write latency graphs. I noticed that all the OSDs in one node had 10x the write latency of the other nodes. After that, I graphed the number of slow notices per OSD, and noticed that a much higher number of slow requests on that node. Long story short, I lost a battery on my write cache. But it wasn't at all obvious from the slow request notices, not until I dug deeper. On Mon, Jan 5, 2015 at 4:07 PM, Sanders, Bill bill.sand...@teradata.com wrote: Thanks for the reply. 14 and 18 happened to show up during that run, but its certainly not only those OSD's. It seems to vary each run. Just from the runs I've done today I've seen the following pairs of OSD's: ['0,13', '0,18', '0,24', '0,25', '0,32', '0,34', '0,36', '10,22', '11,30', '12,28', '13,30', '14,22', '14,24', '14,27', '14,30', '14,31', '14,33', '14,34', '14,35', '14,39', '16,20', '16,27', '18,38', '19,30', '19,31', '19,39', '20,38', '22,30', '26,37', '26,38', '27,33', '27,34', '27,36', '28,32', '28,34', '28,36', '28,37', '3,18', '3,27', '3,29', '3,37', '4,10', '4,29', '5,19', '5,37', '6,25', '9,28', '9,29', '9,37'] Which is almost all of the OSD's in the system. Bill -- *From:* Lincoln Bryant [linco...@uchicago.edu] *Sent:* Monday, January 05, 2015 3:40 PM *To:* Sanders, Bill *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Slow/Hung IOs Hi BIll, From your log excerpt, it looks like your slow requests are happening on OSDs 14 and 18. Is it always these two OSDs? If you don't have a long recovery time (e.g., the cluster is just full of test data), maybe you could try setting OSDs 14 and 18 out and re-benching? Alternatively I suppose you could just use bonnie++ or dd etc to write to those OSDs (careful to not clobber any Ceph dirs) and see how the performance looks. Cheers, Lincoln On Jan 5, 2015, at 4:36 PM, Sanders, Bill wrote: Hi Ceph Users, We've got a Ceph cluster we've built, and we're experiencing issues with slow or hung IO's, even running 'rados bench' on the OSD cluster. Things start out great, ~600 MB/s, then rapidly drops off as the test waits for IO's. Nothing seems to be taxed... the system just seems to be waiting. Any help trying to figure out what could cause the slow IO's is appreciated. For example, 'rados -p rbd bench 60 write -t 32' takes over 900s to complete: A typical rados bench: Total time run: 957.458274 Total writes made: 9251 Write size: 4194304 Bandwidth (MB/sec): 38.648 Stddev Bandwidth: 157.323 Max bandwidth (MB/sec): 964 Min bandwidth (MB/sec): 0 Average Latency:3.21126 Stddev Latency: 51.9546 Max latency:910.72 Min latency:0.04516 According to ceph.log, we're not experiencing any OSD flapping or monitor election cycles, just slow requests: # grep slow /var/log/ceph/ceph.log: 2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow requests, 1 included below; oldest blocked for 513.611379 secs 2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops from 3,37 2015-01-05 13:42:49.938681 osd.18 39.7.48.7:6803/11185 222 : [WRN] 3 slow requests, 1 included below; oldest blocked for 520.612372 secs 2015-01-05 13:42:49.938688 osd.18 39.7.48.7:6803/11185 223 : [WRN] slow request 480.636547 seconds old, received at 2015-01-05 13:34:49.302080: osd_op(client.92008.1:3100010 rb.0.140d.238e1f29.0c77 [write 3622400~512] 3.d031a69f ondisk+write e994) v4 currently waiting for subops from 26,37 2015-01-05 13:43:12.941838 osd.18 39.7.48.7:6803/11185 224 : [WRN] 3 slow requests, 1 included below; oldest blocked for 543.615545 secs 2015-01-05 13:43:12.941844 osd.18 39.7.48.7:6803/11185 225 : [WRN] slow request 60.140595 seconds old, received at 2015-01-05 13:42:12.801205: osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops from 3,37 2015-01-05 13:44:04.933440 osd.14 39.7.48.7:6818/11640 251 : [WRN] 4 slow requests, 1 included below; oldest blocked for 606.941954 secs 2015-01-05 13:44:04.933469 osd.14 39.7.48.7:6818/11640 252 : [WRN] slow request 240.101138 seconds old, received at 2015-01-05 13:40:04.832272: osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.0010 [write 475136~512] 3.5e623815 ondisk+write e994) v4 currently waiting for subops from 27,33 2015-01-05 13:44:12.950805 osd.18
Re: [ceph-users] backfill_toofull, but OSDs not full
What was the osd_backfill_full_ratio? That's the config that controls backfill_toofull. By default, it's 85%. The mon_osd_*_ratio affect the ceph status. I've noticed that it takes a while for backfilling to restart after changing osd_backfill_full_ratio. Backfilling usually restarts for me in 10-15 minutes. Some PGs will stay in that state until the cluster is nearly done recoverying. I've only seen backfill_toofull happen after the OSD exceeds the ratio (so it's reactive, no proactive). Mine usually happen when I'm rebalancing a nearfull cluster, and an OSD backfills itself toofull. On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote: Hi, I am wondering how a PG gets marked backfill_toofull. I reweighted several OSDs using ceph osd crush reweight. As expected, PG began moving around (backfilling). Some PGs got marked +backfilling (~10), some +wait_backfill (~100). But some are marked +backfill_toofull. My OSDs are between 25% and 72% full. Looking at ceph pg dump, I can find the backfill_toofull PGs and verified the OSDs involved are less than 72% full. Do backfill reservations include a size? Are these OSDs projected to be toofull, once the current backfilling complete? Some of the backfill_toofull and backfilling point to the same OSDs. I did adjust the full ratios, but that did not change the backfill_toofull status. ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95' ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Different disk usage on different OSDs
The short answer is that uniform distribution is a lower priority feature of the CRUSH hashing algorithm. CRUSH is designed to be consistent and stable in it's hashing. For the details, you can read Sage's paper ( http://ceph.com/papers/weil-rados-pdsw07.pdf). The goal is that if you make a change to your cluster, there will be some moderate data movement, but not everything moves. If you then undo the change, things will go back to exactly how they were before. Doing that and getting uniform distribution is hard, and it's work in progress. The tunables are progress on this front, but they are by no means the last word. The current work around is to use ceph osd reweight-by-utilization. That tool will look at data distributions, and reweight things to bring the OSDs more inline with each other. Unfortunately, it does a ceph osd reweight, not a ceph osd crush reweight. (The existence of two different weighs with different behavior is unfortunate too). ceph osd reweight is temporary, in that the value will be lost if a OSD is marked out. ceph osd crush reweight updates the CRUSHMAP, and it's not temporary. So I use ceph osd crush reweight manually. While it would be nice if Ceph would automatically rebalance itself, I'd turn that off. Moving data around in my small cluster involves a major performance hit. By manually adjusting the crush weights, I have some control over when and how much data is moved around. I recommend taking a look a ceph osd tree and df on all nodes, and start adjusting the crush weight of heavily used disks down, and under utilized disks up. The crush weight is generally the size (base2) of the disk in TiB. I adjust my OSDs up or down by 0.05 each step, then decide if I need to make another pass. I have one 4 TiB drives with a weight of 4.14, and another with a weight of 3.04. They're still not balanced, but it's better. If data migration has a smaller impact on your cluster, larger steps should be fine. And if anything causes major problems, just revert the change. CRUSH is stable and consistent :-) On Mon, Jan 5, 2015 at 2:04 AM, ivan babrou ibob...@gmail.com wrote: Hi! I have a cluster with 106 osds and disk usage is varying from 166gb to 316gb. Disk usage is highly correlated to number of pg per osd (no surprise here). Is there a reason for ceph to allocate more pg on some nodes? The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest are 87, 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with very little data has only 8 pgs. PG size in biggest pool is ~6gb (5.1..6.3 actually). Lack of balanced disk usage prevents me from using all the disk space. When the biggest osd is full, cluster does not accept writes anymore. Here's gist with info about my cluster: https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae -- Regards, Ian Babrou http://bobrik.name http://twitter.com/ibobrik skype:i.babrou ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote: However, I suspect that temporarily setting min size to a lower number could be enough for the PGs to recover. If ceph osd pool pool set min_size 1 doesn't get the PGs going, I suppose restarting at least one of the OSDs involved in the recovery, so that they PG undergoes peering again, would get you going again. It depends on how incomplete your incomplete PGs are. min_size is defined as Sets the minimum number of replicas required for I/O.. By default, size is 3 and min_size is 2 on recent versions of ceph. If the number of replicas you have drops below min_size, then Ceph will mark the PG as incomplete. As long as you have one copy of the PG, you can recover by lowering the min_size to the number of copies you do have, then restoring the original value after recovery is complete. I did this last week when I deleted the wrong PGs as part of a toofull experiment. If the number of replicas drops to 0, I think you can use ceph pg force_create_pg, but I haven't tested it. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behaviour of a cluster with full OSD(s)
On Tue, Dec 23, 2014 at 3:34 AM, Max Power mailli...@ferienwohnung-altenbeken.de wrote: I understand that the status osd full should never be reached. As I am new to ceph I want to be prepared for this case. I tried two different scenarios and here are my experiences: For a real cluster, you should be monitoring your cluster, and taking immediate action once you get an OSD in nearfull state. Waiting until OSDs are toofull is too late. For a test cluster, it's a great learning experience. :-) The first one is to completely fill the storage (for me: writing files to a rados blockdevice). I discovered that the writing client (dd for example) gets completly stucked then. And this prevents me from stoping the process (SIGTERM, SIGKILL). At the moment I restart the whole computer to prevent writing to the cluster. Then I unmap the rbd device and set the full ratio a bit higher (0.95 to 0.97). I do a mount on my adminnode and delete files till everything is okay again. Is this the best practice? It is a design feature of Ceph that all cluster reads and writes stop until the toofull situation is resolved. The route you took is one of two ways to recover. The other route you found in your replica test. Is it possible to prevent the system from running in a osd full state? I could make the block devices smaller than the cluster can save. But it's hard to calculate this exactly. If you continue to add data to the cluster after it's nearfull, then you're going to hit toofull. Once you hit nearfull, you need to delete existing data, or add more OSDs. You've probably noticed that some OSDs are using more space than others. You can try to even them out with `ceph osd reweight` or `ceph osd crush reweight`, but that's a delaying tactic. When I hit nearfull, I place an order for new hardware, then use `ceph osd reweight` until it arrives. The next scenario is to change a pool size from say 2 to 3 replicas. While the cluster copies the objects it gets stuck as an osd reaches it limit. Normally the osd process quits then and I cannot restart it (even after setting the replicas back). The only possibility is to manually delete complete PG folders after exploring them with 'pg dump'. Is this the only way to get it back working again? There are some other configs that might have come into play here. You might have run into osd_failsafe_nearfull_ratio or osd_failsafe_full_ratio. You could try bumping those up a bit, and see if that lets the process stay up long enough to start reducing replicas. Since osd_failsafe_full_ratio is already 0.97, I wouldn't take it any higher than 0.98. Ceph triggers on greater-than percentages, so 0.99 will let you fill a disk to 100% full. If you get a disk to 100% full, the only way to cleanup is to start deleting PG directories. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Any Good Ceph Web Interfaces?
Are you asking because you want to manage a Ceph cluster point and click? Or do you need some shiny to show the boss? I'm using a combination of Chef and Zabbix. I'm not running RHEL though, but I would assume those are available in the repos. It's not as slick as Calamari, and it really doesn't give me a whole cluster view. Ganglia did a better job of that, but I went with Zabbix for the graphing and alerting in a single product. If you're looking for some shiny for the boss, Zabbix's web interface should work fine. If you're looking for a point and click way to build a Ceph cluster, I think Calamari is your only option. On Mon, Dec 22, 2014 at 4:11 PM, Tony unix...@gmail.com wrote: Please don't mention calamari :-) The best web interface for ceph that actually works with RHEL6.6 Preferable something in repo and controls and monitors all other ceph osd, mon, etc. Take everything and live for the moment. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy state of documentation [was: OSD JOURNAL not associated - ceph-disk list ?]
I get the impression that more people on the ML are using a config management system. ceph-deploy questions seem to come from new users following the quick start guide. I know both Puppet and Chef are fairly well represented here. I've seen a few posts about Salt and Ansible, but not much. Calamari is built on top of Salt, so I suppose that means Salt is well represented. I really haven't seen anything from the CFEngine or Bcfg2 camps. I'm personally using Chef with a private fork of the Ceph cookbook. The Ceph cookbook doesn't use ceph-deploy, but it does use ceph-disk. Whenever I have problems with the ceph-disk command, I first go look at the cookbook to see how it's doing things. On Sun, Dec 21, 2014 at 10:37 AM, Nico Schottelius nico-ceph-us...@schottelius.org wrote: Hello list, I am a bit wondering about ceph-deploy and the development of ceph: I see that many people in the community are pushing towards the use of ceph-deploy, likely to ease use of ceph. However, I have run multiple times into issues using ceph-deploy, when it failed or incorrectly setup partitions or created a cluster of monitors that never reach a qurom. I have also recognised debugging and learning of ceph being much more difficult with ceph-deploy, compared to going the manual way, because as a user I miss a lot of information. Furthermore as the maintainer of a configuration management system [0], I am interested in knowing how things are working behind the scenes to be able to automate them. Thus I was wondering, if it is an option for the ceph community to focus on both (the manual ceph-deploy) way instead of just pushing ceph-deploy? Cheers, Nico p.s.: Loic, just taking your mail as an example, but it is not personal - just want to show my point. Loic Dachary [Sun, Dec 21, 2014 at 06:08:27PM +0100]: [...] Is there a reason why you need to do this instead of letting ceph-disk prepare do it for you ? [...] -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
On Mon, Dec 22, 2014 at 2:57 PM, Sean Sullivan seapasu...@uchicago.edu wrote: Thanks Craig! I think that this may very well be my issue with osds dropping out but I am still not certain as I had the cluster up for a small period while running rados bench for a few days without any status changes. Mine were fine for a while too, through several benchmarks and a large RadosGW import. My problems were memory pressure plus an XFS bug, so it took a while to manifest. When it did, all of the ceph-osd processes on that node would have periods of ~30 seconds with 100% CPU. Some OSDs would get kicked out. Once that started, it was a downward spiral of recovery causing increasing load causing more OSDs to get kicked out... Once I found the memory problem, I cronned a buffer flush, and that usually kept things from getting too bad. I was able to see on the CPU graphs that CPU was increasing before the problems started. Once CPU got close to 100% usage on all cores, that's when the OSDs started dropping out. Hard to say if it was the CPU itself, or if the CPU was just a symptom of the memory pressure plus XFS bug. The real big issue that I have is the radosgw one currently. After I figure out the root cause of the slow radosgw performance and correct that, it should hopefully buy me enough time to figure out the osd slow issue. It just doesn't make sense that I am getting 8mbps per client no matter 1 or 60 clients while rbd and rados shoot well above 600MBs (above 1000 as well). That is strange. I was able to get 300 Mbps per client, on a 3 node cluster with GigE. I expected that each client would saturate the GigE on their own, but 300 Mbps is more than enough for now. I am using the Ceph apache and fastcgi module, but otherwise it's a pretty standard apache setup. My RadosGW processes are using a fair amount of CPU, but as long as you have some idle CPU, that shouldn't be the bottleneck. May I ask how you are monitoring your clusters logs? Are you just using rsyslog or do you have a logstash type system set up? Load wise I do not see a spike until I pull an osd out of the cluster or stop then start an osd without marking nodown. I'm monitoring the cluster with Zabbix, and that gives me pretty much the same info that I'd get in the logs. I am planning to start pushing the logs to Logstash soon, as soon as I get my logstash is able to handle the extra load. I do think that CPU is probably the cause of the osd slow issue though as it makes the most logical sense. Did you end up dropping ceph and moving to zfs or did you stick with it and try to mitigate it via file flusher/ other tweaks? I'm still on Ceph. I worked around the memory pressure by reformatting my XFS filesystems to use regular sized inodes. It was a rough couple of months, but everything has been stable for the last two months. I do still want to use ZFS on my OSDs. It's got all the features of BtrFS, with the extra feature of being production ready. It's just not production ready in Ceph yet. It's coming along nicely though, and I hope to reformat one node to be all ZFS sometime next year. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Have 2 different public networks
On Thu, Dec 18, 2014 at 10:47 PM, Francois Lafont flafdiv...@free.fr wrote: Le 19/12/2014 02:18, Craig Lewis a écrit : The daemons bind to *, Yes but *only* for the OSD daemon. Am I wrong? Personally I must provide IP addresses for the monitors in the /etc/ceph/ceph.conf, like this: [global] mon host = 10.0.1.1, 10.0.1.2, 10.0.1.3 Or like this: [mon.1] mon addr = 10.0.1.1 [mon.2] mon addr = 10.0.1.2 [mon.3] mon addr = 10.0.1.3 I'm not using mon addr lines, and my ceph-mon daemons are bound to 0.0.0.0:*. I have no [mon.#] or [osd.#] sections at all. I do have the global mon host line. On the management nodes, try putting the 10.0.2.0/24 addresses there instead of the 10.0.1.0/24 addresses. Do you really plan on having enough traffic creating and deleting RDB images that you need a dedicated network? It seems like setting up link aggregation on 10.0.1.0/24 would be simpler and less error prone. This is not for traffic. I must have a node to manage rbd images and this node is in a different VLAN (this is an Openstack install... I try... ;). If it's not a traffic volume problem, can you allow the 10.0.2.0/24 network to route to the 10.0.1.0/24 network, and open the firewall enough? There should be enough info in the network config to get the firewall working: http://docs.ceph.com/docs/next/rados/configuration/network-config-ref/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
I've done single nodes. I have a couple VMs for RadosGW Federation testing. It has a single virtual network, with both clusters on the same network. Because I'm only using a single OSD on a single host, I had to update the crushmap to handle that. My Chef recipe runs: ceph osd getcrushmap -o /tmp/compiled-crushmap.old crushtool -d /tmp/compiled-crushmap.old -o /tmp/decompiled-crushmap.old sed -e '/step chooseleaf firstn 0 type/s/host/osd/' /tmp/decompiled-crushmap.old /tmp/decompiled-crushmap.new crushtool -c /tmp/decompiled-crushmap.new -o /tmp/compiled-crushmap.new ceph osd setcrushmap -i /tmp/compiled-crushmap.new Those are the only extra commands I run for a single node cluster. Otherwise, it looks the same as my production nodes that run mon, osd, and rgw. Here's my single node's ceph.conf: [global] fsid = a7798848-1d31-421b-8f3c-5a34d60f6579 mon initial members = test0-ceph0 mon host = 172.16.205.143:6789 auth client required = none auth cluster required = none auth service required = none mon warn on legacy crush tunables = false osd crush chooseleaf type = 0 osd pool default flag hashpspool = true osd pool default min size = 1 osd pool default size = 1 public network = 172.16.205.0/24 [osd] osd journal size = 1000 osd mkfs options xfs = -s size=4096 osd mkfs type = xfs osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64 osd_scrub_sleep = 1.0 osd_snap_trim_sleep = 1.0 [client.radosgw.test0-ceph0] host = test0-ceph0 rgw socket path = /var/run/ceph/radosgw.test0-ceph0 keyring = /etc/ceph/ceph.client.radosgw.test0-ceph0.keyring log file = /var/log/ceph/radosgw.log admin socket = /var/run/ceph/radosgw.asok rgw dns name = test0-ceph rgw region = us rgw region root pool = .us.rgw.root rgw zone = us-west rgw zone root pool = .us-west.rgw.root On Thu, Dec 18, 2014 at 11:23 PM, Debashish Das deba@gmail.com wrote: Hi Team, Thank for the insight the replies, as I understood from the mails - running Ceph cluster in a single node is possible but definitely not recommended. The challenge which i see is there is no clear documentation for single node installation. So I would request if anyone has installed Ceph in single node, please share the link or document which i can refer to install Ceph in my local server. Again thanks guys !! Kind Regards Debashish Das On Fri, Dec 19, 2014 at 6:08 AM, Robert LeBlanc rob...@leblancnet.us wrote: Thanks, I'll look into these. On Thu, Dec 18, 2014 at 5:12 PM, Craig Lewis cle...@centraldesktop.com wrote: I think this is it: https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 You can also check out a presentation on Cern's Ceph cluster: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern At large scale, the biggest problem will likely be network I/O on the inter-switch links. On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovering from PG in down+incomplete state
Why did you remove osd.7? Something else appears to be wrong. With all 11 OSDs up, you shouldn't have any PGs stuck in stale or peering. How badly are the clocks skewed between nodes? If it's bad enough, it can cause communication problems between nodes. Ceph will complain if the clocks are more than 50ms different. It's best if you run ntpd on all nodes. I'm thinking that cleaning up the clock skew will fix most of your issues. If that does fix the issue, you can try bringing osd.7 back in. Don't reformat it, just deploy it as you normally would. The CRUSHMAP will go back to the way it was before you removed osd.7. Ceph will start to backfill+remap data onto the new osd, and see that most of it is already there. It should recovery relatively quickly... I think. On Fri, Dec 19, 2014 at 10:28 AM, Robert LeBlanc rob...@leblancnet.us wrote: I'm still pretty new at troubleshooting Ceph and since no one has responded yet I'll give a stab. What is the size of your pool? 'ceph osd pool get pool name size' It seems like based on the number of incomplete PGs that it was '1'. I understand that if you are able to bring osd 7 back in, it would clear up. I'm just not seeing a secondary osd for that PG. Disclaimer: I could be totally wrong. Robert LeBlanc On Thu, Dec 18, 2014 at 11:41 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down state, I have removed that PG from cluster, by removing crush rule for that OSD. Now cluster with 11 OSD's, started rebalancing. After sometime, cluster status was ems@rack6-client-5:~$ sudo ceph -s cluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1 health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering; 73 pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6 monmap e1: 3 mons at {rack6-client-4= 10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0}, election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6 osdmap e2648: 11 osds: 11 up, 11 in pgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects 11668 GB used, 26048 GB / 37717 GB avail 63 stale+active+clean 1 down+incomplete 521 active+clean 251 incomplete 10 stale+peering ems@rack6-client-5:~$ To fix this, i cant run ceph osd lost osd.id to remove the PG which is in down state. As OSD is already removed from the cluster. ems@rack6-client-4:~$ sudo ceph pg dump all | grep down dumped all in format plain 1.3815480 0 0 0 6492782592 3001 3001down+incomplete 2014-12-18 15:58:29.681708 1118'508438 2648:1073892[6,3,1] 6 [6,3,1] 6 76'437184 2014-12-16 12:38:35.322835 76'437184 2014-12-16 12:38:35.322835 ems@rack6-client-4:~$ ems@rack6-client-4:~$ sudo ceph pg 1.38 query . recovery_state: [ { name: Started\/Primary\/Peering, enter_time: 2014-12-18 15:58:29.681666, past_intervals: [ { first: 1109, last: 1118, maybe_went_rw: 1, ... ... down_osds_we_would_probe: [ 7], peering_blocked_by: []}, ... ... ems@rack6-client-4:~$ sudo ceph osd tree # idweight type name up/down reweight -1 36.85 root default -2 20.1host rack2-storage-1 0 3.35osd.0 up 1 1 3.35osd.1 up 1 2 3.35osd.2 up 1 3 3.35osd.3 up 1 4 3.35osd.4 up 1 5 3.35osd.5 up 1 -3 16.75 host rack2-storage-5 6 3.35osd.6 up 1 8 3.35osd.8 up 1 9 3.35osd.9 up 1 10 3.35osd.10 up 1 11 3.35osd.11 up 1 ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-it osd.7 is not down or doesn't exist ems@rack6-client-4:~$ Can somebody suggest any other recovery step to come out of this? -Thanks Regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] Placement groups stuck inactive after down out of 1/9 OSDs
That seems odd. So you have 3 nodes, with 3 OSDs each. You should've been able to mark osd.0 down and out, then stop the daemon without having those issues. It's generally best to mark an osd down, then out, and wait until the cluster has recovered completely before stopping the daemon and removing it from the cluster. That guarantees that you always have 3+ copies of the data. Disks don't always fail gracefully though. If you have a sudden and complete failure, you can't do it the nice way. At that point, just mark the osd down and out. If your cluster was healthy before this event, you shouldn't have any data problems. If the cluster wasn't HEALTH_OK before the event, you will likely have some problems. Is your cluster HEALTH_OK now? If not, can you give me the following? - ceph -s - ceph osd tree - ceph osd dump | grep ^pool - ceph pg dump_stuck - ceph pg query pgid # For just one of the stuck PGs I'm a bit confused why your cluster has a bunch of PGs in the remapped state, but none in the remapping state. It's supposed to be recovering, and something is blocking that. As to the hung VMs, during any recovery or backfill, you'll probably have IO problems. The ceph.conf defaults are intended for large clusters, probably with SSD journals. In my 3 nodes, 24 OSD cluster with no SSD journals, recovery was IO starving my clients. I de-prioritized recovery with: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 It was still painful, but those values kept my cluster usable. Since I've grown to 5 nodes, and added SSD journals, I've been able to increase the backfills and recovery active to 3. I found those values through trial and error, watching my RadosGW latency, and playing with ceph tell osd.\* injectargs ... I've found that I have problems if more than 20% of my OSDs are involved in a backfilling operation. With your 9 OSDs, you're guaranteeing that any single event will always hit at least 22% of your OSDS, and probably more. If you're unable to add more disks, I would highly recommend adding SSD journals. On Fri, Dec 19, 2014 at 8:08 AM, Chris Murray chrismurra...@gmail.com wrote: Hello, I'm a newbie to CEPH, gaining some familiarity by hosting some virtual machines on a test cluster. I'm using a virtualisation product called Proxmox Virtual Environment, which conveniently handles cluster setup, pool setup, OSD creation etc. During the attempted removal of an OSD, my pool appeared to cease serving IO to virtual machines, and I'm wondering if I did something wrong or if there's something more to the process of removing an OSD. The CEPH cluster is small; 9 OSDs in total across 3 nodes. There's a pool called 'vmpool', with size=3 and min_size=1. It's a bit slow, but I see plenty of information on how to troubleshoot that, and understand I should be separating cluster communication onto a separate network segment to improve performance. CEPH version is Firefly - 0.80.7 So, the issue was: I marked osd.0 as down out (or possibly out down, if order matters), and virtual machines hung. Almost immediately, 78 pgs were 'stuck inactive', and after some activity overnight, they remained that way: cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a health HEALTH_WARN 290 pgs degraded; 78 pgs stuck inactive; 496 pgs stuck unclean; 4 requests are blocked 32 sec; recovery 69696/685356 objects degraded (10.169%) monmap e3: 3 mons at {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0}, election epoch 50, quorum 0,1,2 0,1,2 osdmap e669: 9 osds: 8 up, 8 in pgmap v100175: 1216 pgs, 4 pools, 888 GB data, 223 kobjects 2408 GB used, 7327 GB / 9736 GB avail 69696/685356 objects degraded (10.169%) 78 inactive 720 active+clean 290 active+degraded 128 active+remapped I started the OSD to bring it back 'up'. It was still 'out'. cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a health HEALTH_WARN 59 pgs degraded; 496 pgs stuck unclean; recovery 30513/688554 objects degraded (4.431%) monmap e3: 3 mons at {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0}, election epoch 50, quorum 0,1,2 0,1,2 osdmap e671: 9 osds: 9 up, 8 in pgmap v103181: 1216 pgs, 4 pools, 892 GB data, 224 kobjects 2408 GB used, 7327 GB / 9736 GB avail 30513/688554 objects degraded (4.431%) 720 active+clean 59 active+degraded 437 active+remapped client io 2303 kB/s rd, 153 kB/s wr, 85 op/s The inactive pgs had disappeared. I stopped the OSD again, making it 'down' and 'out', as it was previous. At this point, I started my virtual machines again, which functioned correctly. cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a health HEALTH_WARN 368 pgs degraded; 496
Re: [ceph-users] Placement groups stuck inactive after down out of 1/9 OSDs
With only one OSD down and size = 3, you shouldn't've had any PGs inactive. At worst, they should've been active+degraded. The only thought I have is that some of your PGs aren't mapping to the correct number of OSDs. That's not supposed to be able to happen unless you've messed up your crush rules. You might go through ceph pg dump, and verify that all PGs have 3 OSDs in the reporting and acting columns, and that there are no duplicate OSDs in those lists. With your 1216 PGs, it might be faster to write a script to parse the JSON than to do it manually. If you happen to remember some PGs that were inactive or degraded, you could spot check those. On Fri, Dec 19, 2014 at 11:45 AM, Chris Murray chrismurra...@gmail.com wrote: Interesting indeed, those tuneables were suggested on the pve-user mailing list too, and they certainly sound like they’ll ease the pressure during the recovery operation. What I might not have explained very well though is that the VMs hung indefinitely and past the end of the recovery process, rather than being slow; almost as if the 78 stuck inactive placement groups contained data which was critical to VM operation. Looking at IO and performance in the cluster is certainly on the to-do list, with a scale-out of nodes and move of journals to SSD, but of course that needs some investment and I’d like to prove things first. It’s a bit catch-22 :-) To my knowledge, the cluster was HEALTH_OK before and it is HEALTH_OK now, BUT ... I've not followed my usual advice of stopping and thinking about things before trying something else, so I suppose the marking of the OSD 'up' this morning (which turned those 78 into some other ACTIVE+* states) has spoiled the chance of troubleshooting. I’ve been messing around with osd.0 since too, and the health is now: cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a health HEALTH_OK monmap e3: 3 mons at {0= 192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0}, election epoch 58, quorum 0,1,2 0,1,2 osdmap e1205: 9 osds: 9 up, 9 in pgmap v120175: 1216 pgs, 4 pools, 892 GB data, 224 kobjects 2679 GB used, 9790 GB / 12525 GB avail 1216 active+clean If it helps at all, the other details are as follows. Nothing from 'dump stuck' although I expect there would have been this morning. root@ceph25:~# ceph osd tree # idweight type name up/down reweight -1 12.22 root default -2 4.3 host ceph25 3 0.9 osd.3 up 1 6 0.68osd.6 up 1 0 2.72osd.0 up 1 -3 4.07host ceph26 1 2.72osd.1 up 1 4 0.9 osd.4 up 1 7 0.45osd.7 up 1 -4 3.85host ceph27 2 2.72osd.2 up 1 5 0.68osd.5 up 1 8 0.45osd.8 up 1 root@ceph25:~# ceph osd dump | grep ^pool pool 0 'data' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 3 'vmpool' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 187 flags hashpspool stripe_width 0 root@ceph25:~# ceph pg dump_stuck ok The more I think about this problem, the less I think there'll be an easy answer, and it's more likely that I'll have to reproduce the scenario and actually pause myself next time in order to troubleshoot it? From: Craig Lewis [mailto:cle...@centraldesktop.com] Sent: 19 December 2014 19:17 To: Chris Murray Cc: ceph-users Subject: Re: [ceph-users] Placement groups stuck inactive after down out of 1/9 OSDs That seems odd. So you have 3 nodes, with 3 OSDs each. You should've been able to mark osd.0 down and out, then stop the daemon without having those issues. It's generally best to mark an osd down, then out, and wait until the cluster has recovered completely before stopping the daemon and removing it from the cluster. That guarantees that you always have 3+ copies of the data. Disks don't always fail gracefully though. If you have a sudden and complete failure, you can't do it the nice way. At that point, just mark the osd down and out. If your cluster was healthy before this event, you shouldn't have any data problems. If the cluster wasn't HEALTH_OK before the event, you will likely have some problems. Is your cluster HEALTH_OK now? If not, can you give me the following
Re: [ceph-users] Have 2 different public networks
On Fri, Dec 19, 2014 at 4:03 PM, Francois Lafont flafdiv...@free.fr wrote: Le 19/12/2014 19:17, Craig Lewis a écrit : I'm not using mon addr lines, and my ceph-mon daemons are bound to 0.0.0.0:*. And do you have several IP addresses on your server? Can you contact the *same* monitor process with different IP addresses? For instance: telnet -e ']' ip_addr1 6789 telnet -e ']' ip_addr2 6789 Oh. The second one fails, even though ceph-mon is bound to 0.0.0.0. I guess that's not going to work. Looking again... I'm an idiot. I was looking at the wrong column in netstat. The daemon is bound to a single IP.netstat | grep, with no column headers bites me again. I apologize for that wild goose chase. Please, could you post your ceph.conf here (or just lines about monitors)? Probably doesn't help now, but: [global] fsid = snip mon initial members = ceph0c, ceph1c mon host = 10.193.0.6:6789, 10.193.0.7:6789 auth client required = none auth cluster required = none auth service required = none cluster network = 10.194.0.0/16 mon warn on legacy crush tunables = false public network = 10.193.0.0/16 There is something that I don't understand. Personally I don't use ceph-deploy and I use manual deployment (because I want to make a Puppet deployment of my labs for Ubuntu Trusty with Ceph Firefly) and I'm using Chef, which is also more like a manual deployment than ceph-deploy. when I create my cluster with the first monitor, I have to generate a monitor map with this command: monmaptool --create --add {hostname} {ip-address} --fsid {uuid} /tmp/monmap And I have to provide an IP address, so it seems logical to me that a monitor is bound to only one IP address. I don't see the Chef rule doing anything like that though. Can you post your monitors map? ceph mon getmap -o /tmp/monmap monmaptool --print /tmp/monmap monmaptool: monmap file /tmp/monmap epoch 2 fsid 1604ec7a-6ceb-42fc-8c68-0a7896c4e120 last_changed 2013-11-22 17:34:21.462685 created 0.00 0: 10.193.0.6:6789/0 mon.ceph0c 1: 10.193.0.7:6789/0 mon.ceph1c If it's not a traffic volume problem, can you allow the 10.0.2.0/24 network to route to the 10.0.1.0/24 network, and open the firewall enough? There should be enough info in the network config to get the firewall working: http://docs.ceph.com/docs/next/rados/configuration/network-config-ref/ Yes indeed, It could be enough. But I find it a shame to do this workaround because I'm not able to have monitors bound to several IP addresses. ;) Looks like you'll have to go this route. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Have 2 different public networks
On Fri, Dec 19, 2014 at 6:19 PM, Francois Lafont flafdiv...@free.fr wrote: So, indeed, I have to use routing *or* maybe create 2 monitors by server like this: [mon.node1-public1] host = ceph-node1 mon addr = 10.0.1.1 [mon.node1-public2] host = ceph-node1 mon addr = 10.0.2.1 # etc... But, in this case, the working directories of mon.node1-public1 and mon.node1-public2 will be in the same disk (I have no choice). Is it a problem? Are monitors big consumers of I/O disk? Interesting idea. While you will have an even number of monitors, you'll still have an odd number of failure domains. I'm not sure if it'll work though... make sure you test having the leader on both networks. It might cause problems if the leader is on the 10.0.1.0/24 network? Monitors can be big consumers of disk IO, if there is a lot of cluster activity. Monitors records all of the cluster changes in LevelDB, and send copies to all of the daemons. There have been posts to the ML about people running out of Disk IOps on the monitors, and the problems it causes. The bigger the cluster, the more IOps. As long as you monitor and alert on your monitor disk IOps, I don't think it would be a problem. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
I think this is it: https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 You can also check out a presentation on Cern's Ceph cluster: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern At large scale, the biggest problem will likely be network I/O on the inter-switch links. On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Have 2 different public networks
The daemons bind to *, so adding the 3rd interface to the machine will allow you to talk to the daemons on that IP. I'm not really sure how you'd setup the management network though. I'd start by setting the ceph.conf public network on the management nodes to have the public network 10.0.2.0/24, and an /etc/hosts file with the monitor's names on the 10.0.2.0/24 network. Make sure the management nodes can't route to the 10.0.1.0/24 network, and see what happens. Do you really plan on having enough traffic creating and deleting RDB images that you need a dedicated network? It seems like setting up link aggregation on 10.0.1.0/24 would be simpler and less error prone. On Thu, Dec 18, 2014 at 4:19 PM, Francois Lafont flafdiv...@free.fr wrote: Hi, Is it possible to have 2 different public networks in a Ceph cluster? I explain my question below. Currently, I have 3 identical nodes in my Ceph cluster. Each node has: - only 1 monitor; - n osds (we don't care about the value n here); - and 3 interfaces. One interface for the cluster network (10.0.0.0/24): - node1 - 10.0.0.1 - node2 - 10.0.0.2 - node3 - 10.0.0.3 One interface for the public network (10.0.1.0/24): - node1 - [mon.1] mon addr = 10.0.1.1 - node2 - [mon.2] mon addr = 10.0.1.2 - node3 - [mon.3] mon addr = 10.0.1.3 And one interface not used yet (see below). With this configuration, if I have a Ceph client in the public network, I can use rbd images etc. No problem, it works. But now I would like to use the third interface of the nodes for a *different* plublic network - 10.0.2.0/24. The Ceph clients in this network will not really use the storage but will create and delete rbd images in a pool. In fact it's just a network for *Ceph management*. So, I want to have 2 different public networks: - 10.0.1.0/24 (already exists) - *and* 10.0.2.0/24 Am I wrong if I say that mon.1, mon.2 and mon.3 must have one more IP address? Is it possible to have a monitor that listens on 2 addresses? Something like that: - node1 - [mon.1] mon addr = 10.0.1.1 *and* 10.0.2.1 - node2 - [mon.2] mon addr = 10.0.1.2 *and* 10.0.2.2 - node3 - [mon.3] mon addr = 10.0.1.3 *and* 10.0.2.3 My environment is not a production environment, just a lab. So, if necessary I can reinstall everything, no problem. Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Dual RADOSGW Network
You may need split horizon DNS. The internal machines' DNS should resolve to the internal IP, and the external machines' DNS should resolve to the external IP. There are various ways to do that. The RadosGW config has an example of setting up Dnsmasq: http://ceph.com/docs/master/radosgw/config/#enabling-subdomain-s3-calls On Tue, Dec 16, 2014 at 3:05 AM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Thanks Craig. I will try that! I thought it was more complicate than that because of the entries for the public_network and rgw dns name in the config file... I will give it a try. Best, George That shouldnt be a problem. Just have Apache bind to all interfaces instead of the external IP. In my case, I only have Apache bound to the internal interface. My load balancer has an external and internal IP, and Im able to talk to it on both interfaces. On Mon, Dec 15, 2014 at 2:00 PM, Georgios Dimitrakakis wrote: Hi all! I have a single CEPH node which has two network interfaces. One is configured to be accessed directly by the internet (153.*) and the other one is configured on an internal LAN (192.*) For the moment radosgw is listening on the external (internet) interface. Can I configure radosgw to be accessed by both interfaces? What I would like to do is to save bandwidth and time for the machines on the internal network and use the internal net for all rados communications. Any ideas? Best regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com [1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] Links: -- [1] mailto:ceph-users@lists.ceph.com [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] mailto:gior...@acmac.uoc.gr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Test 6
I always wondered why my posts didn't show up until somebody replied to them. I thought it was my filters. Thanks! On Mon, Dec 15, 2014 at 10:57 PM, Leen de Braal l...@braha.nl wrote: If you are trying to see if your mails come through, don't check on the list. You have a gmail account, gmail removes mails that you have sent yourself. You can check the archives to see. And your mails did come on the list. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Crash makes whole cluster unusable ?
So the problem started once remapping+backfilling started, and lasted until the cluster was healthy again? Have you adjusted any of the recovery tunables? Are you using SSD journals? I had a similar experience the first time my OSDs started backfilling. The average RadosGW operation latency went from 0.1 seconds to 10 seconds, which is longer than the default HAProxy timeout. Fun times. Since then, I've increased HAProxy's timeouts, de-prioritized Ceph's recovery, and I added SSD journals. The relevant sections of ceph.conf are: [global] mon osd down out interval = 900 mon osd min down reporters = 9 mon osd min down reports = 12 mon warn on legacy crush tunables = false osd pool default flag hashpspool = true [osd] osd max backfills = 3 osd recovery max active = 3 osd recovery op priority = 1 osd scrub sleep = 1.0 osd snap trim sleep = 1.0 Before the SSD journals, I had osd_max_backfills and osd_recovery_max_active set to 1. I watched my latency graphs, and used ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 to tweak the values until the latency was acceptable. On Tue, Dec 16, 2014 at 5:37 AM, Christoph Adomeit christoph.adom...@gatworks.de wrote: Hi there, today I had an osd crash with ceph 0.87/giant which made my hole cluster unusable for 45 Minutes. First it began with a disk error: sd 0:1:2:0: [sdc] CDB: Read(10)Read(10):: 28 28 00 00 0d 15 fe d0 fd 7b e8 f8 00 00 00 00 b0 08 00 00 XFS (sdc1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5. Then most other osds found out that my osd.3 is down: 2014-12-16 08:45:15.873478 mon.0 10.67.1.11:6789/0 3361077 : cluster [INF] osd.3 10.67.1.11:6810/713621 failed (42 reports from 35 peers after 23.642482 = grace 23.348982) 5 minutes later the osd is marked as out: 2014-12-16 08:50:21.095903 mon.0 10.67.1.11:6789/0 3361367 : cluster [INF] osd.3 out (down for 304.581079) However, since 8:45 until 9:20 I have 1000 slow requests and 107 incomplete pgs. Many requests are not answered: 2014-12-16 08:46:03.029094 mon.0 10.67.1.11:6789/0 3361126 : cluster [INF] pgmap v6930583: 4224 pgs: 4117 active+clean, 107 incomplete; 7647 GB data, 19090 GB used, 67952 GB / 87042 GB avail; 2307 kB/s rd, 2293 kB/s wr, 407 op/s Also a recovery to another osd was not starting Seems the osd thinks it is still up and all other osds think this osd is down ? I found this in the log of osd3: ceph-osd.3.log:2014-12-16 08:45:19.319152 7faf81296700 0 log_channel(default) log [WRN] : map e61177 wrongly marked me down ceph-osd.3.log: -440 2014-12-16 08:45:19.319152 7faf81296700 0 log_channel(default) log [WRN] : map e61177 wrongly marked me down Luckily I was able to restart osd3 and everything was working again but I do not understand what has happened. The cluster ways simply not usable for 45 Minutes. Any ideas Thanks Christoph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] my cluster has only rbd pool
If you're running Ceph 0.88 or newer, only the rdb pool is created by default now. Greg Farnum mentioned that the docs are out of date there. On Sat, Dec 13, 2014 at 8:25 PM, wang lin linw...@hotmail.com wrote: Hi All I set up my first ceph cluster according to instructions in http://ceph.com/docs/master/start/quick-ceph-deploy/#storing-retrieving-object-data, http://ceph.com/docs/master/start/quick-ceph-deploy/#storing-retrieving-object-data but I got this error error opening pool data: (2) No such file or directory when using command rados put hello_obj hello --pool=data. I typed the command ceph osd lspools, the result only show 0 rbd,, no other pools. Did I missing anything? Could anyone give me some advise? Thanks Lin, Wangf ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Dual RADOSGW Network
That shouldn't be a problem. Just have Apache bind to all interfaces instead of the external IP. In my case, I only have Apache bound to the internal interface. My load balancer has an external and internal IP, and I'm able to talk to it on both interfaces. On Mon, Dec 15, 2014 at 2:00 PM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Hi all! I have a single CEPH node which has two network interfaces. One is configured to be accessed directly by the internet (153.*) and the other one is configured on an internal LAN (192.*) For the moment radosgw is listening on the external (internet) interface. Can I configure radosgw to be accessed by both interfaces? What I would like to do is to save bandwidth and time for the machines on the internal network and use the internal net for all rados communications. Any ideas? Best regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Number of SSD for OSD journal
I was going with a low perf scenario, and I still ended up adding SSDs. Everything was fine in my 3 node cluster, until I wanted to add more nodes. Admittedly, I was a bit aggressive with the expansion. I added a whole node at once, rather than one or two disks at a time. Still, I wasn't expecting the average RadosGW latency to go from 0.1 seconds to 10 seconds. With the SSDs, I can do the same thing, and latency only goes up to 1 seconds. I'll be adding the Intel DC S3700's too all my nodes. On Mon, Dec 15, 2014 at 12:45 PM, Sebastien Han sebastien@enovance.com wrote: If you’re going with a low perf scenario I don’t think you should bother buying SSD, just remove them from the picture and do 12 SATA 7.2K 4TB. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiple issues :( Ubuntu 14.04, latest Ceph
On Sun, Dec 14, 2014 at 6:31 PM, Benjamin zor...@gmail.com wrote: The machines each have Ubuntu 14.04 64-bit, with 1GB of RAM and 8GB of disk. They have between 10% and 30% disk utilization but common between all of them is that they *have free disk space* meaning I have no idea what the heck is causing Ceph to complain. Each OSD is 8GB? You need to make them at least 10 GB. Ceph weights each disk as it's size in TiB, and it truncates to two decimal places. So your 8 GiB disks have a weight of 0.00. Bump it up to 10 GiB, and it'll get a weight of 0.01. You should have 3 OSDs, one for each of ceph0,ceph1,ceph2. If that doesn't fix the problem, go ahead and post the things Udo mentioned. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] active+degraded on an empty new cluster
When I first created a test cluster, I used 1 GiB disks. That causes problems. Ceph has a CRUSH weight. By default, the weight is the size of the disk in TiB, truncated to 2 decimal places. ie, any disk smaller than 10 GiB will have a weight of 0.00. I increased all of my virtual disks to 10 GiB. After rebooting the nodes (to see the changes), everything healed. On Tue, Dec 9, 2014 at 9:45 AM, Gregory Farnum g...@gregs42.com wrote: It looks like your OSDs all have weight zero for some reason. I'd fix that. :) -Greg On Tue, Dec 9, 2014 at 6:24 AM Giuseppe Civitella giuseppe.civite...@gmail.com wrote: Hi, thanks for the quick answer. I did try the force_create_pg on a pg but is stuck on creating: root@ceph-mon1:/home/ceph# ceph pg dump |grep creating dumped all in format plain 2.2f0 0 0 0 0 0 0 creating 2014-12-09 13:11:37.384808 0'0 0:0 [] -1 [] -1 0'0 0.000'0 0.00 root@ceph-mon1:/home/ceph# ceph pg 2.2f query { state: active+degraded, epoch: 105, up: [ 0], acting: [ 0], actingbackfill: [ 0], info: { pgid: 2.2f, last_update: 0'0, last_complete: 0'0, log_tail: 0'0, last_user_version: 0, last_backfill: MAX, purged_snaps: [], last_scrub: 0'0, last_scrub_stamp: 2014-12-06 14:15:11.499769, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-12-06 14:15:11.499769, last_clean_scrub_stamp: 0.00, log_size: 0, ondisk_log_size: 0, stats_invalid: 0, stat_sum: { num_bytes: 0, num_objects: 0, num_object_clones: 0, num_object_copies: 0, num_objects_missing_on_primary: 0, num_objects_degraded: 0, num_objects_unfound: 0, num_objects_dirty: 0, num_whiteouts: 0, num_read: 0, num_read_kb: 0, num_write: 0, num_write_kb: 0, num_scrub_errors: 0, num_shallow_scrub_errors: 0, num_deep_scrub_errors: 0, num_objects_recovered: 0, num_bytes_recovered: 0, num_keys_recovered: 0, num_objects_omap: 0, num_objects_hit_set_archive: 0}, stat_cat_sum: {}, up: [ 0], acting: [ 0], up_primary: 0, acting_primary: 0}, empty: 1, dne: 0, incomplete: 0, last_epoch_started: 104, hit_set_history: { current_last_update: 0'0, current_last_stamp: 0.00, current_info: { begin: 0.00, end: 0.00, version: 0'0}, history: []}}, peer_info: [], recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2014-12-09 12:12:52.760384, might_have_unfound: [], recovery_progress: { backfill_targets: [], waiting_on_backfill: [], last_backfill_started: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: []}, peer_backfill_info: [], backfills_in_flight: [], recovering: [], pg_backend: { pull_from_peer: [], pushing: []}}, scrub: { scrubber.epoch_start: 0, scrubber.active: 0, scrubber.block_writes: 0, scrubber.finalizing: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: []}}, { name: Started, enter_time: 2014-12-09 12:12:51.845686}], agent_state: {}}root@ceph-mon1:/home/ceph# 2014-12-09 13:01 GMT+01:00 Irek Fasikhov malm...@gmail.com: Hi. http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ ceph pg force_create_pg pgid 2014-12-09 14:50 GMT+03:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi all, last week I installed a new ceph cluster on 3 vm running Ubuntu 14.04 with default kernel. There is a ceph monitor a two osd hosts. Here are some datails: ceph -s cluster c46d5b02-dab1-40bf-8a3d-f8e4a77b79da health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean monmap e1: 1 mons at {ceph-mon1=10.1.1.83:6789/0}, election epoch 1, quorum 0 ceph-mon1 osdmap e83: 6 osds: 6 up, 6 in pgmap v231: 192 pgs, 3 pools, 0 bytes data, 0 objects 207 MB used, 30446 MB / 30653 MB avail 192 active+degraded root@ceph-mon1:/home/ceph# ceph osd dump epoch 99 fsid c46d5b02-dab1-40bf-8a3d-f8e4a77b79da created 2014-12-06 13:15:06.418843 modified 2014-12-09 11:38:04.353279 flags pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0
Re: [ceph-users] Scrub while cluster re-balancing
What's the output of ceph osd dump | grep ^pool ? On Tue, Dec 2, 2014 at 10:44 PM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi Craig, but, my concern is why ceph status is not reporting for pool 2 (testPool2 in this case). Whether its not performing scrub or its ceph status report issue? Though I have enough of objects in testPool2, scrub is not reporting active+clean+scrubbing in ceph -s. ems@rack6-ramp-4:~$ sudo ceph osd lspools 0 rbd,1 testPool,2 testPool2, ems@rack6-ramp-4:~$ ems@rack6-ramp-4:~$ sudo rados df pool name category KB objects clones degraded unfound rdrd KB wrwr KB rbd - 000 0 00000 testPool- 5948025217 14521740 0 0141056332 22948324301141070117 22950524809 testPool2 - 45039617109990 0 0 11238999 44955958 11259655 45038593 total used 18004641796 1463173 total avail32330689516 total space50335331312 ems@rack6-ramp-4:~$ -Thanks regards, Mallikarjun Biradar On Wed, Dec 3, 2014 at 1:20 AM, Craig Lewis cle...@centraldesktop.com wrote: ceph osd dump | grep ^pool will map pool names to numbers. PGs are named after the pool; PG 2.xx belongs to pool 2. rados df will tell you have many items and data are in a pool. On Tue, Dec 2, 2014 at 10:53 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi Craig, ceph -s is not showing any PG's in pool2. I have 3 pools. rbd and two pools that i created testPool and testPool2. I have more than 10TB of data in testPool1 and good amount of data in testPool2 as well. Iam not using rbd pool. -Thanks regards, Mallikarjun Biradar On 3 Dec 2014 00:15, Craig Lewis cle...@centraldesktop.com wrote: You mean `ceph -w` and `ceph -s` didn't show any PGs in the active+clean+scrubbing state while pool 2's PGs were being scrubbed? I see that happen with my really small pools. I have a bunch of RadosGW pools that contain 5 objects, and ~1kB of data. When I scrub the PGs in those pools, they complete so fast that they never show up in `ceph -w`. Since you have pools 0, 1, and 2, I assume those are the default 'data', 'metadata', and 'rdb'. If you're not using RDB, then the rdb pool will be very small. On Tue, Dec 2, 2014 at 5:32 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, I was running scrub while cluster is in re-balancing state. From the osd logs.. 2014-12-02 18:50:26.934802 7fcc6b614700 0 log_channel(default) log [INF] : 0.3 scrub ok 2014-12-02 18:50:27.890785 7fcc6b614700 0 log_channel(default) log [INF] : 0.24 scrub ok 2014-12-02 18:50:31.902978 7fcc6b614700 0 log_channel(default) log [INF] : 0.25 scrub ok 2014-12-02 18:50:33.088060 7fcc6b614700 0 log_channel(default) log [INF] : 0.33 scrub ok 2014-12-02 18:50:50.828893 7fcc6b614700 0 log_channel(default) log [INF] : 1.61 scrub ok 2014-12-02 18:51:06.774648 7fcc6b614700 0 log_channel(default) log [INF] : 1.68 scrub ok 2014-12-02 18:51:20.463283 7fcc6b614700 0 log_channel(default) log [INF] : 1.80 scrub ok 2014-12-02 18:51:39.883295 7fcc6b614700 0 log_channel(default) log [INF] : 1.89 scrub ok 2014-12-02 18:52:00.568808 7fcc6b614700 0 log_channel(default) log [INF] : 1.9f scrub ok 2014-12-02 18:52:15.897191 7fcc6b614700 0 log_channel(default) log [INF] : 1.a3 scrub ok 2014-12-02 18:52:34.681874 7fcc6b614700 0 log_channel(default) log [INF] : 1.aa scrub ok 2014-12-02 18:52:47.833630 7fcc6b614700 0 log_channel(default) log [INF] : 1.b1 scrub ok 2014-12-02 18:53:09.312792 7fcc6b614700 0 log_channel(default) log [INF] : 1.b3 scrub ok 2014-12-02 18:53:25.324635 7fcc6b614700 0 log_channel(default) log [INF] : 1.bd scrub ok 2014-12-02 18:53:48.638475 7fcc6b614700 0 log_channel(default) log [INF] : 1.c3 scrub ok 2014-12-02 18:54:02.996972 7fcc6b614700 0 log_channel(default) log [INF] : 1.d7 scrub ok 2014-12-02 18:54:19.660038 7fcc6b614700 0 log_channel(default) log [INF] : 1.d8 scrub ok 2014-12-02 18:54:32.780646 7fcc6b614700 0 log_channel(default) log [INF] : 1.fa scrub ok 2014-12-02 18:54:36.772931 7fcc6b614700 0 log_channel(default) log [INF] : 2.4 scrub ok 2014-12-02 18:54:41.758487 7fcc6b614700 0 log_channel(default) log [INF] : 2.9 scrub ok 2014-12-02 18:54:46.910043 7fcc6b614700 0 log_channel(default) log [INF] : 2.a scrub ok 2014-12-02 18:54:51.908335 7fcc6b614700 0 log_channel(default) log [INF] : 2.16 scrub ok 2014-12-02 18:54:54.940807 7fcc6b614700 0 log_channel(default) log [INF] : 2.19 scrub ok 2014-12-02 18:55:00.956170 7fcc6b614700 0 log_channel(default) log [INF] : 2.44 scrub ok 2014-12-02 18:55:01.948455 7fcc6b614700 0 log_channel(default
Re: [ceph-users] Slow Requests when taking down OSD Node
I've found that it helps to shut down the osds before shutting down the host. Especially if the node is also a monitor. It seems that some OSD shutdown messages get lost while monitors are holding elections. On Tue, Dec 2, 2014 at 10:10 AM, Christoph Adomeit christoph.adom...@gatworks.de wrote: Hi there, I have a giant cluster with 60 OSDs on 6 OSD Hosts. Now I want to do maintenance on one of the OSD Hosts. The documented Procedure is to ceph osd set noout and then shutdown the OSD Node for maintenance. However, as soon as I even shut down 1 OSD I get around 200 slow requests and the number of slow requests is growing for minutes. The test was done at night with low IOPS and I was expecting the cluster to handle this condition much better. Is there some way of a more graceful shutdown of OSDs so that I can prevent those slow requests ? I suppose it takes some time until monitor gets notified that an OSD was shutdown. Thanks Christoph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Scrub while cluster re-balancing
You mean `ceph -w` and `ceph -s` didn't show any PGs in the active+clean+scrubbing state while pool 2's PGs were being scrubbed? I see that happen with my really small pools. I have a bunch of RadosGW pools that contain 5 objects, and ~1kB of data. When I scrub the PGs in those pools, they complete so fast that they never show up in `ceph -w`. Since you have pools 0, 1, and 2, I assume those are the default 'data', 'metadata', and 'rdb'. If you're not using RDB, then the rdb pool will be very small. On Tue, Dec 2, 2014 at 5:32 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, I was running scrub while cluster is in re-balancing state. From the osd logs.. 2014-12-02 18:50:26.934802 7fcc6b614700 0 log_channel(default) log [INF] : 0.3 scrub ok 2014-12-02 18:50:27.890785 7fcc6b614700 0 log_channel(default) log [INF] : 0.24 scrub ok 2014-12-02 18:50:31.902978 7fcc6b614700 0 log_channel(default) log [INF] : 0.25 scrub ok 2014-12-02 18:50:33.088060 7fcc6b614700 0 log_channel(default) log [INF] : 0.33 scrub ok 2014-12-02 18:50:50.828893 7fcc6b614700 0 log_channel(default) log [INF] : 1.61 scrub ok 2014-12-02 18:51:06.774648 7fcc6b614700 0 log_channel(default) log [INF] : 1.68 scrub ok 2014-12-02 18:51:20.463283 7fcc6b614700 0 log_channel(default) log [INF] : 1.80 scrub ok 2014-12-02 18:51:39.883295 7fcc6b614700 0 log_channel(default) log [INF] : 1.89 scrub ok 2014-12-02 18:52:00.568808 7fcc6b614700 0 log_channel(default) log [INF] : 1.9f scrub ok 2014-12-02 18:52:15.897191 7fcc6b614700 0 log_channel(default) log [INF] : 1.a3 scrub ok 2014-12-02 18:52:34.681874 7fcc6b614700 0 log_channel(default) log [INF] : 1.aa scrub ok 2014-12-02 18:52:47.833630 7fcc6b614700 0 log_channel(default) log [INF] : 1.b1 scrub ok 2014-12-02 18:53:09.312792 7fcc6b614700 0 log_channel(default) log [INF] : 1.b3 scrub ok 2014-12-02 18:53:25.324635 7fcc6b614700 0 log_channel(default) log [INF] : 1.bd scrub ok 2014-12-02 18:53:48.638475 7fcc6b614700 0 log_channel(default) log [INF] : 1.c3 scrub ok 2014-12-02 18:54:02.996972 7fcc6b614700 0 log_channel(default) log [INF] : 1.d7 scrub ok 2014-12-02 18:54:19.660038 7fcc6b614700 0 log_channel(default) log [INF] : 1.d8 scrub ok 2014-12-02 18:54:32.780646 7fcc6b614700 0 log_channel(default) log [INF] : 1.fa scrub ok 2014-12-02 18:54:36.772931 7fcc6b614700 0 log_channel(default) log [INF] : 2.4 scrub ok 2014-12-02 18:54:41.758487 7fcc6b614700 0 log_channel(default) log [INF] : 2.9 scrub ok 2014-12-02 18:54:46.910043 7fcc6b614700 0 log_channel(default) log [INF] : 2.a scrub ok 2014-12-02 18:54:51.908335 7fcc6b614700 0 log_channel(default) log [INF] : 2.16 scrub ok 2014-12-02 18:54:54.940807 7fcc6b614700 0 log_channel(default) log [INF] : 2.19 scrub ok 2014-12-02 18:55:00.956170 7fcc6b614700 0 log_channel(default) log [INF] : 2.44 scrub ok 2014-12-02 18:55:01.948455 7fcc6b614700 0 log_channel(default) log [INF] : 2.4f scrub ok 2014-12-02 18:55:07.273587 7fcc6b614700 0 log_channel(default) log [INF] : 2.76 scrub ok 2014-12-02 18:55:10.641274 7fcc6b614700 0 log_channel(default) log [INF] : 2.9e scrub ok 2014-12-02 18:55:11.621669 7fcc6b614700 0 log_channel(default) log [INF] : 2.ab scrub ok 2014-12-02 18:55:18.261900 7fcc6b614700 0 log_channel(default) log [INF] : 2.b0 scrub ok 2014-12-02 18:55:19.560766 7fcc6b614700 0 log_channel(default) log [INF] : 2.b1 scrub ok 2014-12-02 18:55:20.501591 7fcc6b614700 0 log_channel(default) log [INF] : 2.bb scrub ok 2014-12-02 18:55:21.523936 7fcc6b614700 0 log_channel(default) log [INF] : 2.cd scrub ok Interestingly, for pg's 2.x (2.4, 2.9 etc)in logs here, cluster status was not reporting scrubbing, whereas for 0.x 1.x it was reporting as scrubbing in cluster status. In case of scrub operation on PG's (2.x) is really scrubbing performed OR cluster status is missing to report them? -Thanks Regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing Snapshots Killing Cluster Performance
On Mon, Dec 1, 2014 at 1:51 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: I could not find any way to throttle the background deletion activity (the command returns almost immediately). I'm only aware of osd snap trim sleep. I haven't tried this since my Firefly upgrade though. I have tested out osd scrub sleep under a heavy deep-scrub load, and found that I needed a value of 1.0, which is much higher than the recommended starting point of 0.005. I'll revisit this when #9487 gets backported (Thanks Dan Van Der Ster!). I used ceph tell osd.\* injectargs, and watched my IO graphs. Start with 0.005, and multiple by 10 until you see a change. It took 10-60 seconds to see a change after injecting the args. While this is a big issue in itself for us, we would at least try to estimate how long the process will take per snapshot / per pool. I assume the time needed is a function of the number of objects that were modified between two snapshots. That matches my experiences as well. Big snapshots are take longer, and are much more likely to cause a cluster outage than small snapshots. 1) Is there any way to control how much such an operation will tax the cluster (we would be happy to have it run longer, if that meant not utilizing all disks fully during that time)? On Firefly, osd snap trim sleep, and playing with the CFQ scheduler are your only options. They're not great options. If you can upgrade to Giant, the snap trim sleep should solve your problem. There is some work being done in Hammer: https://wiki.ceph.com/Planning/Blueprints/Hammer/osd%3A_Scrub%2F%2FSnapTrim_IO_prioritization For the time being, I'm letting my snapshots accumulate. I can't recover anything without the database backups, and those are deleted on time, so I can say with a straight face that their data is deleted. I'll collect the garbage later. 3) Would SSD journals help here? Or any other hardware configuration change for that matter? Probably, but it's not going to fix it. I added SSD journals. It's better, but I still had downtime after trimming. I'm glad I added them though. The cluster is overall are much healthier and more responsive. In particular, backfilling doesn't cause massive latency anymore. 4) Any other recommendations? We definitely need to remove the data, not because of a lack of space (at least not at the moment), but because when customers delete stuff / cancel accounts, we are obliged to remove their data at least after a reasonable amount of time. I know it's kind of snarky, but perhaps you can redefine reasonable until you have a change to upgrade to Giant or Hammer? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow Requests when taking down OSD Node
If you watch `ceph -w` while stopping the OSD, do you see 2014-12-02 11:45:17.715629 mon.0 [INF] osd.X marked itself down ? On Tue, Dec 2, 2014 at 11:06 AM, Christoph Adomeit christoph.adom...@gatworks.de wrote: Thanks Craig, but this is what I am doing. After setting ceph osd set noout I do a service ceph stop osd.51 and as soon as I do this I get growing numbers (200) of slow requests, although there is not a big load on my cluster. Christoph On Tue, Dec 02, 2014 at 10:40:13AM -0800, Craig Lewis wrote: I've found that it helps to shut down the osds before shutting down the host. Especially if the node is also a monitor. It seems that some OSD shutdown messages get lost while monitors are holding elections. On Tue, Dec 2, 2014 at 10:10 AM, Christoph Adomeit christoph.adom...@gatworks.de wrote: Hi there, I have a giant cluster with 60 OSDs on 6 OSD Hosts. Now I want to do maintenance on one of the OSD Hosts. The documented Procedure is to ceph osd set noout and then shutdown the OSD Node for maintenance. However, as soon as I even shut down 1 OSD I get around 200 slow requests and the number of slow requests is growing for minutes. The test was done at night with low IOPS and I was expecting the cluster to handle this condition much better. Is there some way of a more graceful shutdown of OSDs so that I can prevent those slow requests ? I suppose it takes some time until monitor gets notified that an OSD was shutdown. Thanks Christoph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Scrub while cluster re-balancing
ceph osd dump | grep ^pool will map pool names to numbers. PGs are named after the pool; PG 2.xx belongs to pool 2. rados df will tell you have many items and data are in a pool. On Tue, Dec 2, 2014 at 10:53 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi Craig, ceph -s is not showing any PG's in pool2. I have 3 pools. rbd and two pools that i created testPool and testPool2. I have more than 10TB of data in testPool1 and good amount of data in testPool2 as well. Iam not using rbd pool. -Thanks regards, Mallikarjun Biradar On 3 Dec 2014 00:15, Craig Lewis cle...@centraldesktop.com wrote: You mean `ceph -w` and `ceph -s` didn't show any PGs in the active+clean+scrubbing state while pool 2's PGs were being scrubbed? I see that happen with my really small pools. I have a bunch of RadosGW pools that contain 5 objects, and ~1kB of data. When I scrub the PGs in those pools, they complete so fast that they never show up in `ceph -w`. Since you have pools 0, 1, and 2, I assume those are the default 'data', 'metadata', and 'rdb'. If you're not using RDB, then the rdb pool will be very small. On Tue, Dec 2, 2014 at 5:32 AM, Mallikarjun Biradar mallikarjuna.bira...@gmail.com wrote: Hi all, I was running scrub while cluster is in re-balancing state. From the osd logs.. 2014-12-02 18:50:26.934802 7fcc6b614700 0 log_channel(default) log [INF] : 0.3 scrub ok 2014-12-02 18:50:27.890785 7fcc6b614700 0 log_channel(default) log [INF] : 0.24 scrub ok 2014-12-02 18:50:31.902978 7fcc6b614700 0 log_channel(default) log [INF] : 0.25 scrub ok 2014-12-02 18:50:33.088060 7fcc6b614700 0 log_channel(default) log [INF] : 0.33 scrub ok 2014-12-02 18:50:50.828893 7fcc6b614700 0 log_channel(default) log [INF] : 1.61 scrub ok 2014-12-02 18:51:06.774648 7fcc6b614700 0 log_channel(default) log [INF] : 1.68 scrub ok 2014-12-02 18:51:20.463283 7fcc6b614700 0 log_channel(default) log [INF] : 1.80 scrub ok 2014-12-02 18:51:39.883295 7fcc6b614700 0 log_channel(default) log [INF] : 1.89 scrub ok 2014-12-02 18:52:00.568808 7fcc6b614700 0 log_channel(default) log [INF] : 1.9f scrub ok 2014-12-02 18:52:15.897191 7fcc6b614700 0 log_channel(default) log [INF] : 1.a3 scrub ok 2014-12-02 18:52:34.681874 7fcc6b614700 0 log_channel(default) log [INF] : 1.aa scrub ok 2014-12-02 18:52:47.833630 7fcc6b614700 0 log_channel(default) log [INF] : 1.b1 scrub ok 2014-12-02 18:53:09.312792 7fcc6b614700 0 log_channel(default) log [INF] : 1.b3 scrub ok 2014-12-02 18:53:25.324635 7fcc6b614700 0 log_channel(default) log [INF] : 1.bd scrub ok 2014-12-02 18:53:48.638475 7fcc6b614700 0 log_channel(default) log [INF] : 1.c3 scrub ok 2014-12-02 18:54:02.996972 7fcc6b614700 0 log_channel(default) log [INF] : 1.d7 scrub ok 2014-12-02 18:54:19.660038 7fcc6b614700 0 log_channel(default) log [INF] : 1.d8 scrub ok 2014-12-02 18:54:32.780646 7fcc6b614700 0 log_channel(default) log [INF] : 1.fa scrub ok 2014-12-02 18:54:36.772931 7fcc6b614700 0 log_channel(default) log [INF] : 2.4 scrub ok 2014-12-02 18:54:41.758487 7fcc6b614700 0 log_channel(default) log [INF] : 2.9 scrub ok 2014-12-02 18:54:46.910043 7fcc6b614700 0 log_channel(default) log [INF] : 2.a scrub ok 2014-12-02 18:54:51.908335 7fcc6b614700 0 log_channel(default) log [INF] : 2.16 scrub ok 2014-12-02 18:54:54.940807 7fcc6b614700 0 log_channel(default) log [INF] : 2.19 scrub ok 2014-12-02 18:55:00.956170 7fcc6b614700 0 log_channel(default) log [INF] : 2.44 scrub ok 2014-12-02 18:55:01.948455 7fcc6b614700 0 log_channel(default) log [INF] : 2.4f scrub ok 2014-12-02 18:55:07.273587 7fcc6b614700 0 log_channel(default) log [INF] : 2.76 scrub ok 2014-12-02 18:55:10.641274 7fcc6b614700 0 log_channel(default) log [INF] : 2.9e scrub ok 2014-12-02 18:55:11.621669 7fcc6b614700 0 log_channel(default) log [INF] : 2.ab scrub ok 2014-12-02 18:55:18.261900 7fcc6b614700 0 log_channel(default) log [INF] : 2.b0 scrub ok 2014-12-02 18:55:19.560766 7fcc6b614700 0 log_channel(default) log [INF] : 2.b1 scrub ok 2014-12-02 18:55:20.501591 7fcc6b614700 0 log_channel(default) log [INF] : 2.bb scrub ok 2014-12-02 18:55:21.523936 7fcc6b614700 0 log_channel(default) log [INF] : 2.cd scrub ok Interestingly, for pg's 2.x (2.4, 2.9 etc)in logs here, cluster status was not reporting scrubbing, whereas for 0.x 1.x it was reporting as scrubbing in cluster status. In case of scrub operation on PG's (2.x) is really scrubbing performed OR cluster status is missing to report them? -Thanks Regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebuild OSD's
You have a total of 2 OSDs, and 2 disks, right? The safe method is to mark one OSD out, and wait for the cluster to heal. Delete, reformat, add it back to the cluster, and wait for the cluster to heal. Repeat. But that only works when you have enough OSDs that the cluster can heal. So you'll have to go the less safe route, and hope you don't suffer a failure in the middle. I went this route, because it was taking too long to do the safe route: First, setup your ceph.conf with the new osd options. osd mkfs *, osd journal *, whatever you want the OSDs to look like when you're done. You may want to set osd max backfills to 1 before you start. The default value of 10 is really only a good idea if you have a large cluster and SSD journals. Remove the disk, format, and put it back in: - ceph osd set norecover - ceph osd set nobackfill - ceph osd out $OSDID - sleep 30 - stop ceph-osd id=$OSDID - ceph osd crush remove osd.$OSDID - ceph osd lost $OSDID --yes-i-really-mean-it - ceph auth del osd.$OSDID - ceph osd rm $OSDID - ceph-disk-prepare --zap $dev $journal# ceph-deploy would also work - ceph osd unset norecover - ceph osd unset nobackfill Wait for the cluster to heal, then repeat. It's more complicated if you have multiple devices in the zpool and you're using more than a small percentage of the disk space. On Sat, Nov 29, 2014 at 2:29 PM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: I have 2 OSD's on two nodes top of zfs that I'd like to rebuild in a more standard (xfs) setup. Would the following be a non destructive if somewhat tedious way of doing so? Following the instructions from here: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual 1. Remove osd.0 2. Recreate osd.0 3. Add. osd.0 4. Wait for health to be restored i.e all data be copied from osd.1 to osd.0 5. Remove osd.1 6. Recreate osd.1 7. Add. osd.1 8. Wait for health to be restored i.e all data be copied from osd.0 to osd.1 9. Profit! There's 1TB of data total. I can do this after hours while the system network is not being used I do have complete backups in case it all goes pear shaped. thanks, -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Optimal or recommended threads values
I'm still using the default values, mostly because I haven't had time to test. On Thu, Nov 27, 2014 at 2:44 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hi Craig, Are you keeping the filestore, disk and op threads at their default values? or did you also change them? Cheers Tuning these values depends on a lot more than just the SSDs and HDDs. Which kernel and IO scheduler are you using? Does your HBA do write caching? It also depends on what your goals are. Tuning for a RadosGW cluster is different that for a RDB cluster. The short answer is that you are the only person that can can tell you what your optimal values are. As always, the best benchmark is production load. In my small cluster (5 nodes, 44 osds), I'm optimizing to minimize latency during recovery. When the cluster is healthy, bandwidth and latency are more than adequate for my needs. Even with journals on SSDs, I've found that reducing the number of operations and threads has reduced my average latency. I use injectargs to try out new values while I monitor cluster latency. I monitor latency while the cluster is healthy and recovering. If a change is deemed better, only then will I persist the change to ceph.conf. This gives me a fallback that any changes that causes massive problems can be undone with a restart or reboot. So far, the configs that I've written to ceph.conf are [global] mon osd down out interval = 900 mon osd min down reporters = 9 mon osd min down reports = 12 osd pool default flag hashpspool = true [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 I have it on my list to investigate filestore max sync interval. And now that I've pasted that, I need to revisit the min down reports/reporters. I have some nodes with 10 OSDs, and I don't want any one node able to mark the rest of the cluster as down (it happened once). On Sat, Nov 22, 2014 at 6:24 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hello guys, Could some one comment on the optimal or recommended values of various threads values in ceph.conf? At the moment I have the following settings: filestore_op_threads = 8 osd_disk_threads = 8 osd_op_threads = 8 filestore_merge_threshold = 40 filestore_split_multiple = 8 Are these reasonable for a small cluster made of 7.2K SAS disks with ssd journals with a ratio of 4:1? What are the settings that other people are using? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Tip of the week: don't use Intel 530 SSD's for journals
I have suffered power losses in every data center I've been in. I have lost SSDs because of it (Intel 320 Series). The worst time, I lost both SSDs in a RAID1. That was a bad day. I'm using the Intel DC S3700 now, so I don't have a repeat. My cluster is small enough that losing a journal SSD would be a major headache. I'm manually monitoring wear level. So far all of my journals are still at 100% lifetime. I do have some of the Intel 320 that are down to 45% lifetime remaining. (Those Intel 320s are in less critical roles). One of these days I'll get around to automating it. Speed wise, my small cluster was fast enough without SSDs, until I started to expand. I'm only using RadosGW, and I only care about latency in the human timeframe. A second or two of latency is annoying, but not a big deal. I went from 3 nodes to 5, and the expansion was extremely painful. I admit that I inflicted a lot of pain on myself. I expanded too fast (add all the OSDs at the same time? Sure, why not.), and I was using the default configs. Things got better after I lowered the backfill priority and count, and learned to add one or two disks at a time. Still, customers noticed the increase in latency when I was adding osds. Now that I have the journals on SSDs, customers don't notice the maintenance anymore. RadosGW latency goes from ~50ms to ~80ms, not ~50ms to 2000ms. On Tue, Nov 25, 2014 at 9:12 AM, Michael Kuriger mk7...@yp.com wrote: My cluster is actually very fast without SSD drives. Thanks for the advice! Michael Kuriger mk7...@yp.com 818-649-7235 MikeKuriger (IM) On 11/25/14, 7:49 AM, Mark Nelson mark.nel...@inktank.com wrote: On 11/25/2014 09:41 AM, Erik Logtenberg wrote: If you are like me, you have the journals for your OSD's with rotating media stored separately on an SSD. If you are even more like me, you happen to use Intel 530 SSD's in some of your hosts. If so, please do check your S.M.A.R.T. statistics regularly, because these SSD's really can't cope with Ceph. Check out the media-wear graphs for the two Intel 530's in my cluster. As soon as those declining lines get down to 30% or so, they need to be replaced. That means less than half a year between purchase and end-of-life :( Tip of the week, keep an eye on those statistics, don't let a failing SSD surprise you. This is really good advice, and it's not just the Intel 530s. Most consumer grade SSDs have pretty low write endurance. If you mostly are doing reads from your cluster you may be OK, but if you have even moderately high write workloads and you care about avoiding OSD downtime (which in a production cluster is pretty important though not usually 100% critical), get high write endurance SSDs. Mark Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listin fo.cgi_ceph-2Dusers-2Dceph.com d=AAICAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOS ncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=xAjtZHPapVvnusxPYRk6BsgVfaL1ZLDaT ojJmuDFDpQs=F0CBA8T3LuTIhofIV4LGk-6CgC8KsPAu-7JgJ4jRm3Ie= ___ ceph-users mailing list ceph-users@lists.ceph.com https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinf o.cgi_ceph-2Dusers-2Dceph.com d=AAICAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSnc m6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=xAjtZHPapVvnusxPYRk6BsgVfaL1ZLDaTojJ muDFDpQs=F0CBA8T3LuTIhofIV4LGk-6CgC8KsPAu-7JgJ4jRm3Ie= ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] private network - VLAN vs separate switch
It's mostly about bandwidth. With VLANs, the public and cluster networks are going to be sharing the inter-switch links. For a cluster that size, I don't see much advantage to the VLANs. You'll save a few ports by having the inter-switch links shared, at the expense of contention on those links. If you're trying to save ports, I'd go with a single network. Adding a cluster network later is relatively straight forward. Just monitor the bandwidth on the inter-switch links, and plan to expand when you saturate those links. That said, I am using VLANs, but my cluster is much smaller. I only have 5 nodes and a single switch. I'm planning to transition to a dedicated cluster switch when I need the extra ports. I don't anticipate the transition being difficult. I'll continue to use the same VLAN on the dedicated switch, just to make the migration less complicated. On Tue, Nov 25, 2014 at 3:11 AM, Sreenath BH bhsreen...@gmail.com wrote: Hi For a large network (say 100 servers and 2500 disks), are there any strong advantages to using separate switch and physical network instead of VLAN? Also, how difficult it would be to switch from a VLAN to using separate switches later? -Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Create OSD on ZFS Mount (firefly)
There was a good thread on the mailing list a little while ago. There were several recommendations in that thread, maybe some of them will help. Found it: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14154.html On Tue, Nov 25, 2014 at 4:16 AM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: Testing ceph on top of ZFS (zfsonlinux), kernel driver. - Have created ZFS mount: /var/lib/ceph/osd/ceph-0 - followed the instructions at: http://ceph.com/docs/firefly/rados/operations/add-or-rm-osds/ failing on the step 4. Initialize the OSD data directory. ceph-osd -i 0 --mkfs --mkkey 2014-11-25 22:12:26.563666 7ff12b466780 -1 filestore(/var/lib/ceph/osd/ceph-0) mkjournal error creating journal on /var/lib/ceph/osd/ceph-0/journal: (22) Invalid argument 2014-11-25 22:12:26.563691 7ff12b466780 -1 OSD::mkfs: ObjectStore::mkfs failed with error -22 2014-11-25 22:12:26.563765 7ff12b466780 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument Is this supported? thanks, -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues
/::initializing 2014-11-22 14:19:21.918229 7f73b07c0700 10 host=us-west-1.lt.com rgw_dns_name=us-west-1.lt.com 2014-11-22 14:19:21.918288 7f73b07c0700 2 req 1:0.53:swift-auth:GET /auth/::getting op 2014-11-22 14:19:21.918300 7f73b07c0700 2 req 1:0.71:swift-auth:GET /auth/:swift_auth_get:authorizing 2014-11-22 14:19:21.918307 7f73b07c0700 2 req 1:0.78:swift-auth:GET /auth/:swift_auth_get:reading permissions 2014-11-22 14:19:21.918313 7f73b07c0700 2 req 1:0.84:swift-auth:GET /auth/:swift_auth_get:init op 2014-11-22 14:19:21.918319 7f73b07c0700 2 req 1:0.90:swift-auth:GET /auth/:swift_auth_get:verifying op mask 2014-11-22 14:19:21.918325 7f73b07c0700 20 required_mask= 0 user.op_mask=7 2014-11-22 14:19:21.918330 7f73b07c0700 2 req 1:0.000100:swift-auth:GET /auth/:swift_auth_get:verifying op permissions 2014-11-22 14:19:21.918336 7f73b07c0700 2 req 1:0.000107:swift-auth:GET /auth/:swift_auth_get:verifying op params 2014-11-22 14:19:21.918341 7f73b07c0700 2 req 1:0.000112:swift-auth:GET /auth/:swift_auth_get:executing 2014-11-22 14:19:21.918470 7f73b07c0700 20 get_obj_state: rctx=0x7f73dc002030 obj=.us-west.users.swift:east-user:swift state=0x7f73dc0066d8 s-prefetch_data=0 2014-11-22 14:19:21.918494 7f73b07c0700 10 cache get: name=.us-west.users.swift+east-user:swift : miss 2014-11-22 14:19:21.931892 7f73b07c0700 10 cache put: name=.us-west.users.swift+east-user:swift 2014-11-22 14:19:21.931892 7f73b07c0700 10 adding .us-west.users.swift+east-user:swift to cache LRU end 2014-11-22 14:19:21.931892 7f73b07c0700 20 get_obj_state: s-obj_tag was set empty 2014-11-22 14:19:21.931892 7f73b07c0700 10 cache get: name=.us-west.users.swift+east-user:swift : type miss (requested=1, cached=6) 2014-11-22 14:19:21.931893 7f73b07c0700 20 get_obj_state: rctx=0x7f73dc007300 obj=.us-west.users.swift:east-user:swift state=0x7f73dc006558 s-prefetch_data=0 2014-11-22 14:19:21.931893 7f73b07c0700 10 cache get: name=.us-west.users.swift+east-user:swift : hit 2014-11-22 14:19:21.931893 7f73b07c0700 20 get_obj_state: s-obj_tag was set empty 2014-11-22 14:19:21.931893 7f73b07c0700 20 get_obj_state: rctx=0x7f73dc007300 obj=.us-west.users.swift:east-user:swift state=0x7f73dc006558 s-prefetch_data=0 2014-11-22 14:19:21.931893 7f73b07c0700 20 state for obj=.us-west.users.swift:east-user:swift is not atomic, not appending atomic test 2014-11-22 14:19:21.931893 7f73b07c0700 20 rados-read obj-ofs=0 read_ofs=0 read_len=524288 2014-11-22 14:19:21.932003 7f73b07c0700 20 rados-read r=0 bl.length=13 2014-11-22 14:19:21.932021 7f73b07c0700 10 cache put: name=.us-west.users.swift+east-user:swift 2014-11-22 14:19:21.932023 7f73b07c0700 10 moving .us-west.users.swift+east-user:swift to cache LRU end 2014-11-22 14:19:21.932054 7f73b07c0700 20 get_obj_state: rctx=0x7f73dc006b30 obj=.us-west.users.uid:east-user state=0x7f73dc006498 s-prefetch_data=0 2014-11-22 14:19:21.932062 7f73b07c0700 10 cache get: name=.us-west.users.uid+east-user : miss 2014-11-22 14:19:21.933559 7f73b07c0700 10 cache put: name=.us-west.users.uid+east-user 2014-11-22 14:19:21.933567 7f73b07c0700 10 adding .us-west.users.uid+east-user to cache LRU end 2014-11-22 14:19:21.933572 7f73b07c0700 20 get_obj_state: s-obj_tag was set empty 2014-11-22 14:19:21.933580 7f73b07c0700 10 cache get: name=.us-west.users.uid+east-user : type miss (requested=1, cached=6) 2014-11-22 14:19:21.933601 7f73b07c0700 20 get_obj_state: rctx=0x7f73dc006b30 obj=.us-west.users.uid:east-user state=0x7f73dc006498 s-prefetch_data=0 2014-11-22 14:19:21.933607 7f73b07c0700 10 cache get: name=.us-west.users.uid+east-user : hit 2014-11-22 14:19:21.933611 7f73b07c0700 20 get_obj_state: s-obj_tag was set empty 2014-11-22 14:19:21.933617 7f73b07c0700 20 get_obj_state: rctx=0x7f73dc006b30 obj=.us-west.users.uid:east-user state=0x7f73dc006498 s-prefetch_data=0 2014-11-22 14:19:21.933620 7f73b07c0700 20 state for obj=.us-west.users.uid:east-user is not atomic, not appending atomic test 2014-11-22 14:19:21.933622 7f73b07c0700 20 rados-read obj-ofs=0 read_ofs=0 read_len=524288 2014-11-22 14:19:21.934709 7f73b07c0700 20 rados-read r=0 bl.length=310 2014-11-22 14:19:21.934725 7f73b07c0700 10 cache put: name=.us-west.users.uid+east-user 2014-11-22 14:19:21.934727 7f73b07c0700 10 moving .us-west.users.uid+east-user to cache LRU end 2014-11-22 14:19:21.934790 7f73b07c0700 2 req 1:0.016560:swift-auth:GET /auth/:swift_auth_get:http status=403 2014-11-22 14:19:21.934794 7f73b07c0700 1 == req done req=0x7f73e000d010 http_status=403 == 2014-11-22 14:19:21.934800 7f73b07c0700 20 process_request() returned -1 Why am I not able to authenticate? On Fri, Nov 21, 2014 at 1:04 AM, Craig Lewis cle...@centraldesktop.com wrote: You need to create two system users, in both zones. They should have the same name, access key, and secret in both zones. By convention, these system users are named the same as the zones. You
Re: [ceph-users] Optimal or recommended threads values
Tuning these values depends on a lot more than just the SSDs and HDDs. Which kernel and IO scheduler are you using? Does your HBA do write caching? It also depends on what your goals are. Tuning for a RadosGW cluster is different that for a RDB cluster. The short answer is that you are the only person that can can tell you what your optimal values are. As always, the best benchmark is production load. In my small cluster (5 nodes, 44 osds), I'm optimizing to minimize latency during recovery. When the cluster is healthy, bandwidth and latency are more than adequate for my needs. Even with journals on SSDs, I've found that reducing the number of operations and threads has reduced my average latency. I use injectargs to try out new values while I monitor cluster latency. I monitor latency while the cluster is healthy and recovering. If a change is deemed better, only then will I persist the change to ceph.conf. This gives me a fallback that any changes that causes massive problems can be undone with a restart or reboot. So far, the configs that I've written to ceph.conf are [global] mon osd down out interval = 900 mon osd min down reporters = 9 mon osd min down reports = 12 osd pool default flag hashpspool = true [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 I have it on my list to investigate filestore max sync interval. And now that I've pasted that, I need to revisit the min down reports/reporters. I have some nodes with 10 OSDs, and I don't want any one node able to mark the rest of the cluster as down (it happened once). On Sat, Nov 22, 2014 at 6:24 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hello guys, Could some one comment on the optimal or recommended values of various threads values in ceph.conf? At the moment I have the following settings: filestore_op_threads = 8 osd_disk_threads = 8 osd_op_threads = 8 filestore_merge_threshold = 40 filestore_split_multiple = 8 Are these reasonable for a small cluster made of 7.2K SAS disks with ssd journals with a ratio of 4:1? What are the settings that other people are using? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues
You need to create two system users, in both zones. They should have the same name, access key, and secret in both zones. By convention, these system users are named the same as the zones. You shouldn't use those system users for anything other than replication. You should create a non-system user to interact with the cluster. Just like you don't run as root all the time, you don't want to be a radosgw system user all the time. You only need to create this user in the primary zone. Once replication is working, it should copy the non-system user to the secondary cluster, as well as any buckets and objects this user creates. On Wed, Nov 19, 2014 at 1:16 AM, Vinod H I vinvi...@gmail.com wrote: Hi, I am using firefly version 0.80.7. I am testing disaster recovery mechanism for rados gateways. I have followed the federated gateway setup as mentioned in the docs. There is one region with two zones on the same cluster. After sync(using radosgw-agent, with --sync-scope=full), container created by the swift user(with --system flag) on the master zone gateway is not visible for the swift user(with --system flag) on the slave zone. There are no error during the syncing process. I tried by creating a new slave zone user with same uid and access and secret keys as that of master. It did not work! Any idea on how to be able to read the synced containers from the slave zone? Is there any requirement that the two zones must be on separate clusters? -- Vinod H I ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg's degraded
Just to be clear, this is from a cluster that was healthy, had a disk replaced, and hasn't returned to healthy? It's not a new cluster that has never been healthy, right? Assuming it's an existing cluster, how many OSDs did you replace? It almost looks like you replaced multiple OSDs at the same time, and lost data because of it. Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`? On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah jshah2...@me.com wrote: After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. Sone are in the unclean and others are in the stale state. Somehow the MDS is also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read through the documentation and on the web but no luck so far. pg 2.33 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 0.30 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 1.31 is stuck unclean since forever, current state stale+active+degraded, last acting [2] pg 2.32 is stuck unclean for 597129.903922, current state stale+active+degraded, last acting [2] pg 0.2f is stuck unclean for 597129.903951, current state stale+active+degraded, last acting [2] pg 1.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 2.2d is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [2] pg 0.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 1.2f is stuck unclean for 597129.904015, current state stale+active+degraded, last acting [2] pg 2.2c is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 0.2d is stuck stale for 422844.566858, current state stale+active+degraded, last acting [2] pg 1.2c is stuck stale for 422598.539483, current state stale+active+degraded+remapped, last acting [3] pg 2.2f is stuck stale for 422598.539488, current state stale+active+degraded+remapped, last acting [3] pg 0.2c is stuck stale for 422598.539487, current state stale+active+degraded+remapped, last acting [3] pg 1.2d is stuck stale for 422598.539492, current state stale+active+degraded+remapped, last acting [3] pg 2.2e is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3] pg 0.2b is stuck stale for 422598.539491, current state stale+active+degraded+remapped, last acting [3] pg 1.2a is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3] pg 2.29 is stuck stale for 422598.539504, current state stale+active+degraded+remapped, last acting [3] . . . 6 ops are blocked 2097.15 sec 3 ops are blocked 2097.15 sec on osd.0 2 ops are blocked 2097.15 sec on osd.2 1 ops are blocked 2097.15 sec on osd.4 3 osds have slow requests recovery 40/60 objects degraded (66.667%) mds cluster is degraded mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal —Jiten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg's degraded
So you have your crushmap set to choose osd instead of choose host? Did you wait for the cluster to recover between each OSD rebuild? If you rebuilt all 3 OSDs at the same time (or without waiting for a complete recovery between them), that would cause this problem. On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah jshah2...@me.com wrote: Yes, it was a healthy cluster and I had to rebuild because the OSD’s got accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of them. [jshah@Lab-cephmon001 ~]$ ceph osd tree # id weight type name up/down reweight -1 0.5 root default -2 0.0 host Lab-cephosd005 4 0.0 osd.4 up 1 -3 0.0 host Lab-cephosd001 0 0.0 osd.0 up 1 -4 0.0 host Lab-cephosd002 1 0.0 osd.1 up 1 -5 0.0 host Lab-cephosd003 2 0.0 osd.2 up 1 -6 0.0 host Lab-cephosd004 3 0.0 osd.3 up 1 [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query Error ENOENT: i don't have paid 2.33 —Jiten On Nov 20, 2014, at 11:18 AM, Craig Lewis cle...@centraldesktop.com wrote: Just to be clear, this is from a cluster that was healthy, had a disk replaced, and hasn't returned to healthy? It's not a new cluster that has never been healthy, right? Assuming it's an existing cluster, how many OSDs did you replace? It almost looks like you replaced multiple OSDs at the same time, and lost data because of it. Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`? On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah jshah2...@me.com wrote: After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. Sone are in the unclean and others are in the stale state. Somehow the MDS is also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read through the documentation and on the web but no luck so far. pg 2.33 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 0.30 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 1.31 is stuck unclean since forever, current state stale+active+degraded, last acting [2] pg 2.32 is stuck unclean for 597129.903922, current state stale+active+degraded, last acting [2] pg 0.2f is stuck unclean for 597129.903951, current state stale+active+degraded, last acting [2] pg 1.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 2.2d is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [2] pg 0.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 1.2f is stuck unclean for 597129.904015, current state stale+active+degraded, last acting [2] pg 2.2c is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 0.2d is stuck stale for 422844.566858, current state stale+active+degraded, last acting [2] pg 1.2c is stuck stale for 422598.539483, current state stale+active+degraded+remapped, last acting [3] pg 2.2f is stuck stale for 422598.539488, current state stale+active+degraded+remapped, last acting [3] pg 0.2c is stuck stale for 422598.539487, current state stale+active+degraded+remapped, last acting [3] pg 1.2d is stuck stale for 422598.539492, current state stale+active+degraded+remapped, last acting [3] pg 2.2e is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3] pg 0.2b is stuck stale for 422598.539491, current state stale+active+degraded+remapped, last acting [3] pg 1.2a is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3] pg 2.29 is stuck stale for 422598.539504, current state stale+active+degraded+remapped, last acting [3] . . . 6 ops are blocked 2097.15 sec 3 ops are blocked 2097.15 sec on osd.0 2 ops are blocked 2097.15 sec on osd.2 1 ops are blocked 2097.15 sec on osd.4 3 osds have slow requests recovery 40/60 objects degraded (66.667%) mds cluster is degraded mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal —Jiten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg's degraded
If there's no data to lose, tell Ceph to re-create all the missing PGs. ceph pg force_create_pg 2.33 Repeat for each of the missing PGs. If that doesn't do anything, you might need to tell Ceph that you lost the OSDs. For each OSD you moved, run ceph osd lost OSDID, then try the force_create_pg command again. If that doesn't work, you can keep fighting with it, but it'll be faster to rebuild the cluster. On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah jshah2...@me.com wrote: Thanks for your help. I was using puppet to install the OSD’s where it chooses a path over a device name. Hence it created the OSD in the path within the root volume since the path specified was incorrect. And all 3 of the OSD’s were rebuilt at the same time because it was unused and we had not put any data in there. Any way to recover from this or should i rebuild the cluster altogether. —Jiten On Nov 20, 2014, at 1:40 PM, Craig Lewis cle...@centraldesktop.com wrote: So you have your crushmap set to choose osd instead of choose host? Did you wait for the cluster to recover between each OSD rebuild? If you rebuilt all 3 OSDs at the same time (or without waiting for a complete recovery between them), that would cause this problem. On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah jshah2...@me.com wrote: Yes, it was a healthy cluster and I had to rebuild because the OSD’s got accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of them. [jshah@Lab-cephmon001 ~]$ ceph osd tree # id weight type name up/down reweight -1 0.5 root default -2 0.0 host Lab-cephosd005 4 0.0 osd.4 up 1 -3 0.0 host Lab-cephosd001 0 0.0 osd.0 up 1 -4 0.0 host Lab-cephosd002 1 0.0 osd.1 up 1 -5 0.0 host Lab-cephosd003 2 0.0 osd.2 up 1 -6 0.0 host Lab-cephosd004 3 0.0 osd.3 up 1 [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query Error ENOENT: i don't have paid 2.33 —Jiten On Nov 20, 2014, at 11:18 AM, Craig Lewis cle...@centraldesktop.com wrote: Just to be clear, this is from a cluster that was healthy, had a disk replaced, and hasn't returned to healthy? It's not a new cluster that has never been healthy, right? Assuming it's an existing cluster, how many OSDs did you replace? It almost looks like you replaced multiple OSDs at the same time, and lost data because of it. Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`? On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah jshah2...@me.com wrote: After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. Sone are in the unclean and others are in the stale state. Somehow the MDS is also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read through the documentation and on the web but no luck so far. pg 2.33 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 0.30 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 1.31 is stuck unclean since forever, current state stale+active+degraded, last acting [2] pg 2.32 is stuck unclean for 597129.903922, current state stale+active+degraded, last acting [2] pg 0.2f is stuck unclean for 597129.903951, current state stale+active+degraded, last acting [2] pg 1.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 2.2d is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [2] pg 0.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 1.2f is stuck unclean for 597129.904015, current state stale+active+degraded, last acting [2] pg 2.2c is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3] pg 0.2d is stuck stale for 422844.566858, current state stale+active+degraded, last acting [2] pg 1.2c is stuck stale for 422598.539483, current state stale+active+degraded+remapped, last acting [3] pg 2.2f is stuck stale for 422598.539488, current state stale+active+degraded+remapped, last acting [3] pg 0.2c is stuck stale for 422598.539487, current state stale+active+degraded+remapped, last acting [3] pg 1.2d is stuck stale for 422598.539492, current state stale+active+degraded+remapped, last acting [3] pg 2.2e is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3] pg 0.2b is stuck stale for 422598.539491, current state stale+active+degraded+remapped, last acting [3] pg 1.2a is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3] pg 2.29 is stuck stale for 422598.539504, current state stale+active+degraded+remapped, last acting [3] . . . 6 ops are blocked 2097.15 sec 3 ops are blocked 2097.15 sec on osd.0 2 ops are blocked 2097.15 sec on osd.2 1 ops are blocked 2097.15 sec on osd.4 3 osds have slow requests recovery 40/60 objects
Re: [ceph-users] OSD commits suicide
That would probably have helped. The XFS deadlocks would only occur when there was relatively little free memory. Kernel 3.18 is supposed to have a fix for that, but I haven't tried it yet. Looking at my actual usage, I don't even need 64k inodes. 64k inodes should make things a bit faster when you have a large number of files in a directory. Ceph will automatically split directories with too many files into multiple sub-directories, so it's kinda pointless. I may try the experiment again, but probably not. It took several weeks to reformat all of the OSDS. Even on a single node, it takes 4-5 days to drain, format, and backfill. That was months ago, and I'm still dealing with the side effects. I'm not eager to try again. On Mon, Nov 17, 2014 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com wrote: I did have a problem in my secondary cluster that sounds similar to yours. I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs options xfs = -i size=64k). This showed up with a lot of XFS: possible memory allocation deadlock in kmem_alloc in the kernel logs. I was able to keep things limping along by flushing the cache frequently, but I eventually re-formatted every OSD to get rid of the 64k inodes. After I finished the reformat, I had problems because of deep-scrubbing. While reformatting, I disabled deep-scrubbing. Once I re-enabled it, Ceph wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would be doing a deep-scrub. I'm manually deep-scrubbing now, trying to spread out the schedule a bit. Once this finishes in a few day, I should be able to re-enable deep-scrubbing and keep my HEALTH_OK. Would you mind to check suggestions by following mine hints or hints from mentioned URLs from there http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As for me, I am not observing lock loop after setting min_free_kbytes for a half of gigabyte per OSD. Even if your locks has a different nature, it may be worthy to try anyway. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crashed while there was no space
You shouldn't let the cluster get so full that losing a few OSDs will make you go toofull. Letting the cluster get to 100% full is such a bad idea that you should make sure it doesn't happen. Ceph is supposed to stop moving data to an OSD once that OSD hits osd_backfill_full_ratio, which defaults to 0.85. Any disk at 86% full will stop backfilling. I have verified this works when the disks fill up while the cluster is healthy, but I haven't failed a disk once I'm in the toofull state. Even so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default 0.97) should stop all IO until a human gets involved. The only gotcha I can find is that the values are percentages, and the test is a greater than done with two significant digits. ie, if the osd_backfill_full_ratio is 0.85, it will continue backfilling until the disk is 86% full. So values are 0.99 and 1.00 will cause problems. On Mon, Nov 17, 2014 at 6:50 PM, han vincent hang...@gmail.com wrote: hi, craig: Your solution did work very well. But if the data is very important, when remove directory of PG from OSDs, a small mistake will result in loss of data. And if cluster is very large, do not you think delete the data on the disk from 100% to 95% is a tedious and error-prone thing, for so many OSDs, large disks, and so on. so my key question is: if there is no space in the cluster while some OSDs crashed, why the cluster should choose to migrate? And in the migrating, other OSDs will crashed one by one until the cluster could not work. 2014-11-18 5:28 GMT+08:00 Craig Lewis cle...@centraldesktop.com: At this point, it's probably best to delete the pool. I'm assuming the pool only contains benchmark data, and nothing important. Assuming you can delete the pool: First, figure out the ID of the data pool. You can get that from ceph osd dump | grep '^pool' Once you have the number, delete the data pool: rados rmpool data data --yes-i-really-really-mean-it That will only free up space on OSDs that are up. You'll need to manually some PGs on the OSDs that are 100% full. Go to /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories that start with your data pool ID. You don't need to delete all of them. Once the disk is below 95% full, you should be able to start that OSD. Once it's up, it will finish deleting the pool. If you can't delete the pool, it is possible, but it's more work, and you still run the risk of losing data if you make a mistake. You need to disable backfilling, then delete some PGs on each OSD that's full. Try to only delete one copy of each PG. If you delete every copy of a PG on all OSDs, then you lost the data that was in that PG. As before, once you delete enough that the disk is less than 95% full, you can start the OSD. Once you start it, start deleting your benchmark data out of the data pool. Once that's done, you can re-enable backfilling. You may need to scrub or deep-scrub the OSDs you deleted data from to get everything back to normal. So how did you get the disks 100% full anyway? Ceph normally won't let you do that. Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or osd_failsafe_full_ratio? On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote: hello, every one: These days a problem of ceph has troubled me for a long time. I build a cluster with 3 hosts and each host has three osds in it. And after that I used the command rados bench 360 -p data -b 4194304 -t 300 write --no-cleanup to test the write performance of the cluster. When the cluster is near full, there couldn't write any data to it. Unfortunately, there was a host hung up, then a lots of PG was going to migrate to other OSDs. After a while, a lots of OSD was marked down and out, my cluster couldn't work any more. The following is the output of ceph -s: cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons down, quorum 0,2 2,1 monmap e1: 3 mons at {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election epoch 40, quorum 0,2 2,1 osdmap e173: 9 osds: 2 up, 2 in flags full pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects 37541 MB used, 3398 MB / 40940 MB avail 945/29649 objects degraded (3.187%) 34 stale+active+degraded+remapped 176 stale+incomplete 320 stale+down+peering 53 active+degraded+remapped 408 incomplete 1 active+recovering+degraded 673 down
Re: [ceph-users] OSDs down
Firstly, any chance of getting node4 and node5 back up? You can move the disks (monitor and osd) to a new chasis, and bring it back up. As long as it has the same IP as the original node4 and node5, the monitor should join. How much is the clock skewed on node2? I haven't had problems with small skew (~100 ms), but I've seen posts to the mailing list about large skews (minutes) causing quorum and authentication problems. When you say Nevertheless on node3 every ceph * commands stay freezed, do you by chance mean node2 instead of node3? If so, that supports the clock skew being a problem, preventing the commands and the OSDs from authenticating with the monitors. If you really did mean node3, then something strange else going on. On Mon, Nov 17, 2014 at 7:07 AM, NEVEU Stephane stephane.ne...@thalesgroup.com wrote: Hi all J , I need some help, I’m in a sad situation : i’ve lost 2 ceph server nodes physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2, node3 On my first node leaving, I’ve updated the crush map to remove every osds running on those 2 lost servers : Ceph osd crush remove osds ceph auth del osds ceph osd rm osds ceph osd remove my2Lostnodes So the crush map seems to be ok now on node1. Ceph osd tree on node 1 returns that every osds running on node2 are “down 1” and “up 1” on node 3 and “up 1” on node1. Nevertheless on node3 every ceph * commands stay freezed, so I’m not sure the crush map has been updated on node2 and node3. I don’t know how to set ods on node 2 up again. My node2 says it cannot connect to the cluster ! Ceph –s on node 1 gives me (so still 5 monitors): cluster 45d9195b-365e-491a-8853-34b46553db94 health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean; recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down; noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew detected on mon.node2 monmap e1: 5 mons at {node1= 172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0 http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0}, election epoch 488, quorum 0,1,2 node1,node2,node3 mdsmap e48: 1/1/1 up {0=node3=up:active} osdmap e3852: 33 osds: 22 up, 33 in flags noout pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects 2122 GB used, 90051 GB / 92174 GB avail 181055/544038 objects degraded (33.280%) 10016 active+degraded client io 0 B/s rd, 233 kB/s wr, 22 op/s Thx for your help !! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crashed while there was no space
At this point, it's probably best to delete the pool. I'm assuming the pool only contains benchmark data, and nothing important. Assuming you can delete the pool: First, figure out the ID of the data pool. You can get that from ceph osd dump | grep '^pool' Once you have the number, delete the data pool: rados rmpool data data --yes-i-really-really-mean-it That will only free up space on OSDs that are up. You'll need to manually some PGs on the OSDs that are 100% full. Go to /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories that start with your data pool ID. You don't need to delete all of them. Once the disk is below 95% full, you should be able to start that OSD. Once it's up, it will finish deleting the pool. If you can't delete the pool, it is possible, but it's more work, and you still run the risk of losing data if you make a mistake. You need to disable backfilling, then delete some PGs on each OSD that's full. Try to only delete one copy of each PG. If you delete every copy of a PG on all OSDs, then you lost the data that was in that PG. As before, once you delete enough that the disk is less than 95% full, you can start the OSD. Once you start it, start deleting your benchmark data out of the data pool. Once that's done, you can re-enable backfilling. You may need to scrub or deep-scrub the OSDs you deleted data from to get everything back to normal. So how did you get the disks 100% full anyway? Ceph normally won't let you do that. Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or osd_failsafe_full_ratio? On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote: hello, every one: These days a problem of ceph has troubled me for a long time. I build a cluster with 3 hosts and each host has three osds in it. And after that I used the command rados bench 360 -p data -b 4194304 -t 300 write --no-cleanup to test the write performance of the cluster. When the cluster is near full, there couldn't write any data to it. Unfortunately, there was a host hung up, then a lots of PG was going to migrate to other OSDs. After a while, a lots of OSD was marked down and out, my cluster couldn't work any more. The following is the output of ceph -s: cluster 002c3742-ab04-470f-8a7a-ad0658b547d6 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean; recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons down, quorum 0,2 2,1 monmap e1: 3 mons at {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election epoch 40, quorum 0,2 2,1 osdmap e173: 9 osds: 2 up, 2 in flags full pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects 37541 MB used, 3398 MB / 40940 MB avail 945/29649 objects degraded (3.187%) 34 stale+active+degraded+remapped 176 stale+incomplete 320 stale+down+peering 53 active+degraded+remapped 408 incomplete 1 active+recovering+degraded 673 down+peering 1 stale+active+degraded 15 remapped+peering 3 stale+active+recovering+degraded+remapped 3 active+degraded 33 remapped+incomplete 8 active+recovering+degraded+remapped The following is the output of ceph osd tree: # idweight type name up/down reweight -1 9 root default -3 9 rack unknownrack -2 3 host 10.0.0.97 0 1 osd.0 down0 1 1 osd.1 down0 2 1 osd.2 down0 -4 3 host 10.0.0.98 3 1 osd.3 down0 4 1 osd.4 down0 5 1 osd.5 down0 -5 3 host 10.0.0.70 6 1 osd.6 up 1 7 1 osd.7 up 1 8 1 osd.8 down0 The following is part of output os osd.0.log -3 2014-11-14 17:33:02.166022 7fd9dd1ab700 0 filestore(/data/osd/osd.0) error (28) No space left on device not handled on operation 10 (15804.0.13, or op 13, counting from 0) -2 2014-11-14 17:33:02.216768 7fd9dd1ab700 0 filestore(/data/osd/osd.0) ENOSPC handling not implemented -1 2014-11-14 17:33:02.216783 7fd9dd1ab700 0 filestore(/data/osd/osd.0) transaction dump: ... ... 0 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
I use `dd` to force activity to the disk I want to replace, and watch the activity lights. That only works if your disks aren't 100% busy. If they are, stop the ceph-osd daemon, and see which drive stops having activity. Repeat until you're 100% confident that you're pulling the right drive. On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr wrote: Hi, I’m used to RAID software giving me the failing disks slots, and most often blinking the disks on the disk bays. I recently installed a DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) . Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but MegaCli does not see the HBA card. Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me the disk slots (it finds the disks, serials and others, but slot is always 0) Because of this, I’m going to head over to the disk bay and unplug the disk which I think corresponds to the alphabetical order in linux, and see if it’s the correct one…. But even if this is correct this time, it might not be next time. But this makes me wonder : how do you guys, Ceph users, manage your disks if you really have JBOD servers ? I can’t imagine having to guess slots that each time, and I can’t imagine neither creating serial number stickers for every single disk I could have to manage … Is there any specific advice reguarding JBOD cards people should (not) use in their systems ? Any magical way to “blink” a drive in linux ? Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
I did have a problem in my secondary cluster that sounds similar to yours. I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs options xfs = -i size=64k). This showed up with a lot of XFS: possible memory allocation deadlock in kmem_alloc in the kernel logs. I was able to keep things limping along by flushing the cache frequently, but I eventually re-formatted every OSD to get rid of the 64k inodes. After I finished the reformat, I had problems because of deep-scrubbing. While reformatting, I disabled deep-scrubbing. Once I re-enabled it, Ceph wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would be doing a deep-scrub. I'm manually deep-scrubbing now, trying to spread out the schedule a bit. Once this finishes in a few day, I should be able to re-enable deep-scrubbing and keep my HEALTH_OK. My primary cluster has always been well behaved. It completed the re-format without having any problems. The clusters are nearly identical, the biggest difference being that the secondary had a higher sustained load due to a replication backlog. On Sat, Nov 15, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Thanks for the tip, I applied these configuration settings and it does lower the load during rebuilding a bit. Are there settings like these that also tune Ceph down a bit during regular operations? The slow requests, timeouts and OSD suicides are killing me. If I allow the cluster to regain consciousness and stay idle a bit, it all seems to settle down nicely, but as soon as I apply some load it immediately starts to overstress and complain like crazy. I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844 This was reported by Dmitry Smirnov 26 days ago, but the report has no response yet. Any ideas? In my experience, OSD's are quite unstable in Giant and very easily stressed, causing chain effects, further worsening the issues. It would be nice to know if this is also noticed by other users? Thanks, Erik. On 11/10/2014 08:40 PM, Craig Lewis wrote: Have you tuned any of the recovery or backfill parameters? My ceph.conf has: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 Still, if it's running for a few hours, then failing, it sounds like there might be something else at play. OSDs use a lot of RAM during recovery. How much RAM and how many OSDs do you have in these nodes? What does memory usage look like after a fresh restart, and what does it look like when the problems start? Even better if you know what it looks like 5 minutes before the problems start. Is there anything interesting in the kernel logs? OOM killers, or memory deadlocks? On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu mailto:e...@logtenberg.eu wrote: Hi, I have some OSD's that keep committing suicide. My cluster has ~1.3M misplaced objects, and it can't really recover, because OSD's keep failing before recovering finishes. The load on the hosts is quite high, but the cluster currently has no other tasks than just the backfilling/recovering. I attached the logfile from a failed OSD. It shows the suicide, the recent events and also me starting the OSD again after some time. It'll keep running for a couple of hours and then fail again, for the same reason. I noticed a lot of timeouts. Apparently ceph stresses the hosts to the limit with the recovery tasks, so much that they timeout and can't finish that task. I don't understand why. Can I somehow throttle ceph a bit so that it doesn't keep overrunning itself? I kinda feel like it should chill out a bit and simply recover one step at a time instead of full force and then fail. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep scrub parameter tuning
The minimum value for osd_deep_scrub_interval is osd_scrub_min_interval, and it wouldn't be advisable to go that low. I can't find the documentation, but basically Ceph will attempt a scrub sometime between osd_scrub_min_interval and osd_scrub_max_interval. If the PG hasn't been deep-scrubbed in the last osd_deep_scrub_interval seconds, it does a deep-scrub instead. So if you set osd_deep_scrub_interval to osd_scrub_min_interval, you'll never scrub your PGs, you'll only deep-scrub. Obviously, you can lower the two scrub intervals too. As Loïc says, test it well. I find when I'm playing with these values, I use injectargs to find a good value, then persist that value in the ceph.conf. On Fri, Nov 14, 2014 at 3:16 AM, Loic Dachary l...@dachary.org wrote: Hi, On 14/11/2014 12:11, Mallikarjun Biradar wrote: Hi, Default deep scrub interval is once per week, which we can set using osd_deep_scrub_interval parameter. Whether can we reduce it to less than a week or minimum interval is one week? You can reduce it to a shorter period. It is worth testing the impact on disk IO before going to production with shorter intervals though. Cheers -Thanks regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Negative number of objects degraded for extended period of time
Well, after 4 days, this is probably moot. Hopefully it's finished backfilling, and your problem is gone. If not, I believe that if you fix those backfill_toofull, the negative numbers will start approaching zero. I seem to recall that negative degraded is a special case of degraded, but I don't remember exactly, and can't find any references. I have seen it before, and it went away when my cluster became healthy. As long as you still have OSDs completing their backfilling, I'd let it run. If you get to the point that all of the backfills are done, and you're left with only wait_backfill+backfill_toofull, then you can bump osd_backfill_full_ratio, mon_osd_nearfull_ratio, and maybe osd_failsafe_nearfull_ratio. If you do, be careful, and only bump them just enough to let them start backfilling. If you set them to 0.99, bad things will happen. On Thu, Nov 13, 2014 at 7:57 AM, Fred Yang frederic.y...@gmail.com wrote: Hi, The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks ago so I ran a reweight to balance it out, in the meantime, instructing application to purge data not required. But after large amount of data purge issued from application side(all OSDs' usage dropped below 20%), the cluster fall into this weird state for days, the objects degraded remain negative for more than 7 days, I'm seeing some IOs going on on OSDs consistently, but the number(negative) objects degraded does not change much: 2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s; -13582/1468299 objects degraded (-0.925%) 2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s; -13582/1468303 objects degraded (-0.925%) Any idea what might be happening here? It seems active+remapped+wait_backfill+backfill_toofull stuck? osdmap e43029: 36 osds: 36 up, 36 in pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects 3017 GB used, 17092 GB / 20109 GB avail -13438/1475773 objects degraded (-0.911%) 44713 active+clean 1 active+backfilling 20 active+remapped+wait_backfill 27 active+remapped+wait_backfill+backfill_toofull 11 active+recovery_wait 33 active+remapped+backfilling 11 active+wait_backfill+backfill_toofull client io 478 B/s rd, 40170 kB/s wr, 80 op/s The cluster is running on v0.72.2, we are planning to upgrade cluster to firefly, but I would like to get the cluster state clean first before the upgrade. Thanks, Fred ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Federated gateways
I have identical regionmaps in both clusters. I only created the zone's pools in that cluster. I didn't delete the default .rgw.* pools, so those exist in both zones. Both users need to be system on both ends, and have identical access and secrets. If they're not, this is likely your problem. On Fri, Nov 14, 2014 at 11:38 AM, Aaron Bassett aa...@five3genomics.com wrote: Well I upgraded both clusters to giant this morning just to see if that would help, and it didn’t. I have a couple questions though. I have the same regionmap on both clusters, with both zones in it, but then i only have the buckets and zone info for one zone in each cluster, is this right? Or do I need all the buckets and zones in both clusters? Reading the docs it doesn’t seem like i do because I’m expecting data to sync from one zone in one cluster to the other zone on the other cluster, but I don’t know what to think anymore. Also do both users need to be system users on both ends? Aaron On Nov 12, 2014, at 4:00 PM, Craig Lewis cle...@centraldesktop.com wrote: http://tracker.ceph.com/issues/9206 My post to the ML: http://www.spinics.net/lists/ceph-users/msg12665.html IIRC, the system uses didn't see the other user's bucket in a bucket listing, but they could read and write the objects fine. On Wed, Nov 12, 2014 at 11:16 AM, Aaron Bassett aa...@five3genomics.com wrote: In playing around with this a bit more, I noticed that the two users on the secondary node cant see each others buckets. Is this a problem? IIRC, the system user couldn't see each other's buckets, but they could read and write the objects. On Nov 11, 2014, at 6:56 PM, Craig Lewis cle...@centraldesktop.com wrote: I see you're running 0.80.5. Are you using Apache 2.4? There is a known issue with Apache 2.4 on the primary and replication. It's fixed, just waiting for the next firefly release. Although, that causes 40x errors with Apache 2.4, not 500 errors. It is apache 2.4, but I’m actually running 0.80.7 so I probably have that bug fix? No, the unreleased 0.80.8 has the fix. Have you verified that both system users can read and write to both clusters? (Just make sure you clean up the writes to the slave cluster). Yes I can write everywhere and radosgw-agent isn’t getting any 403s like it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index pool is syncing properly, as are the users. It seems like really the only thing that isn’t syncing is the .zone.rgw.buckets pool. That's pretty much the same behavior I was seeing with Apache 2.4. Try downgrading the primary cluster to Apache 2.2. In my testing, the secondary cluster could run 2.2 or 2.4. Do you have a link to that bug#? I want to see if it gives me any clues. Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Federated gateways
http://tracker.ceph.com/issues/9206 My post to the ML: http://www.spinics.net/lists/ceph-users/msg12665.html IIRC, the system uses didn't see the other user's bucket in a bucket listing, but they could read and write the objects fine. On Wed, Nov 12, 2014 at 11:16 AM, Aaron Bassett aa...@five3genomics.com wrote: In playing around with this a bit more, I noticed that the two users on the secondary node cant see each others buckets. Is this a problem? IIRC, the system user couldn't see each other's buckets, but they could read and write the objects. On Nov 11, 2014, at 6:56 PM, Craig Lewis cle...@centraldesktop.com wrote: I see you're running 0.80.5. Are you using Apache 2.4? There is a known issue with Apache 2.4 on the primary and replication. It's fixed, just waiting for the next firefly release. Although, that causes 40x errors with Apache 2.4, not 500 errors. It is apache 2.4, but I’m actually running 0.80.7 so I probably have that bug fix? No, the unreleased 0.80.8 has the fix. Have you verified that both system users can read and write to both clusters? (Just make sure you clean up the writes to the slave cluster). Yes I can write everywhere and radosgw-agent isn’t getting any 403s like it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index pool is syncing properly, as are the users. It seems like really the only thing that isn’t syncing is the .zone.rgw.buckets pool. That's pretty much the same behavior I was seeing with Apache 2.4. Try downgrading the primary cluster to Apache 2.2. In my testing, the secondary cluster could run 2.2 or 2.4. Do you have a link to that bug#? I want to see if it gives me any clues. Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Federated gateways
=0x7f53f00053f0).reader got front 190 2014-11-11 14:37:06.701449 7f51ff0f0700 10 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).aborted = 0 2014-11-11 14:37:06.701458 7f51ff0f0700 20 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).reader got 190 + 0 + 0 byte message 2014-11-11 14:37:06.701569 7f51ff0f0700 10 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).reader got message 49 0x7f51b4001460 osd_op_reply(1784 statelog.obj_opstate.97 [call] v47531'14 uv14 ondisk = 0) v6 2014-11-11 14:37:06.701597 7f51ff0f0700 20 -- 172.16.10.103:0/1007381 queue 0x7f51b4001460 prio 127 2014-11-11 14:37:06.701627 7f51ff0f0700 20 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).reader reading tag... 2014-11-11 14:37:06.701636 7f51ff1f1700 10 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).writer: state = open policy.server=0 2014-11-11 14:37:06.701678 7f51ff1f1700 10 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).write_ack 49 2014-11-11 14:37:06.701684 7f54ebfff700 1 -- 172.16.10.103:0/1007381 == osd.25 172.16.10.103:6934/14875 49 osd_op_reply(1784 statelog.obj_opstate.97 [call] v47531'14 uv14 ondisk = 0) v6 190+0+0 (1714651716 0 0) 0x7f51b4001460 con 0x7f53f00053f0 2014-11-11 14:37:06.701710 7f51ff1f1700 10 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).writer: state = open policy.server=0 2014-11-11 14:37:06.701728 7f51ff1f1700 20 -- 172.16.10.103:0/1007381 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524 cs=1 l=1 c=0x7f53f00053f0).writer sleeping 2014-11-11 14:37:06.701751 7f54ebfff700 10 -- 172.16.10.103:0/1007381 dispatch_throttle_release 190 to dispatch throttler 190/104857600 2014-11-11 14:37:06.701762 7f54ebfff700 20 -- 172.16.10.103:0/1007381 done calling dispatch on 0x7f51b4001460 2014-11-11 14:37:06.701815 7f54447f0700 0 WARNING: set_req_state_err err_no=5 resorting to 500 2014-11-11 14:37:06.701894 7f54447f0700 1 == req done req=0x7f546800f3b0 http_status=500 == Any information you could give me would be wonderful as I’ve been banging my head against this for a few days. Thanks, Aaron On Nov 5, 2014, at 3:02 PM, Aaron Bassett aa...@five3genomics.com wrote: Ah so I need both users in both clusters? I think I missed that bit, let me see if that does the trick. Aaron On Nov 5, 2014, at 2:59 PM, Craig Lewis cle...@centraldesktop.com wrote: One region two zones is the standard setup, so that should be fine. Is metadata (users and buckets) being replicated, but not data (objects)? Let's go through a quick checklist: - Verify that you enabled log_meta and log_data in the region.json for the master zone - Verify that RadosGW is using your region map with radosgw-admin regionmap get --name client.radosgw.name - Verifu - Verify that RadosGW is using your zone map with radosgw-admin zone get --name client.radosgw.name - Verify that all the pools in your zone exist (RadosGW only auto-creates the basic ones). - Verify that your system users exist in both zones with the same access and secret. Hopefully that gives you an idea what's not working correctly. If it doesn't, crank up the logging on the radosgw daemon on both sides, and check the logs. Add debug rgw = 20 to both ceph.conf (in the client.radosgw.name section), and restart. Hopefully those logs will tell you what's wrong. On Wed, Nov 5, 2014 at 11:39 AM, Aaron Bassett aa...@five3genomics.com wrote: Hello everyone, I am attempted to setup a two cluster situation for object storage disaster recovery. I have two physically separate sites so using 1 big cluster isn’t an option. I’m attempting to follow the guide at: http://ceph.com/docs/v0.80.5/radosgw/federated-config/ . After a couple days of flailing, I’ve settled on using 1 region with two zones, where each cluster is a zone. I’m now attempting to set up an agent as per the “Multi-Site Data Replication section. The agent kicks off ok and starts making all sorts of connections, but no objects were being copied to the non-master zone. I re-ran the agent with the -v flag and saw a lot of: DEBUG:urllib3.connectionpool:GET /admin/opstate?client-id=radosgw-agentobject=test%2F_shadow_.JjVixjWmebQTrRed36FL6D0vy2gDVZ__39op-id=phx-r1-head1%3A2451615%3A1 HTTP/1.1 200 None DEBUG:radosgw_agent.worker:op state is [] DEBUG:radosgw_agent.worker:error geting op state: list index out of range So it appears something is still
Re: [ceph-users] pg's stuck for 4-5 days after reaching backfill_toofull
How many OSDs are nearfull? I've seen Ceph want two toofull OSDs to swap PGs. In that case, I dynamically raised mon_osd_nearfull_ratio and osd_backfill_full_ratio a bit, then put it back to normal once the scheduling deadlock finished. Keep in mind that ceph osd reweight is temporary. If you mark an osd OUT then IN, the weight will be set to 1.0. If you need something that's persistent, you can use ceph osd crush reweight osd.NUM crust_weight. Look at ceph osd tree to get the current weight. I also recommend stepping towards your goal. Changing either weight can cause a lot of unrelated migrations, and the crush weight seems to cause more than the osd weight. I step osd weight by 0.125, and crush weight by 0.05. On Tue, Nov 11, 2014 at 12:47 PM, Chad Seys cws...@physics.wisc.edu wrote: Find out which OSD it is: ceph health detail Squeeze blocks off the affected OSD: ceph osd reweight OSDNUM 0.8 Repeat with any OSD which becomes toofull. Your cluster is only about 50% used, so I think this will be enough. Then when it finishes, allow data back on OSD: ceph osd reweight OSDNUM 1 Hopefully ceph will someday be taught to move PGs in a better order! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Federated gateways
I see you're running 0.80.5. Are you using Apache 2.4? There is a known issue with Apache 2.4 on the primary and replication. It's fixed, just waiting for the next firefly release. Although, that causes 40x errors with Apache 2.4, not 500 errors. It is apache 2.4, but I’m actually running 0.80.7 so I probably have that bug fix? No, the unreleased 0.80.8 has the fix. Have you verified that both system users can read and write to both clusters? (Just make sure you clean up the writes to the slave cluster). Yes I can write everywhere and radosgw-agent isn’t getting any 403s like it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index pool is syncing properly, as are the users. It seems like really the only thing that isn’t syncing is the .zone.rgw.buckets pool. That's pretty much the same behavior I was seeing with Apache 2.4. Try downgrading the primary cluster to Apache 2.2. In my testing, the secondary cluster could run 2.2 or 2.4. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
If all of your PGs now have an empty down_osds_we_would_probe, I'd run through this discussion again. The commands to tell Ceph to give up on lost data should have an effect now. That's my experience anyway. Nothing progressed until I took care of down_osds_we_would_probe. After that was empty, I was able to repair. It wasn't immediate though. It still took ~24 hours, and a few OSD restarts, for the cluster to get itself healthy. You might try sequentially restarting OSDs. It shouldn't be necessary, but it shouldn't make anything worse. On Mon, Nov 10, 2014 at 7:17 AM, Chad Seys cws...@physics.wisc.edu wrote: Hi Craig and list, If you create a real osd.20, you might want to leave it OUT until you get things healthy again. I created a real osd.20 (and it turns out I needed an osd.21 also). ceph pg x.xx query no longer lists down osds for probing: down_osds_we_would_probe: [], But I cannot find the magic command line which will remove these incomplete PGs. Anyone know how to remove incomplete PGs ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG inconsistency
For #1, it depends what you mean by fast. I wouldn't worry about it taking 15 minutes. If you mark the old OSD out, ceph will start remapping data immediately, including a bunch of PGs on unrelated OSDs. Once you replace the disk, and put the same OSDID back in the same host, the CRUSH map will be back to what it was before you started. All of those remaps on unrelated OSDs will reverse. They'll complete fairly quickly, because they only have to backfill the data that was written during the remap. I prefer #1. ceph pg repair will just overwrite the replicas with whatever the primary OSD has, which may copy bad data from your bad OSD over good replicas. So #2 has the potential to corrupt the data. #1 will delete the data you know is bad, leaving only good data behind to replicate. Once ceph pg repair gets more intelligent, I'll revisit this. I also prefer the simplicity. If it's dead or corrupt, they're treated the same. On Sun, Nov 9, 2014 at 7:25 PM, GuangYang yguan...@outlook.com wrote: In terms of disk replacement, to avoid migrating data back and forth, are the below two approaches reasonable? 1. Keep the OSD in and do an ad-hoc disk replacement and provision a new OSD (so that keep the OSD id as the same), and then trigger data migration. In this way the data migration only happens once, however, it does require operators to replace the disk very fast. 2. Move the data on the broken disk to a new disk completely and use Ceph to repair bad objects. Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
Have you tuned any of the recovery or backfill parameters? My ceph.conf has: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 Still, if it's running for a few hours, then failing, it sounds like there might be something else at play. OSDs use a lot of RAM during recovery. How much RAM and how many OSDs do you have in these nodes? What does memory usage look like after a fresh restart, and what does it look like when the problems start? Even better if you know what it looks like 5 minutes before the problems start. Is there anything interesting in the kernel logs? OOM killers, or memory deadlocks? On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, I have some OSD's that keep committing suicide. My cluster has ~1.3M misplaced objects, and it can't really recover, because OSD's keep failing before recovering finishes. The load on the hosts is quite high, but the cluster currently has no other tasks than just the backfilling/recovering. I attached the logfile from a failed OSD. It shows the suicide, the recent events and also me starting the OSD again after some time. It'll keep running for a couple of hours and then fail again, for the same reason. I noticed a lot of timeouts. Apparently ceph stresses the hosts to the limit with the recovery tasks, so much that they timeout and can't finish that task. I don't understand why. Can I somehow throttle ceph a bit so that it doesn't keep overrunning itself? I kinda feel like it should chill out a bit and simply recover one step at a time instead of full force and then fail. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] An OSD always crash few minutes after start
You're running 0.87-6. There were various fixes for this problem in Firefly. Were any of these snapshots created on early version of Firefly? So far, every fix for this issue has gotten developers involved. I'd see if you can talk to some devs on IRC, or post to the ceph-devel mailing list. My own experience is that I had to delete the affected PGs, and force create them. Hopefully there's a better answer now. On Fri, Nov 7, 2014 at 8:10 PM, Chu Duc Minh chu.ducm...@gmail.com wrote: One of my OSDs have problems and can NOT be start. I tried to start many times but it always crash few minutes after start. I think about two reasons to make it crash: 1. A read/write request to this OSD, but due to the corrupted volume/snapshot/parent-image/..., it crash. 2. The recovering process can NOT work properly due to the corrupted volumes/snapshot/parent-image/... After many retry and check log, i guess the reason (2) is the main cause. Because if (1) is the main cause, other OSDs (contain buggy volume/snapshot) will crash too. State of my ceph cluster (just few seconds before crash time): 111/57706299 objects degraded (0.001%) 14918 active+clean 1 active+clean+scrubbing+deep 52 active+recovery_wait+degraded 2 active+recovering+degraded PS: i attach crash-dump log of that OSD in this email for your information. Thank you! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck in stale state
nothing to send, going to standby isn't necessarily bad, I see it from time to time. It shouldn't stay like that for long though. If it's been 5 minutes, and the cluster still isn't doing anything, I'd restart that osd. On Fri, Nov 7, 2014 at 1:55 PM, Jan Pekař jan.pe...@imatic.cz wrote: Hi, I was testing ceph cluster map changes and I got to stuck state which seems to be indefinite. First my description what I have done. I'm testing special case with only one copy of pg's (pool size = 1). All pg's was on one osd.0. I created second osd.1 and modified cluster map to transfer one pool (metadata) to the newly created osd.1 PG's started to remap and objects degraded number was dropping - so everything looked normal. During that recovery process I restarted both osd daemons. After that I noticed, that pg's, that should be remapped had stale state - stale+active+remapped+backfilling and other object with stale state . I tried to run ceph pg force_create_pg on one pg, that should be remapped, but nothing changed (that is 1 stuck / creating PG below in ceph health) Command rados -p metadata ls hangs so data are unavailable, but it should be there. What should I do in this state to get it working? ceph -s below: cluster 93418692-8e2e-4689-a237-ed5b47f39f72 health HEALTH_WARN 52 pgs backfill; 1 pgs backfilling; 63 pgs stale; 1 pgs stuck inactive; 63 pgs stuck stale; 54 pgs stuck unclean; recovery 107232/1881806 objects degraded (5.698%); mon.imatic-mce low disk space monmap e1: 1 mons at {imatic-mce=192.168.11.165:6789/0}, election epoch 1, quorum 0 imatic-mce mdsmap e450: 1/1/1 up {0=imatic-mce=up:active} osdmap e275: 2 osds: 2 up, 2 in pgmap v51624: 448 pgs, 4 pools, 790 GB data, 1732 kobjects 804 GB used, 2915 GB / 3720 GB avail 107232/1881806 objects degraded (5.698%) 52 stale+active+remapped+wait_backfill 1 creating 1 stale+active+remapped+backfilling 10 stale+active+clean 384 active+clean Last message in OSD log's: 2014-11-07 22:17:45.402791 deb4db70 0 -- 192.168.11.165:6804/29564 192.168.11.165:6807/29939 pipe(0x9d52f00 sd=213 :53216 s=2 pgs=1 cs=1 l=0 c=0x2c7f58c0).fault with nothing to send, going to standby Thank you for help With regards Jan Pekar, ceph fan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd down
Yes, removing an OSD before re-creating it will give you the same OSD ID. That's my preferred method, because it keeps the crushmap the same. Only PGs that existed on the replaced disk need to be backfilled. I don't know if adding the replacement to the same host then removing the old OSD gives you the same CRUSH map as the reverse. I suspect not, because the OSDs are re-ordered on that host. On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley smi...@npr.org wrote: Craig, Thanks for the info. I ended up doing a zap and then a create via ceph-deploy. One question that I still have is surrounding adding the failed osd back into the pool. In this example...osd.70 was badwhen I added it back in via ceph-deploy...the disk was brought up as osd.108. Only after osd.108 was up and running did I think to remove osd.70 from the crush map etc. My question is this...had I removed it from the crush map prior to my ceph-deploy create...should/would Ceph have reused the osd number 70? I would prefer to replace a failed disk with a new one and keep the old osd assignment...if possible that is why I am asking. Anyway...thanks again for all the help. Shain Sent from my iPhone On Nov 7, 2014, at 2:09 PM, Craig Lewis cle...@centraldesktop.com wrote: I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition. If you repair anything, you should probably force a deep-scrub on all the PGs on that disk. I think ceph osd deep-scrub osdid will do that, but you might have to manually grep ceph pg dump . Or you could just treat it like a failed disk, but re-use the disk. ceph-disk-prepare --zap-disk should take care of you. On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.org wrote: I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online. There is nothing in the ceph-osd log for osd.70. However I do see over 13,000 of these messages in the kern.log: Nov 6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: error 5 returned. Does anyone have any suggestions on how I might be able to get this HD back in the cluster (or whether or not it is worth even trying). Thanks, Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: Shain Miley [smi...@npr.org] Sent: Tuesday, November 04, 2014 3:55 PM To: ceph-users@lists.ceph.com Subject: osd down Hello, We are running ceph version 0.80.5 with 108 osd's. Today I noticed that one of the osd's is down: root@hqceph1:/var/log/ceph# ceph -s cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 health HEALTH_WARN crush map has legacy tunables monmap e1: 3 mons at {hqceph1= 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0 }, election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3 osdmap e7119: 108 osds: 107 up, 107 in pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects 216 TB used, 171 TB / 388 TB avail 3204 active+clean 4 active+clean+scrubbing client io 4079 kB/s wr, 8 op/s Using osd dump I determined that it is osd number 70: osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913 last_clean_interval [488,2665) 10.35.1.217:6814/22440 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440 autoout,exists http://10.35.1.217:6830/22440autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568 Looking at that node, the drive is still mounted and I did not see any errors in any of the system logs, and the raid level status shows the drive as up and healthy, etc. root@hqosd6:~# df -h |grep 70 /dev/sdl1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-70 I was hoping that someone might be able to advise me on the next course of action (can I add the osd back in?, should I replace the drive altogether, etc) I have attached the osd log to this email. Any suggestions would be great. Thanks, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
I had the same experience with force_create_pg too. I ran it, and the PGs sat there in creating state. I left the cluster overnight, and sometime in the middle of the night, they created. The actual transition from creating to active+clean happened during the recovery after a single OSD was kicked out. I don't recall if that single OSD was responsible for the creating PGs. I really can't say what un-jammed my creating. On Mon, Nov 10, 2014 at 12:33 PM, Chad Seys cws...@physics.wisc.edu wrote: Hi Craig, If all of your PGs now have an empty down_osds_we_would_probe, I'd run through this discussion again. Yep, looks to be true. So I ran: # ceph pg force_create_pg 2.5 and it has been creating for about 3 hours now. :/ # ceph health detail | grep creating pg 2.5 is stuck inactive since forever, current state creating, last acting [] pg 2.5 is stuck unclean since forever, current state creating, last acting [] Then I restart all OSDs. The creating label disapears and I'm back with same number of incomplete PGs. :( is the 'force_create_pg' the right command? The 'mark_unfound_lost' complains that 'pg has no unfound objects' . I shall start the 'force_create_pg' again and wait longer. Unless there is a different command to use. ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] buckets and users
You need separate pools for the different zones, otherwise both zones will have the same data. You could use the defaults for the first zone, but the second zone will need it's own. You might as well follow the convention of creating non-default pools for the zone. This is all semantics, but regions are generally seen as more distinct than zones. It's up to you if you want separate regions or the same region with separate zones. The end result is the same either way. On Fri, Nov 7, 2014 at 2:06 AM, Marco Garcês ma...@garces.cc wrote: So I really need to create the region also? I thought it was using the default region, so I didn't have to create extra regions. Let me try to figure this out, the docs are a little bit confusing. Marco Garcês On Thu, Nov 6, 2014 at 6:39 PM, Craig Lewis cle...@centraldesktop.com wrote: You need to tell each radosgw daemon which zone to use. In ceph.conf, I have: [client.radosgw.ceph3c] host = ceph3c rgw socket path = /var/run/ceph/radosgw.ceph3c keyring = /etc/ceph/ceph.client.radosgw.ceph3c.keyring log file = /var/log/ceph/radosgw.log admin socket = /var/run/ceph/radosgw.asok rgw dns name = us-central-1.ceph.cdlocal rgw region = us rgw region root pool = .us.rgw.root rgw zone = us-central-1 rgw zone root pool = .us-central-1.rgw.root On Thu, Nov 6, 2014 at 6:35 AM, Marco Garcês ma...@garces.cc wrote: Update: I was able to fix the authentication error, and I have 2 radosgw running on the same host. The problem now, is, I believe I have created the zone wrong, or, I am doing something wrong, because I can login with the user I had before, and I can access his buckets. I need to have everything separated. Here are my zone info: default zone: { domain_root: .rgw, control_pool: .rgw.control, gc_pool: .rgw.gc, log_pool: .log, intent_log_pool: .intent-log, usage_log_pool: .usage, user_keys_pool: .users, user_email_pool: .users.email, user_swift_pool: .users.swift, user_uid_pool: .users.uid, system_key: { access_key: , secret_key: }, placement_pools: [ { key: default-placement, val: { index_pool: .rgw.buckets.index, data_pool: .rgw.buckets, data_extra_pool: .rgw.buckets.extra}}]} env2 zone: { domain_root: .rgw, control_pool: .rgw.control, gc_pool: .rgw.gc, log_pool: .log, intent_log_pool: .intent-log, usage_log_pool: .usage, user_keys_pool: .users, user_email_pool: .users.email, user_swift_pool: .users.swift, user_uid_pool: .users.uid, system_key: { access_key: , secret_key: }, placement_pools: [ { key: default-placement, val: { index_pool: .rgw.buckets.index, data_pool: .rgw.buckets, data_extra_pool: .rgw.buckets.extra}}]} Could you guys help me? Marco Garcês On Thu, Nov 6, 2014 at 3:56 PM, Marco Garcês ma...@garces.cc wrote: By the way, Is it possible to run 2 radosgw on the same host? I think I have created the zone, not sure if it was correct, because it used the default pool names, even though I had changed them in the json file I had provided. Now I am trying to run ceph-radosgw with two different entries in the ceph.conf file, but without sucess. Example: [client.radosgw.gw] host = GATEWAY keyring = /etc/ceph/keyring.radosgw.gw rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock log file = /var/log/ceph/client.radosgw.gateway.log rgw print continue = false rgw dns name = gateway.local rgw enable ops log = false rgw enable usage log = true rgw usage log tick interval = 30 rgw usage log flush threshold = 1024 rgw usage max shards = 32 rgw usage max user shards = 1 rgw cache lru size = 15000 rgw thread pool size = 2048 #[client.radosgw.gw.env2] #host = GATEWAY #keyring = /etc/ceph/keyring.radosgw.gw #rgw socket path = /var/run/ceph/ceph.env2.radosgw.gateway.fastcgi.sock #log file = /var/log/ceph/client.env2.radosgw.gateway.log #rgw print continue = false #rgw dns name = cephppr.local #rgw enable ops log = false #rgw enable usage log = true #rgw usage log tick interval = 30 #rgw usage log flush threshold = 1024 #rgw usage max shards = 32 #rgw usage max user shards = 1 #rgw cache lru size = 15000 #rgw thread pool size = 2048 #rgw zone = ppr It fails to create the socket: 2014-11-06 15:39:08.862364 7f80cc670880 0 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process radosgw, pid 7930 2014-11-06 15:39:08.870429 7f80cc670880 0 librados: client.radosgw.gw.env2 authentication error (1) Operation not permitted 2014-11-06 15:39:08.870889 7f80cc670880 -1 Couldn't init storage provider (RADOS) What am I doing wrong? Marco Garcês #sysadmin Maputo
Re: [ceph-users] Is it normal that osd's memory exceed 1GB under stresstest?
It depends on which version of ceph, but it's pretty normal under newer versions. There are a bunch of variables. How many PGs per OSD, how much data is in the PGs, etc. I'm a bit light on the PGs (~60 PGs per OSD), and heavy on the data (~3 TiB of data on each OSD). In the production cluster, under peak user traffic, my OSDs are using around 1GiB of memory. If there is some scrubbing, deep-scrubbing, or a recovery, I've seen individual OSDs go as high as 4 GiB. Which causes some problems... On Thu, Nov 6, 2014 at 11:00 PM, 谢锐 xie...@szsandstone.com wrote: and make one osd down.then do stress test by fio. -- Original -- From: 谢锐xie...@szsandstone.com; Date: Fri, Nov 7, 2014 02:50 PM To: ceph-usersceph-us...@ceph.com; Subject: [ceph-users] Is it normal that osd's memory exceed 1GB under stresstest? I set mon_osd_down_out_interval to two days,and do stress test. the memory of osd exceed 1GB. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd down
I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition. If you repair anything, you should probably force a deep-scrub on all the PGs on that disk. I think ceph osd deep-scrub osdid will do that, but you might have to manually grep ceph pg dump . Or you could just treat it like a failed disk, but re-use the disk. ceph-disk-prepare --zap-disk should take care of you. On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.org wrote: I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online. There is nothing in the ceph-osd log for osd.70. However I do see over 13,000 of these messages in the kern.log: Nov 6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: error 5 returned. Does anyone have any suggestions on how I might be able to get this HD back in the cluster (or whether or not it is worth even trying). Thanks, Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: Shain Miley [smi...@npr.org] Sent: Tuesday, November 04, 2014 3:55 PM To: ceph-users@lists.ceph.com Subject: osd down Hello, We are running ceph version 0.80.5 with 108 osd's. Today I noticed that one of the osd's is down: root@hqceph1:/var/log/ceph# ceph -s cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 health HEALTH_WARN crush map has legacy tunables monmap e1: 3 mons at {hqceph1= 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}, election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3 osdmap e7119: 108 osds: 107 up, 107 in pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects 216 TB used, 171 TB / 388 TB avail 3204 active+clean 4 active+clean+scrubbing client io 4079 kB/s wr, 8 op/s Using osd dump I determined that it is osd number 70: osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913 last_clean_interval [488,2665) 10.35.1.217:6814/22440 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440 autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568 Looking at that node, the drive is still mounted and I did not see any errors in any of the system logs, and the raid level status shows the drive as up and healthy, etc. root@hqosd6:~# df -h |grep 70 /dev/sdl1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-70 I was hoping that someone might be able to advise me on the next course of action (can I add the osd back in?, should I replace the drive altogether, etc) I have attached the osd log to this email. Any suggestions would be great. Thanks, Shain -- Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
ceph-disk-prepare will give you the next unused number. So this will work only if the osd you remove is greater than 20. On Thu, Nov 6, 2014 at 12:12 PM, Chad Seys cws...@physics.wisc.edu wrote: Hi Craig, You'll have trouble until osd.20 exists again. Ceph really does not want to lose data. Even if you tell it the osd is gone, ceph won't believe you. Once ceph can probe any osd that claims to be 20, it might let you proceed with your recovery. Then you'll probably need to use ceph pg pgid mark_unfound_lost. If you don't have a free bay to create a real osd.20, it's possible to fake it with some small loop-back filesystems. Bring it up and mark it OUT. It will probably cause some remapping. I would keep it around until you get things healthy. If you create a real osd.20, you might want to leave it OUT until you get things healthy again. Thanks for the recovery tip! I would guess I could safely remove an OSD (mark OUT, wait for migration to stop, then crush osd rm) and then add back in as osd.20 would work? New switch: --yes-i-really-REALLY-mean-it ;) Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] buckets and users
You need to tell each radosgw daemon which zone to use. In ceph.conf, I have: [client.radosgw.ceph3c] host = ceph3c rgw socket path = /var/run/ceph/radosgw.ceph3c keyring = /etc/ceph/ceph.client.radosgw.ceph3c.keyring log file = /var/log/ceph/radosgw.log admin socket = /var/run/ceph/radosgw.asok rgw dns name = us-central-1.ceph.cdlocal rgw region = us rgw region root pool = .us.rgw.root rgw zone = us-central-1 rgw zone root pool = .us-central-1.rgw.root On Thu, Nov 6, 2014 at 6:35 AM, Marco Garcês ma...@garces.cc wrote: Update: I was able to fix the authentication error, and I have 2 radosgw running on the same host. The problem now, is, I believe I have created the zone wrong, or, I am doing something wrong, because I can login with the user I had before, and I can access his buckets. I need to have everything separated. Here are my zone info: default zone: { domain_root: .rgw, control_pool: .rgw.control, gc_pool: .rgw.gc, log_pool: .log, intent_log_pool: .intent-log, usage_log_pool: .usage, user_keys_pool: .users, user_email_pool: .users.email, user_swift_pool: .users.swift, user_uid_pool: .users.uid, system_key: { access_key: , secret_key: }, placement_pools: [ { key: default-placement, val: { index_pool: .rgw.buckets.index, data_pool: .rgw.buckets, data_extra_pool: .rgw.buckets.extra}}]} env2 zone: { domain_root: .rgw, control_pool: .rgw.control, gc_pool: .rgw.gc, log_pool: .log, intent_log_pool: .intent-log, usage_log_pool: .usage, user_keys_pool: .users, user_email_pool: .users.email, user_swift_pool: .users.swift, user_uid_pool: .users.uid, system_key: { access_key: , secret_key: }, placement_pools: [ { key: default-placement, val: { index_pool: .rgw.buckets.index, data_pool: .rgw.buckets, data_extra_pool: .rgw.buckets.extra}}]} Could you guys help me? Marco Garcês On Thu, Nov 6, 2014 at 3:56 PM, Marco Garcês ma...@garces.cc wrote: By the way, Is it possible to run 2 radosgw on the same host? I think I have created the zone, not sure if it was correct, because it used the default pool names, even though I had changed them in the json file I had provided. Now I am trying to run ceph-radosgw with two different entries in the ceph.conf file, but without sucess. Example: [client.radosgw.gw] host = GATEWAY keyring = /etc/ceph/keyring.radosgw.gw rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock log file = /var/log/ceph/client.radosgw.gateway.log rgw print continue = false rgw dns name = gateway.local rgw enable ops log = false rgw enable usage log = true rgw usage log tick interval = 30 rgw usage log flush threshold = 1024 rgw usage max shards = 32 rgw usage max user shards = 1 rgw cache lru size = 15000 rgw thread pool size = 2048 #[client.radosgw.gw.env2] #host = GATEWAY #keyring = /etc/ceph/keyring.radosgw.gw #rgw socket path = /var/run/ceph/ceph.env2.radosgw.gateway.fastcgi.sock #log file = /var/log/ceph/client.env2.radosgw.gateway.log #rgw print continue = false #rgw dns name = cephppr.local #rgw enable ops log = false #rgw enable usage log = true #rgw usage log tick interval = 30 #rgw usage log flush threshold = 1024 #rgw usage max shards = 32 #rgw usage max user shards = 1 #rgw cache lru size = 15000 #rgw thread pool size = 2048 #rgw zone = ppr It fails to create the socket: 2014-11-06 15:39:08.862364 7f80cc670880 0 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process radosgw, pid 7930 2014-11-06 15:39:08.870429 7f80cc670880 0 librados: client.radosgw.gw.env2 authentication error (1) Operation not permitted 2014-11-06 15:39:08.870889 7f80cc670880 -1 Couldn't init storage provider (RADOS) What am I doing wrong? Marco Garcês #sysadmin Maputo - Mozambique [Skype] marcogarces On Thu, Nov 6, 2014 at 10:11 AM, Marco Garcês ma...@garces.cc wrote: Your solution of pre-pending the environment name to the bucket, was my first choice, but at the moment I can't ask the devs to change the code to do that. For now I have to stick with the zones solution. Should I follow the federated zones docs (http://ceph.com/docs/master/radosgw/federated-config/) but skip the sync step? Thank you, Marco Garcês On Wed, Nov 5, 2014 at 8:13 PM, Craig Lewis cle...@centraldesktop.com wrote: You could setup dedicated zones for each environment, and not replicate between them. Each zone would have it's own URL, but you would be able to re-use usernames and bucket names. If different URLs are a problem, you might be able to get around that in the load balancer or the web servers. I wouldn't really recommend that, but it's possible. I have a similar requirement. I was able to pre-pending
Re: [ceph-users] Basic Ceph Questions
On Wed, Nov 5, 2014 at 11:57 PM, Wido den Hollander w...@42on.com wrote: On 11/05/2014 11:03 PM, Lindsay Mathieson wrote: - Geo Replication - thats done via federated gateways? looks complicated :( * The remote slave, it would be read only? That is only for the RADOS Gateway. Ceph itself (RADOS) does not support Geo Replication. That is only for the RADOS Gateway. Ceph itself (RADOS) does not support Geo Replication. The 3 services built on top of RADOS support backups, but RADOS itself does not. For RDB, you can use snapshot diffs, and ship them offsite (see various threads on the ML). For RadosGW, there is Federation. For CephFS, you can use traditional POSIX filesystem backup tools. - Disaster strikes, apart from DR backups how easy is it to recover your data off ceph OSD's? one of the things I liked about gluster was that if I totally screwed up the gluster masters, I could always just copy the data off the filesystem. Not so much with ceph. It's a bit harder with Ceph. Eventually it is doable, but that is something that would take a lot of time. In practice, not really. Out of curiosity, I attempted this for some RadosGW objects. It was easy when there was a single object less than 4MB. It very quickyl became complicated with a few larger objects. You'd have to have a very deep understanding of the service to track all of the information down with the cluster offline. It's definitely possible, just not practical. - Am I abusing ceph? :) I just have a small 3 node VM server cluster with 20 windows VM;s, some servers, some VDI. The shared store is a QNAP nas which is struggling. I'm using ceph for - Shared Storage - Replication/Redundancy - Improved performance I think that 3 nodes is not sufficient, Ceph really starts performing when you go 10 nodes (excluding monitors). If it meets your needs, then it's working. :-) You're going to spend a lot more time managing the 3 node Ceph cluster than you spent on the QNAP. If it doesn't make sense for you to spent a lot of time dealing with storage, then a single shared store with more IOPS would be a better fit. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
On Thu, Nov 6, 2014 at 11:27 AM, Chad Seys Also, are you certain that osd 20 is not up? -Sam Yep. # ceph osd metadata 20 Error ENOENT: osd.20 does not exist So part of ceph thinks osd.20 doesn't exist, but another part (the down_osds_we_would_probe) thinks the osd exists and is down? You'll have trouble until osd.20 exists again. Ceph really does not want to lose data. Even if you tell it the osd is gone, ceph won't believe you. Once ceph can probe any osd that claims to be 20, it might let you proceed with your recovery. Then you'll probably need to use ceph pg pgid mark_unfound_lost. If you don't have a free bay to create a real osd.20, it's possible to fake it with some small loop-back filesystems. Bring it up and mark it OUT. It will probably cause some remapping. I would keep it around until you get things healthy. If you create a real osd.20, you might want to leave it OUT until you get things healthy again. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com