[ceph-users] radosgw-admin orphans find -- Hammer
Hello, we need to reclaim a lot of wasted space by RGW orphans in our production Hammer cluster (0.94.10 on Ubuntu 14.04). According to http://tracker.ceph.com/issues/18258 <http://tracker.ceph.com/issues/18258> there is a bug in the radosgw-admin orphans find command, that causes it to get stuck in an infinite loop. From the bug report I cannot tell if there are unusual circumstances that need to be present to trigger the infinite-loop condition, of if I am more or less guaranteed to hit the issue. The bug has been fixed, but not im Hammer. Any chance of getting it backported into Hammer? Is the fix in the radosgw-admin tool itself, or are there more/other components that would have to be touched? As the cluster has about 200 million objects, I would rather not just “try my luck” and get stuck in the middle. Any insight on this would be appreciated. Thanks a lot, Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH | Hochstraße 11 | 42697 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.de | www.centerdevice.de Geschäftsführung: Dr. Patrick Peschlow, Dr. Lukas Pustina, Michael Rosbach, Handelsregister-Nr.: HRB 18655, HR-Gericht: Bonn, USt-IdNr.: DE-815299431 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD omap disk write bursts
Hello! We are observing a somewhat strange IO pattern on our OSDs. The cluster is running Hammer 0.94.1, 48 OSDs, 4 TB spinners, xfs, colocated journals. Over periods of days on end we see groups of 3 OSDs being busy with lots and lots of small writes for several minutes at a time. Once one group calms down, another group begins. Might be easier to understand in a graph: https://public.centerdevice.de/3e62a18d-dd01-477e-b52b-f65d181e2920 (this shows a limited time range to make the individual lines discernable) Initial attemps to correlate this to client activity with small writes, turned out to be wrong -- not really surprising, because both VM RBD activity, as well as RGW object storage should show much evenly spread patterns across all OSDs. Using sysdig I figured it seems to be LevelDB activity: [16:58:42 B|daniel.schneller@node02] ~ ➜ sudo sysdig -p "%12user.name %6proc.pid %12proc.name %3fd.num %fd.typechar %fd.name" "evt.type=write and proc.pid=8215" root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763308.log ... (*lots and lots* more writes to 763308.log ) ... root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763308.log root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763308.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 15 f /var/lib/ceph/osd/ceph-14/current/omap/LOG root 8215 ceph-osd 15 f /var/lib/ceph/osd/ceph-14/current/omap/LOG root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb root 8215 ceph-osd 153 f /var/lib/ceph/osd/ceph-14/current/omap/763311.ldb ... (*lots and lots* more writes to 763311.ldb ) ... root 8215 ceph-osd 15 f /var/lib/ceph/osd/ceph-14/current/omap/LOG root 8215 ceph-osd 15 f /var/lib/ceph/osd/ceph-14/current/omap/LOG root 8215 ceph-osd 18 f /var/lib/ceph/osd/ceph-14/current/omap/MANIFEST-171304 root 8215 ceph-osd 18 f /var/lib/ceph/osd/ceph-14/current/omap/MANIFEST-171304 root 8215 ceph-osd 15 f /var/lib/ceph/osd/ceph-14/current/omap/LOG root 8215 ceph-osd 15 f /var/lib/ceph/osd/ceph-14/current/omap/LOG root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log root 8215 ceph-osd 103 f /var/lib/ceph/osd/ceph-14/current/omap/763310.log ... (*lots and lots* more writes to 763310.log ) ... This correlates to the patterns in the graph for the given OSDs. If I understand this correctly, it looks like LevelDB compaction -- however, if that is the case, why would that happen in groups of only three at a time and why would it hit a single OSD in short succession? See this single-OSD graph of the same time as before: https://public.centerdevice.de/ab5f417d-43af-435d-aad0-7becff2b9acb Are there any regular / event based maintenance tasks that are ensured to only run on n (=3) OSDs at time? Can I do anything to smooth this out or reduce it somehow? Thanks, Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW container deletion problem
Bump On 2016-07-25 14:05:38 +, Daniel Schneller said: Hi! I created a bunch of test containers with some objects in them via RGW/Swift (Ubuntu, RGW via Apache, Ceph Hammer 0.94.1) Now I try to get rid of the test data. I manually staretd with one container: ~/rgwtest ➜ swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K <...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153 test_10 test_5 test_100 test_20 test_30 So far so good. Notice that locahost:8405 is bound by haproxy, distributing requests to 4 RGWs on different servers, in case that is relevant. To make sure my script gets error handling right, I tried to delete the same container again, leading to an error: ~/rgwtest ➜ swift -v --retries=0 -V 1.0 -A http://localhost:8405/auth -U <...> -K <...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153 Container DELETE failed: http://localhost:8405:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153 500 Internal Server Error UnknownError Stat'ing it still works: ~/rgwtest ➜ swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K <...> --insecure stat test_a6b3e80c-e880-bef9-b1b5-892073e3b153 URL: http://localhost:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153 Auth Token: AUTH_rgwtk... Account: v1 Container: test_a6b3e80c-e880-bef9-b1b5-892073e3b153 Objects: 0 Bytes: 0 Read ACL: Write ACL: Sync To: Sync Key: Server: Apache/2.4.7 (Ubuntu) X-Container-Bytes-Used-Actual: 0 X-Storage-Policy: default-placement Content-Type: text/plain; charset=utf-8 Checking the RGW Logs I found this: 2016-07-25 15:21:29.751055 7fbcd67f4700 1 == starting new request req=0x7fbce40a1100 = 2016-07-25 15:21:29.768688 7fbcd67f4700 0 WARNING: set_req_state_err err_no=125 resorting to 500 2016-07-25 15:21:29.768743 7fbcd67f4700 1 == req done req=0x7fbce40a1100 http_status=500 == Googling a little and finding this: http://tracker.ceph.com/issues/14208 mentioning similar issues and an out-of-sync metadata cache between different RGWs. I vaguely remember having seen something like this in the Firefly timeframe before, but I am not sure if it is the same. Where does this metadata cache live? Can it be flushed somehow without disturbing other operations? I found this PDF https://archive.fosdem.org/2016/schedule/event/virt_iaas_ceph_rados_gateway_overview/attachments/audio/1077/export/events/attachments/virt_iaas_ceph_rados_gateway_overview/audio/1077/Fosdem_RGW.pdf but without the "audio track" it doesn't really help me. Thanks! Daniel -- -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW container deletion problem
Hi! I created a bunch of test containers with some objects in them via RGW/Swift (Ubuntu, RGW via Apache, Ceph Hammer 0.94.1) Now I try to get rid of the test data. I manually staretd with one container: ~/rgwtest ➜ swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K <...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153 test_10 test_5 test_100 test_20 test_30 So far so good. Notice that locahost:8405 is bound by haproxy, distributing requests to 4 RGWs on different servers, in case that is relevant. To make sure my script gets error handling right, I tried to delete the same container again, leading to an error: ~/rgwtest ➜ swift -v --retries=0 -V 1.0 -A http://localhost:8405/auth -U <...> -K <...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153 Container DELETE failed: http://localhost:8405:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153 500 Internal Server Error UnknownError Stat'ing it still works: ~/rgwtest ➜ swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K <...> --insecure stat test_a6b3e80c-e880-bef9-b1b5-892073e3b153 URL: http://localhost:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153 Auth Token: AUTH_rgwtk... Account: v1 Container: test_a6b3e80c-e880-bef9-b1b5-892073e3b153 Objects: 0 Bytes: 0 Read ACL: Write ACL: Sync To: Sync Key: Server: Apache/2.4.7 (Ubuntu) X-Container-Bytes-Used-Actual: 0 X-Storage-Policy: default-placement Content-Type: text/plain; charset=utf-8 Checking the RGW Logs I found this: 2016-07-25 15:21:29.751055 7fbcd67f4700 1 == starting new request req=0x7fbce40a1100 = 2016-07-25 15:21:29.768688 7fbcd67f4700 0 WARNING: set_req_state_err err_no=125 resorting to 500 2016-07-25 15:21:29.768743 7fbcd67f4700 1 == req done req=0x7fbce40a1100 http_status=500 == Googling a little and finding this: http://tracker.ceph.com/issues/14208 mentioning similar issues and an out-of-sync metadata cache between different RGWs. I vaguely remember having seen something like this in the Firefly timeframe before, but I am not sure if it is the same. Where does this metadata cache live? Can it be flushed somehow without disturbing other operations? I found this PDF https://archive.fosdem.org/2016/schedule/event/virt_iaas_ceph_rados_gateway_overview/attachments/audio/1077/export/events/attachments/virt_iaas_ceph_rados_gateway_overview/audio/1077/Fosdem_RGW.pdf but without the "audio track" it doesn't really help me. Thanks! Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?
On 2016-06-27 16:01:07 +, Lionel Bouton said: Le 27/06/2016 17:42, Daniel Schneller a écrit : Hi! * Network Link saturation. All links / bonds are well below any relevant load (around 35MB/s or less) ... Or you sure ? On each server you have 12 OSDs with a theoretical bandwidth of at least half of 100MB/s (minimum bandwidth of any reasonable HDD but halved because of the journal on the same device). Which means your total disk bandwidth per server is 600MB/s. Correct. However, I fear that because of lots of random IO going on, we won't be coming anywhere near that number, esp. with 3x replication. Bonded links are not perfect aggregation (depending on the mode one client will either always use the same link or have its traffic imperfectly balanced between the 2), so your theoretical network bandwidth is probably nearest to 1Gbps (~ 120MB/s). We use layer3+4 to spread traffic based on sources and destination IP and port information. Benchmarks have shown that using enough parallel streams we can saturate the full 250MB/s this ideally produces. You are right, of course, that any single TCP connection will never exceed 1Gbps. What could happen is that the 35MB/s is an average over a large period (several seconds), it's probably peaking at 120MB/s during short bursts. That thought crossed my mind early on, too, but these values are based on /proc/net/dev which has counters for each network device. The statistics are gathered by checking the difference between the current sample and the last. So this does not suffer from samples being taken at relatively long intervals. I wouldn't use less than 10Gbps for both the cluster and public networks in your case. I whole-heartedly agree... Certainly sensible, but for now we have to make due with the infrastructure we have. Still, based on the data we have so far, the network at least doesn't jump at me as a (major) contributor to the slowness we see in this current scenario. You didn't say how many VMs are running : the rkB/s and wkB/s seem very low (note that for write intensive tasks your VM is reading quite a bit...) but if you have 10 VMs or more battling for read and write access this way it wouldn't be unexpected. As soon as latency rises for one reason or another (here it would be network latency) you can expect the total throughput of random accesses to plummet. In total there are about 25 VMs, however many of them are less I/O bound than MongoDB and Elasticsearch. As for the comparatively high read load, I agree, but I cannot really explain that in detail at the moment. In general I would be very much interested in diagnosing the underlying bare metal layer without making too many assumptions about what clients are actually doing. In this case we can look into the VMs, but in general it would be ideal to pinpoint a bottleneck on the "lower" levels. Any improvements there would be beneficial to all client software. Cheers, Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Pinpointing performance bottleneck / would SSD journals help?
Hi! We are currently trying to pinpoint a bottleneck and are somewhat stuck. First things first, this is the hardware setup: 4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD 96GB RAM, 2x6 Cores + HT 2x1GbE bonded interfaces for Cluster Network 2x1GbE bonded interfaces for Public Network Ceph Hammer on Ubuntu 14.04 6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage). The VMs run a variety of stuff, most notable MongoDB, Elasticsearch and our custom software which uses both the VM's virtual disks as well the Rados Gateway for Object Storage. Recently, under certain more write intensive conditions we see reads overall system performance starting to suffer as well. Here is an iostat -x 3 sample for one of the VMs hosting MongoDB. Notice the "await" times (vda is the root, vdb is the data volume). Linux 3.13.0-35-generic (node02)06/24/2016 _x86_64_(16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.550.000.440.420.00 97.59 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 0.910.091.01 2.55 9.59 22.12 0.01 266.90 2120.51 98.59 4.76 0.52 vdb 0.00 1.53 18.39 40.79 405.98 483.92 30.07 0.305.685.425.80 3.96 23.43 avg-cpu: %user %nice %system %iowait %steal %idle 5.050.002.083.160.00 89.71 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 vdb 0.00 7.00 23.00 29.00 368.00 500.00 33.38 1.91 446.00 422.26 464.83 19.08 99.20 avg-cpu: %user %nice %system %iowait %steal %idle 4.430.001.734.940.00 88.90 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 vdb 0.0013.00 45.00 83.00 712.00 1041.00 27.39 2.54 1383.25 272.18 1985.64 7.50 96.00 If we read this right, the average time spent waiting for read or write requests to be serviced can be multi-second. This would go in line with MongoDB's slow log, where we see fully indexed queries, returning a single result, taking over a second, where they would normally be finished quasi instantly. So far we have looked at these metrics (using StackExchange's Bosun from https://bosun.org). Most values are collected every 15 seconds. * Network Link saturation. All links / bonds are well below any relevant load (around 35MB/s or less) * Storage Node RAM At least 3GB reported "free", between 50GB and 70GB as cached. * Storage node CPU. Hardly above 30% * # of ios in progress per OSD (as per /proc/diskstats) These reach values of up to 180. Bosun collects the raw data for these metrics (and lots of others) every 15 seconds. We have a suspicion the spinners are the culprit here, but to verify this and to be able to convince the upper layers of company leadership to invest in some SSDs for journals, we need better evidence; apart from the personal desire to understand exactly what's going on here :) Regardless of the VMs on top (which could be any client, as I see it) which metrics would I have to collect/look at to verify/reject the assumption that we are limited by our pure HDD setup? Thanks a lot! Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to run multiple RadosGW instances under the same zone
On 2016-01-04 10:37:43 +, Srinivasula Maram said: Hi Joseph, You can try haproxy as proxy for load balancing and failover. Thanks, Srinivas We have 6 hosts running RadosGW with haproxy in front of them without problems. Depending on your setup you might even consider running haproxy locally on your application servers, so that your application always connects to localhost. This saves you from having to set up highly available load balancers. It's strongly recommended, of course, to use some kind of automatic provisioning (Ansible, Puppet etc.) to roll out identical haproxy configuration on all these machines. -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Namespaces and authentication
Hi! On http://docs.ceph.com/docs/master/rados/operations/user-management/#namespace I read about auth namespaces. According to the most recent documentation it is still not supported by any of the client libraries, especially rbd. I have a client asking to get access to rbd volumes for Kubernetes (http://kubernetes.io/v1.1/docs/user-guide/volumes.html#rbd). Due to the dynamic nature of the environment, I would like to grant them access to a dedicated pool where they could create volumes on their own. Different ceph secrets should be used for different volumes, so that they can hand out different secrets to different tenants in their environment to only give them access to their respective volumes. Is there any way to do that yet? Are there plans on extending the namespace support beyond the current state? Of course, I would be open to suggestions on how to do it differently, too, in case I am overlooking something obvious. Main requirements are a) client admin can create new rbd volumes in a dedicated pool, b) client admin can limit access to a volume to a specific user/secret. Thanks! Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Number of buckets per user
Hi! Maybe I am missing something obvious, but is there no way to quickly tell how many buckets an RGW user has? I can see the max_buckets limit in radosgw-admin user info --uid=x, but nothing about how much of that limit has been used. To be clear: I do not care what they are called, or what is in them, just the count. Is that something the RGW maintains for cheap queries? Thanks, Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Creating RGW Zone System Users Fails with "couldn't init storage provider"
Bump... :) On 2015-11-02 15:52:44 +, Daniel Schneller said: Hi! I am trying to set up a Rados Gateway, prepared for multiple regions and zones, according to the documenation on http://docs.ceph.com/docs/hammer/radosgw/federated-config/. Ceph version is 0.94.3 (Hammer). I am stuck at the "Create zone users" step (http://docs.ceph.com/docs/hammer/radosgw/federated-config/#create-zone-users). Running the user create command I get this: $ sudo radosgw-admin user create --uid="eu-zone1" --display-name="Region-EU Zone-zone1" --client-id client.radosgw.eu-zone1-1 --system couldn't init storage provider $ echo $? 5 I have found this in a Documentation bug ticket, but unfortunately there is no indication of what was actually going on there: http://tracker.ceph.com/issues/10848#note-21 I am at a loss, I have even tried to figure out what was going on via reading the rgw-admin source, but I could not find any strong hints. Ideas? Thanks, Daniel Find all relevant(?) bits of configuration below: Ceph.conf has this for the RGW instances: [client.radosgw.eu-zone1-1] host = dec-b1-d7-73-f0-04 admin socket = /var/run/ceph-radosgw/client.radosgw.dec-b1-d7-73-f0-04.asok pid file = /var/run/ceph-radosgw/$name.pid rgw region = eu rgw region root pool = .eu.rgw.root rgw zone = eu-zone1 rgw zone root pool = .eu-zone1.rgw.root rgw_print_continue = false keyring = /etc/ceph/ceph.client.radosgw.keyring rgw_socket_path = /var/run/ceph-radosgw/client.radosgw.eu-zone1-1.sock log_file = /var/log/radosgw/radosgw.log rgw_enable_ops_log = false rgw_gc_max_objs = 31 rgw_frontends = fastcgi debug_rgw = 20 Keyring: [client.radosgw.eu-zone1-1] key = caps mon = "allow rwx" caps osd = "allow rwx" ceph auth list has the same key and these caps: client.radosgw.eu-zone1-1 key: caps: [mon] allow rwx caps: [osd] allow rwx I have followed the instructions on that page and have created Region and Zone configurations as follows: { "name": "eu", "api_name": "eu", "is_master": "true", "endpoints": [ "https:\/\/rgw-eu-zone1.mydomain.net:443\/", "http:\/\/rgw-eu-zone1.mydomain.net:80\/"], "master_zone": "eu-zone1", "zones": [ { "name": "eu-zone1", "endpoints": [ "https:\/\/rgw-eu-zone1.mydomain.net:443\/", "http:\/\/rgw-eu-zone1.mydomain.net:80\/"], "log_meta": "true", "log_data": "true"} ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement"} { "domain_root": ".eu-zone1.domain.rgw", "control_pool": ".eu-zone1.rgw.control", "gc_pool": ".eu-zone1.rgw.gc", "log_pool": ".eu-zone1.log", "intent_log_pool": ".eu-zone1.intent-log", "usage_log_pool": ".eu-zone1.usage", "user_keys_pool": ".eu-zone1.users", "user_email_pool": ".eu-zone1.users.email", "user_swift_pool": ".eu-zone1.users.swift", "user_uid_pool": ".eu-zone1.users.uid", "system_key": { "access_key": "", "secret_key": ""}, "placement_pools": [ { "key": "default-placement", "val": { "index_pool": ".eu-zone1.rgw.buckets.index", "data_pool": ".eu-zone1.rgw.buckets"} } ] } These pools are defined: rbd images volumes .eu-zone1.rgw.root .eu-zone1.rgw.control .eu-zone1.rgw.gc .eu-zone1.rgw.buckets .eu-zone1.rgw.buckets.index .eu-zone1.rgw.buckets.extra .eu-zone1.log .eu-zone1.intent-log .eu-zone1.usage .eu-zone1.users .eu-zone1.users.email .eu-zone1.users.swift .eu-zone1.users.uid .eu.rgw.root .eu-zone1.domain.rgw .rgw .rgw.root .rgw.gc .users.uid .users .rgw.control .log .intent-log .usage .users.email .users.swift -- -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Creating RGW Zone System Users Fails with "couldn't init storage provider"
On 2015-11-05 12:16:35 +, Wido den Hollander said: This is usuaully when keys aren't set up properly. Are you sure that the cephx keys you are using are correct and that you can connect to the Ceph cluster? Wido Yes, I could execute all kinds of commands, however it turns out, I might have seen the effects of some non-obvious behavior: We noticed was that whatever is used as an argument to --client-id (tried with completely random crap), we could successfully execute commands! E. g. $ sudo radosgw-admin zone list --client-id blablabla" would get results back just fine, which took me very much by surprise. Turns out, if you read `man ceph` closely, "--client-id" is not even a valid parameter! Trying it with e. g. "ceph -s" will tell you that immediately: $ sudo ceph --client-id blablabla -s Invalid command: unused arguments: ['--client-id', 'blablabla'] ... On the other hand, radosgw-admin doesn't: $ sudo radosgw-admin user info --uid=someuser --client-id blablabla { results } Apparently, radosgw-admin swallows unkown arguments silently. It just uses the admin key, which I could see by running this as an unprivileged user without sudo: $ radosgw-admin user info --uid=someuser --client-id blablabla 2015-11-05 14:47:30.079318 7fc4dd104900 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication couldn't init storage provider 2015-11-05 14:47:30.079323 7fc4dd104900 0 librados: client.admin initialization error (2) No such file or directory The unkown --client-id argument gets dropped and it tries to use the admin keyring, which it is not allowed to access without sudo. I still do not know exactly, why this did not help me originally, because it should just have created the user using the admin key. So it is not exactly clear what was going on then. Nevertheless, user exists now, so it might remain a mistery... In any case, making radosgw-admin at least _inform_ about unknown arguments might be a better idea than just silently ignoring them. Thanks! Daniel -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One object in .rgw.buckets.index causes systemic instability
We had a similar issue in Firefly, where we had a very large number (about 1.500.000) of buckets for a single RGW user. We observed a number of slow requests in day-to-day use, but did not think much of it at the time. At one point the primary OSD managing the list of buckets for that user crashed and could not restart, because processing the tremendous amount of buckets on startup - which also seemed to be single-threaded, judging by to 100% CPU usage we could see - took longer than the suicide-timeout. That lead to this OSD crashing again, and again. Eventually, it would be marked out and the secondary tried to process the list with the same result, leading to a cascading failure. While I am quite certain it is a different code path in your case (you speak about a handful of buckets), it certainly sounds like the a very similar issue. Do you have lots of objects in those few buckets, or are they few, but large in size to reach the 30TB? Worst case you might be in for a similar procedure as we had to take: Take load off the cluster, increase the timeouts to ridiculous levels and copy the data over into a more evenly distributed set of buckets (users in our case). Fortunately as long as we did not try to write to the problematic buckets, we could still read from them. Please notice that this is only a guess, I could be completely wrong. Daniel On 2015-11-03 13:33:19 +, Gerd Jakobovitsch said: Dear all, I have a cluster running hammer (0.94.5), with 5 nodes. The main usage is for S3-compatible object storage. I am getting to a very troublesome problem at a ceph cluster. A single object in the .rgw.buckets.index is not responding to request and takes a very long time while recovering after an osd restart. During this time, the OSDs where this object is mapped got heavily loaded, with high cpu as well as memory usage. At the same time, the directory /var/lib/ceph/osd/ceph-XX/current/omap gets a large number of entries ( > 1), that won't decrease. Very frequently, I get >100 blocked requests for this object, and the main OSD that stores it ends up accepting no other requests. Very frequently the OSD ends up crashing due to filestore timeout, and getting it up again is very troublesome - it usually has to run alone in the node for a long time, until the object gets recovered, somehow. At the OSD logs, there are several entries like these: -7051> 2015-11-03 10:46:08.339283 7f776974f700 10 log_client logged 2015-11-03 10:46:02.942023 osd.63 10.17.0.9:6857/2002 41 : cluster [WRN] slow re quest 120.003081 seconds old, received at 2015-11-03 10:43:56.472825: osd_repop(osd.53.236531:7 34.7 8a7482ff/.dir.default.198764998.1/head//34 v 2369 84'22) currently commit_sent 2015-11-03 10:28:32.405265 7f0035982700 0 log_channel(cluster) log [WRN] : 97 slow requests, 1 included below; oldest blocked for > 2046.502848 secs 2015-11-03 10:28:32.405269 7f0035982700 0 log_channel(cluster) log [WRN] : slow request 1920.676998 seconds old, received at 2015-11-03 09:56:31.7282 24: osd_op(client.210508702.0:14696798 .dir.default.198764998.1 [call rgw.bucket_prepare_op] 15.8a7482ff ondisk+write+known_if_redirected e236956) cur rently waiting for blocked object Is there any way to go deeper into this problem, or to rebuild the .rgw index without loosing data? I currently have 30 TB of data in the cluster - most of it concentrated in a handful of buckets - that I can't loose. Regards. -- -- As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou qualquer forma de utilização do teor deste documento depende de autorização do emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação tenha sido recebida por engano, favor avisar imediatamente, respondendo esta mensagem. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Creating RGW Zone System Users Fails with "couldn't init storage provider"
Hi! I am trying to set up a Rados Gateway, prepared for multiple regions and zones, according to the documenation on http://docs.ceph.com/docs/hammer/radosgw/federated-config/. Ceph version is 0.94.3 (Hammer). I am stuck at the "Create zone users" step (http://docs.ceph.com/docs/hammer/radosgw/federated-config/#create-zone-users). Running the user create command I get this: $ sudo radosgw-admin user create --uid="eu-zone1" --display-name="Region-EU Zone-zone1" --client-id client.radosgw.eu-zone1-1 --system couldn't init storage provider $ echo $? 5 I have found this in a Documentation bug ticket, but unfortunately there is no indication of what was actually going on there: http://tracker.ceph.com/issues/10848#note-21 I am at a loss, I have even tried to figure out what was going on via reading the rgw-admin source, but I could not find any strong hints. Ideas? Thanks, Daniel Find all relevant(?) bits of configuration below: Ceph.conf has this for the RGW instances: [client.radosgw.eu-zone1-1] host = dec-b1-d7-73-f0-04 admin socket = /var/run/ceph-radosgw/client.radosgw.dec-b1-d7-73-f0-04.asok pid file = /var/run/ceph-radosgw/$name.pid rgw region = eu rgw region root pool = .eu.rgw.root rgw zone = eu-zone1 rgw zone root pool = .eu-zone1.rgw.root rgw_print_continue = false keyring = /etc/ceph/ceph.client.radosgw.keyring rgw_socket_path = /var/run/ceph-radosgw/client.radosgw.eu-zone1-1.sock log_file = /var/log/radosgw/radosgw.log rgw_enable_ops_log = false rgw_gc_max_objs = 31 rgw_frontends = fastcgi debug_rgw = 20 Keyring: [client.radosgw.eu-zone1-1] key = caps mon = "allow rwx" caps osd = "allow rwx" ceph auth list has the same key and these caps: client.radosgw.eu-zone1-1 key: caps: [mon] allow rwx caps: [osd] allow rwx I have followed the instructions on that page and have created Region and Zone configurations as follows: { "name": "eu", "api_name": "eu", "is_master": "true", "endpoints": [ "https:\/\/rgw-eu-zone1.mydomain.net:443\/", "http:\/\/rgw-eu-zone1.mydomain.net:80\/"], "master_zone": "eu-zone1", "zones": [ { "name": "eu-zone1", "endpoints": [ "https:\/\/rgw-eu-zone1.mydomain.net:443\/", "http:\/\/rgw-eu-zone1.mydomain.net:80\/"], "log_meta": "true", "log_data": "true"} ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement"} { "domain_root": ".eu-zone1.domain.rgw", "control_pool": ".eu-zone1.rgw.control", "gc_pool": ".eu-zone1.rgw.gc", "log_pool": ".eu-zone1.log", "intent_log_pool": ".eu-zone1.intent-log", "usage_log_pool": ".eu-zone1.usage", "user_keys_pool": ".eu-zone1.users", "user_email_pool": ".eu-zone1.users.email", "user_swift_pool": ".eu-zone1.users.swift", "user_uid_pool": ".eu-zone1.users.uid", "system_key": { "access_key": "", "secret_key": ""}, "placement_pools": [ { "key": "default-placement", "val": { "index_pool": ".eu-zone1.rgw.buckets.index", "data_pool": ".eu-zone1.rgw.buckets"} } ] } These pools are defined: rbd images volumes .eu-zone1.rgw.root .eu-zone1.rgw.control .eu-zone1.rgw.gc .eu-zone1.rgw.buckets .eu-zone1.rgw.buckets.index .eu-zone1.rgw.buckets.extra .eu-zone1.log .eu-zone1.intent-log .eu-zone1.usage .eu-zone1.users .eu-zone1.users.email .eu-zone1.users.swift .eu-zone1.users.uid .eu.rgw.root .eu-zone1.domain.rgw .rgw .rgw.root .rgw.gc .users.uid .users .rgw.control .log .intent-log .usage .users.email .users.swift -- Daniel Schneller Principal Cloud Engineer CenterDevice GmbH https://www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problems to expect with newer point release rgw vs. older MONs/OSDs
On 2015-07-08 10:34:14 +, Wido den Hollander said: On 08-07-15 12:20, Daniel Schneller wrote: Hi! Just a quick question regarding mixed versions. So far a cluster is running on 0.94.1-1trusty without Rados Gateway. Since the packets have been updated in the meantime, installing radosgw now would entail bringing a few updated dependencies along. OSDs and MONs on the nodes that are to become Rados Gateways would not automatically be upgraded, too. Is that a safe setup, or do I need to upgrade the whole cluster to the same point release? That's safe. It's not required that the whole cluster runs the same version. However, 94.2 fixes some bugs, so I would recommend that you upgrade the cluster anyway. It can be done in a rolling fashion. Wido Understood. We are planning to upgrade to 0.94.3 on everything once that become available. In the meantime we decided to install rgw 0.94.1 just to have less stuff to track in our heads and because we know 0.94.1 works for our current use case in another cluster. However, just now I tried this without success: [C|daniel.schneller@node01] ~ ➜ apt-get install --dry-run radosgw=0.94.1-1trusty ... E: Version '0.94.1-1trusty' for 'radosgw' was not found [C|daniel.schneller@node01] ~ ➜ apt-cache policy radosgw radosgw: Installed: (none) Candidate: 0.94.2-1trusty Version table: 0.94.2-1trusty 0 999 http://ceph.com/debian-hammer/ trusty/main amd64 Packages 0.80.9-0ubuntu0.14.04.2 0 500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages 0.79-0ubuntu1 0 500 http://archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages It seems the repo does not offer anything but the most recent version? Am I missing anthing? Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problems to expect with newer point release rgw vs. older MONs/OSDs
Hi! Just a quick question regarding mixed versions. So far a cluster is running on 0.94.1-1trusty without Rados Gateway. Since the packets have been updated in the meantime, installing radosgw now would entail bringing a few updated dependencies along. OSDs and MONs on the nodes that are to become Rados Gateways would not automatically be upgraded, too. Is that a safe setup, or do I need to upgrade the whole cluster to the same point release? Thanks, Daniel [C|daniel.schneller@node01] ~ ➜ apt-get install --dry-run radosgw NOTE: This is only a simulation! apt-get needs root privileges for real execution. Keep also in mind that locking is deactivated, so don't depend on the relevance to the real current situation! Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: ceph-common libfcgi0ldbl librados2 libradosstriper1 librbd1 python-cephfs python-rados python-rbd The following NEW packages will be installed: libfcgi0ldbl radosgw The following packages will be upgraded: ceph-common librados2 libradosstriper1 librbd1 python-cephfs python-rados python-rbd 7 upgraded, 2 newly installed, 0 to remove and 206 not upgraded. Inst libradosstriper1 [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) [] Inst ceph-common [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) [] Inst librbd1 [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) [] Inst librados2 [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) [] Inst python-rados [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) [] Inst python-cephfs [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) [] Inst python-rbd [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) Inst libfcgi0ldbl (2.4.0-8.1ubuntu5 Ubuntu:14.04/trusty [amd64]) Inst radosgw (0.94.2-1trusty stable [amd64]) Conf librados2 (0.94.2-1trusty stable [amd64]) Conf libradosstriper1 (0.94.2-1trusty stable [amd64]) Conf librbd1 (0.94.2-1trusty stable [amd64]) Conf python-rados (0.94.2-1trusty stable [amd64]) Conf python-cephfs (0.94.2-1trusty stable [amd64]) Conf python-rbd (0.94.2-1trusty stable [amd64]) Conf ceph-common (0.94.2-1trusty stable [amd64]) Conf libfcgi0ldbl (2.4.0-8.1ubuntu5 Ubuntu:14.04/trusty [amd64]) Conf radosgw (0.94.2-1trusty stable [amd64]) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Node reboot -- OSDs not logging off from cluster
On 2015-07-03 01:31:35 +, Johannes Formann said: Hi, When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs do not seem to shut down correctly. Clients hang and ceph osd tree show the OSDs of that node still up. Repeated runs of ceph osd tree show them going down after a while. For instance, here OSD.7 is still up, even though the machine is in the middle of the reboot cycle. ... Any ideas as to what is causing this or how to diagnose this? I see this behavior (only) when I reboot a ceph-node with a monitor and OSDs. I guess somehow this relates. (OSD-messages getting lost due to the „failing“ mon) Sorry for being silent for a few days, other things kept me busy. Indeed this an interesting thought. We do have MONs running on three of our storage nodes. I need to verify if the one where I aw the problem is one of them, but with 5 total, there is more than 50% chance ;) Can any tell me which log levels on the MONs and/or OSDs I might want to change to track if the shutdown notification are actually received by the monitors or where they get lost? Regards, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Node reboot -- OSDs not logging off from cluster
Hi! We are seeing a strange - and problematic - behavior in our 0.94.1 cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each. When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs do not seem to shut down correctly. Clients hang and ceph osd tree show the OSDs of that node still up. Repeated runs of ceph osd tree show them going down after a while. For instance, here OSD.7 is still up, even though the machine is in the middle of the reboot cycle. [C|root@control01] ~ ➜ ceph osd tree # idweight type name up/down reweight -1 36.2root default -2 7.24host node01 0 1.81osd.0 up 1 5 1.81osd.5 up 1 10 1.81osd.10 up 1 15 1.81osd.15 up 1 -3 7.24host node02 1 1.81osd.1 up 1 6 1.81osd.6 up 1 11 1.81osd.11 up 1 16 1.81osd.16 up 1 -4 7.24host node03 2 1.81osd.2 down1 7 1.81osd.7 up 1 12 1.81osd.12 down1 17 1.81osd.17 down1 -5 7.24host node04 3 1.81osd.3 up 1 8 1.81osd.8 up 1 13 1.81osd.13 up 1 18 1.81osd.18 up 1 -6 7.24host node05 4 1.81osd.4 up 1 9 1.81osd.9 up 1 14 1.81osd.14 up 1 19 1.81osd.19 up 1 So it seems, the services are either not shut down correctly when the reboot begins, or they do not get enough time to actually let the cluster know they are going away. If I stop the OSDs on that node manually before the reboot, everything works as expected and clients don't notice any interruptions. [C|root@node03] ~ ➜ service ceph-osd stop id=2 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=7 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=12 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=17 ceph-osd stop/waiting [C|root@node03] ~ ➜ reboot The upstart file was not changed from the packaged version. Interestingly, the same Ceph version on a different cluster does _not_ show this behaviour. Any ideas as to what is causing this or how to diagnose this? Cheers, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected period of iowait, no obvious activity?
On 23.06.2015, at 14:13, Gregory Farnum g...@gregs42.com wrote: ... On the other hand, there are lots of administrative tasks that can run and do something like this. The CERN guys had a lot of trouble with some daemon which wanted to scan the OSD's entire store for tracking changes, and was installed by their standard Ubuntu deployment. Thanks! Good hint. I will look into that. Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very chatty MON logs: Is this normal?
On 2015-06-18 09:53:54 +, Joao Eduardo Luis said: Setting 'mon debug = 0/5' should be okay. Unless you see that setting '/5' impacts your performance and/or memory consumption, you should leave that be. '0/5' means 'output only debug 0 or lower to the logs; keep the last 1000 debug level 5 or lower in memory in case of a crash'. Your logs will not be as heavily populated but, if for some reason the daemon crashes, you get quite a few of debug information to help track down the source of the problem. Great, will do. Just for my understanding re/ memory: If this is a ring buffer for the last 1 events, shouldn't that be a somewhat fixed amount of memory? How would it negatively affect the MON's consumption? Assuming it works that way, once they have been running for a few days or weeks, these buffers would be full of events anyway, just more aged ones if the memory level was lower? Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unexpected period of iowait, no obvious activity?
Hi! Recently over a few hours our 4 Ceph disk nodes showed unusually high and somewhat constant iowait times. Cluster runs 0.94.1 on Ubuntu 14.04.1. It started on one node, then - with maybe 15 minutes delay each - on the next and the next one. Overall duration of the phenomenon was about 90 minutes on each machine, finishing in the same order they had started. We could not see any obvious cluster activity during that time, applications did not do anything out of the ordinary. Scrubbing and deep scrubbing were turned off long before this happened. We are using CephFS for shared administrator home directories on the system, RBD volumes for OpenStack and the Rados Gateway to manage application data via the Swift interface. Telemetry and logs from inside the VMs did not offer an explanation either. The fact that these readings were limited to OSD hosts, but none of the other (client) nodes in the system, suggests this must be some kind of Ceph behaviour. Any ideas? We would like to understand what the system was doing, but haven't found anything obvious in the logs. Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very chatty MON logs: Is this normal?
On 2015-06-17 18:52:51 +, Somnath Roy said: This is presently written from log level 1 onwards :-) So, only log level 0 will not log this.. Try, 'debug_mon = 0/0' in the conf file.. Yeah, once I had sent the mail I realized that 1 in the log line was the level. Had overlooked that before. However, I'd rather not set the level to 0/0, as that would disable all logging from the MONs. Now, I don't have enough knowledge on that part to say whether it is important enough to log at log level 1 , sorry :-( That would indeed be an interesting to know. Judging from the sheer amount, at least I have my doubts, because the cluster seems to be running without any issues. So I figure at least it isn't indicative of an immediate issue. Anyone with a little more definitve knowledge around? Should I create a bug ticket for this? Cheers, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD crashing over and over, taking cluster down
7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean] snapset_obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/snapdir//81 rwstate(write n=1 w=0)) We cannot pinpoint an exact trigger, but it there _seems_ to be some correlation with larger uploads into the RGW. This is not yet completely validated, but merely a timing related assumption. Could it be that the RGW code causes OSDs to fail either with the big upload alone or in conjunction with parallel other requests? We are seeing more crashes, though, than large uploads, so this remains a guess at best. We created this http://tracker.ceph.com/issues/11677 http://tracker.ceph.com/issues/11677 bug ticket, but would be extremely thankful for some quicker help. I am in the #ceph IRC channel as dschneller, too. Thanks! Daniel -- Daniel Schneller Infrastructure Engineer / Developer CenterDevice GmbH | Merscheider Straße 1 | 42699 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.de | www.centerdevice.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD images -- parent snapshot missing (help!)
On 2015-05-16 04:13:57 +, Tuomas Juntunen said: Hey Pavel Could you share your C program and the process how you were able to fix the images. Thanks Tuomas Pavel, That would indeed be invaluable! Thank you very much in advance! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RadosGW User Limit?
Hello! I am wondering if there is a limit to the number of (Swift) users that should be observed when using RadosGW. For example, if I were to offer storage via S3 or Swift APIs with Ceph and RGW as the backing implementation and people could just sign up through some kind of public website, need I watch the number of users created? Would a few thousand / ten-thousand / hundred-thousand users cause trouble, or is the system designed (and hopefully test ;)) to handle this? I would certainly hope so, because otherwise there would be a natural limit to how much data you could store in any cluster, not determined by the cluster size itself. If there are caveats, what would they be and when would I expect them? If it matters for this: Hammer 0.94.1. Thanks for any insight! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deleting RGW Users
Hello! In our cluster we had a nasty problem recently due to a very large number of buckets for a single RadosGW user. The bucket limit was disabled earlier, and the number of buckets grew to the point where OSDs started to go down due to excessive access times, missed heartbeats etc. We have since rectified that problem by first raising the relevant timeouts to near ridiculous levels so we could get the system to respond again and by copying all data from that single user to a few hundred new users. Of course, the old gigantic user is still around. Not sure if this is relevant, but we also have quite a few snapshots on the rgw pools. We are now hesitant to delete the problematic user, because we're not sure how this is implemented. Will deleting the user iterate its buckets and delete those one by one? If so, we would be in trouble, because anything but reading from that users' buckets is a good way to get processes to crash / timeout again. If it does it at a lower level, do we need to expect the snapshots to cause trouble? Either now, or when we finally get around to throw out old ones? So before we know more about what the implemention does (we're currently on Hammer 0.94.1) we won't touch that user, but we would like to get rid of it and the space it is wasting. Thanks a lot in advance! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados cppool
On 2015-05-14 21:04:06 +, Daniel Schneller said: On 2015-04-23 19:39:33 +, Sage Weil said: On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote: Hi! I have copied two of my pools recently, because old ones has too many pgs. Both of them contains RBD images, with 1GB and ~30GB of data. Both pools was copied without errors, RBD images are mountable and seems to be fine. CEPH version is 0.94.1 You will likely have problems if you try to delete snapshots that existed on the images (snaps are not copied/preserved by cppool). sage Could you be more specific on what these problems would look like? Are you referring to RBD pools in particular, or is this a general issue with snapshots? Anything that could be done to prevent these issues? Background of the question is that we take daily snapshots of some pools to allow reverting data when users make mistakes (via RGW). So it would be difficult to get rid of all snapshots first. Thanks Daniel Never mind, found more information on this on the list a few posts later. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph -w output
Hi! I am trying to get behind the values in ceph -w, especially those regarding throughput(?) at the end: 2015-05-15 00:54:33.333500 mon.0 [INF] pgmap v26048646: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 6023 kB/s rd, 549 kB/s wr, 7564 op/s 2015-05-15 00:54:34.339739 mon.0 [INF] pgmap v26048647: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 1853 kB/s rd, 1014 kB/s wr, 2015 op/s 2015-05-15 00:54:35.353621 mon.0 [INF] pgmap v26048648: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 2101 kB/s rd, 1680 kB/s wr, 1950 op/s 2015-05-15 00:54:36.375887 mon.0 [INF] pgmap v26048649: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 1641 kB/s rd, 1266 kB/s wr, 1710 op/s 2015-05-15 00:54:37.399647 mon.0 [INF] pgmap v26048650: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 4735 kB/s rd, 777 kB/s wr, 7088 op/s 2015-05-15 00:54:38.453922 mon.0 [INF] pgmap v26048651: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 5176 kB/s rd, 942 kB/s wr, 7779 op/s 2015-05-15 00:54:39.462838 mon.0 [INF] pgmap v26048652: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 3407 kB/s rd, 768 kB/s wr, 2131 op/s 2015-05-15 00:54:40.488387 mon.0 [INF] pgmap v26048653: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 3343 kB/s rd, 518 kB/s wr, 1881 op/s 2015-05-15 00:54:41.512540 mon.0 [INF] pgmap v26048654: 17344 pgs: 17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 1221 kB/s rd, 2385 kB/s wr, 1686 op/s Am I right to assume the values for kB/s rd and kB/s wr mean that the indicated amount of data has been read/written by clients since the last line, total over all OSDs? As for the op/s I am a little more uncertain. What kind of operations does this count? Assuming it is also reads and writes aggregated, what counts as an operation? For example, when I request data via the Rados Gateway, do I see one op here for the request from RGW's perspective, or do I see multiple, depending on how many low level objects a big RGW upload was striped to? What about non-rgw objects that get striped? Are reads/writes on those counted as one or one per stripe? Is there anything else counting into this but reads/writes to the object data? What about key/value level accesses? Is it possible to someone come up with a theoretical estimate for a maximum value achievable with a given set of hardware? This is a cluster of 4 nodes with 48 OSDs, 4TB each, all spinners. Are these values good, bad, critical? Can I somehow deduce - even if it is just a rather rough estimate - how loaded my cluster is? I am not talking about precision monitoring, but some kind of traffic light system (e.g. up to X% of the theoretical max is fine, up to Y% show a very busy cluster and anything above Y% means we might be up for trouble)? Any pointers to documentation or other material would be appreciated if this was discussed in some detail before. The only thing I found was a post on this list from 2013 which did not say more than ops are reads, writes, anything, not going into detail about the anything. Thanks a lot! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados cppool
On 2015-04-23 19:39:33 +, Sage Weil said: On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote: Hi! I have copied two of my pools recently, because old ones has too many pgs. Both of them contains RBD images, with 1GB and ~30GB of data. Both pools was copied without errors, RBD images are mountable and seems to be fine. CEPH version is 0.94.1 You will likely have problems if you try to delete snapshots that existed on the images (snaps are not copied/preserved by cppool). sage Could you be more specific on what these problems would look like? Are you referring to RBD pools in particular, or is this a general issue with snapshots? Anything that could be done to prevent these issues? Background of the question is that we take daily snapshots of some pools to allow reverting data when users make mistakes (via RGW). So it would be difficult to get rid of all snapshots first. Thanks Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly to Hammer
You should be able to do just that. We recently upgraded from Firefly to Hammer like that. Follow the order described on the website. Monitors, OSDs, MDSs. Notice that the Debian packages do not restart running daemons, but they _do_ start up not running ones. So say for some reason before your upgrade you shut down OSDs, they would be started as part of the upgrade. Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understand RadosGW logs
Bump... On 2015-03-03 10:54:13 +, Daniel Schneller said: Hi! After realizing the problem with log rotation (see http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708) and fixing it, I now for the first time have some meaningful (and recent) logs to look at. While from an application perspective there seem to be no issues, I would like to understand some messages I find with relatively high frequency in the logs: Exhibit 1 - 2015-03-03 11:14:53.685361 7fcf4bfef700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:15:57.476059 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:17:43.570986 7fcf25fcb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:00.881640 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:48.147011 7fcf35feb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:27:40.572723 7fcf50ff9700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:29:40.082954 7fcf36fed700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:30:32.204492 7fcf4dff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 I cannot find anything relevant by Googling for that, apart from the actual line of code that produces this line. What does that mean? Is it an indication of data corruption or are there more benign reasons for this line? Exhibit 2 -- Several of these blocks 2015-03-03 07:06:17.805772 7fcf36fed700 1 == starting new request req=0x7fcf5800f3b0 = 2015-03-03 07:06:17.836671 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836758 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836918 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243 part_ofs=0 rule-part_size=0 2015-03-03 07:06:18.263126 7fcf36fed700 1 == req done req=0x7fcf5800f3b0 http_status=200 == ... 2015-03-03 09:27:29.855001 7fcf28fd1700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:27:29.866718 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866778 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866852 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866917 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.875466 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.884434 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.906155 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.914364 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.940653 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024 part_ofs=0 rule-part_size=0 2015-03-03 09:27:30.272816 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.125773 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.192661 7fcf28fd1700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 09:27:31.194481 7fcf28fd1700 1 == req done req=0x7fcf580102a0 http_status=200 == ... 2015-03-03 09:28:43.008517 7fcf2a7d4700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:28:43.016414 7fcf2a7d4700 0 RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579 part_ofs=0 rule-part_size=0 2015-03-03 09:28:43.022387 7fcf2a7d4700 1 == req done req=0x7fcf580102a0 http_status=200 == First, what is the req= line? Is that a thread-id? I am asking, because the same id is used over and over in the same file over time. More importantly, what do the RGWObjManifest::operator++():... lines mean? In the middle case above the block even ends with one of the ERROR lines mentioned before, but the HTTP status is still 200, suggesting a succesful operation. Thanks in advance for shedding some light, because I would like to know if I need to take some action or at least keep an eye on these via monitoring? Cheers, Daniel
[ceph-users] Understand RadosGW logs
Hi! After realizing the problem with log rotation (see http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708) and fixing it, I now for the first time have some meaningful (and recent) logs to look at. While from an application perspective there seem to be no issues, I would like to understand some messages I find with relatively high frequency in the logs: Exhibit 1 - 2015-03-03 11:14:53.685361 7fcf4bfef700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:15:57.476059 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:17:43.570986 7fcf25fcb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:00.881640 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:48.147011 7fcf35feb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:27:40.572723 7fcf50ff9700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:29:40.082954 7fcf36fed700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:30:32.204492 7fcf4dff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 I cannot find anything relevant by Googling for that, apart from the actual line of code that produces this line. What does that mean? Is it an indication of data corruption or are there more benign reasons for this line? Exhibit 2 -- Several of these blocks 2015-03-03 07:06:17.805772 7fcf36fed700 1 == starting new request req=0x7fcf5800f3b0 = 2015-03-03 07:06:17.836671 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836758 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836918 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243 part_ofs=0 rule-part_size=0 2015-03-03 07:06:18.263126 7fcf36fed700 1 == req done req=0x7fcf5800f3b0 http_status=200 == ... 2015-03-03 09:27:29.855001 7fcf28fd1700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:27:29.866718 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866778 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866852 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866917 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.875466 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.884434 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.906155 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.914364 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.940653 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024 part_ofs=0 rule-part_size=0 2015-03-03 09:27:30.272816 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.125773 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.192661 7fcf28fd1700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 09:27:31.194481 7fcf28fd1700 1 == req done req=0x7fcf580102a0 http_status=200 == ... 2015-03-03 09:28:43.008517 7fcf2a7d4700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:28:43.016414 7fcf2a7d4700 0 RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579 part_ofs=0 rule-part_size=0 2015-03-03 09:28:43.022387 7fcf2a7d4700 1 == req done req=0x7fcf580102a0 http_status=200 == First, what is the req= line? Is that a thread-id? I am asking, because the same id is used over and over in the same file over time. More importantly, what do the RGWObjManifest::operator++():... lines mean? In the middle case above the block even ends with one of the ERROR lines mentioned before, but the HTTP status is still 200, suggesting a succesful operation. Thanks in advance for shedding some light, because I would like to know if I need to take some action or at least keep an eye on these via monitoring? Cheers, Daniel ___ ceph-users
Re: [ceph-users] Shutting down a cluster fully and powering it back up
On 2015-02-28 20:46:15 +, Gregory Farnum said: Sounds good! -Greg On Sat, Feb 28, 2015 at 10:55 AM David da...@visions.se wrote: Hi! We did that a few weeks ago and it mostly worked fine. However, on startup of one of the 4 machines, it got stuck while starting OSDs (at least that's what the console output indicated), while the others started up just fine. After waiting for more than 20 minutes with the other 3 machines already back up we hit ctrl-alt-del via the server console. The signal got caught, the OS restarted and came up without problems the next time. Unfortunately, as this was in the middle of the night after a very long day of moving hardware around in the datacenter we did not manage to save the logs before they were rotated... Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RadosGW Log Rotation (firefly)
On our Ubuntu 14.04/Firefly 0.80.8 cluster we are seeing problem with log file rotation for the rados gateway. The /etc/logrotate.d/radosgw script gets called, but it does not work correctly. It spits out this message, coming from the postrotate portion: /etc/cron.daily/logrotate: reload: Unknown parameter: id invoke-rc.d: initscript radosgw, action reload failed. A new log file actually gets created, but due to the failure in the post-rotate script, the daemon actually continues writing into the now deleted previous file: [B|root@node01] /etc/init ➜ ps aux | grep radosgw root 13077 0.9 0.1 13710396 203256 ? Ssl Feb14 212:27 /usr/bin/radosgw -n client.radosgw.node01 [B|root@node01] /etc/init ➜ ls -l /proc/13077/fd/ total 0 lr-x-- 1 root root 64 Mar 2 15:53 0 - /dev/null lr-x-- 1 root root 64 Mar 2 15:53 1 - /dev/null lr-x-- 1 root root 64 Mar 2 15:53 2 - /dev/null l-wx-- 1 root root 64 Mar 2 15:53 3 - /var/log/radosgw/radosgw.log.1 (deleted) ... Trying manually with service radosgw reload fails with the same message. Running the non-upstart /etc/init.d/radosgw reload works. It will, kind of crudely, just send a SIGHUP to any running radosgw process. To figure out the cause I compared OSDs and RadosGW wrt to upstart and got this: [B|root@node01] /etc/init ➜ initctl list | grep osd ceph-osd-all start/running ceph-osd-all-starter stop/waiting ceph-osd (ceph/8) start/running, process 12473 ceph-osd (ceph/9) start/running, process 12503 ... [B|root@node01] /etc/init ➜ initctl reload radosgw cluster=ceph id=radosgw.node01 initctl: Unknown instance: ceph/radosgw.node01 [B|root@node01] /etc/init ➜ initctl list | grep rados radosgw-instance stop/waiting radosgw stop/waiting radosgw-all-starter stop/waiting radosgw-all start/running Apart from me not being totally clear about what the difference between radosgw-instance and radosgw is, obviously Upstart has no idea about which PID to send the SIGHUP to when I ask it to reload. I can, of course, replace the logrotate config and use the /etc/init.d/radosgw reload approach, but I would like to understand if this is something unique to our system, or if this is a bug in the scripts. FWIW here's an excerpt from /etc/ceph.conf: [client.radosgw.node01] host = node01 rgw print continue = false keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/radosgw/radosgw.log rgw enable ops log = false rgw gc max objs = 31 Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW Log Rotation (firefly)
On 2015-03-02 18:17:00 +, Gregory Farnum said: I'm not very (well, at all, for rgw) familiar with these scripts, but how are you starting up your RGW daemon? There's some way to have Apache handle the process instead of Upstart, but Yehuda says you don't want to do it. -Greg Well, we installed the packages via APT. That places the upstart scripts into /etc/init. Nothing special. That will make Upstart launch them in boot. In the meantime I just placed /var/log/radosgw/*.log { rotate 7 daily compress sharedscripts postrotate start-stop-daemon --stop --signal HUP -x /usr/bin/radosgw --oknodo endscript missingok notifempty } into the logrotate script, removing the more complicated (and not working :)) logic with the core piece from the regular init.d script. Because the daemons were already running and using an already deleted script, logrotate wouldn't see the need to rotate the (visible) ones, because they had not changed. So I needed to manually execute the above start-stop-daemon on all relevant nodes ones to force the gateway to start a new, non-deleted logfile. Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Update 0.80.7 to 0.80.8 -- Restart Order
On 2015-02-03 18:48:45 +, Alexandre DERUMIER said: debian deb packages update are not restarting services. (So, I think it should be the same for ubuntu). you need to restart daemons in this order: -monitor -osd -mds -rados gateway http://ceph.com/docs/master/install/upgrading-ceph/ Just a small update: We just updated from 0.80.7 to our own build of 0.80.8 with the fix for http://tracker.ceph.com/issues/10262 added, because that was the main reason for us to update. Went as planned. Updated packages via apt-get --upgrade-only, then restarted MONs one by one, then OSDs one by one and finally MDSs one by one. The only slight hiccup was that ceph-fuse did not unmount voluntarily on two machines, claiming the filesystem was in use, even though noone was logged in who could be using it. Apart from that, no interruptions, no problems :) Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs-fuse: set/getfattr, change pools
Hi! We have a CephFS directory /baremetal mounted as /cephfs via FUSE on our clients. There are no specific settings configured for /baremetal. As a result, trying to get the directory layout via getfattr does not work getfattr -n 'ceph.dir.layout' /cephfs /cephfs: ceph.dir.layout: No such attribute Using a dummy file I can work around this to get at least at the pool name: ➜ touch dummy.txt ➜ getfattr -n 'ceph.file.layout' dummy.txt # file: dummy.txt ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs (BTW: Why doesn't getfattr -d -m - dummy.txt show any of the Ceph attributes?) Now, say I wanted to put /baremetal into a different pool, how would I go about this? Can I setfattr on the /cephfs mountpoint and assign it a different pool with e. g. different replication settings? Or would I need to mount the CephFS / directory somewhere and modify the settings for /baremetal from there? Can I change this after the fact at all, or do I have to mount both pools at the same time and move data between them manually? Thanks a lot! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs-fuse: set/getfattr, change pools
On 2015-02-03 18:19:24 +, Gregory Farnum said: Okay, I've looked at the code a bit, and I think that it's not showing you one because there isn't an explicit layout set. You should still be able to set one if you like, though; have you tried that? Actually, no, not yet. We were setting up CephFS on a 2nd cluster today and came across these issues. Turns out when we set up the first one we had used the kernel module and the accompanying tools, so some of our notes were not applicable anymore. We will play with this some more and come back if problems turn up. Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs-fuse: set/getfattr, change pools
We have a CephFS directory /baremetal mounted as /cephfs via FUSE on our clients. There are no specific settings configured for /baremetal. As a result, trying to get the directory layout via getfattr does not work getfattr -n 'ceph.dir.layout' /cephfs /cephfs: ceph.dir.layout: No such attribute What version are you running? I thought it was zapped a while ago, but in some versions of the code you can't access these xattrs on the root inode (but you can on everything else). 0.80.7 (BTW: Why doesn't getfattr -d -m - dummy.txt show any of the Ceph attributes?) They're virtual xattrs controlling layout: you don't want tools like rsync trying to copy them around. That actually makes perfect sense :) You can change the layout settings whenever you want, but there's no mechanism for CephFS to move the data between different pools; it simply applies the settings when the file is created. Understood. So if we did not move the data ourselves, e. g. by mounting both CephFS paths simultaneously to different mount points and moved the data over using rsync, mv, cp, ... we would gradullay get new files stored in the old pool, new files in the new? So every file knows about its containing pool itself, right? Cheers, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs-fuse: set/getfattr, change pools
Understood. Thanks for the details. Daniel On Tue, Feb 3, 2015 at 1:23 PM -0800, Gregory Farnum g...@gregs42.com wrote: On Tue, Feb 3, 2015 at 1:17 PM, John Spray wrote: On Tue, Feb 3, 2015 at 2:21 PM, Daniel Schneller wrote: Now, say I wanted to put /baremetal into a different pool, how would I go about this? Can I setfattr on the /cephfs mountpoint and assign it a different pool with e. g. different replication settings? This should make it clearer: http://ceph.com/docs/master/cephfs/file-layouts/#inheritance-of-layouts When you change the layout of a directory, the new layout will only apply to newly created files: it will not trigger any data movement. If you explicitly change the layout of a file containing data to point to a different pool, then you will see zeros when you try to read it back (although new data will be written to the new pool). That statement sounds really scary. To reassure people: you can't actually change layout on a file which has already been written to! Trying to do so will return an error code; actually changing the layouts and seeing this result would require manually mucking around with RADOS data underneath the MDS. -Greg___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Update 0.80.7 to 0.80.8 -- Restart Order
Hello! We are planning to upgrade our Ubuntu 14.04.1 based cluster from Ceph Firefly 0.80.7 to 0.80.8. We have 4 nodes, 12x4TB spinners each (plus OS disks). Apart from the 12 OSDs per node, nodes 1-3 have MONs running. The instructions on ceph.com say it is best to first restart the MONs, then the OSDs. We are wondering, if updating the packages from the repository will trigger daemon reloads through package scripts. This would then certainly not guarantee the restart order recommended. Do these instructions assume that the monitors are separate machines that could be updated first? If so, are there best-practice recommendations on how to update a production cluster without service interruption? Thanks a lot for any advice! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Update 0.80.7 to 0.80.8 -- Restart Order
On 2015-02-02 16:09:27 +, Gregory Farnum said: That said, for a point release it shouldn't matter what order stuff gets restarted in. I wouldn't worry about it. :) That is good to know. One follow-up then: If the packets trigger restarts, they will most probably do so for *all* daemons virtually at once, right? So that means that all OSDs on that host will go down at the same time. That sounds like a not so good idea, taking 25% of the cluster down at the same time (provided I go host by host)? Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Removing Snapshots Killing Cluster Performance
Hi! We take regular (nightly) snapshots of our Rados Gateway Pools for backup purposes. This allows us - with some manual pokery - to restore clients' documents should they delete them accidentally. The cluster is a 4 server setup with 12x4TB spinning disks each, totaling about 175TB. We are running firefly. We have now completed our first month of snapshots and want to remove the oldest ones. Unfortunately doing so practically kills everything else that is using the cluster, because performance drops to almost zero while the OSDs work their disks 100% (as per iostat). It seems this is the same phenomenon I asked about some time ago where we were deleting whole pools. I could not find any way to throttle the background deletion activity (the command returns almost immediately). Here is a graph the I/O operations waiting (colored by device) while deleting a few snapshots. Each of the blocks in the graph show one snapshot being removed. The big one in the middle was a snapshot of the .rgw.buckets pool. It took about 15 minutes during which basically nothing relying on the cluster was working due to immense slowdowns. This included users getting kicked off their SSH sessions due to timeouts. https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048 While this is a big issue in itself for us, we would at least try to estimate how long the process will take per snapshot / per pool. I assume the time needed is a function of the number of objects that were modified between two snapshots. We tried to get an idea of at least how many objects were added/removed in total by running `rados df` with a snapshot specified as a parameter, but it seems we still always get the current values: $ sudo rados -p .rgw df --snap backup-20141109 selected snap 13 'backup-20141109' pool name category KB objects .rgw - 276165 1368545 $ sudo rados -p .rgw df --snap backup-20141124 selected snap 28 'backup-20141124' pool name category KB objects .rgw - 276165 1368546 $ sudo rados -p .rgw df pool name category KB objects .rgw - 276165 1368547 So there are a few questions: 1) Is there any way to control how much such an operation will tax the cluster (we would be happy to have it run longer, if that meant not utilizing all disks fully during that time)? 2) Is there a way to get a decent approximation of how much work deleting a specific snapshot will entail (in terms of objects, time, whatever)? 3) Would SSD journals help here? Or any other hardware configuration change for that matter? 4) Any other recommendations? We definitely need to remove the data, not because of a lack of space (at least not at the moment), but because when customers delete stuff / cancel accounts, we are obliged to remove their data at least after a reasonable amount of time. Cheers, Daniel___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing Snapshots Killing Cluster Performance
On 2014-12-01 10:03:35 +, Dan Van Der Ster said: Which version of Ceph are you using? This could be related: http://tracker.ceph.com/issues/9487 Firefly. I had seen this ticket earlier (when deleting a whole pool) and hoped the backport of the fix would be available some time soon. I must admin, I did not look this up before posting, because I had forgotten about it. See ReplicatedPG: don't move on to the next snap immediately; basically, the OSD is getting into a tight loop trimming the snapshot objects. The fix above breaks out of that loop more frequently, and then you can use the osd snap trim sleep option to throttle it further. I’m not sure if the fix above will be sufficient if you have many objects to remove per snapshot. Just so I get this right: With the fix alone you are not sure it would be nice enough, so adjusting the snap trim sleep option in addition might be needed? I assume the loop that will be broken up with 9487 does not take the sleep time into account? That commit is only in giant at the moment. The backport to dumpling is in the dumpling branch but not yet in a release, and firefly is still pending. Holding my breath :) Any thoughts on the other items I had in the original post? 2) Is there a way to get a decent approximation of how much work deleting a specific snapshot will entail (in terms of objects, time, whatever)? 3) Would SSD journals help here? Or any other hardware configuration change for that matter? Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing Snapshots Killing Cluster Performance
Thanks for your input. We will see what we can find out with the logs and how to proceed from there. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuring swift user for ceph Rados Gateway - 403 Access Denied
On 2014-11-11 13:12:32 +, ವಿನೋದ್ Vinod H I said: Hi, I am having problems accessing rados gateway using swift interface. I am using ceph firefly version and have configured a us region as explained in the docs. There are two zones us-east and us-west. us-east gateway is running on host ceph-node-1 and us-west gateway is running on host ceph-node-2. [...] Auth GET failed: http://ceph-node-1/auth 403 Forbidden [...] swift_keys: [ { user: useast:swift, secret_key: FmQYYbzly4RH+PmNlrWA3ynN+eJrayYXzeISGDSw}], We have seen problems when the secret_key has special characters. I am not sure if + is one of them, but the manual states this somewhere. Try setting the key explictly or by re-generating one until you get one without any special chars. Drove me nuts. Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation?
To remove the max_bucket limit I used radosgw-admin user modify --uid=username --max-buckets=0 Off the top of my head, I think radosgw-admin user info --uid=username will show you the current values without changing anything. See also this thread I started about this topic a few weeks ago. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12840.html Daniel___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crash with rados cppool and snapshots
Ticket created: http://tracker.ceph.com/issues/9941 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete pools with low priority?
On 2014-10-30 10:14:44 +, Dan van der Ster said: Hi Daniel, I can't remember if deleting a pool invokes the snap trimmer to do the actual work deleting objects. But if it does, then it is most definitely broken in everything except latest releases (actual dumpling doesn't have the fix yet in a release). Given a release with those fixes (see tracker #9487) then you should enable the snap trim sleep (e.g. 0.05) and set the io priority class to 3 or idle. Cheers, Dan Dan, thank you for the hint. I have looked at the ticket, but I am not too familiar with trac (yet) to understand the current state. The header part says Status: Pending Backport and Backport: dumpling. At the very bottom (as of now ;)) however, I see a revision Revision 496e561d Added by Samuel Just 3 days ago ReplicatedPG: don't move on to the next snap immediately If we have a bunch of trimmed snaps for which we have no objects, we'll spin for a long time. Instead, requeue. Fixes: #9487 Backport: dumpling, firefly, giant Reviewed-by: Sage Weil s...@redhat.com Signed-off-by: Samuel Just sam.j...@inktank.com (cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659) Does this mean there will be a backport to firefly, too, and that the bug status (the header of the page) hasn't been updated yet? Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crash with rados cppool and snapshots
Apart from the current there is a bug part, is the idea to copy a snapshot into a new pool a viable one for a full-backup-restore? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Crash with rados cppool and snapshots
Hi! We are exploring options to regularly preserve (i.e. backup) the contents of the pools backing our rados gateways. For that we create nightly snapshots of all the relevant pools when there is no activity on the system to get consistent states. In order to restore the whole pools back to a specific snapshot state, we tried to use the rados cppool command (see below) to copy a snapshot state into a new pool. Unfortunately this causes a segfault. Are we doing anything wrong? This command: rados cppool --snap snap-1 deleteme.lp deleteme.lp2 2 segfault.txt Produces this output: *** Caught signal (Segmentation fault) ** in thread 7f8f49a927c0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: rados() [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3: (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17) [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5: (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7] 2014-10-29 12:03:22.761653 7f8f49a927c0 -1 *** Caught signal (Segmentation fault) ** in thread 7f8f49a927c0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: rados() [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3: (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17) [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5: (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Full segfault file and the objdump output for the rados command can be found here: - https://public.centerdevice.de/53bddb80-423e-4213-ac62-59fe8dbb9bea - https://public.centerdevice.de/50b81566-41fb-439a-b58b-e1e32d75f32a We updated to the 0.80.7 release (saw the issue with 0.80.5 before and had hoped that the long list of bugfixes in the release notes would include a fix for this) but are still seeing it. Rados gateways, OSDs, MONs etc. have all been restarted after the update. Package versions as follows: daniel.schneller@node01 [~] $ ➜ dpkg -l | grep ceph ii ceph0.80.7-1trusty ii ceph-common 0.80.7-1trusty ii ceph-fs-common 0.80.7-1trusty ii ceph-fuse 0.80.7-1trusty ii ceph-mds0.80.7-1trusty ii libcephfs1 0.80.7-1trusty ii python-ceph 0.80.7-1trusty daniel.schneller@node01 [~] $ ➜ uname -a Linux node01 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Copying without the snapshot works. Should this work at least in theory? Thanks! Daniel -- Daniel Schneller Mobile Development Lead CenterDevice GmbH | Merscheider Straße 1 | 42699 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.com | www.centerdevice.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete pools with low priority?
Bump :-) Any ideas on this? They would be much appreciated. Also: Sorry for a possible double post, client had forgotten its email config. On 2014-10-22 21:21:54 +, Daniel Schneller said: We have been running several rounds of benchmarks through the Rados Gateway. Each run creates several hundred thousand objects and similarly many containers. The cluster consists of 4 machines, 12 OSD disks (spinning, 4TB) — 48 OSDs total. After running a set of benchmarks we renamed the pools used by the gateway pools to get a clean baseline. In total we now have several million objects and containers in 3 pools. Redundancy for all pools is set to 3. Today we started deleting the benchmark data. Once the first renamed set of RGW pools was executed, cluster performance started to go down the drain. Using iotop we can see that the disks are all working furiously. As the command to delete the pools came back very quickly, the assumption is that we are now seeing the effects of the actual objects being removed, causing lots and lots of IO activity on the disks, negatively impacting regular operations. We are running OpenStack on top of Ceph, and we see drastic reduction in responsiveness of these machines as well as in CephFS. Fortunately this is still a test setup, so no production systems are affected. Nevertheless I would like to ask a few questions: 1) Is it possible to have the object deletion run in some low-prio mode? 2) If not, is there another way to delete lots and lots of objects without affecting the rest of the cluster so badly? 3) Can we somehow determine the progress of the deletion so far? We would like to estimate if this is going to take hours, days or weeks? 4) Even if not possible for the already running deletion, could be get a progress for the remaining pools we still want to delete? 5) Are there any parameters that we might tune — even if just temporarily - to speed this up? Slide 18 of http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern describes a very similar situation. Thanks, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Delete pools with low priority?
We have been running several rounds of benchmarks through the Rados Gateway. Each run creates several hundred thousand objects and similarly many containers. The cluster consists of 4 machines, 12 OSD disks (spinning, 4TB) — 48 OSDs total. After running a set of benchmarks we renamed the pools used by the gateway pools to get a clean baseline. In total we now have several million objects and containers in 3 pools. Redundancy for all pools is set to 3. Today we started deleting the benchmark data. Once the first renamed set of RGW pools was executed, cluster performance started to go down the drain. Using iotop we can see that the disks are all working furiously. As the command to delete the pools came back very quickly, the assumption is that we are now seeing the effects of the actual objects being removed, causing lots and lots of IO activity on the disks, negatively impacting regular operations. We are running OpenStack on top of Ceph, and we see drastic reduction in responsiveness of these machines as well as in CephFS. Fortunately this is still a test setup, so no production systems are affected. Nevertheless I would like to ask a few questions: 1) Is it possible to have the object deletion run in some low-prio mode? 2) If not, is there another way to delete lots and lots of objects without affecting the rest of the cluster so badly? 3) Can we somehow determine the progress of the deletion so far? We would like to estimate if this is going to take hours, days or weeks? 4) Even if not possible for the already running deletion, could be get a progress for the remaining pools we still want to delete? 5) Are there any parameters that we might tune — even if just temporarily - to speed this up? Slide 18 of http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern describes a very similar situation. Thanks, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Icehouse Ceph -- live migration fails?
samuel samu60@... writes: Hi all,This issue is also affecting us (centos6.5 based icehouse) and, as far as I could read, comes from the fact that the path /var/lib/nova/instances (or whatever configuration path you have in nova.conf) is not shared. Nova does not see this shared path and therefore does not allow to perform live migrate although all the required information is stored in ceph and in the qemu local state. Some people has cheated nova to see this as a shared path but I'm not confident about how this will affect stability. Can someone confirm this deduction? What are the possible workarounds for this situation in a full ceph based environment (without shared path)? I got it to work finally. Step 1 was double checking nova.conf on the compute nodes. It was actually missing the flags pointed out earlier in this thread. As for the /var/lib/nova/instances data, this will get transferred to the destination host as part of the migration. For that to work, you need to have the transport between the libvirtd's set up correctly. libvirt_live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER,VIR _MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST live_migration_uri=qemu+ssh://nova@%s/system?keyfile=/var/lib/nova/.ssh/ id_rsa I did not want to open another TCP port on all the nodes, so I went with the SSH based transport as described in the libvirtd documentation. For some reason it would only work once I explicitly added the user account (nova@...) and the location of the key file explicitly, even though the locations and names are default. As part of our deployment via Ansible we make sure the nova user has an up to date list of host keys in /var/lib/nova/.ssh/known_hosts. Otherwise you will get errors regarding failing host key verification in /var/log/nova/nova-compute.log if you try to live migrate. Of course, they user needs to be present everywhere, have the same key everywhere and have that key's public part be in /var/lib/nova/.ssh/authorized_keys for the login to work without user intervention. Setting up this alone brought me almost to my goal, the only thing I had missed was vncserver_listen = 0.0.0.0 in nova.conf -- this address will be put into the virtual machines libvirt.xml file as the address the machine uses for its VNC console. While on the baremetal node where it was originally created, this works. However, when the VM gets migrated to another host (basically copying over the instance folder from /var/lib/nova/instances) this address cannot be bound on the new baremetal host and the migration fails. The log is pretty clear about that. Once I had changed the vncserver_listen, new machines could be migrated immediately. For existing ones, I have not tried if editing the libvirt.xml file while they are running is in any way harmful, so I will wait until I can shut them down for a short maintenance window, then edit the file to replace the current listen address with 0.0.0.0 and bring them up again. One more caveat: If you use the Horizon dashboard, there is a bug in the Icehouse release that prevents successful live migration on another level, because it uses the wrong names for the baremetal machines. Instead of the compute service names (e. g. node01, node02 ... in my case), it uses the fully qualified hypervisor names. This will not work. See https://bugs.launchpad.net/horizon/+bug/1335999 for details. I applied the corresponding patch from https://git.openstack.org/cgit/openstack/horizon/patch/? id=89dc7de2e87b8d4e35837ad5122117aa2fb2c520 (excluding the tests, those do not match well enough). Now I can live migration from horizon and the command line :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] max_bucket limit -- safe to disable?
=0x7f0378fd8250 obj=.rgw:CNT-UUID state=0x7f02b007ac18 s-prefetch_data=0 31.886430 ID 10 cache get: name=.rgw+CNT-UUID : type miss (requested=22, cached=19) 36.746327 ID 10 cache put: name=.rgw+CNT-UUID 36.746404 ID 10 moving .rgw+CNT-UUID to cache LRU end 36.746426 ID 20 get_obj_state: s-obj_tag was set empty 36.746431 ID 20 Read xattr: user.rgw.idtag 36.746433 ID 20 Read xattr: user.rgw.manifest 36.746452 ID 10 cache get: name=.rgw+CNT-UUID : hit 36.746481 ID 20 rgw_get_bucket_info: bucket instance: CNT-UUID(@{i=.rgw.buckets.index}.rgw.buckets[default.78418684.119116]) 36.746491 ID 20 reading from .rgw:.bucket.meta.CNT-UUID:default.78418684.119116 36.746549 ID 20 get_obj_state: rctx=0x7f0378fd8250 obj=.rgw:.bucket.meta.CNT-UUID:default.78418684.119116 state=0x7f02b00ce638 s-prefetch_data=0 36.746585 ID 10 cache get: name=.rgw+.bucket.meta.CNT-UUID:default.78418684.119116 : type miss (requested=22, cached=19) 36.747938 ID 10 cache put: name=.rgw+.bucket.meta.CNT-UUID:default.78418684.119116 36.747955 ID 10 moving .rgw+.bucket.meta.CNT-UUID:default.78418684.119116 to cache LRU end 36.747963 ID 10 updating xattr: name=user.rgw.acl bl.length()=177 36.747972 ID 20 get_obj_state: s-obj_tag was set empty 36.747975 ID 20 Read xattr: user.rgw.acl 36.747977 ID 20 Read xattr: user.rgw.idtag 36.747978 ID 20 Read xattr: user.rgw.manifest 36.747985 ID 10 cache get: name=.rgw+.bucket.meta.CNT-UUID:default.78418684.119116 : hit 36.748025 ID 15 Read AccessControlPolicyAccessControlPolicy xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDdocumentstore/IDDisplayNameDocument Store/DisplayName/OwnerAccessControlListGrantGrantee xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:type=CanonicalUserIDdocumentstore/IDDisplayNameDocument Store/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy 36.748037 ID 2 req 983095:4.861888:swift:PUT /swift/v1/CNT-UUID/version:put_obj:init op 36.748043 ID 2 req 983095:4.861895:swift:PUT /swift/v1/CNT-UUID/version:put_obj:verifying op mask 36.748046 ID 20 required_mask= 2 user.op_mask=7 36.748050 ID 2 req 983095:4.861902:swift:PUT /swift/v1/CNT-UUID/version:put_obj:verifying op permissions 36.748054 ID 5 Searching permissions for uid=documentstore mask=50 36.748056 ID 5 Found permission: 15 36.748058 ID 5 Searching permissions for group=1 mask=50 36.748060 ID 5 Permissions for group not found 36.748061 ID 5 Searching permissions for group=2 mask=50 36.748063 ID 5 Permissions for group not found 36.748064 ID 5 Getting permissions id=documentstore owner=documentstore perm=2 36.748066 ID 10 uid=documentstore requested perm (type)=2, policy perm=2, user_perm_mask=2, acl perm=2 36.748069 ID 2 req 983095:4.861921:swift:PUT /swift/v1/CNT-UUID/version:put_obj:verifying op params 36.748072 ID 2 req 983095:4.861924:swift:PUT /swift/v1/CNT-UUID/version:put_obj:executing 36.748200 ID 20 get_obj_state: rctx=0x7f0378fd8250 obj=CNT-UUID:version state=0x7f02b0042618 s-prefetch_data=0 36.802077 ID 10 setting object write_tag=default.78418684.983095 36.818727 ID 2 req 983095:4.932579:swift:PUT /swift/v1/CNT-UUID/version:put_obj:http status=201 == -- Daniel Schneller Mobile Development Lead CenterDevice GmbH | Merscheider Straße 1 | 42699 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.com | www.centerdevice.com On 06 Oct 2014, at 19:26, Yehuda Sadeh yeh...@redhat.com wrote: It'd be interesting to see which rados operation is slowing down the requests. Can you provide a log dump of a request (with 'debug rgw = 20', and 'debug ms = 1'). This might give us a better idea as to what's going on. Thanks, Yehuda On Mon, Oct 6, 2014 at 10:05 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi again! We have done some tests regarding the limits of storing lots and lots of buckets through Rados Gateway into Ceph. Our test used a single user for which we removed the default max buckets limit. It then continuously created containers - both empty and such with 10 objects of around 100k random data in them. With 3 parallel processes we saw relatively consistent time of about 500-700msper such container. This kept steady until we reached approx. 3 million containers after which the time per insert sharply went up to currently around 1600ms and rising. Due to some hiccups with network equipment the tests were aborted a few times, but then resumed without deleting any of the previous runs created containers, so the actual number might be 2.8 or 3.2 million, but still in that ballpark. We aborted the test here. Judging by the advice given earlier (see quoted mail below) that we might hit a limit on some per-user data structures, we created another user account, removed its max-bucket limit as well and restarted the benchmark with that one
Re: [ceph-users] max_bucket limit -- safe to disable?
Hi! By looking at these logs it seems that there are only 8 pgs on the .rgw pool, if this is correct then you may want to change that considering your workload. Thanks. See out pg_num configuration below. We had already suspected that the 1600 that we had previously (48 OSDs * 100 / triple redundancy) were not ideal, so we increased the .rgw.buckets pool to 2048. The number of objects and their size was in an earlier email, but for completeness I will put them up once again. Any other ideas where to look? == for i in $(rados df | awk '{ print $1 }' | grep '^\.'); do echo $i; echo -n - “; ceph osd pool get $i pg_num; echo -n - “; ceph osd pool get $i pgp_num; done .intent-log - pg_num: 1600 - pgp_num: 1600 .log - pg_num: 1600 - pgp_num: 1600 .rgw - pg_num: 1600 - pgp_num: 1600 .rgw.buckets - pg_num: 2048 - pgp_num: 2048 .rgw.buckets.index - pg_num: 1600 - pgp_num: 1600 .rgw.control - pg_num: 1600 - pgp_num: 1600 .rgw.gc - pg_num: 1600 - pgp_num: 1600 .rgw.root - pg_num: 100 - pgp_num: 100 .usage - pg_num: 1600 - pgp_num: 1600 .users - pg_num: 1600 - pgp_num: 1600 .users.email - pg_num: 1600 - pgp_num: 1600 .users.swift - pg_num: 1600 - pgp_num: 1600 .users.uid - pg_num: 1600 - pgp_num: 1600 === .rgw = KB: 1,966,932 objects:9,094,552 rd: 195,747,645 rd KB: 153,585,472 wr:30,191,844 wr KB:10,751,065 .rgw.buckets = KB: 2,038,313,855 objects: 22,088,103 rd: 5,455,123 rd KB: 408,416,317 wr: 149,377,728 wr KB: 1,882,517,472 .rgw.buckets.index = KB: 0 objects:5,374,376 rd: 267,996,778 rd KB: 262,626,106 wr: 107,142,891 wr KB: 0 .rgw.control = KB: 0 objects:8 rd: 0 rd KB: 0 wr: 0 wr KB: 0 .rgw.gc = KB: 0 objects: 32 rd: 5,554,407 rd KB: 5,713,942 wr: 8,355,934 wr KB: 0 .rgw.root = KB: 1 objects:3 rd: 524 rd KB: 346 wr: 3 wr KB: 3 Daniel On 08 Oct 2014, at 01:03, Yehuda Sadeh yeh...@redhat.com wrote: This operation stalled quite a bit, seems that it was waiting for the osd: 2.547155 7f036ffc7700 1 -- 10.102.4.11:0/1009401 -- 10.102.4.14:6809/7428 -- osd_op(client.78418684.0:27514711 .bucket.meta.CNT-UUID-FINDME:default.78418684.122043 [call version.read,getxattrs,stat] 5.3b7d1197 ack+read e16034) v4 -- ?+0 0x7f026802e2c0 con 0x7f040c055ca0 ... 7.619750 7f041ddf4700 1 -- 10.102.4.11:0/1009401 == osd.32 10.102.4.14:6809/7428 208252 osd_op_reply(27514711 .bucket.meta.CNT-UUID-FINDME:default.78418684.122043 [call,getxattrs,stat] v0'0 uv6371 ondisk = 0) v6 338+0+336 (3685145659 0 4232894755) 0x7f00e430f540 con 0x7f040c055ca0 By looking at these logs it seems that there are only 8 pgs on the .rgw pool, if this is correct then you may want to change that considering your workload. Yehuda On Tue, Oct 7, 2014 at 3:46 PM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi! Sorry, I must have missed the enabling of that debug module. However, the test setup has been the same all the time - I only have the one test-application :) But maybe I phrased it a bit ambiguously when I wrote It then continuously created containers - both empty and such with 10 objects of around 100k random data in them. 100 kilobytes is the size of a single object, of which we create 10 per container. The container gets created first, without any objects, naturally, then 10 objects are added. One of these objects is called “version”, the rest have generated names with a fixed prefix and appended 1-9. The version object is the one I picked for the example logs I sent earlier. I hope this makes the setup clearer. Attached you will find the (now more extensive) logs for the outliers again. As you did not say that I garbled the logs, I assume the pre-processing was OK, so I have prepared the new data in a similar fashion, marking the relevant request with CNT-UUID-FINDME. I have not removed any lines in between the beginning of the “interesting” request and its completion to keep all the network traffic log intact. Due to the increased verbosity, I will not post the logs inline, but only attach them gzipped. As before, should the full data set be needed, I can provide an archived version. Thanks for your support! Daniel On 07 Oct 2014, at 22:45, Yehuda Sadeh yeh...@redhat.com wrote: The logs here don't include the messenger (debug ms = 1). It's hard to tell what going on from looking at the outliers. Also, in your previous mail you
Re: [ceph-users] max_bucket limit -- safe to disable?
Hi again! We have done some tests regarding the limits of storing lots and lots of buckets through Rados Gateway into Ceph. Our test used a single user for which we removed the default max buckets limit. It then continuously created containers - both empty and such with 10 objects of around 100k random data in them. With 3 parallel processes we saw relatively consistent time of about 500-700msper such container. This kept steady until we reached approx. 3 million containers after which the time per insert sharply went up to currently around 1600ms and rising. Due to some hiccups with network equipment the tests were aborted a few times, but then resumed without deleting any of the previous runs created containers, so the actual number might be 2.8 or 3.2 million, but still in that ballpark. We aborted the test here. Judging by the advice given earlier (see quoted mail below) that we might hit a limit on some per-user data structures, we created another user account, removed its max-bucket limit as well and restarted the benchmark with that one, _expecting_ the times to be down to the original range of 500-700ms. However, what we are seeing is that the times stay at the 1600ms and higher levels even for that fresh account. Here is the output of `rados df`, reformatted to fit the email. clones, degraded and unfound were 0 in all cases and have been left out for clarity: .rgw = KB: 1,966,932 objects: 9,094,552 rd: 195,747,645 rd KB: 153,585,472 wr:30,191,844 wr KB:10,751,065 .rgw.buckets = KB: 2,038,313,855 objects:22,088,103 rd: 5,455,123 rd KB: 408,416,317 wr: 149,377,728 wr KB: 1,882,517,472 .rgw.buckets.index = KB: 0 objects: 5,374,376 rd: 267,996,778 rd KB: 262,626,106 wr: 107,142,891 wr KB: 0 .rgw.control = KB: 0 objects: 8 rd: 0 rd KB: 0 wr: 0 wr KB: 0 .rgw.gc = KB: 0 objects:32 rd: 5,554,407 rd KB: 5,713,942 wr: 8,355,934 wr KB: 0 .rgw.root = KB: 1 objects: 3 rd: 524 rd KB: 346 wr: 3 wr KB: 3 We would very much like to understand what is going on here in order to decide if Rados Gateway is a viable option to base our production system on (where we expect similar counts as in the benchmark), or if we need to investigate using librados directly which we would like to avoid if possible. Any advice on what configuration parameters to check or which additional information to provide to analyze this would be very much welcome. Cheers, Daniel -- Daniel Schneller Mobile Development Lead CenterDevice GmbH | Merscheider Straße 1 | 42699 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.com mailto:daniel.schnel...@centerdevice.com | www.centerdevice.com http://www.centerdevice.com/ On 10 Sep 2014, at 19:42, Gregory Farnum g...@inktank.com wrote: On Wednesday, September 10, 2014, Daniel Schneller daniel.schnel...@centerdevice.com mailto:daniel.schnel...@centerdevice.com wrote: On 09 Sep 2014, at 21:43, Gregory Farnum g...@inktank.com wrote: Yehuda can talk about this with more expertise than I can, but I think it should be basically fine. By creating so many buckets you're decreasing the effectiveness of RGW's metadata caching, which means the initial lookup in a particular bucket might take longer. Thanks for your thoughts. With “initial lookup in a particular bucket” do you mean accessing any of the objects in a bucket? If we directly access the object (not enumerating the buckets content), would that still be an issue? Just trying to understand the inner workings a bit better to make more educated guesses :) When doing an object lookup, the gateway combines the bucket ID with a mangled version of the object name to try and do a read out of RADOS. It first needs to get that bucket ID though -- it will cache an the bucket name-ID mapping, but if you have a ton of buckets there could be enough entries to degrade the cache's effectiveness. (So, you're more likely to pay that extra disk access lookup.) The big concern is that we do maintain a per-user list of all their buckets — which is stored in a single RADOS object — so if you have an extreme number of buckets that RADOS object could get pretty big and become a bottleneck when creating/removing/listing the buckets. You Alright. Listing buckets is no problem, that we don’t do. Can you say what “pretty big
[ceph-users] Icehouse Ceph -- live migration fails?
Hi! We have an Icehouse system running with librbd based Cinder and Glance configurations, storing images and volumes in Ceph. Configuration is (apart from network setup details, of course) by the book / OpenStack setup guide. Works very nicely, including regular migration, but live migration of virtual machines fails. I created a simple machine booting from a volume based off the Ubuntu 14.04.1 cloud image for testing. Using Horizon, I can move this VM from host to host, but when I try to Live Migrate it from one baremetal host to another, I get an error message “Failed to live migrate instance to host ’node02’. The only related log entry I recognize is in the controller’s nova-api.log: 2014-09-25 17:15:47.679 3616 INFO nova.api.openstack.wsgi [req-f3dc3c2e-d366-40c5-a1f1-31db71afd87a f833f8e2d1104e66b9abe9923751dcf2 a908a95a87cc42cd87ff97da4733c414] HTTP exception thrown: Compute service of node02.baremetal.clusterb.centerdevice.local is unavailable at this time. 2014-09-25 17:15:47.680 3616 INFO nova.osapi_compute.wsgi.server [req-f3dc3c2e-d366-40c5-a1f1-31db71afd87a f833f8e2d1104e66b9abe9923751dcf2 a908a95a87cc42cd87ff97da4733c414] 10.102.6.8 POST /v2/a908a95a87cc42cd87ff97da4733c414/servers/0f762f35-64ee-461f-baa4-30f5de4d5ddf/action HTTP/1.1 status: 400 len: 333 time: 0.1479030 I cannot see anything of value on the destination host itself. New machines get scheduled there, so the compute service cannot really be down. In this thread Travis http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/019944.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/019944.html describes a similar situation, however that was on Folsom, so I wonder if it is still applicable. Would be great to get some outside opinion :) Thanks! Daniel -- Daniel Schneller Mobile Development Lead CenterDevice GmbH | Merscheider Straße 1 | 42699 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.com | www.centerdevice.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] max_bucket limit -- safe to disable?
On 09 Sep 2014, at 21:43, Gregory Farnum g...@inktank.com wrote: Yehuda can talk about this with more expertise than I can, but I think it should be basically fine. By creating so many buckets you're decreasing the effectiveness of RGW's metadata caching, which means the initial lookup in a particular bucket might take longer. Thanks for your thoughts. With “initial lookup in a particular bucket” do you mean accessing any of the objects in a bucket? If we directly access the object (not enumerating the buckets content), would that still be an issue? Just trying to understand the inner workings a bit better to make more educated guesses :) The big concern is that we do maintain a per-user list of all their buckets — which is stored in a single RADOS object — so if you have an extreme number of buckets that RADOS object could get pretty big and become a bottleneck when creating/removing/listing the buckets. You Alright. Listing buckets is no problem, that we don’t do. Can you say what “pretty big” would be in terms of MB? How much space does a bucket record consume in there? Based on that I could run a few numbers. should run your own experiments to figure out what the limits are there; perhaps you have an easy way of sharding up documents into different users. Good advice. We can do that per distributor (an org unit in our software) to at least compartmentalize any potential locking issues in this area to that single entity. Still, there would be quite a lot of buckets/objects per distributor, so some more detail on the above items would be great. Thanks a lot! Daniel___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] max_bucket limit -- safe to disable?
Hi list! Under http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/033670.html I found a situation not unlike ours, but unfortunately either the list archive fails me or the discussion ended without a conclusion, so I dare to ask again :) We currently have a setup of 4 servers with 12 OSDs each, combined journal and data. No SSDs. We develop a document management application that accepts user uploads of all kinds of documents and processes them in several ways. For any given document, we might create anywhere from 10s to several hundred dependent artifacts. We are now preparing to move from Gluster to a Ceph based backend. The application uses the Apache JClouds Library to talk to the Rados Gateways that are running on all 4 of these machines, load balanced by haproxy. We currently intend to create one container for each document and put all the dependent and derived artifacts as objects into that container. This gives us a nice compartmentalization per document, also making it easy to remove a document and everything that is connected with it. During the first test runs we ran into the default limit of 1000 containers per user. In the thread mentioned above that limit was removed (setting the max_buckets value to 0). We did that and now can upload more than 1000 documents. I just would like to understand a) if this design is recommended, or if there are reasons to go about the whole issue in a different way, potentially giving up the benefit of having all document artifacts under one convenient handle. b) is there any absolute limit for max_buckets that we will run into? Remember we are talking about 10s of millions of containers over time. c) are any performance issues to be expected with this design and can we tune any parameters to alleviate this? Any feedback would be very much appreciated. Regards, Daniel -- Daniel Schneller Mobile Development Lead CenterDevice GmbH | Merscheider Straße 1 | 42699 Solingen tel: +49 1754155711| Deutschland daniel.schnel...@centerdevice.com | www.centerdevice.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com