Re: [ceph-users] Serious performance problems with small file writes
Hello, On Wed, 20 Aug 2014 15:39:11 +0100 Hugo Mills wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) Please let us know the CPU and memory specs of the OSD nodes as well. And the replication factor, I presume 3 if you value that data. Also the PG and PGP values for the pool(s) you're using. The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds, and in bad cases, ceph -s shows up to many hundreds of requests blocked for more than 32s. We've had to turn off scrubbing and deep scrubbing completely -- except between 01.00 and 04.00 every night -- because it triggers the exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets up to 7 PGs being scrubbed, as it did on Monday, it's completely unusable. Note that I know nothing about CephFS and while there are probably tunables the slow requests you're seeing and the hardware up there definitely suggests slow OSDs. Now with a replication factor of 3, your total cluster performance (sustained) is that of just 6 disks and 4TB ones are never any speed wonders. Minus the latency overheads from the network, which should be minimal in your case though. Your old NFS (cluster?) had twice the spindles you wrote, so if that means 36 disks it was quite a bit faster. A cluster I'm just building with 3 nodes, 4 journal SSDs and 8 OSD HDDs per node can do about 7000 write IOPS (4KB), so I would expect yours to be worse off. Having the journals on dedicated partitions instead of files on the rootfs would not only be faster (though probably not significantly so), but also prevent any potential failures based on FS corruption. The SSD journals will compensate for some spikes of high IOPS, but 25 files is clearly beyond that. Putting lots of RAM (relatively cheap these days) into the OSD nodes has the big benefit that reads of hot objects will not have to go to disk and thus compete with write IOPS. Is this problem something that's often seen? If so, what are the best options for mitigation or elimination of the problem? I've found a few references to issue #6278 [1], but that seems to be referencing scrub specifically, not ordinary (if possibly pathological) writes. You need to match your cluster to your workload. Aside from tuning things (which tends to have limited effects), you can either scale out by adding more servers or scale up by using faster storage and/or a cache pool. What are the sorts of things I should be looking at to work out where the bottleneck(s) are? I'm a bit lost about how to drill down into the ceph system for identifying performance issues. Is there a useful guide to tools somewhere? Reading/scouring this ML can be quite helpful. Watch your OSD nodes (all of them!) with iostat or preferably atop (which will also show you how your CPUs and network is doing) while running the below stuff. To get a baseline do: rados -p pool-in-question bench 60 write -t 64 This will test your throughput most of all and due to the 4MB block size spread the load very equally amongst the OSDs. During that test you should see all OSDs more or
Re: [ceph-users] RadosGW problems
I have noticed that when I make the request to HTTPS, the responde comes in http form with port 443... Where is this happening, do you have any idea? On Wed, Aug 20, 2014 at 1:30 PM, Marco Garcês ma...@garces.cc wrote: swift --insecure -V 1 -A https://gateway.bcitestes.local/auth -U testuser:swift -K MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat Account HEAD failed: http://gateway.bcitestes.local:443/swift/v1 400 Bad Request ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi Hugo, On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote: What are you using for OSD journals? On each machine, the three OSD journals live on the same ext4 filesystem on an SSD, which is also the root filesystem of the machine. Also check the CPU usage for the mons and osds... The mons are doing pretty much nothing in terms of CPU, as far as I can see. I will double-check during an incident. Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. NFS exported async or sync? If async, it can’t be compared to CephFS. Also, if those NFS servers had RAID cards with a wb-cache, it can’t really be compared. I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. Another tunable to look at is the filestore max sync interval — in my experience the colocated journal/OSD setup suffers with the default (5s, IIRC), especially when an OSD is getting a constant stream of writes. When this happens, the disk heads are constantly seeking back and forth between synchronously writing to the journal and flushing the outstanding writes. If we would have a dedicated (spinning) disk for the journal, then the synchronous writes (to the journal) could be done sequentially (thus, quickly) and the flushes would also be quick(er). SSD journals can obviously also help with this. For a short test I would try increasing filestore max sync interval to 30s or maybe even 60s to see if it helps. (I know that at least one of the Inktank experts advise against changing the filestore max sync interval — but in my experience 5s is much too short for the colocated journal setup.) You need to make sure your journals are large enough to store 30/60s of writes, but when you have predominantly small writes even a few GB of journal ought to be enough. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting
Hi, You only have one OSD? I’ve seen similar strange things in test pools having only one OSD — and I kinda explained it by assuming that OSDs need peers (other OSDs sharing the same PG) to behave correctly. Install a second OSD and see how it goes... Cheers, Dan On 21 Aug 2014, at 02:59, Bruce McFarland bruce.mcfarl...@taec.toshiba.commailto:bruce.mcfarl...@taec.toshiba.com wrote: I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple OSD’s running on it. When I start the OSD using /etc/init.d/ceph start osd.0 I see the expected interaction between the OSD and the monitor authenticating keys etc and finally the OSD starts. Running watching the cluster with ‘ceph –w’ running on the monitor I never see the INFO messages I expect. There isn’t a msg from osd.0 for the boot event and the expected INFO messages from osdmap and pgmap for the osd and it’s pages being added to those maps. I only see the last time the monitor was booted and it wins the monitor election and reports monmap, pgmap, and mdsmap info. The firewalls are disabled with selinux==disabled and iptables turned off. All hosts can ssh w/o passwords into each other and I’ve verified traffic between hosts using tcpdump captures. Any ideas on what I’d need to add to ceph.conf or have overlooked would be greatly appreciated. Thanks, Bruce [root@ceph0 ceph]# /etc/init.d/ceph restart osd.0 === osd.0 === === osd.0 === Stopping Ceph osd.0 on ceph0...kill 15676...done === osd.0 === 2014-08-20 17:43:46.456592 7fa51a034700 1 -- :/0 messenger.start 2014-08-20 17:43:46.457363 7fa51a034700 1 -- :/1025971 -- 209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x7fa51402f9e0 con 0x7fa51402f570 2014-08-20 17:43:46.458229 7fa5189f0700 1 -- 209.243.160.83:0/1025971 learned my addr 209.243.160.83:0/1025971 2014-08-20 17:43:46.459664 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 1 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508000ab0 con 0x7fa51402f570 2014-08-20 17:43:46.459849 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.460180 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7fa4fc0012d0 con 0x7fa51402f570 2014-08-20 17:43:46.461341 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.461514 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7fa4fc001cf0 con 0x7fa51402f570 2014-08-20 17:43:46.462824 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570 2014-08-20 17:43:46.463011 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0 con 0x7fa51402f570 2014-08-20 17:43:46.463073 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fa4fc0025d0 con 0x7fa51402f570 2014-08-20 17:43:46.463329 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7fa514030490 con 0x7fa51402f570 2014-08-20 17:43:46.463363 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7fa5140309b0 con 0x7fa51402f570 2014-08-20 17:43:46.463564 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 5 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508001100 con 0x7fa51402f570 2014-08-20 17:43:46.463639 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 6 mon_subscribe_ack(300s) v1 20+0+0 (540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570 2014-08-20 17:43:46.463707 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 7 auth_reply(proto 2 0 (0) Success) v1 194+0+0 (1040860857 0 0) 0x7fa5080015d0 con 0x7fa51402f570 2014-08-20 17:43:46.468877 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_command({prefix: get_command_descriptions} v 0) v1 -- ?+0 0x7fa514030e20 con 0x7fa51402f570 2014-08-20 17:43:46.469862 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 8 osd_map(554..554 src has 1..554) v3 59499+0+0 (2180258623 0 0) 0x7fa50800f980 con 0x7fa51402f570 2014-08-20 17:43:46.470428 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 9 mon_subscribe_ack(300s) v1 20+0+0 (540052875 0 0) 0x7fa50800fc40 con 0x7fa51402f570 2014-08-20 17:43:46.475021 7fa5135fe700 1 --
Re: [ceph-users] Serious performance problems with small file writes
Just to fill in some of the gaps from yesterday's mail: On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote: Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. Yes, the tarball with the 25 small files in it is definitely a reproducer. [snip] What about iostat on the OSDs — are your OSD disks busy reading or writing during these incidents? Not sure. I don't think so, but I'll try to trigger an incident and report back on this one. Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes, and 200-300 kB/s reads on all three, but it fluctuates a lot (with 5-second intervals). Sample data at the end of the email. What are you using for OSD journals? On each machine, the three OSD journals live on the same ext4 filesystem on an SSD, which is also the root filesystem of the machine. Also check the CPU usage for the mons and osds... The mons are doing pretty much nothing in terms of CPU, as far as I can see. I will double-check during an incident. The mons are just ticking over with a 1% CPU usage. Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. With minimal users and one machine running the tar unpacking process, I'm getting somewhere around 100-200 op/s on the ceph cluster, but interactivity on the desktop machine I'm logged in on is horrible -- I'm frequently getting tens of seconds of latency. Compare that to the (relatively) comfortable 350-400 op/s we had yesterday with what is probably workloads with larger files. If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb — we had to disable it from indexing /var/lib/ceph on our OSDs. I haven't seen that running at all during the day, but I'll look into it. No, it's not anything like that -- iotop reports pretty much the only things doing IO are ceph-osd and the occasional xfsaild. Hugo. Hugo. Best Regards, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds,
Re: [ceph-users] Serious performance problems with small file writes
On Thu, Aug 21, 2014 at 07:40:45AM +, Dan Van Der Ster wrote: On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote: Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. NFS exported async or sync? If async, it can’t be compared to CephFS. Also, if those NFS servers had RAID cards with a wb-cache, it can’t really be compared. Hmm. Yes, async. Probably wouldn't have been my choice... (I only started working with this system recently -- about the same time that the ceph cluster was deployed to replace the older machines. I haven't had much of say in what's implemented here, but I have to try to support it.) I'm tempted to put the users' home directories back on an NFS server, and keep ceph for the research data. That at least should give us more in the way of interactivity (which is the main thing I'm getting complaints about). I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. Another tunable to look at is the filestore max sync interval — in my experience the colocated journal/OSD setup suffers with the default (5s, IIRC), especially when an OSD is getting a constant stream of writes. When this happens, the disk heads are constantly seeking back and forth between synchronously writing to the journal and flushing the outstanding writes. If we would have a dedicated (spinning) disk for the journal, then the synchronous writes (to the journal) could be done sequentially (thus, quickly) and the flushes would also be quick(er). SSD journals can obviously also help with this. Not sure what you mean about colocated journal/OSD. The journals aren't on the same device as the OSDs. However, all three journals on each machine are on the same SSD. For a short test I would try increasing filestore max sync interval to 30s or maybe even 60s to see if it helps. (I know that at least one of the Inktank experts advise against changing the filestore max sync interval — but in my experience 5s is much too short for the colocated journal setup.) You need to make sure your journals are large enough to store 30/60s of writes, but when you have predominantly small writes even a few GB of journal ought to be enough. I'll have a play with that. Thanks for all the help so far -- it's been useful. I'm learning what the right kind of questions are. Hugo. -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi Hugo, On 21 Aug 2014, at 14:17, Hugo Mills h.r.mi...@reading.ac.uk wrote: Not sure what you mean about colocated journal/OSD. The journals aren't on the same device as the OSDs. However, all three journals on each machine are on the same SSD. embarrassed I obviously didn’t drink enough coffee this morning. I read your reply as something like … On each machine, the three OSD journals live on the same ext4 filesystem on an OSD”. Anyway… what kind of SSD do you have? With iostat -xm 1, do you see high % utilisation on that SSD during these incidents? It could be that you’re exceeding even the iops capacity of the SSD. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-users@lists.ceph.com
Hi, I'm trying to start Qemu on top of RBD. In documentation[1] there is a big warning: Important If you set rbd_cache=true, you must set cache=writeback or risk data loss. Without cache=writeback, QEMU will not send flush requests to librbd. If QEMU exits uncleanly in this configuration, filesystems on top of rbd can be corrupted. But in last part of that page there is written that Qemu command line override ceph.conf settings and setting *cache=writethrough* will force *rbd_cache**=**true* and *rbd_cache_max_dirty=0*. In that configuration rbd will write directly to Ceph and there is no risk of data loss (except for things cached in VM OS). Am I right or am I missing something? 1: http://ceph.com/docs/master/rbd/qemu-rbd/ Thanks, PS ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph + Qemu cache=writethrough
Sorry for missing subject. On 08/21/2014 03:09 PM, Paweł Sadowski wrote: Hi, I'm trying to start Qemu on top of RBD. In documentation[1] there is a big warning: Important If you set rbd_cache=true, you must set cache=writeback or risk data loss. Without cache=writeback, QEMU will not send flush requests to librbd. If QEMU exits uncleanly in this configuration, filesystems on top of rbd can be corrupted. But in last part of that page there is written that Qemu command line override ceph.conf settings and setting *cache=writethrough* will force *rbd_cache**=**true* and *rbd_cache_max_dirty=0*. In that configuration rbd will write directly to Ceph and there is no risk of data loss (except for things cached in VM OS). Am I right or am I missing something? 1: http://ceph.com/docs/master/rbd/qemu-rbd/ Thanks, PS ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question on OSD node failure recovery
I understand the concept with Ceph being able to recover from the failure of an OSD (presumably with a single OSD being on a single disk), but I'm wondering what the scenario is if an OSD server node containing multiple disks should fail. Presuming you have a server containing 8-10 disks, your duplicated placement groups could end up on the same system. From diagrams I've seen they show duplicates going to separate nodes, but is this in fact how it handles it? -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question on OSD node failure recovery
Ceph uses CRUSH (http://ceph.com/docs/master/rados/operations/crush-map/) to determine object placement. The default generated crush maps are sane, in that they will put replicas in placement groups into separate failure domains. You do not need to worry about this simple failure case, but you should consider the network and disk i/o consequences of re-replicating large amounts of data. Sean From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of LaBarre, James (CTR) A6IT [james.laba...@cigna.com] Sent: Thursday, August 21, 2014 9:17 AM To: ceph-us...@ceph.com Subject: [ceph-users] Question on OSD node failure recovery I understand the concept with Ceph being able to recover from the failure of an OSD (presumably with a single OSD being on a single disk), but I’m wondering what the scenario is if an OSD server node containing multiple disks should fail. Presuming you have a server containing 8-10 disks, your duplicated placement groups could end up on the same system. From diagrams I’ve seen they show duplicates going to separate nodes, but is this in fact how it handles it? -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Hanging ceph client
Hi, On a freshly created 4 node cluster I'm struggling to get the 4th node to create correctly. ceph-deploy is unable to create the OSDs on it and when logging in to the node and attempting to run `ceph -s` manually (after copying the client.admin keyring) with debug parameters it ends up hanging and looping over mon_command({prefix: get_command_descriptions} v 0). I'm not sure what else to try to find out why this is happening. It seems like it's able to talk to the monitors okay as it looks like it is authenticating, and the same command runs fine on the first 3 nodes which are running monitors, but just hangs on the node that isn't. Thanks in advance for any help! root@ceph4:~# ceph -s --debug-ms=5 --debug-client=5 --debug-mon=10 2014-08-21 14:45:32.689379 7ff622841700 1 -- :/0 messenger.start 2014-08-21 14:45:32.691284 7ff622841700 1 -- :/1007607 -- 192.168.78.13:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7ff61c024980 con 0x7ff61c024530 2014-08-21 14:45:32.692075 7ff61a7fc700 1 -- 192.168.78.14:0/1007607 learned my addr 192.168.78.14:0/1007607 2014-08-21 14:45:32.693174 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 1 mon_map v1 485+0+0 (2066881705 0 0) 0x7ff61bd0 con 0x7ff61c024530 2014-08-21 14:45:32.693383 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (3596119886 0 0) 0x7ff610001080 con 0x7ff61c024530 2014-08-21 14:45:32.693691 7ff620885700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.13:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7ff604001680 con 0x7ff61c024530 2014-08-21 14:45:32.694549 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 206+0+0 (1790499909 0 0) 0x7ff610001080 con 0x7ff61c024530 2014-08-21 14:45:32.694750 7ff620885700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.13:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7ff604003810 con 0x7ff61c024530 2014-08-21 14:45:32.695641 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 393+0+0 (350251809 0 0) 0x7ff618c0 con 0x7ff61c024530 2014-08-21 14:45:32.695780 7ff620885700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.13:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7ff61c020c20 con 0x7ff61c024530 2014-08-21 14:45:32.696051 7ff622841700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.13:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7ff61c025200 con 0x7ff61c024530 2014-08-21 14:45:32.696079 7ff622841700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.13:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7ff61c0257a0 con 0x7ff61c024530 2014-08-21 14:45:32.696324 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 5 mon_map v1 485+0+0 (2066881705 0 0) 0x7ff6100012f0 con 0x7ff61c024530 2014-08-21 14:45:32.696422 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 6 mon_subscribe_ack(300s) v1 20+0+0 (1427523647 0 0) 0x7ff610001590 con 0x7ff61c024530 2014-08-21 14:45:32.696834 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 7 osd_map(46..46 src has 1..46) v3 7172+0+0 (2083907578 0 0) 0x7ff618c0 con 0x7ff61c024530 2014-08-21 14:45:32.697095 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 8 mon_subscribe_ack(300s) v1 20+0+0 (1427523647 0 0) 0x7ff610002fd0 con 0x7ff61c024530 2014-08-21 14:45:32.704621 7ff622841700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.13:6789/0 -- mon_command({prefix: get_command_descriptions} v 0) v1 -- ?+0 0x7ff61c025c10 con 0x7ff61c024530 2014-08-21 14:45:32.900195 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 9 osd_map(46..46 src has 1..46) v3 7172+0+0 (2083907578 0 0) 0x7ff618c0 con 0x7ff61c024530 2014-08-21 14:45:32.900265 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.2 192.168.78.13:6789/0 10 mon_subscribe_ack(300s) v1 20+0+0 (1427523647 0 0) 0x7ff610002fd0 con 0x7ff61c024530 2014-08-21 14:46:05.691726 7ff61b7fe700 1 -- 192.168.78.14:0/1007607 mark_down 0x7ff61c024530 -- 0x7ff61c0242c0 2014-08-21 14:46:05.691818 7ff61a6fb700 2 -- 192.168.78.14:0/1007607 192.168.78.13:6789/0 pipe(0x7ff61c0242c0 sd=3 :60918 s=4 pgs=174 cs=1 l=1 c=0x7ff61c024530).fault (0) Success 2014-08-21 14:46:05.691913 7ff61b7fe700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.12:6789/0 -- auth(proto 0 30 bytes epoch 1) v1 -- ?+0 0x7ff608001ba0 con 0x7ff608001760 2014-08-21 14:46:05.693707 7ff620885700 1 -- 192.168.78.14:0/1007607 == mon.1 192.168.78.12:6789/0 1 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (2330663482 0 0) 0x7ff610001220 con 0x7ff608001760 2014-08-21 14:46:05.693982 7ff620885700 1 -- 192.168.78.14:0/1007607 -- 192.168.78.12:6789/0 -- auth(proto 2 128 bytes epoch 0) v1 -- ?+0 0x7ff604007520
[ceph-users] fail to upload file from RadosGW by Python+S3
i can upload file to RadosGW by s3cmd , and software Dragondisk. the script can list all bucket and all file in the bucket. but can not from python s3. ### #coding=utf-8 __author__ = 'Administrator' #!/usr/bin/env python import fnmatch import os, sys import boto import boto.s3.connection access_key = 'VC8R6C193WDVKNTDCRKA' secret_key = 'ASUWdUTx6PwVXEf/oJRRmDnvKEWp509o3rl1Xt+h' pidfile = copytoceph.pid def check_pid(pid): try: os.kill(pid, 0) except OSError: return False else: return True if os.path.isfile(pidfile): pid = long(open(pidfile, 'r').read()) if check_pid(pid): print %s already exists, doing natting % pidfile sys.exit() pid = str(os.getpid()) file(pidfile, 'w').write(pid) conn = boto.connect_s3( aws_access_key_id=access_key, aws_secret_access_key=secret_key, host='ceph-radosgw.lab.com', port=80, is_secure=False, calling_format=boto.s3.connection.OrdinaryCallingFormat(), ) print conn mybucket = conn.get_bucket('foo') print mybucket mylist = mybucket.list() print mylist buckets = conn.get_all_buckets() for bucket in buckets: print {name}\t{created}.format( name=bucket.name, created=bucket.creation_date, ) for key in bucket.list(): print {name}\t{size}\t{modified}.format( name=(key.name).encode('utf8'), size=key.size, modified=key.last_modified, ) key = mybucket.new_key('hello.txt') print key key.set_contents_from_string('Hello World!') ### root@ceph-radosgw:~# python rgwupload.py S3Connection:ceph-radosgw.lab.com Bucket: foo boto.s3.bucketlistresultset.BucketListResultSet object at 0x1d6ae10 backup 2014-08-21T10:23:08.000Z add volume for vms.png 23890 2014-08-21T10:53:43.000Z foo 2014-08-20T16:11:19.000Z file0001.txt29 2014-08-21T04:22:25.000Z galley/DSC_0005.JPG 2142126 2014-08-21T04:24:29.000Z galley/DSC_0006.JPG 2005662 2014-08-21T04:24:29.000Z galley/DSC_0009.JPG 1922686 2014-08-21T04:24:29.000Z galley/DSC_0010.JPG 2067713 2014-08-21T04:24:29.000Z galley/DSC_0011.JPG 2027689 2014-08-21T04:24:30.000Z galley/DSC_0012.JPG 2853358 2014-08-21T04:24:30.000Z galley/DSC_0013.JPG 2844746 2014-08-21T04:24:30.000Z iso 2014-08-21T04:43:16.000Z pdf 2014-08-21T09:36:15.000Z Key: foo,hello.txt it hanged at here. Same error when i run this script on radosgw host. Traceback (most recent call last): File D:/Workspace/S3-Ceph/test.py, line 65, in module key.set_contents_from_string('Hello World!') File c:\Python27\lib\site-packages\boto\s3\key.py, line 1419, in set_contents_from_string encrypt_key=encrypt_key) File c:\Python27\lib\site-packages\boto\s3\key.py, line 1286, in set_contents_from_file chunked_transfer=chunked_transfer, size=size) File c:\Python27\lib\site-packages\boto\s3\key.py, line 746, in send_file chunked_transfer=chunked_transfer, size=size) File c:\Python27\lib\site-packages\boto\s3\key.py, line 944, in _send_file_internal query_args=query_args File c:\Python27\lib\site-packages\boto\s3\connection.py, line 664, in make_request retry_handler=retry_handler File c:\Python27\lib\site-packages\boto\connection.py, line 1053, in make_request retry_handler=retry_handler) File c:\Python27\lib\site-packages\boto\connection.py, line 1009, in _mexe raise BotoServerError(response.status, response.reason, body) boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error None ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] active+remapped after remove osd via ceph osd out
Hi, I have 2 PG in active+remapped state. ceph health detail HEALTH_WARN 2 pgs stuck unclean; recovery 24/348041229 degraded (0.000%) pg 3.1a07 is stuck unclean for 29239.046024, current state active+remapped, last acting [167,80,145] pg 3.154a is stuck unclean for 29239.039777, current state active+remapped, last acting [377,224,292] recovery 24/348041229 degraded (0.000%) This happend when i call ceph osd reweight-by-utilization 102 What can be wrong ? ceph -v - ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f) Tunables: ceph osd crush dump | tail -n 4 tunables: { choose_local_tries: 0, choose_local_fallback_tries: 0, choose_total_tries: 60, chooseleaf_descend_once: 1}} Cluster: 6 racks X 3 hosts X 22 OSDs. (396 osds: 396 up, 396 in) crushtool -i ../crush2 --min-x 0 --num-rep 3 --max-x 10624 --test --show-bad-mappings is clean. When 'ceph osd reweight' for all osd is 1.0 is ok, but i have nearfull OSD's. There is no missing OSD's in crushmap grep device /tmp/crush.txt | grep -v osd # devices ceph osd dump | grep -i pool pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0 pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0 pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner 0 pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 90517 owner 0 pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0 pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0 pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0 pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 28467 owner 18446744073709551615 pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 28468 owner 18446744073709551615 pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner 18446744073709551615 pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 33487 owner 18446744073709551615 pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0 pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 46912 owner 0 ceph pg 3.1a07 query { state: active+remapped, epoch: 181721, up: [ 167, 80], acting: [ 167, 80, 145], info: { pgid: 3.1a07, last_update: 181719'94809, last_complete: 181719'94809, log_tail: 159997'91808, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 4, last_epoch_started: 179611, last_epoch_clean: 179611, last_epoch_split: 11522, same_up_since: 179610, same_interval_since: 179610, same_primary_since: 179610, last_scrub: 160655'94695, last_scrub_stamp: 2014-08-19 04:16:20.308318, last_deep_scrub: 158290'91157, last_deep_scrub_stamp: 2014-08-12 05:15:25.557591, last_clean_scrub_stamp: 2014-08-19 04:16:20.308318}, stats: { version: 181719'94809, reported_seq: 995830, reported_epoch: 181721, state: active+remapped, last_fresh: 2014-08-21 14:53:14.050284, last_change: 2014-08-21 09:42:07.473356, last_active: 2014-08-21 14:53:14.050284, last_clean: 2014-08-21 07:38:51.366084, last_became_active: 2013-10-25 13:59:36.125019, last_unstale: 2014-08-21 14:53:14.050284, mapping_epoch: 179606, log_start: 159997'91808, ondisk_log_start: 159997'91808, created: 4, last_epoch_clean: 179611, parent: 0.0, parent_split_bits: 0, last_scrub: 160655'94695, last_scrub_stamp: 2014-08-19 04:16:20.308318, last_deep_scrub: 158290'91157, last_deep_scrub_stamp: 2014-08-12 05:15:25.557591, last_clean_scrub_stamp: 2014-08-19 04:16:20.308318, log_size: 3001, ondisk_log_size: 3001, stats_invalid: 0, stat_sum: { num_bytes: 2880784014, num_objects: 12108, num_object_clones: 0, num_object_copies: 0, num_objects_missing_on_primary: 0, num_objects_degraded: 0,
Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting
I have 3 storage servers each with 30 osds. Each osd has a journal that is a partition on a virtual drive that is a raid0 of 6 ssds. I brought up a 3 osd (1 per storage server) cluster to bring up Ceph and figure out configuration etc. From: Dan Van Der Ster [mailto:daniel.vanders...@cern.ch] Sent: Thursday, August 21, 2014 1:17 AM To: Bruce McFarland Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting Hi, You only have one OSD? I've seen similar strange things in test pools having only one OSD - and I kinda explained it by assuming that OSDs need peers (other OSDs sharing the same PG) to behave correctly. Install a second OSD and see how it goes... Cheers, Dan On 21 Aug 2014, at 02:59, Bruce McFarland bruce.mcfarl...@taec.toshiba.commailto:bruce.mcfarl...@taec.toshiba.com wrote: I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple OSD's running on it. When I start the OSD using /etc/init.d/ceph start osd.0 I see the expected interaction between the OSD and the monitor authenticating keys etc and finally the OSD starts. Running watching the cluster with 'ceph -w' running on the monitor I never see the INFO messages I expect. There isn't a msg from osd.0 for the boot event and the expected INFO messages from osdmap and pgmap for the osd and it's pages being added to those maps. I only see the last time the monitor was booted and it wins the monitor election and reports monmap, pgmap, and mdsmap info. The firewalls are disabled with selinux==disabled and iptables turned off. All hosts can ssh w/o passwords into each other and I've verified traffic between hosts using tcpdump captures. Any ideas on what I'd need to add to ceph.conf or have overlooked would be greatly appreciated. Thanks, Bruce [root@ceph0 ceph]# /etc/init.d/ceph restart osd.0 === osd.0 === === osd.0 === Stopping Ceph osd.0 on ceph0...kill 15676...done === osd.0 === 2014-08-20 17:43:46.456592 7fa51a034700 1 -- :/0 messenger.start 2014-08-20 17:43:46.457363 7fa51a034700 1 -- :/1025971 -- 209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x7fa51402f9e0 con 0x7fa51402f570 2014-08-20 17:43:46.458229 7fa5189f0700 1 -- 209.243.160.83:0/1025971 learned my addr 209.243.160.83:0/1025971 2014-08-20 17:43:46.459664 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 1 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508000ab0 con 0x7fa51402f570 2014-08-20 17:43:46.459849 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.460180 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7fa4fc0012d0 con 0x7fa51402f570 2014-08-20 17:43:46.461341 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.461514 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7fa4fc001cf0 con 0x7fa51402f570 2014-08-20 17:43:46.462824 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570 2014-08-20 17:43:46.463011 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0 con 0x7fa51402f570 2014-08-20 17:43:46.463073 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fa4fc0025d0 con 0x7fa51402f570 2014-08-20 17:43:46.463329 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7fa514030490 con 0x7fa51402f570 2014-08-20 17:43:46.463363 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7fa5140309b0 con 0x7fa51402f570 2014-08-20 17:43:46.463564 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 5 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508001100 con 0x7fa51402f570 2014-08-20 17:43:46.463639 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 6 mon_subscribe_ack(300s) v1 20+0+0 (540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570 2014-08-20 17:43:46.463707 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 7 auth_reply(proto 2 0 (0) Success) v1 194+0+0 (1040860857 0 0) 0x7fa5080015d0 con 0x7fa51402f570 2014-08-20 17:43:46.468877 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_command({prefix: get_command_descriptions} v 0) v1 -- ?+0 0x7fa514030e20 con 0x7fa51402f570 2014-08-20
[ceph-users] Ceph Cinder Capabilities reports wrong free size
I am working with Cinder Multi Backends on an Icehouse installation and have added another backend (Quobyte) to a previously running Cinder/Ceph installation. I can now create QuoByte volumes, but no longer any ceph volumes. The cinder-scheduler log get’s an incorrect number for the free size of the volumes pool and disregards the RBD backend as a viable storage system: 2014-08-21 16:42:49.847 1469 DEBUG cinder.openstack.common.scheduler.filters.capabilities_filter [r...] extra_spec requirement 'rbd' does not match 'quobyte' _satisfies_extra_specs /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:55 2014-08-21 16:42:49.848 1469 DEBUG cinder.openstack.common.scheduler.filters.capabilities_filter [r...] host 'controller@quobyte': free_capacity_gb: 156395.931061 fails resource_type extra_specs requirements host_passes /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:68 2014-08-21 16:42:49.848 1469 WARNING cinder.scheduler.filters.capacity_filter [r...-] Insufficient free space for volume creation (requested / avail): 20/8.0 2014-08-21 16:42:49.849 1469 ERROR cinder.scheduler.flows.create_volume [r.] Failed to schedule_create_volume: No valid host was found. here’s our /etc/cinder/cinder.conf — cut — [DEFAULT] rootwrap_config = /etc/cinder/rootwrap.conf api_paste_confg = /etc/cinder/api-paste.ini # iscsi_helper = tgtadm volume_name_template = volume-%s # volume_group = cinder-volumes verbose = True auth_strategy = keystone state_path = /var/lib/cinder lock_path = /var/lock/cinder volumes_dir = /var/lib/cinder/volumes rabbit_host=10.2.0.10 use_syslog=False api_paste_config=/etc/cinder/api-paste.ini glance_num_retries=0 debug=True storage_availability_zone=nova glance_api_ssl_compression=False glance_api_insecure=False rabbit_userid=openstack rabbit_use_ssl=False log_dir=/var/log/cinder osapi_volume_listen=0.0.0.0 glance_api_servers=1.2.3.4:9292 rabbit_virtual_host=/ scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler default_availability_zone=nova rabbit_hosts=10.2.0.10:5672 control_exchange=openstack rabbit_ha_queues=False glance_api_version=2 amqp_durable_queues=False rabbit_password=secret rabbit_port=5672 rpc_backend=cinder.openstack.common.rpc.impl_kombu enabled_backends=quobyte,rbd default_volume_type=rbd [database] idle_timeout=3600 connection=mysql://cinder:secret@10.2.0.10/cinder [quobyte] quobyte_volume_url=quobyte://hostname.cloud.example.com/openstack-volumes volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver [rbd-volumes] volume_backend_name=rbd-volumes rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_user=cinder rbd_ceph_conf=/etc/ceph/ceph.conf rbd_secret_uuid=1234-5678-ABCD-…-DEF rbd_max_clone_depth=5 volume_driver=cinder.volume.drivers.rbd.RBDDriver — cut --- any ideas? cheers Jens-Christian -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hanging ceph client
Yeah, that's fairly bizarre. Have you turned up the monitor logs and seen what they're doing? Have you checked that the nodes otherwise have the same configuration (firewall rules, client key permissions, installed version of Ceph...) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Aug 21, 2014 at 6:50 AM, Damien Churchill dam...@gmail.com wrote: Hi, On a freshly created 4 node cluster I'm struggling to get the 4th node to create correctly. ceph-deploy is unable to create the OSDs on it and when logging in to the node and attempting to run `ceph -s` manually (after copying the client.admin keyring) with debug parameters it ends up hanging and looping over mon_command({prefix: get_command_descriptions} v 0). I'm not sure what else to try to find out why this is happening. It seems like it's able to talk to the monitors okay as it looks like it is authenticating, and the same command runs fine on the first 3 nodes which are running monitors, but just hangs on the node that isn't. Thanks in advance for any help! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fail to upload file from RadosGW by Python+S3
when i use Dragondisk , i unselect Expect 100-continue header , upload file sucessfully. when select this option, upload file will hang. maybe the python script can not upload file due to the 100-continue ?? my radosgw Apache2 not use 100-continue. if my guess is ture, how to disable this in python s3-connection and make python script working for upload file? 2014-08-21 20:57 GMT+07:00 debian Only onlydeb...@gmail.com: i can upload file to RadosGW by s3cmd , and software Dragondisk. the script can list all bucket and all file in the bucket. but can not from python s3. ### #coding=utf-8 __author__ = 'Administrator' #!/usr/bin/env python import fnmatch import os, sys import boto import boto.s3.connection access_key = 'VC8R6C193WDVKNTDCRKA' secret_key = 'ASUWdUTx6PwVXEf/oJRRmDnvKEWp509o3rl1Xt+h' pidfile = copytoceph.pid def check_pid(pid): try: os.kill(pid, 0) except OSError: return False else: return True if os.path.isfile(pidfile): pid = long(open(pidfile, 'r').read()) if check_pid(pid): print %s already exists, doing natting % pidfile sys.exit() pid = str(os.getpid()) file(pidfile, 'w').write(pid) conn = boto.connect_s3( aws_access_key_id=access_key, aws_secret_access_key=secret_key, host='ceph-radosgw.lab.com', port=80, is_secure=False, calling_format=boto.s3.connection.OrdinaryCallingFormat(), ) print conn mybucket = conn.get_bucket('foo') print mybucket mylist = mybucket.list() print mylist buckets = conn.get_all_buckets() for bucket in buckets: print {name}\t{created}.format( name=bucket.name, created=bucket.creation_date, ) for key in bucket.list(): print {name}\t{size}\t{modified}.format( name=(key.name).encode('utf8'), size=key.size, modified=key.last_modified, ) key = mybucket.new_key('hello.txt') print key key.set_contents_from_string('Hello World!') ### root@ceph-radosgw:~# python rgwupload.py S3Connection:ceph-radosgw.lab.com Bucket: foo boto.s3.bucketlistresultset.BucketListResultSet object at 0x1d6ae10 backup 2014-08-21T10:23:08.000Z add volume for vms.png 23890 2014-08-21T10:53:43.000Z foo 2014-08-20T16:11:19.000Z file0001.txt29 2014-08-21T04:22:25.000Z galley/DSC_0005.JPG 2142126 2014-08-21T04:24:29.000Z galley/DSC_0006.JPG 2005662 2014-08-21T04:24:29.000Z galley/DSC_0009.JPG 1922686 2014-08-21T04:24:29.000Z galley/DSC_0010.JPG 2067713 2014-08-21T04:24:29.000Z galley/DSC_0011.JPG 2027689 2014-08-21T04:24:30.000Z galley/DSC_0012.JPG 2853358 2014-08-21T04:24:30.000Z galley/DSC_0013.JPG 2844746 2014-08-21T04:24:30.000Z iso 2014-08-21T04:43:16.000Z pdf 2014-08-21T09:36:15.000Z Key: foo,hello.txt it hanged at here. Same error when i run this script on radosgw host. Traceback (most recent call last): File D:/Workspace/S3-Ceph/test.py, line 65, in module key.set_contents_from_string('Hello World!') File c:\Python27\lib\site-packages\boto\s3\key.py, line 1419, in set_contents_from_string encrypt_key=encrypt_key) File c:\Python27\lib\site-packages\boto\s3\key.py, line 1286, in set_contents_from_file chunked_transfer=chunked_transfer, size=size) File c:\Python27\lib\site-packages\boto\s3\key.py, line 746, in send_file chunked_transfer=chunked_transfer, size=size) File c:\Python27\lib\site-packages\boto\s3\key.py, line 944, in _send_file_internal query_args=query_args File c:\Python27\lib\site-packages\boto\s3\connection.py, line 664, in make_request retry_handler=retry_handler File c:\Python27\lib\site-packages\boto\connection.py, line 1053, in make_request retry_handler=retry_handler) File c:\Python27\lib\site-packages\boto\connection.py, line 1009, in _mexe raise BotoServerError(response.status, response.reason, body) boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error None ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fail to upload file from RadosGW by Python+S3
my radosgw disbaled 100-continue [global] fsid = 075f1aae-48de-412e-b024-b0f014dbc8cf mon_initial_members = ceph01-vm, ceph02-vm, ceph04-vm mon_host = 192.168.123.251,192.168.123.252,192.168.123.250 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true *rgw print continue = false* rgw dns name = ceph-radosgw osd pool default pg num = 128 osd pool default pgp num = 128 #debug rgw = 20 [client.radosgw.gateway] host = ceph-radosgw keyring = /etc/ceph/ceph.client.radosgw.keyring rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock log file = /var/log/ceph/client.radosgw.gateway.log 2014-08-21 22:42 GMT+07:00 debian Only onlydeb...@gmail.com: when i use Dragondisk , i unselect Expect 100-continue header , upload file sucessfully. when select this option, upload file will hang. maybe the python script can not upload file due to the 100-continue ?? my radosgw Apache2 not use 100-continue. if my guess is ture, how to disable this in python s3-connection and make python script working for upload file? 2014-08-21 20:57 GMT+07:00 debian Only onlydeb...@gmail.com: i can upload file to RadosGW by s3cmd , and software Dragondisk. the script can list all bucket and all file in the bucket. but can not from python s3. ### #coding=utf-8 __author__ = 'Administrator' #!/usr/bin/env python import fnmatch import os, sys import boto import boto.s3.connection access_key = 'VC8R6C193WDVKNTDCRKA' secret_key = 'ASUWdUTx6PwVXEf/oJRRmDnvKEWp509o3rl1Xt+h' pidfile = copytoceph.pid def check_pid(pid): try: os.kill(pid, 0) except OSError: return False else: return True if os.path.isfile(pidfile): pid = long(open(pidfile, 'r').read()) if check_pid(pid): print %s already exists, doing natting % pidfile sys.exit() pid = str(os.getpid()) file(pidfile, 'w').write(pid) conn = boto.connect_s3( aws_access_key_id=access_key, aws_secret_access_key=secret_key, host='ceph-radosgw.lab.com', port=80, is_secure=False, calling_format=boto.s3.connection.OrdinaryCallingFormat(), ) print conn mybucket = conn.get_bucket('foo') print mybucket mylist = mybucket.list() print mylist buckets = conn.get_all_buckets() for bucket in buckets: print {name}\t{created}.format( name=bucket.name, created=bucket.creation_date, ) for key in bucket.list(): print {name}\t{size}\t{modified}.format( name=(key.name).encode('utf8'), size=key.size, modified=key.last_modified, ) key = mybucket.new_key('hello.txt') print key key.set_contents_from_string('Hello World!') ### root@ceph-radosgw:~# python rgwupload.py S3Connection:ceph-radosgw.lab.com Bucket: foo boto.s3.bucketlistresultset.BucketListResultSet object at 0x1d6ae10 backup 2014-08-21T10:23:08.000Z add volume for vms.png 23890 2014-08-21T10:53:43.000Z foo 2014-08-20T16:11:19.000Z file0001.txt29 2014-08-21T04:22:25.000Z galley/DSC_0005.JPG 2142126 2014-08-21T04:24:29.000Z galley/DSC_0006.JPG 2005662 2014-08-21T04:24:29.000Z galley/DSC_0009.JPG 1922686 2014-08-21T04:24:29.000Z galley/DSC_0010.JPG 2067713 2014-08-21T04:24:29.000Z galley/DSC_0011.JPG 2027689 2014-08-21T04:24:30.000Z galley/DSC_0012.JPG 2853358 2014-08-21T04:24:30.000Z galley/DSC_0013.JPG 2844746 2014-08-21T04:24:30.000Z iso 2014-08-21T04:43:16.000Z pdf 2014-08-21T09:36:15.000Z Key: foo,hello.txt it hanged at here. Same error when i run this script on radosgw host. Traceback (most recent call last): File D:/Workspace/S3-Ceph/test.py, line 65, in module key.set_contents_from_string('Hello World!') File c:\Python27\lib\site-packages\boto\s3\key.py, line 1419, in set_contents_from_string encrypt_key=encrypt_key) File c:\Python27\lib\site-packages\boto\s3\key.py, line 1286, in set_contents_from_file chunked_transfer=chunked_transfer, size=size) File c:\Python27\lib\site-packages\boto\s3\key.py, line 746, in send_file chunked_transfer=chunked_transfer, size=size) File c:\Python27\lib\site-packages\boto\s3\key.py, line 944, in _send_file_internal query_args=query_args File c:\Python27\lib\site-packages\boto\s3\connection.py, line 664, in make_request retry_handler=retry_handler File c:\Python27\lib\site-packages\boto\connection.py, line 1053, in make_request retry_handler=retry_handler) File c:\Python27\lib\site-packages\boto\connection.py, line 1009, in _mexe raise BotoServerError(response.status, response.reason, body) boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error None ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting
Are the OSD processes still alive? What's the osdmap output of ceph -w (which was not in the output you pasted)? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Aug 21, 2014 at 7:11 AM, Bruce McFarland bruce.mcfarl...@taec.toshiba.com wrote: I have 3 storage servers each with 30 osds. Each osd has a journal that is a partition on a virtual drive that is a raid0 of 6 ssds. I brought up a 3 osd (1 per storage server) cluster to bring up Ceph and figure out configuration etc. From: Dan Van Der Ster [mailto:daniel.vanders...@cern.ch] Sent: Thursday, August 21, 2014 1:17 AM To: Bruce McFarland Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting Hi, You only have one OSD? I’ve seen similar strange things in test pools having only one OSD — and I kinda explained it by assuming that OSDs need peers (other OSDs sharing the same PG) to behave correctly. Install a second OSD and see how it goes... Cheers, Dan On 21 Aug 2014, at 02:59, Bruce McFarland bruce.mcfarl...@taec.toshiba.com wrote: I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple OSD’s running on it. When I start the OSD using /etc/init.d/ceph start osd.0 I see the expected interaction between the OSD and the monitor authenticating keys etc and finally the OSD starts. Running watching the cluster with ‘ceph –w’ running on the monitor I never see the INFO messages I expect. There isn’t a msg from osd.0 for the boot event and the expected INFO messages from osdmap and pgmap for the osd and it’s pages being added to those maps. I only see the last time the monitor was booted and it wins the monitor election and reports monmap, pgmap, and mdsmap info. The firewalls are disabled with selinux==disabled and iptables turned off. All hosts can ssh w/o passwords into each other and I’ve verified traffic between hosts using tcpdump captures. Any ideas on what I’d need to add to ceph.conf or have overlooked would be greatly appreciated. Thanks, Bruce [root@ceph0 ceph]# /etc/init.d/ceph restart osd.0 === osd.0 === === osd.0 === Stopping Ceph osd.0 on ceph0...kill 15676...done === osd.0 === 2014-08-20 17:43:46.456592 7fa51a034700 1 -- :/0 messenger.start 2014-08-20 17:43:46.457363 7fa51a034700 1 -- :/1025971 -- 209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x7fa51402f9e0 con 0x7fa51402f570 2014-08-20 17:43:46.458229 7fa5189f0700 1 -- 209.243.160.83:0/1025971 learned my addr 209.243.160.83:0/1025971 2014-08-20 17:43:46.459664 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 1 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508000ab0 con 0x7fa51402f570 2014-08-20 17:43:46.459849 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.460180 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7fa4fc0012d0 con 0x7fa51402f570 2014-08-20 17:43:46.461341 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.461514 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7fa4fc001cf0 con 0x7fa51402f570 2014-08-20 17:43:46.462824 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570 2014-08-20 17:43:46.463011 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0 con 0x7fa51402f570 2014-08-20 17:43:46.463073 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fa4fc0025d0 con 0x7fa51402f570 2014-08-20 17:43:46.463329 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7fa514030490 con 0x7fa51402f570 2014-08-20 17:43:46.463363 7fa51a034700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x7fa5140309b0 con 0x7fa51402f570 2014-08-20 17:43:46.463564 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 5 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508001100 con 0x7fa51402f570 2014-08-20 17:43:46.463639 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 6 mon_subscribe_ack(300s) v1 20+0+0 (540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570 2014-08-20 17:43:46.463707 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 7
Re: [ceph-users] Ceph Cinder Capabilities reports wrong free size
On Thu, Aug 21, 2014 at 8:29 AM, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: I am working with Cinder Multi Backends on an Icehouse installation and have added another backend (Quobyte) to a previously running Cinder/Ceph installation. I can now create QuoByte volumes, but no longer any ceph volumes. The cinder-scheduler log get’s an incorrect number for the free size of the volumes pool and disregards the RBD backend as a viable storage system: I don't know much about Cinder, but given this output: 2014-08-21 16:42:49.847 1469 DEBUG cinder.openstack.common.scheduler.filters.capabilities_filter [r...] extra_spec requirement 'rbd' does not match 'quobyte' _satisfies_extra_specs /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:55 2014-08-21 16:42:49.848 1469 DEBUG cinder.openstack.common.scheduler.filters.capabilities_filter [r...] host 'controller@quobyte': free_capacity_gb: 156395.931061 fails resource_type extra_specs requirements host_passes /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:68 2014-08-21 16:42:49.848 1469 WARNING cinder.scheduler.filters.capacity_filter [r...-] Insufficient free space for volume creation (requested / avail): 20/8.0 2014-08-21 16:42:49.849 1469 ERROR cinder.scheduler.flows.create_volume [r.] Failed to schedule_create_volume: No valid host was found. I suspect you'll have better luck on the Openstack mailing list. :) Although for a random quick guess, I think maybe you need to match the rbd and rbd-volumes (from your conf file) strings? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com here’s our /etc/cinder/cinder.conf — cut — [DEFAULT] rootwrap_config = /etc/cinder/rootwrap.conf api_paste_confg = /etc/cinder/api-paste.ini # iscsi_helper = tgtadm volume_name_template = volume-%s # volume_group = cinder-volumes verbose = True auth_strategy = keystone state_path = /var/lib/cinder lock_path = /var/lock/cinder volumes_dir = /var/lib/cinder/volumes rabbit_host=10.2.0.10 use_syslog=False api_paste_config=/etc/cinder/api-paste.ini glance_num_retries=0 debug=True storage_availability_zone=nova glance_api_ssl_compression=False glance_api_insecure=False rabbit_userid=openstack rabbit_use_ssl=False log_dir=/var/log/cinder osapi_volume_listen=0.0.0.0 glance_api_servers=1.2.3.4:9292 rabbit_virtual_host=/ scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler default_availability_zone=nova rabbit_hosts=10.2.0.10:5672 control_exchange=openstack rabbit_ha_queues=False glance_api_version=2 amqp_durable_queues=False rabbit_password=secret rabbit_port=5672 rpc_backend=cinder.openstack.common.rpc.impl_kombu enabled_backends=quobyte,rbd default_volume_type=rbd [database] idle_timeout=3600 connection=mysql://cinder:secret@10.2.0.10/cinder [quobyte] quobyte_volume_url=quobyte://hostname.cloud.example.com/openstack-volumes volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver [rbd-volumes] volume_backend_name=rbd-volumes rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_user=cinder rbd_ceph_conf=/etc/ceph/ceph.conf rbd_secret_uuid=1234-5678-ABCD-…-DEF rbd_max_clone_depth=5 volume_driver=cinder.volume.drivers.rbd.RBDDriver — cut --- any ideas? cheers Jens-Christian -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/stories ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting
Yes all of the ceph-osd processes are up and running. I perform a ceph-mon restart to see if that might trigger the osdmap update, but there is no INFO msg from the osdmap or the pgmap that I expect to when the osd's are started. All of the osd's and their hosts appear in the CRUSH map and in ceph.conf. Since I went through a bunch of issues getting the multiple osds/host setup and working I'm assuming that the monitor's tables might be hosed and am going to purgedata and reinstall the monitor and see if it builds the proper mappings. I've stopped all of the osd's and verified that there aren't any active ceph-osd processes. Then I'll follow the procedure for bringing online a new monitor to an existing cluster so that I use the proper fsid. 2014-08-20 17:20:24.648538 7f326ebfd700 0 monclient: hunting for new mon 2014-08-20 17:20:24.648857 7f327455f700 0 -- 209.243.160.84:0/1005462 209.243.160.84:6789/0 pipe(0x7f3264020300 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3264020570).fault 2014-08-20 17:20:26.077687 mon.0 [INF] mon.ceph-mon01@0 won leader election with quorum 0 2014-08-20 17:20:26.077810 mon.0 [INF] monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0} 2014-08-20 17:20:26.077931 mon.0 [INF] pgmap v555: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-08-20 17:20:26.078032 mon.0 [INF] mdsmap e1: 0/0/1 up -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Thursday, August 21, 2014 8:44 AM To: Bruce McFarland Cc: Dan Van Der Ster; ceph-us...@ceph.com Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting Are the OSD processes still alive? What's the osdmap output of ceph -w (which was not in the output you pasted)? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Aug 21, 2014 at 7:11 AM, Bruce McFarland bruce.mcfarl...@taec.toshiba.com wrote: I have 3 storage servers each with 30 osds. Each osd has a journal that is a partition on a virtual drive that is a raid0 of 6 ssds. I brought up a 3 osd (1 per storage server) cluster to bring up Ceph and figure out configuration etc. From: Dan Van Der Ster [mailto:daniel.vanders...@cern.ch] Sent: Thursday, August 21, 2014 1:17 AM To: Bruce McFarland Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting Hi, You only have one OSD? I’ve seen similar strange things in test pools having only one OSD — and I kinda explained it by assuming that OSDs need peers (other OSDs sharing the same PG) to behave correctly. Install a second OSD and see how it goes... Cheers, Dan On 21 Aug 2014, at 02:59, Bruce McFarland bruce.mcfarl...@taec.toshiba.com wrote: I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple OSD’s running on it. When I start the OSD using /etc/init.d/ceph start osd.0 I see the expected interaction between the OSD and the monitor authenticating keys etc and finally the OSD starts. Running watching the cluster with ‘ceph –w’ running on the monitor I never see the INFO messages I expect. There isn’t a msg from osd.0 for the boot event and the expected INFO messages from osdmap and pgmap for the osd and it’s pages being added to those maps. I only see the last time the monitor was booted and it wins the monitor election and reports monmap, pgmap, and mdsmap info. The firewalls are disabled with selinux==disabled and iptables turned off. All hosts can ssh w/o passwords into each other and I’ve verified traffic between hosts using tcpdump captures. Any ideas on what I’d need to add to ceph.conf or have overlooked would be greatly appreciated. Thanks, Bruce [root@ceph0 ceph]# /etc/init.d/ceph restart osd.0 === osd.0 === === osd.0 === Stopping Ceph osd.0 on ceph0...kill 15676...done === osd.0 === 2014-08-20 17:43:46.456592 7fa51a034700 1 -- :/0 messenger.start 2014-08-20 17:43:46.457363 7fa51a034700 1 -- :/1025971 -- 209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x7fa51402f9e0 con 0x7fa51402f570 2014-08-20 17:43:46.458229 7fa5189f0700 1 -- 209.243.160.83:0/1025971 learned my addr 209.243.160.83:0/1025971 2014-08-20 17:43:46.459664 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 1 mon_map v1 200+0+0 (3445960796 0 0) 0x7fa508000ab0 con 0x7fa51402f570 2014-08-20 17:43:46.459849 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570 2014-08-20 17:43:46.460180 7fa5135fe700 1 -- 209.243.160.83:0/1025971 -- 209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7fa4fc0012d0 con 0x7fa51402f570 2014-08-20 17:43:46.461341 7fa5135fe700 1 -- 209.243.160.83:0/1025971 == mon.0 209.243.160.84:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 206+0+0 (409581826 0 0)
Re: [ceph-users] Problem setting tunables for ceph firefly
There was a good discussion of this a month ago: https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg11483.html That'll give you some things you can try, and information on how to undo it if it does cause problems. You can disable the warning by adding this to the [mon] section of ceph.conf: mon warn on legacy crush tunables = false On Thu, Aug 21, 2014 at 7:17 AM, Gerd Jakobovitsch g...@mandic.net.br wrote: Dear all, I have a ceph cluster running in 3 nodes, 240 TB space with 60% usage, used by rbd and radosgw clients. Recently I upgraded from emperor to firefly, and I got the message about legacy tunables described in http://ceph.com/docs/master/rados/operations/crush-map/#tunables. After some data rearrangement to minimize risks, I tried to apply the optimal settings. This resulted in 28% of object degradation, much more than I expected, and worse, I lost communication for the rbd clients, running in kernels 3.10 or 3.11. Searching for a solution, I got to this proposed solution: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11199.html. Applying it (before the data was all moved), I got additional 2% of object degradation, but the rbd clients came back into working. But then I got a large number of degraded or staled PGs, that are not backfilling. Looking for the definition of chooseleaf_vary_r, I reached the definition in http://ceph.com/docs/master/rados/operations/crush-map/: chooseleaf_vary_r: Whether a recursive chooseleaf attempt will start with a non-zero value of r, based on how many attempts the parent has already made. Legacy default is 0, but with this value CRUSH is sometimes unable to find a mapping. The optimal value (in terms of computational cost and correctness) is 1. However, for legacy clusters that have lots of existing data, changing from 0 to 1 will cause a lot of data to move; a value of 4 or 5 will allow CRUSH to find a valid mapping but will make less data move. Is there any suggestion to handle it? Have I to set chooseleaf_vary_r to some other value? Will I lose communication with my rbd clients? Or should I return to legacy tunables? Regards, Gerd Jakobovitsch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question on OSD node failure recovery
The default rules are sane for small clusters with few failure domains. Anything larger than a single rack should customize their rules. It's a good idea to figure this out early. Changes to your CRUSH rules can result in a large percentage of data moving around, which will make your cluster unusable until the migration completes. It is possible to make changes after the cluster has a lot of data. From what I've been able to figure out, it involves a lot of work to manually migrate data to new pools using the new rules. On Thu, Aug 21, 2014 at 6:23 AM, Sean Noonan sean.noo...@twosigma.com wrote: Ceph uses CRUSH (http://ceph.com/docs/master/rados/operations/crush-map/) to determine object placement. The default generated crush maps are sane, in that they will put replicas in placement groups into separate failure domains. You do not need to worry about this simple failure case, but you should consider the network and disk i/o consequences of re-replicating large amounts of data. Sean From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of LaBarre, James (CTR) A6IT [james.laba...@cigna.com] Sent: Thursday, August 21, 2014 9:17 AM To: ceph-us...@ceph.com Subject: [ceph-users] Question on OSD node failure recovery I understand the concept with Ceph being able to recover from the failure of an OSD (presumably with a single OSD being on a single disk), but I’m wondering what the scenario is if an OSD server node containing multiple disks should fail. Presuming you have a server containing 8-10 disks, your duplicated placement groups could end up on the same system. From diagrams I’ve seen they show duplicates going to separate nodes, but is this in fact how it handles it? -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com