[ceph-users] (no subject)
Thanks for the information. -Sreenath - Date: Wed, 25 Mar 2015 04:11:11 +0100 From: Francois Lafont flafdiv...@free.fr To: ceph-users ceph-us...@ceph.com Subject: Re: [ceph-users] PG calculator queries Message-ID: 5512274f.1000...@free.fr Content-Type: text/plain; charset=utf-8 Hi, Sreenath BH wrote : consider following values for a pool: Size = 3 OSDs = 400 %Data = 100 Target PGs per OSD = 200 (This is default) The PG calculator generates number of PGs for this pool as : 32768. Questions: 1. The Ceph documentation recommends around 100 PGs/OSD, whereas the calculator takes 200 as default value. Are there any changes in the recommended value of PGs/OSD? Not really I think. Here http://ceph.com/pgcalc/, we can read: Target PGs per OSD This value should be populated based on the following guidance: - 100 If the cluster OSD count is not expected to increase in the foreseeable future. - 200 If the cluster OSD count is expected to increase (up to double the size) in the foreseeable future. - 300 If the cluster OSD count is expected to increase between 2x and 3x in the foreseeable future. So, it seems to me cautious to recommend 100 in the official documentation because you can increase the pg_num but it's impossible to decrease it. So, if I should recommend just one value, It would be 100. 2. Under notes it says: Total PG Count below table will be the count of Primary PG copies. However, when calculating total PGs per OSD average, you must include all copies. However, the number of 200 PGs/OSD already seems to include the primary as well as replica PGs in a OSD. Is the note a typo mistake or am I missing something? To my mind, in the site, the Total PG Count doesn't include all copies. So, for me, there is no typo. Here is 2 basic examples from http://ceph.com/pgcalc/ with just *one* pool. 1. Pool-Name Size OSD# %DataTarget-PGs-per-OSD Suggested-PG-count rbd2 10100.00% 100 512 2. Pool-Name Size OSD# %DataTarget-PGs-per-OSD Suggested-PG-count rbd2 10100.00% 200 1024 In the first example, I have: 512/10 = 51.2 but (Size x 512)/10 = 102.4 In the second example, I have: 1024/10 = 102.4 but (Size x 1024)/10 = 204.8 HTH. -- Fran?ois Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
Thanks for the answer. Now the meaning of MB data and MB used is clear, and if all the pools have size=3 I expect a ratio 1 to 3 of the two values. I still can't understand why MB used is so big in my setup. All my pools are size =3 but the ratio MB data and MB used is 1 to 5 instead of 1 to 3. My first guess was that I wrote a wrong crushmap that was making more than 3 copies.. (is it really possible to make such a mistake?) So I changed my crushmap and I put the default one, that just spreads data across hosts, but I see no change, the ratio is still 1 to 5. I thought maybe my 3 monitors have different views of the pgmap, so I tried to restart the monitors but this also did not help. What useful information may I share here to troubleshoot this issue further ? ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) Thank you Saverio 2015-03-25 14:55 GMT+01:00 Gregory Farnum g...@gregs42.com: On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote: Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. MB used is the summation of (the programmatic equivalent to) df across all your nodes, whereas MB data is calculated by the OSDs based on data they've written down. Depending on your configuration MB used can include thing like the OSD journals, or even totally unrelated data if the disks are shared with other applications. MB used including the space used by the OSD journals is my first guess about what you're seeing here, in which case you'll notice that it won't grow any faster than MB data does once the journal is fully allocated. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
Hi Don, after a lot of trouble due an unfinished setcrushmap, I was able to remove the new EC pool. Load the old crushmap and edit agin. After include an step set_choose_tries 100 in the crushmap the EC pool creation with ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile work without trouble. Due to defect PGs from this test, I remove the cache tier from the old EC pool which gaves the next trouble - but this is another story! Thanks again Udo Am 25.03.2015 20:37, schrieb Don Doerner: More info please: how did you create your EC pool? It's hard to imagine that you could have specified enough PGs to make it impossible to form PGs out of 84 OSDs (I'm assuming your SSDs are in a separate root) but I have to ask... -don- -Original Message- From: Udo Lembke [mailto:ulem...@polarzone.de] Sent: 25 March, 2015 08:54 To: Don Doerner; ceph-us...@ceph.com Subject: Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded Hi Don, thanks for the info! looks that choose_tries set to 200 do the trick. But the setcrushmap takes a long long time (alarming, but the client have still IO)... hope it's finished soon ;-) Udo Am 25.03.2015 16:00, schrieb Don Doerner: Assuming you've calculated the number of PGs reasonably, see here https://urldefense.proofpoint.com/v1/url?u=http://tracker.ceph.com/issues/10350k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0As=b2547ec4aefa0f1b25d47bc813cab344a24c22c2464d4ff2cb199be0ef9b15cf and here https://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/%23crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0As=09d9aeb34481797e2d8f24938980db3697f26d94e92ff4c72714651181329de9. I'm guessing these will address your issue. That weird number means that no OSD was found/assigned to the PG. -don- -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] more human readable log to track request or using mapreduce for data statistics
On 26/03/2015, at 09.05, 池信泽 xmdx...@gmail.com wrote: hi,ceph: Currently, the command ”ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops“ may return as below: { description: osd_op(client.4436.1:11617 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92), received_at: 2015-03-25 19:41:47.146145, age: 2.186521, duration: 1.237882, type_data: [ commit sent; apply or cleanup, { client: client.4436, tid: 11617}, [ { time: 2015-03-25 19:41:47.150803, event: event1}, { time: 2015-03-25 19:41:47.150873, event: event2}, { time: 2015-03-25 19:41:47.150895, event: event3}, { time: 2015-03-25 19:41:48.384027, event: event4}]]} Seems like JSON format So consider doing your custom conversion by some means of CLI convert json format to string I think this message is not so suitable for grep log or using mapreduce for data statistics. Such as, I want to know the write request avg latency for each rbd everyday. If we could output the all latency in just one line, it would be very easy to achieve it. Such as, the output log maybe something like this: 2015-03-26 03:30:53.859759 osd=osd.0 pg=2.11 op=(client.4436.1:11617 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92) received_at=1427355253 age=2.186521 duration=1.237882 tid=11617 client=client.4436 event1=20ms event2=300ms event3=400ms event4=100ms. The above: duration means: the time between (reply_to_client_stamp - request_received_stamp) event1 means: the time between (the event1_stamp - request_received_stamp) ... event4 means: the time between (the event4_stamp - request_received_stamp) Now, If we output the every log as above. it would be every easy to know the write request avg latency for each rbd everyday. Or if I use grep it is more easy to find out which stage is the bottleneck. -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Hammer release data and a Design question
Hi , I 'm just starting on small Ceph implementation and wanted to know the release date for Hammer. Will it coincide with relase of Openstack. My Conf: (using 10G and Jumboframes on Centos 7 / RHEL7 ) 3x Mons (VMs) : CPU - 2 Memory - 4G Storage - 20 GB 4x OSDs : CPU - Haswell Xeon Memory - 8 GB Sata - 3x 2TB (3 OSD per node) SSD - 2x 480 GB ( Journaling and if possible tiering) This is a test environment to see how all the components play . If all goes well then we plan to increase the OSDs to 24 per node and RAM to 32 GB and a dual Socket Haswell Xeons The storage is primarily will be used to provide Cinder and Swift. Just wanted to know what the Expert opinion on how to scale will be. - Keep the nodes Symmetric - Just add the new beefy nodes and grow. Thanks in advance ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. mhm sure? I've never seen problems like that. Currently i ran each ceph node with 64GB of memory and each hypervisor node with around 512GB to 1TB RAM while having 48 cores. Yes, it can happen. You have machines with enough memory, but if you overprovision the machines it can happen. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. Never seen an OSD doing so crazy things. Again, it really depends on the available memory and CPU. If you buy big machines for this purpose it probably won't be a problem. Stefan So technically it could work, but memorey and CPU pressure is something which might give you problems. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. So technically it could work, but memorey and CPU pressure is something which might give you problems. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
A word of caution: While normally my OSDs use very little CPU, I have occasionally had an issue where the OSDs saturate the CPU (not necessarily during a rebuild). This might be a kernel thing, or a driver thing specific to our hosts, but were this to happen to you, it now impacts your VMs as well potentially. And even during a rebuild, but when things are acting normally, CPU usage goes up by a lot relative to steady-state for periods. On top of this, you would also be sharing other system resources which would be potential abuse vectors -- network for one. I would avoid. On Thu, Mar 26, 2015 at 8:11 AM, Wido den Hollander w...@42on.com wrote: On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. mhm sure? I've never seen problems like that. Currently i ran each ceph node with 64GB of memory and each hypervisor node with around 512GB to 1TB RAM while having 48 cores. Yes, it can happen. You have machines with enough memory, but if you overprovision the machines it can happen. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. Never seen an OSD doing so crazy things. Again, it really depends on the available memory and CPU. If you buy big machines for this purpose it probably won't be a problem. Stefan So technically it could work, but memorey and CPU pressure is something which might give you problems. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- David Burley NOC Manager, Sr. Systems Programmer/Analyst Slashdot Media e: da...@slashdotmedia.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
It's kind of a philosophical question. Technically there's nothing that prevents you from putting ceph and the hypervisor on the same boxes. It's a question of whether or not potential cost savings are worth increased risk of failure and contention. You can minimize those things through various means (cgroups, ristricting NUMA nodes, etc). What is more difficult is isolating disk IO contention (say if you want local SSDs for VMs), memory bus and QPI contention, network contention, etc. If the VMs are working really hard you can restrict them to their own socket, and you can even restrict memory usage to the local socket, but what about remote socket network or disk IO? (you will almost certainly want these things on the ceph socket) I wonder as well about increased risk of hardware failure with the increased load, but I don't have any statistics. I'm guessing if you spent enough time at it you could make it work relatively well, but at least personally I question how beneficial it really is after all of that. If you are going for cost savings, I suspect efficient compute and storage node designs will be nearly as good with much less complexity. Mark On 03/26/2015 07:11 AM, Wido den Hollander wrote: On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. mhm sure? I've never seen problems like that. Currently i ran each ceph node with 64GB of memory and each hypervisor node with around 512GB to 1TB RAM while having 48 cores. Yes, it can happen. You have machines with enough memory, but if you overprovision the machines it can happen. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. Never seen an OSD doing so crazy things. Again, it really depends on the available memory and CPU. If you buy big machines for this purpose it probably won't be a problem. Stefan So technically it could work, but memorey and CPU pressure is something which might give you problems. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] more human readable log to track request or using mapreduce for data statistics
On 26/03/2015, at 12.14, 池信泽 xmdx...@gmail.com wrote: It is not so convenience to do conversion in custom. Because there are many kinds of log in ceph-osd.log. we only need some of them including latency. But now, It is hard to grep the log what we want and decode them. Still run output through a pipe which either knows and reads json and either print directly what your need and/or stores data i whatever data repository you what to accumulate statistic in. eg.: ceph —admin-daemon … dump_history | myjsonreaderNformatter.php | grep, awk, sed, cut, posix-1 filter-cmd Don’t expect ceph developers to alter ceph code base to complement your exact need when you still wants to filter output through grep whatever anyway ImHO :) 2015-03-26 16:38 GMT+08:00 Steffen W Sørensen ste...@me.com: On 26/03/2015, at 09.05, 池信泽 xmdx...@gmail.com wrote: hi,ceph: Currently, the command ”ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops“ may return as below: { description: osd_op(client.4436.1:11617 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92), received_at: 2015-03-25 19:41:47.146145, age: 2.186521, duration: 1.237882, type_data: [ commit sent; apply or cleanup, { client: client.4436, tid: 11617}, [ { time: 2015-03-25 19:41:47.150803, event: event1}, { time: 2015-03-25 19:41:47.150873, event: event2}, { time: 2015-03-25 19:41:47.150895, event: event3}, { time: 2015-03-25 19:41:48.384027, event: event4}]]} Seems like JSON format So consider doing your custom conversion by some means of CLI convert json format to string I think this message is not so suitable for grep log or using mapreduce for data statistics. Such as, I want to know the write request avg latency for each rbd everyday. If we could output the all latency in just one line, it would be very easy to achieve it. Such as, the output log maybe something like this: 2015-03-26 03:30:53.859759 osd=osd.0 pg=2.11 op=(client.4436.1:11617 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92) received_at=1427355253 age=2.186521 duration=1.237882 tid=11617 client=client.4436 event1=20ms event2=300ms event3=400ms event4=100ms. The above: duration means: the time between (reply_to_client_stamp - request_received_stamp) event1 means: the time between (the event1_stamp - request_received_stamp) ... event4 means: the time between (the event4_stamp - request_received_stamp) Now, If we output the every log as above. it would be every easy to know the write request avg latency for each rbd everyday. Or if I use grep it is more easy to find out which stage is the bottleneck. -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Where is the systemd files?
I understand that Giant should have systemd service files, but I don't see them in the CentOS 7 packages. https://github.com/ceph/ceph/tree/giant/systemd [ulhglive-root@mon1 systemd]# rpm -qa | grep --color=always ceph ceph-common-0.93-0.el7.centos.x86_64 python-cephfs-0.93-0.el7.centos.x86_64 libcephfs1-0.93-0.el7.centos.x86_64 ceph-0.93-0.el7.centos.x86_64 ceph-deploy-1.5.22-0.noarch [ulhglive-root@mon1 systemd]# for i in $(rpm -qa | grep ceph); do rpm -ql $i | grep -i --color=always systemd; done [nothing returned] Thanks, Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calamari Deployment
For that matter, is there a way to build Calamari without going the whole vagrant path at all? Some way of just building it through command-line tools? I would be building it on an Openstack instance, no GUI. Seems silly to have to install an entire virtualbox environment inside something that’s already a VM. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JESUS CHAVEZ ARGUELLES Sent: Monday, March 02, 2015 3:00 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] Calamari Deployment Does anybody know how to succesful install Calamari in rhel7 ? I have tried the vagrant thug without sucesss and it seems like a nightmare there is a Kind of Sidur when you do vagrant up where it seems not to find the vm path... Regards Jesus Chavez SYSTEMS ENGINEER-C.SALES jesch...@cisco.commailto:jesch...@cisco.com Phone: +52 55 5267 3146tel:+52%2055%205267%203146 Mobile: +51 1 5538883255tel:+51%201%205538883255 CCIE - 44433 -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2015 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
I don't know why you're mucking about manually with the rbd directory; the rbd tool and rados handle cache pools correctly as far as I know. -Greg On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote: Hi Greg, ok! It's looks like, that my problem is more setomapval-related... I must o something like rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51 but rados setomapval don't use the hexvalues - instead of this I got rados -p ssd-archiv listomapvals rbd_directory name_vm-409-disk-2 value: (35 bytes) : : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\ 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d 0020 : 63 35 31: c51 hmm, strange. With rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 I got the binary inside the file name_vm-409-disk-2, but reverse do an rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 fill the variable with name_vm-409-disk-2 and not with the content of the file... Are there other tools for the rbd_directory? regards Udo Am 26.03.2015 15:03, schrieb Gregory Farnum: You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calamari Deployment
I used this as a guide for building calamari packages w/o using vagrant. Worked great: http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/ On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: For that matter, is there a way to build Calamari without going the whole vagrant path at all? Some way of just building it through command-line tools? I would be building it on an Openstack instance, no GUI. Seems silly to have to install an entire virtualbox environment inside something that’s already a VM. Agreed... if U wanted to built in on your server farm/cloud stack env. I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant is an easy disposable built-env:) *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com ceph-users-boun...@lists.ceph.com] *On Behalf Of *JESUS CHAVEZ ARGUELLES *Sent:* Monday, March 02, 2015 3:00 PM *To:* ceph-users@lists.ceph.com *Subject:* [ceph-users] Calamari Deployment Does anybody know how to succesful install Calamari in rhel7 ? I have tried the vagrant thug without sucesss and it seems like a nightmare there is a Kind of Sidur when you do vagrant up where it seems not to find the vm path... Regards *Jesus Chavez* SYSTEMS ENGINEER-C.SALES jesch...@cisco.com Phone: *+52 55 5267 3146 +52%2055%205267%203146* Mobile: *+51 1 5538883255 +51%201%205538883255* CCIE - 44433 -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2015 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
We run many clusters in a similar config with shared Hypervisor/OSD/RGW/RBD in production and in staging but we have been looking into moving our storage to it's own cluster so that we can scale independently. We used AWS and scaled up a ton of virtual users using JMeter clustering to test performance and max loads. We found over all of our test with the same config and upstream network traffic, the latency went from 45ms to 2.2s after a 1,000 users. It stayed that way for the duration of the hour long test. The response time was of course higher than latency (as defined by JMeter) and our payload was a 2MB byte range request of video clips. Our use case is also changing from a standpoint that our object storage is becoming very popular within the company so it has to scale differently but we're not there yet. We plan on a new rollout being separated so we can test it before jumping all in but so far the numbers are there. Both options are valid and work. It really depends on the use cases. My 2 cents, Chris On Thu, Mar 26, 2015 at 11:36 AM, Mark Nelson mnel...@redhat.com wrote: I suspect a config like this where you only have 3 OSDs per node would be more manageable than something denser. IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super micro chassis for a semi-dense converged solution. You could attempt to restrict the OSDs to one socket and then use a second E5-2697v3 for VMs. Maybe after you've got cgroups setup properly and if you've otherwise balanced things it would work out ok. I question though how much you really benefit by doing this rather than running a 36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many of because you can dedicate both sockets to VMs). It probably depends quite a bit on how memory, network, and disk intensive the VMs are, but my take is that it's better to error on the side of simplicity rather than making things overly complicated. Every second you are screwing around trying to make the setup work right eats into any savings you might gain by going with the converged setup. Mark On 03/26/2015 10:12 AM, Quentin Hartman wrote: I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM unused on each node for OSD / OS overhead. All the VMs are backed by ceph volumes and things generally work very well. I would prefer a dedicated storage layer simply because it seems more right, but I can't say that any of the common concerns of using this kind of setup have come up for me. Aside from shaving off that 3GB of RAM, my deployment isn't any more complex than a split stack deployment would be. After running like this for the better part of a year, I would have a hard time honestly making a real business case for the extra hardware a split stack cluster would require. QH On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: It's kind of a philosophical question. Technically there's nothing that prevents you from putting ceph and the hypervisor on the same boxes. It's a question of whether or not potential cost savings are worth increased risk of failure and contention. You can minimize those things through various means (cgroups, ristricting NUMA nodes, etc). What is more difficult is isolating disk IO contention (say if you want local SSDs for VMs), memory bus and QPI contention, network contention, etc. If the VMs are working really hard you can restrict them to their own socket, and you can even restrict memory usage to the local socket, but what about remote socket network or disk IO? (you will almost certainly want these things on the ceph socket) I wonder as well about increased risk of hardware failure with the increased load, but I don't have any statistics. I'm guessing if you spent enough time at it you could make it work relatively well, but at least personally I question how beneficial it really is after all of that. If you are going for cost savings, I suspect efficient compute and storage node designs will be nearly as good with much less complexity. Mark On 03/26/2015 07:11 AM, Wido den Hollander wrote: On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
On 03/25/2015 05:44 PM, Gregory Farnum wrote: On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate Right, so did you run ceph mds set_max_mds 0 and then repeating the stop command? :) This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? Yeah, unfortunately you need to recreate it if you want to try and use an EC pool with cache tiering, because CephFS knows what pools it expects data to belong to. Things are unlikely to behave correctly if you try and stick an EC pool under an existing one. :( Sounds like this is all just testing, which is good because the suitability of EC+cache is very dependent on how much hot data you have, etc...good luck! -Greg many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Thanks for your help - much appreciated. The set_max_mds 0 command worked, but only after I rebooted the server, and restarted ceph twice. Before this I still got an mds active error, and so was unable to destroy the cephfs. Possibly I was being impatient, and needed to let mds go inactive? there were ~1 million files on the system. [root@ceph1 ~]# ceph mds set_max_mds 0 max_mds = 0 [root@ceph1 ~]# ceph mds stop 0 telling mds.0 10.1.0.86:6811/3249 to deactivate [root@ceph1 ~]# ceph mds stop 0 Error EEXIST: mds.0 not active (up:stopping) [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem There shouldn't be any other mds servers running.. [root@ceph1 ~]# ceph mds stop 1 Error EEXIST: mds.1 not active (down:dne) At this point I rebooted the server, did a service ceph restart twice. Shutdown ceph, then restarted ceph before this command worked: [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it Anyhow, I've now been able to create an erasure coded pool, with a replicated tier which cephfs is running on :) *Lots* of testing to go! Again, many thanks Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
That one big server sounds great, but it also sounds like a single point of failure. It's also not cheap. I've been able to build this cluster for about $1400 per node, including the 10Gb networking gear, which is less than what I see the _empty case_ you describe going for new. Even used, the lowest I've seen (lacking trays at that price) is what I paid for one of my nodes including CPU and RAM, and drive trays. So, it's been a pretty inexpensive venture considering what we get out of it. I have no per-node fault tolerance, but if one of my nodes dies, I just restart the VMs that were on it somewhere else and wait for ceph to heal. I also benefit from higher aggregate network bandwidth because I have more ports on the wire. And better per-U cpu and RAM density (for the money). *shrug* different strokes. As for difficulty of management, any screwing around I've done has had nothing to do with the converged nature of the setup, aside from discovering and changing the one setting I mentioned. So, for me at least, it's been a pretty well unqualified net win. I can imagine all sorts of scenarios where that wouldn't be, but I think it's probably debatable whether or not those constitute a common case. The higher node count does add some complexity, but that's easily overcome with some simple automation. Again though, that has no bearing on the converged setup, it's just a factor of how much CPU and RAM we need for our use case. I guess what I'm trying to say is that I don't think the answer is as cut and dry as you seem to think. QH On Thu, Mar 26, 2015 at 9:36 AM, Mark Nelson mnel...@redhat.com wrote: I suspect a config like this where you only have 3 OSDs per node would be more manageable than something denser. IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super micro chassis for a semi-dense converged solution. You could attempt to restrict the OSDs to one socket and then use a second E5-2697v3 for VMs. Maybe after you've got cgroups setup properly and if you've otherwise balanced things it would work out ok. I question though how much you really benefit by doing this rather than running a 36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many of because you can dedicate both sockets to VMs). It probably depends quite a bit on how memory, network, and disk intensive the VMs are, but my take is that it's better to error on the side of simplicity rather than making things overly complicated. Every second you are screwing around trying to make the setup work right eats into any savings you might gain by going with the converged setup. Mark On 03/26/2015 10:12 AM, Quentin Hartman wrote: I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM unused on each node for OSD / OS overhead. All the VMs are backed by ceph volumes and things generally work very well. I would prefer a dedicated storage layer simply because it seems more right, but I can't say that any of the common concerns of using this kind of setup have come up for me. Aside from shaving off that 3GB of RAM, my deployment isn't any more complex than a split stack deployment would be. After running like this for the better part of a year, I would have a hard time honestly making a real business case for the extra hardware a split stack cluster would require. QH On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: It's kind of a philosophical question. Technically there's nothing that prevents you from putting ceph and the hypervisor on the same boxes. It's a question of whether or not potential cost savings are worth increased risk of failure and contention. You can minimize those things through various means (cgroups, ristricting NUMA nodes, etc). What is more difficult is isolating disk IO contention (say if you want local SSDs for VMs), memory bus and QPI contention, network contention, etc. If the VMs are working really hard you can restrict them to their own socket, and you can even restrict memory usage to the local socket, but what about remote socket network or disk IO? (you will almost certainly want these things on the ceph socket) I wonder as well about increased risk of hardware failure with the increased load, but I don't have any statistics. I'm guessing if you spent enough time at it you could make it work relatively well, but at least personally I question how beneficial it really is after all of that. If you are going for cost savings, I suspect efficient compute and storage node designs will be nearly as good with much less complexity. Mark On 03/26/2015 07:11 AM, Wido den Hollander wrote: On
Re: [ceph-users] Calamari Deployment
The first step is incorrect: echo deb http://ppa.launchpad.net/saltstack/salt/ubuntu lsb_release -sc main | sudo tee /etc/apt/sources.list.d/saltstack.list should be echo deb http://ppa.launchpad.net/saltstack/salt/ubuntu $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/saltstack.list Anyway this process fails for me at the ./configure stage for Node: creating ./config.mk Traceback (most recent call last): File tools/gyp_node, line 57, in module run_gyp(gyp_args) File tools/gyp_node, line 18, in run_gyp rc = gyp.main(args) File ./tools/gyp/pylib/gyp/__init__.py, line 526, in main return gyp_main(args) File ./tools/gyp/pylib/gyp/__init__.py, line 502, in gyp_main options.circular_check) File ./tools/gyp/pylib/gyp/__init__.py, line 91, in Load generator = __import__(generator_name, globals(), locals(), generator_name) ImportError: No module named generator.make Lee On Thu, Mar 26, 2015 at 1:14 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: I used this as a guide for building calamari packages w/o using vagrant. Worked great: http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/ On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: For that matter, is there a way to build Calamari without going the whole vagrant path at all? Some way of just building it through command-line tools? I would be building it on an Openstack instance, no GUI. Seems silly to have to install an entire virtualbox environment inside something that’s already a VM. Agreed... if U wanted to built in on your server farm/cloud stack env. I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant is an easy disposable built-env:) *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com ceph-users-boun...@lists.ceph.com] *On Behalf Of *JESUS CHAVEZ ARGUELLES *Sent:* Monday, March 02, 2015 3:00 PM *To:* ceph-users@lists.ceph.com *Subject:* [ceph-users] Calamari Deployment Does anybody know how to succesful install Calamari in rhel7 ? I have tried the vagrant thug without sucesss and it seems like a nightmare there is a Kind of Sidur when you do vagrant up where it seems not to find the vm path... Regards *Jesus Chavez* SYSTEMS ENGINEER-C.SALES jesch...@cisco.com Phone: *+52 55 5267 3146 +52%2055%205267%203146* Mobile: *+51 1 5538883255 +51%201%205538883255* CCIE - 44433 -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2015 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calamari Deployment
On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: For that matter, is there a way to build Calamari without going the whole vagrant path at all? Some way of just building it through command-line tools? I would be building it on an Openstack instance, no GUI. Seems silly to have to install an entire virtualbox environment inside something that’s already a VM. Agreed... if U wanted to built in on your server farm/cloud stack env. I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant is an easy disposable built-env:) From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JESUS CHAVEZ ARGUELLES Sent: Monday, March 02, 2015 3:00 PM To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Subject: [ceph-users] Calamari Deployment Does anybody know how to succesful install Calamari in rhel7 ? I have tried the vagrant thug without sucesss and it seems like a nightmare there is a Kind of Sidur when you do vagrant up where it seems not to find the vm path... Regards Jesus Chavez SYSTEMS ENGINEER-C.SALES jesch...@cisco.com mailto:jesch...@cisco.com Phone: +52 55 5267 3146 tel:+52%2055%205267%203146 Mobile: +51 1 5538883255 tel:+51%201%205538883255 CCIE - 44433 -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2015 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
Hi Greg, On 26.03.2015 18:46, Gregory Farnum wrote: I don't know why you're mucking about manually with the rbd directory; the rbd tool and rados handle cache pools correctly as far as I know. that's because I deleted the cache tier pool, so the files like rbd_header.2cfc7ce74b0dc51 and rbd_directory are gone. The whole vm-disk data are in the ec pool (rbd_data.2cfc7ce74b0dc51.*) I can't see or recreate the VM-disk, because rados setomapval don't like binary-data and the rbd-tool can't (re)create an rbd-disk with an given hash (like 2cfc7ce74b0dc51). The only way I see in the moment, is to create new rbd-disks and copy all blocks with rados get - file - rados put. The problem is the time it's take (days to weeks for 3 * 16TB)... Udo -Greg On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote: Hi Greg, ok! It's looks like, that my problem is more setomapval-related... I must o something like rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51 but rados setomapval don't use the hexvalues - instead of this I got rados -p ssd-archiv listomapvals rbd_directory name_vm-409-disk-2 value: (35 bytes) : : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\ 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d 0020 : 63 35 31: c51 hmm, strange. With rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 I got the binary inside the file name_vm-409-disk-2, but reverse do an rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 fill the variable with name_vm-409-disk-2 and not with the content of the file... Are there other tools for the rbd_directory? regards Udo Am 26.03.2015 15:03, schrieb Gregory Farnum: You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. On 3/26/2015 3:54 PM, Steffen W Sørensen wrote: On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cascading Failure of OSDs
Since I have been in ceph-land today, it reminded me that I needed to close the loop on this. I was finally able to isolate this problem down to a faulty NIC on the ceph cluster network. It worked, but it was accumulating a huge number of Rx errors. My best guess is some receive buffer cache failed? Anyway, having a NIC go weird like that is totally consistent with all the weird problems I was seeing, the corrupted PGs, and the inability for the cluster to settle down. As a result we've added NIC error rates to our monitoring suite on the cluster so we'll hopefully see this coming if it ever happens again. QH On Sat, Mar 7, 2015 at 11:36 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: So I'm not sure what has changed, but in the last 30 minutes the errors which were all over the place, have finally settled down to this: http://pastebin.com/VuCKwLDp The only thing I can think of is that I also net the noscrub flag in addition to the nodeep-scrub when I first got here, and that finally took. Anyway, they've been stable there for some time now, and I've been able to get a couple VMs to come up and behave reasonably well. At this point I'm prepared to wipe the entire cluster and start over if I have to to get it truly consistent again, since my efforts to zap pg 3.75b haven't borne fruit. However, if anyone has a less nuclear option they'd like to suggest, I'm all ears. I've tried to export/re-import the pg and do a force_create. The import failed, and the force_create just reverted back to being incomplete after creating for a few minutes. QH On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: Now that I have a better understanding of what's happening, I threw together a little one-liner to create a report of the errors that the OSDs are seeing. Lots of missing / corrupted pg shards: https://gist.github.com/qhartman/174cc567525060cb462e I've experimented with exporting / importing the broken pgs with ceph_objectstore_tool, and while they seem to export correctly, the tool crashes when trying to import: root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import --data-path /var/lib/ceph/osd/ceph-19/ --journal-path /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export Importing pgid 3.75b Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3 Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3 Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3 Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3 Write c086075b/rbd_data.42a742ae8944a.02fb/head//3 Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3 Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3 Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3 Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3 Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3 Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3 Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3 Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3 Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3 Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3 Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3 Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3 Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3 Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3 osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t, const std::setsnapid_t, MapCacher::Transactionstd::basic_stringchar, ceph::buffer::list*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820 osd/SnapMapper.cc: 228: FAILED assert(r == -2) ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb94fbb] 2: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t, std::lesssnapid_t, std::allocatorsnapid_t const, MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e] 3: (get_attrs(ObjectStore*, coll_t, ghobject_t, ObjectStore::Transaction*, ceph::buffer::list, OSDriver, SnapMapper)+0x67c) [0x661a1c] 4: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5) [0x661f85] 5: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1] 6: (main()+0x2208) [0x63f178] 7: (__libc_start_main()+0xf5) [0x7fba627b2ec5] 8: ceph_objectstore_tool() [0x659577] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) ** in thread 7fba67ff3900 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) 1: ceph_objectstore_tool() [0xab1cea] 2: (()+0x10340) [0x7fba66a95340] 3: (gsignal()+0x39) [0x7fba627c7cc9] 4: (abort()+0x148) [0x7fba627cb0d8] 5:
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
For what it's worth, I don't think being patient was the answer. I was having the same problem a couple of weeks ago, and I waited from before 5pm one day until after 8am the next, and still got the same errors. I ended up adding a new cephfs pool with a newly-created small pool, but was never able to actually remove cephfs altogether. On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: On 03/25/2015 05:44 PM, Gregory Farnum wrote: On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate Right, so did you run ceph mds set_max_mds 0 and then repeating the stop command? :) This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? Yeah, unfortunately you need to recreate it if you want to try and use an EC pool with cache tiering, because CephFS knows what pools it expects data to belong to. Things are unlikely to behave correctly if you try and stick an EC pool under an existing one. :( Sounds like this is all just testing, which is good because the suitability of EC+cache is very dependent on how much hot data you have, etc...good luck! -Greg many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Thanks for your help - much appreciated. The set_max_mds 0 command worked, but only after I rebooted the server, and restarted ceph twice. Before this I still got an mds active error, and so was unable to destroy the cephfs. Possibly I was being impatient, and needed to let mds go inactive? there were ~1 million files on the system. [root@ceph1 ~]# ceph mds set_max_mds 0 max_mds = 0 [root@ceph1 ~]# ceph mds stop 0 telling mds.0 10.1.0.86:6811/3249 to deactivate [root@ceph1 ~]# ceph mds stop 0 Error EEXIST: mds.0 not active (up:stopping) [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem There shouldn't be any other mds servers running.. [root@ceph1 ~]# ceph mds stop 1 Error EEXIST: mds.1 not active (down:dne) At this point I rebooted the server, did a service ceph restart twice. Shutdown ceph, then restarted ceph before this command worked: [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it Anyhow, I've now been able to create an erasure coded pool, with a replicated tier which cephfs is running on :) *Lots* of testing to go! Again, many thanks Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
Has the OSD actually been detected as down yet? You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) On Thu, Mar 26, 2015 at 1:29 PM, Lee Revell rlrev...@gmail.com wrote: I added the osd pool default min size = 1 to test the behavior when 2 of 3 OSDs are down, but the behavior is exactly the same as without it: when the 2nd OSD is killed, all client writes start to block and these pipe.(stuff).fault messages begin: 2015-03-26 16:08:50.775848 7fce177fe700 0 monclient: hunting for new mon 2015-03-26 16:08:53.781133 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce0c01d260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce0c01d4f0).fault 2015-03-26 16:09:00.009092 7fce1c3fa700 0 -- 192.168.122.111:0/1011003 192.168.122.141:6789/0 pipe(0x7fce1802dab0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802dd40).fault 2015-03-26 16:09:12.013147 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce1802e740 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802e9d0).fault 2015-03-26 16:10:06.013113 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce1802df80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1801e600).fault 2015-03-26 16:10:36.013166 7fce1c3fa700 0 -- 192.168.122.111:0/1011003 192.168.122.141:6789/0 pipe(0x7fce1802ebc0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802ee50).fault Here is my ceph.conf: [global] fsid = db460aa2-5129-4aaa-8b2e-43eac727124e mon_initial_members = ceph-node-1 mon_host = 192.168.122.121 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 3 osd pool default min size = 1 public network = 192.168.122.0/24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
There have been bugs here in the recent past which have been fixed for hammer, at least...it's possible we didn't backport it for the giant point release. :( But for users going forward that procedure should be good! -Greg On Thu, Mar 26, 2015 at 11:26 AM, Kyle Hutson kylehut...@ksu.edu wrote: For what it's worth, I don't think being patient was the answer. I was having the same problem a couple of weeks ago, and I waited from before 5pm one day until after 8am the next, and still got the same errors. I ended up adding a new cephfs pool with a newly-created small pool, but was never able to actually remove cephfs altogether. On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: On 03/25/2015 05:44 PM, Gregory Farnum wrote: On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate Right, so did you run ceph mds set_max_mds 0 and then repeating the stop command? :) This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? Yeah, unfortunately you need to recreate it if you want to try and use an EC pool with cache tiering, because CephFS knows what pools it expects data to belong to. Things are unlikely to behave correctly if you try and stick an EC pool under an existing one. :( Sounds like this is all just testing, which is good because the suitability of EC+cache is very dependent on how much hot data you have, etc...good luck! -Greg many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Thanks for your help - much appreciated. The set_max_mds 0 command worked, but only after I rebooted the server, and restarted ceph twice. Before this I still got an mds active error, and so was unable to destroy the cephfs. Possibly I was being impatient, and needed to let mds go inactive? there were ~1 million files on the system. [root@ceph1 ~]# ceph mds set_max_mds 0 max_mds = 0 [root@ceph1 ~]# ceph mds stop 0 telling mds.0 10.1.0.86:6811/3249 to deactivate [root@ceph1 ~]# ceph mds stop 0 Error EEXIST: mds.0 not active (up:stopping) [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem There shouldn't be any other mds servers running.. [root@ceph1 ~]# ceph mds stop 1 Error EEXIST: mds.1 not active (down:dne) At this point I rebooted the server, did a service ceph restart twice. Shutdown ceph, then restarted ceph before this command worked: [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it Anyhow, I've now been able to create an erasure coded pool, with a replicated tier which cephfs is running on :) *Lots* of testing to go! Again, many thanks Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
Am 26.03.2015 um 16:36 schrieb Mark Nelson: I suspect a config like this where you only have 3 OSDs per node would be more manageable than something denser. IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super micro chassis for a semi-dense converged solution. You could attempt to restrict the OSDs to one socket and then use a second E5-2697v3 for VMs. Maybe after you've got cgroups setup properly and if you've otherwise balanced things it would work out ok. I question though how much you really benefit by doing this rather than running a 36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many of because you can dedicate both sockets to VMs). that's pretty big. I have only around 6-8 ssd drives per node. In case of 36 osds per node i won't mix. It probably depends quite a bit on how memory, network, and disk intensive the VMs are, but my take is that it's better to error on the side of simplicity rather than making things overly complicated. Every second you are screwing around trying to make the setup work right eats into any savings you might gain by going with the converged setup. Mark On 03/26/2015 10:12 AM, Quentin Hartman wrote: I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM unused on each node for OSD / OS overhead. All the VMs are backed by ceph volumes and things generally work very well. I would prefer a dedicated storage layer simply because it seems more right, but I can't say that any of the common concerns of using this kind of setup have come up for me. Aside from shaving off that 3GB of RAM, my deployment isn't any more complex than a split stack deployment would be. After running like this for the better part of a year, I would have a hard time honestly making a real business case for the extra hardware a split stack cluster would require. QH On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: It's kind of a philosophical question. Technically there's nothing that prevents you from putting ceph and the hypervisor on the same boxes. It's a question of whether or not potential cost savings are worth increased risk of failure and contention. You can minimize those things through various means (cgroups, ristricting NUMA nodes, etc). What is more difficult is isolating disk IO contention (say if you want local SSDs for VMs), memory bus and QPI contention, network contention, etc. If the VMs are working really hard you can restrict them to their own socket, and you can even restrict memory usage to the local socket, but what about remote socket network or disk IO? (you will almost certainly want these things on the ceph socket) I wonder as well about increased risk of hardware failure with the increased load, but I don't have any statistics. I'm guessing if you spent enough time at it you could make it work relatively well, but at least personally I question how beneficial it really is after all of that. If you are going for cost savings, I suspect efficient compute and storage node designs will be nearly as good with much less complexity. Mark On 03/26/2015 07:11 AM, Wido den Hollander wrote: On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. mhm sure? I've never seen problems like that. Currently i ran each ceph node with 64GB of memory and each hypervisor node with around 512GB to 1TB RAM while having 48 cores. Yes, it can happen. You have machines with enough memory, but if you overprovision the machines it can happen. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. Never seen an OSD doing so crazy things. Again, it really depends on the available memory and CPU. If you buy big machines for this purpose it probably won't be a problem. Stefan
[ceph-users] Ceph RBD devices management OpenSVC integration
Hi Team, I’ve just written blog post regarding integration of CEPH RBD devices management in OpenSVC service : http://www.flox-arts.net/article30/ceph-rbd-devices-management-with-opensvc-service http://www.flox-arts.net/article30/ceph-rbd-devices-management-with-opensvc-service Next blog post will be regarding Snapshots clones (integrated too in OpenSVC) Thanks Florent Monthel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calamari Deployment
Well, we’re a RedHat shop, so I’ll have to see what’s adaptable from there. (Mint on all my home systems, so I’m not totally lost with Ubuntu g) From: Quentin Hartman [mailto:qhart...@direwolfdigital.com] Sent: Thursday, March 26, 2015 1:15 PM To: Steffen W Sørensen Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Calamari Deployment I used this as a guide for building calamari packages w/o using vagrant. Worked great: http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/https://urldefense.proofpoint.com/v2/url?u=http-3A__bryanapperson.com_blog_compiling-2Dcalamari-2Dceph-2Dubuntu-2D14-2D04_d=AwMFaQc=WlnXFIBCT3pmNh_w8hYeLwr=wHASSrXLcneWpRUvkiFE4YeL1dd09LRRfqfW2SnLvVkm=LYJ7jurZXFUR0O0z7RQYmQwoJoI0lxM2z_KcwZyyG2As=VvuP1JebIkqFQQX3vQpQloFGkt_O00GkCSDehbFQjKQe= On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen ste...@me.commailto:ste...@me.com wrote: On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT james.laba...@cigna.commailto:james.laba...@cigna.com wrote: For that matter, is there a way to build Calamari without going the whole vagrant path at all? Some way of just building it through command-line tools? I would be building it on an Openstack instance, no GUI. Seems silly to have to install an entire virtualbox environment inside something that’s already a VM. Agreed... if U wanted to built in on your server farm/cloud stack env. I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant is an easy disposable built-env:) From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JESUS CHAVEZ ARGUELLES Sent: Monday, March 02, 2015 3:00 PM To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: [ceph-users] Calamari Deployment Does anybody know how to succesful install Calamari in rhel7 ? I have tried the vagrant thug without sucesss and it seems like a nightmare there is a Kind of Sidur when you do vagrant up where it seems not to find the vm path... Regards Jesus Chavez SYSTEMS ENGINEER-C.SALES jesch...@cisco.commailto:jesch...@cisco.com Phone: +52 55 5267 3146tel:+52%2055%205267%203146 Mobile: +51 1 5538883255tel:+51%201%205538883255 CCIE - 44433 -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2015 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.comhttps://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.comd=AwMFaQc=WlnXFIBCT3pmNh_w8hYeLwr=wHASSrXLcneWpRUvkiFE4YeL1dd09LRRfqfW2SnLvVkm=LYJ7jurZXFUR0O0z7RQYmQwoJoI0lxM2z_KcwZyyG2As=Xrs2jkzW8YBGyou7WMawVR5OqIS1cPaVO5MW-YIo4XAe= ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.comhttps://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.comd=AwMFaQc=WlnXFIBCT3pmNh_w8hYeLwr=wHASSrXLcneWpRUvkiFE4YeL1dd09LRRfqfW2SnLvVkm=LYJ7jurZXFUR0O0z7RQYmQwoJoI0lxM2z_KcwZyyG2As=Xrs2jkzW8YBGyou7WMawVR5OqIS1cPaVO5MW-YIo4XAe= -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2015 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
That's fair enough Greg, I'll keep upgrading when the opportunity arises, and maybe it'll spring back to life someday :-) -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: 20 March 2015 23:05 To: Chris Murray Cc: ceph-users Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help? On Fri, Mar 20, 2015 at 4:03 PM, Chris Murray chrismurra...@gmail.com wrote: Ah, I was wondering myself if compression could be causing an issue, but I'm reconsidering now. My latest experiment should hopefully help troubleshoot. So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I try that: find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec btrfs filesystem defragment -v -czlib -- {} + After much, much waiting, all files have been rewritten, but the OSD still gets stuck at the same point. I've now unset the compress attribute on all files and started the defragment process again, but I'm not too hopeful since the files must be readable/writeable if I didn't get some failure during the defrag process. find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec chattr -c -- {} + find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec btrfs filesystem defragment -v -- {} + (latter command still running) Any other ideas at all? In the absence of the problem being spelled out to me with an error of some sort, I'm not sure how to troubleshoot further. Not much, sorry. Is it safe to upgrade a problematic cluster, when the time comes, in case this ultimately is a CEPH bug which is fixed in something later than 0.80.9? In general it should be fine since we're careful about backwards compatibility, but without knowing the actual issue I can't promise anything. -Greg - No virus found in this message. Checked by AVG - www.avg.com Version: 2015.0.5751 / Virus Database: 4306/9314 - Release Date: 03/16/15 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Migrating objects from one pool to another?
Hi, Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
I thought there was some discussion about this before. Something like creating a new pool and then taking your existing pool as an overlay of the new pool (cache) and then flush the overlay to the new pool. I haven't tried it or know if it is possible. The other option is shut the VM down, create a new snapshot on the new pool, point the VM to that and then flatten the RBD. Robert LeBlanc Sent from a mobile device please excuse any typos. On Mar 26, 2015 5:23 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 23.13, Gregory Farnum g...@gregs42.com wrote: The procedure you've outlined won't copy snapshots, just the head objects. Preserving the proper snapshot metadata and inter-pool relationships on rbd images I think isn't actually possible when trying to change pools. This wasn’t ment for migrating a RBD pool, but pure object/Swift pools… Anyway seems Glance http://docs.openstack.org/developer/glance/architecture.html#basic-architecture supports multiple storages http://docs.openstack.org/developer/glance/configuring.html#configuring-multiple-swift-accounts-stores so assume one could use a glance client to also extract/download images into local file format (raw, qcow2 vmdk…) as well as uploading images to glance. And as glance images ain’t ‘live’ like virtual disk images one could also download glance images from one glance store over local file and upload back into a different glance back end store. Again this is properly better than dealing at a lower abstraction level and having to known its internal storage structures and avoid what you’re pointing put Greg. On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote: On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. What isn’t possible, export-import objects out-and-in of pools or snapshots issues? /Steffen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
On 26/03/2015, at 22.53, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net mailto:jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done and of course when done redirect glance to new pool :) Not sure, but this might require you to quenching the object usage from openstack during migration, dunno, maybe ask openstack community if it’s possible to live migration of objects first :/ possible split/partition list of objects into multiple concurrent loops, possible from multiple boxes as seems fit for resources at hand, cpu, memory, network, ceph perf. /Steffen On 3/26/2015 3:54 PM, Steffen W Sørensen wrote: On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. -Greg possible split/partition list of objects into multiple concurrent loops, possible from multiple boxes as seems fit for resources at hand, cpu, memory, network, ceph perf. /Steffen On 3/26/2015 3:54 PM, Steffen W Sørensen wrote: On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote: On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com mailto:ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. What isn’t possible, export-import objects out-and-in of pools or snapshots issues? /Steffen___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
The procedure you've outlined won't copy snapshots, just the head objects. Preserving the proper snapshot metadata and inter-pool relationships on rbd images I think isn't actually possible when trying to change pools. On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote: On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. What isn’t possible, export-import objects out-and-in of pools or snapshots issues? /Steffen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] All client writes block when 2 of 3 OSDs down
I added the osd pool default min size = 1 to test the behavior when 2 of 3 OSDs are down, but the behavior is exactly the same as without it: when the 2nd OSD is killed, all client writes start to block and these pipe.(stuff).fault messages begin: 2015-03-26 16:08:50.775848 7fce177fe700 0 monclient: hunting for new mon 2015-03-26 16:08:53.781133 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce0c01d260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce0c01d4f0).fault 2015-03-26 16:09:00.009092 7fce1c3fa700 0 -- 192.168.122.111:0/1011003 192.168.122.141:6789/0 pipe(0x7fce1802dab0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802dd40).fault 2015-03-26 16:09:12.013147 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce1802e740 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802e9d0).fault 2015-03-26 16:10:06.013113 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce1802df80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1801e600).fault 2015-03-26 16:10:36.013166 7fce1c3fa700 0 -- 192.168.122.111:0/1011003 192.168.122.141:6789/0 pipe(0x7fce1802ebc0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802ee50).fault Here is my ceph.conf: [global] fsid = db460aa2-5129-4aaa-8b2e-43eac727124e mon_initial_members = ceph-node-1 mon_host = 192.168.122.121 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 3 osd pool default min size = 1 public network = 192.168.122.0/24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
Ah, thanks, got it, I wasn't thinking that mons and osds on the same node isn't a likely real world thing. You have to admit that pipe/fault log message is a bit cryptic. Thanks, Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done possible split/partition list of objects into multiple concurrent loops, possible from multiple boxes as seems fit for resources at hand, cpu, memory, network, ceph perf. /Steffen On 3/26/2015 3:54 PM, Steffen W Sørensen wrote: On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1= 192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; 0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; 0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; 0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for new mon 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- 192.168.122.111:0/1007741 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0023490).fault 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- 192.168.122.111:0/1007741 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0025440).fault And all writes block until I bring back an OSD. Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; 0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; 0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; 0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for new mon 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- 192.168.122.111:0/1007741 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0023490).fault 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- 192.168.122.111:0/1007741 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0025440).fault And all writes block until I bring back an OSD. Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
On 03/26/2015 10:46 AM, Gregory Farnum wrote: I don't know why you're mucking about manually with the rbd directory; the rbd tool and rados handle cache pools correctly as far as I know. That's true, but the rados tool should be able to manipulate binary data more easily. It should probably be able to read from a file or stdin for this. Josh On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote: Hi Greg, ok! It's looks like, that my problem is more setomapval-related... I must o something like rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51 but rados setomapval don't use the hexvalues - instead of this I got rados -p ssd-archiv listomapvals rbd_directory name_vm-409-disk-2 value: (35 bytes) : : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\ 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d 0020 : 63 35 31: c51 hmm, strange. With rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 I got the binary inside the file name_vm-409-disk-2, but reverse do an rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 fill the variable with name_vm-409-disk-2 and not with the content of the file... Are there other tools for the rbd_directory? regards Udo Am 26.03.2015 15:03, schrieb Gregory Farnum: You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On 26/03/2015, at 23.36, Somnath Roy somnath@sandisk.com wrote: Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. To have quorum you need more than 50% of monitors, which isn’t possible with one out of two, since 1 (0.5*2 + 1) hence at least 3 monitors. Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/ 0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; active+0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; active+0 B/s rd, 943 kB/s
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; active+0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; active+0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; active+0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for new mon 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- 192.168.122.111:0/1007741 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0023490).fault 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- 192.168.122.111:0/1007741 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0025440).fault And all writes block until I bring back an OSD. Lee ___ ceph-users mailing list
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote: Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. The whole point of the monitor cluster is to ensure a globally consistent view of the cluster state that will never be reversed by a different group of up nodes. If one monitor (out of three) could make changes to the maps by itself, then there's nothing to prevent all three monitors from staying up but getting a net split, and then each issuing different versions of the osdmaps to whichever clients or OSDs happen to be connected to them. If you want to get down into the math proofs and things then the Paxos papers do all the proofs. Or you can look at the CAP theorem about the tradeoff between consistency and availability. The monitors are a Paxos cluster and Ceph is a 100% consistent system. -Greg Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/ 0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/ 0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; active+0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; active+0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; active+0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
Greg, I think you got me wrong. I am not saying each monitor of a group of 3 should be able to change the map. Here is the scenario. 1. Cluster up and running with 3 mons (quorum of 3), all fine. 2. One node (and mon) is down, quorum of 2 , still connecting. 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should still be able to connect. Isn't it ? Cluster with single monitor is able to form a quorum and should be working fine. So, why not in case of point 3 ? If this is the way Paxos works, should we say that in a cluster with say 3 monitors it should be able to tolerate only one mon failure ? Let me know if I am missing a point here. Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:41 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote: Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. The whole point of the monitor cluster is to ensure a globally consistent view of the cluster state that will never be reversed by a different group of up nodes. If one monitor (out of three) could make changes to the maps by itself, then there's nothing to prevent all three monitors from staying up but getting a net split, and then each issuing different versions of the osdmaps to whichever clients or OSDs happen to be connected to them. If you want to get down into the math proofs and things then the Paxos papers do all the proofs. Or you can look at the CAP theorem about the tradeoff between consistency and availability. The monitors are a Paxos cluster and Ceph is a 100% consistent system. -Greg Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy somnath@sandisk.com wrote: Greg, I think you got me wrong. I am not saying each monitor of a group of 3 should be able to change the map. Here is the scenario. 1. Cluster up and running with 3 mons (quorum of 3), all fine. 2. One node (and mon) is down, quorum of 2 , still connecting. 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should still be able to connect. Isn't it ? No. The monitors can't tell the difference between dead monitors, and monitors they can't reach over the network. So they say there are three monitors in my map; therefore it requires two to make any change. That's the case regardless of whether all of them are running, or only one. Cluster with single monitor is able to form a quorum and should be working fine. So, why not in case of point 3 ? If this is the way Paxos works, should we say that in a cluster with say 3 monitors it should be able to tolerate only one mon failure ? Yes, that is the case. Let me know if I am missing a point here. Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:41 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote: Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. The whole point of the monitor cluster is to ensure a globally consistent view of the cluster state that will never be reversed by a different group of up nodes. If one monitor (out of three) could make changes to the maps by itself, then there's nothing to prevent all three monitors from staying up but getting a net split, and then each issuing different versions of the osdmaps to whichever clients or OSDs happen to be connected to them. If you want to get down into the math proofs and things then the Paxos papers do all the proofs. Or you can look at the CAP theorem about the tradeoff between consistency and availability. The monitors are a Paxos cluster and Ceph is a 100% consistent system. -Greg Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly
Re: [ceph-users] Migrating objects from one pool to another?
On 26/03/2015, at 23.13, Gregory Farnum g...@gregs42.com wrote: The procedure you've outlined won't copy snapshots, just the head objects. Preserving the proper snapshot metadata and inter-pool relationships on rbd images I think isn't actually possible when trying to change pools. This wasn’t ment for migrating a RBD pool, but pure object/Swift pools… Anyway seems Glance http://docs.openstack.org/developer/glance/architecture.html#basic-architecture supports multiple storages http://docs.openstack.org/developer/glance/configuring.html#configuring-multiple-swift-accounts-stores so assume one could use a glance client to also extract/download images into local file format (raw, qcow2 vmdk…) as well as uploading images to glance. And as glance images ain’t ‘live’ like virtual disk images one could also download glance images from one glance store over local file and upload back into a different glance back end store. Again this is properly better than dealing at a lower abstraction level and having to known its internal storage structures and avoid what you’re pointing put Greg. On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote: On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. What isn’t possible, export-import objects out-and-in of pools or snapshots issues? /Steffen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; active+0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; active+0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; active+0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for new mon 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- 192.168.122.111:0/1007741 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0023490).fault 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- 192.168.122.111:0/1007741 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0025440).fault And all writes block until I bring back an OSD. Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] more human readable log to track request or using mapreduce for data statistics
hi,ceph: Currently, the command ”ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops“ may return as below: { description: osd_op(client.4436.1:11617 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92), received_at: 2015-03-25 19:41:47.146145, age: 2.186521, duration: 1.237882, type_data: [ commit sent; apply or cleanup, { client: client.4436, tid: 11617}, [ { time: 2015-03-25 19:41:47.150803, event: event1}, { time: 2015-03-25 19:41:47.150873, event: event2}, { time: 2015-03-25 19:41:47.150895, event: event3}, { time: 2015-03-25 19:41:48.384027, event: event4}]]} I think this message is not so suitable for grep log or using mapreduce for data statistics. Such as, I want to know the write request avg latency for each rbd everyday. If we could output the all latency in just one line, it would be very easy to achieve it. Such as, the output log maybe something like this: 2015-03-26 03:30:53.859759 osd=osd.0 pg=2.11 op=(client.4436.1:11617 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92) received_at=1427355253 age=2.186521 duration=1.237882 tid=11617 client=client.4436 event1=20ms event2=300ms event3=400ms event4=100ms. The above: duration means: the time between (reply_to_client_stamp - request_received_stamp) event1 means: the time between (the event1_stamp - request_received_stamp) ... event4 means: the time between (the event4_stamp - request_received_stamp) Now, If we output the every log as above. it would be every easy to know the write request avg latency for each rbd everyday. Or if I use grep it is more easy to find out which stage is the bottleneck. -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS
On Wed, Mar 25, 2015 at 8:10 PM, Ridwan Rashid Noel ridwan...@gmail.com wrote: Hi Greg, Thank you for your response. I have understood that I should be starting only the mapred daemons when using cephFS instead of HDFS. I have fixed that and trying to run hadoop wordcount job using this instruction: bin/hadoop jar hadoop*examples*.jar wordcount /tmp/wc-input /tmp/wc-output but I am getting this error 15/03/26 02:54:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/03/26 02:54:35 INFO input.FileInputFormat: Total input paths to process : 1 15/03/26 02:54:35 WARN snappy.LoadSnappy: Snappy native library not loaded 15/03/26 02:54:35 INFO mapred.JobClient: Running job: job_201503260253_0001 15/03/26 02:54:36 INFO mapred.JobClient: map 0% reduce 0% 15/03/26 02:54:36 INFO mapred.JobClient: Task Id : attempt_201503260253_0001_m_21_0, Status : FAILED Error initializing attempt_201503260253_0001_m_21_0: java.io.FileNotFoundException: File file:/tmp/hadoop-ceph/mapred/system/job_201503260253_0001/jobToken does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:4445) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1272) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1213) at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2568) at java.lang.Thread.run(Thread.java:745) I'm not an expert at setting up Hadoop, but these errors are coming out of the RawLocalFileSystem, which I think means that worker node is trying to use a local FS instead of Ceph. Did you set up each node to access Ceph? Have you set up and used Hadoop previously? -Greg . I have used the core-site.xml configurations as mentioned in http://ceph.com/docs/master/cephfs/hadoop/ Please tell me how can this problem be solved? Regards, Ridwan Rashid Noel Doctoral Student, Department of Computer Science, University of Texas at San Antonio Contact# 210-773-9966 On Fri, Mar 20, 2015 at 4:04 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Mar 20, 2015 at 1:05 PM, Ridwan Rashid ridwan...@gmail.com wrote: Gregory Farnum greg@... writes: On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan064@... wrote: Hi, I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with cephFS. I have installed hadoop-1.1.1 in the nodes and changed the conf/core-site.xml file according to the ceph documentation http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the namenode is not starting (namenode can be formatted) but the other services(datanode, jobtracker, tasktracker) are running in hadoop. The default hadoop works fine but when I change the core-site.xml file as above I get the following bindException as can be seen from the namenode log: 2015-03-19 01:37:31,436 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address I have one monitor for the ceph cluster (node1/10.242.144.225) and I included in the core-site.xml file ceph://10.242.144.225:6789 as the value of fs.default.name. The 6789 port is the default port being used by the monitor node of ceph, so that may be the reason for the bindException but the ceph documentation mentions that it should be included like this in the core-site.xml file. It would be really helpful to get some pointers to where I am doing wrong in the setup. I'm a bit confused. The NameNode is only used by HDFS, and so shouldn't be running at all if you're using CephFS. Nor do I have any idea why you've changed anything in a way that tells the NameNode to bind to the monitor's IP address; none of the instructions that I see can do that, and they certainly shouldn't be. -Greg Hi Greg, I want to run a hadoop job (e.g. terasort) and want to use cephFS instead of HDFS. In Using Hadoop with cephFS documentation in http://ceph.com/docs/master/cephfs/hadoop/ if you look into the Hadoop configuration section, the first property fs.default.name has to be set as the ceph URI and in the notes it's mentioned as ceph://[monaddr:port]/. My core-site.xml of hadoop conf looks like this configuration property namefs.default.name/name valueceph://10.242.144.225:6789/value /property Yeah, that all makes sense. But I don't understand why or how you're starting up a NameNode at all, nor what config values it's drawing from to try and bind to that port. The NameNode is the problem because it shouldn't even be invoked. -Greg
Re: [ceph-users] ceph falsely reports clock skew?
On Thu, 26 Mar 2015, Gregory Farnum wrote: On Thu, Mar 26, 2015 at 7:44 AM, Lee Revell rlrev...@gmail.com wrote: I have a virtual test environment of an admin node and 3 mon + osd nodes, built by just following the quick start guide. It seems to work OK but ceph is constantly complaining about clock skew much greater than reality. Clocksource on the virtuals is kvm-clock and they also run ntpd. ceph-admin-node 26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset 0.000802 sec ceph-node-1 26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset 0.002537 sec ceph-node-2 26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset -0.000214 sec ceph-node-3 26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset 0.001490 sec ceph@ceph-admin-node:~/my-cluster$ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN clock skew detected on mon.ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active} osdmap e182: 3 osds: 3 up, 3 in pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects 29850 MB used, 27118 MB / 60088 MB avail 840 active+clean What clock skews is it reporting? I don't remember the defaults, but if ntp is consistently adjusting your clocks by a couple of milliseconds then I don't think Ceph is going to be very happy about it. IIRC the mons re-check sync every 5 minutes. Does the warning persist? Does it go away if you restart the mons? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM unused on each node for OSD / OS overhead. All the VMs are backed by ceph volumes and things generally work very well. I would prefer a dedicated storage layer simply because it seems more right, but I can't say that any of the common concerns of using this kind of setup have come up for me. Aside from shaving off that 3GB of RAM, my deployment isn't any more complex than a split stack deployment would be. After running like this for the better part of a year, I would have a hard time honestly making a real business case for the extra hardware a split stack cluster would require. QH On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com wrote: It's kind of a philosophical question. Technically there's nothing that prevents you from putting ceph and the hypervisor on the same boxes. It's a question of whether or not potential cost savings are worth increased risk of failure and contention. You can minimize those things through various means (cgroups, ristricting NUMA nodes, etc). What is more difficult is isolating disk IO contention (say if you want local SSDs for VMs), memory bus and QPI contention, network contention, etc. If the VMs are working really hard you can restrict them to their own socket, and you can even restrict memory usage to the local socket, but what about remote socket network or disk IO? (you will almost certainly want these things on the ceph socket) I wonder as well about increased risk of hardware failure with the increased load, but I don't have any statistics. I'm guessing if you spent enough time at it you could make it work relatively well, but at least personally I question how beneficial it really is after all of that. If you are going for cost savings, I suspect efficient compute and storage node designs will be nearly as good with much less complexity. Mark On 03/26/2015 07:11 AM, Wido den Hollander wrote: On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. mhm sure? I've never seen problems like that. Currently i ran each ceph node with 64GB of memory and each hypervisor node with around 512GB to 1TB RAM while having 48 cores. Yes, it can happen. You have machines with enough memory, but if you overprovision the machines it can happen. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. Never seen an OSD doing so crazy things. Again, it really depends on the available memory and CPU. If you buy big machines for this purpose it probably won't be a problem. Stefan So technically it could work, but memorey and CPU pressure is something which might give you problems. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph falsely reports clock skew?
On Thu, Mar 26, 2015 at 7:44 AM, Lee Revell rlrev...@gmail.com wrote: I have a virtual test environment of an admin node and 3 mon + osd nodes, built by just following the quick start guide. It seems to work OK but ceph is constantly complaining about clock skew much greater than reality. Clocksource on the virtuals is kvm-clock and they also run ntpd. ceph-admin-node 26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset 0.000802 sec ceph-node-1 26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset 0.002537 sec ceph-node-2 26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset -0.000214 sec ceph-node-3 26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset 0.001490 sec ceph@ceph-admin-node:~/my-cluster$ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN clock skew detected on mon.ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active} osdmap e182: 3 osds: 3 up, 3 in pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects 29850 MB used, 27118 MB / 60088 MB avail 840 active+clean What clock skews is it reporting? I don't remember the defaults, but if ntp is consistently adjusting your clocks by a couple of milliseconds then I don't think Ceph is going to be very happy about it. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph falsely reports clock skew?
I have a virtual test environment of an admin node and 3 mon + osd nodes, built by just following the quick start guide. It seems to work OK but ceph is constantly complaining about clock skew much greater than reality. Clocksource on the virtuals is kvm-clock and they also run ntpd. ceph-admin-node 26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset 0.000802 sec ceph-node-1 26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset 0.002537 sec ceph-node-2 26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset -0.000214 sec ceph-node-3 26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset 0.001490 sec ceph@ceph-admin-node:~/my-cluster$ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN clock skew detected on mon.ceph-node-2 monmap e3: 3 mons at {ceph-node-1= 192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active} osdmap e182: 3 osds: 3 up, 3 in pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects 29850 MB used, 27118 MB / 60088 MB avail 840 active+clean Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
On Thu, Mar 26, 2015 at 2:56 AM, Saverio Proto ziopr...@gmail.com wrote: Thanks for the answer. Now the meaning of MB data and MB used is clear, and if all the pools have size=3 I expect a ratio 1 to 3 of the two values. I still can't understand why MB used is so big in my setup. All my pools are size =3 but the ratio MB data and MB used is 1 to 5 instead of 1 to 3. My first guess was that I wrote a wrong crushmap that was making more than 3 copies.. (is it really possible to make such a mistake?) So I changed my crushmap and I put the default one, that just spreads data across hosts, but I see no change, the ratio is still 1 to 5. I thought maybe my 3 monitors have different views of the pgmap, so I tried to restart the monitors but this also did not help. What useful information may I share here to troubleshoot this issue further ? ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) You just need to go look at one of your OSDs and see what data is stored on it. Did you configure things so that the journals are using a file on the same storage disk? If so, *that* is why the data used is large. I promise that your 5:1 ratio won't persist as you write more than 2GB of data into the cluster. -Greg Thank you Saverio 2015-03-25 14:55 GMT+01:00 Gregory Farnum g...@gregs42.com: On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote: Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. MB used is the summation of (the programmatic equivalent to) df across all your nodes, whereas MB data is calculated by the OSDs based on data they've written down. Depending on your configuration MB used can include thing like the OSD journals, or even totally unrelated data if the disks are shared with other applications. MB used including the space used by the OSD journals is my first guess about what you're seeing here, in which case you'll notice that it won't grow any faster than MB data does once the journal is fully allocated. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph falsely reports clock skew?
I have a virtual test environment of an admin node and 3 mon + osd nodes, built by just following the quick start guide. It seems to work OK but ceph is constantly complaining about clock skew much greater than reality. Clocksource on the virtuals is kvm-clock and they also run ntpd. ceph-admin-node 26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset 0.000802 sec ceph-node-1 26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset 0.002537 sec ceph-node-2 26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset -0.000214 sec ceph-node-3 26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset 0.001490 sec ceph@ceph-admin-node:~/my-cluster$ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN clock skew detected on mon.ceph-node-2 monmap e3: 3 mons at {ceph-node-1= 192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active} osdmap e182: 3 osds: 3 up, 3 in pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects 29850 MB used, 27118 MB / 60088 MB avail 840 active+clean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
I suspect a config like this where you only have 3 OSDs per node would be more manageable than something denser. IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super micro chassis for a semi-dense converged solution. You could attempt to restrict the OSDs to one socket and then use a second E5-2697v3 for VMs. Maybe after you've got cgroups setup properly and if you've otherwise balanced things it would work out ok. I question though how much you really benefit by doing this rather than running a 36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many of because you can dedicate both sockets to VMs). It probably depends quite a bit on how memory, network, and disk intensive the VMs are, but my take is that it's better to error on the side of simplicity rather than making things overly complicated. Every second you are screwing around trying to make the setup work right eats into any savings you might gain by going with the converged setup. Mark On 03/26/2015 10:12 AM, Quentin Hartman wrote: I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM unused on each node for OSD / OS overhead. All the VMs are backed by ceph volumes and things generally work very well. I would prefer a dedicated storage layer simply because it seems more right, but I can't say that any of the common concerns of using this kind of setup have come up for me. Aside from shaving off that 3GB of RAM, my deployment isn't any more complex than a split stack deployment would be. After running like this for the better part of a year, I would have a hard time honestly making a real business case for the extra hardware a split stack cluster would require. QH On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: It's kind of a philosophical question. Technically there's nothing that prevents you from putting ceph and the hypervisor on the same boxes. It's a question of whether or not potential cost savings are worth increased risk of failure and contention. You can minimize those things through various means (cgroups, ristricting NUMA nodes, etc). What is more difficult is isolating disk IO contention (say if you want local SSDs for VMs), memory bus and QPI contention, network contention, etc. If the VMs are working really hard you can restrict them to their own socket, and you can even restrict memory usage to the local socket, but what about remote socket network or disk IO? (you will almost certainly want these things on the ceph socket) I wonder as well about increased risk of hardware failure with the increased load, but I don't have any statistics. I'm guessing if you spent enough time at it you could make it work relatively well, but at least personally I question how beneficial it really is after all of that. If you are going for cost savings, I suspect efficient compute and storage node designs will be nearly as good with much less complexity. Mark On 03/26/2015 07:11 AM, Wido den Hollander wrote: On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote: Hi Wido, Am 26.03.2015 um 11:59 schrieb Wido den Hollander: On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote: Hi, in the past i rwad pretty often that it's not a good idea to run ceph and qemu / the hypervisors on the same nodes. But why is this a bad idea? You save space and can better use the ressources you have in the nodes anyway. Memory pressure during recovery *might* become a problem. If you make sure that you don't allocate more then let's say 50% for the guests it could work. mhm sure? I've never seen problems like that. Currently i ran each ceph node with 64GB of memory and each hypervisor node with around 512GB to 1TB RAM while having 48 cores. Yes, it can happen. You have machines with enough memory, but if you overprovision the machines it can happen. Using cgroups you could also prevent that the OSDs eat up all memory or CPU. Never seen an OSD doing so crazy things. Again, it really depends on the available memory and CPU. If you buy big machines for this purpose it probably won't be a problem. Stefan So technically it could work, but memorey and CPU pressure is something which might give you problems.
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
Hi Greg, ok! It's looks like, that my problem is more setomapval-related... I must o something like rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51 but rados setomapval don't use the hexvalues - instead of this I got rados -p ssd-archiv listomapvals rbd_directory name_vm-409-disk-2 value: (35 bytes) : : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\ 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d 0020 : 63 35 31: c51 hmm, strange. With rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 I got the binary inside the file name_vm-409-disk-2, but reverse do an rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 fill the variable with name_vm-409-disk-2 and not with the content of the file... Are there other tools for the rbd_directory? regards Udo Am 26.03.2015 15:03, schrieb Gregory Farnum: You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
You just need to go look at one of your OSDs and see what data is stored on it. Did you configure things so that the journals are using a file on the same storage disk? If so, *that* is why the data used is large. I followed your suggestion and this is the result of my trobleshooting. Each OSD controls a disk that is mounted in a folder with the name: /var/lib/ceph/osd/ceph-N where N is the OSD number The journal is stored on another disk drive. I have three extra SSD drives per server, that I partitioned with 6 partitions each, and those partitions are journal partitions. I checked that the setup is correct because each /var/lib/ceph/osd/ceph-N/journal points correctly to another drive. with df -h I see the folders where my OSD are mounted. The space occupation looks well distributed among all OSDs as expected. the data is always in a folder called: /var/lib/ceph/osd/ceph-N/current I checked with the tool ncdu where the data is stored inside the current folders. in each OSD there is a folder with a lot of data called /var/lib/ceph/osd/ceph-N/current/meta If I sum the MB for each meta folder that is more or less the extra space that is consumed, leading to the 1 to 5 ratio. the meta folder contains a lot of binary files, unreadable, but looking at the file names it looks like it is where the versions of the osdmap are stored. but it is really a lot of metadata. I will start now to push a lot of data into the cluster to see if the metadata grows a lot or stays costant. There is a way to clean up old metadata ? thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph falsely reports clock skew?
I think I solved the problem. The clock skew only happens when restarting a node to simulate hardware failure. The virtual comes up with a skewed clock and ceph services start before ntp has time to adjust it, then there's a delay before ceph rechecks the clock skew. Lee On Thu, Mar 26, 2015 at 11:21 AM, Sage Weil s...@newdream.net wrote: On Thu, 26 Mar 2015, Gregory Farnum wrote: On Thu, Mar 26, 2015 at 7:44 AM, Lee Revell rlrev...@gmail.com wrote: I have a virtual test environment of an admin node and 3 mon + osd nodes, built by just following the quick start guide. It seems to work OK but ceph is constantly complaining about clock skew much greater than reality. Clocksource on the virtuals is kvm-clock and they also run ntpd. ceph-admin-node 26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset 0.000802 sec ceph-node-1 26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset 0.002537 sec ceph-node-2 26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset -0.000214 sec ceph-node-3 26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset 0.001490 sec ceph@ceph-admin-node:~/my-cluster$ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN clock skew detected on mon.ceph-node-2 monmap e3: 3 mons at {ceph-node-1= 192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0 }, election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active} osdmap e182: 3 osds: 3 up, 3 in pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects 29850 MB used, 27118 MB / 60088 MB avail 840 active+clean What clock skews is it reporting? I don't remember the defaults, but if ntp is consistently adjusting your clocks by a couple of milliseconds then I don't think Ceph is going to be very happy about it. IIRC the mons re-check sync every 5 minutes. Does the warning persist? Does it go away if you restart the mons? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com