Re: [ceph-users] Upgraded Bobtail to Cuttlefish and unable to mount cephfs
Hi Greg, Thank you for your concern. It seems that problem was caused by ceph-mds. While the rest of Ceph modules have been upgraded to 0.61.8, ceph-mds was 0.56.7. I've updated ceph-mds and cluster stabilised within few hours. Kind regards, Serge On 08/30/2013 08:22 PM, Gregory Farnum wrote: Can you start up your mds with dedug mds = 20 and debug ms = 20? The failed to decode message line is suspicious but there's not enough context here for me to be sure, and my pattern-matching isn't reminding me of any serious bugs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Aug 29, 2013 at 3:10 AM, Serge Slipchenko serge.slipche...@zoral.com.ua wrote: Hi, I upgraded Ceph from Bobtail to Cuttlefish and everything seemed good. Then I started to write to cephfs, but at some moment write stalled. After that I'm not able to mount either with kernel driver, or with custom utility. ceph -s shows that everything is good. health HEALTH_OK monmap e2: 2 mons at {m01=5.9.118.83:6789/0,m02=5.9.122.115:6789/0}, election epoch 1320, quorum 0,1 m01,m02 osdmap e3967: 16 osds: 16 up, 16 in pgmap v1315932: 256 pgs: 255 active+clean, 1 active+clean+scrubbing; 215 GB data, 448 GB used, 38441 GB / 40971 GB avail; 37585KB/s rd, 1op/s mdsmap e774: 1/1/1 up {0=m02=up:active}, 1 up:standby But in the mds.a log I see the following messages: 2013-08-29 10:06:34.371166 7f49e68aa700 0 -- 5.9.122.115:6807/1077 91.193.166.194:0/2272475298 pipe(0x8de3780 sd=74 :6807 s=0 pgs=0 cs=0 l=0).accept peer addr is really 91.193.166.194:0/2272475298 (socket is 91.193.166.194:56649/0) 2013-08-29 10:07:38.454659 7f49e68aa700 0 -- 5.9.122.115:6807/1077 91.193.166.194:0/2272475298 pipe(0x8de3780 sd=74 :6807 s=2 pgs=2 cs=1 l=0).fault, server, going to standby 2013-08-29 10:23:06.898089 7f49e60a2700 0 -- 5.9.122.115:6807/1077 91.193.166.194:0/3930317661 pipe(0x7442c000 sd=78 :6807 s=0 pgs=0 cs=0 l=0).accept peer addr is really 91.193.166.194:0/3930317661 (socket is 91.193.166.194:56272/0) 2013-08-29 10:24:07.384136 7f49e60a2700 0 -- 5.9.122.115:6807/1077 91.193.166.194:0/3930317661 pipe(0x7442c000 sd=78 :6807 s=2 pgs=2 cs=1 l=0).fault, server, going to standby 2013-08-29 10:30:21.177807 7f49e5c9e700 0 -- 5.9.122.115:6807/1077 91.193.166.194:0/1838286378 pipe(0x73bd8a00 sd=80 :6807 s=0 pgs=0 cs=0 l=0).accept peer addr is really 91.193.166.194:0/1838286378 (socket is 91.193.166.194:59069/0) 2013-08-29 10:31:21.34 7f49e5c9e700 0 -- 5.9.122.115:6807/1077 91.193.166.194:0/1838286378 pipe(0x73bd8a00 sd=80 :6807 s=2 pgs=2 cs=1 l=0).fault, server, going to standby 2013-08-29 11:17:17.331613 7f040de6b700 0 -- 5.9.122.115:6807/7622 91.193.166.194:0/2689145238 pipe(0x13ea780 sd=34 :6807 s=2 pgs=2 cs=1 l=0).fault with nothing to send, going to standby 2013-08-29 11:22:08.137711 7f0411897700 0 log [INF] : closing stale session client.76201 91.193.166.194:0/2689145238 after 304.270364 And mds.b outputs a lot of: 2013-08-29 12:04:58.743938 7fa75604d700 -1 failed to decode message of type 23 v2: buffer::end_of_buffer 2013-08-29 12:04:58.743969 7fa75604d700 0 -- 5.9.122.115:6800/977 144.76.13.103:0/925435369 pipe(0x524e780 sd=39 :6800 s=2 pgs=130763 cs=12829 l=0).fault with nothing to send, going to standby 2013-08-29 12:04:58.744236 7fa755f4c700 0 -- 5.9.122.115:6800/977 144.76.13.102:0/2955281877 pipe(0x524e500 sd=37 :6800 s=0 pgs=0 cs=0 l=0).accept connect_seq 12834 vs existing 12833 state standby 2013-08-29 12:04:58.744607 7fa756754700 0 -- 5.9.122.115:6800/977 144.76.13.105:0/347604456 pipe(0x52c5a00 sd=38 :6800 s=0 pgs=0 cs=0 l=0).accept connect_seq 12538 vs existing 12537 state standby 2013-08-29 12:04:58.744627 7fa755f4c700 -1 failed to decode message of type 23 v2: buffer::end_of_buffer 2013-08-29 12:04:58.744671 7fa755f4c700 0 -- 5.9.122.115:6800/977 144.76.13.102:0/2955281877 pipe(0x524e500 sd=37 :6800 s=2 pgs=292532 cs=12835 l=0).fault with nothing to send, going to standby 2013-08-29 12:04:58.745006 7fa75614e700 0 -- 5.9.122.115:6800/977 144.76.13.103:0/925435369 pipe(0x52c5780 sd=31 :6800 s=0 pgs=0 cs=0 l=0).accept connect_seq 12830 vs existing 12829 state standby 2013-08-29 12:04:58.745102 7fa756754700 -1 failed to decode message of type 23 v2: buffer::end_of_buffer 2013-08-29 12:04:58.745146 7fa756754700 0 -- 5.9.122.115:6800/977 144.76.13.105:0/347604456 pipe(0x52c5a00 sd=38 :6800 s=2 pgs=131368 cs=12539 l=0).fault with nothing to send, going to standby -- Kind regards, Serge Slipchenko ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Serge Slipchenko E-mail: serge.slipche...@zoral.com.ua Skype: serge.slipchenko ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Location field empty in Glance when instance to image
Thanks a lot Josh. It be very useful. Regards On 31/08/13 02:58, Josh Durgin wrote: On 08/30/2013 03:40 AM, Toni F. [ackstorm] wrote: Sorry, wrong list Anyway i take this oportunity to ask two questions: Somebody knows how i can download a image or snapshot? Cinder has no way to export them, but you can use: rbd export pool/image@snap /path/to/file how the direct url are build? rbd://9ed296cb-e9a7-4d36-b728-0ddc5f249ca0/images/7729788f-b80a-4d90-b3c7-6f61f5ebd535/snap The format is rbd://fsid/pool/image/snapshot fsid is a unique id for a ceph cluster. This is from a image I need to build this direct url for a snapshot and i don't know how In this case it's a cinder snapshot, and you've already found it in the rbd snap ls output. Josh Thanks Regards On 30/08/13 12:27, Toni F. [ackstorm] wrote: Hi all, With a running boot-from-volume instance backed in ceph, i launch command to create an image from instance. All seems to work fine but if i look in bdd i notice that location is empty mysql select * from images where id=b7674970-5d60-41da-bbb9-2ef10955fbbe \G; *** 1. row *** id: b7674970-5d60-41da-bbb9-2ef10955fbbe name: snapshot_athena326 size: 0 status: active is_public: 1 *location: NULL* created_at: 2013-08-29 14:41:16 updated_at: 2013-08-29 14:41:16 deleted_at: NULL deleted: 0 disk_format: raw container_format: bare checksum: 8e79e146ce5d2c71807362730e7b5a3b owner: 36d462972b1d49c5850ca864b6f39d05 min_disk: 0 min_ram: 0 protected: 0 1 row in set (0.00 sec) Bug? Aditional info # glance index ID Name Disk Format Container Format Size -- -- 7729788f-b80a-4d90-b3c7-6f61f5ebd535 Ubuntu 12.04 LTS 32bits raw bare 2147483648 b0692408-6bcf-40b1-94c6-672154d5d7eb Ubuntu 12.04 LTS 64bits raw bare 2147483648 tel:2147483648 I created a instance from image 7729788f-b80a-4d90-b3c7-6f61f5ebd535 #nova list +--+---+++ | ID | Name | Status | Networks | +--+---+++ | bffd1b30-5690-4d2f-9347-1f0b7202ee6d | athena326 | ACTIVE | Private_15=10.128.3.195, 88.87.208.155 | +--+---+++ #nova image-create bffd1b30-5690-4d2f-9347-1f0b7202ee6d snapshot_athena326 ///LOGS in cinder_volume 2013-08-29 16:41:16 INFO cinder.volume.manager [req-8fc22aae-a516-4f62-a836-99f63f86f144 55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05] snapshot snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68: creating 2013-08-29 16:41:16 DEBUG cinder.volume.manager [req-8fc22aae-a516-4f62-a836-99f63f86f144 55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05] snapshot snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68: creating create_snapshot /usr/lib/python2.7/dist-packages/cinder/volume/manager.py:234 2013-08-29 16:41:16 DEBUG cinder.utils [req-8fc22aae-a516-4f62-a836-99f63f86f144 55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05] Running cmd (subprocess): rbd snap create --pool volumes --snap snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68 volume-1b1e9684-05fa-4d8b-90a3-5bd2031c28bd execute /usr/lib/python2.7/dist-packages/cinder/utils.py:167 2013-08-29 16:41:17 DEBUG cinder.utils [req-8fc22aae-a516-4f62-a836-99f63f86f144 55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05] Running cmd (subprocess): rbd --help execute /usr/lib/python2.7/dist-packages/cinder/utils.py:167 2013-08-29 16:41:17 DEBUG cinder.utils [req-8fc22aae-a516-4f62-a836-99f63f86f144 55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05] Running cmd (subprocess): rbd snap protect --pool volumes --snap snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68 volume-1b1e9684-05fa-4d8b-90a3-5bd2031c28bd execute /usr/lib/python2.7/dist-packages/cinder/utils.py:167 2013-08-29 16:41:17 DEBUG cinder.volume.manager [req-8fc22aae-a516-4f62-a836-99f63f86f144 55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05] snapshot snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68: created successfully create_snapshot /usr/lib/python2.7/dist-packages/cinder/volume/manager.py:249 ///LOGS in cinder_volume root@nova-volume-lnx001:/home/ackstorm# glance index ID Name Disk Format Container Format Size -- --
Re: [ceph-users] Is it possible to change the pg number after adding new osds?
You can change the pg numbers on the fly with ceph osd pool set {pool_name} pg_num {value} ceph osd pool set {pool_name} pgp_num {value} refrence: http://ceph.com/docs/master/rados/operations/pools/ From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Da Chun Ng Sent: Montag, 2. September 2013 04:49 To: ceph-users@lists.ceph.com Subject: [ceph-users] Is it possible to change the pg number after adding new osds? According to the doc, the pg numbers should be enlarged for better read/write balance if the osd number is increased. But seems the pg number cannot be changed on the fly. It's fixed when the pool is created. Am I right? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] To put journals to SSD or not?
How do you test the random behavior of the disks, what's a good setup? If I understand ceph writes in 4M blocks I also expect a 50%/50% rw ratio of our workloads, what else to I have to take into consideration. Also what I not yet understand, in my performance test I get pretty nice rados bench results: (osd nodes have 1Gb public and 1Gb sync interface, testnode has 10Gb nic to public) rados bench -p test 30 write --no-cleanup Bandwidth (MB/sec): 128.801 (here the 1Gb sync network is clearly the bottleneck) rados bench -p test 30 seq Bandwidth (MB/sec):303.140 ( here it's the 1Gb public interface of the 3 nodes) But if I test still sequential workloads to a rbd device with the same pool settings as the testpool above, results are as follows sudo dd if=/dev/zero of=/mnt/testfile bs=4M count=100 oflag=direct 419430400 bytes (419 MB) copied, 5.97832 s, 70.2 MB/s I cannot identify the bottleneck here, no network interface is at his limit, cpu's are 10%, iostat shows all disk working with ok numbers. The only difference I see ceph -w shows much more ops than with the rados bench. Any idea how I could identify the bottleneck here? Or is it just the single dd thread? Regards Andi -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Martin Rudat Sent: Montag, 2. September 2013 01:44 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] To put journals to SSD or not? On 2013-09-02 05:19, Fuchs, Andreas (SwissTXT) wrote: Reading through the documentation and talking to several peaople leads to the conclusion that it's a best practice to place the journal of an OSD instance to a separate SSD disk to speed writing up. But is this true? i have 3 new dell servers for testing available with 12 x 4 TB SATA and 2 x 100GB SSD disks. I don't have the exact specs at hand but tests show: The SATA's sequential write speed is 300MB/s The SSD which is in RAID1 config is only 270MB/s ! was probably not the most expensive. When we put the journals on the OSD's i can expect a sequential wtite speed of 12 x 150MB/s (on write to journal, one to disk), this is 1800MB/s per Node. The thing is that, unless you've got a magical workload, you're not going to be seeing sequential write speeds from your spinning disks, because, at a minimum, a write to the journal at the beginning of the disk, and a write to data at a different portion of the disk is going to perform the same as random i/o... because the disk is going to have to seek, on average half-way across the platter each time it commits a new transaction to disk... this gets worse when you also take into account random reads, which also cause more disk seeks. Sequential read on the disks I've got is at about 180M/s (they're cheap slow disk), random read/write on the array seems to be peaking around 10M/s a disk. I'd benchmark your random i/o performance, and use that to choose how much, and how fast, a set of SSDs you will need. I've actually got a 4-disk external hot-swap sata cage on order, that connects over a usb3 or esata link... sequential read/write even with the slow disk I've got will saturate the link... but filled with spinning disk doing random i/o, there should be plenty of headroom available... it'll be interesting to see if it's a worthwhile investment, as opposed to having to open a computer up to change disks. -- Martin Rudat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Best way to reformat OSD drives?
Hi all we have a Ceph Cluster with 64 OSD drives in 10 servers. We originally formatted the OSDs with btrfs but have had numerous problems (server kernel panics) that we could point back to btrfs. We are therefore in the process of reformatting our OSDs to XFS. We have a process that works, but I was wondering, if there is a simpler / faster way. Currently we 'ceph osd out' all drives of a server and wait for the data to migrate away, then delete the OSD, recreate it and start the OSD processes again. This takes at least 1-2 days per server (mostly waiting for the data to migrate back and forth) Here's the script we are using: --- cut --- #! /bin/bash OSD=$1 PART=$2 HOST=$3 echo changing partition ${PART}1 to XFS for OSD: $OSD on host: $HOST read -p continue or CTRL-C service ceph -a stop osd.$OSD ceph osd crush remove osd.$OSD ceph auth del osd.$OSD ceph osd rm $OSD ceph osd create # this should give you back the same osd number as the one you just removed umount ${PART}1 parted $PART rm 1 # remove partion and create a new one parted $PART mkpart primary 0% 100% # remove partion and create a new one mkfs.xfs -f -i size=2048 ${PART}1 -L osd.$OSD mount -o inode64,noatime ${PART}1 /var/lib/ceph/osd/ceph-$OSD ceph-osd -i $OSD --mkfs --mkkey --mkjournal ceph auth add osd.$OSD osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-${OSD}/keyring ceph osd crush set $OSD 1 root=default host=$HOST service ceph -a start osd.$OSD --- cut --- cheers Jens-Christian -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/socialmedia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best way to reformat OSD drives?
Am 02.09.2013 11:37, schrieb Jens-Christian Fischer: we have a Ceph Cluster with 64 OSD drives in 10 servers. We originally formatted the OSDs with btrfs but have had numerous problems (server kernel panics) that we could point back to btrfs. We are therefore in the process of reformatting our OSDs to XFS. We have a process that works, but I was wondering, if there is a simpler / faster way. Currently we 'ceph osd out' all drives of a server and wait for the data to migrate away, then delete the OSD, recreate it and start the OSD processes again. This takes at least 1-2 days per server (mostly waiting for the data to migrate back and forth) Why wait for the data to migrate away? Normally you have replicas of the whole osd data, so you can simply stop the osd, reformat the disk and restart it again. It'll join the cluster and automatically get all data it's missing. Of course the risk of dataloss is a bit higher during that time, but normally that should be ok, because it's not different from an ordinary disk failure which can happen any time. I just found a similar question from one year ago: http://www.spinics.net/lists/ceph-devel/msg05915.html I didn't read the whole thread, but probably you can find some other ideas there. service ceph osd stop $OSD mkfs -t xfs /dev/XXX ceph-osd -i $OSD --mkfs --mkkey --mkjournal service ceph osd start $OSD Corin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best way to reformat OSD drives?
Hi Jens, On 2013-09-02 19:37, Jens-Christian Fischer wrote: we have a Ceph Cluster with 64 OSD drives in 10 servers. We originally formatted the OSDs with btrfs but have had numerous problems (server kernel panics) that we could point back to btrfs. We are therefore in the process of reformatting our OSDs to XFS. We have a process that works, but I was wondering, if there is a simpler / faster way. Currently we 'ceph osd out' all drives of a server and wait for the data to migrate away, then delete the OSD, recreate it and start the OSD processes again. This takes at least 1-2 days per server (mostly waiting for the data to migrate back and forth) The first thing I'd try is doing one osd at a time, rather than the entire server; in theory, this should allow for (as opposed to definitely make it happen) data to move from one osd to the other, rather than having to push it across the network from other nodes. depending on just how much data you have on an individual osd, you could stop two, blow the first away, copy the data from osd 2 to the disk osd 1 was using, change the mount-points, then bring osd 2 back up again; in theory, osd 2 will only need to resync changes that have occurred while it was offline. This, of course, presumes that there's no change in the on-disk layout between btrfs and xfs... -- Martin Rudat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best way to reformat OSD drives?
Why wait for the data to migrate away? Normally you have replicas of the whole osd data, so you can simply stop the osd, reformat the disk and restart it again. It'll join the cluster and automatically get all data it's missing. Of course the risk of dataloss is a bit higher during that time, but normally that should be ok, because it's not different from an ordinary disk failure which can happen any time. Because I lost 2 objects last time I did that trick (probably caused by additional user (i.e. me) stupidity in the first place, but I don't really fancy taking chances this time :) ) I just found a similar question from one year ago: http://www.spinics.net/lists/ceph-devel/msg05915.html I didn't read the whole thread, but probably you can find some other ideas there. I read it, but it is the usual to a fro - no definitive solution... service ceph osd stop $OSD mkfs -t xfs /dev/XXX ceph-osd -i $OSD --mkfs --mkkey --mkjournal service ceph osd start $OSD I'll give that a whirl - I have enough OSDs to try - as soon as the cluster has recovered from the 9 disks I formatted on saturday cheers jc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best way to reformat OSD drives?
Hi Martin On 2013-09-02 19:37, Jens-Christian Fischer wrote: we have a Ceph Cluster with 64 OSD drives in 10 servers. We originally formatted the OSDs with btrfs but have had numerous problems (server kernel panics) that we could point back to btrfs. We are therefore in the process of reformatting our OSDs to XFS. We have a process that works, but I was wondering, if there is a simpler / faster way. Currently we 'ceph osd out' all drives of a server and wait for the data to migrate away, then delete the OSD, recreate it and start the OSD processes again. This takes at least 1-2 days per server (mostly waiting for the data to migrate back and forth) The first thing I'd try is doing one osd at a time, rather than the entire server; in theory, this should allow for (as opposed to definitely make it happen) data to move from one osd to the other, rather than having to push it across the network from other nodes. Isn't that depending on the CRUSH map and some rules? depending on just how much data you have on an individual osd, you could stop two, blow the first away, copy the data from osd 2 to the disk osd 1 was using, change the mount-points, then bring osd 2 back up again; in theory, osd 2 will only need to resync changes that have occurred while it was offline. This, of course, presumes that there's no change in the on-disk layout between btrfs and xfs... We were actually thinking of doing that, but I wanted to hear the wisdom of the crowd… The thread from a year ago (that I just read) cautioned against that procedure though. cheers jc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] adding SSD only pool to existing ceph cluster
We have a ceph cluster with 64 OSD (3 TB SATA) disks on 10 servers, and run an OpenStack cluster. We are planning to move the images of the running VM instances from the physical machines to CephFS. Our plan is to add 10 SSDs (one in each server) and create a pool that is backed only by these SSDs and mount that pool in a specific location in CephFS. References perused: http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/ http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds The difference between Sebastiens and the Ceph approach is that Sebastien has mixed SAS/SSD servers, while the ceph documentation assumes either or servers. We have tried to replicate both approaches by manually editing the CRUSH map like so: Option 1) Create new virtual SSD only servers (where we have a h0 physical server, we'd set a h0-ssd for the ssd) in the CRUSH map, together with a related server/rack/datacenter/root hierarchy --- cut --- host s1-ssd { id -15 # do not change unnecessarily # weight 0.500 alg straw hash 0 # rjenkins1 item osd.36 weight 0.500 } … rack cla-r71-ssd { id -24 # do not change unnecessarily # weight 2.500 alg straw hash 0 # rjenkins1 item s0-ssd weight 0.000 item s1-ssd weight 0.500 […] item h5-ssd weight 0.000 } root ssd { id -25 # do not change unnecessarily # weight 2.500 alg straw hash 0 # rjenkins1 item cla-r71-ssd weight 2.500 } rule ssd { ruleset 3 type replicated min_size 1 max_size 10 step take ssd step chooseleaf firstn 0 type host step emit } --- cut --- Option 2) Create two pools (SATA and SSD) and list all SSDs manually in them --- cut --- pool ssd { id -14 # do not change unnecessarily # weight 2.500 alg straw hash 0 # rjenkins1 item osd.36 weight 0.500 item osd.65 weight 0.500 item osd.66 weight 0.500 item osd.67 weight 0.500 item osd.68 weight 0.500 item osd.69 weight 0.500 } --- cut --- We extracted the CRUSH map, decompiled, changed, compiled and injected it. Both tries didn't seem to really work (™) as we saw the cluster go into reshuffling mode immediately (probably due to the changed layout (OSD - Host - Rack - Root) in both cases. We reverted to the original CRUSH map and the cluster has been quiet since then. Now the question: What is the best way to handle our use case? Add 10 SSD drives, create a separate pool with them, don't upset the current pools (We don't want the regular/existing data to migrate towards the SSD pool, and no disruption of service? thanks Jens-Christian -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/socialmedia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] some newbie questions...
Dimitri Maziuk пишет: 1) i read somewhere that it is recommended to have one OSD per disk in a production environment. is this also the maximum disk per OSD or could i use multiple disks per OSD? and why? you could use multiple disks for one OSD if you used some striping and abstract the disk (like LVM, MDRAID, etc). But it wouldn't make sense. One OSD writes into one filesystem, that is usually one disk in a production environment. Using RAID under it wouldn't increase neither reliability nor performance drastically. I see some sense in RAID 0: single ceph-osd daemon per node (but still disk-per-osd self). But if you have relative few [planned] cores per task on node - you can think about it. Raid-0: single disk failure kills the entire filesystem, off-lines the osd and triggers a cluster-wide resync. Actual raid: single disk failure does not affect the cluster in any way. Usually data distributed per-host, so whole array failure cause only longer cluster resync, but nothing new cluster-wide. -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] some newbie questions...
Oliver Daudey пишет: 1) i read somewhere that it is recommended to have one OSD per disk in a production environment. is this also the maximum disk per OSD or could i use multiple disks per OSD? and why? you could use multiple disks for one OSD if you used some striping and abstract the disk (like LVM, MDRAID, etc). But it wouldn't make sense. One OSD writes into one filesystem, that is usually one disk in a production environment. Using RAID under it wouldn't increase neither reliability nor performance drastically. I see some sense in RAID 0: single ceph-osd daemon per node (but still disk-per-osd self). But if you have relative few [planned] cores per task on node - you can think about it. Raid-0: single disk failure kills the entire filesystem, off-lines the osd and triggers a cluster-wide resync. Actual raid: single disk failure does not affect the cluster in any way. RAID-controllers also add a lot of manageability into the mix. The fact that a chassis starts beeping and indicates exactly which disk needs replacing, managing automatic rebuild after replacement, makes operations much easier, even by less technical personnel. Also, if you have fast disks and a good RAID-controller, it should offload the entire rebuild-process from the node's main CPU without a performance-hit on the Ceph-cluster or node. As already said, OSDs are expensive on the resources, too. Having too many of them on one node and then having an entire node fail, can cause a lot of traffic and load on the remaining nodes while things rebalance. Oh, no! Raid controller bounds to special hardware and|or his limitations. Example: I have 3 nodes, 2 with SATA, 1 - LSI Megaraid SAS. SAS have 1 profit: large number of disks (I have 6x1Tb OSDs on SAS and 3x2Tb OSDs per SATA), but many troubles: cannot hot replace (fixme about Megaraid?), cannot read RAID-formatted on other 2 nodes... You speak about GOOD controller - so, yes - good is good. But for Ceph I see 2 reason of special controller: possible better speed and battery-backed cache. All other jobs (striping, fault tolerance) are Ceph's. Better to buy many biggest possible disks and insert it into many usual SATA machines. And usually I kill hardware RAID on new machines and start mdadm (if there are single-node Linux server) - to avoid painful games with various hardware. -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph freezes for 10+ seconds during benchmark
We've installed ceph on test cluster: 3x mon, 7xOSD on 2x10k RPM SAS Centos 6.4 ( 2.6.32-358.14.1.el6.x86_64 ) ceph 0.67.2 (also tried with 0.61.7 with same results) And during rados bench I get very strange behaviour: # rados bench -p pbench 100 write sec Cur ops started finished avg MB/s cur MB/s last lat avg lat ... 51 16 1503 1487 116.60372 0.306585 0.524611 52 16 1525 1509 116.05388 0.171904 0.520352 53 16 1541 1525115.0764 0.121784 0.516466 54 16 1541 1525 112.939 0 - 0.516466 55 16 1541 1525 110.885 0 - 0.516466 56 16 1541 1525 108.905 0 - 0.516466 57 16 1541 1525 106.994 0 - 0.516466 ... ( http://pastebin.com/vV50YBVK ) Bandwidth (MB/sec): 81.760 Stddev Bandwidth: 53.8371 Max bandwidth (MB/sec): 156 Min bandwidth (MB/sec): 0 Average Latency:0.782271 Stddev Latency: 2.51829 Max latency:26.1715 Min latency:0.084654 basically benchmark goes at full disk speed and then it stops any I/O for 10+ seconds During that time all IO and cpu load on all nodes basically stops and ceph -w starts to report: 2013-09-02 16:44:57.794115 osd.4 [WRN] 6 slow requests, 1 included below; oldest blocked for 62.953663 secs 2013-09-02 16:44:57.794125 osd.4 [WRN] slow request 60.363101 seconds old, received at 2013-09-02 16:43:57.430961: osd_op(client.381797.0:2109 benchmark_data_hqblade203.non.3dart.com_18829_object2108 [write 0~4194304] 14.745012c3 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:01.795211 osd.4 [WRN] 6 slow requests, 1 included below; oldest blocked for 66.954773 secs 2013-09-02 16:45:01.795221 osd.4 [WRN] slow request 60.661060 seconds old, received at 2013-09-02 16:44:01.134112: osd_op(client.381797.0:2199 benchmark_data_hqblade203.non.3dart.com_18829_object2198 [write 0~4194304] 14.dec41e60 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:02.795582 osd.4 [WRN] 6 slow requests, 2 included below; oldest blocked for 67.955102 secs 2013-09-02 16:45:02.795590 osd.4 [WRN] slow request 60.316291 seconds old, received at 2013-09-02 16:44:02.479210: osd_op(client.381797.0:2230 benchmark_data_hqblade203.non.3dart.com_18829_object2229 [write 0~4194304] 14.b3ca5505 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:02.795595 osd.4 [WRN] slow request 60.014792 seconds old, received at 2013-09-02 16:44:02.780709: osd_op(client.381797.0:2234 benchmark_data_hqblade203.non.3dart.com_18829_object2233 [write 0~4194304] 14.a8c8cfd5 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:03.723742 osd.0 [WRN] 10 slow requests, 1 included below; oldest blocked for 69.571037 secs 2013-09-02 16:45:03.723748 osd.0 [WRN] slow request 60.871583 seconds old, received at 2013-09-02 16:44:02.852110: osd_op(client.381797.0:2235 benchmark_data_hqblade203.non.3dart.com_18829_object2234 [write 0~4194304] 14.d44b2ab6 e277) v4 currently waiting for subops from [4] Any ideas why it is happening and how it can be debugged ? it seems that there is something wrong with osd.0 but there doesnt seem to be anything wrong with machine itself (bonnie++ and dd on machine does not show up any lockups) -- Mariusz Gronczewski, Administrator Efigence Sp. z o. o. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczew...@efigence.com mailto:mariusz.gronczew...@efigence.com signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD only storage, where to place journal
On 30 August 2013 22:13, Stefan Priebe s.pri...@profihost.ag wrote: Yes, thats correct what i hate at this point is that you lower the ssd speed by writing to journal, reading to journal wirting to ssd. Sadly there is no option to disable the journal. I think for SSD this would be best. This would be best for reliability. In terms of performance is it a good idea to store one's journal on the other ssd and the other way around? Both ssds are in different pools for different purposes. regards -- Maciej Gałkiewicz Shelly Cloud Sp. z o. o., Sysadmin http://shellycloud.com/, mac...@shellycloud.com KRS: 440358 REGON: 101504426 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool
On 08/18/2013 07:11 PM, Oliver Daudey wrote: Hey all, Also created on the tracker, under http://tracker.ceph.com/issues/6047 Oliver, list, We fixed this last week. Fixes can be found on wip-6047. We shall merge this to the mainline and the patch will be backported to dumpling. Thanks once again for reporting it! Cheers! -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it possible to change the pg number after adding new osds?
Only pgp_num is listed in the reference. Though pg_num can be changed in the same way, is there any risk to do that? From: andreas.fu...@swisstxt.ch To: dachun...@outlook.com; ceph-users@lists.ceph.com Subject: RE: [ceph-users] Is it possible to change the pg number after adding new osds? Date: Mon, 2 Sep 2013 09:02:15 + You can change the pg numbers on the fly with ceph osd pool set {pool_name} pg_num {value} ceph osd pool set {pool_name} pgp_num {value} refrence: http://ceph.com/docs/master/rados/operations/pools/ From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Da Chun Ng Sent: Montag, 2. September 2013 04:49 To: ceph-users@lists.ceph.com Subject: [ceph-users] Is it possible to change the pg number after adding new osds? According to the doc, the pg numbers should be enlarged for better read/write balance if the osd number is increased. But seems the pg number cannot be changed on the fly. It's fixed when the pool is created. Am I right? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it possible to change the pg number after adding new osds?
2 days ago I increase it for one pool and trying to reduce for others. Reducing don't work (for me? - repair freezed, but rollung back - up - is good), increasing is good. I understand next: pgp_num is temporary parameter to change pg_num. Data actually distributed over pgp_num, but allocated PGs - pg_num. So, for increasing, first - change pg_num, then wait and change pgp_num. Da Chun Ng пишет: Only pgp_num is listed in the reference. Though pg_num can be changed in the same way, is there any risk to do that? From: andreas.fu...@swisstxt.ch To: dachun...@outlook.com; ceph-users@lists.ceph.com Subject: RE: [ceph-users] Is it possible to change the pg number after adding new osds? Date: Mon, 2 Sep 2013 09:02:15 + You can change the pg numbers on the fly with ceph osd pool set {pool_name} pg_num {value} ceph osd pool set {pool_name} pgp_num {value} refrence: http://ceph.com/docs/master/rados/operations/pools/ From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Da Chun Ng Sent: Montag, 2. September 2013 04:49 To: ceph-users@lists.ceph.com Subject: [ceph-users] Is it possible to change the pg number after adding new osds? According to the doc, the pg numbers should be enlarged for better read/write balance if the osd number is increased. But seems the pg number cannot be changed on the fly. It's fixed when the pool is created. Am I right? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to force lost PGs
I created a pool with no replication and an RBD within that pool. I mapped the RBD to a machine, formatted it with a file system and dumped data on it. Just to see what kind of trouble I can get into, I stopped the OSD the RBD was using, marked the OSD as out, and reformatted the OSD tree. When I brought the OSD back up, I now have three stale PGs. Now I'm trying to clear the stale PGs. I've tried removing the OSD from the crush maps, the OSD lists etc, without any luck. Running ceph pg 3.1 query ceph pg 3.1 mark_unfound_lost revert ceph explains it doesn't have a PG 3.1 Running ceph osd repair osd.1 hangs after pg 2.3e Running ceph osd lost 1 --yes-i-really-mean-it nukes the osd. Rebuilding osd.1 goes fine, but I still have 3 stale PGs. Any help clearing these stale pages would be appreciated. Thanks, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rgw geo-replication and disaster recovery problem
Mr. Hi!I'm interested into the rgw geo-replication and disaster recovery feature. But whether those 'regisions and zones ' distributes among several different ceph clusters or just only one? Thank you ! ashely___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] about the script 'init-radosgw'
| Hi, I install the ceph 0.56.3 on fedord 15. there is no rpm release for fc15, so i build it form the source. # ./autogen.sh when configure I use # ./configure -with--radosgw and i install the ceph and the rados sucessful. I follow the ceph doucment configure the ceph radosgw.when I start the radosgw, I found that thw init-radosgw (/user/local/ceph-o.56.3/src/init-radosgw) cant be used in fedora. feador not support the commond start-stop-daemon. | | | | | |___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] tons of failed lossy con, dropping message = root cause for bad performance ?
Hello All, I have a simple test setup with 2 osd servers each 3 NICs (1Gb each): * One for management (ssh and such) * One for the public network (connected to ceph clients) * One for the cluster (osd inter-connection) I keep seeing this messages: Aug 26 18:43:31 ceph01 ceph-osd: 2013-08-26 18:43:31.040038 7f1afe5b6700 0 -- 192.168.113.115:6801/14629 submit_message osd_op_reply(88713 rb.0.1133.74b0dc51.03cd [write 2297856~4096] ondisk = 0) v4 remote, 192.168.113.1:0/607109564, failed lossy con, dropping message 0xaf83680 Aug 26 18:43:32 ceph01 ceph-osd: 2013-08-26 18:43:32.578875 7f1afe5b6700 0 -- 192.168.113.115:6801/14629 submit_message osd_op_reply(88870 rb.0.1133.74b0dc51.0345 [write 3145728~524288] ondisk = 0) v4 remote, 192.168.113.1:0/607109564, failed lossy con, dropping message 0x5c2e1e0 And also: Aug 26 18:27:08 ceph01 ceph-osd: 2013-08-26 18:27:08.211604 7f3cf738f700 0 bad crc in data 1545773059 != exp 878537506 Aug 26 18:27:08 ceph01 ceph-osd: 2013-08-26 18:27:08.225121 7f3cf738f700 0 bad crc in data 1929463652 != exp 2083940607 Any idea on the problem ? Matthieu. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-mon runs on 6800 not 6789.
Hi all. I have 1 MDS and 3 OSDs. I installed them via ceph-deploy. (dumpling 0.67.2 version) At first, It works perfectly. But, after I reboot one of OSD, ceph-mon launched on port 6800 not 6789. This is a result of 'ceph -s' --- cluster c59d13fd-c4c9-4cd0-b2ed-b654428b3171 health HEALTH_WARN 1 mons down, quorum 0,1,2 ceph-mds,ceph-osd0,ceph-osd1 monmap e1: 4 mons at {ceph-mds= 192.168.13.135:6789/0,ceph-osd0=192.168.13.136:6789/0,ceph-osd1=192.168.13.137:6789/0,ceph-osd2=192.168.13.138:6789/0}, election epoch 206, quorum 0,1,2 ceph-mds,ceph-osd0,ceph-osd1 osdmap e22: 4 osds: 2 up, 2 in pgmap v67: 192 pgs: 192 active+clean; 145 MB data, 2414 MB used, 18043 MB / 20458 MB avail mdsmap e4: 1/1/1 up {0=ceph-mds=up:active} --- 1 mons down - It's running on 6800. This is a /etc/ceph/ceph.conf that is created automatically by ceph-deploy. --- [global] fsid = c59d13fd-c4c9-4cd0-b2ed-b654428b3171 mon_initial_members = ceph-mds, ceph-osd0, ceph-osd1, ceph-osd2 mon_host = 192.168.13.135,192.168.13.136,192.168.13.137,192.168.13.138 auth_supported = cephx osd_journal_size = 1024 filestore_xattr_use_omap = true --- According to my understanding, ceph-mon's default port is 6789. Why does it run on 6800 instead of 6789? Restarting ceph-mon has a same result. Sorry for my poor english. I don't write or speak english fluently. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tons of failed lossy con, dropping message = root cause for bad performance ?
Looks like maybe your network is faulty. The crc error means the OSD received a message with a checksum that didn't match. The dropped message indicates that the connection (in this case to a client) has failed (probably because of the bad crc?) and so it's dropping the outgoing message. This is intentional; if the connection is re-established from the other end that state gets replayed and handled properly. -Greg On Monday, September 2, 2013, Matthieu Patou wrote: Hello All, I have a simple test setup with 2 osd servers each 3 NICs (1Gb each): * One for management (ssh and such) * One for the public network (connected to ceph clients) * One for the cluster (osd inter-connection) I keep seeing this messages: Aug 26 18:43:31 ceph01 ceph-osd: 2013-08-26 18:43:31.040038 7f1afe5b6700 0 -- 192.168.113.115:6801/14629 submit_message osd_op_reply(88713 rb.0.1133.74b0dc51.**03cd [write 2297856~4096] ondisk = 0) v4 remote, 192.168.113.1:0/607109564, failed lossy con, dropping message 0xaf83680 Aug 26 18:43:32 ceph01 ceph-osd: 2013-08-26 18:43:32.578875 7f1afe5b6700 0 -- 192.168.113.115:6801/14629 submit_message osd_op_reply(88870 rb.0.1133.74b0dc51.**0345 [write 3145728~524288] ondisk = 0) v4 remote, 192.168.113.1:0/607109564, failed lossy con, dropping message 0x5c2e1e0 And also: Aug 26 18:27:08 ceph01 ceph-osd: 2013-08-26 18:27:08.211604 7f3cf738f700 0 bad crc in data 1545773059 != exp 878537506 Aug 26 18:27:08 ceph01 ceph-osd: 2013-08-26 18:27:08.225121 7f3cf738f700 0 bad crc in data 1929463652 != exp 2083940607 Any idea on the problem ? Matthieu. __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com