Re: [ceph-users] Multi-site Implementation
I assume you're talking about Option Two: MULTI-SITE OBJECT STORAGE WITH FEDERATED GATEWAYS, from Inktank's http://info.inktank.com/multisite_options_with_inktank_ceph_enterprise There are still some options. Each zone has a master and one (or more) replicas. You can only write to the master zone, but you can read from the master or replicas. Regions and Zones live inside a Ceph clusters. Each Ceph cluster can have multiple zones. Each zone has it's own URL and web servers. Just like any replication strategy, this can be as simple or complicated as you want to make it. For example, you could set up a single master zone in site one that replicates to sites 2 and 3. Or you could setup 3 master zones that replicate in a ring, Site 1 master - Site 2 replica, Site 2 master - Site 3 replica, Site 3 master - Site 1 replica. It's more complicated, but it lets everybody read/write to their local cluster, as long as you're prepared to deal with 6 different URLs. Which setup you choose really depends on your requirements, and it changes the answers to the rest of your questions. *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ On 4/1/14 10:06 , Shang Wu wrote: Hi all, I have some questions about the Ceph multi-site implementation. I am thinking to have Ceph as the storage solution for across three internal site. I think, with a good internet connection, using the Multi-site object storage with RADOS (or RGW) might be a good use here. Thus, each site will have a MON node and many OSDs and replicate data between each other. With this implementation, I hope it will allow user to READ/WRITE from/to the local office and Ceph will take care the replication. So my question is: 1. How does Ceph know how to retrieve data from the nearest location? (As Ceph usually calculate where the data is through CRUSH rather than the nearest location for the user.) Will the data be distributed evenly throughout the three sites? If not, how can we let user to access the _local copy_ ? 2. Is Multi-site object storage with RADOS a good fit for their implementation? i.e. to READ/Write data To/From their local site? If not, what is the best way to approach this? 3. Does Ceph use the same ID (object name?) for all its replica? Can we access(read/write) these replica directly? 4. From this multi-site scenario, when a user write data to Ceph, will it find the nearest OSD to put the data? When a user read data, does it always respond from the primary data set (doesn't matter the location) or respond from the nearest replica copy? Thanks, Shang Wu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] write speed issue on RBD image
Can someone recommend some testing I can do to further investigate why this issue with slow-disk-write in the VM OS is occurring? It seems the issue, details below, are perhaps related to the VM OS running on the RADOS images in Ceph. Issue: I have a handful (like 10) of VM's running that, when tested, report slow disk write speed of 8MB/s-30MB/s. All of the remaining VM's (like 40) are reporting fast disk write speed of average 800MB/s-1.0GB/s. There are no VMs reporting any disk write speeds in-between these numbers. Restarting the OS on any of the VMs does not resolve the issue. After these tests, I took one of the VMs (image02host) with slow disk write speed and reinstalled the basic OS, including repartitioning the disk. I used the same RADOS image. After this, I retested this VM (image02host) and all the other VMs with slow disk write speed. This VM (image02host) I reinstalled the OS on no longer has the slow disk write speeds any longer. And, surprisingly, one of the other VMs (another-host) with slow disk write speed started having fast write speeds. All other VMs with slow disk write speed continued the same. So, I do not necessarily believe the slow disk issue is directly related to any kind of bug or outstanding issue with Ceph/RADOS. I only have a couple guesses at this point: 1. Perhaps my OS install (or possibly configuration), somehow is having issue. I don't see how this is possible, however. For all the VMs I have tested, they have all been kick-started with the same disk and OS configuration. So they are virtually identical, but are having either fast or slow disk write speed among them. 2. Perhaps I have some bad sectors or hard drive error at the hardware level that is causing the issue. Perhaps the RADOS images of these handful (like 10) of VMs is being written across a bad part of a hard drive. This seems more likely to me. However, all drives across all Ceph hosts are reporting good health. So, now, I have come to the ceph-user list to ask for help. What are some things I can do to test if there is some, possibly, bad sector or hardware error on one of the hard drives, or some issue with Ceph writing to part of one of the hard drives? Or are there any other tests I can run to help determine possible issues. And, secondly, if I wanted to move a RADOS image to new OSD blocks, is there a way to do that without exporting and importing the image? Perhaps, by resplattering the image and testing again to see if the issue is resolved, this can help determine if the existing slow disk write speed issue is how the image is splattered across OSDs - indicating a bad OSD hard drive, or bad parts of an OSD hard drive. Ceph Configuration: * Ceph Version 0.72.2 * Three Ceph hosts, CentOS 6.5 OS, using Xfs * All connected via 10GbE network * KVM/QEMU Virtualization, with Ceph support * Virtual Machines are all RHEL 5.9 32bit * Our Ceph setup is very basic. One pool for all VM disks, all drives on all Ceph hosts are in that pool. * Ceph Caching is on: rbd cache = true rbd cache size = 128 rbd cache max dirty = 64 rbd cache target dirty = 64 rbd cache max dirty age = 10.0 Test: Here I provide the test results of two VMs that are running on the same Ceph host, using disk images from the same ceph pool, and were cloned from the same RADOS snapshot. They both have the same exact KVM configuration. However, they report dramaticly different write speeds. When I tested them both, they were running on the same Ceph host. In fact, for the VM reporting slow disk write speed, I even had it run on a different Ceph host to test, and it still gave the same disk write speed results. [root@linux]# rbd -p images info osimage01 rbd image 'osimage01': size 28672 MB in 7168 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.2bfb74b0dc51 format: 2 features: layering [root@linux]# rbd -p images info osimage02 rbd image 'osimage02': size 28672 MB in 7168 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.2c1a2ae8944a format: 2 features: layering None of the images used are cloned. [root@linux]# ssh image01host image01host [65]% dd if=/dev/zero of=disk-test bs=1048576 count=512; dd if=disk-test of=/dev/null bs=1048576; /bin/rm disk-test 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 0.760446 seconds, 706 MB/s 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 0.214783 seconds, 2.5 GB/s image01host [66]% dd if=/dev/zero of=disk-test bs=1048576 count=512; dd if=disk-test of=/dev/null bs=1048576; /bin/rm disk-test 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 0.514886 seconds, 1.0 GB/s 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 0.198433 seconds, 2.7 GB/s image01host [67]% dd if=/dev/zero of=disk-test bs=1048576 count=512; dd if=disk-test
Re: [ceph-users] write speed issue on RBD image
Correction: When I wrote Here I provide the test results of two VMs that are running on the same Ceph host, using disk images from the same ceph pool, and were cloned from the same RADOS snapshot. I really meant: Here I provide the test results of two VMs that are running on the same Ceph host, using disk images from the same ceph pool, and were NOT cloned from ANY RADOS snapshot. -RG - Original Message - From: Russell E. Glaue rgl...@cait.org To: ceph-users@lists.ceph.com Sent: Wednesday, April 2, 2014 1:12:46 PM Subject: [ceph-users] write speed issue on RBD image Can someone recommend some testing I can do to further investigate why this issue with slow-disk-write in the VM OS is occurring? It seems the issue, details below, are perhaps related to the VM OS running on the RADOS images in Ceph. Issue: I have a handful (like 10) of VM's running that, when tested, report slow disk write speed of 8MB/s-30MB/s. All of the remaining VM's (like 40) are reporting fast disk write speed of average 800MB/s-1.0GB/s. There are no VMs reporting any disk write speeds in-between these numbers. Restarting the OS on any of the VMs does not resolve the issue. After these tests, I took one of the VMs (image02host) with slow disk write speed and reinstalled the basic OS, including repartitioning the disk. I used the same RADOS image. After this, I retested this VM (image02host) and all the other VMs with slow disk write speed. This VM (image02host) I reinstalled the OS on no longer has the slow disk write speeds any longer. And, surprisingly, one of the other VMs (another-host) with slow disk write speed started having fast write speeds. All other VMs with slow disk write speed continued the same. So, I do not necessarily believe the slow disk issue is directly related to any kind of bug or outstanding issue with Ceph/RADOS. I only have a couple guesses at this point: 1. Perhaps my OS install (or possibly configuration), somehow is having issue. I don't see how this is possible, however. For all the VMs I have tested, they have all been kick-started with the same disk and OS configuration. So they are virtually identical, but are having either fast or slow disk write speed among them. 2. Perhaps I have some bad sectors or hard drive error at the hardware level that is causing the issue. Perhaps the RADOS images of these handful (like 10) of VMs is being written across a bad part of a hard drive. This seems more likely to me. However, all drives across all Ceph hosts are reporting good health. So, now, I have come to the ceph-user list to ask for help. What are some things I can do to test if there is some, possibly, bad sector or hardware error on one of the hard drives, or some issue with Ceph writing to part of one of the hard drives? Or are there any other tests I can run to help determine possible issues. And, secondly, if I wanted to move a RADOS image to new OSD blocks, is there a way to do that without exporting and importing the image? Perhaps, by resplattering the image and testing again to see if the issue is resolved, this can help determine if the existing slow disk write speed issue is how the image is splattered across OSDs - indicating a bad OSD hard drive, or bad parts of an OSD hard drive. Ceph Configuration: * Ceph Version 0.72.2 * Three Ceph hosts, CentOS 6.5 OS, using Xfs * All connected via 10GbE network * KVM/QEMU Virtualization, with Ceph support * Virtual Machines are all RHEL 5.9 32bit * Our Ceph setup is very basic. One pool for all VM disks, all drives on all Ceph hosts are in that pool. * Ceph Caching is on: rbd cache = true rbd cache size = 128 rbd cache max dirty = 64 rbd cache target dirty = 64 rbd cache max dirty age = 10.0 Test: Here I provide the test results of two VMs that are running on the same Ceph host, using disk images from the same ceph pool, and were cloned from the same RADOS snapshot. They both have the same exact KVM configuration. However, they report dramaticly different write speeds. When I tested them both, they were running on the same Ceph host. In fact, for the VM reporting slow disk write speed, I even had it run on a different Ceph host to test, and it still gave the same disk write speed results. [root@linux]# rbd -p images info osimage01 rbd image 'osimage01': size 28672 MB in 7168 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.2bfb74b0dc51 format: 2 features: layering [root@linux]# rbd -p images info osimage02 rbd image 'osimage02': size 28672 MB in 7168 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.2c1a2ae8944a format: 2 features: layering None of the images used are cloned. [root@linux]# ssh image01host image01host [65]% dd if=/dev/zero of=disk-test bs=1048576 count=512; dd if=disk-test of=/dev/null bs=1048576; /bin/rm disk-test 512+0 records in
Re: [ceph-users] Backup Restore?
The short answer is no. The longer answer is it depends. The most concise discussion I've seen is Inktank's Multi-site option whitepaper: http://info.inktank.com/multisite_options_with_inktank_ceph_enterprise That white paper only addresses RBD backups (using snapshots) and RadosGW backups (using RadosGW replication). The first option in the whitepaper, a single cluster in multiple location, isn't a backup. I'm not aware of any backup or offsite capability for raw RADOS pools. There really aren't any good options for backing up CephFS. You could use rsync on CephFS, but it's not going to work well. rsync to offsite locations begins to have problems around the TB size, give or take an order of magnitude. The exact spot depends on your bandwidth, latency, file count, average file size, average file churn, and Disk I/O on both sides. It takes a lot of time and Disk I/O to enumerate all the files on the filesystem, and compare them to the offsite copy. CephFS does have some nice features that could make for an efficient backup. If rsync (or any backup client) was aware of the way CephFS handles directory size and timestamp, it could prune the directory tree enumeration much more efficiently. That should scale well to much larger file systems, mostly limited by file churn and churn locality. I don't know of anybody that's working on that. I'm interested in the concept, but I have no plans (personal or professional) to use CephFS. I'm currently working on adding Snapshot capabilities to RadosGW. Combined with replication, it can protect against disasters, PEBKAC, and application error. Replication alone only protects against disasters, but not PEBKAC nor application errors. Just like RAID protects against disk failure, but not file deletion. Replication + Snapshots (for both RadosGW and RBD) don't protect against a determined attacker. Even tape is vulnerable to a determined attacker with a high security level in your organization. The trick with both offline backups and remote snapshots is to set up enough barriers and checks that things get noticed before a determined attacker can finish the job. It's easier to do with offline backups than online backups. *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ On 4/2/14 00:08 , Robert Sander wrote: Hi, what are the options to consistently backup and restore data out of a ceph cluster? - RBDs can be snapshotted. - Data on RBDs used inside VMs can be backed up using tools from the guest. - CephFS data can be backed up using rsync are similar tools What about object data in other pools? There are two scenarios where a backup is needed: - disaster recovery, i.e. the while cluster goes nuts - single item restore, because PEBKAC or application error Is there any work on progress to cover these? Regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cancel a scrub?
Is there any way to cancel a scrub on a PG? I have an OSD that's recovering, and there's a single PG left waiting: 2014-04-02 13:15:39.868994 mon.0 [INF] pgmap v5322756: 2592 pgs: 2589 active+clean, 1 active+recovery_wait, 2 active+clean+scrubbing+deep; 15066 GB data, 30527 GB used, 29061 GB / 59588 GB avail; 1/3878 objects degraded (0.000%) The PG that is in recovery_wait is on the same OSD that is being deep scrubbed. I don't have journals on SSD, so recovery and scrubbing are heavily throttled. I want to cancel the scrub so the recovery can complete. I'll manually restart the deep scrub when it's done. Normally I'd just wait, but this OSD is flapping. It keeps getting kicked out of the cluster for being unresponsive. I'm hoping that if I cancel the scrub, it will allow the recovery to complete and the OSD will stop flapping. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cancel a scrub?
Thanks! I knew about noscrub, but I didn't realize that the flapping would cancel a scrub in progress. So the scrub doesn't appear to be the reason it wasn't recovering. After a flap, it goes into: 2014-04-02 14:11:09.776810 mon.0 [INF] pgmap v5323181: 2592 pgs: 2591 active+clean, 1 active+recovery_wait; 15066 GB data, 30527 GB used, 29060 GB / 59588 GB avail; 1/3878 objects degraded (0.000%); 0 B/s, 11 keys/s, 2 objects/s recovering It stays in that state until the OSD gets kicked out again. The problem is the flapping OSD is spamming its logs with: 2014-04-02 14:12:01.242425 7f344a97d700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f3447977700' had timed out after 15 None of the other OSDs are saying that. Is there anything I can do to repair the health map on osd.11? In case it helps, here are the osd.11 logs after a daemon restart: 2014-04-02 14:10:58.267556 7f3467ff6780 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-osd, pid 7791 2014-04-02 14:10:58.269782 7f3467ff6780 1 filestore(/var/lib/ceph/osd/ceph-11) mount detected xfs 2014-04-02 14:10:58.269789 7f3467ff6780 1 filestore(/var/lib/ceph/osd/ceph-11) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2014-04-02 14:10:58.306112 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work 2014-04-02 14:10:58.306135 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-04-02 14:10:58.308070 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-04-02 14:10:58.357102 7f3467ff6780 0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2014-04-02 14:10:58.360837 7f3467ff6780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-04-02 14:10:58.360851 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 20: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.422842 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 20: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.423241 7f3467ff6780 1 journal close /var/lib/ceph/osd/ceph-11/journal 2014-04-02 14:10:58.424433 7f3467ff6780 1 filestore(/var/lib/ceph/osd/ceph-11) mount detected xfs 2014-04-02 14:10:58.442963 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work 2014-04-02 14:10:58.442974 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-04-02 14:10:58.445144 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-04-02 14:10:58.451977 7f3467ff6780 0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2014-04-02 14:10:58.454481 7f3467ff6780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-04-02 14:10:58.454495 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 21: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.465211 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 21: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.466825 7f3467ff6780 0 cls cls/hello/cls_hello.cc:271: loading cls_hello 2014-04-02 14:10:58.468745 7f3467ff6780 0 osd.11 11688 crush map has features 1073741824, adjusting msgr requires for clients 2014-04-02 14:10:58.468756 7f3467ff6780 0 osd.11 11688 crush map has features 1073741824, adjusting msgr requires for osds 2014-04-02 14:11:07.822045 7f343de58700 0 -- 10.194.0.7:6800/7791 10.194.0.7:6822/14075 pipe(0x1c96e000 sd=177 :6800 s=0 pgs=0 cs=0 l=0 c=0x1b7e3000).accept connect_seq 0 vs existing 0 state connecting 2014-04-02 14:11:07.822182 7f343f973700 0 -- 10.194.0.7:6800/7791 10.194.0.7:6806/26942 pipe(0x1c96e280 sd=82 :6800 s=0 pgs=0 cs=0 l=0 c=0x1b7e3160).accept connect_seq 0 vs existing 0 state connecting 2014-04-02 14:11:20.333163 7f344a97d700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f3447977700' had timed out after 15 snip repeats 2014-04-02 14:13:35.310407 7f344a97d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f344a97d700 time 2014-04-02 14:13:35.308718 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) ceph version 0.72.2
[ceph-users] Cleaning up; data usage, snap-shots, auth users
Hi, I have a small 8TB testing cluster. During testing I've used 94G. But, I have since removed pools and images from Ceph, I shouldn't be using any space, but still the 94G usage remains. How can I reclaim old used space? Also, this:- ceph@ceph-admin:~$ rbd rm 6fa36869-4afe-485a-90a3-93fba1b5d15e 2014-04-03 01:02:23.304323 7f92e2ced760 -1 librbd::ImageCtx: error finding header: (2) No such file or directory Removing image: 2014-04-03 01:02:23.312212 7f92e2ced760 -1 librbd: error removing img from new-style directory: (2) No such file or directory 0% complete...failed. rbd: delete error: (2) No such file or directory ceph@ceph-admin:~$ rbd rm 6fa36869-4afe-485a-90a3-93fba1b5d15e -p cloudstack 2014-04-03 01:02:34.424626 7fd556d00760 -1 librbd: image has snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. ceph@ceph-admin:~$ rbd snap purge 6fa36869-4afe-485a-90a3-93fba1b5d15e -p cloudstack Removing all snapshots2014-04-03 01:02:46.863370 7f2949461760 -1 librbd: removing snapshot from header failed: (16) Device or resource busy : 0% complete...failed. rbd: removing snaps failed: (16) Device or resource busy Lastly, how can I remove a user from the auth list? Regards, Jon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] heartbeat_map is_healthy had timed out after 15
I'm seeing one OSD spamming it's log with 2014-04-02 16:49:21.547339 7f5cc6c5d700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f5cc3456700' had timed out after 15 It starts about 30 seconds after the OSD daemon is started. It continues until 2014-04-02 16:48:57.526925 7f0e5a683700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0e3c857700' had suicide timed out after 150 2014-04-02 16:48:57.528008 7f0e5a683700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f0e5a683700 time 2014-04-02 16:48:57.526948 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) I tried bumping up logging, and I don't see anything interesting. I tried strace, and all I can really see is that the OSD spends a lot of time in FUTEX_WAIT. This OSD has been flapping for several days now. None of the other OSDs are having this issue. I thought it might be similiar to Quenten Grasso's post about 'OSD Restarts cause excessively high load average and requests are blocked 32 sec'. At first it looks similiar, but Quenten said his OSDs eventually settle down. Mine never does. Can I increase that 15 second timeout, to see if it just needs additional time? I don't see anything in the ceph docs about this. Otherwise, I'm pretty close to removing the disk, zapping it, and add it back to the cluster. Any other suggestions? -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Backup Restore?
Hi, what are the options to consistently backup and restore data out of a ceph cluster? - RBDs can be snapshotted. - Data on RBDs used inside VMs can be backed up using tools from the guest. - CephFS data can be backed up using rsync are similar tools What about object data in other pools? There are two scenarios where a backup is needed: - disaster recovery, i.e. the while cluster goes nuts - single item restore, because PEBKAC or application error Is there any work on progress to cover these? Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd map error - numerical result out of range
Hi again Ilya, No, no snapshots in this case. It's a brand new RBD that I've created. Cheers. Tom. On 01/04/14 16:08, Ilya Dryomov wrote: On Tue, Apr 1, 2014 at 6:55 PM, Tom t...@t0mb.net wrote: Thanks for the reply. Ceph is version 0.73-1precise, and the kernel release is 3.11.9-031109-generic. also rbd showmapped shows 16 lines of output. Are there snapshots involved? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backup Restore?
Hi Robert Thanks for raising this question , backup and restores options has always been interesting to discuss. i too have a connected question for Inktank. — Is there any work going for support of ceph cluster getting backed by enterprise *proprietary* backup solutions available today Karan Singh CSC - IT Center for Science, Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland mobile: +358 503 812758 tel. +358 9 4572001 fax +358 9 4572302 http://www.csc.fi/ On 02 Apr 2014, at 10:08, Robert Sander r.san...@heinlein-support.de wrote: Hi, what are the options to consistently backup and restore data out of a ceph cluster? - RBDs can be snapshotted. - Data on RBDs used inside VMs can be backed up using tools from the guest. - CephFS data can be backed up using rsync are similar tools What about object data in other pools? There are two scenarios where a backup is needed: - disaster recovery, i.e. the while cluster goes nuts - single item restore, because PEBKAC or application error Is there any work on progress to cover these? Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OpenStack + Ceph Integration
I Integrated Ceph + OpenStack with following document. https://ceph.com/docs/master/rbd/rbd-openstack/ I could put image to glance on ceph cluster. but I can not create any volume to cinder. error messages are the same on this URL. http://comments.gmane.org/gmane.comp.file-systems.ceph.user/7641 --- 2014-04-02 17:31:57.799 22321 ERROR cinder.volume.drivers.rbd [req-b18d0e8d-c818-4fb4-9dd8-dbdd938f919b None None] error connecting to ceph cluster 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd Traceback (most recent call last): 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd File /usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py, line 262, in check_for_setup_error 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd with RADOSClient(self): 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd File /usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py, line 234, in __init__ 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd self.cluster, self.ioctx = driver._connect_to_rados(pool) 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd File /usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py, line 282, in _connect_to_rados 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd client.connect() 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd File /usr/lib/python2.7/dist-packages/rados.py, line 408, in connect 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd raise make_ex(ret, error calling connect) 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd ObjectNotFound: error calling connect 2014-04-02 17:31:57.799 22321 TRACE cinder.volume.drivers.rbd 2014-04-02 17:31:57.800 22321 ERROR cinder.volume.manager [req-b18d0e8d-c818-4fb4-9dd8-dbdd938f919b None None] Error encountered during initialization of driver: RBDDriver 2014-04-02 17:31:57.801 22321 ERROR cinder.volume.manager [req-b18d0e8d-c818-4fb4-9dd8-dbdd938f919b None None] Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster 2014-04-02 17:31:57.801 22321 TRACE cinder.volume.manager Traceback (most recent call last): 2014-04-02 17:31:57.801 22321 TRACE cinder.volume.manager File /usr/lib/python2.7/dist-packages/cinder/volume/manager.py, line 190, in init_host 2014-04-02 17:31:57.801 22321 TRACE cinder.volume.manager self.driver.check_for_setup_error() 2014-04-02 17:31:57.801 22321 TRACE cinder.volume.manager File /usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py, line 267, in check_for_setup_error 2014-04-02 17:31:57.801 22321 TRACE cinder.volume.manager raise exception.VolumeBackendAPIException(data=msg) 2014-04-02 17:31:57.801 22321 TRACE cinder.volume.manager VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster so I added these lines to /etc/ceph/ceph.conf [client.cinder] key = key_id but I could not create any volumes to cinder. Does anyone have an idea ? Thanks from cloudy Tokyo. -- Tomokazu HIRAI (@jedipunkz) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw multipart-uploaded downloads fail
Hi Yehuda, i tried your patch and it feels fine, except you might need some special handling for those already corrupt uploads, as trying to delete them gets radosgw in an endless loop and high cpu usage: 2014-04-02 11:03:15.045627 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045628 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045629 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045631 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045632 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045634 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045634 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045636 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045637 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045639 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045639 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045641 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045642 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045644 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045644 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045646 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045647 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045649 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045649 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045651 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045652 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045654 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045654 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045656 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045657 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045659 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045660 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045661 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045662 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045664 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045665 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045667 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045667 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045669 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 2014-04-02 11:03:15.045670 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=33554432 stripe_ofs=33554432 part_ofs=33554432 rule-part_size=0 2014-04-02 11:03:15.045672 7fbf157d2700 20 RGWObjManifest::operator++(): rule-part_size=0 rules.size()=1 Thx Benedikt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
- Message from Gregory Farnum g...@inktank.com - Date: Tue, 1 Apr 2014 09:03:17 -0700 From: Gregory Farnum g...@inktank.com Subject: Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error) To: Yan, Zheng uker...@gmail.com Cc: Kenneth Waegeman kenneth.waege...@ugent.be, ceph-users ceph-users@lists.ceph.com On Tue, Apr 1, 2014 at 7:12 AM, Yan, Zheng uker...@gmail.com wrote: On Tue, Apr 1, 2014 at 10:02 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: After some more searching, I've found that the source of the problem is with the mds and not the mon.. The mds crashes, generates a core dump that eats the local space, and in turn the monitor (because of leveldb) crashes. The error in the mds log of one host: 2014-04-01 15:46:34.414615 7f870e319700 0 -- 10.141.8.180:6836/13152 10.141.8.180:6789/0 pipe(0x517371180 sd=54 :42439 s=4 pgs=0 cs=0 l=1 c=0x147ac780).connect got RESETSESSION but no longer connecting 2014-04-01 15:46:34.438792 7f871194f700 0 -- 10.141.8.180:6836/13152 10.141.8.180:6789/0 pipe(0x1b099f580 sd=8 :43150 s=4 pgs=0 cs=0 l=1 c=0x1fd44360).connect got RESETSESSION but no longer connecting 2014-04-01 15:46:34.439028 7f870e319700 0 -- 10.141.8.180:6836/13152 10.141.8.182:6789/0 pipe(0x13aa64880 sd=54 :37085 s=4 pgs=0 cs=0 l=1 c=0x1fd43de0).connect got RESETSESSION but no longer connecting 2014-04-01 15:46:34.468257 7f871b7ae700 -1 mds/CDir.cc: In function 'void CDir::_omap_fetched(ceph::bufferlist, std::mapstd::basic_stringchar, std::char_traitschar, std::allocatorchar , ceph::buffer::list, std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::allocatorstd::pairconst std::basic_stringchar, std::char_traitschar, std::allocatorchar , ceph::buffer::list , const std::string, int)' thread 7f871b7ae700 time 2014-04-01 15:46:34.448320 mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r == -61) could you use gdb to check what is value of variable 'r' . If you look at the crash dump log you can see the return value in the osd_op_reply message: -1 2014-04-01 15:46:34.440860 7f871b7ae700 1 -- 10.141.8.180:6836/13152 == osd.3 10.141.8.180:6827/4366 33077 osd_op_reply(4179177 11f2ef1. [omap-get-header 0~0,omap-get-vals 0~16] v0'0 uv0 ack = -108 (Cannot send after transport endpoint shutdown)) v6 229+0+0 (958358678 0 0) 0x2cff7aa80 con 0x37ea3c0 -108, which is ESHUTDOWN, but we also use it (via the 108 constant, I think because ESHUTDOWN varies across platforms) as EBLACKLISTED. So it looks like this is itself actually a symptom of another problem that is causing the MDS to get timed out on the monitor. If a core dump is eating the local space, maybe the MDS is stuck in an infinite allocation loop of some kind? How big are your disks, Kenneth? Do you have any information on how much CPU/memory the MDS was using before this? I monitored the mds process after restart: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 19215 root 20 0 6070m 5.7g 5236 S 778.6 18.1 1:27.54 ceph-mds 19215 root 20 0 7926m 7.5g 5236 S 179.2 23.8 2:44.39 ceph-mds 19215 root 20 0 12.4g 12g 5236 S 157.2 38.8 3:43.47 ceph-mds 19215 root 20 0 16.6g 16g 5236 S 144.4 52.0 4:15.01 ceph-mds 19215 root 20 0 19.9g 19g 5236 S 137.2 62.5 4:35.83 ceph-mds 19215 root 20 0 24.5g 24g 5224 S 136.5 77.0 5:04.66 ceph-mds 19215 root 20 0 25.8g 25g 2944 S 33.7 81.2 5:13.74 ceph-mds 19215 root 20 0 26.0g 25g 2916 S 24.6 81.7 5:19.07 ceph-mds 19215 root 20 0 26.1g 25g 2916 S 13.0 82.1 5:22.16 ceph-mds 19215 root 20 0 27.7g 26g 1856 S 100.0 85.8 5:36.46 ceph-mds Then it crashes. I changed the core dump location out of the root fs, the core dump is indeed about 26G My disks: Filesystem Size Used Avail Use% Mounted on /dev/sda2 9.9G 2.9G 6.5G 31% / tmpfs16G 0 16G 0% /dev/shm /dev/sda1 248M 53M 183M 23% /boot /dev/sda4 172G 61G 112G 35% /var/lib/ceph/log/sda4 /dev/sdb187G 61G 127G 33% /var/lib/ceph/log/sdb /dev/sdc3.7T 1.7T 2.0T 47% /var/lib/ceph/osd/sdc /dev/sdd3.7T 1.5T 2.2T 41% /var/lib/ceph/osd/sdd /dev/sde3.7T 1.4T 2.4T 37% /var/lib/ceph/osd/sde /dev/sdf3.7T 1.5T 2.3T 39% /var/lib/ceph/osd/sdf /dev/sdg3.7T 2.1T 1.7T 56% /var/lib/ceph/osd/sdg /dev/sdh3.7T 1.7T 2.0T 47% /var/lib/ceph/osd/sdh /dev/sdi3.7T 1.7T 2.0T 47% /var/lib/ceph/osd/sdi /dev/sdj3.7T 1.7T 2.0T 47% /var/lib/ceph/osd/sdj /dev/sdk3.7T 2.1T 1.6T 58% /var/lib/ceph/osd/sdk /dev/sdl3.7T 1.7T 2.0T 46% /var/lib/ceph/osd/sdl /dev/sdm3.7T 1.5T 2.2T 41% /var/lib/ceph/osd/sdm /dev/sdn3.7T 1.4T 2.3T 38% /var/lib/ceph/osd/sdn -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com - End message from Gregory Farnum g...@inktank.com - -- Met
Re: [ceph-users] Setting root directory in fstab with Fuse
It's been a while, but I think you need to use the long form client_mountpoint config option here instead. If you search the list archives it'll probably turn up; this is basically the only reason we ever discuss -r. ;) Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Apr 2, 2014 at 5:10 AM, Florent B flor...@coppint.com wrote: Hi all, I am trying to set a fuse.ceph mount on a Debian 7 (kernel 3.2). I use Ceph Emperor version. How can I set a root directory in fstab using fuse.ceph ?? I do : id=mail01,conf=/etc/ceph/ceph.conf,r=/fs1-mail1 /fs1-mail1 fuse.ceph noatime 0 0 But I get this error : ceph-fuse[23794]: starting ceph client fuse: unknown option `--r=/fs1-mail1' ceph-fuse[23794]: fuse failed to initialize 2014-04-02 14:04:23.132664 7f7e4fd91760 -1 fuse_lowlevel_new failed ceph-fuse[23785]: mount failed: (33) Numerical argument out of domain Whereas when I do : ceph-fuse -d --id mail01 -m mon.mycompany.net -r /fs1-mail1 /fs1-mail1 It works fine. How can I do that ? My configuration file only contains monitor address. Passing it as an option could be nice. Thank you a lot ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
hi gregory, (i'm a colleague of kenneth) 1) How big and what shape the filesystem is. Do you have some extremely large directory that the MDS keeps trying to load and then dump? anyway to extract this from the mds without having to start it? as it was an rsync operation, i can try to locate possible candidates on the source filesystem, but what would be considered large? 2) Use tcmalloc's heap analyzer to see where all the memory is being allocated. we'll giv ethat a try 3) Look through the logs for when the beacon fails (the first of mds.0.16 is_laggy 600.641332 15 since last acked beacon) and see if there's anything tell-tale going on at the time. anything in particular we should be looking for? the log goes as follows: mds starts around 11:43 ... 2014-04-01 11:44:23.658583 7ffec89c6700 1 mds.0.server reconnect_clients -- 1 sessions 2014-04-01 11:44:41.212488 7ffec89c6700 0 log [DBG] : reconnect by client.4585 10.141.8.199:0/3551 after 17.553854 2014-04-01 11:44:45.692237 7ffec89c6700 1 mds.0.10 reconnect_done 2014-04-01 11:44:45.996384 7ffec89c6700 1 mds.0.10 handle_mds_map i am now mds.0.10 2014-04-01 11:44:45.996388 7ffec89c6700 1 mds.0.10 handle_mds_map state change up:reconnect -- up:rejoin 2014-04-01 11:44:45.996390 7ffec89c6700 1 mds.0.10 rejoin_start 2014-04-01 11:49:53.158471 7ffec89c6700 1 mds.0.10 rejoin_joint_start then lots (4667 lines) of 2014-04-01 11:50:10.237035 7ffebc844700 0 -- 10.141.8.180:6837/55117 10.141.8.180:6789/0 pipe(0x38a7da00 sd=104 :41115 s=4 pgs=0 cs=0 l=1 c=0x6513e8840).connect got RESETSESSION but no longer connecting with one intermediate 2014-04-01 11:51:50.181354 7ffebcf4b700 0 -- 10.141.8.180:6837/55117 10.141.8.180:6789/0 pipe(0x10e282580 sd=103 :0 s=1 pgs=0 cs=0 l=1 c=0xc77d5ee0).fault then sudden change 2014-04-01 11:57:30.440554 7ffebcd49700 0 -- 10.141.8.180:6837/55117 10.141.8.182:6789/0 pipe(0xa1534100 sd=104 :48176 s=4 pgs=0 cs=0 l=1 c=0xd99b11e0).connect got RESETSESSION but no longer connecting 2014-04-01 11:57:30.722607 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER 2014-04-01 11:57:30.722669 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER 2014-04-01 11:57:30.722885 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER 2014-04-01 11:57:30.722945 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER followed by lots of 2014-04-01 11:57:30.738562 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0xead9fa80 sd=57 :0 s=1 pgs=0 cs=0 l=1 c=0x10e5d280).fault with sporadic 2014-04-01 11:57:32.431219 7ffebeb67700 0 -- 10.141.8.180:6837/55117 10.141.8.182:6789/0 pipe(0xef85cd80 sd=103 :41218 s=4 pgs=0 cs=0 l=1 c=0x130590dc0).connect got RESETSESSION but no longer connecting until the dmup 2014-04-01 11:59:27.612850 7ffebea66700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0xe3036400 sd=103 :0 s=1 pgs=0 cs=0 l=1 c=0xa7be300).fault 2014-04-01 11:59:27.639009 7ffec89c6700 -1 mds/CDir.cc: In function 'void CDir::_omap_fetched(ceph::bufferlist, std::mapstd::basic_stringchar, std::char_traitschar, std::allocator\ char , ceph::buffer::list, std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::allocatorstd::pairconst std::basic_stringchar, std::char_trait\ schar, std::allocatorchar , ceph::buffer::list , const std::string, int)' thread 7ffec89c6700 time 2014-04-01 11:59:27.620684 mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r == -61) ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (CDir::_omap_fetched(ceph::buffer::list, std::mapstd::string, ceph::buffer::list, std::lessstd::string, std::allocatorstd::pairstd::string const, ceph::buffer::list , st\ d::string const, int)+0x4d71) [0x77c3c1] 2: (Context::complete(int)+0x9) [0x56bb79] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x11a6) [0x806dd6] 4: (MDS::handle_core_message(Message*)+0x9c7) [0x5901d7] 5: (MDS::_dispatch(Message*)+0x2f) [0x59028f] 6: (MDS::ms_dispatch(Message*)+0x1ab) [0x591d4b] 7: (DispatchQueue::entry()+0x582) [0x902072] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x85ef4d] 9: /lib64/libpthread.so.0() [0x34c36079d1] 10: (clone()+0x6d) [0x34c32e8b6d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -1 2014-04-01 11:59:27.137779 7ffec89c6700 5 mds.0.10 initiating monitor reconnect; maybe we're not the slow one - 2014-04-01 11:59:27.137787 7ffec89c6700 10 monclient(hunting): _reopen_session rank -1 name -9998 2014-04-01 11:59:27.137790
Re: [ceph-users] cephx key for CephFS access only
Thanks for the response Greg. Unfortunately, I appear to be missing something. If I use my cephfs key with these perms: client.cephfs key: redacted caps: [mds] allow rwx caps: [mon] allow r caps: [osd] allow rwx pool=data This is what happens when I mount: # ceph-fuse -k /etc/ceph/ceph.client.cephfs.keyring -m ceph0-10g /data ceph-fuse[13533]: starting ceph client ceph-fuse[13533]: ceph mount failed with (1) Operation not permitted ceph-fuse[13531]: mount failed: (1) Operation not permitted But using the admin key works just fine: # ceph-fuse -k /etc/ceph/ceph.client.admin.keyring -m ceph0-10g /data ceph-fuse[13548]: starting ceph client ceph-fuse[13548]: starting fuse The admin key as the following perms: client.admin key: redacted caps: [mds] allow caps: [mon] allow * caps: [osd] allow * Since the mds permissions are functionally equivalent, either I need extra rights on the monitor, or the OSDs. Does a client need to access the metadata pool in order to do a CephFS mount? I'll experiment a bit and report back. On Mon, Mar 31, 2014 at 1:36 PM, Gregory Farnum g...@inktank.com wrote: At present, the only security permission on the MDS is allowed to do stuff, so rwx and * are synonymous. In general * means is an admin, though, so you'll be happier in the future if you use rwx. You may also want a more restrictive set of monitor capabilities as somebody else recently pointed out, but [3] will give you the filesystem access you're looking for. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Mar 28, 2014 at 9:40 AM, Travis Rhoden trho...@gmail.com wrote: Hi Folks, What would be the right set of capabilities to set for a new client key that has access to CephFS only? I've seen a few different examples: [1] mds 'allow *' mon 'allow r' osd 'allow rwx pool=data' [2] mon 'allow r' osd 'allow rwx pool=data' [3] mds 'allow rwx' mon 'allow r' osd 'allow rwx pool=data' I'm inclined to go with [3]. [1] seems weird for using *, I like seeing rwx. Are these synonymous? [2] seems wrong because it doesn't include anything for MDS. - Travis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephx key for CephFS access only
Hrm, I don't remember. Let me know which permutation works and we can dig into it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Apr 2, 2014 at 9:00 AM, Travis Rhoden trho...@gmail.com wrote: Thanks for the response Greg. Unfortunately, I appear to be missing something. If I use my cephfs key with these perms: client.cephfs key: redacted caps: [mds] allow rwx caps: [mon] allow r caps: [osd] allow rwx pool=data This is what happens when I mount: # ceph-fuse -k /etc/ceph/ceph.client.cephfs.keyring -m ceph0-10g /data ceph-fuse[13533]: starting ceph client ceph-fuse[13533]: ceph mount failed with (1) Operation not permitted ceph-fuse[13531]: mount failed: (1) Operation not permitted But using the admin key works just fine: # ceph-fuse -k /etc/ceph/ceph.client.admin.keyring -m ceph0-10g /data ceph-fuse[13548]: starting ceph client ceph-fuse[13548]: starting fuse The admin key as the following perms: client.admin key: redacted caps: [mds] allow caps: [mon] allow * caps: [osd] allow * Since the mds permissions are functionally equivalent, either I need extra rights on the monitor, or the OSDs. Does a client need to access the metadata pool in order to do a CephFS mount? I'll experiment a bit and report back. On Mon, Mar 31, 2014 at 1:36 PM, Gregory Farnum g...@inktank.com wrote: At present, the only security permission on the MDS is allowed to do stuff, so rwx and * are synonymous. In general * means is an admin, though, so you'll be happier in the future if you use rwx. You may also want a more restrictive set of monitor capabilities as somebody else recently pointed out, but [3] will give you the filesystem access you're looking for. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Mar 28, 2014 at 9:40 AM, Travis Rhoden trho...@gmail.com wrote: Hi Folks, What would be the right set of capabilities to set for a new client key that has access to CephFS only? I've seen a few different examples: [1] mds 'allow *' mon 'allow r' osd 'allow rwx pool=data' [2] mon 'allow r' osd 'allow rwx pool=data' [3] mds 'allow rwx' mon 'allow r' osd 'allow rwx pool=data' I'm inclined to go with [3]. [1] seems weird for using *, I like seeing rwx. Are these synonymous? [2] seems wrong because it doesn't include anything for MDS. - Travis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephx key for CephFS access only
Ah, I figured it out. My original key worked, but I needed to use the --id option with ceph-fuse to tell it to use the cephfs user rather than the admin user. Tailing the log on my monitor pointed out that it was logging in with client.admin, but providing the key for client.cephfs. So, final working command is: ceph-fuse -k /etc/ceph/ceph.client.cephfs.keyring --id cephfs -m ceph0-10g /data I will note that neither the -k or --id options are present in man ceph-fuse, ceph-fuse --help, or in the Ceph docs, really. An example using -k is found here: http://ceph.com/docs/master/start/quick-cephfs/#filesystem-in-user-space-fuse, but there is never any mention of needing to change users if you are not using client.admin. In fact, using the search functionality on ceph-fuse returns zero results. If I'm ambitious I'll submit changes for the docs... Thanks for the help! - Travis On Wed, Apr 2, 2014 at 12:00 PM, Travis Rhoden trho...@gmail.com wrote: Thanks for the response Greg. Unfortunately, I appear to be missing something. If I use my cephfs key with these perms: client.cephfs key: redacted caps: [mds] allow rwx caps: [mon] allow r caps: [osd] allow rwx pool=data This is what happens when I mount: # ceph-fuse -k /etc/ceph/ceph.client.cephfs.keyring -m ceph0-10g /data ceph-fuse[13533]: starting ceph client ceph-fuse[13533]: ceph mount failed with (1) Operation not permitted ceph-fuse[13531]: mount failed: (1) Operation not permitted But using the admin key works just fine: # ceph-fuse -k /etc/ceph/ceph.client.admin.keyring -m ceph0-10g /data ceph-fuse[13548]: starting ceph client ceph-fuse[13548]: starting fuse The admin key as the following perms: client.admin key: redacted caps: [mds] allow caps: [mon] allow * caps: [osd] allow * Since the mds permissions are functionally equivalent, either I need extra rights on the monitor, or the OSDs. Does a client need to access the metadata pool in order to do a CephFS mount? I'll experiment a bit and report back. On Mon, Mar 31, 2014 at 1:36 PM, Gregory Farnum g...@inktank.com wrote: At present, the only security permission on the MDS is allowed to do stuff, so rwx and * are synonymous. In general * means is an admin, though, so you'll be happier in the future if you use rwx. You may also want a more restrictive set of monitor capabilities as somebody else recently pointed out, but [3] will give you the filesystem access you're looking for. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Mar 28, 2014 at 9:40 AM, Travis Rhoden trho...@gmail.com wrote: Hi Folks, What would be the right set of capabilities to set for a new client key that has access to CephFS only? I've seen a few different examples: [1] mds 'allow *' mon 'allow r' osd 'allow rwx pool=data' [2] mon 'allow r' osd 'allow rwx pool=data' [3] mds 'allow rwx' mon 'allow r' osd 'allow rwx pool=data' I'm inclined to go with [3]. [1] seems weird for using *, I like seeing rwx. Are these synonymous? [2] seems wrong because it doesn't include anything for MDS. - Travis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
hi, 1) How big and what shape the filesystem is. Do you have some extremely large directory that the MDS keeps trying to load and then dump? anyway to extract this from the mds without having to start it? as it was an rsync operation, i can try to locate possible candidates on the source filesystem, but what would be considered large? total number of files 13M, spread over 800k directories, but it's unclear how far the sync was at time of failing. i've not found a good way to for directories with lots of files and/or subdirs. 2) Use tcmalloc's heap analyzer to see where all the memory is being allocated. we'll giv ethat a try i run ceph-mds with HEAPCHECK=normal (via the init script), but how can we stop mds without killing it? the heapchecker only seems to dump at the end of a run, maybe there's a way to have intermediate dump like valgrind, but the documentation is not very helpful. stijn 3) Look through the logs for when the beacon fails (the first of mds.0.16 is_laggy 600.641332 15 since last acked beacon) and see if there's anything tell-tale going on at the time. anything in particular we should be looking for? the log goes as follows: mds starts around 11:43 ... 2014-04-01 11:44:23.658583 7ffec89c6700 1 mds.0.server reconnect_clients -- 1 sessions 2014-04-01 11:44:41.212488 7ffec89c6700 0 log [DBG] : reconnect by client.4585 10.141.8.199:0/3551 after 17.553854 2014-04-01 11:44:45.692237 7ffec89c6700 1 mds.0.10 reconnect_done 2014-04-01 11:44:45.996384 7ffec89c6700 1 mds.0.10 handle_mds_map i am now mds.0.10 2014-04-01 11:44:45.996388 7ffec89c6700 1 mds.0.10 handle_mds_map state change up:reconnect -- up:rejoin 2014-04-01 11:44:45.996390 7ffec89c6700 1 mds.0.10 rejoin_start 2014-04-01 11:49:53.158471 7ffec89c6700 1 mds.0.10 rejoin_joint_start then lots (4667 lines) of 2014-04-01 11:50:10.237035 7ffebc844700 0 -- 10.141.8.180:6837/55117 10.141.8.180:6789/0 pipe(0x38a7da00 sd=104 :41115 s=4 pgs=0 cs=0 l=1 c=0x6513e8840).connect got RESETSESSION but no longer connecting with one intermediate 2014-04-01 11:51:50.181354 7ffebcf4b700 0 -- 10.141.8.180:6837/55117 10.141.8.180:6789/0 pipe(0x10e282580 sd=103 :0 s=1 pgs=0 cs=0 l=1 c=0xc77d5ee0).fault then sudden change 2014-04-01 11:57:30.440554 7ffebcd49700 0 -- 10.141.8.180:6837/55117 10.141.8.182:6789/0 pipe(0xa1534100 sd=104 :48176 s=4 pgs=0 cs=0 l=1 c=0xd99b11e0).connect got RESETSESSION but no longer connecting 2014-04-01 11:57:30.722607 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER 2014-04-01 11:57:30.722669 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER 2014-04-01 11:57:30.722885 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER 2014-04-01 11:57:30.722945 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0 cs=0 l=1 c=0xc48a3f40).connect got BADAUTHORIZER followed by lots of 2014-04-01 11:57:30.738562 7ffebec68700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0xead9fa80 sd=57 :0 s=1 pgs=0 cs=0 l=1 c=0x10e5d280).fault with sporadic 2014-04-01 11:57:32.431219 7ffebeb67700 0 -- 10.141.8.180:6837/55117 10.141.8.182:6789/0 pipe(0xef85cd80 sd=103 :41218 s=4 pgs=0 cs=0 l=1 c=0x130590dc0).connect got RESETSESSION but no longer connecting until the dmup 2014-04-01 11:59:27.612850 7ffebea66700 0 -- 10.141.8.180:6837/55117 10.141.8.181:6789/0 pipe(0xe3036400 sd=103 :0 s=1 pgs=0 cs=0 l=1 c=0xa7be300).fault 2014-04-01 11:59:27.639009 7ffec89c6700 -1 mds/CDir.cc: In function 'void CDir::_omap_fetched(ceph::bufferlist, std::mapstd::basic_stringchar, std::char_traitschar, std::allocator\ char , ceph::buffer::list, std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::allocatorstd::pairconst std::basic_stringchar, std::char_trait\ schar, std::allocatorchar , ceph::buffer::list , const std::string, int)' thread 7ffec89c6700 time 2014-04-01 11:59:27.620684 mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r == -61) ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) 1: (CDir::_omap_fetched(ceph::buffer::list, std::mapstd::string, ceph::buffer::list, std::lessstd::string, std::allocatorstd::pairstd::string const, ceph::buffer::list , st\ d::string const, int)+0x4d71) [0x77c3c1] 2: (Context::complete(int)+0x9) [0x56bb79] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x11a6) [0x806dd6] 4: (MDS::handle_core_message(Message*)+0x9c7) [0x5901d7] 5: (MDS::_dispatch(Message*)+0x2f) [0x59028f] 6: (MDS::ms_dispatch(Message*)+0x1ab) [0x591d4b] 7: (DispatchQueue::entry()+0x582) [0x902072] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x85ef4d] 9: /lib64/libpthread.so.0() [0x34c36079d1] 10: