Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Nice Work Mark ! I don't see any tuning about sharding in the config file sample (osd_op_num_threads_per_shard,osd_op_num_shards,...) as you only use 1 ssd for the bench, I think it should improve results for hammer ? - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-de...@vger.kernel.org Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mardi 17 Février 2015 18:37:01 Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-giant installation error on centos 6.6
Thanks Brad. That solved the problem. I mistakenly assumed all dependencies are in http://ceph.com/rpm-giant/el6/x86_64/. Regards, Wenxiao On Tue, Feb 17, 2015 at 10:37 PM, Brad Hubbard bhubb...@redhat.com wrote: On 02/18/2015 12:43 PM, Wenxiao He wrote: Hello, I need some help as I am getting package dependency errors when trying to install ceph-giant on centos 6.6. See below for repo files and also the yum install output. --- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed -- Finished Dependency Resolution Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph) Requires: libunwind.so.8()(64bit) Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Looks like you may need to install libunwind and lttng-ust from EPEL 6? They seem to be the packages that supply liblttng-ust.so and ibunwind.so so you could try installing those from EPEL 6 and see how that goes? Note that this should not be taken as the, or even a, authorative answer :) Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Updating monmap
How do I update the ceph monmap after extracting and removing unwanted an ip in the monmap to the clean monmap? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-giant installation error on centos 6.6
Note that ceph-deploy would enable EPEL for you automatically on CentOS. When doing a manual installation, the requirement for EPEL is called out here: http://ceph.com/docs/master/install/get-packages/#id8 Though looking at that, we could probably update it to use the now much easier to use yum install epel-release. :) - Travis On Wed, Feb 18, 2015 at 12:25 PM, Wenxiao He wenx...@gmail.com wrote: Thanks Brad. That solved the problem. I mistakenly assumed all dependencies are in http://ceph.com/rpm-giant/el6/x86_64/. Regards, Wenxiao On Tue, Feb 17, 2015 at 10:37 PM, Brad Hubbard bhubb...@redhat.com wrote: On 02/18/2015 12:43 PM, Wenxiao He wrote: Hello, I need some help as I am getting package dependency errors when trying to install ceph-giant on centos 6.6. See below for repo files and also the yum install output. --- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed -- Finished Dependency Resolution Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph) Requires: libunwind.so.8()(64bit) Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Looks like you may need to install libunwind and lttng-ust from EPEL 6? They seem to be the packages that supply liblttng-ust.so and ibunwind.so so you could try installing those from EPEL 6 and see how that goes? Note that this should not be taken as the, or even a, authorative answer :) Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Updating monmap
Hi, use the following command line: ceph-mon -i {monitor_id} --inject-monmap {updated_monmap_file} JC On 18 Feb 2015, at 11:15, SUNDAY A. OLUTAYO olut...@sadeeb.com wrote: How do I update the ceph monmap after extracting and removing unwanted an ip in the monmap to the clean monmap? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG stuck degraded, undersized, unclean
We're running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), and seeing this: HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 pgs stuck undersized; 1 pgs undersized pg 4.2af is stuck unclean for 77192.522960, current state active+undersized+degraded, last acting [50,42] pg 4.2af is stuck undersized for 980.617479, current state active+undersized+degraded, last acting [50,42] pg 4.2af is stuck degraded for 980.617902, current state active+undersized+degraded, last acting [50,42] pg 4.2af is active+undersized+degraded, acting [50,42] However, ceph pg query doesn't really show any issues: https://gist.githubusercontent.com/devicenull/9d911362e4de83c02e40/raw/565fe18163e261c8105e5493a4e90cc3c461ed9d/gistfile1.txt (too long to post here) I've also tried: # ceph pg 4.2af mark_unfound_lost revert pg has no unfound objects How can I get Ceph to rebuild here? The replica count is 3, but I can't seem to figure out what's going on here. Enabling various debug logs doesn't reveal anything obvious to me. I've tried restarting both OSDs, which did nothing. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 12 March - Ceph Day San Francisco
Hey cephers, We still have a couple of speaking slots open for Ceph Day San Francisco on 12 March. I'm open to both high level what have you been doing with Ceph type talks as well as more technical here is what we're writing and/or integrating with Ceph. I know many folks will be at VAULT, but we figured there would still be plenty of folks left on the west coast, so let me know if you'd be interested in speaking. Thanks! -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck degraded, undersized, unclean
On Wed, Feb 18, 2015 at 9:09 PM, Brian Rak b...@gameservers.com wrote: What does your crushmap look like (ceph osd getcrushmap -o /tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic prevent Ceph from selecting an OSD for the third replica? Cheers, Florian I have 5 hosts, and it's configured like this: That's not the full crushmap, so I'm a bit reduced to guessing... root default { id -1 # do not change unnecessarily # weight 204.979 alg straw hash 0 # rjenkins1 item osd01 weight 12.670 item osd02 weight 14.480 item osd03 weight 14.480 item osd04 weight 79.860 item osd05 weight 83.490 Whence the large weight difference? Are osd04 and osd05 really that much bigger in disk space? rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } This should not be preventing the assignment (AFAIK). Currently the PG is on osd01 and osd05. Just checking, sure you're not running short on space (close to 90% utilization) on one of your OSD filesystems? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck degraded, undersized, unclean
On 2/18/2015 3:24 PM, Florian Haas wrote: On Wed, Feb 18, 2015 at 9:09 PM, Brian Rak b...@gameservers.com wrote: What does your crushmap look like (ceph osd getcrushmap -o /tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic prevent Ceph from selecting an OSD for the third replica? Cheers, Florian I have 5 hosts, and it's configured like this: That's not the full crushmap, so I'm a bit reduced to guessing... I wasn't sure the rest of it was useful. The full one can be found here: https://gist.githubusercontent.com/devicenull/db9a3fbaa0df2138071b/raw/4158a6205692eb5a2ba73831e7f51ececd8eb1a5/gistfile1.txt root default { id -1 # do not change unnecessarily # weight 204.979 alg straw hash 0 # rjenkins1 item osd01 weight 12.670 item osd02 weight 14.480 item osd03 weight 14.480 item osd04 weight 79.860 item osd05 weight 83.490 Whence the large weight difference? Are osd04 and osd05 really that much bigger in disk space? Yes, osd04 and osd05 have 3-4x the number of disks as osd01-osd3 rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } This should not be preventing the assignment (AFAIK). Currently the PG is on osd01 and osd05. Just checking, sure you're not running short on space (close to 90% utilization) on one of your OSD filesystems? No, they're all under 10% used. The cluster as a whole only has about 6TB used (out of 196 TB). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-osd pegging CPU on giant, no snapshots involved this time
On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson mnel...@redhat.com wrote: On 02/18/2015 02:19 PM, Florian Haas wrote: Hey everyone, I must confess I'm still not fully understanding this problem and don't exactly know where to start digging deeper, but perhaps other users have seen this and/or it rings a bell. System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2 different rulesets where the problem applies to hosts and PGs using a bog-standard default crushmap. Symptom: out of the blue, ceph-osd processes on a single OSD node start going to 100% CPU utilization. The problems turns so bad that the machine is effectively becoming CPU bound and can't cope with any client requests anymore. Stopping and restarting all OSDs brings the problem right back, as does rebooting the machine — right after ceph-osd processes start, CPU utilization shoots up again. Stopping and marking out several OSDs on the machine makes the problem go away but obviously causes massive backfilling. All the logs show while CPU utilization is implausibly high are slow requests (which would be expected in a system that can barely do anything). Now I've seen issues like this before on dumpling and firefly, but besides the fact that they have all been addressed and should now be fixed, they always involved the prior mass removal of RBD snapshots. This system only used a handful of snapshots in testing, and is presently not using any snapshots at all. I'll be spending some time looking for clues in the log files of the OSDs that were shut down which caused the problem to go away, but if this sounds familiar to anyone willing to offer clues, I'd be more than interested. :) Thanks! Hi Florian, Does a quick perf top tell you anything useful? Hi Mark, Unfortunately, quite the contrary -- but this might actually provide a clue to the underlying issue. So the CPU pegging issue isn't currently present, so the perf top data wouldn't be conclusive until the issue is reproduced. But: merely running perf top on this host, which currently only has 2 active OSDs, renders the host unresponsive. Corresponding dmesg snippet: [Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns [Wed Feb 18 20:53:49 2015] [ cut here ] [Wed Feb 18 20:53:49 2015] WARNING: at arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100() [Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables xfs vfat fat iTCO_w dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich mfd_core mei_me mei ipmi_devintf shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm bnx2 x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si] [Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not tainted 3.10.0-123.20.1.el7.x86_64 #1 [Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge R720xd/0020HJ, BIOS 2.2.2 01/16/2014 [Wed Feb 18 20:53:49 2015] 50de8931 880fef003d40 815e2b0c [Wed Feb 18 20:53:49 2015] 880fef003d78 8105dee1 880c316a7400 880fef00b9e0 [Wed Feb 18 20:53:49 2015] 880fef016db0 880dbaa896c0 880fef003d88 [Wed Feb 18 20:53:49 2015] Call Trace: [Wed Feb 18 20:53:49 2015] IRQ [815e2b0c] dump_stack+0x19/0x1b [Wed Feb 18 20:53:49 2015] [8105dee1] warn_slowpath_common+0x61/0x80 [Wed Feb 18 20:53:49 2015] [8105e00a] warn_slowpath_null+0x1a/0x20 [Wed Feb 18 20:53:49 2015] [81023706] x86_pmu_start+0xc6/0x100 [Wed Feb 18 20:53:49 2015] [81136128] perf_adjust_freq_unthr_context.part.79+0x198/0x1b0 [Wed Feb 18 20:53:49 2015] [811363d6] perf_event_task_tick+0xb6/0xf0 [Wed Feb 18 20:53:49 2015] [810967e5] scheduler_tick+0xd5/0x150 [Wed Feb 18 20:53:49 2015] [8106fe86] update_process_times+0x66/0x80 [Wed Feb 18 20:53:49 2015] [810be055] tick_sched_handle.isra.16+0x25/0x60 [Wed Feb 18 20:53:49 2015] [810be0d1] tick_sched_timer+0x41/0x60 [Wed Feb 18 20:53:49 2015] [81089a57] __run_hrtimer+0x77/0x1d0 [Wed Feb 18 20:53:49 2015] [810be090] ? tick_sched_handle.isra.16+0x60/0x60 [Wed Feb 18 20:53:49 2015] [8108a297] hrtimer_interrupt+0xf7/0x240 [Wed Feb 18 20:53:49 2015] [81039717] local_apic_timer_interrupt+0x37/0x60 [Wed Feb 18 20:53:49 2015] [815f552f] smp_apic_timer_interrupt+0x3f/0x60 [Wed Feb 18 20:53:49 2015] [815f3e9d]
[ceph-users] Privileges for read-only CephFS access?
Dear Ceph Experts, is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow write or other modifications to the Ceph cluster? I would like to export a sub-tree of our CephFS via HTTPS. Alas, web-servers are inviting targets, so in the (hopefully unlikely) event that the server is hacked, I want to protected the Ceph cluster from file modification/deletion and other possible nasty things. The alternative would be to put an NFS- or SSHFS-proxy between Ceph and the web-server. But I'd like to avoid the additional complication if possible. Cheers and thanks, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Privileges for read-only CephFS access?
On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote: Dear Ceph Experts, is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow write or other modifications to the Ceph cluster? Warning, read this to the end, don't blindly do as I say. :) All you should need to do is define a CephX identity that has only r capabilities on the data pool (assuming you're using a default configuration where your CephFS uses the data and metadata pools): sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r pool=data' mon 'allow r' That identity should then be able to mount the filesystem but not write any data (use ceph-fuse -n client.readonly or mount -t ceph -o name=readonly) That said, just touching files or creating them is only a metadata operation that doesn't change anything in the data pool, so I think that might still be allowed under these circumstances. However, I've just tried the above with ceph-fuse on firefly, and I was able to mount the filesystem that way and then echo something into a previously existing file. After unmounting, remounting, and trying to cat that file, I/O just hangs. It eventually does complete, but this looks really fishy. So I believe you've uncovered a CephFS bug. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck degraded, undersized, unclean
On 2/18/2015 3:01 PM, Florian Haas wrote: On Wed, Feb 18, 2015 at 7:53 PM, Brian Rak b...@gameservers.com wrote: We're running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), and seeing this: HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 pgs stuck undersized; 1 pgs undersized pg 4.2af is stuck unclean for 77192.522960, current state active+undersized+degraded, last acting [50,42] pg 4.2af is stuck undersized for 980.617479, current state active+undersized+degraded, last acting [50,42] pg 4.2af is stuck degraded for 980.617902, current state active+undersized+degraded, last acting [50,42] pg 4.2af is active+undersized+degraded, acting [50,42] However, ceph pg query doesn't really show any issues: https://gist.githubusercontent.com/devicenull/9d911362e4de83c02e40/raw/565fe18163e261c8105e5493a4e90cc3c461ed9d/gistfile1.txt (too long to post here) I've also tried: # ceph pg 4.2af mark_unfound_lost revert pg has no unfound objects How can I get Ceph to rebuild here? The replica count is 3, but I can't seem to figure out what's going on here. Enabling various debug logs doesn't reveal anything obvious to me. I've tried restarting both OSDs, which did nothing. What does your crushmap look like (ceph osd getcrushmap -o /tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic prevent Ceph from selecting an OSD for the third replica? Cheers, Florian I have 5 hosts, and it's configured like this: root default { id -1 # do not change unnecessarily # weight 204.979 alg straw hash 0 # rjenkins1 item osd01 weight 12.670 item osd02 weight 14.480 item osd03 weight 14.480 item osd04 weight 79.860 item osd05 weight 83.490 } rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } This should not be preventing the assignment (AFAIK). Currently the PG is on osd01 and osd05. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-osd pegging CPU on giant, no snapshots involved this time
On 02/18/2015 02:19 PM, Florian Haas wrote: Hey everyone, I must confess I'm still not fully understanding this problem and don't exactly know where to start digging deeper, but perhaps other users have seen this and/or it rings a bell. System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2 different rulesets where the problem applies to hosts and PGs using a bog-standard default crushmap. Symptom: out of the blue, ceph-osd processes on a single OSD node start going to 100% CPU utilization. The problems turns so bad that the machine is effectively becoming CPU bound and can't cope with any client requests anymore. Stopping and restarting all OSDs brings the problem right back, as does rebooting the machine — right after ceph-osd processes start, CPU utilization shoots up again. Stopping and marking out several OSDs on the machine makes the problem go away but obviously causes massive backfilling. All the logs show while CPU utilization is implausibly high are slow requests (which would be expected in a system that can barely do anything). Now I've seen issues like this before on dumpling and firefly, but besides the fact that they have all been addressed and should now be fixed, they always involved the prior mass removal of RBD snapshots. This system only used a handful of snapshots in testing, and is presently not using any snapshots at all. I'll be spending some time looking for clues in the log files of the OSDs that were shut down which caused the problem to go away, but if this sounds familiar to anyone willing to offer clues, I'd be more than interested. :) Thanks! Hi Florian, Does a quick perf top tell you anything useful? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] metrics to monitor for performance bottlenecks?
Hey folks, I have a ceph cluster supporting about 500 VMs using RBD. I am seeing around 10-12k IOPS cluster-wide and IO wait time creeping up within the VMs. My suspicion is that I am pushing my ceph cluster to its limit in terms of overall throughput. I am curious if there are metrics that can be passively collected either in VMs or on ceph nodes to reveal the cluster is at its peak. IO wait time inside of VMs might be a good one, but I am interested in monitoring the ceph nodes directly as well. Ideally I want to track those metrics, perform some trending analysis, and provision capacity (not space, but throughput) before VM performance is impacted. Any thoughts or experience on this matter? Thanks. -Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck degraded, undersized, unclean
On Wed, Feb 18, 2015 at 7:53 PM, Brian Rak b...@gameservers.com wrote: We're running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), and seeing this: HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 pgs stuck undersized; 1 pgs undersized pg 4.2af is stuck unclean for 77192.522960, current state active+undersized+degraded, last acting [50,42] pg 4.2af is stuck undersized for 980.617479, current state active+undersized+degraded, last acting [50,42] pg 4.2af is stuck degraded for 980.617902, current state active+undersized+degraded, last acting [50,42] pg 4.2af is active+undersized+degraded, acting [50,42] However, ceph pg query doesn't really show any issues: https://gist.githubusercontent.com/devicenull/9d911362e4de83c02e40/raw/565fe18163e261c8105e5493a4e90cc3c461ed9d/gistfile1.txt (too long to post here) I've also tried: # ceph pg 4.2af mark_unfound_lost revert pg has no unfound objects How can I get Ceph to rebuild here? The replica count is 3, but I can't seem to figure out what's going on here. Enabling various debug logs doesn't reveal anything obvious to me. I've tried restarting both OSDs, which did nothing. What does your crushmap look like (ceph osd getcrushmap -o /tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic prevent Ceph from selecting an OSD for the third replica? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Privileges for read-only CephFS access?
On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote: On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote: Dear Ceph Experts, is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow write or other modifications to the Ceph cluster? Warning, read this to the end, don't blindly do as I say. :) All you should need to do is define a CephX identity that has only r capabilities on the data pool (assuming you're using a default configuration where your CephFS uses the data and metadata pools): sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r pool=data' mon 'allow r' That identity should then be able to mount the filesystem but not write any data (use ceph-fuse -n client.readonly or mount -t ceph -o name=readonly) That said, just touching files or creating them is only a metadata operation that doesn't change anything in the data pool, so I think that might still be allowed under these circumstances. ...and deletes, unfortunately. :( I don't think this is presently a thing it's possible to do until we get a much better user auth capabilities system into CephFS. However, I've just tried the above with ceph-fuse on firefly, and I was able to mount the filesystem that way and then echo something into a previously existing file. After unmounting, remounting, and trying to cat that file, I/O just hangs. It eventually does complete, but this looks really fishy. This is happening because the CephFS clients don't (can't, really, for all the time we've spent thinking about it) check whether they have read permissions on the underlying pool when buffering writes for a file. I believe if you ran an fsync on the file you'd get an EROFS or similar. Anyway, the client happily buffers up the writes. Depending on how exactly you remount then it might not be able to drop the MDS caps for file access (due to having dirty data it can't get rid of), and those caps have to time out before anybody else can access the file again. So you've found an unpleasant oddity of how the POSIX interfaces map onto this kind of distributed system, but nothing unexpected. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FreeBSD on RBD (KVM)
From: Logan Barfield lbarfi...@tqhosting.com We've been running some tests to try to determine why our FreeBSD VMs are performing much worse than our Linux VMs backed by RBD, especially on writes. Our current deployment is: - 4x KVM Hypervisors (QEMU 2.0.0+dfsg-2ubuntu1.6) - 2x OSD nodes (8x SSDs each, 10Gbit links to hypervisors, pool has 2x replication across nodes) - Hypervisors have rbd_cache enabled - All VMs use cache=none currently. If you don't have rbd cache writethrough until flush = true, this configuration is unsafe - with cache=none, qemu will not send flushes. In testing we were getting ~30MB/s writes, and ~100MB/s reads on FreeBSD 10.1. On Linux VMs we're seeing ~150+MB/s for writes and reads (dd if=/dev/zero of=output bs=1M count=1024 oflag=direct). I'm not very familiar with FreeBSD, but I'd guess it's sending smaller I/Os for some reason. This could be due to trusting the sector size qemu reports (this can be changed, though I don't remember the syntax offhand), lower fs block size, or scheduler or block subsystem configurables. It could also be related to differences in block allocation strategies by whatever FS you're using in the guest and Linux filesystems. What FS are you using in each guest? You can check the I/O sizes seen by rbd by adding something like this to ceph.conf on a node running qemu: [client] debug rbd = 20 log file = /path/writeable/by/qemu.$pid.log This will show the offset and length of requests in lines containing aio_read and aio_write. If you're using giant you could instead gather a trace of I/O to rbd via lttng. I tested several configurations on both RBD and local SSDs, and the only time FreeBSD performance was comparable to Linux was with the following configuration: - Local SSD - Qemu cache=writeback - GPT journaling enabled We did see some performance improvement (~50MB/s writes instead of 30MB/s) when using cache=writeback on RBD. I've read several threads regarding cache=none vs cache=writeback. cache=none is apparently safer for live migration, but cache=writeback is recommended by Ceph to prevent data loss. Apparently there was a patch submitted for Qemu a few months ago to make cache=writeback safer for live migrations as well: http://tracker.ceph.com/issues/2467 RBD caching is already safe with live migration without this patch. It just makes sure that it will continue to be safe in case of future QEMU changes. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-giant installation error on centos 6.6
Hi, The impatient me was using the quick guide ( http://ceph.com/docs/master/start/quick-start-preflight/) which merely states On CentOS, you may need to install EPEL :) I have a separate question: why ceph-deploy always shows Error in sys.exitfunc:, though things look fine? $ ceph-deploy new ceph1 ceph2 ... [ceph_deploy.new][DEBUG ] Creating a random mon key... [ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring... [ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf... Error in sys.exitfunc: $ ceph-deploy install ceph1 ceph2 ... [ceph_deploy.gatherkeys][DEBUG ] Got ceph.bootstrap-mds.keyring key from ceph1. Error in sys.exitfunc: $ ceph-deploy osd prepare ceph1:sdb ceph1:sdc ceph1:sdd ceph2:sdb ceph2:sdc ceph2:sdd ... [ceph_deploy.osd][DEBUG ] Host ceph1 is now ready for osd use. ... [ceph_deploy.osd][DEBUG ] Host ceph2 is now ready for osd use. Error in sys.exitfunc: Regards, Wenxiao On Wed, Feb 18, 2015 at 10:38 AM, Travis Rhoden trho...@gmail.com wrote: Note that ceph-deploy would enable EPEL for you automatically on CentOS. When doing a manual installation, the requirement for EPEL is called out here: http://ceph.com/docs/master/install/get-packages/#id8 Though looking at that, we could probably update it to use the now much easier to use yum install epel-release. :) - Travis On Wed, Feb 18, 2015 at 12:25 PM, Wenxiao He wenx...@gmail.com wrote: Thanks Brad. That solved the problem. I mistakenly assumed all dependencies are in http://ceph.com/rpm-giant/el6/x86_64/. Regards, Wenxiao On Tue, Feb 17, 2015 at 10:37 PM, Brad Hubbard bhubb...@redhat.com wrote: On 02/18/2015 12:43 PM, Wenxiao He wrote: Hello, I need some help as I am getting package dependency errors when trying to install ceph-giant on centos 6.6. See below for repo files and also the yum install output. --- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed -- Finished Dependency Resolution Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph) Requires: libunwind.so.8()(64bit) Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph) Requires: liblttng-ust.so.0()(64bit) Looks like you may need to install libunwind and lttng-ust from EPEL 6? They seem to be the packages that supply liblttng-ust.so and ibunwind.so so you could try installing those from EPEL 6 and see how that goes? Note that this should not be taken as the, or even a, authorative answer :) Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Hi Alex, Thanks! I didn't tweak the sharding settings at all, so they are just at the default values: OPTION(osd_op_num_threads_per_shard, OPT_INT, 2) OPTION(osd_op_num_shards, OPT_INT, 5) I don't have really good insight yet into how tweaking these would affect single-osd performance. I know the PCIe SSDs do have multiple controllers on-board so perhaps increasing the number of shards would improve things, but I suspect that going too high could maybe start hurting performance as well. Have you done any testing here? It could be an interesting follow-up paper. Mark On 02/18/2015 02:34 AM, Alexandre DERUMIER wrote: Nice Work Mark ! I don't see any tuning about sharding in the config file sample (osd_op_num_threads_per_shard,osd_op_num_shards,...) as you only use 1 ssd for the bench, I think it should improve results for hammer ? - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-de...@vger.kernel.org Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mardi 17 Février 2015 18:37:01 Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Hi Andrei, On 02/18/2015 09:08 AM, Andrei Mikhailovsky wrote: Mark, many thanks for your effort and ceph performance tests. This puts things in perspective. Looking at the results, I was a bit concerned that the IOPs performance in niether releases come even marginally close to the capabilities of the underlying ssd device. Even the fastest PCI ssds have only managed to achieve about the 1/6th IOPs of the raw device. Perspective is definitely good! Any time you are dealing with latency sensitive workloads, there are a lot of bottlenecks that can limit your performance. There's a world of difference between streaming data to a raw SSD as fast as possible and writing data out to a distributed storage system that is calculating data placement, invoking the TCP stack, doing CRC checks, journaling writes, invoking the VM layer to cache data in case it's hot (which in this case it's not). I guess there is a great deal more optimisations to be done in the upcoming LTS releases to make the IOPs rate close to the raw device performance. There is definitely still room for improvement! It's important to remember though that there is always going to be a trade off between flexibility, data integrity, and performance. If low latency is your number one need before anything else, you are probably best off eliminating as much software as possible between you and the device (except possibly if you can make clever use of caching). While Ceph itself is some times the bottleneck, in many cases we've found that bottlenecks in the software that surrounds Ceph are just as big obstacles (filesystem, VM layer, TCP stack, leveldb, etc). If you need a distributed storage system that can universally maintain native SSD levels of performance, the entire stack has to be highly tuned. I have done some testing in the past and noticed that despite the server having a lot of unused resources (about 40-50% server idle and about 60-70% ssd idle) the ceph would not perform well when used with ssds. I was testing with Firefly + auth and my IOPs rate was around the 3K mark. Something is holding ceph back from performing well with ssds ((( Out of curiosity, did you try the same tests directly on the SSD? Andrei *From: *Mark Nelson mnel...@redhat.com *To: *ceph-devel ceph-de...@vger.kernel.org *Cc: *ceph-users@lists.ceph.com *Sent: *Tuesday, 17 February, 2015 5:37:01 PM *Subject: *[ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performancecomparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Introducing Learning Ceph : The First ever Book on Ceph
To be exact, the platform used throughout is CentOS 6.4... I am reading my copy right now :) Best -F - Original Message - From: SUNDAY A. OLUTAYO olut...@sadeeb.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Monday, February 16, 2015 3:28:45 AM Subject: Re: [ceph-users] Introducing Learning Ceph : The First ever Book on Ceph I bought a copy some days ago, great job but it is Redhat specific. Thanks, Sunday Olutayo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
https://github.com/ceph/ceph-tools/tree/master/cbt On Tue, Feb 17, 2015 at 12:16 PM, Stephen Hindle shin...@llnw.com wrote: I was wondering what the 'CBT' tool is ? Google is useless for that acronym... Thanks! Steve On Tue, Feb 17, 2015 at 10:37 AM, Mark Nelson mnel...@redhat.com wrote: Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- The information in this message may be confidential. It is intended solely for the addressee(s). If you are not the intended recipient, any disclosure, copying or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Mark, many thanks for your effort and ceph performance tests. This puts things in perspective. Looking at the results, I was a bit concerned that the IOPs performance in niether releases come even marginally close to the capabilities of the underlying ssd device. Even the fastest PCI ssds have only managed to achieve about the 1/6th IOPs of the raw device. I guess there is a great deal more optimisations to be done in the upcoming LTS releases to make the IOPs rate close to the raw device performance. I have done some testing in the past and noticed that despite the server having a lot of unused resources (about 40-50% server idle and about 60-70% ssd idle) the ceph would not perform well when used with ssds. I was testing with Firefly + auth and my IOPs rate was around the 3K mark. Something is holding ceph back from performing well with ssds ((( Andrei - Original Message - From: Mark Nelson mnel...@redhat.com To: ceph-devel ceph-de...@vger.kernel.org Cc: ceph-users@lists.ceph.com Sent: Tuesday, 17 February, 2015 5:37:01 PM Subject: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly low number of concurrent backfills
On Wed, Feb 18, 2015 at 6:56 AM, Gregory Farnum g...@gregs42.com wrote: On Tue, Feb 17, 2015 at 9:48 PM, Florian Haas flor...@hastexo.com wrote: On Tue, Feb 17, 2015 at 11:19 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Feb 17, 2015 at 12:09 PM, Florian Haas flor...@hastexo.com wrote: Hello everyone, I'm seeing some OSD behavior that I consider unexpected; perhaps someone can shed some insight. Ceph giant (0.87.0), osd max backfills and osd recovery max active both set to 1. Please take a moment to look at the following ceph health detail screen dump: HEALTH_WARN 14 pgs backfill; 1 pgs backfilling; 15 pgs stuck unclean; recovery 16/65732491 objects degraded (0.000%); 328254/65732491 objects misplaced (0.499%) pg 20.3db is stuck unclean for 13547.432043, current state active+remapped+wait_backfill, last acting [45,90,157] pg 15.318 is stuck unclean for 13547.380581, current state active+remapped+wait_backfill, last acting [41,17,120] pg 15.34a is stuck unclean for 13548.115170, current state active+remapped+wait_backfill, last acting [64,87,80] pg 20.6f is stuck unclean for 13548.019218, current state active+remapped+wait_backfill, last acting [13,38,98] pg 20.44c is stuck unclean for 13548.075430, current state active+remapped+wait_backfill, last acting [174,127,139] pg 20.bc is stuck unclean for 13545.743397, current state active+remapped+wait_backfill, last acting [72,64,104] pg 15.1ac is stuck unclean for 13548.181461, current state active+remapped+wait_backfill, last acting [121,145,84] pg 15.1af is stuck unclean for 13547.962269, current state active+remapped+backfilling, last acting [150,62,101] pg 20.396 is stuck unclean for 13547.835109, current state active+remapped+wait_backfill, last acting [134,49,96] pg 15.1ba is stuck unclean for 13548.128752, current state active+remapped+wait_backfill, last acting [122,63,162] pg 15.3fd is stuck unclean for 13547.644431, current state active+remapped+wait_backfill, last acting [156,38,131] pg 20.41c is stuck unclean for 13548.133470, current state active+remapped+wait_backfill, last acting [78,85,168] pg 20.525 is stuck unclean for 13545.272774, current state active+remapped+wait_backfill, last acting [76,57,148] pg 15.1ca is stuck unclean for 13547.944928, current state active+remapped+wait_backfill, last acting [157,19,36] pg 20.11e is stuck unclean for 13545.368614, current state active+remapped+wait_backfill, last acting [36,134,8] pg 20.525 is active+remapped+wait_backfill, acting [76,57,148] pg 20.44c is active+remapped+wait_backfill, acting [174,127,139] pg 20.41c is active+remapped+wait_backfill, acting [78,85,168] pg 15.3fd is active+remapped+wait_backfill, acting [156,38,131] pg 20.3db is active+remapped+wait_backfill, acting [45,90,157] pg 20.396 is active+remapped+wait_backfill, acting [134,49,96] pg 15.34a is active+remapped+wait_backfill, acting [64,87,80] pg 15.318 is active+remapped+wait_backfill, acting [41,17,120] pg 15.1ca is active+remapped+wait_backfill, acting [157,19,36] pg 15.1ba is active+remapped+wait_backfill, acting [122,63,162] pg 15.1ac is active+remapped+wait_backfill, acting [121,145,84] pg 15.1af is active+remapped+backfilling, acting [150,62,101] pg 20.11e is active+remapped+wait_backfill, acting [36,134,8] pg 20.bc is active+remapped+wait_backfill, acting [72,64,104] pg 20.6f is active+remapped+wait_backfill, acting [13,38,98] recovery 16/65732491 objects degraded (0.000%); 328254/65732491 objects misplaced (0.499%) As you can see, there is barely any overlap between the acting OSDs for those PGs. osd max backfills should only limit the number of concurrent backfills out of a single OSD, and so in the situation above I would expect the 15 backfills to happen mostly concurrently. As it is they are being serialized, and that seems to needlessly slow down the process and extend the time needed to complete recovery. I'm pretty sure I'm missing something obvious here, but what is it? The max backfill values cover both incoming and outgoing results. Presumably these are all waiting on a small set of target OSDs which are currently receiving backfills of some other PG. Thanks for the reply, and I am aware of that, but I am not sure how it applies here. What I quoted was the complete list of then-current backfills in the cluster. Those are *all* the PGs affected by backfills. And they're so scattered across OSDs that there is barely any overlap. The only OSDs I even see listed twice are 38 and 64, which would affect PGs 15.3fd/20.6f 15.34a/20.bc. What is causing the others to wait? Or am I misunderstanding the acting value here and some other OSDs are involved, and if so, how would I find out what those are? Yes, unless I'm misremembering. Look at the pg dump for those PGs and check out the up versus acting values. The acting ones are what the PG is currently remapped to; they're waiting to backfill onto the proper set of up OSDs.
Re: [ceph-users] Privileges for read-only CephFS access?
Hi Florian, On 18.02.2015 22:58, Florian Haas wrote: is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow All you should need to do is [...] However, I've just tried the above with ceph-fuse on firefly, and [...] So I believe you've uncovered a CephFS bug. :) many thanks for the advice and the tests! I guess I'll have to go with a proxy for now, to be safe. But if it's possible (design-wise) read-only CephFS access might be a useful feature to have in the future. Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] wider rados namespace support?
On 02/12/2015 05:59 PM, Blair Bethwaite wrote: My particular interest is for a less dynamic environment, so manual key distribution is not a problem. Re. OpenStack, it's probably good enough to have the Cinder host creating them as needed (presumably stored in its DB) and just send the secret keys over the message bus to compute hosts as needed - if your infrastructure network is not trusted then you've got bigger problems to worry about. It's true that a lot of clouds would end up logging the secrets in various places, but then they are only useful on particular hosts. I guess there is nothing special about the default namespace compared to any other as far as cephx is concerned. It would be nice to have something of a nested auth, so that the client requires explicit permission to read the default namespace (configured out-of-band when setting up compute hosts) and further permission for particular non-default namespaces (managed by the cinder rbd driver), that way leaking secrets from cinder gives less exposure - but I guess that would be a bit of a change from the current namespace functionality. You can restrict client access to the default namespace like this with the existing ceph capabilities. For the proposed rbd usage of namespaces, for example, you could allow read-only access to the rbd_id.* objects in the default namespace, and full access to other specific namespaces. Something like: mon 'allow r' osd 'allow r class-read pool=foo namespace= object_prefix rbd_id, allow rwx pool=foo namespace=bar' Cinder or other management layers would still want broader access, but these more restricted keys could be the only ones exposed to QEMU. Josh On 13 February 2015 at 05:57, Josh Durgin josh.dur...@inktank.com wrote: On 02/10/2015 07:54 PM, Blair Bethwaite wrote: Just came across this in the docs: Currently (i.e., firefly), namespaces are only useful for applications written on top of librados. Ceph clients such as block device, object storage and file system do not currently support this feature. Then found: https://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support Is there any progress or plans to address this (particularly for rbd clients but also cephfs)? No immediate plans for rbd. That blueprint still seems like a reasonable way to implement it to me. The one part I'm less sure about is the OpenStack or other higher level integration, which would need to start adding secret keys to libvirt dynamically. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Privileges for read-only CephFS access?
Dear Greg, On 18.02.2015 23:41, Gregory Farnum wrote: is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow ...and deletes, unfortunately. :( I don't think this is presently a thing it's possible to do until we get a much better user auth capabilities system into CephFS. thanks a lot for the in-depth explanation! I guess this is indeed not a trivial thing to do. Well, it's probably best anyhow to isolate the Ceph cluster from potentially vulnerable systems. Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Privileges for read-only CephFS access?
On Wed, Feb 18, 2015 at 3:30 PM, Florian Haas flor...@hastexo.com wrote: On Wed, Feb 18, 2015 at 11:41 PM, Gregory Farnum g...@gregs42.com wrote: On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote: On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote: Dear Ceph Experts, is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow write or other modifications to the Ceph cluster? Warning, read this to the end, don't blindly do as I say. :) All you should need to do is define a CephX identity that has only r capabilities on the data pool (assuming you're using a default configuration where your CephFS uses the data and metadata pools): sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r pool=data' mon 'allow r' That identity should then be able to mount the filesystem but not write any data (use ceph-fuse -n client.readonly or mount -t ceph -o name=readonly) That said, just touching files or creating them is only a metadata operation that doesn't change anything in the data pool, so I think that might still be allowed under these circumstances. ...and deletes, unfortunately. :( If the file being deleted is empty, yes. If the file has any content, then the removal should hit the data pool before it hits metadata, and should fail there. No? No, all data deletion is handled by the MDS, for two reasons: 1) You don't want clients to have to block on deletes in time linear with the number of objects 2) (IMPORTANT) if clients unlink a file which is still opened elsewhere, it can't be deleted until closed. ;) I don't think this is presently a thing it's possible to do until we get a much better user auth capabilities system into CephFS. However, I've just tried the above with ceph-fuse on firefly, and I was able to mount the filesystem that way and then echo something into a previously existing file. After unmounting, remounting, and trying to cat that file, I/O just hangs. It eventually does complete, but this looks really fishy. This is happening because the CephFS clients don't (can't, really, for all the time we've spent thinking about it) check whether they have read permissions on the underlying pool when buffering writes for a file. I believe if you ran an fsync on the file you'd get an EROFS or similar. Anyway, the client happily buffers up the writes. Depending on how exactly you remount then it might not be able to drop the MDS caps for file access (due to having dirty data it can't get rid of), and those caps have to time out before anybody else can access the file again. So you've found an unpleasant oddity of how the POSIX interfaces map onto this kind of distributed system, but nothing unexpected. :) Oliver's point is valid though; I would be nice if you could somehow make CephFS read-only to some (or all) clients server side, the way an NFS ro export does. Yeah. Yet another thing that would be good but requires real permission bits on the MDS. It'll happen eventually, but we have other bits that seem a lot more important...fsck, stability, single-tenant usability ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Privileges for read-only CephFS access?
On Wed, Feb 18, 2015 at 11:41 PM, Gregory Farnum g...@gregs42.com wrote: On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote: On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote: Dear Ceph Experts, is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow write or other modifications to the Ceph cluster? Warning, read this to the end, don't blindly do as I say. :) All you should need to do is define a CephX identity that has only r capabilities on the data pool (assuming you're using a default configuration where your CephFS uses the data and metadata pools): sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r pool=data' mon 'allow r' That identity should then be able to mount the filesystem but not write any data (use ceph-fuse -n client.readonly or mount -t ceph -o name=readonly) That said, just touching files or creating them is only a metadata operation that doesn't change anything in the data pool, so I think that might still be allowed under these circumstances. ...and deletes, unfortunately. :( If the file being deleted is empty, yes. If the file has any content, then the removal should hit the data pool before it hits metadata, and should fail there. No? I don't think this is presently a thing it's possible to do until we get a much better user auth capabilities system into CephFS. However, I've just tried the above with ceph-fuse on firefly, and I was able to mount the filesystem that way and then echo something into a previously existing file. After unmounting, remounting, and trying to cat that file, I/O just hangs. It eventually does complete, but this looks really fishy. This is happening because the CephFS clients don't (can't, really, for all the time we've spent thinking about it) check whether they have read permissions on the underlying pool when buffering writes for a file. I believe if you ran an fsync on the file you'd get an EROFS or similar. Anyway, the client happily buffers up the writes. Depending on how exactly you remount then it might not be able to drop the MDS caps for file access (due to having dirty data it can't get rid of), and those caps have to time out before anybody else can access the file again. So you've found an unpleasant oddity of how the POSIX interfaces map onto this kind of distributed system, but nothing unexpected. :) Oliver's point is valid though; I would be nice if you could somehow make CephFS read-only to some (or all) clients server side, the way an NFS ro export does. Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd: I/O Errors in low memory situations
Hi, yesterday we had had the problem that one of our cluster clients remounted a rbd device in read-only mode. We found this[1] stack trace in the logs. We investigated further and found similar traces on all other machines that are using the rbd kernel module. It seems to me that whenever there is a swapping situation on a client those I/O errors occur. Is there anything we can do or is this something that needs to be fixed in the code? client setup: # uname -v #1 SMP Debian 3.16.7-ckt2-1~bpo70+1 (2014-12-08) # rbd -v ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) # lsb_release -r Release:6.0.10 Cheers, Sebastian [1] Feb 17 22:52:25 six kernel: [2401866.069932] kworker/6:1: page allocation failure: order:1, mode:0x204020 Feb 17 22:52:25 six kernel: [2401866.069939] CPU: 6 PID: 4474 Comm: kworker/6:1 Tainted: GW 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1 Feb 17 22:52:25 six kernel: [2401866.069940] Hardware name: Dell Inc. PowerEdge R710/0PV9DG, BIOS 1.3.6 10/30/2009 Feb 17 22:52:25 six kernel: [2401866.069947] Workqueue: rbd1 rbd_request_workfn [rbd] Feb 17 22:52:25 six kernel: [2401866.069949] 0001 81541f8f 00204020 Feb 17 22:52:25 six kernel: [2401866.069951] 811519ed 0001 88032fff2c00 0002 Feb 17 22:52:25 six kernel: [2401866.069953] 818e24d0 00010002 88032fff2c08 0092 Feb 17 22:52:25 six kernel: [2401866.069955] Call Trace: Feb 17 22:52:25 six kernel: [2401866.069963] [81541f8f] ? dump_stack+0x41/0x51 Feb 17 22:52:25 six kernel: [2401866.069969] [811519ed] ? warn_alloc_failed+0xfd/0x160 Feb 17 22:52:25 six kernel: [2401866.069972] [8115606f] ? __alloc_pages_nodemask+0x91f/0xbb0 Feb 17 22:52:25 six kernel: [2401866.069977] [8119fe00] ? kmem_getpages+0x60/0x110 Feb 17 22:52:25 six kernel: [2401866.069979] [811a1648] ? fallback_alloc+0x158/0x220 Feb 17 22:52:25 six kernel: [2401866.069981] [811a1f44] ? kmem_cache_alloc+0x1a4/0x1e0 Feb 17 22:52:25 six kernel: [2401866.069987] [a0627889] ? ceph_osdc_alloc_request+0x69/0x320 [libceph] Feb 17 22:52:25 six kernel: [2401866.069989] [a065f53b] ? rbd_osd_req_create.isra.17+0x7b/0x190 [rbd] Feb 17 22:52:25 six kernel: [2401866.069992] [a0661fd5] ? rbd_img_request_fill+0x2b5/0x900 [rbd] Feb 17 22:52:25 six kernel: [2401866.069995] [a0663485] ? rbd_request_workfn+0x235/0x350 [rbd] Feb 17 22:52:25 six kernel: [2401866.07] [810878fc] ? process_one_work+0x15c/0x450 Feb 17 22:52:25 six kernel: [2401866.070003] [81088b52] ? worker_thread+0x112/0x540 Feb 17 22:52:25 six kernel: [2401866.070005] [81088a40] ? create_and_start_worker+0x60/0x60 Feb 17 22:52:25 six kernel: [2401866.070008] [8108f511] ? kthread+0xc1/0xe0 Feb 17 22:52:25 six kernel: [2401866.070010] [8108f450] ? flush_kthread_worker+0xb0/0xb0 Feb 17 22:52:25 six kernel: [2401866.070013] [815483bc] ? ret_from_fork+0x7c/0xb0 Feb 17 22:52:25 six kernel: [2401866.070015] [8108f450] ? flush_kthread_worker+0xb0/0xb0 Feb 17 22:52:25 six kernel: [2401866.070017] Mem-Info: Feb 17 22:52:25 six kernel: [2401866.070018] Node 0 Normal per-cpu: Feb 17 22:52:25 six kernel: [2401866.070020] CPU0: hi: 186, btch: 31 usd: 46 Feb 17 22:52:25 six kernel: [2401866.070021] CPU1: hi: 186, btch: 31 usd: 168 Feb 17 22:52:25 six kernel: [2401866.070023] CPU2: hi: 186, btch: 31 usd: 154 Feb 17 22:52:25 six kernel: [2401866.070024] CPU3: hi: 186, btch: 31 usd: 177 Feb 17 22:52:25 six kernel: [2401866.070025] CPU4: hi: 186, btch: 31 usd: 128 Feb 17 22:52:25 six kernel: [2401866.070026] CPU5: hi: 186, btch: 31 usd: 152 Feb 17 22:52:25 six kernel: [2401866.070028] CPU6: hi: 186, btch: 31 usd: 76 Feb 17 22:52:25 six kernel: [2401866.070029] CPU7: hi: 186, btch: 31 usd: 109 Feb 17 22:52:25 six kernel: [2401866.070030] CPU8: hi: 186, btch: 31 usd: 183 Feb 17 22:52:25 six kernel: [2401866.070031] CPU9: hi: 186, btch: 31 usd: 180 Feb 17 22:52:25 six kernel: [2401866.070033] CPU 10: hi: 186, btch: 31 usd: 149 Feb 17 22:52:25 six kernel: [2401866.070034] CPU 11: hi: 186, btch: 31 usd: 170 Feb 17 22:52:25 six kernel: [2401866.070035] CPU 12: hi: 186, btch: 31 usd: 182 Feb 17 22:52:25 six kernel: [2401866.070036] CPU 13: hi: 186, btch: 31 usd: 169 Feb 17 22:52:25 six kernel: [2401866.070038] CPU 14: hi: 186, btch: 31 usd: 157 Feb 17 22:52:25 six kernel: [2401866.070039] CPU 15: hi: 186, btch: 31 usd: 176 Feb 17 22:52:25 six kernel: [2401866.070040] Node 1 DMA per-cpu: Feb 17 22:52:25 six kernel: [2401866.070041] CPU0: hi:0, btch: 1 usd: 0 Feb 17 22:52:25 six kernel: [2401866.070043] CPU1: hi:0, btch: 1 usd: 0 Feb 17 22:52:25 six kernel: [2401866.070044] CPU2: hi:0, btch: 1 usd:
[ceph-users] OSD Startup Best Practice: gpt/udev or SysVInit/systemd ?
Hi Cephers, What is your best practice for starting up OSDs? I am trying to determine the most robust technique on CentOS 7 where I have too much choice: udev/gpt/uuid or /etc/init.d/ceph or /etc/systemd/system/ceph-osd@X 1. Use udev/gpt/UUID: no OSD sections in /etc/ceph/mycluster.conf or premounts in /etc/fstab. Let udev + ceph-disk-activate do its magic. 2. Use /etc/init.d/ceph start osd or systemctl start ceph-osd@N a. do you change partition UUID so no udev kicks in? b. do you keep [osd.N] sections in /etc/ceph/mycluster.conf c. premount all journals/OSDs in /etc/fstab? The problem with this approach, though very explicit and robust, is that it is is hard to maintain /etc/fstab on the OSD hosts. - Anthony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com