Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-18 Thread Alexandre DERUMIER
Nice Work Mark !

I don't see any tuning about sharding in the config file sample

(osd_op_num_threads_per_shard,osd_op_num_shards,...)

as you only use 1 ssd for the bench, I think it should improve results for 
hammer ?



- Mail original -
De: Mark Nelson mnel...@redhat.com
À: ceph-devel ceph-de...@vger.kernel.org
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Mardi 17 Février 2015 18:37:01
Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance   
comparison

Hi All, 

I wrote up a short document describing some tests I ran recently to look 
at how SSD backed OSD performance has changed across our LTS releases. 
This is just looking at RADOS performance and not RBD or RGW. It also 
doesn't offer any real explanations regarding the results. It's just a 
first high level step toward understanding some of the behaviors folks 
on the mailing list have reported over the last couple of releases. I 
hope you find it useful. 

Mark 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-giant installation error on centos 6.6

2015-02-18 Thread Wenxiao He
Thanks Brad. That solved the problem. I mistakenly assumed all dependencies
are in http://ceph.com/rpm-giant/el6/x86_64/.


Regards,
Wenxiao

On Tue, Feb 17, 2015 at 10:37 PM, Brad Hubbard bhubb...@redhat.com wrote:

 On 02/18/2015 12:43 PM, Wenxiao He wrote:


 Hello,

 I need some help as I am getting package dependency errors when trying to
 install ceph-giant on centos 6.6. See below for repo files and also the yum
 install output.


  --- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed
 -- Finished Dependency Resolution
 Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph)
 Requires: liblttng-ust.so.0()(64bit)
 Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph)
 Requires: libunwind.so.8()(64bit)
 Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph)
 Requires: liblttng-ust.so.0()(64bit)
 Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph)
 Requires: liblttng-ust.so.0()(64bit)


 Looks like you may need to install libunwind and lttng-ust from EPEL 6?

 They seem to be the packages that supply liblttng-ust.so and ibunwind.so
 so you
 could try installing those from EPEL 6 and see how that goes?

 Note that this should not be taken as the, or even a, authorative answer :)

 Cheers,
 Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Updating monmap

2015-02-18 Thread SUNDAY A. OLUTAYO
How do I update the ceph monmap after extracting and removing unwanted an ip in 
the monmap to the clean monmap? 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-giant installation error on centos 6.6

2015-02-18 Thread Travis Rhoden
Note that ceph-deploy would enable EPEL for you automatically on
CentOS.  When doing a manual installation, the requirement for EPEL is
called out here:
http://ceph.com/docs/master/install/get-packages/#id8

Though looking at that, we could probably update it to use the now
much easier to use yum install epel-release.  :)

 - Travis

On Wed, Feb 18, 2015 at 12:25 PM, Wenxiao He wenx...@gmail.com wrote:
 Thanks Brad. That solved the problem. I mistakenly assumed all dependencies
 are in http://ceph.com/rpm-giant/el6/x86_64/.


 Regards,
 Wenxiao

 On Tue, Feb 17, 2015 at 10:37 PM, Brad Hubbard bhubb...@redhat.com wrote:

 On 02/18/2015 12:43 PM, Wenxiao He wrote:


 Hello,

 I need some help as I am getting package dependency errors when trying to
 install ceph-giant on centos 6.6. See below for repo files and also the yum
 install output.


 --- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed
 -- Finished Dependency Resolution
 Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph)
 Requires: liblttng-ust.so.0()(64bit)
 Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph)
 Requires: libunwind.so.8()(64bit)
 Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph)
 Requires: liblttng-ust.so.0()(64bit)
 Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph)
 Requires: liblttng-ust.so.0()(64bit)


 Looks like you may need to install libunwind and lttng-ust from EPEL 6?

 They seem to be the packages that supply liblttng-ust.so and ibunwind.so
 so you
 could try installing those from EPEL 6 and see how that goes?

 Note that this should not be taken as the, or even a, authorative answer
 :)

 Cheers,
 Brad



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating monmap

2015-02-18 Thread LOPEZ Jean-Charles
Hi,

use the following command line: ceph-mon -i {monitor_id} --inject-monmap 
{updated_monmap_file}

JC

 On 18 Feb 2015, at 11:15, SUNDAY A. OLUTAYO olut...@sadeeb.com wrote:
 
 How do I update the ceph monmap after extracting and removing unwanted an ip 
 in the monmap to the clean monmap?
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck degraded, undersized, unclean

2015-02-18 Thread Brian Rak
We're running ceph version 0.87 
(c51c8f9d80fa4e0168aa52685b8de40e42758578), and seeing this:


HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 
pgs stuck undersized; 1 pgs undersized
pg 4.2af is stuck unclean for 77192.522960, current state 
active+undersized+degraded, last acting [50,42]
pg 4.2af is stuck undersized for 980.617479, current state 
active+undersized+degraded, last acting [50,42]
pg 4.2af is stuck degraded for 980.617902, current state 
active+undersized+degraded, last acting [50,42]

pg 4.2af is active+undersized+degraded, acting [50,42]


However, ceph pg query doesn't really show any issues: 
https://gist.githubusercontent.com/devicenull/9d911362e4de83c02e40/raw/565fe18163e261c8105e5493a4e90cc3c461ed9d/gistfile1.txt 
(too long to post here)


I've also tried:

# ceph pg 4.2af mark_unfound_lost revert
pg has no unfound objects

How can I get Ceph to rebuild here?  The replica count is 3, but I can't 
seem to figure out what's going on here.  Enabling various debug logs 
doesn't reveal anything obvious to me.


I've tried restarting both OSDs, which did nothing.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 12 March - Ceph Day San Francisco

2015-02-18 Thread Patrick McGarry
Hey cephers,

We still have a couple of speaking slots open for Ceph Day San
Francisco on 12 March. I'm open to both high level what have you been
doing with Ceph type talks as well as more technical here is what
we're writing and/or integrating with Ceph.

I know many folks will be at VAULT, but we figured there would still
be plenty of folks left on the west coast, so let me know if you'd be
interested in speaking. Thanks!


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck degraded, undersized, unclean

2015-02-18 Thread Florian Haas
On Wed, Feb 18, 2015 at 9:09 PM, Brian Rak b...@gameservers.com wrote:
 What does your crushmap look like (ceph osd getcrushmap -o
 /tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic
 prevent Ceph from selecting an OSD for the third replica?

 Cheers,
 Florian


 I have 5 hosts, and it's configured like this:

That's not the full crushmap, so I'm a bit reduced to guessing...

 root default {
 id -1   # do not change unnecessarily
 # weight 204.979
 alg straw
 hash 0  # rjenkins1
 item osd01 weight 12.670
 item osd02 weight 14.480
 item osd03 weight 14.480
 item osd04 weight 79.860
 item osd05 weight 83.490

Whence the large weight difference? Are osd04 and osd05 really that
much bigger in disk space?

 rule replicated_ruleset {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }

 This should not be preventing the assignment (AFAIK).  Currently the PG is
 on osd01 and osd05.

Just checking, sure you're not running short on space (close to 90%
utilization) on one of your OSD filesystems?

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck degraded, undersized, unclean

2015-02-18 Thread Brian Rak


On 2/18/2015 3:24 PM, Florian Haas wrote:

On Wed, Feb 18, 2015 at 9:09 PM, Brian Rak b...@gameservers.com wrote:

What does your crushmap look like (ceph osd getcrushmap -o
/tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic
prevent Ceph from selecting an OSD for the third replica?

Cheers,
Florian


I have 5 hosts, and it's configured like this:

That's not the full crushmap, so I'm a bit reduced to guessing...
I wasn't sure the rest of it was useful.  The full one can be found 
here: 
https://gist.githubusercontent.com/devicenull/db9a3fbaa0df2138071b/raw/4158a6205692eb5a2ba73831e7f51ececd8eb1a5/gistfile1.txt






root default {
 id -1   # do not change unnecessarily
 # weight 204.979
 alg straw
 hash 0  # rjenkins1
 item osd01 weight 12.670
 item osd02 weight 14.480
 item osd03 weight 14.480
 item osd04 weight 79.860
 item osd05 weight 83.490

Whence the large weight difference? Are osd04 and osd05 really that
much bigger in disk space?

Yes, osd04 and osd05 have 3-4x the number of disks as osd01-osd3


rule replicated_ruleset {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
}

This should not be preventing the assignment (AFAIK).  Currently the PG is
on osd01 and osd05.

Just checking, sure you're not running short on space (close to 90%
utilization) on one of your OSD filesystems?

No, they're all under 10% used.  The cluster as a whole only has about 
6TB used (out of 196 TB).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd pegging CPU on giant, no snapshots involved this time

2015-02-18 Thread Florian Haas
On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson mnel...@redhat.com wrote:
 On 02/18/2015 02:19 PM, Florian Haas wrote:

 Hey everyone,

 I must confess I'm still not fully understanding this problem and
 don't exactly know where to start digging deeper, but perhaps other
 users have seen this and/or it rings a bell.

 System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
 different rulesets where the problem applies to hosts and PGs using a
 bog-standard default crushmap.

 Symptom: out of the blue, ceph-osd processes on a single OSD node
 start going to 100% CPU utilization. The problems turns so bad that
 the machine is effectively becoming CPU bound and can't cope with any
 client requests anymore. Stopping and restarting all OSDs brings the
 problem right back, as does rebooting the machine — right after
 ceph-osd processes start, CPU utilization shoots up again. Stopping
 and marking out several OSDs on the machine makes the problem go away
 but obviously causes massive backfilling. All the logs show while CPU
 utilization is implausibly high are slow requests (which would be
 expected in a system that can barely do anything).

 Now I've seen issues like this before on dumpling and firefly, but
 besides the fact that they have all been addressed and should now be
 fixed, they always involved the prior mass removal of RBD snapshots.
 This system only used a handful of snapshots in testing, and is
 presently not using any snapshots at all.

 I'll be spending some time looking for clues in the log files of the
 OSDs that were shut down which caused the problem to go away, but if
 this sounds familiar to anyone willing to offer clues, I'd be more
 than interested. :) Thanks!


 Hi Florian,

 Does a quick perf top tell you anything useful?

Hi Mark,

Unfortunately, quite the contrary -- but this might actually provide a
clue to the underlying issue.

So the CPU pegging issue isn't currently present, so the perf top data
wouldn't be conclusive until the issue is reproduced. But: merely
running perf top on this host, which currently only has 2 active OSDs,
renders the host unresponsive.

Corresponding dmesg snippet:

[Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns
[Wed Feb 18 20:53:49 2015] [ cut here ]
[Wed Feb 18 20:53:49 2015] WARNING: at
arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100()
[Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc
mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
iptable_filter ip_tables xfs vfat fat iTCO_w
dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
mfd_core mei_me mei ipmi_devintf
shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm bnx2
x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]

[Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not
tainted 3.10.0-123.20.1.el7.x86_64 #1
[Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge
R720xd/0020HJ, BIOS 2.2.2 01/16/2014
[Wed Feb 18 20:53:49 2015]  50de8931
880fef003d40 815e2b0c
[Wed Feb 18 20:53:49 2015] 880fef003d78 8105dee1
880c316a7400 880fef00b9e0
[Wed Feb 18 20:53:49 2015]  880fef016db0
880dbaa896c0 880fef003d88
[Wed Feb 18 20:53:49 2015] Call Trace:
[Wed Feb 18 20:53:49 2015]  IRQ  [815e2b0c] dump_stack+0x19/0x1b
[Wed Feb 18 20:53:49 2015] [8105dee1] warn_slowpath_common+0x61/0x80
[Wed Feb 18 20:53:49 2015] [8105e00a] warn_slowpath_null+0x1a/0x20
[Wed Feb 18 20:53:49 2015] [81023706] x86_pmu_start+0xc6/0x100
[Wed Feb 18 20:53:49 2015] [81136128]
perf_adjust_freq_unthr_context.part.79+0x198/0x1b0
[Wed Feb 18 20:53:49 2015] [811363d6] perf_event_task_tick+0xb6/0xf0
[Wed Feb 18 20:53:49 2015] [810967e5] scheduler_tick+0xd5/0x150
[Wed Feb 18 20:53:49 2015] [8106fe86] update_process_times+0x66/0x80
[Wed Feb 18 20:53:49 2015] [810be055]
tick_sched_handle.isra.16+0x25/0x60
[Wed Feb 18 20:53:49 2015] [810be0d1] tick_sched_timer+0x41/0x60
[Wed Feb 18 20:53:49 2015] [81089a57] __run_hrtimer+0x77/0x1d0
[Wed Feb 18 20:53:49 2015] [810be090] ?
tick_sched_handle.isra.16+0x60/0x60
[Wed Feb 18 20:53:49 2015] [8108a297] hrtimer_interrupt+0xf7/0x240
[Wed Feb 18 20:53:49 2015] [81039717]
local_apic_timer_interrupt+0x37/0x60
[Wed Feb 18 20:53:49 2015] [815f552f]
smp_apic_timer_interrupt+0x3f/0x60
[Wed Feb 18 20:53:49 2015] [815f3e9d] 

[ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Oliver Schulz

Dear Ceph Experts,

is it possible to define a Ceph user/key with privileges
that allow for read-only CephFS access but do not allow
write or other modifications to the Ceph cluster?

I would like to export a sub-tree of our CephFS via HTTPS.
Alas, web-servers are inviting targets, so in the (hopefully
unlikely) event that the server is hacked, I want to
protected the Ceph cluster from file modification/deletion
and other possible nasty things.

The alternative would be to put an NFS- or SSHFS-proxy
between Ceph and the web-server. But I'd like to avoid the
additional complication if possible.


Cheers and thanks,

Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Florian Haas
On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote:
 Dear Ceph Experts,

 is it possible to define a Ceph user/key with privileges
 that allow for read-only CephFS access but do not allow
 write or other modifications to the Ceph cluster?

Warning, read this to the end, don't blindly do as I say. :)

All you should need to do is define a CephX identity that has only r
capabilities on the data pool (assuming you're using a default
configuration where your CephFS uses the data and metadata pools):

sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r
pool=data' mon 'allow r'

That identity should then be able to mount the filesystem but not
write any data (use ceph-fuse -n client.readonly or mount -t ceph
-o name=readonly)

That said, just touching files or creating them is only a metadata
operation that doesn't change anything in the data pool, so I think
that might still be allowed under these circumstances.

However, I've just tried the above with ceph-fuse on firefly, and I
was able to mount the filesystem that way and then echo something into
a previously existing file. After unmounting, remounting, and trying
to cat that file, I/O just hangs. It eventually does complete, but
this looks really fishy.

So I believe you've uncovered a CephFS bug. :)

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck degraded, undersized, unclean

2015-02-18 Thread Brian Rak

On 2/18/2015 3:01 PM, Florian Haas wrote:

On Wed, Feb 18, 2015 at 7:53 PM, Brian Rak b...@gameservers.com wrote:

We're running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578),
and seeing this:

HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 pgs
stuck undersized; 1 pgs undersized
pg 4.2af is stuck unclean for 77192.522960, current state
active+undersized+degraded, last acting [50,42]
pg 4.2af is stuck undersized for 980.617479, current state
active+undersized+degraded, last acting [50,42]
pg 4.2af is stuck degraded for 980.617902, current state
active+undersized+degraded, last acting [50,42]
pg 4.2af is active+undersized+degraded, acting [50,42]


However, ceph pg query doesn't really show any issues:
https://gist.githubusercontent.com/devicenull/9d911362e4de83c02e40/raw/565fe18163e261c8105e5493a4e90cc3c461ed9d/gistfile1.txt
(too long to post here)

I've also tried:

# ceph pg 4.2af mark_unfound_lost revert
pg has no unfound objects

How can I get Ceph to rebuild here?  The replica count is 3, but I can't
seem to figure out what's going on here.  Enabling various debug logs
doesn't reveal anything obvious to me.

I've tried restarting both OSDs, which did nothing.

What does your crushmap look like (ceph osd getcrushmap -o
/tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic
prevent Ceph from selecting an OSD for the third replica?

Cheers,
Florian


I have 5 hosts, and it's configured like this:

root default {
id -1   # do not change unnecessarily
# weight 204.979
alg straw
hash 0  # rjenkins1
item osd01 weight 12.670
item osd02 weight 14.480
item osd03 weight 14.480
item osd04 weight 79.860
item osd05 weight 83.490
}

rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

This should not be preventing the assignment (AFAIK).  Currently the PG 
is on osd01 and osd05.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd pegging CPU on giant, no snapshots involved this time

2015-02-18 Thread Mark Nelson

On 02/18/2015 02:19 PM, Florian Haas wrote:

Hey everyone,

I must confess I'm still not fully understanding this problem and
don't exactly know where to start digging deeper, but perhaps other
users have seen this and/or it rings a bell.

System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
different rulesets where the problem applies to hosts and PGs using a
bog-standard default crushmap.

Symptom: out of the blue, ceph-osd processes on a single OSD node
start going to 100% CPU utilization. The problems turns so bad that
the machine is effectively becoming CPU bound and can't cope with any
client requests anymore. Stopping and restarting all OSDs brings the
problem right back, as does rebooting the machine — right after
ceph-osd processes start, CPU utilization shoots up again. Stopping
and marking out several OSDs on the machine makes the problem go away
but obviously causes massive backfilling. All the logs show while CPU
utilization is implausibly high are slow requests (which would be
expected in a system that can barely do anything).

Now I've seen issues like this before on dumpling and firefly, but
besides the fact that they have all been addressed and should now be
fixed, they always involved the prior mass removal of RBD snapshots.
This system only used a handful of snapshots in testing, and is
presently not using any snapshots at all.

I'll be spending some time looking for clues in the log files of the
OSDs that were shut down which caused the problem to go away, but if
this sounds familiar to anyone willing to offer clues, I'd be more
than interested. :) Thanks!


Hi Florian,

Does a quick perf top tell you anything useful?



Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] metrics to monitor for performance bottlenecks?

2015-02-18 Thread Xu (Simon) Chen
Hey folks,

I have a ceph cluster supporting about 500 VMs using RBD. I am seeing
around 10-12k IOPS cluster-wide and IO wait time creeping up within the
VMs.

My suspicion is that I am pushing my ceph cluster to its limit in terms of
overall throughput. I am curious if there are metrics that can be passively
collected either in VMs or on ceph nodes to reveal the cluster is at its
peak. IO wait time inside of VMs might be a good one, but I am interested
in monitoring the ceph nodes directly as well. Ideally I want to track
those metrics, perform some trending analysis, and provision capacity (not
space, but throughput) before VM performance is impacted.

Any thoughts or experience on this matter?

Thanks.
-Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck degraded, undersized, unclean

2015-02-18 Thread Florian Haas
On Wed, Feb 18, 2015 at 7:53 PM, Brian Rak b...@gameservers.com wrote:
 We're running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578),
 and seeing this:

 HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 pgs
 stuck undersized; 1 pgs undersized
 pg 4.2af is stuck unclean for 77192.522960, current state
 active+undersized+degraded, last acting [50,42]
 pg 4.2af is stuck undersized for 980.617479, current state
 active+undersized+degraded, last acting [50,42]
 pg 4.2af is stuck degraded for 980.617902, current state
 active+undersized+degraded, last acting [50,42]
 pg 4.2af is active+undersized+degraded, acting [50,42]


 However, ceph pg query doesn't really show any issues:
 https://gist.githubusercontent.com/devicenull/9d911362e4de83c02e40/raw/565fe18163e261c8105e5493a4e90cc3c461ed9d/gistfile1.txt
 (too long to post here)

 I've also tried:

 # ceph pg 4.2af mark_unfound_lost revert
 pg has no unfound objects

 How can I get Ceph to rebuild here?  The replica count is 3, but I can't
 seem to figure out what's going on here.  Enabling various debug logs
 doesn't reveal anything obvious to me.

 I've tried restarting both OSDs, which did nothing.

What does your crushmap look like (ceph osd getcrushmap -o
/tmp/crushmap; crushtool -d /tmp/crushmap)? Does your placement logic
prevent Ceph from selecting an OSD for the third replica?

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Gregory Farnum
On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote:
 On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote:
 Dear Ceph Experts,

 is it possible to define a Ceph user/key with privileges
 that allow for read-only CephFS access but do not allow
 write or other modifications to the Ceph cluster?

 Warning, read this to the end, don't blindly do as I say. :)

 All you should need to do is define a CephX identity that has only r
 capabilities on the data pool (assuming you're using a default
 configuration where your CephFS uses the data and metadata pools):

 sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r
 pool=data' mon 'allow r'

 That identity should then be able to mount the filesystem but not
 write any data (use ceph-fuse -n client.readonly or mount -t ceph
 -o name=readonly)

 That said, just touching files or creating them is only a metadata
 operation that doesn't change anything in the data pool, so I think
 that might still be allowed under these circumstances.

...and deletes, unfortunately. :( I don't think this is presently a
thing it's possible to do until we get a much better user auth
capabilities system into CephFS.


 However, I've just tried the above with ceph-fuse on firefly, and I
 was able to mount the filesystem that way and then echo something into
 a previously existing file. After unmounting, remounting, and trying
 to cat that file, I/O just hangs. It eventually does complete, but
 this looks really fishy.

This is happening because the CephFS clients don't (can't, really, for
all the time we've spent thinking about it) check whether they have
read permissions on the underlying pool when buffering writes for a
file. I believe if you ran an fsync on the file you'd get an EROFS or
similar.
Anyway, the client happily buffers up the writes. Depending on how
exactly you remount then it might not be able to drop the MDS caps for
file access (due to having dirty data it can't get rid of), and those
caps have to time out before anybody else can access the file again.
So you've found an unpleasant oddity of how the POSIX interfaces map
onto this kind of distributed system, but nothing unexpected. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD on RBD (KVM)

2015-02-18 Thread Josh Durgin
 From: Logan Barfield lbarfi...@tqhosting.com
 We've been running some tests to try to determine why our FreeBSD VMs
 are performing much worse than our Linux VMs backed by RBD, especially
 on writes.
 
 Our current deployment is:
 - 4x KVM Hypervisors (QEMU 2.0.0+dfsg-2ubuntu1.6)
 - 2x OSD nodes (8x SSDs each, 10Gbit links to hypervisors, pool has 2x
 replication across nodes)
 - Hypervisors have rbd_cache enabled
 - All VMs use cache=none currently.

If you don't have rbd cache writethrough until flush = true, this
configuration is unsafe - with cache=none, qemu will not send flushes.
 
 In testing we were getting ~30MB/s writes, and ~100MB/s reads on
 FreeBSD 10.1.  On Linux VMs we're seeing ~150+MB/s for writes and
 reads (dd if=/dev/zero of=output bs=1M count=1024 oflag=direct).

I'm not very familiar with FreeBSD, but I'd guess it's sending smaller
I/Os for some reason. This could be due to trusting the sector size
qemu reports (this can be changed, though I don't remember the syntax
offhand), lower fs block size, or scheduler or block subsystem
configurables.  It could also be related to differences in block
allocation strategies by whatever FS you're using in the guest and
Linux filesystems. What FS are you using in each guest?

You can check the I/O sizes seen by rbd by adding something like this
to ceph.conf on a node running qemu:

[client]
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

This will show the offset and length of requests in lines containing
aio_read and aio_write. If you're using giant you could instead gather
a trace of I/O to rbd via lttng.

 I tested several configurations on both RBD and local SSDs, and the
 only time FreeBSD performance was comparable to Linux was with the
 following configuration:
 - Local SSD
 - Qemu cache=writeback
 - GPT journaling enabled
 
 We did see some performance improvement (~50MB/s writes instead of
 30MB/s) when using cache=writeback on RBD.
 
 I've read several threads regarding cache=none vs cache=writeback.
 cache=none is apparently safer for live migration, but cache=writeback
 is recommended by Ceph to prevent data loss.  Apparently there was a
 patch submitted for Qemu a few months ago to make cache=writeback
 safer for live migrations as well: http://tracker.ceph.com/issues/2467

RBD caching is already safe with live migration without this patch. It
just makes sure that it will continue to be safe in case of future
QEMU changes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-giant installation error on centos 6.6

2015-02-18 Thread Wenxiao He
Hi,

The impatient me was using the quick guide (
http://ceph.com/docs/master/start/quick-start-preflight/) which merely
states On CentOS, you may need to install EPEL :)

I have a separate question: why ceph-deploy always shows Error in
sys.exitfunc:, though things look fine?

$ ceph-deploy new ceph1 ceph2
...
[ceph_deploy.new][DEBUG ] Creating a random mon key...
[ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
[ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...
Error in sys.exitfunc:

$ ceph-deploy install ceph1 ceph2
...
[ceph_deploy.gatherkeys][DEBUG ] Got ceph.bootstrap-mds.keyring key from
ceph1.
Error in sys.exitfunc:

$ ceph-deploy osd prepare ceph1:sdb ceph1:sdc ceph1:sdd ceph2:sdb ceph2:sdc
ceph2:sdd
...
[ceph_deploy.osd][DEBUG ] Host ceph1 is now ready for osd use.
...
[ceph_deploy.osd][DEBUG ] Host ceph2 is now ready for osd use.
Error in sys.exitfunc:




Regards,
Wenxiao

On Wed, Feb 18, 2015 at 10:38 AM, Travis Rhoden trho...@gmail.com wrote:

 Note that ceph-deploy would enable EPEL for you automatically on
 CentOS.  When doing a manual installation, the requirement for EPEL is
 called out here:
 http://ceph.com/docs/master/install/get-packages/#id8

 Though looking at that, we could probably update it to use the now
 much easier to use yum install epel-release.  :)

  - Travis

 On Wed, Feb 18, 2015 at 12:25 PM, Wenxiao He wenx...@gmail.com wrote:
  Thanks Brad. That solved the problem. I mistakenly assumed all
 dependencies
  are in http://ceph.com/rpm-giant/el6/x86_64/.
 
 
  Regards,
  Wenxiao
 
  On Tue, Feb 17, 2015 at 10:37 PM, Brad Hubbard bhubb...@redhat.com
 wrote:
 
  On 02/18/2015 12:43 PM, Wenxiao He wrote:
 
 
  Hello,
 
  I need some help as I am getting package dependency errors when trying
 to
  install ceph-giant on centos 6.6. See below for repo files and also
 the yum
  install output.
 
 
  --- Package python-imaging.x86_64 0:1.1.6-19.el6 will be installed
  -- Finished Dependency Resolution
  Error: Package: 1:librbd1-0.87-0.el6.x86_64 (Ceph)
  Requires: liblttng-ust.so.0()(64bit)
  Error: Package: gperftools-libs-2.0-11.el6.3.x86_64 (Ceph)
  Requires: libunwind.so.8()(64bit)
  Error: Package: 1:librados2-0.87-0.el6.x86_64 (Ceph)
  Requires: liblttng-ust.so.0()(64bit)
  Error: Package: 1:ceph-0.87-0.el6.x86_64 (Ceph)
  Requires: liblttng-ust.so.0()(64bit)
 
 
  Looks like you may need to install libunwind and lttng-ust from EPEL 6?
 
  They seem to be the packages that supply liblttng-ust.so and ibunwind.so
  so you
  could try installing those from EPEL 6 and see how that goes?
 
  Note that this should not be taken as the, or even a, authorative answer
  :)
 
  Cheers,
  Brad
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-18 Thread Mark Nelson

Hi Alex,

Thanks!  I didn't tweak the sharding settings at all, so they are just 
at the default values:


OPTION(osd_op_num_threads_per_shard, OPT_INT, 2)
OPTION(osd_op_num_shards, OPT_INT, 5)

I don't have really good insight yet into how tweaking these would 
affect single-osd performance.  I know the PCIe SSDs do have multiple 
controllers on-board so perhaps increasing the number of shards would 
improve things, but I suspect that going too high could maybe start 
hurting performance as well.  Have you done any testing here?  It could 
be an interesting follow-up paper.


Mark

On 02/18/2015 02:34 AM, Alexandre DERUMIER wrote:

Nice Work Mark !

I don't see any tuning about sharding in the config file sample

(osd_op_num_threads_per_shard,osd_op_num_shards,...)

as you only use 1 ssd for the bench, I think it should improve results for 
hammer ?



- Mail original -
De: Mark Nelson mnel...@redhat.com
À: ceph-devel ceph-de...@vger.kernel.org
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Mardi 17 Février 2015 18:37:01
Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance   
comparison

Hi All,

I wrote up a short document describing some tests I ran recently to look
at how SSD backed OSD performance has changed across our LTS releases.
This is just looking at RADOS performance and not RBD or RGW. It also
doesn't offer any real explanations regarding the results. It's just a
first high level step toward understanding some of the behaviors folks
on the mailing list have reported over the last couple of releases. I
hope you find it useful.

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-18 Thread Mark Nelson

Hi Andrei,

On 02/18/2015 09:08 AM, Andrei Mikhailovsky wrote:


Mark, many thanks for your effort and ceph performance tests. This puts
things in perspective.

Looking at the results, I was a bit concerned that the IOPs performance
in niether releases come even marginally close to the capabilities of
the underlying ssd device. Even the fastest PCI ssds have only managed
to achieve about the 1/6th IOPs of the raw device.


Perspective is definitely good!  Any time you are dealing with latency 
sensitive workloads, there are a lot of bottlenecks that can limit your 
performance.  There's a world of difference between streaming data to a 
raw SSD as fast as possible and writing data out to a distributed 
storage system that is calculating data placement, invoking the TCP 
stack, doing CRC checks, journaling writes, invoking the VM layer to 
cache data in case it's hot (which in this case it's not).




I guess there is a great deal more optimisations to be done in the
upcoming LTS releases to make the IOPs rate close to the raw device
performance.


There is definitely still room for improvement!  It's important to 
remember though that there is always going to be a trade off between 
flexibility, data integrity, and performance.  If low latency is your 
number one need before anything else, you are probably best off 
eliminating as much software as possible between you and the device 
(except possibly if you can make clever use of caching).  While Ceph 
itself is some times the bottleneck, in many cases we've found that 
bottlenecks in the software that surrounds Ceph are just as big 
obstacles (filesystem, VM layer, TCP stack, leveldb, etc).  If you need 
a distributed storage system that can universally maintain native SSD 
levels of performance, the entire stack has to be highly tuned.




I have done some testing in the past and noticed that despite the server
having a lot of unused resources (about 40-50% server idle and about
60-70% ssd idle) the ceph would not perform well when used with ssds. I
was testing with Firefly + auth and my IOPs rate was around the 3K mark.
Something is holding ceph back from performing well with ssds (((


Out of curiosity, did you try the same tests directly on the SSD?



Andrei



*From: *Mark Nelson mnel...@redhat.com
*To: *ceph-devel ceph-de...@vger.kernel.org
*Cc: *ceph-users@lists.ceph.com
*Sent: *Tuesday, 17 February, 2015 5:37:01 PM
*Subject: *[ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore
performancecomparison

Hi All,

I wrote up a short document describing some tests I ran recently to
look
at how SSD backed OSD performance has changed across our LTS releases.
This is just looking at RADOS performance and not RBD or RGW.  It also
doesn't offer any real explanations regarding the results.  It's just a
first high level step toward understanding some of the behaviors folks
on the mailing list have reported over the last couple of releases.  I
hope you find it useful.

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Introducing Learning Ceph : The First ever Book on Ceph

2015-02-18 Thread federico


To be exact, the platform used throughout is CentOS 6.4... I am reading my copy 
right now :)

Best -F

- Original Message -
From: SUNDAY A. OLUTAYO olut...@sadeeb.com
To: Andrei Mikhailovsky and...@arhont.com
Cc: ceph-users@lists.ceph.com
Sent: Monday, February 16, 2015 3:28:45 AM
Subject: Re: [ceph-users] Introducing Learning Ceph : The First ever Book on 
Ceph

I bought a copy some days ago, great job but it is Redhat specific. 

Thanks, 

Sunday Olutayo 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-18 Thread Tyler Brekke
https://github.com/ceph/ceph-tools/tree/master/cbt

On Tue, Feb 17, 2015 at 12:16 PM, Stephen Hindle shin...@llnw.com wrote:

 I was wondering what the 'CBT' tool is ?  Google is useless for that
 acronym...

 Thanks!
 Steve

 On Tue, Feb 17, 2015 at 10:37 AM, Mark Nelson mnel...@redhat.com wrote:
  Hi All,
 
  I wrote up a short document describing some tests I ran recently to look
 at
  how SSD backed OSD performance has changed across our LTS releases. This
 is
  just looking at RADOS performance and not RBD or RGW.  It also doesn't
 offer
  any real explanations regarding the results.  It's just a first high
 level
  step toward understanding some of the behaviors folks on the mailing list
  have reported over the last couple of releases.  I hope you find it
 useful.
 
  Mark
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

 --
 The information in this message may be confidential.  It is intended solely
 for
 the addressee(s).  If you are not the intended recipient, any disclosure,
 copying or distribution of the message, or any action or omission taken by
 you
 in reliance on it, is prohibited and may be unlawful.  Please immediately
 contact the sender if you have received this message in error.

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-18 Thread Andrei Mikhailovsky
Mark, many thanks for your effort and ceph performance tests. This puts things 
in perspective. 

Looking at the results, I was a bit concerned that the IOPs performance in 
niether releases come even marginally close to the capabilities of the 
underlying ssd device. Even the fastest PCI ssds have only managed to achieve 
about the 1/6th IOPs of the raw device. 

I guess there is a great deal more optimisations to be done in the upcoming LTS 
releases to make the IOPs rate close to the raw device performance. 

I have done some testing in the past and noticed that despite the server having 
a lot of unused resources (about 40-50% server idle and about 60-70% ssd idle) 
the ceph would not perform well when used with ssds. I was testing with Firefly 
+ auth and my IOPs rate was around the 3K mark. Something is holding ceph back 
from performing well with ssds ((( 

Andrei 

- Original Message -

 From: Mark Nelson mnel...@redhat.com
 To: ceph-devel ceph-de...@vger.kernel.org
 Cc: ceph-users@lists.ceph.com
 Sent: Tuesday, 17 February, 2015 5:37:01 PM
 Subject: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore
 performance comparison

 Hi All,

 I wrote up a short document describing some tests I ran recently to
 look
 at how SSD backed OSD performance has changed across our LTS
 releases.
 This is just looking at RADOS performance and not RBD or RGW. It also
 doesn't offer any real explanations regarding the results. It's just
 a
 first high level step toward understanding some of the behaviors
 folks
 on the mailing list have reported over the last couple of releases. I
 hope you find it useful.

 Mark

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly low number of concurrent backfills

2015-02-18 Thread Florian Haas
On Wed, Feb 18, 2015 at 6:56 AM, Gregory Farnum g...@gregs42.com wrote:
 On Tue, Feb 17, 2015 at 9:48 PM, Florian Haas flor...@hastexo.com wrote:
 On Tue, Feb 17, 2015 at 11:19 PM, Gregory Farnum g...@gregs42.com wrote:
 On Tue, Feb 17, 2015 at 12:09 PM, Florian Haas flor...@hastexo.com wrote:
 Hello everyone,

 I'm seeing some OSD behavior that I consider unexpected; perhaps
 someone can shed some insight.

 Ceph giant (0.87.0), osd max backfills and osd recovery max active
 both set to 1.

 Please take a moment to look at the following ceph health detail screen 
 dump:

 HEALTH_WARN 14 pgs backfill; 1 pgs backfilling; 15 pgs stuck unclean;
 recovery 16/65732491 objects degraded (0.000%); 328254/65732491
 objects misplaced (0.499%)
 pg 20.3db is stuck unclean for 13547.432043, current state
 active+remapped+wait_backfill, last acting [45,90,157]
 pg 15.318 is stuck unclean for 13547.380581, current state
 active+remapped+wait_backfill, last acting [41,17,120]
 pg 15.34a is stuck unclean for 13548.115170, current state
 active+remapped+wait_backfill, last acting [64,87,80]
 pg 20.6f is stuck unclean for 13548.019218, current state
 active+remapped+wait_backfill, last acting [13,38,98]
 pg 20.44c is stuck unclean for 13548.075430, current state
 active+remapped+wait_backfill, last acting [174,127,139]
 pg 20.bc is stuck unclean for 13545.743397, current state
 active+remapped+wait_backfill, last acting [72,64,104]
 pg 15.1ac is stuck unclean for 13548.181461, current state
 active+remapped+wait_backfill, last acting [121,145,84]
 pg 15.1af is stuck unclean for 13547.962269, current state
 active+remapped+backfilling, last acting [150,62,101]
 pg 20.396 is stuck unclean for 13547.835109, current state
 active+remapped+wait_backfill, last acting [134,49,96]
 pg 15.1ba is stuck unclean for 13548.128752, current state
 active+remapped+wait_backfill, last acting [122,63,162]
 pg 15.3fd is stuck unclean for 13547.644431, current state
 active+remapped+wait_backfill, last acting [156,38,131]
 pg 20.41c is stuck unclean for 13548.133470, current state
 active+remapped+wait_backfill, last acting [78,85,168]
 pg 20.525 is stuck unclean for 13545.272774, current state
 active+remapped+wait_backfill, last acting [76,57,148]
 pg 15.1ca is stuck unclean for 13547.944928, current state
 active+remapped+wait_backfill, last acting [157,19,36]
 pg 20.11e is stuck unclean for 13545.368614, current state
 active+remapped+wait_backfill, last acting [36,134,8]
 pg 20.525 is active+remapped+wait_backfill, acting [76,57,148]
 pg 20.44c is active+remapped+wait_backfill, acting [174,127,139]
 pg 20.41c is active+remapped+wait_backfill, acting [78,85,168]
 pg 15.3fd is active+remapped+wait_backfill, acting [156,38,131]
 pg 20.3db is active+remapped+wait_backfill, acting [45,90,157]
 pg 20.396 is active+remapped+wait_backfill, acting [134,49,96]
 pg 15.34a is active+remapped+wait_backfill, acting [64,87,80]
 pg 15.318 is active+remapped+wait_backfill, acting [41,17,120]
 pg 15.1ca is active+remapped+wait_backfill, acting [157,19,36]
 pg 15.1ba is active+remapped+wait_backfill, acting [122,63,162]
 pg 15.1ac is active+remapped+wait_backfill, acting [121,145,84]
 pg 15.1af is active+remapped+backfilling, acting [150,62,101]
 pg 20.11e is active+remapped+wait_backfill, acting [36,134,8]
 pg 20.bc is active+remapped+wait_backfill, acting [72,64,104]
 pg 20.6f is active+remapped+wait_backfill, acting [13,38,98]
 recovery 16/65732491 objects degraded (0.000%); 328254/65732491
 objects misplaced (0.499%)

 As you can see, there is barely any overlap between the acting OSDs
 for those PGs. osd max backfills should only limit the number of
 concurrent backfills out of a single OSD, and so in the situation
 above I would expect the 15 backfills to happen mostly concurrently.
 As it is they are being serialized, and that seems to needlessly slow
 down the process and extend the time needed to complete recovery.

 I'm pretty sure I'm missing something obvious here, but what is it?

 The max backfill values cover both incoming and outgoing results.
 Presumably these are all waiting on a small set of target OSDs which
 are currently receiving backfills of some other PG.

 Thanks for the reply, and I am aware of that, but I am not sure how it
 applies here.

 What I quoted was the complete list of then-current backfills in the
 cluster. Those are *all* the PGs affected by backfills. And they're so
 scattered across OSDs that there is barely any overlap. The only OSDs
 I even see listed twice are 38 and 64, which would affect PGs
 15.3fd/20.6f 15.34a/20.bc. What is causing the others to wait?

 Or am I misunderstanding the acting value here and some other OSDs
 are involved, and if so, how would I find out what those are?

 Yes, unless I'm misremembering. Look at the pg dump for those PGs and
 check out the up versus acting values. The acting ones are what
 the PG is currently remapped to; they're waiting to backfill onto the
 proper set of up OSDs.

Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Oliver Schulz

Hi Florian,

On 18.02.2015 22:58, Florian Haas wrote:

is it possible to define a Ceph user/key with privileges
that allow for read-only CephFS access but do not allow

All you should need to do is [...]
However, I've just tried the above with ceph-fuse on firefly, and [...]
So I believe you've uncovered a CephFS bug. :)


many thanks for the advice and the tests!

I guess I'll have to go with a proxy for now, to be safe.
But if it's possible (design-wise) read-only CephFS
access might be a useful feature to have in the future.


Cheers,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] wider rados namespace support?

2015-02-18 Thread Josh Durgin

On 02/12/2015 05:59 PM, Blair Bethwaite wrote:

My particular interest is for a less dynamic environment, so manual
key distribution is not a problem. Re. OpenStack, it's probably good
enough to have the Cinder host creating them as needed (presumably
stored in its DB) and just send the secret keys over the message bus
to compute hosts as needed - if your infrastructure network is not
trusted then you've got bigger problems to worry about. It's true that
a lot of clouds would end up logging the secrets in various places,
but then they are only useful on particular hosts.

I guess there is nothing special about the default  namespace
compared to any other as far as cephx is concerned. It would be nice
to have something of a nested auth, so that the client requires
explicit permission to read the default namespace (configured
out-of-band when setting up compute hosts) and further permission for
particular non-default namespaces (managed by the cinder rbd driver),
that way leaking secrets from cinder gives less exposure - but I guess
that would be a bit of a change from the current namespace
functionality.


You can restrict client access to the default namespace like this with
the existing ceph capabilities. For the proposed rbd usage of
namespaces, for example, you could allow read-only access to the
rbd_id.* objects in the default namespace, and full access to other
specific namespaces. Something like:

mon 'allow r' osd 'allow r class-read pool=foo namespace= 
object_prefix rbd_id, allow rwx pool=foo namespace=bar'


Cinder or other management layers would still want broader access, but
these more restricted keys could be the only ones exposed to QEMU.

Josh


On 13 February 2015 at 05:57, Josh Durgin josh.dur...@inktank.com wrote:

On 02/10/2015 07:54 PM, Blair Bethwaite wrote:


Just came across this in the docs:
Currently (i.e., firefly), namespaces are only useful for
applications written on top of librados. Ceph clients such as block
device, object storage and file system do not currently support this
feature.

Then found:
https://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support

Is there any progress or plans to address this (particularly for rbd
clients but also cephfs)?



No immediate plans for rbd. That blueprint still seems like a
reasonable way to implement it to me.

The one part I'm less sure about is the OpenStack or other higher level
integration, which would need to start adding secret keys to libvirt
dynamically.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Oliver Schulz

Dear Greg,

On 18.02.2015 23:41, Gregory Farnum wrote:

is it possible to define a Ceph user/key with privileges
that allow for read-only CephFS access but do not allow

...and deletes, unfortunately. :( I don't think this is presently a
thing it's possible to do until we get a much better user auth
capabilities system into CephFS.


thanks a lot for the in-depth explanation!

I guess this is indeed not a trivial thing to do. Well, it's
probably best anyhow to isolate the Ceph cluster from potentially
vulnerable systems.


Cheers,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Gregory Farnum
On Wed, Feb 18, 2015 at 3:30 PM, Florian Haas flor...@hastexo.com wrote:
 On Wed, Feb 18, 2015 at 11:41 PM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote:
 On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote:
 Dear Ceph Experts,

 is it possible to define a Ceph user/key with privileges
 that allow for read-only CephFS access but do not allow
 write or other modifications to the Ceph cluster?

 Warning, read this to the end, don't blindly do as I say. :)

 All you should need to do is define a CephX identity that has only r
 capabilities on the data pool (assuming you're using a default
 configuration where your CephFS uses the data and metadata pools):

 sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r
 pool=data' mon 'allow r'

 That identity should then be able to mount the filesystem but not
 write any data (use ceph-fuse -n client.readonly or mount -t ceph
 -o name=readonly)

 That said, just touching files or creating them is only a metadata
 operation that doesn't change anything in the data pool, so I think
 that might still be allowed under these circumstances.

 ...and deletes, unfortunately. :(

 If the file being deleted is empty, yes. If the file has any content,
 then the removal should hit the data pool before it hits metadata, and
 should fail there. No?

No, all data deletion is handled by the MDS, for two reasons:
1) You don't want clients to have to block on deletes in time linear
with the number of objects
2) (IMPORTANT) if clients unlink a file which is still opened
elsewhere, it can't be deleted until closed. ;)


I don't think this is presently a
 thing it's possible to do until we get a much better user auth
 capabilities system into CephFS.


 However, I've just tried the above with ceph-fuse on firefly, and I
 was able to mount the filesystem that way and then echo something into
 a previously existing file. After unmounting, remounting, and trying
 to cat that file, I/O just hangs. It eventually does complete, but
 this looks really fishy.

 This is happening because the CephFS clients don't (can't, really, for
 all the time we've spent thinking about it) check whether they have
 read permissions on the underlying pool when buffering writes for a
 file. I believe if you ran an fsync on the file you'd get an EROFS or
 similar.
 Anyway, the client happily buffers up the writes. Depending on how
 exactly you remount then it might not be able to drop the MDS caps for
 file access (due to having dirty data it can't get rid of), and those
 caps have to time out before anybody else can access the file again.
 So you've found an unpleasant oddity of how the POSIX interfaces map
 onto this kind of distributed system, but nothing unexpected. :)

 Oliver's point is valid though; I would be nice if you could somehow
 make CephFS read-only to some (or all) clients server side, the way an
 NFS ro export does.

Yeah. Yet another thing that would be good but requires real
permission bits on the MDS. It'll happen eventually, but we have other
bits that seem a lot more important...fsck, stability, single-tenant
usability
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Privileges for read-only CephFS access?

2015-02-18 Thread Florian Haas
On Wed, Feb 18, 2015 at 11:41 PM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote:
 On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote:
 Dear Ceph Experts,

 is it possible to define a Ceph user/key with privileges
 that allow for read-only CephFS access but do not allow
 write or other modifications to the Ceph cluster?

 Warning, read this to the end, don't blindly do as I say. :)

 All you should need to do is define a CephX identity that has only r
 capabilities on the data pool (assuming you're using a default
 configuration where your CephFS uses the data and metadata pools):

 sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r
 pool=data' mon 'allow r'

 That identity should then be able to mount the filesystem but not
 write any data (use ceph-fuse -n client.readonly or mount -t ceph
 -o name=readonly)

 That said, just touching files or creating them is only a metadata
 operation that doesn't change anything in the data pool, so I think
 that might still be allowed under these circumstances.

 ...and deletes, unfortunately. :(

If the file being deleted is empty, yes. If the file has any content,
then the removal should hit the data pool before it hits metadata, and
should fail there. No?

I don't think this is presently a
 thing it's possible to do until we get a much better user auth
 capabilities system into CephFS.


 However, I've just tried the above with ceph-fuse on firefly, and I
 was able to mount the filesystem that way and then echo something into
 a previously existing file. After unmounting, remounting, and trying
 to cat that file, I/O just hangs. It eventually does complete, but
 this looks really fishy.

 This is happening because the CephFS clients don't (can't, really, for
 all the time we've spent thinking about it) check whether they have
 read permissions on the underlying pool when buffering writes for a
 file. I believe if you ran an fsync on the file you'd get an EROFS or
 similar.
 Anyway, the client happily buffers up the writes. Depending on how
 exactly you remount then it might not be able to drop the MDS caps for
 file access (due to having dirty data it can't get rid of), and those
 caps have to time out before anybody else can access the file again.
 So you've found an unpleasant oddity of how the POSIX interfaces map
 onto this kind of distributed system, but nothing unexpected. :)

Oliver's point is valid though; I would be nice if you could somehow
make CephFS read-only to some (or all) clients server side, the way an
NFS ro export does.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd: I/O Errors in low memory situations

2015-02-18 Thread Sebastian Köhler [Alfahosting GmbH]

Hi,

yesterday we had had the problem that one of our cluster clients 
remounted a rbd device in read-only mode. We found this[1] stack trace 
in the logs. We investigated further and found similar traces on all 
other machines that are using the rbd kernel module. It seems to me that 
whenever there is a swapping situation on a client those I/O errors occur.
Is there anything we can do or is this something that needs to be fixed 
in the code?



client setup:
# uname -v
#1 SMP Debian 3.16.7-ckt2-1~bpo70+1 (2014-12-08)

# rbd -v
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

# lsb_release -r
Release:6.0.10


Cheers,
Sebastian


[1]
Feb 17 22:52:25 six kernel: [2401866.069932] kworker/6:1: page 
allocation failure: order:1, mode:0x204020
Feb 17 22:52:25 six kernel: [2401866.069939] CPU: 6 PID: 4474 Comm: 
kworker/6:1 Tainted: GW 3.16.0-0.bpo.4-amd64 #1 Debian 
3.16.7-ckt2-1~bpo70+1
Feb 17 22:52:25 six kernel: [2401866.069940] Hardware name: Dell Inc. 
PowerEdge R710/0PV9DG, BIOS 1.3.6 10/30/2009
Feb 17 22:52:25 six kernel: [2401866.069947] Workqueue: rbd1 
rbd_request_workfn [rbd]
Feb 17 22:52:25 six kernel: [2401866.069949]   
0001 81541f8f 00204020
Feb 17 22:52:25 six kernel: [2401866.069951]  811519ed 
0001 88032fff2c00 0002
Feb 17 22:52:25 six kernel: [2401866.069953]  818e24d0 
00010002 88032fff2c08 0092

Feb 17 22:52:25 six kernel: [2401866.069955] Call Trace:
Feb 17 22:52:25 six kernel: [2401866.069963]  [81541f8f] ? 
dump_stack+0x41/0x51
Feb 17 22:52:25 six kernel: [2401866.069969]  [811519ed] ? 
warn_alloc_failed+0xfd/0x160
Feb 17 22:52:25 six kernel: [2401866.069972]  [8115606f] ? 
__alloc_pages_nodemask+0x91f/0xbb0
Feb 17 22:52:25 six kernel: [2401866.069977]  [8119fe00] ? 
kmem_getpages+0x60/0x110
Feb 17 22:52:25 six kernel: [2401866.069979]  [811a1648] ? 
fallback_alloc+0x158/0x220
Feb 17 22:52:25 six kernel: [2401866.069981]  [811a1f44] ? 
kmem_cache_alloc+0x1a4/0x1e0
Feb 17 22:52:25 six kernel: [2401866.069987]  [a0627889] ? 
ceph_osdc_alloc_request+0x69/0x320 [libceph]
Feb 17 22:52:25 six kernel: [2401866.069989]  [a065f53b] ? 
rbd_osd_req_create.isra.17+0x7b/0x190 [rbd]
Feb 17 22:52:25 six kernel: [2401866.069992]  [a0661fd5] ? 
rbd_img_request_fill+0x2b5/0x900 [rbd]
Feb 17 22:52:25 six kernel: [2401866.069995]  [a0663485] ? 
rbd_request_workfn+0x235/0x350 [rbd]
Feb 17 22:52:25 six kernel: [2401866.07]  [810878fc] ? 
process_one_work+0x15c/0x450
Feb 17 22:52:25 six kernel: [2401866.070003]  [81088b52] ? 
worker_thread+0x112/0x540
Feb 17 22:52:25 six kernel: [2401866.070005]  [81088a40] ? 
create_and_start_worker+0x60/0x60
Feb 17 22:52:25 six kernel: [2401866.070008]  [8108f511] ? 
kthread+0xc1/0xe0
Feb 17 22:52:25 six kernel: [2401866.070010]  [8108f450] ? 
flush_kthread_worker+0xb0/0xb0
Feb 17 22:52:25 six kernel: [2401866.070013]  [815483bc] ? 
ret_from_fork+0x7c/0xb0
Feb 17 22:52:25 six kernel: [2401866.070015]  [8108f450] ? 
flush_kthread_worker+0xb0/0xb0

Feb 17 22:52:25 six kernel: [2401866.070017] Mem-Info:
Feb 17 22:52:25 six kernel: [2401866.070018] Node 0 Normal per-cpu:
Feb 17 22:52:25 six kernel: [2401866.070020] CPU0: hi:  186, btch: 
31 usd:  46
Feb 17 22:52:25 six kernel: [2401866.070021] CPU1: hi:  186, btch: 
31 usd: 168
Feb 17 22:52:25 six kernel: [2401866.070023] CPU2: hi:  186, btch: 
31 usd: 154
Feb 17 22:52:25 six kernel: [2401866.070024] CPU3: hi:  186, btch: 
31 usd: 177
Feb 17 22:52:25 six kernel: [2401866.070025] CPU4: hi:  186, btch: 
31 usd: 128
Feb 17 22:52:25 six kernel: [2401866.070026] CPU5: hi:  186, btch: 
31 usd: 152
Feb 17 22:52:25 six kernel: [2401866.070028] CPU6: hi:  186, btch: 
31 usd:  76
Feb 17 22:52:25 six kernel: [2401866.070029] CPU7: hi:  186, btch: 
31 usd: 109
Feb 17 22:52:25 six kernel: [2401866.070030] CPU8: hi:  186, btch: 
31 usd: 183
Feb 17 22:52:25 six kernel: [2401866.070031] CPU9: hi:  186, btch: 
31 usd: 180
Feb 17 22:52:25 six kernel: [2401866.070033] CPU   10: hi:  186, btch: 
31 usd: 149
Feb 17 22:52:25 six kernel: [2401866.070034] CPU   11: hi:  186, btch: 
31 usd: 170
Feb 17 22:52:25 six kernel: [2401866.070035] CPU   12: hi:  186, btch: 
31 usd: 182
Feb 17 22:52:25 six kernel: [2401866.070036] CPU   13: hi:  186, btch: 
31 usd: 169
Feb 17 22:52:25 six kernel: [2401866.070038] CPU   14: hi:  186, btch: 
31 usd: 157
Feb 17 22:52:25 six kernel: [2401866.070039] CPU   15: hi:  186, btch: 
31 usd: 176

Feb 17 22:52:25 six kernel: [2401866.070040] Node 1 DMA per-cpu:
Feb 17 22:52:25 six kernel: [2401866.070041] CPU0: hi:0, btch: 
 1 usd:   0
Feb 17 22:52:25 six kernel: [2401866.070043] CPU1: hi:0, btch: 
 1 usd:   0
Feb 17 22:52:25 six kernel: [2401866.070044] CPU2: hi:0, btch: 
 1 usd:  

[ceph-users] OSD Startup Best Practice: gpt/udev or SysVInit/systemd ?

2015-02-18 Thread Anthony Alba
Hi Cephers,

What is your best practice for starting up OSDs?

I am trying to determine the most robust technique on CentOS 7 where I
have too much choice:

udev/gpt/uuid or /etc/init.d/ceph or /etc/systemd/system/ceph-osd@X

1. Use udev/gpt/UUID: no OSD  sections in  /etc/ceph/mycluster.conf or
premounts in /etc/fstab.
Let udev + ceph-disk-activate do its magic.

2. Use /etc/init.d/ceph start osd or systemctl start ceph-osd@N
a. do you change partition UUID so no udev kicks in?
b. do you keep  [osd.N] sections in /etc/ceph/mycluster.conf
c. premount all journals/OSDs in /etc/fstab?

The problem with this approach, though very explicit and robust, is
that it is is hard to maintain
/etc/fstab on the OSD hosts.

- Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com