from:"Jens\-Christian Fischer"

Re: [ceph-users] NFS interaction with RBD

2015-05-27 Thread Jens-Christian Fischer

George,

I will let Christian provide you the details. As far as I know, it was enough 
to just do a ‘ls’ on all of the attached drives.

we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   1.0.0+git-2013.c3d1e78-2ubuntu1   
all  PXE boot firmware - ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11
amd64QEMU utilities

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote:

 Jens-Christian,
 
 how did you test that? Did you just tried to write to them simultaneously? 
 Any other tests that one can perform to verify that?
 
 In our installation we have a VM with 30 RBD volumes mounted which are all 
 exported via NFS to other VMs.
 No one has complaint for the moment but the load/usage is very minimal.
 If this problem really exists then very soon that the trial phase will be 
 over we will have millions of complaints :-(
 
 What version of QEMU are you using? We are using the one provided by Ceph in 
 qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
 
 Best regards,
 
 George
 
 I think we (i.e. Christian) found the problem:
 
 We created a test VM with 9 mounted RBD volumes (no NFS server). As
 soon as he hit all disks, we started to experience these 120 second
 timeouts. We realized that the QEMU process on the hypervisor is
 opening a TCP connection to every OSD for every mounted volume -
 exceeding the 1024 FD limit.
 
 So no deep scrubbing etc, but simply to many connections…
 
 cheers
 jc
 
 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch [3]
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 On 25.05.2015, at 06:02, Christian Balzer  wrote:
 
 Hello,
 
 lets compare your case with John-Paul's.
 
 Different OS and Ceph versions (thus we can assume different NFS
 versions
 as well).
 The only common thing is that both of you added OSDs and are likely
 suffering from delays stemming from Ceph re-balancing or
 deep-scrubbing.
 
 Ceph logs will only pipe up when things have been blocked for more
 than 30
 seconds, NFS might take offense to lower values (or the accumulation
 of
 several distributed delays).
 
 You added 23 OSDs, tell us more about your cluster, HW, network.
 Were these added to the existing 16 nodes, are these on new storage
 nodes
 (so could there be something different with those nodes?), how busy
 is your
 network, CPU.
 Running something like collectd to gather all ceph perf data and
 other
 data from the storage nodes and then feeding it to graphite (or
 similar)
 can be VERY helpful to identify if something is going wrong and what
 it is
 in particular.
 Otherwise run atop on your storage nodes to identify if CPU,
 network,
 specific HDDs/OSDs are bottlenecks.
 
 Deep scrubbing can be _very_ taxing, do your problems persist if
 inject
 into your running cluster an osd_scrub_sleep value of 0.5 (lower
 that
 until it hurts again) or if you turn of deep scrubs altogether for
 the
 moment?
 
 Christian
 
 On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
 
 We see something very similar on our Ceph cluster, starting as of
 today.
 
 We use a 16 node, 102 OSD Ceph installation as the basis for an
 Icehouse
 OpenStack cluster (we applied the RBD patches for live migration
 etc)
 
 On this cluster we have a big ownCloud installation (Sync  Share)
 that
 stores its files on three NFS servers, each

Re: [ceph-users] NFS interaction with RBD

2015-05-26 Thread Jens-Christian Fischer

I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he 
hit all disks, we started to experience these 120 second timeouts. We realized 
that the QEMU process on the hypervisor is opening a TCP connection to every 
OSD for every mounted volume - exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer ch...@gol.com wrote:

 
 Hello,
 
 lets compare your case with John-Paul's.
 
 Different OS and Ceph versions (thus we can assume different NFS versions
 as well).
 The only common thing is that both of you added OSDs and are likely
 suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
 
 Ceph logs will only pipe up when things have been blocked for more than 30
 seconds, NFS might take offense to lower values (or the accumulation of
 several distributed delays).
 
 You added 23 OSDs, tell us more about your cluster, HW, network.
 Were these added to the existing 16 nodes, are these on new storage nodes
 (so could there be something different with those nodes?), how busy is your
 network, CPU.
 Running something like collectd to gather all ceph perf data and other
 data from the storage nodes and then feeding it to graphite (or similar)
 can be VERY helpful to identify if something is going wrong and what it is
 in particular.
 Otherwise run atop on your storage nodes to identify if CPU, network,
 specific HDDs/OSDs are bottlenecks. 
 
 Deep scrubbing can be _very_ taxing, do your problems persist if inject
 into your running cluster an osd_scrub_sleep value of 0.5 (lower that
 until it hurts again) or if you turn of deep scrubs altogether for the
 moment?
 
 Christian
 
 On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
 
 We see something very similar on our Ceph cluster, starting as of today.
 
 We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
 OpenStack cluster (we applied the RBD patches for live migration etc)
 
 On this cluster we have a big ownCloud installation (Sync  Share) that
 stores its files on three NFS servers, each mounting 6 2TB RBD volumes
 and exposing them to around 10 web server VMs (we originally started
 with one NFS server with a 100TB volume, but that has become unwieldy).
 All of the servers (hypervisors, ceph storage nodes and VMs) are using
 Ubuntu 14.04
 
 Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
 OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
 rebalancing process ended this morning (after around 12 hours) The
 cluster has been clean since then:
 
cluster b1f3f4c8-x
 health HEALTH_OK
 monmap e2: 3 mons at
 {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
 election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
 e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
 active+clean 17 active+clean+scrubbing+deep
  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
 
 At midnight, we run a script that creates an RBD snapshot of all RBD
 volumes that are attached to the NFS servers (for backup purposes).
 Looking at our monitoring, around that time, one of the NFS servers
 became unresponsive and took down the complete ownCloud installation
 (load on the web server was  200 and they had lost some of the NFS
 mounts)
 
 Rebooting the NFS server solved that problem, but the NFS kernel server
 kept crashing all day long after having run between 10 to 90 minutes.
 
 We initially suspected a corrupt rbd volume (as it seemed that we could
 trigger the kernel crash by just “ls -l” one of the volumes, but
 subsequent “xfs_repair -n” checks on those RBD volumes showed no
 problems.
 
 We migrated the NFS server off of its hypervisor, suspecting a problem
 with RBD kernel modules, rebooted the hypervisor but the problem
 persisted (both on the new hypervisor, and on the old one when we
 migrated it back)
 
 We changed the /etc/default/nfs-kernel-server to start up 256 servers
 (even though the defaults had been working fine for over a year)
 
 Only one of our 3 NFS servers crashes (see below for syslog information)
 - the other 2 have been fine
 
 May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
 Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
 grace period (net 81cdab00) May 23 21:44:23 drive-nfs1
 rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
 kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core

Re: [ceph-users] NFS interaction with RBD

2015-05-23 Thread Jens-Christian Fischer

] 
nfsd_lookup+0x69/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781613]  [a03be90a] 
nfsd4_lookup+0x1a/0x20 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781628]  [a03c055a] 
nfsd4_proc_compound+0x56a/0x7d0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781638]  [a03acd3b] 
nfsd_dispatch+0xbb/0x200 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781662]  [a028762d] 
svc_process_common+0x46d/0x6d0 [sunrpc]
May 23 21:51:26 drive-nfs1 kernel: [  600.781678]  [a0287997] 
svc_process+0x107/0x170 [sunrpc]
May 23 21:51:26 drive-nfs1 kernel: [  600.781687]  [a03ac71f] 
nfsd+0xbf/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781696]  [a03ac660] ? 
nfsd_destroy+0x80/0x80 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781702]  [8108b6b2] 
kthread+0xd2/0xf0
May 23 21:51:26 drive-nfs1 kernel: [  600.781707]  [8108b5e0] ? 
kthread_create_on_node+0x1c0/0x1c0
May 23 21:51:26 drive-nfs1 kernel: [  600.781712]  [81733868] 
ret_from_fork+0x58/0x90
May 23 21:51:26 drive-nfs1 kernel: [  600.781717]  [8108b5e0] ? 
kthread_create_on_node+0x1c0/0x1c0

Before each crash, we see the disk utilization of one or two random mounted RBD 
volumes to go to 100% - there is no pattern on which of the RBD disks start to 
act up.

We have scoured the log files of the Ceph cluster for any signs of problems but 
came up empty.

The NFS server has almost no load (compared to regular usage) as most sync 
clients are either turned off (weekend) or have given up connecting to the 
server. 

There haven't been any configuration change on the NFS servers prior to the 
problems. The only change was the adding of 23 OSDs.

We use ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)

Our team is completely out of ideas. We have removed the 100TB volume from the 
nfs server (we used the downtime to migrate the last data off of it to one of 
the smaller volumes). The NFS server has been running for 30 minutes now (with 
close to no load) but we don’t really expect it to make it until tomorrow.

send help
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 23.05.2015, at 20:38, John-Paul Robinson (Campus) j...@uab.edu wrote:

 We've had a an NFS gateway serving up RBD images successfully for over a 
 year. Ubuntu 12.04 and ceph .73 iirc. 
 
 In the past couple of weeks we have developed a problem where the nfs clients 
 hang while accessing exported rbd containers. 
 
 We see errors on the server about nfsd hanging for 120sec etc. 
 
 The nfs server is still able to successfully interact with the images it is 
 serving. We can export non rbd shares from the local file system and nfs 
 clients can use them just fine. 
 
 There seems to be something weird going on with rbd and nfs kernel modules. 
 
 Our ceph pool is in a warn state due to an osd rebalance that is continuing 
 slowly. But the fact that we continue to have good rbd image access directly 
 on the server makes me think this is not related.  Also the nfs server is 
 only a client of the pool, it doesnt participate in it. 
 
 Has anyone experienced similar issues?  
 
 We do have a lot of images attached to the server but he issue is there even 
 when we map only a few. 
 
 Thanks for any pointers. 
 
 ~jpr
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Cinder Capabilities reports wrong free size

2014-08-22 Thread Jens-Christian Fischer

Thanks Greg, a good nights sleep and your eyes made the difference: Here’s the 
relevant part from /etc/cinder/cinder.conf to make that happen

[DEFAULT]
...
enabled_backends=quobyte,rbd
default_volume_type=rbd



[quobyte]
volume_backend_name=quobyte
quobyte_volume_url=quobyte://host.example.com/openstack-volumes
volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver

[rbd]
volume_backend_name=rbd
rbd_pool=volumes
rbd_flatten_volume_from_snapshot=False
rbd_user=cinder
rbd_ceph_conf=/etc/ceph/ceph.conf
rbd_secret_uuid=111-222-333-444-555
rbd_max_clone_depth=5
volume_driver=cinder.volume.drivers.rbd.RBDDriver

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 21.08.2014, at 17:55, Gregory Farnum g...@inktank.com wrote:

 On Thu, Aug 21, 2014 at 8:29 AM, Jens-Christian Fischer
 jens-christian.fisc...@switch.ch wrote:
 I am working with Cinder Multi Backends on an Icehouse installation and have 
 added another backend (Quobyte) to a previously running Cinder/Ceph 
 installation.
 
 I can now create QuoByte volumes, but no longer any ceph volumes. The 
 cinder-scheduler log get’s an incorrect number for the free size of the 
 volumes pool and disregards the RBD backend as a viable storage system:
 
 I don't know much about Cinder, but given this output:
 
 2014-08-21 16:42:49.847 1469 DEBUG 
 cinder.openstack.common.scheduler.filters.capabilities_filter [r...] 
 extra_spec requirement 'rbd' does not match 'quobyte' _satisfies_extra_specs 
 /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:55
 2014-08-21 16:42:49.848 1469 DEBUG 
 cinder.openstack.common.scheduler.filters.capabilities_filter [r...] host 
 'controller@quobyte': free_capacity_gb: 156395.931061 fails resource_type 
 extra_specs requirements host_passes 
 /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:68
 2014-08-21 16:42:49.848 1469 WARNING 
 cinder.scheduler.filters.capacity_filter [r...-] Insufficient free space for 
 volume creation (requested / avail): 20/8.0
 2014-08-21 16:42:49.849 1469 ERROR cinder.scheduler.flows.create_volume [r.] 
 Failed to schedule_create_volume: No valid host was found.
 
 I suspect you'll have better luck on the Openstack mailing list. :)
 
 Although for a random quick guess, I think maybe you need to match the
 rbd and rbd-volumes (from your conf file) strings?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 
 here’s our /etc/cinder/cinder.conf
 
 — cut —
 [DEFAULT]
 rootwrap_config = /etc/cinder/rootwrap.conf
 api_paste_confg = /etc/cinder/api-paste.ini
 # iscsi_helper = tgtadm
 volume_name_template = volume-%s
 # volume_group = cinder-volumes
 verbose = True
 auth_strategy = keystone
 state_path = /var/lib/cinder
 lock_path = /var/lock/cinder
 volumes_dir = /var/lib/cinder/volumes
 rabbit_host=10.2.0.10
 use_syslog=False
 api_paste_config=/etc/cinder/api-paste.ini
 glance_num_retries=0
 debug=True
 storage_availability_zone=nova
 glance_api_ssl_compression=False
 glance_api_insecure=False
 rabbit_userid=openstack
 rabbit_use_ssl=False
 log_dir=/var/log/cinder
 osapi_volume_listen=0.0.0.0
 glance_api_servers=1.2.3.4:9292
 rabbit_virtual_host=/
 scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler
 default_availability_zone=nova
 rabbit_hosts=10.2.0.10:5672
 control_exchange=openstack
 rabbit_ha_queues=False
 glance_api_version=2
 amqp_durable_queues=False
 rabbit_password=secret
 rabbit_port=5672
 rpc_backend=cinder.openstack.common.rpc.impl_kombu
 enabled_backends=quobyte,rbd
 default_volume_type=rbd
 
 [database]
 idle_timeout=3600
 connection=mysql://cinder:secret@10.2.0.10/cinder
 
 [quobyte]
 quobyte_volume_url=quobyte://hostname.cloud.example.com/openstack-volumes
 volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver
 
 [rbd-volumes]
 volume_backend_name=rbd-volumes
 rbd_pool=volumes
 rbd_flatten_volume_from_snapshot=False
 rbd_user=cinder
 rbd_ceph_conf=/etc/ceph/ceph.conf
 rbd_secret_uuid=1234-5678-ABCD-…-DEF
 rbd_max_clone_depth=5
 volume_driver=cinder.volume.drivers.rbd.RBDDriver
 
 — cut ---
 
 any ideas?
 
 cheers
 Jens-Christian
 
 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph Cinder Capabilities reports wrong free size

2014-08-21 Thread Jens-Christian Fischer

I am working with Cinder Multi Backends on an Icehouse installation and have 
added another backend (Quobyte) to a previously running Cinder/Ceph 
installation.

I can now create QuoByte volumes, but no longer any ceph volumes. The 
cinder-scheduler log get’s an incorrect number for the free size of the volumes 
pool and disregards the RBD backend as a viable storage system:

2014-08-21 16:42:49.847 1469 DEBUG 
cinder.openstack.common.scheduler.filters.capabilities_filter [r...] extra_spec 
requirement 'rbd' does not match 'quobyte' _satisfies_extra_specs 
/usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:55
2014-08-21 16:42:49.848 1469 DEBUG 
cinder.openstack.common.scheduler.filters.capabilities_filter [r...] host 
'controller@quobyte': free_capacity_gb: 156395.931061 fails resource_type 
extra_specs requirements host_passes 
/usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:68
2014-08-21 16:42:49.848 1469 WARNING cinder.scheduler.filters.capacity_filter 
[r...-] Insufficient free space for volume creation (requested / avail): 20/8.0
2014-08-21 16:42:49.849 1469 ERROR cinder.scheduler.flows.create_volume [r.] 
Failed to schedule_create_volume: No valid host was found.

here’s our /etc/cinder/cinder.conf

— cut —
[DEFAULT]
rootwrap_config = /etc/cinder/rootwrap.conf
api_paste_confg = /etc/cinder/api-paste.ini
# iscsi_helper = tgtadm
volume_name_template = volume-%s
# volume_group = cinder-volumes
verbose = True
auth_strategy = keystone
state_path = /var/lib/cinder
lock_path = /var/lock/cinder
volumes_dir = /var/lib/cinder/volumes
rabbit_host=10.2.0.10
use_syslog=False
api_paste_config=/etc/cinder/api-paste.ini
glance_num_retries=0
debug=True
storage_availability_zone=nova
glance_api_ssl_compression=False
glance_api_insecure=False
rabbit_userid=openstack
rabbit_use_ssl=False
log_dir=/var/log/cinder
osapi_volume_listen=0.0.0.0
glance_api_servers=1.2.3.4:9292
rabbit_virtual_host=/
scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler
default_availability_zone=nova
rabbit_hosts=10.2.0.10:5672
control_exchange=openstack
rabbit_ha_queues=False
glance_api_version=2
amqp_durable_queues=False
rabbit_password=secret
rabbit_port=5672
rpc_backend=cinder.openstack.common.rpc.impl_kombu
enabled_backends=quobyte,rbd
default_volume_type=rbd

[database]
idle_timeout=3600
connection=mysql://cinder:secret@10.2.0.10/cinder

[quobyte]
quobyte_volume_url=quobyte://hostname.cloud.example.com/openstack-volumes
volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver

[rbd-volumes]
volume_backend_name=rbd-volumes
rbd_pool=volumes
rbd_flatten_volume_from_snapshot=False
rbd_user=cinder
rbd_ceph_conf=/etc/ceph/ceph.conf
rbd_secret_uuid=1234-5678-ABCD-…-DEF
rbd_max_clone_depth=5
volume_driver=cinder.volume.drivers.rbd.RBDDriver

— cut ---

any ideas?

cheers
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD clone for OpenStack Nova ephemeral volumes

2014-05-28 Thread Jens-Christian Fischer

We are currently starting to set up a new Icehouse/Ceph based cluster and will 
help to get this patch in shape as well. 

I am currently collecting the information needed that allow us to patch Nova 
and I have this: 
https://github.com/angdraug/nova/tree/rbd-ephemeral-clone-stable-icehouse on my 
list of patches to apply. Is there new code for the rbd-clone-image-handler 
blueprint, or should I use the one mentioned above?

Also, are there other patches that would need to be applied for the full 
Icehouse/Ceph integration?

cheers
jc

On 01.05.2014, at 01:23, Dmitry Borodaenko dborodae...@mirantis.com wrote:

 I've re-proposed the rbd-clone-image-handler blueprint via nova-specs:
 https://review.openstack.org/91486
 
 In other news, Sebastien has helped me test the most recent
 incarnation of this patch series and it seems to be usable now. With
 an important exception of live migrations of VMs with RBD backed
 ephemeral drives, which will need a bit more work and a separate
 blueprint.
 
 On Mon, Apr 28, 2014 at 7:44 PM, Dmitry Borodaenko
 dborodae...@mirantis.com wrote:
 I have decoupled the Nova rbd-ephemeral-clone branch from the
 multiple-image-location patch, the result can be found at the same
 location on GitHub as before:
 https://github.com/angdraug/nova/tree/rbd-ephemeral-clone
 
 I will keep rebasing this over Nova master, I also plan to update the
 rbd-clone-image-handler blueprint and publish it to nova-specs so that
 the patch series could be proposed for Juno.
 
 Icehouse backport of this branch is here:
 https://github.com/angdraug/nova/tree/rbd-ephemeral-clone-stable-icehouse
 
 I am not going to track every stable/icehouse commit with this branch,
 instead, I will rebase it over stable release tags as they appear.
 Right now it's based on tag:2014.1.
 
 For posterity, I'm leaving the multiple-image-location patch rebased
 over current Nova master here:
 https://github.com/angdraug/nova/tree/multiple-image-location
 
 I don't plan on maintaining multiple-image-location, just leaving it
 out there to save some rebasing effort for whoever decides to pick it
 up.
 
 -DmitryB
 
 On Fri, Mar 21, 2014 at 1:12 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 03/20/2014 07:03 PM, Dmitry Borodaenko wrote:
 
 On Thu, Mar 20, 2014 at 3:43 PM, Josh Durgin josh.dur...@inktank.com
 wrote:
 
 On 03/20/2014 02:07 PM, Dmitry Borodaenko wrote:
 
 The patch series that implemented clone operation for RBD backed
 ephemeral volumes in Nova did not make it into Icehouse. We have tried
 our best to help it land, but it was ultimately rejected. Furthermore,
 an additional requirement was imposed to make this patch series
 dependent on full support of Glance API v2 across Nova (due to its
 dependency on direct_url that was introduced in v2).
 
 You can find the most recent discussion of this patch series in the
 FFE (feature freeze exception) thread on openstack-dev ML:
 
 http://lists.openstack.org/pipermail/openstack-dev/2014-March/029127.html
 
 As I explained in that thread, I believe this feature is essential for
 using Ceph as a storage backend for Nova, so I'm going to try and keep
 it alive outside of OpenStack mainline until it is allowed to land.
 
 I have created rbd-ephemeral-clone branch in my nova repo fork on
 GitHub:
 https://github.com/angdraug/nova/tree/rbd-ephemeral-clone
 
 I will keep it rebased over nova master, and will create an
 rbd-ephemeral-clone-stable-icehouse to track the same patch series
 over nova stable/icehouse once it's branched. I also plan to make sure
 that this patch series is included in Mirantis OpenStack 5.0 which
 will be based on Icehouse.
 
 If you're interested in this feature, please review and test. Bug
 reports and patches are welcome, as long as their scope is limited to
 this patch series and is not applicable for mainline OpenStack.
 
 
 Thanks for taking this on Dmitry! Having rebased those patches many
 times during icehouse, I can tell you it's often not trivial.
 
 
 Indeed, I get conflicts every day lately, even in the current
 bugfixing stage of the OpenStack release cycle. I have a feeling it
 will not get easier when Icehouse is out and Juno is in full swing.
 
 Do you think the imagehandler-based approach is best for Juno? I'm
 leaning towards the older way [1] for simplicity of review, and to
 avoid using glance's v2 api by default.
 [1] https://review.openstack.org/#/c/46879/
 
 
 Excellent question, I have thought long and hard about this. In
 retrospect, requiring this change to depend on the imagehandler patch
 back in December 2013 proven to have been a poor decision.
 Unfortunately, now that it's done, porting your original patch from
 Havana to Icehouse is more work than keeping the new patch series up
 to date with Icehouse, at least short term. Especially if we decide to
 keep the rbd_utils refactoring, which I've grown to like.
 
 As far as I understand, your original code made use of the same v2 api
 call even before it was rebased

[ceph-users] aborted downloads from Radosgw when multiple clients access same object

2013-12-05 Thread Jens-Christian Fischer

-aio_operate r=0 bl.length=0
2013-12-05 17:14:35.636551 7f2bbd3c1700 20 get_obj_aio_completion_cb: io 
completion ofs=17301504 len=4194304
2013-12-05 17:14:35.636645 7f2b597fa700 20 rados-get_obj_iterate_cb 
oid=default.40804.6__shadow__W8D84-M3taNmGJG4UCxDxbmNDJqubhP_9 obj-ofs=34078720 
read_ofs=0 len=4194304
2013-12-05 17:14:35.636764 7f2b597fa700 20 rados-aio_operate r=0 bl.length=0
2013-12-05 17:14:52.909931 7f2bbcbc0700  2 
RGWDataChangesLog::ChangesRenewThread: start
2013-12-05 17:15:06.803464 7f2bc96237c0 20 enqueued request req=0x1d62fe0
2013-12-05 17:15:06.803499 7f2bc96237c0 20 RGWWQ:
2013-12-05 17:15:06.803503 7f2bc96237c0 20 req: 0x1d62fe0
2013-12-05 17:15:06.803511 7f2bc96237c0 10 allocated request req=0x1d64bc0
2013-12-05 17:15:06.803559 7f2b5affd700 20 dequeued request req=0x1d62fe0
2013-12-05 17:15:06.803574 7f2b5affd700 20 RGWWQ: empty
2013-12-05 17:15:06.803692 7f2b5affd700  2 req 7:0.000110::GET 
/2f/e4491dbfa00c328828bbbc2c8d128a/test2.mp4::initializing
2013-12-05 17:15:06.803702 7f2b5affd700 10 host=xx rgw_dns_name=xx




Ceph Versions:
root@server2:/var/log/apache2# radosgw --version
ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)


root@server1:/etc# ceph -s
  cluster 6b3bd327-2f97-44f6-a8fc-2af8534a2e67
   health HEALTH_OK
   monmap e1: 3 mons at 
{h0s=[2001:]:6789/0,h1s=[2001:]:6789/0,hxs=[2001:]:6789/0}, election epoch 
1270, quorum 0,1,2 server2, server3, server1
   osdmap e6645: 24 osds: 24 up, 24 in
pgmap v2541337: 7368 pgs: 7368 active+clean; 2602 GB data, 5213 GB used, 
61822 GB / 67035 GB avail; 31013KB/s rd, 151KB/s wr, 34op/s
   mdsmap e1: 0/0/1 up

root@server1:/etc# ceph --version
ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)



more log files available upon request….


any ideas?

cheers
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Number of threads for osd processes

2013-11-27 Thread Jens-Christian Fischer

 The largest group of threads is those from the network messenger — in
 the current implementation it creates two threads per process the
 daemon is communicating with. That's two threads for each OSD it
 shares PGs with, and two threads for each client which is accessing
 any data on that OSD.

If I read your statement right, then 1000 threads still seem excessive, no? 
(with 24 OSD, there's only max 2 * 23 threads to the other OSDs + some threads 
to the clients)...

/jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Openstack Havana, boot from volume fails

2013-11-27 Thread Jens-Christian Fischer



 Thanks a lot, Jens. Do I have to have cephx authentication enabled? Did you 
 enable it? Which user from the node that contains cinder-api or glance-api 
 are you using to create volumes and images? The documentation at  
 http://ceph.com/docs/master/rbd/rbd-openstack/ mentions creating new users 
 client.volumes and client.images for cinder and glance respectively. Did you 
 do that?


we have cephx authentication enabled: Here's the /etc/ceph/ceph.conf file that 
our cluster has (we have OSDs on our compute nodes - we shouldn't, but this is 
a test cluster only)

root@h1:~# cat /etc/ceph/ceph.conf
[global]
fsid = 6b3bd327-2f97-44f6-a8fc-
mon_initial_members = hxs, h0s, h1s
mon_host = :yyy:0:6::11c,:yyy:0:6::11e,:yyy:0:6::11d
auth_supported = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true
ms_bind_ipv6 = true
rgw_print_continue = false

[client]
rbd cache = true


[client.images]
keyring = /etc/ceph/ceph.client.images.keyring

[client.volumes]
keyring = /etc/ceph/ceph.client.volumes.keyring

[client.radosgw.gateway]
host = hxs
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/ceph/radosgw.log


Make sure that /etc/ceph/ceph.conf is readable by other processes - ceph-deploy 
sets it to 0600 or 0400 (which makes nova really really unhappy)

root@h1:~# ls -l /etc/ceph/ceph.conf
-rw-r--r-- 1 root root 592 Nov  8 16:32 /etc/ceph/ceph.conf

We have a volumes and an images user as you can see (with the necessary rights 
on the volumes and images pool, as described in the ceph-openstack 
documentation)


A really good overview over the current state of ceph and OpenStack Havana was 
posted by Sebastien Hen yesterday: 
http://techs.enovance.com/6424/back-from-the-summit-cephopenstack-integration - 
it cleared a bunch of things for me

cheers
jc


  
 Thanks again!
 Narendra
  
 From: Jens-Christian Fischer [mailto:jens-christian.fisc...@switch.ch] 
 Sent: Monday, November 25, 2013 8:19 AM
 To: Trivedi, Narendra
 Cc: ceph-users@lists.ceph.com; Rüdiger Rissmann
 Subject: Re: [ceph-users] Openstack Havana, boot from volume fails
  
 Hi Narendra
  
 rbd for cinder and glance are according to the ceph documentation here: 
 http://ceph.com/docs/master/rbd/rbd-openstack/
  
 rbd for VM images configured like so: https://review.openstack.org/#/c/36042/
  
 config sample (nova.conf):
  
 --- cut ---
  
 volume_driver=nova.volume.driver.RBDDriver
 rbd_pool=volumes
 rbd_user=volumes
 rbd_secret_uuid=--
  
  
 libvirt_images_type=rbd
 # the RADOS pool in which rbd volumes are stored (string value)
 libvirt_images_rbd_pool=volumes
 # path to the ceph configuration file to use (string value)
 libvirt_images_rbd_ceph_conf=/etc/ceph/ceph.conf
  
  
 # dont inject stuff into partions, RBD backed partitions don't work that way
 libvirt_inject_partition = -2
  
 --- cut ---
  
 and finally, used the following files from this repository: 
 https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd
  
 image/glance.py
 virt/images.py
 virt/libvirt/driver.py
 virt/libvirt/imagebackend.py
 virt/libvirt/utils.py
  
 good luck :)
  
 cheers
 jc
  
 -- 
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/socialmedia
  
 On 22.11.2013, at 17:41, Trivedi, Narendra narendra.triv...@savvis.com 
 wrote:
 
 
 Hi Jean,
  
 Could you please tell me which link you followed to install RBD etc. for 
 Havana?
  
 Thanks!
 Narendra
  
 From: ceph-users-boun...@lists.ceph.com 
 [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jens-Christian Fischer
 Sent: Thursday, November 21, 2013 8:06 AM
 To: ceph-users@lists.ceph.com
 Cc: Rüdiger Rissmann
 Subject: [ceph-users] Openstack Havana, boot from volume fails
  
 Hi all
  
 I'm playing with the boot from volume options in Havana and have run into 
 problems:
  
 (Openstack Havana, Ceph Dumpling (0.67.4), rbd for glance, cinder and 
 experimental ephemeral disk support)
  
 The following things do work:
 - glance images are in rbd
 - cinder volumes are in rbd
 - creating a VM from an image works
 - creating a VM from a snapshot works
  
  
 However, the booting from volume fails:
  
 Steps to reproduce:
  
 Boot from image
 Create snapshot from running instance
 Create volume from this snapshot
 Start a new instance with boot from volume and the volume just created:
  
 The boot process hangs after around 3 seconds, and the console.log of the 
 instance shows this:
  
 [0.00] Linux version 3.11.0-12-generic (buildd@allspice) (gcc version 
 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 
 2013 (Ubuntu 3.11.0-12.19-generic 3.11.3)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.11.0-12-generic 
 root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
 ...
 [0.098221] Brought up

Re: [ceph-users] how to Testing cinder and glance with CEPH

2013-11-27 Thread Jens-Christian Fischer

Hi Karan

your cinder.conf looks sensible to me, I have posted mine here:

--- cut ---

[DEFAULT]
rootwrap_config = /etc/cinder/rootwrap.conf
api_paste_confg = /etc/cinder/api-paste.ini
iscsi_helper = tgtadm
volume_name_template = volume-%s
volume_group = cinder-volumes
verbose = True
auth_strategy = keystone
state_path = /var/lib/cinder
lock_path = /var/lock/cinder
volumes_dir = /var/lib/cinder/volumes

volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_pool=volumes
glance_api_version=2

rbd_user=volumes
rbd_secret_uuid=e1915277-e3a5-4547-bc9e-xxx

rpc_backend = cinder.openstack.common.rpc.impl_kombu
rabbit_host = xxx.yyy.cc
rabbit_port = 5672

quota_volumes=20
quota_snapshots=20

debug = False
use_syslog = True
syslog_log_facility = LOG_LOCAL0



[database]
connection = mysql://cinder:x...@xxx.yyy.cc/cinder


[keystone_authtoken]
# keystone public API
auth_protocol = https
auth_host = xxx.yyy.cc
auth_port = 5000
admin_tenant_name = service
admin_user = cinder
admin_password =xxx

--- cut ---

what are the different cinder*.log files telling you?

Is /etc/ceph/ceph.conf readable for other processes? (chmod 644 
/etc/ceph/ceph.conf)
Are the key rings available and readable?

good luck
jc


-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 27.11.2013, at 08:51, Karan Singh ksi...@csc.fi wrote:

 Hello Sebastien / Community
 
 
 I tried the commands mentioned in below email.
 
 
 [root@rdo ~]#
 [root@rdo ~]# cinder create 1
 +-+--+
 |   Property  |Value |
 +-+--+
 | attachments |  []  |
 |  availability_zone  | nova |
 |   bootable  |false |
 |  created_at |  2013-11-27T07:40:54.161478  |
 | display_description | None |
 | display_name| None |
 |  id | ae8cd686-5f1d-4c05-8c42-cb7622122a3e |
 |   metadata  |  {}  |
 | size|  1   |
 | snapshot_id | None |
 | source_volid| None |
 |status   |   creating   |
 | volume_type | None |
 +-+--+
 [root@rdo ~]#
 [root@rdo ~]# cinder list
 +--++--+--+-+--+-+
 |  ID  | Status | Display Name | Size | 
 Volume Type | Bootable | Attached to |
 +--++--+--+-+--+-+
 | ae8cd686-5f1d-4c05-8c42-cb7622122a3e | error  | None |  1   | 
 None|  false   | |
 +--++--+--+-+--+-+
 [root@rdo ~]#
 [root@rdo ~]#
 [root@rdo ~]#
 [root@rdo ~]# rbd -p ceph-volumes ls
 rbd: pool ceph-volumes doesn't contain rbd images
 [root@rdo ~]#
 [root@rdo ~]#
 [root@rdo ~]# rados lspools
 data
 metadata
 rbd
 ceph-images
 ceph-volumes
 [root@rdo ~]# rbd -p rbd ls
 [root@rdo ~]# rbd -p data ls
 foo
 foo1
 [root@rdo ~]#
 
 
 
 
 I checked in cinder.log and got the below errors.
 
 
 2013-11-27 09:44:14.830 3273 INFO cinder.volume.manager [-] Updating volume 
 status
 2013-11-27 09:44:14.830 3273 WARNING cinder.volume.manager [-] Unable to 
 update stats, driver is uninitialized
 2013-11-27 09:44:42.407 12007 INFO cinder.volume.manager [-] Updating volume 
 status
 2013-11-27 09:44:42.408 12007 WARNING cinder.volume.manager [-] Unable to 
 update stats, driver is uninitialized
 2013-11-27 09:44:51.799 4943 INFO cinder.volume.manager [-] Updating volume 
 status
 2013-11-27 09:44:51.799 4943 WARNING cinder.volume.manager [-] Unable to 
 update stats, driver is uninitialized
 2013-11-27 09:45:14.834 3273 INFO cinder.volume.manager [-] Updating volume 
 status
 2013-11-27 09:45:14.834 3273 WARNING cinder.volume.manager [-] Unable to 
 update stats, driver is uninitialized
 [root@rdo cinder]#
 
 
 
 
 Output from my cinder.conf file
 
 
 
 # Options defined in cinder.volume.utils
 #
 
 # The default block size used when copying/clearing volumes
 # (string value)
 #volume_dd_blocksize=1M
 
 
 # Total option count: 382
 volume_driver=cinder.volume.drivers.rbd.RBDDriver
 rbd_pool=ceph-volumes
 glance_api_version=2
 rbd_user=volumes
 rbd_secret_uuid=801a42ec-aec1-3ea8-d869-823c2de56b83
 
 rootwrap_config=/etc/cinder/rootwrap.conf

[ceph-users] Number of threads for osd processes

2013-11-26 Thread Jens-Christian Fischer

Hi all

we have a ceph 0.67.4 cluster with 24 OSDs

I have noticed that the two servers that have 9 OSD each, have around 10'000 
threads running - a number that went up significantly 2 weeks ago.

Looking at the threads:


root@h2:/var/log/ceph# ps -efL | grep ceph-osd | awk '{ print $2 }' | uniq -c | 
sort -n
  1 17583
856 3151
874 11946
   1034 3173
   1038 3072
   1040 3175
   1052 3068
   1062 3149
   1068 3060
   1070 3190

root@h2:/var/log/ceph# ps axwu | grep ceph-osd
root  3060  1.0  0.2 2224068 312456 ?  Ssl  Nov01 392:17 
/usr/bin/ceph-osd --cluster=ceph -i 5 -f
root  3068  1.2  0.2 2140988 356208 ?  Ssl  Nov01 441:22 
/usr/bin/ceph-osd --cluster=ceph -i 9 -f
root  3072  1.2  0.2 2049608 370236 ?  Ssl  Nov01 443:18 
/usr/bin/ceph-osd --cluster=ceph -i 4 -f
root  3149  1.2  0.3 2122548 402236 ?  Ssl  Nov01 440:16 
/usr/bin/ceph-osd --cluster=ceph -i 8 -f
root  3151  1.2  0.3 1917856 426224 ?  Ssl  Nov01 453:30 
/usr/bin/ceph-osd --cluster=ceph -i 7 -f
root  3173  0.8  0.2 1978252 264732 ?  Ssl  Nov01 325:01 
/usr/bin/ceph-osd --cluster=ceph -i 11 -f
root  3175  1.1  0.3 2186676 422112 ?  Ssl  Nov01 401:27 
/usr/bin/ceph-osd --cluster=ceph -i 12 -f
root  3190  1.1  0.3 2140480 412844 ?  Ssl  Nov01 421:31 
/usr/bin/ceph-osd --cluster=ceph -i 6 -f
root 11946  1.7  0.3 2060968 445368 ?  Ssl  Nov14 302:36 
/usr/bin/ceph-osd --cluster=ceph -i 10 -f
root 17589  0.0  0.0   9456   952 pts/25   S+   16:13   0:00 grep 
--color=auto ceph-osd

we see each osd process with around 1000 threads. Is this normal and expected?

One theory we have, is that this has to do with the number of placement groups 
- I had increased the number of PGs in one of the pools:

root@h2:/var/log/ceph# ceph osd pool get images pg_num
pg_num: 1000
root@h2:/var/log/ceph# ceph osd pool get volumes pg_num
pg_num: 128

That could possibly have been on the day, the number of treads started to rise.

Feedback appreciated!

thanks
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Openstack Havana, boot from volume fails

2013-11-25 Thread Jens-Christian Fischer

Hi Narendra

rbd for cinder and glance are according to the ceph documentation here: 
http://ceph.com/docs/master/rbd/rbd-openstack/

rbd for VM images configured like so: https://review.openstack.org/#/c/36042/

config sample (nova.conf):

--- cut ---

volume_driver=nova.volume.driver.RBDDriver
rbd_pool=volumes
rbd_user=volumes
rbd_secret_uuid=--


libvirt_images_type=rbd
# the RADOS pool in which rbd volumes are stored (string value)
libvirt_images_rbd_pool=volumes
# path to the ceph configuration file to use (string value)
libvirt_images_rbd_ceph_conf=/etc/ceph/ceph.conf


# dont inject stuff into partions, RBD backed partitions don't work that way
libvirt_inject_partition = -2

--- cut ---

and finally, used the following files from this repository: 
https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd

image/glance.py
virt/images.py
virt/libvirt/driver.py
virt/libvirt/imagebackend.py
virt/libvirt/utils.py

good luck :)

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 22.11.2013, at 17:41, Trivedi, Narendra narendra.triv...@savvis.com 
wrote:

 Hi Jean,
  
 Could you please tell me which link you followed to install RBD etc. for 
 Havana?
  
 Thanks!
 Narendra
  
 From: ceph-users-boun...@lists.ceph.com 
 [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jens-Christian Fischer
 Sent: Thursday, November 21, 2013 8:06 AM
 To: ceph-users@lists.ceph.com
 Cc: Rüdiger Rissmann
 Subject: [ceph-users] Openstack Havana, boot from volume fails
  
 Hi all
  
 I'm playing with the boot from volume options in Havana and have run into 
 problems:
  
 (Openstack Havana, Ceph Dumpling (0.67.4), rbd for glance, cinder and 
 experimental ephemeral disk support)
  
 The following things do work:
 - glance images are in rbd
 - cinder volumes are in rbd
 - creating a VM from an image works
 - creating a VM from a snapshot works
  
  
 However, the booting from volume fails:
  
 Steps to reproduce:
  
 Boot from image
 Create snapshot from running instance
 Create volume from this snapshot
 Start a new instance with boot from volume and the volume just created:
  
 The boot process hangs after around 3 seconds, and the console.log of the 
 instance shows this:
  
 [0.00] Linux version 3.11.0-12-generic (buildd@allspice) (gcc version 
 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 
 2013 (Ubuntu 3.11.0-12.19-generic 3.11.3)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.11.0-12-generic 
 root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
 ...
 [0.098221] Brought up 1 CPUs
 [0.098964] smpboot: Total of 1 processors activated (4588.94 BogoMIPS)
 [0.100408] NMI watchdog: enabled on all CPUs, permanently consumes one 
 hw-PMU counter.
 [0.102667] devtmpfs: initialized
 …
 [0.560202] Linux agpgart interface v0.103
 [0.562276] brd: module loaded
 [0.563599] loop: module loaded
 [0.565315]  vda: vda1
 [0.568386] scsi0 : ata_piix
 [0.569217] scsi1 : ata_piix
 [0.569972] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc0a0 irq 14
 [0.571289] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc0a8 irq 15
 …
 [0.742082] Freeing unused kernel memory: 1040K (8800016fc000 - 
 88000180)
 [0.746153] Freeing unused kernel memory: 836K (880001b2f000 - 
 880001c0)
 Loading, please wait...
 [0.764177] systemd-udevd[95]: starting version 204
 [0.787913] floppy: module verification failed: signature and/or required 
 key missing - tainting kernel
 [0.825174] FDC 0 is a S82078B
 …
 [1.448178] tsc: Refined TSC clocksource calibration: 2294.376 MHz
 error: unexpectedly disconnected from boot status daemon
 Begin: Loading essential drivers ... done.
 Begin: Running /scripts/init-premount ... done.
 Begin: Mounting root file system ... Begin: Running /scripts/local-top ... 
 done.
 Begin: Running /scripts/local-premount ... done.
 [2.384452] EXT4-fs (vda1): mounted filesystem with ordered data mode. 
 Opts: (null)
 Begin: Running /scripts/local-bottom ... done.
 done.
 Begin: Running /scripts/init-bottom ... done.
 [3.021268] init: mountall main process (193) killed by FPE signal
 General error mounting filesystems.
 A maintenance shell will now be started.
 CONTROL-D will terminate this shell and reboot the system.
 root@box-web1:~# 
 The console is stuck, I can't get to the rescue shell
  
 I can rbd map the volume and mount it from a physical host - the filesystem 
 etc all is in good order.
  
 Any ideas?
  
 cheers
 jc
  
 -- 
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/socialmedia

Re: [ceph-users] Openstack Havana, boot from volume fails

2013-11-25 Thread Jens-Christian Fischer

Hi Steffen

the virsh secret is defined on all compute hosts. Booting from a volume works 
(it's the boot from image (create volume) part that doesn't work

cheers
jc
 
-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 21.11.2013, at 15:46, Steffen Thorhauer thorh...@iti.cs.uni-magdeburg.de 
wrote:

 Hi,
 I think you have to set the libvirt secret for your ceph UUID on your 
 nova-compute node  like 
 
 secret ephemeral='no' private='no'
 uuide1915277-e3a5-4547-bc9e-4991c6864dc7/uuid
   usage type='ceph'
 nameclient.volumes secret/name
   /usage
 /secret
 
 in secret.xml 
 
 virsh secret-define secret.xml
 and set the secret
 
 virsh  secret-set-value e1915277-e3a5-4547-bc9e-4991c6864dc7 
 ceph-secret-of.client-volumes
 
 Regards,
   Steffen Thorhauer
 
 On 11/21/2013 03:05 PM, Jens-Christian Fischer wrote:
 Hi all
 
 I'm playing with the boot from volume options in Havana and have run into 
 problems:
 
 (Openstack Havana, Ceph Dumpling (0.67.4), rbd for glance, cinder and 
 experimental ephemeral disk support)
 
 The following things do work:
 - glance images are in rbd
 - cinder volumes are in rbd
 - creating a VM from an image works
 - creating a VM from a snapshot works
 
 
 However, the booting from volume fails:
 
 Steps to reproduce:
 
 Boot from image
 Create snapshot from running instance
 Create volume from this snapshot
 Start a new instance with boot from volume and the volume just created:
 
 The boot process hangs after around 3 seconds, and the console.log of the 
 instance shows this:
 
 [0.00] Linux version 3.11.0-12-generic (buildd@allspice) (gcc 
 version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 
 16:20:46 UTC 2013 (Ubuntu 3.11.0-12.19-generic 3.11.3)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.11.0-12-generic 
 root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
 ...
 [0.098221] Brought up 1 CPUs
 [0.098964] smpboot: Total of 1 processors activated (4588.94 BogoMIPS)
 [0.100408] NMI watchdog: enabled on all CPUs, permanently consumes one 
 hw-PMU counter.
 [0.102667] devtmpfs: initialized
 …
 [0.560202] Linux agpgart interface v0.103
 [0.562276] brd: module loaded
 [0.563599] loop: module loaded
 [0.565315]  vda: vda1
 [0.568386] scsi0 : ata_piix
 [0.569217] scsi1 : ata_piix
 [0.569972] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc0a0 irq 14
 [0.571289] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc0a8 irq 15
 …
 [0.742082] Freeing unused kernel memory: 1040K (8800016fc000 - 
 88000180)
 [0.746153] Freeing unused kernel memory: 836K (880001b2f000 - 
 880001c0)
 Loading, please wait...
 [0.764177] systemd-udevd[95]: starting version 204
 [0.787913] floppy: module verification failed: signature and/or required 
 key missing - tainting kernel
 [0.825174] FDC 0 is a S82078B
 …
 [1.448178] tsc: Refined TSC clocksource calibration: 2294.376 MHz
 error: unexpectedly disconnected from boot status daemon
 Begin: Loading essential drivers ... done.
 Begin: Running /scripts/init-premount ... done.
 Begin: Mounting root file system ... Begin: Running /scripts/local-top ... 
 done.
 Begin: Running /scripts/local-premount ... done.
 [2.384452] EXT4-fs (vda1): mounted filesystem with ordered data mode. 
 Opts: (null)
 Begin: Running /scripts/local-bottom ... done.
 done.
 Begin: Running /scripts/init-bottom ... done.
 [3.021268] init: mountall main process (193) killed by FPE signal
 General error mounting filesystems.
 A maintenance shell will now be started.
 CONTROL-D will terminate this shell and reboot the system.
 root@box-web1:~# 
 The console is stuck, I can't get to the rescue shell
 
 I can rbd map the volume and mount it from a physical host - the 
 filesystem etc all is in good order.
 
 Any ideas?
 
 cheers
 jc
 
 -- 
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/socialmedia
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 -- 
 __
 Steffen Thorhauer
 
 Department of Technichal and Business Information Systems (ITI)
 Faculty of Computer Science (FIN)
   Otto von Guericke University Magdeburg
 Universitaetsplatz 2
 39106 Magdeburg, Germany
 
 phone: 0391 67 52996
 fax: 0391 67 12341
 email: s...@iti.cs.uni-magdeburg.de
 url: http://wwwiti.cs.uni-magdeburg.de/~thorhaue
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com

Re: [ceph-users] Openstack Havana, boot from volume fails

2013-11-21 Thread Jens-Christian Fischer

the weird thing is that I have some volumes that were created from a snapshot, 
that actually boot (they complain about not being able to connect to the 
metadataserver (which I guess is a totally different problem) but in the end, 
they come up.

I haven't been able to see the difference between the volumes….

I re-snapshotted the instance whose volume wouldn't boot, and made a volume out 
of it, and this instance booted nicely from the volume.

weirder and weirder…

/jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 21.11.2013, at 15:05, Jens-Christian Fischer 
jens-christian.fisc...@switch.ch wrote:

 Hi all
 
 I'm playing with the boot from volume options in Havana and have run into 
 problems:
 
 (Openstack Havana, Ceph Dumpling (0.67.4), rbd for glance, cinder and 
 experimental ephemeral disk support)
 
 The following things do work:
 - glance images are in rbd
 - cinder volumes are in rbd
 - creating a VM from an image works
 - creating a VM from a snapshot works
 
 
 However, the booting from volume fails:
 
 Steps to reproduce:
 
 Boot from image
 Create snapshot from running instance
 Create volume from this snapshot
 Start a new instance with boot from volume and the volume just created:
 
 The boot process hangs after around 3 seconds, and the console.log of the 
 instance shows this:
 
 [0.00] Linux version 3.11.0-12-generic (buildd@allspice) (gcc version 
 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 
 2013 (Ubuntu 3.11.0-12.19-generic 3.11.3)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.11.0-12-generic 
 root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
 ...
 [0.098221] Brought up 1 CPUs
 [0.098964] smpboot: Total of 1 processors activated (4588.94 BogoMIPS)
 [0.100408] NMI watchdog: enabled on all CPUs, permanently consumes one 
 hw-PMU counter.
 [0.102667] devtmpfs: initialized
 …
 [0.560202] Linux agpgart interface v0.103
 [0.562276] brd: module loaded
 [0.563599] loop: module loaded
 [0.565315]  vda: vda1
 [0.568386] scsi0 : ata_piix
 [0.569217] scsi1 : ata_piix
 [0.569972] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc0a0 irq 14
 [0.571289] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc0a8 irq 15
 …
 [0.742082] Freeing unused kernel memory: 1040K (8800016fc000 - 
 88000180)
 [0.746153] Freeing unused kernel memory: 836K (880001b2f000 - 
 880001c0)
 Loading, please wait...
 [0.764177] systemd-udevd[95]: starting version 204
 [0.787913] floppy: module verification failed: signature and/or required 
 key missing - tainting kernel
 [0.825174] FDC 0 is a S82078B
 …
 [1.448178] tsc: Refined TSC clocksource calibration: 2294.376 MHz
 error: unexpectedly disconnected from boot status daemon
 Begin: Loading essential drivers ... done.
 Begin: Running /scripts/init-premount ... done.
 Begin: Mounting root file system ... Begin: Running /scripts/local-top ... 
 done.
 Begin: Running /scripts/local-premount ... done.
 [2.384452] EXT4-fs (vda1): mounted filesystem with ordered data mode. 
 Opts: (null)
 Begin: Running /scripts/local-bottom ... done.
 done.
 Begin: Running /scripts/init-bottom ... done.
 [3.021268] init: mountall main process (193) killed by FPE signal
 General error mounting filesystems.
 A maintenance shell will now be started.
 CONTROL-D will terminate this shell and reboot the system.
 root@box-web1:~# 
 The console is stuck, I can't get to the rescue shell
 
 I can rbd map the volume and mount it from a physical host - the filesystem 
 etc all is in good order.
 
 Any ideas?
 
 cheers
 jc
 
 -- 
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/socialmedia
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OpenStack, Boot from image (create volume) failed with volumes in rbd

2013-11-21 Thread Jens-Christian Fischer

Hi all

I'm playing with the boot from volume options in Havana and have run into 
problems:

(Openstack Havana, Ceph Dumpling (0.67.4), rbd for glance, cinder and 
experimental ephemeral disk support)

The following things do work:
- glance images are in rbd
- cinder volumes are in rbd
- creating a VM from an image works
- creating a VM from a snapshot works


However, the booting from volume options fail in various ways:


* Select Boot from Image (create volume) 

fails booting, with the VM complaining that there was no bootable device Boot 
failed: not a bootable disk

The libvirt.xml definition is as follows:

disk type=network device=disk
  driver name=qemu type=raw cache=none/
  source protocol=rbd 
name=volumes/volume-fa635ee4-5ea5-429d-bfab-6e53aa687245
host name=::0:6::11c port=6789/
host name=::0:6::11d port=6789/
host name=::0:6::11e port=6789/
  /source
  auth username=volumes
secret type=ceph uuid=e1915277-e3a5-4547-bc9e-4991c6864dc7/
  /auth
  target bus=virtio dev=vda/
  serialfa635ee4-5ea5-429d-bfab-6e53aa687245/serial
/disk

The qemu command line is this: 

 qemu-system-x86_64 -machine accel=kvm:tcg -name instance-0187 -S -machine 
pc-i440fx-1.5,accel=kvm,usb=off -cpu 
SandyBridge,+pdpe1gb,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 -m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 
f28f9b90-9e75-45a7-ac34-c8dd2c6e3c5b -smbios type=1,manufacturer=OpenStack 
Foundation,product=OpenStack 
Nova,version=2013.2,serial=078965e4-1a79-0010-82d4-089e015a2b41,uuid=f28f9b90-9e75-45a7-ac34-c8dd2c6e3c5b
 -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0187.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
-no-kvm-pit-reinjection -no-shutdown -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=rbd:volumes/volume-fa635ee4-5ea5-429d-bfab-6e53aa687245:id=volumes:key=AQAsO2pSYEjXNBAAYB02+zSa2boqFcl+aZNwLw==:auth_supported=cephx\;none:mon_host=[\:\:0\:6\:\:11c]\:6789\;[\:\:0\:6\:\:11d]\:6789\;[\:\:0\:6\:\:11e]\:6789,if=none,id=drive-virtio-disk0,format=raw,serial=fa635ee4-5ea5-429d-bfab-6e53aa687245,cache=none
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:93:a1:88,bus=pci.0,addr=0x3 
-chardev 
file,id=charserial0,path=/var/lib/nova/instances/f28f9b90-9e75-45a7-ac34-c8dd2c6e3c5b/console.log
 -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 
-device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 
-vnc 0.0.0.0:4 -k en-us -vga cirrus -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

The volume is known to cinder:

cinder list --all-tenants | grep fa635ee4
| fa635ee4-5ea5-429d-bfab-6e53aa687245 |   in-use  |
   |  10  | None|   true   | 
f28f9b90-9e75-45a7-ac34-c8dd2c6e3c5b |

and rbd

root@hxs:~# rbd --pool volumes ls | grep fa635ee4
volume-fa635ee4-5ea5-429d-bfab-6e53aa687245

the file is a qcow2 file:

root@hxs:~# rbd map --pool volumes volume-fa635ee4-5ea5-429d-bfab-6e53aa687245
root@hxs:~# mount /dev/rbd2p1 /dev/rbd
mount: special device /dev/rbd2p1 does not exist

root@hxs:~#  dd if=/dev/rbd2 of=foo count=100
root@hxs:~# file foo
foo: QEMU QCOW Image (v2), 2147483648 bytes

It is our understanding, that we need raw volumes to get the boot process 
working. Why is the volume created as a qcow2 volume?

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Openstack Havana, boot from volume fails

2013-11-21 Thread Jens-Christian Fischer

Hi all

I'm playing with the boot from volume options in Havana and have run into 
problems:

(Openstack Havana, Ceph Dumpling (0.67.4), rbd for glance, cinder and 
experimental ephemeral disk support)

The following things do work:
- glance images are in rbd
- cinder volumes are in rbd
- creating a VM from an image works
- creating a VM from a snapshot works


However, the booting from volume fails:

Steps to reproduce:

Boot from image
Create snapshot from running instance
Create volume from this snapshot
Start a new instance with boot from volume and the volume just created:

The boot process hangs after around 3 seconds, and the console.log of the 
instance shows this:

[0.00] Linux version 3.11.0-12-generic (buildd@allspice) (gcc version 
4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 
2013 (Ubuntu 3.11.0-12.19-generic 3.11.3)
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.11.0-12-generic 
root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
...
[0.098221] Brought up 1 CPUs
[0.098964] smpboot: Total of 1 processors activated (4588.94 BogoMIPS)
[0.100408] NMI watchdog: enabled on all CPUs, permanently consumes one 
hw-PMU counter.
[0.102667] devtmpfs: initialized
…
[0.560202] Linux agpgart interface v0.103
[0.562276] brd: module loaded
[0.563599] loop: module loaded
[0.565315]  vda: vda1
[0.568386] scsi0 : ata_piix
[0.569217] scsi1 : ata_piix
[0.569972] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc0a0 irq 14
[0.571289] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc0a8 irq 15
…
[0.742082] Freeing unused kernel memory: 1040K (8800016fc000 - 
88000180)
[0.746153] Freeing unused kernel memory: 836K (880001b2f000 - 
880001c0)
Loading, please wait...
[0.764177] systemd-udevd[95]: starting version 204
[0.787913] floppy: module verification failed: signature and/or required 
key missing - tainting kernel
[0.825174] FDC 0 is a S82078B
…
[1.448178] tsc: Refined TSC clocksource calibration: 2294.376 MHz
error: unexpectedly disconnected from boot status daemon
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[2.384452] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: 
(null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
[3.021268] init: mountall main process (193) killed by FPE signal
General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.
root@box-web1:~# 
The console is stuck, I can't get to the rescue shell

I can rbd map the volume and mount it from a physical host - the filesystem 
etc all is in good order.

Any ideas?

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ephemeral RBD with Havana and Dumpling

2013-11-14 Thread Jens-Christian Fischer

We have migration working partially - it works through Horizon (to a random 
host) and sometimes through the CLI.

We are using the nova fork by Josh Durgin 
https://github.com/jdurgin/nova/commits/havana-ephemeral-rbd - are there more 
patches that need to be integrated?

cheers
jc


-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 14.11.2013, at 13:18, Haomai Wang haomaiw...@gmail.com wrote:

 Yes, we still need a patch to make live-migration work.
 
 
 On Thu, Nov 14, 2013 at 8:12 PM, Matt Thompson watering...@gmail.com wrote:
 Hi Dinu,
 
 Initial tests for me using Ubuntu's 2013.2-0ubuntu1~cloud0 indicate that live 
 migrations do not work when using libvirt_images_type=rbd (Live migration 
 can not be used without shared storage.)  I'll need to dig through LP to see 
 if this has since been addressed.
 
 On a side note, live migrations now appear to work for me when booting 
 instances from a cinder volume.
 
 -Matt
 
 
 On Tue, Nov 12, 2013 at 11:44 PM, Dinu Vlad dinuvla...@gmail.com wrote:
 Out of curiosity - can you live-migrate instances with this setup?
 
 
 
 On Nov 12, 2013, at 10:38 PM, Dmitry Borodaenko dborodae...@mirantis.com 
 wrote:
 
  And to answer my own question, I was missing a meaningful error
  message: what the ObjectNotFound exception I got from librados didn't
  tell me was that I didn't have the images keyring file in /etc/ceph/
  on my compute node. After 'ceph auth get-or-create client.images 
  /etc/ceph/ceph.client.images.keyring' and reverting images caps back
  to original state, it all works!
 
  On Tue, Nov 12, 2013 at 12:19 PM, Dmitry Borodaenko
  dborodae...@mirantis.com wrote:
  I can get ephemeral storage for Nova to work with RBD backend, but I
  don't understand why it only works with the admin cephx user? With a
  different user starting a VM fails, even if I set its caps to 'allow
  *'.
 
  Here's what I have in nova.conf:
  libvirt_images_type=rbd
  libvirt_images_rbd_pool=images
  rbd_secret_uuid=fd9a11cc-6995-10d7-feb4-d338d73a4399
  rbd_user=images
 
  The secret UUID is defined following the same steps as for Cinder and 
  Glance:
  http://ceph.com/docs/master/rbd/libvirt/
 
  BTW rbd_user option doesn't seem to be documented anywhere, is that a
  documentation bug?
 
  And here's what 'ceph auth list' tells me about my cephx users:
 
  client.admin
 key: AQCoSX1SmIo0AxAAnz3NffHCMZxyvpz65vgRDg==
 caps: [mds] allow
 caps: [mon] allow *
 caps: [osd] allow *
  client.images
 key: AQC1hYJS0LQhDhAAn51jxI2XhMaLDSmssKjK+g==
 caps: [mds] allow
 caps: [mon] allow *
 caps: [osd] allow *
  client.volumes
 key: AQALSn1ScKruMhAAeSETeatPLxTOVdMIt10uRg==
 caps: [mon] allow r
 caps: [osd] allow class-read object_prefix rbd_children, allow
  rwx pool=volumes, allow rx pool=images
 
  Setting rbd_user to images or volumes doesn't work.
 
  What am I missing?
 
  Thanks,
 
  --
  Dmitry Borodaenko
 
 
 
  --
  Dmitry Borodaenko
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 -- 
 Best Regards,
 
 Wheat
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Havana RBD - a few problems

2013-11-08 Thread Jens-Christian Fischer

Hi Josh

 Using libvirt_image_type=rbd to replace ephemeral disks is new with
 Havana, and unfortunately some bug fixes did not make it into the
 release. I've backported the current fixes on top of the stable/havana
 branch here:
 
 https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd

that looks really useful. I have tried to patch our installation, but so far 
haven't been successful: First I tried to replace the whole 
/usr/share/pyshared/nova directory with the one from your repository, then only 
the changed files. (Ubuntu Saucy). In both cases nova-compute dies immediately 
after starting it. There is probably a really simple way to install your 
version on an Ubuntu server - but I don't know how...

 2) Creating a new instance from an ISO image fails completely - no
 bootable disk found, says the KVM console. Related?
 
 This sounds like a bug in the ephemeral rbd code - could you file
 it in launchpad if you can reproduce with file injection disabled?
 I suspect it's not being attached as a carom.

Will try to reproduce as soon as I have the patched version


 You're seeing some issues in the ephemeral rbd code, which is new
 in Havana. None of these affect non-ephemeral rbd, or Grizzly.
 Thanks for reporting them!

thanks for your help

cheers
jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Havana RBD - a few problems

2013-11-08 Thread Jens-Christian Fischer

 Using libvirt_image_type=rbd to replace ephemeral disks is new with
 Havana, and unfortunately some bug fixes did not make it into the
 release. I've backported the current fixes on top of the stable/havana
 branch here:
 
 https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd
 
 that looks really useful. I have tried to patch our installation, but so far 
 haven't been successful: First I tried to replace the whole 
 /usr/share/pyshared/nova directory with the one from your repository, then 
 only the changed files. (Ubuntu Saucy). In both cases nova-compute dies 
 immediately after starting it. There is probably a really simple way to 
 install your version on an Ubuntu server - but I don't know how…

ok - got it working by cherry picking the last few commits, and then replacing 
only the 5 affected files. 

Resize of disk on instance creation works! Yay

 
 2) Creating a new instance from an ISO image fails completely - no
 bootable disk found, says the KVM console. Related?
 
 This sounds like a bug in the ephemeral rbd code - could you file
 it in launchpad if you can reproduce with file injection disabled?
 I suspect it's not being attached as a carom.
 
 Will try to reproduce as soon as I have the patched version

still doesn't work - will file a bug

cheers
jc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Havana RBD - a few problems

2013-11-08 Thread Jens-Christian Fischer

and one more:

boot from image (create a new volume) doesn't work either: it leads to a VM 
that complains about a non-bootable disk (just like the ISO case). This is 
actually and improvement: earlier, nova was waiting for ages for an image to be 
created (I guess that this is the result of the glance - cinder RBD 
improvements)

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 08.11.2013, at 02:20, Josh Durgin josh.dur...@inktank.com wrote:

 On 11/08/2013 12:15 AM, Jens-Christian Fischer wrote:
 Hi all
 
 we have installed a Havana OpenStack cluster with RBD as the backing
 storage for volumes, images and the ephemeral images. The code as
 delivered in
 https://github.com/openstack/nova/blob/master/nova/virt/libvirt/imagebackend.py#L498
  fails,
 because the RBD.path it not set. I have patched this to read:
 
 Using libvirt_image_type=rbd to replace ephemeral disks is new with
 Havana, and unfortunately some bug fixes did not make it into the
 release. I've backported the current fixes on top of the stable/havana
 branch here:
 
 https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd
 
  * @@ -419,10 +419,12 @@ class Rbd(Image):
  * if path:
  * try:
  * self.rbd_name = path.split('/')[1]
  * + self.path = path
  * except IndexError:
  * raise exception.InvalidDevicePath(path=path)
  * else:
  * self.rbd_name = '%s_%s' % (instance['name'], disk_name)
  * + self.path = 'volumes/%s' % self.rbd_name
  * self.snapshot_name = snapshot_name
  * if not CONF.libvirt_images_rbd_pool:
  * raise RuntimeError(_('You should specify'
 
 
 but am not sure this is correct. I have the following problems:
 
 1) can't inject data into image
 
 2013-11-07 16:59:25.251 24891 INFO nova.virt.libvirt.driver
 [req-f813ef24-de7d-4a05-ad6f-558e27292495
 c66a737acf0545fdb9a0a920df0794d9 2096e25f5e814882b5907bc5db342308]
 [instance: 2fa02e4f-f804-4679-9507-736eeebd9b8d] Injecting key into
  image fc8179d4-14f3-4f21-a76d-72b03b5c1862
 2013-11-07 16:59:25.269 24891 WARNING nova.virt.disk.api
 [req-f813ef24-de7d-4a05-ad6f-558e27292495
 c66a737acf0545fdb9a0a920df0794d9 2096e25f5e814882b5907bc5db342308]
 Ignoring error injecting data into image (Error mounting volumes/
 instance-
 0089_disk with libguestfs (volumes/instance-0089_disk: No such file
 or directory))
 
 possibly the self.path = … is wrong - but what are the correct values?
 
 Like Dinu mentioned, I'd suggest disabling file injection and using
 the metadata service + cloud-init instead. We should probably change
 nova to log an error about this configuration when ephemeral volumes
 are rbd.
 
 2) Creating a new instance from an ISO image fails completely - no
 bootable disk found, says the KVM console. Related?
 
 This sounds like a bug in the ephemeral rbd code - could you file
 it in launchpad if you can reproduce with file injection disabled?
 I suspect it's not being attached as a cdrom.
 
 3) When creating a new instance from an image (non ISO images work), the
 disk is not resized to the size specified in the flavor (but left at the
 size of the original image)
 
 This one is fixed in the backports already.
 
 I would be really grateful, if those people that have Grizzly/Havana
 running with an RBD backend could pipe in here…
 
 You're seeing some issues in the ephemeral rbd code, which is new
 in Havana. None of these affect non-ephemeral rbd, or Grizzly.
 Thanks for reporting them!
 
 Josh
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Havana RBD - a few problems

2013-11-07 Thread Jens-Christian Fischer

Hi all

we have installed a Havana OpenStack cluster with RBD as the backing storage 
for volumes, images and the ephemeral images. The code as delivered in 
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/imagebackend.py#L498
 fails, because the RBD.path it not set. I have patched this to read:

@@ -419,10 +419,12 @@ class Rbd(Image):
 if path:
 try:
 self.rbd_name = path.split('/')[1]
+self.path = path
 except IndexError:
 raise exception.InvalidDevicePath(path=path)
 else:
 self.rbd_name = '%s_%s' % (instance['name'], disk_name)
+self.path = 'volumes/%s' % self.rbd_name
 self.snapshot_name = snapshot_name
 if not CONF.libvirt_images_rbd_pool:
 raise RuntimeError(_('You should specify'

but am not sure this is correct. I have the following problems:

1) can't inject data into image

2013-11-07 16:59:25.251 24891 INFO nova.virt.libvirt.driver 
[req-f813ef24-de7d-4a05-ad6f-558e27292495 c66a737acf0545fdb9a0a920df0794d9 
2096e25f5e814882b5907bc5db342308] [instance: 
2fa02e4f-f804-4679-9507-736eeebd9b8d] Injecting key into
 image fc8179d4-14f3-4f21-a76d-72b03b5c1862
2013-11-07 16:59:25.269 24891 WARNING nova.virt.disk.api 
[req-f813ef24-de7d-4a05-ad6f-558e27292495 c66a737acf0545fdb9a0a920df0794d9 
2096e25f5e814882b5907bc5db342308] Ignoring error injecting data into image 
(Error mounting volumes/ instance-
0089_disk with libguestfs (volumes/instance-0089_disk: No such file or 
directory))

possibly the self.path = … is wrong - but what are the correct values?


2) Creating a new instance from an ISO image fails completely - no bootable 
disk found, says the KVM console. Related?

3) When creating a new instance from an image (non ISO images work), the disk 
is not resized to the size specified in the flavor (but left at the size of the 
original image)

I would be really grateful, if those people that have Grizzly/Havana running 
with an RBD backend could pipe in here…

thanks
Jens-Christian


-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] interested questions

2013-10-30 Thread Jens-Christian Fischer

Hi  there

 I am interested in the following questions:
 1.Does the amount of HDD performance cluster?

not quite sure I understand your question, but: Adding more disks and more 
servers in general helps performance, because the requests will be spread out 
among more spindles. We see  1 GByte/Second sequential read from our 64 slow 
SATA drives (in 10 servers)

 2.Is there any experience of implementing KVM virtualization and Ceph on the 
 same server?

Yes - we have/had Openstack Folsom running on the same hosts that hosted the 
OSD. This is not a recommended way of implementing things, but it works 
reasonably well for testing purposes. We are planning/building our next cluster 
now (a production cluster) and plan to separate OSD/MON servers from OpenStack 
compute servers.

cheers
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] one pg stuck with 2 unfound pieces

2013-09-23 Thread Jens-Christian Fischer

Hi Sam

in the meantime, the output of ceph pg 0.cfa query has become quite a bit 
longer (for better or worse) - see:  http://pastebin.com/0Jxmm353

I have restarted osd.23 with the debug log settings and have extracted these 
0.cfa related log lines - I can't interpret them. There might be more, I can 
provide the complete log file if you need it: http://pastebin.com/dYsihsx4

0.cfa has been out so long, that it shows up as being down forever

HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 mons 
down, quorum 0,1,2,4 h1,h5,s2,s4
pg 0.cfa is stuck inactive since forever, current state incomplete, last acting 
[23,50,18]
pg 0.cfa is stuck unclean since forever, current state incomplete, last acting 
[23,50,18]
pg 0.cfa is incomplete, acting [23,50,18]

also, we can't revert 0.cfa

root@h0:~# ceph pg 0.cfa mark_unfound_lost revert
pg has no unfound objects

This stuck pg seems to fill up our mons (they need to keep old data, right?) 
which makes starting a new mon a task of seemingly herculean proportions.

Any ideas on how to proceed?

thanks

Jens-Christian




-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

On 14.08.2013, at 20:53, Samuel Just sam.j...@inktank.com wrote:

 Try restarting the two osd processes with debug osd = 20, debug ms =
 1, debug filestore = 20.  Restarting the osds may clear the problem,
 but if it recurs, the logs should help explain what's going on.
 -Sam
 
 On Wed, Aug 14, 2013 at 12:17 AM, Jens-Christian Fischer
 jens-christian.fisc...@switch.ch wrote:
 On 13.08.2013, at 21:09, Samuel Just sam.j...@inktank.com wrote:
 
 You can run 'ceph pg 0.cfa mark_unfound_lost revert'. (Revert Lost
 section of http://ceph.com/docs/master/rados/operations/placement-groups/).
 -Sam
 
 
 As I wrote further down the info, ceph wouldn't let me do that:
 
 root@ineri ~$ ceph pg 0.cfa  mark_unfound_lost revert
 pg has 2 objects but we haven't probed all sources, not marking lost
 
 I'm looking for a way that forces the (re) probing of the sources…
 
 cheers
 jc
 
 
 
 
 
 On Tue, Aug 13, 2013 at 6:50 AM, Jens-Christian Fischer
 jens-christian.fisc...@switch.ch wrote:
 We have a cluster with 10 servers, 64 OSDs and 5 Mons on them. The OSDs are
 3TB disk, formatted with btrfs and the servers are either on Ubuntu 12.10 
 or
 13.04.
 
 Recently one of the servers (13.04) stood still (due to problems with btrfs
 - something we have seen a few times). I decided to not try to recover the
 disks, but reformat them with XFS. I removed the OSDs, reformatted, and
 re-created them (they got the same OSD numbers)
 
 I redid this twice (because I wrongly partioned the disks in the first
 place) and I ended up with 2 unfound pieces in one pg:
 
 root@s2:~# ceph health details
 HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; recovery
 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%)
 pg 0.cfa is stuck unclean for 1004252.309704, current state
 active+recovering+degraded+remapped, last acting [23,50]
 pg 0.cfa is active+recovering+degraded+remapped, acting [23,50], 2 unfound
 recovery 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%)
 
 
 root@s2:~# ceph pg 0.cfa query
 
 { state: active+recovering+degraded+remapped,
 epoch: 28197,
 up: [
   23,
   50,
   18],
 acting: [
   23,
   50],
 info: { pgid: 0.cfa,
 last_update: 28082'7774,
 last_complete: 23686'7083,
 log_tail: 14360'4061,
 last_backfill: MAX,
 purged_snaps: [],
 history: { epoch_created: 1,
 last_epoch_started: 28197,
 last_epoch_clean: 24810,
 last_epoch_split: 0,
 same_up_since: 28195,
 same_interval_since: 28196,
 same_primary_since: 26036,
 last_scrub: 20585'6801,
 last_scrub_stamp: 2013-07-28 15:40:53.298786,
 last_deep_scrub: 20585'6801,
 last_deep_scrub_stamp: 2013-07-28 15:40:53.298786,
 last_clean_scrub_stamp: 2013-07-28 15:40:53.298786},
 stats: { version: 28082'7774,
 reported: 28197'41950,
 state: active+recovering+degraded+remapped,
 last_fresh: 2013-08-13 14:34:33.057271,
 last_change: 2013-08-13 14:34:33.057271,
 last_active: 2013-08-13 14:34:33.057271,
 last_clean: 2013-08-01 23:50:18.414082,
 last_became_active: 2013-05-29 13:10:51.366237,
 last_unstale: 2013-08-13 14:34:33.057271,
 mapping_epoch: 28195,
 log_start: 14360'4061,
 ondisk_log_start: 14360'4061,
 created: 1,
 last_epoch_clean: 24810,
 parent: 0.0,
 parent_split_bits: 0,
 last_scrub: 20585'6801,
 last_scrub_stamp: 2013-07-28 15:40:53.298786,
 last_deep_scrub: 20585'6801,
 last_deep_scrub_stamp: 2013-07-28 15:40:53.298786

[ceph-users] Sparse files copied to CephFS not sparse

2013-09-16 Thread Jens-Christian Fischer

Hi all

as part of moving our OpenStack VM instance store from dedicated disks on the 
physical hosts to a CephFS backed by an SSD pool, we noticed that the files 
created on CephFS aren't sparse, even though the original files were.

This is on 
root@s2:~# ls -lhs /var/lib/nova/instances/_base
total 63G
750M -rw-r--r-- 1 nova nova 2.0G Jul 10 21:40 
1a11de23fe75a210b4da631366513cb7c22ef311
750M -rw-r--r-- 1 libvirt-qemu kvm   10G Jul 10 21:40 
1a11de23fe75a210b4da631366513cb7c22ef311_10
…

vs

root@s2:~# ls -lhs 
/mnt/instances/instances/_base/1a11de23fe75a210b4da631366513cb7c22ef311*
1.2G -rw-r--r-- 1 nova nova 1.2G Sep  5 16:56 
/mnt/instances/instances/_base/1a11de23fe75a210b4da631366513cb7c22ef311
 10G -rw-r--r-- 1 libvirt-qemu kvm   10G Jul 10 21:40 
/mnt/instances/instances/_base/1a11de23fe75a210b4da631366513cb7c22ef311_10

We have used different ways of copying the files (tar and rsync) and specified 
the sparse options:

# rsync -rtvupogS -h  /var/lib/nova/instances/ /mnt/instances/instances
or
# (cd /var/lib/nova/instances ; tar -Svcf - .)|(cd /mnt/instances/instances ; 
tar Sxpf -)

The OSDs we use for this pool are backed by XFS (which has a problem with 
sparse files, unless one specifies allocation block size options in the mounts) 
http://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file,
 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=055388a3188f56676c21e92962fc366ac8b5cb72.
 We have mounted the XFS partitions for the OSDs with this option, but I assume 
that this shouldn't impact the way CephFS handles sparse files.

I seem to remember that the copying of sparse files worked a couple of months 
ago (ceph-fs kernel 3.5 on btrfs OSDs), but now we used Kernel 3.10 and 
recently ceph-fuse to mount the CephFS.

Are we doing something wrong, or is this not supported by CephFS?

cheers
jc





-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sparse files copied to CephFS not sparse

2013-09-16 Thread Jens-Christian Fischer


 For cephfs, the size reported by 'ls -s' is the same as file size. see
 http://ceph.com/docs/next/dev/differences-from-posix/

ah! So if I understand correctly, the files are indeed sparse on CephFS?

thanks
/jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sparse files copied to CephFS not sparse

2013-09-16 Thread Jens-Christian Fischer

 
 For cephfs, the size reported by 'ls -s' is the same as file size. see
 http://ceph.com/docs/next/dev/differences-from-posix/
 
 ...but the files are still in fact stored sparsely.  It's just hard to 
 tell.

perfect - thanks!

/jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent view on mounted CephFS

2013-09-13 Thread Jens-Christian Fischer

 
 All servers mount the same filesystem. Needless to say, that we are a bit 
 worried…
 
 The bug was introduced in 3.10 kernel, will be fixed in 3.12 kernel by commit 
 590fb51f1c (vfs: call d_op-d_prune() before unhashing dentry). Sage may 
 backport the fix to 3.11 and 3.10 kernel soon.  please use ceph-fuse at 
 present.

thanks for the answer - will change immediately

cheers
jc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent view on mounted CephFS

2013-09-13 Thread Jens-Christian Fischer

 Just out of curiosity. Why you are using cephfs instead of rbd?

Two reasons:

- we are still on Folsom
- Experience with shared storage as this is something our customers are 
asking for all the time

cheers
jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] adding SSD only pool to existing ceph cluster

2013-09-04 Thread Jens-Christian Fischer

Hi Greg

 If you saw your existing data migrate that means you changed its
 hierarchy somehow. It sounds like maybe you reorganized your existing
 nodes slightly, and that would certainly do it (although simply adding
 single-node higher levels would not). It's also possible that you
 introduced your SSD devices/hosts in a way that your existing data
 pool rules believed they should make use of them (if, for instance,
 your data pool rule starts out at root and you added your SSDs
 underneath there). What you'll want to do is add a whole new root for
 your SSD nodes, and then make the SSD pool rule (and only that rule)
 start out there.

And that is the problem: The SSDs are in the same physical servers as the SATA 
drives. Adding them to the hosts adds them into the hierarchy. Adding them to 
virtual hosts (a host name that doesn't exist) breaks the startup scripts.

Can I add the SSD OSDs directly to a new root without having them in the host 
hierarchy?

If you have a solution that solves either of these problems, I'm all ears :)

cheers
jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best way to reformat OSD drives?

2013-09-03 Thread Jens-Christian Fischer

 Why wait for the data to migrate away? Normally you have replicas of the 
 whole osd data, so you can simply stop the osd, reformat the disk and restart 
 it again. It'll join the cluster and automatically get all data it's missing. 
 Of course the risk of dataloss is a bit higher during that time, but normally 
 that should be ok, because it's not different from an ordinary disk failure 
 which can happen any time.
 
 I just found a similar question from one year ago: 
 http://www.spinics.net/lists/ceph-devel/msg05915.html I didn't read the whole 
 thread, but probably you can find some other ideas there.
 
 service ceph osd stop $OSD
 mkfs -t xfs /dev/XXX
 ceph-osd -i $OSD --mkfs --mkkey --mkjournal
 service ceph osd start $OSD


this is what I did now:

ceph osd set noout
service ceph stop osd.X
umount /dev/sdX1
mkfs.xfs -f -i size=2048 /dev/sdX1 -L osd.X
vim /etc/fstab # edit line for /dev/sdX1
mount /dev/sdX1
ceph-osd -i X --mkfs --mkkey --mkjournal
ceph auth add osd.X osd 'allow *' mon 'allow rwx' -i 
/var/lib/ceph/osd/ceph-X/keyring
service ceph start osd.X

seems to work so far, the OSD is busy retrieving data - and I didn't have to 
wait for the OSD to become empty.

cheers
jc___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best way to reformat OSD drives?

2013-09-03 Thread Jens-Christian Fischer


On 03.09.2013, at 16:27, Sage Weil s...@inktank.com wrote:

 ceph osd create # this should give you back the same osd number as the one
 you just removed
 
 OSD=`ceph osd create` # may or may not be the same osd id

good point - so far it has been good to us!

 
 
 umount ${PART}1
 parted $PART rm 1 # remove partion and create a new one
 parted $PART mkpart primary 0% 100%  # remove partion and create a new one
 
 I don't think the partition removal/add step is needed.

it isn't - I'm still learning the ropes :)


 
 Otherwise it looks fine!

ok - I have tried a simplified version (that doesn't take the OSD out) that 
just simulates a disk failure (i.e.. stops the OSD, reformats the drive, 
recreates the OSD structure and starts the process again). This (seems) to 
work, but is really slow in rebuilding the disk (we see write speed of 4-20 
MB/s - and it takes ages to refill around 100GB of data)

I don't dare to run this on multiple OSDs a the same time for fear of loosing 
data, so the slower/longer process of first marking all OSDs of a server as 
out, waiting for them to empty and then batch formatting all OSDs on the server 
and waiting for the cluster to be stable again, might be faster in the end

cheers
jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Best way to reformat OSD drives?

2013-09-02 Thread Jens-Christian Fischer

Hi all

we have a Ceph Cluster with 64 OSD drives in 10 servers. We originally 
formatted the OSDs with btrfs but have had numerous problems (server kernel 
panics) that we could point back to btrfs. We are therefore in the process of 
reformatting our OSDs to XFS. We have a process that works, but I was 
wondering, if there is a simpler / faster way.

Currently we 'ceph osd out' all drives of a server and wait for the data to 
migrate away, then delete the OSD, recreate it and start the OSD processes 
again. This takes at least 1-2 days per server (mostly waiting for the data to 
migrate back and forth)

Here's the script we are using:

--- cut ---
#! /bin/bash

OSD=$1
PART=$2
HOST=$3
echo changing partition ${PART}1 to XFS for OSD: $OSD on host: $HOST
read -p continue or CTRL-C


service ceph -a stop osd.$OSD
ceph osd crush remove osd.$OSD
ceph auth del osd.$OSD
ceph osd rm $OSD
ceph osd create # this should give you back the same osd number as the one you 
just removed

umount ${PART}1
parted $PART rm 1 # remove partion and create a new one
parted $PART mkpart primary 0% 100%  # remove partion and create a new one
mkfs.xfs -f -i size=2048 ${PART}1 -L osd.$OSD
mount -o inode64,noatime ${PART}1 /var/lib/ceph/osd/ceph-$OSD
ceph-osd -i $OSD --mkfs --mkkey --mkjournal
ceph auth add osd.$OSD osd 'allow *' mon 'allow rwx' -i 
/var/lib/ceph/osd/ceph-${OSD}/keyring
ceph osd crush set $OSD 1 root=default host=$HOST
service ceph -a start osd.$OSD

--- cut ---

cheers
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best way to reformat OSD drives?

2013-09-02 Thread Jens-Christian Fischer

 
 Why wait for the data to migrate away? Normally you have replicas of the 
 whole osd data, so you can simply stop the osd, reformat the disk and restart 
 it again. It'll join the cluster and automatically get all data it's missing. 
 Of course the risk of dataloss is a bit higher during that time, but normally 
 that should be ok, because it's not different from an ordinary disk failure 
 which can happen any time.

Because I lost 2 objects last time I did that trick (probably caused by 
additional user (i.e. me) stupidity in the first place, but I don't really 
fancy taking chances this time :) )

 
 I just found a similar question from one year ago: 
 http://www.spinics.net/lists/ceph-devel/msg05915.html I didn't read the whole 
 thread, but probably you can find some other ideas there.

I read it, but it is the usual to a fro - no definitive solution...

 
 service ceph osd stop $OSD
 mkfs -t xfs /dev/XXX
 ceph-osd -i $OSD --mkfs --mkkey --mkjournal
 service ceph osd start $OSD

I'll give that a whirl - I have enough OSDs to try - as soon as the cluster has 
recovered from the 9 disks I formatted on saturday

cheers
jc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best way to reformat OSD drives?

2013-09-02 Thread Jens-Christian Fischer

Hi Martin

 On 2013-09-02 19:37, Jens-Christian Fischer wrote:
 we have a Ceph Cluster with 64 OSD drives in 10 servers. We originally 
 formatted the OSDs with btrfs but have had numerous problems (server kernel 
 panics) that we could point back to btrfs. We are therefore in the process 
 of reformatting our OSDs to XFS. We have a process that works, but I was 
 wondering, if there is a simpler / faster way.
 
 Currently we 'ceph osd out' all drives of a server and wait for the data to 
 migrate away, then delete the OSD, recreate it and start the OSD processes 
 again. This takes at least 1-2 days per server (mostly waiting for the data 
 to migrate back and forth)
 
 
 The first thing I'd try is doing one osd at a time, rather than the entire 
 server; in theory, this should allow for (as opposed to definitely make it 
 happen) data to move from one osd to the other, rather than having to push it 
 across the network from other nodes.

Isn't that depending on the CRUSH map and some rules?

 
 depending on just how much data you have on an individual osd, you could stop 
 two, blow the first away, copy the data from osd 2 to the disk osd 1 was 
 using, change the mount-points, then bring osd 2 back up again; in theory, 
 osd 2 will only need to resync changes that have occurred while it was 
 offline. This, of course, presumes that there's no change in the on-disk 
 layout between btrfs and xfs...

We were actually thinking of doing that, but I wanted to hear the wisdom of the 
crowd… The thread from a year ago (that I just read) cautioned against that 
procedure though. 

cheers
jc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] adding SSD only pool to existing ceph cluster

2013-09-02 Thread Jens-Christian Fischer

We have a ceph cluster with 64 OSD (3 TB SATA) disks on 10 servers, and run an 
OpenStack cluster.

We are planning to move the images of the running VM instances from the 
physical machines to CephFS. Our plan is to add 10 SSDs (one in each server) 
and create a pool that is backed only by these SSDs and mount that pool in a 
specific location in CephFS.

References perused:

http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/
http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds

The difference between Sebastiens and the Ceph approach is that Sebastien has 
mixed SAS/SSD servers, while the ceph documentation assumes either or servers.

We have tried to replicate both approaches by manually editing the CRUSH map 
like so:

Option 1)

Create new virtual SSD only servers (where we have a h0 physical server, we'd 
set a h0-ssd for the ssd) in the CRUSH map, together with a related 
server/rack/datacenter/root hierarchy

--- cut ---
host s1-ssd {
id -15  # do not change unnecessarily
# weight 0.500
alg straw
hash 0  # rjenkins1
item osd.36 weight 0.500
}
…

rack cla-r71-ssd {
id -24  # do not change unnecessarily
# weight 2.500
alg straw
hash 0  # rjenkins1
item s0-ssd weight 0.000
item s1-ssd weight 0.500
[…]
item h5-ssd weight 0.000
}
root ssd {
id -25  # do not change unnecessarily
# weight 2.500
alg straw
hash 0  # rjenkins1
item cla-r71-ssd weight 2.500
}

rule ssd {
ruleset 3
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}

--- cut ---

Option 2)
Create two pools (SATA and SSD) and list all SSDs manually in them

--- cut ---
pool ssd {
id -14  # do not change unnecessarily
# weight 2.500
alg straw
hash 0  # rjenkins1
item osd.36 weight 0.500
item osd.65 weight 0.500
item osd.66 weight 0.500
item osd.67 weight 0.500
item osd.68 weight 0.500
item osd.69 weight 0.500
}

--- cut ---


We extracted the CRUSH map, decompiled, changed, compiled and injected it. Both 
tries didn't seem to really work (™) as we saw the cluster go into 
reshuffling mode immediately (probably due to the changed layout (OSD - Host 
- Rack - Root) in both cases.

We reverted to the original CRUSH map and the cluster has been quiet since then.

Now the question: What is the best way to handle our use case?

Add 10 SSD drives, create a separate pool with them, don't upset the current 
pools (We don't want the regular/existing data to migrate towards the SSD 
pool, and no disruption of service?

thanks
Jens-Christian
 
-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] one pg stuck with 2 unfound pieces

2013-08-13 Thread Jens-Christian Fischer

 that it is still in querying state on osd 9 and 18). I have restarted 
the OSDs, but I can't force any other status change.

What next? Take the OSDs (9, 18) out again and rebuilding?

thanks for your help
Jens-Christian


-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] XFS or btrfs for production systems with modern Kernel?

2013-08-02 Thread Jens-Christian Fischer

On 07.06.2013, at 16:57, Stefan Priebe s.pri...@profihost.ag wrote:

 Am 07.06.2013 16:31, schrieb Sage Weil:
 On Fri, 7 Jun 2013, Oliver Schulz wrote:
 
 Btrfs is the longer-term plan, but we haven't done as much testing there
 yet, and in particular, there is a bug in 3.9 that is triggered by a
 power-cycle and the fixes aren't yet backported to 3.9 stable.  Until we
 have done more validation, we still recommend XFS.
 
 The last time we did aging tests on btrfs performance was very good
 (better than xfs) initially but then trailed off as things fragmented.
 This was ~3.2 era.  We haven't repeated that yet for newer kernels.  I
 suspect it is better now, but I don't know how much better...


Just another data point - we have 10 servers with 64 OSD all on btrfs. 
Initially we started out with Ubuntu 12.10 servers, but experienced btrfs 
related kernel panics and have migrated the offending servers to 13.04. 
Yesterday one of these machines locked up with btrfs issues (that weren't 
easily diagnosed)

I have now started on migrating our OSD to xfs … (taking them out, making new 
filesystem on drive, putting them back into cluster again)

cheers
jc


-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/socialmedia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

37 matches

Mail list logo