Re: [ceph-users] Unexpected issues with simulated 'rack' outage

Romero Junior Wed, 24 Jun 2015 06:34:29 -0700

The recovery process is only triggered after 20 minutes, this tests are done 
before that.
I don’t see any traffic or increased load on any of the remaining nodes.


Here is my ceph.conf:

[global]
fsid = XXXX
mon_initial_members = mon001, mon002, mon003
mon_host = 10.XXX,10.YYYY,10.ZZZZ
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.XXXX/24
cluster_network = 172.XXX/24
max_open_files = 131072
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_crush_rule = 0
mon_osd_down_out_interval = 1200
mon_osd_min_down_reporters = 4
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[osd]
osd_mkfs_type = xfs
osd_mkfs_options_xfs = -f -i size=2048
osd_mount_options_xfs = noatime,largeio,inode64,swalloc
osd_journal_size = 4096
osd_mon_heartbeat_interval = 30
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 8
filestore_op_threads = 8
filestore_max_sync_interval = 5
osd_max_scrubs = 1
osd_recovery_max_active = 5
osd_max_backfills = 2
osd_recovery_op_priority = 2
osd_recovery_max_chunk = 1048576
osd_recovery_threads = 1
osd_objectstore = filestore
osd_crush_update_on_start = true

root@srv003:~# ceph osd pool ls detail
pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool 
stripe_width 0



From: Saverio Proto [mailto:[email protected]]
Sent: woensdag, 24 juni, 2015 15:22
To: Romero Junior
Cc: [email protected]
Subject: Re: [ceph-users] Unexpected issues with simulated 'rack' outage

You dont have to wait, but the recovery process will be very heavy and it will 
have an impact on performance. The impact could be catastrophic as you are 
experiencing.
After removing 1 rack, the CRUSH algorithm will run again on the available 
resources and will map the PGs to the available OSDs. You lost 33% of OSDs so 
it will be a big change.
This means that you will not only have to create again copies for the OSDs that 
are out of your cluster, but also you have to move a round a lot of objects 
that are now misplaced.
It would also be nice to see your crushmap because you are not using the 
default. A conceptual bug in the crushmap could leave the cluster on a degraded 
state forever. For example if you did a crushmap to place copies only on 
different racks, and you want 3 copies with 2 racks available, this is a 
possible conceptual bug.

Saverio




2015-06-24 15:11 GMT+02:00 Romero Junior 
<[email protected]<mailto:[email protected]>>:
If I have a replica of each object on the other racks why should I have to wait 
for any recovery time? The failure should not impact my virtual machines.

From: Saverio Proto [mailto:[email protected]<mailto:[email protected]>]
Sent: woensdag, 24 juni, 2015 14:54
To: Romero Junior
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [ceph-users] Unexpected issues with simulated 'rack' outage

Hello Romero,
I am still begineer with Ceph, but as far as I understood, ceph is not designed 
to lose the 33% of the cluster at once and recover rapidly. What I understand 
is that you are losing 33% of the cluster losing 1 rack out of 3. It will take 
a very long time to recover, before you have HEALTH_OK status.
can you check with ceph -w how long it takes for ceph to converge to a healthy 
cluster after you switch off the switch in Rack-A ?

Saverio

2015-06-24 14:44 GMT+02:00 Romero Junior 
<[email protected]<mailto:[email protected]>>:
Hi,

We are setting up a test environment using Ceph as the main storage solution 
for my QEMU-KVM virtualization platform, and everything works fine except for 
the following:

When I simulate a failure by powering off the switches on one of our three 
racks my virtual machines get into a weird state, the illustration might help 
you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg

The PGs are distributed based on racks, there are not default crush rules.

The number of PGs is the following:

root@srv003:~# ceph osd pool ls detail
pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool 
stripe_width 0

The qemu talks directly to Ceph through librdb, the disk is configured as the 
following:

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/>
      </auth>
      <source protocol='rbd' name='libvirt-pool/ceph-vm-automated'>
        <host name='10.XX.YY.1' port='6789'/>
        <host name='10.XX.YY.2' port='6789'/>
        <host name='10.XX.YY.2' port='6789'/>
      </source>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk25'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' 
function='0x0'/>
    </disk>


As mentioned, it's not a real read-only state, I can "touch" files and even 
login on the affected virtual machines (by the way, all are affected) however, 
a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a 3 GB file 
download starts (via wget/curl), it usually crashes after the first few hundred 
megabytes and it resumes as soon as I power on the “failed” rack. Everything 
goes back to normal as soon as the rack is powered on again.

For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 TB 
each).

On the virtual machine, after recovering the rack, I can see the following 
messages on /var/log/kern.log:

[163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 seconds.
[163800.444260]       Not tainted 3.13.0-55-generic #94-Ubuntu
[163800.444295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[163800.444346] jbd2/vda1-8     D ffff88007fd13180     0   135      2 0x00000000
[163800.444354]  ffff880036d3bbd8 0000000000000046 ffff880036a4b000 
ffff880036d3bfd8
[163800.444386]  0000000000013180 0000000000013180 ffff880036a4b000 
ffff88007fd13a18
[163800.444390]  ffff88007ffc69d0 0000000000000002 ffffffff811efa80 
ffff880036d3bc50
[163800.444396] Call Trace:
[163800.444420]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50
[163800.444426]  [<ffffffff817279bd>] io_schedule+0x9d/0x140
[163800.444432]  [<ffffffff811efa8e>] sleep_on_buffer+0xe/0x20
[163800.444437]  [<ffffffff81727e42>] __wait_on_bit+0x62/0x90
[163800.444442]  [<ffffffff811efa80>] ? generic_block_bmap+0x50/0x50
[163800.444447]  [<ffffffff81727ee7>] out_of_line_wait_on_bit+0x77/0x90
[163800.444455]  [<ffffffff810ab300>] ? autoremove_wake_function+0x40/0x40
[163800.444461]  [<ffffffff811f0dba>] __wait_on_buffer+0x2a/0x30
[163800.444470]  [<ffffffff8128be4d>] 
jbd2_journal_commit_transaction+0x185d/0x1ab0
[163800.444477]  [<ffffffff8107562f>] ? try_to_del_timer_sync+0x4f/0x70
[163800.444484]  [<ffffffff8129017d>] kjournald2+0xbd/0x250
[163800.444490]  [<ffffffff810ab2c0>] ? prepare_to_wait_event+0x100/0x100
[163800.444496]  [<ffffffff812900c0>] ? commit_timeout+0x10/0x10
[163800.444502]  [<ffffffff8108b702>] kthread+0xd2/0xf0
[163800.444507]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0
[163800.444513]  [<ffffffff81733ca8>] ret_from_fork+0x58/0x90
[163800.444517]  [<ffffffff8108b630>] ? kthread_create_on_node+0x1c0/0x1c0

A few theories for this behavior were mention on #Ceph (OFTC):

[14:09] <Be-El> RomeroJnr: i think the problem is the fact that you write to 
parts of the rbd that have not been accessed before
[14:09] <Be-El> RomeroJnr: ceph does thin provisioning; each rbd is striped 
into chunks of 4 mb. each stripe is put into one pgs
[14:10] <Be-El> RomeroJnr: if you access formerly unaccessed parts of the rbd, 
a new stripe is created. and this probably fails if one of the racks is down
[14:10] <Be-El> RomeroJnr: but that's just a theory...maybe some developer can 
comment on this later
[14:21] <Be-El> smerz: creating an object in a pg might be different than 
writing to an object
[14:21] <Be-El> smerz: with one rack down ceph cannot satisfy the pg 
requirements in RomeroJnr's case
[14:22] <smerz> i can only agree with you. that i would expect other behaviour

The question is: is this behavior indeed expected?

Kind regards,
Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230<tel:%2B31%2020%20316%200230>
M: +31 6 2115 9310
E: [email protected]<mailto:[email protected]>
W: www.leaseweb.com<http://www.leaseweb.com>


Luttenbergweg 8,

1101 EC Amsterdam,

Netherlands


LeaseWeb is the brand name under which the various independent LeaseWeb 
companies operate. Each company is a separate and distinct entity that provides 
services in a particular geographic area. LeaseWeb Global Services B.V. does 
not provide third-party services. Please see 
www.leaseweb.com/en/legal<http://www.leaseweb.com/en/legal> for more 
information.




Kind regards,
Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230<tel:%2B31%2020%20316%200230>
M: +31 6 2115 9310
E: [email protected]<mailto:[email protected]>
W: www.leaseweb.com<http://www.leaseweb.com>


Luttenbergweg 8,

1101 EC Amsterdam,

Netherlands


LeaseWeb is the brand name under which the various independent LeaseWeb 
companies operate. Each company is a separate and distinct entity that provides 
services in a particular geographic area. LeaseWeb Global Services B.V. does 
not provide third-party services. Please see 
www.leaseweb.com/en/legal<http://www.leaseweb.com/en/legal> for more 
information.



Kind regards,

Romero Junior
Hosting Engineer
LeaseWeb Global Services B.V.

T: +31 20 316 0230
M: +31 6 2115 9310
E: [email protected]
W: www.leaseweb.com<http://www.leaseweb.com>

Luttenbergweg 8,        1101 EC Amsterdam,      Netherlands

LeaseWeb is the brand name under which the various independent LeaseWeb 
companies operate. Each company is a separate and distinct entity that provides 
services in a particular geographic area. LeaseWeb Global Services B.V. does 
not provide third-party services. Please see www.leaseweb.com/en/legal for more 
information.


_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected issues with simulated 'rack' outage

Reply via email to