On Fri, 2016-03-25 at 08:00 -0600, Alex Schultz wrote: > > On Fri, Mar 25, 2016 at 7:32 AM, Dmitry Guryanov <dguryanov@mirantis. > com> wrote: > > Here is the bug which I'm trying to fix - https://bugs.launchpad.ne > > t/fuel/+bug/1538587. > > > > In VMs (set up with fuel-virtualbox) kernel panic occurs every > > time you delete node, stack trace shows error in ext4 driver [1]. > > The same as in the bug. > > > > Here is a patch - https://review.openstack.org/297669 . I've > > checked it with virtual box VMs and it works fine. > > > > I propose also don't reboot nodes in case of kernel panic, so that > > we'll catch possible errors, but maybe it's too dangerous before > > release. > > > > > The panic is in there to prevent controllers from staying active with > a bad disk. If the file system on a controller goes RO, the node > stays in the cluster and causes errors with the openstack > deployment. The node erase code tries to disable this prior to > erasing the disk so if it's not working we need to fix that, not > remove it.
There will be no filesystem errors because of erasing disks with my patch. The node will be fully operable until reboot. > Thanks, > -Alex > > > [1] > > [13607.545119] EXT4-fs error (device dm-0) in > > ext4_reserve_inode_write:4928: IO failure > > [13608.157968] EXT4-fs error (device dm-0) in > > ext4_reserve_inode_write:4928: IO failure > > [13608.780695] EXT4-fs error (device dm-0) in > > ext4_reserve_inode_write:4928: IO failure > > [13609.471245] Aborting journal on device dm-0-8. > > [13609.478549] EXT4-fs error (device dm-0) in > > ext4_dirty_inode:5047: IO failure > > [13610.069244] EXT4-fs error (device dm-0) in > > ext4_dirty_inode:5047: IO failure > > [13610.698915] Kernel panic - not syncing: EXT4-fs (device dm-0): > > panic forced after error > > [13610.698915] > > [13611.060673] CPU: 0 PID: 8676 Comm: systemd-udevd Not tainted > > 3.13.0-83-generic #127-Ubuntu > > [13611.236566] Hardware name: innotek GmbH VirtualBox/VirtualBox, > > BIOS VirtualBox 12/01/2006 > > [13611.887198] 00000000fffffffb ffff88003b6e9a08 ffffffff81725992 > > ffffffff81a77878 > > [13612.527154] ffff88003b6e9a80 ffffffff8171e80b ffffffff00000010 > > ffff88003b6e9a90 > > [13613.037061] ffff88003b6e9a30 ffff88003b6e9a50 ffff8800367f2ad0 > > 0000000000000040 > > [13613.717119] Call Trace: > > [13613.927162] [<ffffffff81725992>] dump_stack+0x45/0x56 > > [13614.306858] [<ffffffff8171e80b>] panic+0xc8/0x1e1 > > [13614.767154] [<ffffffff8125e7c6>] > > ext4_handle_error.part.187+0xa6/0xb0 > > [13615.187201] [<ffffffff8125eddb>] __ext4_std_error+0x7b/0x100 > > [13615.627960] [<ffffffff81244c64>] > > ext4_reserve_inode_write+0x44/0xa0 > > [13616.007943] [<ffffffff81247f80>] ? ext4_dirty_inode+0x40/0x60 > > [13616.448084] [<ffffffff81244d04>] > > ext4_mark_inode_dirty+0x44/0x1f0 > > [13616.917611] [<ffffffff8126f7f9>] ? > > __ext4_journal_start_sb+0x69/0xe0 > > [13617.367730] [<ffffffff81247f80>] ext4_dirty_inode+0x40/0x60 > > [13617.747567] [<ffffffff811e858a>] __mark_inode_dirty+0x10a/0x2d0 > > [13618.088060] [<ffffffff811d94e1>] update_time+0x81/0xd0 > > [13618.467965] [<ffffffff811d96f0>] file_update_time+0x80/0xd0 > > [13618.977649] [<ffffffff811511f0>] > > __generic_file_aio_write+0x180/0x3d0 > > [13619.467993] [<ffffffff81151498>] > > generic_file_aio_write+0x58/0xa0 > > [13619.978080] [<ffffffff8123c712>] ext4_file_write+0xa2/0x3f0 > > [13620.467624] [<ffffffff81158066>] ? > > free_hot_cold_page_list+0x46/0xa0 > > [13621.038045] [<ffffffff8115d400>] ? release_pages+0x80/0x210 > > [13621.408080] [<ffffffff811bdf5a>] do_sync_write+0x5a/0x90 > > [13621.818155] [<ffffffff810e52f6>] do_acct_process+0x4e6/0x5c0 > > [13622.278005] [<ffffffff810e5a91>] acct_process+0x71/0xa0 > > [13622.597617] [<ffffffff8106a3cf>] do_exit+0x80f/0xa50 > > [13622.968015] [<ffffffff811c041e>] ? ____fput+0xe/0x10 > > [13623.337738] [<ffffffff8106a68f>] do_group_exit+0x3f/0xa0 > > [13623.738020] [<ffffffff8106a704>] SyS_exit_group+0x14/0x20 > > [13624.137447] [<ffffffff8173659d>] system_call_fastpath+0x1a/0x1f > > [13624.518044] Rebooting in 10 seconds.. > > > > On Tue, Mar 22, 2016 at 1:07 PM, Dmitry Guryanov <dguryanov@miranti > > s.com> wrote: > > > Hello, > > > > > > Here is a start of the discussion - http://lists.openstack.org/pi > > > permail/openstack-dev/2015-December/083021.html . I've subscribed > > > to this mailing list later, so can reply there. > > > > > > Currently we clear node's disks in two places. The first one is > > > before reboot into bootstrap image [0] and the second - just > > > before provisioning in fuel-agent [1]. > > > > > > There are two problems, which should be solved with erasing first > > > megabyte of disk data: node should not boot from hdd after reboot > > > and new partitioning scheme should overwrite the previous one. > > > > > > The first problem could be solved with zeroing first 512 bytes of > > > each disk (not partition). Even 446 to be precise, because last > > > 66 bytes are partition scheme, see https://wiki.archlinux.org/ind > > > ex.php/Master_Boot_Record . > > > > > > The second problem should be solved only after reboot into > > > bootstrap. Because if we bring a new node to the cluster from > > > some other place and boot it with bootstrap image it will > > > possibly have disks with some partitions, md devices and lvm > > > volumes. So all these entities should be correctly cleared before > > > provisioning, not before reboot. And fuel-agent does it in [1]. > > > > > > I propose to remove erasing first 1M of each partiton, because it > > > can lead to errors in FS kernel drivers and kernel panic. An > > > existing workaround, that in case of kernel panic we do reboot is > > > bad because it may occur just after clearing first partition of > > > the first disk and after reboot bios will read MBR of the second > > > disk and boot from it instead of network. Let's just clear first > > > 446 bytes of each disk. > > > > > > > > > [0] https://github.com/openstack/fuel-astute/blob/master/mcagents > > > /erase_node.rb#L162-L174 > > > [1] https://github.com/openstack/fuel-agent/blob/master/fuel_agen > > > t/manager.py#L194-L221 > > > > > > > > > -- > > > Dmitry Guryanov > > > > > > > > > -- > > Dmitry Guryanov > > > > ___________________________________________________________________ > > _______ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: [email protected]?subject:unsu > > bscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > _____________________________________________________________________ > _____ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubs > cribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
