On Thu, May 26, 2016 at 8:09 PM, Scott Talbert <scott.talb...@hgst.com> wrote: > On Wed, 11 May 2016, Filipe Manana wrote: > >>> I've noticed some time ago that our device replace implementation is >>> unreliable. Basically under several situations it ends up not copying >>> extents (both data and metadata) from the old device to the new device >>> (I briefly talked about some of the problems with Josef at LSF). >>> >>> Essentially it operates by iterating all the extents of the old device >>> using the device tree's commit root, and for each device extent found, >>> does a direct copy of it into the new device (it does not use the >>> regular write paths). While it's doing this no extents can be >>> allocated from the new device, this is only possible once it finishes >>> copying all device extents and replaces the old device with the new >>> device in the list of devices available for allocations. >>> >>> There are 4 main flaws with the implementation: >>> >>> 1) It processes each device extent only once. Clearly after it >>> processes (copies) a device extent it's possible for future data and >>> metadata writes to allocate extents from the corresponding block >>> group, so any new extents are not copied into the new device. >>> >>> 2) Before it copies a device extent, it sets the corresponding block >>> group to RO mode (as of commit [1], which sadly does not really fix an >>> issue nor does it explain how the problem it claims to fix happens) to >>> prevent new allocations from allocating extents from the block group >>> while we are processing it, and once we finish processing it, it sets >>> the block group back to RW mode. This is flawed too, as when it sets >>> the block group to RO mode, some concurrent task might have just >>> allocated an extent from the block group and ends up writing to it >>> while or after we process the device extent. >>> >>> 3) Since it iterates over the device tree in an ascending fashion, it >>> never considers the possibility for new device extents with an offset >>> lower then the current search offset from being allocated for new >>> block groups (just take a look at the while loop at >>> scrub_enumerate_chunks() to see how flawed it is). So it totally >>> ignores that device holes can be used while we are doing the replace >>> and never looks back to check for any and copy the respective device >>> extent into the new device. This is much more likely to happen >>> nowadays since empty block groups are automatically removed by the >>> cleaner kthread, while in the past this could only happen through >>> balance operations triggered from userspace. >>> >>> 4) When we finish iterating "all" device extents, we call >>> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the >>> current transaction, replaces the old device in all extent mappings >>> with the new device, removes old device from the list of available >>> devices for allocation and adds the new device to that list. So when >>> flushing dellaloc more extents are allocated from existing block >>> groups and new block groups might be allocated, with all the >>> corresponding bios writing to the old device and not the new one, and >>> after this flushing we surely don't copy all new or changed extents to >>> the new device. >>> >>> These are the 4 major issues I see with device replace. Issue number 2 >>> is trivial to solve but it's pointless to do so without addressing the >>> other issues. >>> >>> Solving this isn't easy, not without some trade offs at least. We can >>> keep in memory structures to track which new block groups are >>> allocated while we are doing the replace and which new extents are >>> allocated from existing block groups. Besides the needed >>> synchronization between the allocator and the device replace code >>> (which would bring yet more complexity), this can imply using a lot of >>> memory if there's a significant number of writes happening all the >>> time and ending up in an endless copy loop. Maybe some sort of write >>> throttling would work out. There's also a bit of a chicken and the egg >>> problem, as while the replace operation is ongoing we update a >>> progress item in the devices tree (btrfs_run_dev_replace()) which in >>> turn results in allocating new metadata extents. >>> >>> Anyway, I'm just hoping someone tells me that I'm completely wrong and >>> misunderstood the current implementation we have. If not, and given >>> how serious this is (affecting both data and metadata, and defeating >>> the main purpose of raid), shouldn't we disable this feature, like we >>> did for snapshot aware defrag a couple years ago, until we have these >>> issues solved? >> >> >> On IRC, Stefan Behrens just pointed me out that for locations of the >> old device behind the current left cursor, we make all writes go both >> into the old and new devices (volumes.c:__btrfs_map_block()). This >> solves problems 1 and 3, which are really the hard problems. The >> remaining ones are easy to fix and I'll address them soon. > > > Hi Filipe,
Hi Scott, > > Does your recent patch set (from May 20) address all of these issues? Yes. > > Scott > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html