[
https://issues.apache.org/jira/browse/KUDU-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-2628:
------------------------------
Labels: roadmap-candidate (was: )
> Allow hot-swapping of failed data directories
> ---------------------------------------------
>
> Key: KUDU-2628
> URL: https://issues.apache.org/jira/browse/KUDU-2628
> Project: Kudu
> Issue Type: Improvement
> Components: fs
> Reporter: Andrew Wong
> Priority: Major
> Labels: roadmap-candidate
>
> Currently, if a disk fails at runtime, its data directory will be marked as
> "failed" in memory, all tablets will be marked "failed" and re-replicated
> elsewhere, and life will move on, until the Kudu admins want to fix the
> server, at which point, they would need to bring down the server, replace the
> bad disk, run the `kudu fs update_dirs` tool to adopt the new disk and forget
> about the old data directory, and then start up again. This process can be
> slow, particularly since server startup can be slow, depending on the amount
> of data on a server.
> As implemented today, no IO should be going to a failed data directory
> anyway, so the next logical extension of this would be to allow users to
> replace such failed directories while the server is up, removing the old data
> dir in memory, adding a new one with a new UUID, and updating the path
> instance metadata files on disk to reflect the change.
> A few considerations should be taken into account in implementing this:
> # Should we support removals of directories (vs replacements)? My feeling is
> no, since this would break further usages of the same `fs_data_dirs`
> configurations.
> # Writing the PIMFs may be challenging. These files would need to be
> rewritten to adopt a new, empty disk. Writing to multiple files across
> multiple disks may be messy, and doing this while the server is online only
> exacerbates the problem. A reasonable amount of error handling should be
> scoped out.
> # We should be careful in picking when a hot-swap is actually viable. E.g.
> if a hot-swap is requested and the data dir isn't bad, we shouldn't do
> anything, etc. Or perhaps a user may want to request a forced "failing" of a
> data directory in preparation for a swap.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)