[ 
https://issues.apache.org/jira/browse/KUDU-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2628:
------------------------------
    Labels: roadmap-candidate  (was: )

> Allow hot-swapping of failed data directories
> ---------------------------------------------
>
>                 Key: KUDU-2628
>                 URL: https://issues.apache.org/jira/browse/KUDU-2628
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Andrew Wong
>            Priority: Major
>              Labels: roadmap-candidate
>
> Currently, if a disk fails at runtime, its data directory will be marked as 
> "failed" in memory, all tablets will be marked "failed" and re-replicated 
> elsewhere, and life will move on, until the Kudu admins want to fix the 
> server, at which point, they would need to bring down the server, replace the 
> bad disk, run the `kudu fs update_dirs` tool to adopt the new disk and forget 
> about the old data directory, and then start up again. This process can be 
> slow, particularly since server startup can be slow, depending on the amount 
> of data on a server.
> As implemented today, no IO should be going to a failed data directory 
> anyway, so the next logical extension of this would be to allow users to 
> replace such failed directories while the server is up, removing the old data 
> dir in memory, adding a new one with a new UUID, and updating the path 
> instance metadata files on disk to reflect the change.
> A few considerations should be taken into account in implementing this:
>  # Should we support removals of directories (vs replacements)? My feeling is 
> no, since this would break further usages of the same `fs_data_dirs` 
> configurations.
>  # Writing the PIMFs may be challenging. These files would need to be 
> rewritten to adopt a new, empty disk. Writing to multiple files across 
> multiple disks may be messy, and doing this while the server is online only 
> exacerbates the problem. A reasonable amount of error handling should be 
> scoped out.
>  # We should be careful in picking when a hot-swap is actually viable. E.g. 
> if a hot-swap is requested and the data dir isn't bad, we shouldn't do 
> anything, etc. Or perhaps a user may want to request a forced "failing" of a 
> data directory in preparation for a swap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to