Adar Dembo created KUDU-2202:
--------------------------------
Summary: Removing a data directory is unsafe
Key: KUDU-2202
URL: https://issues.apache.org/jira/browse/KUDU-2202
Project: Kudu
Issue Type: Bug
Components: fs
Affects Versions: 1.6.0
Reporter: Adar Dembo
I wrote a [patch|https://gerrit.cloudera.org/c/8352] that modifies the Kudu CLI
to allow for data directory addition and removal. It turns out that
implementing removal safely is quite complicated. Below I've outlined the
various issues and their potential solutions or workarounds:
# The data dir to be removed may be the first data dir, used for tablet and
consensus metadata. Until these can be striped to other data dirs, we'll work
around this by prohibiting the removal of the first data dir outright.
# Tablets may have data blocks on the removed directory. No problem, just
consider those tablets to be failed. Except, this could lead to the failed
tablets' block IDs being reused in the creation of new blocks, which can lead
to all sorts of issues. For example, deletion of a failed tablet whose block
IDs were reused means deleting another tablet's data. Some band-aid solutions
here include rewriting the superblocks on removal to also strip out all block
IDs that were on the removed directory or persisting the maximum block ID on
every disk (or in every superblock) to prevent block ID reuse. More solutions
are discussed in the aforementioned patch as well.
# Even if the removed data dir is empty of data, existing tablets may still be
configured to stripe to it, either explicitly (their data dir group includes
this data dir) or implicitly (they're from an older version of Kudu and don't
have a data dir group). We could work around this by rewriting these tablets'
superblocks to prune the removed data dir from the tablets' data dir groups.
The patch works around issue #1 via prohibition, but is vulnerable to issues #2
and #3. Given that, and given that data dir addition is of limited value (we
lack intra-node data rebalancing and existing tablets' superblocks aren't
rewritten to take advantage of a new data dir), I'm shelving the patch. Note
that it depends on [this other patch|https://gerrit.cloudera.org/c/8376], which
allows tablets with removed data directories to start up.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)