Adar Dembo created KUDU-2202:
--------------------------------

             Summary: Removing a data directory is unsafe
                 Key: KUDU-2202
                 URL: https://issues.apache.org/jira/browse/KUDU-2202
             Project: Kudu
          Issue Type: Bug
          Components: fs
    Affects Versions: 1.6.0
            Reporter: Adar Dembo


I wrote a [patch|https://gerrit.cloudera.org/c/8352] that modifies the Kudu CLI 
to allow for data directory addition and removal. It turns out that 
implementing removal safely is quite complicated. Below I've outlined the 
various issues and their potential solutions or workarounds:

# The data dir to be removed may be the first data dir, used for tablet and 
consensus metadata. Until these can be striped to other data dirs, we'll work 
around this by prohibiting the removal of the first data dir outright.
# Tablets may have data blocks on the removed directory. No problem, just 
consider those tablets to be failed. Except, this could lead to the failed 
tablets' block IDs being reused in the creation of new blocks, which can lead 
to all sorts of issues. For example, deletion of a failed tablet whose block 
IDs were reused means deleting another tablet's data. Some band-aid solutions 
here include rewriting the superblocks on removal to also strip out all block 
IDs that were on the removed directory or persisting the maximum block ID on 
every disk (or in every superblock) to prevent block ID reuse. More solutions 
are discussed in the aforementioned patch as well.
# Even if the removed data dir is empty of data, existing tablets may still be 
configured to stripe to it, either explicitly (their data dir group includes 
this data dir) or implicitly (they're from an older version of Kudu and don't 
have a data dir group). We could work around this by rewriting these tablets' 
superblocks to prune the removed data dir from the tablets' data dir groups.

The patch works around issue #1 via prohibition, but is vulnerable to issues #2 
and #3. Given that, and given that data dir addition is of limited value (we 
lack intra-node data rebalancing and existing tablets' superblocks aren't 
rewritten to take advantage of a new data dir), I'm shelving the patch. Note 
that it depends on [this other patch|https://gerrit.cloudera.org/c/8376], which 
allows tablets with removed data directories to start up.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to