Will Berkeley created KUDU-2405:

             Summary: Sanity-check tablet copies for full disks
                 Key: KUDU-2405
                 URL: https://issues.apache.org/jira/browse/KUDU-2405
             Project: Kudu
          Issue Type: Improvement
    Affects Versions: 1.7.0
            Reporter: Will Berkeley

It'd be nice to do a basic sanity check when starting a tablet copy session. 
Presently, when a tablet is created, it will acquire a data dir group that 
avoids dirs that were full at the time of the tablet's creation. That's good, 
but we should also get some info from the remote about how much WAL data, 
metadata, and data is going to be sent, and check, if there's no change to disk 
space using across data dirs or to the size of the source tablet, that the copy 
is possible. In other words, make sure amount to be copied is less than the 
available free space for wal and metadata, and that the amount of data to be 
copied is less than the space available across the dir group. If the check 
fails the new replica should be failed, which will encourage Kudu to 
re-replicate the tablet elsewhere.

Naturally, this isn't perfect, as more space may be used, or more space may be 
freed, over the course of the copy; also, the source tablet replica may gain 
additional WAL data to copy as it accepts writes. But it should help, and in 
particular should help prevent "domino" crashes where one server's wal dir 
fills, so it crashes, and re-replication crashes other servers as their wal 
drives fill (presumably because they are on similar hardware having done a 
similar workload).

A harder thing to address will be the corner case where the only option is to 
try to copy to a server with too little space. In this case it'd be better to 
surface the error aggressively in logs, etc, and perhaps back off on attempts, 
rather than endlessly make a tablet, start a copy, and fail a sanity check.

This message was sent by Atlassian JIRA

Reply via email to