Adar Dembo created KUDU-1856:
--------------------------------

             Summary: Kudu can consume far more data than it should on XFS
                 Key: KUDU-1856
                 URL: https://issues.apache.org/jira/browse/KUDU-1856
             Project: Kudu
          Issue Type: Bug
          Components: fs
    Affects Versions: 1.2.0
            Reporter: Adar Dembo
            Priority: Critical
         Attachments: check_fragmentation.py, dump_all_blocks.py, frag_report

I was investigating Kudu's disk space consumption on an internal cluster and 
found a few interesting things. This is a 42-node cluster with three masters, 
running CentOS 6.6. I focused on a particular node with 11 data directories, 
each formatted with XFS. The node was serving a bunch of tombstoned tablets, 
but no actual live tablets. All the tablets belonged to one of two tables. Due 
to the file sizes involved, the following analysis was done on just one of the 
data directories.

There were 7406 "live" blocks. I put live in quotes because these blocks were 
orphaned by definition, as there were no live tablets. The running theory is 
that they were orphaned due to tablet copy operations that failed mid-way. 
KUDU-1853 tracks this issue, at least with respect to non-crash failures. 
Failures due to a crash require a data GC of some kind, tracked in KUDU-829. 
The live blocks were stored in 1025 LBM containers. The vast majority of the 
file space in each container held punched-out dead blocks, as one might expect. 
Taken together, the live blocks accounted for ~85 GB of data.

However, the total disk space usage of these container files was ~123 GB. There 
were three discrepancies here, one tiny, one minor, and one major:
* There was ~17 MB of space lost to external fragmentation. This is because LBM 
containers force live blocks to be aligned to the nearest filesystem block.
* There was ~1.4 GB of dead block data that was backed by live extents 
according to filefrag. That is, these are dead blocks the tserver either failed 
to punch, or (more likely) crashed before it could punch.
* There was ~40 GB of zeroed space hanging off the edge of the container files. 
Unlike a typical preallocation, this space _was not_ accounted for in the 
logical file size; it only manifests in filefrag or du. I believe this is due 
to XFS's [speculative preallocation 
feature|http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F]
 feature. What is worrying is that this preallocation became permanent; neither 
clearing the kernel's inode cache nor shutting down the tserver (the documented 
workarounds) made it disappear. Only an explicit ftruncate() cleared it up.

There are a couple of conclusions to draw here:
# It's good that we've fixed KUDU-1853; that should reduce the number of 
orphaned blocks. However, we should prioritize KUDU-829 too, as a crash during 
a tablet copy can still orphan a ton of blocks, far more than a crash during a 
flush or compaction.
# There's also a need to re-effect hole punches in case we crash after blocks 
have been deleted but before the punches take place. This can be done blindly 
on all dead blocks in an LBM container at startup, perhaps based on some 
"actual disk space used > expected disk space used" threshold. Or we can use 
[the FIEMAP 
ioctl|https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt] to 
figure out exactly where the extents are, and surgically only punch those that 
are needed.
# On XFS, we really need to address this speculative preallocation problem. 
It's not clear exactly what causes this temporary phenomenon to become 
permanent; the [XFS 
faq|http://xfs.org/index.php/XFS_FAQ#Q:_Is_speculative_preallocation_permanent.3F]
 is vague on that. But, one option is to adjust the LBM 
truncate-full-container-file-at-startup logic to ignore the container's logical 
file size; that is, to always truncate the container to the end of the last 
block.

I've attached two scripts that helped me during the analysis. 
dump_all_blocks.py converts the on-disk LBM metadata files into a JSON 
representation. check_fragmentation.py uses the JSON representation and the 
output of filefrag to find fragmentation, unpunched holes, and excess 
preallocation. frag_report is the output of "check_fragmentation.py -v" on the 
JSON representation of one of the data directories.

Let's use this JIRA to track issue #3 from the above list.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to