Adar Dembo created KUDU-1856:
--------------------------------
Summary: Kudu can consume far more data than it should on XFS
Key: KUDU-1856
URL: https://issues.apache.org/jira/browse/KUDU-1856
Project: Kudu
Issue Type: Bug
Components: fs
Affects Versions: 1.2.0
Reporter: Adar Dembo
Priority: Critical
Attachments: check_fragmentation.py, dump_all_blocks.py, frag_report
I was investigating Kudu's disk space consumption on an internal cluster and
found a few interesting things. This is a 42-node cluster with three masters,
running CentOS 6.6. I focused on a particular node with 11 data directories,
each formatted with XFS. The node was serving a bunch of tombstoned tablets,
but no actual live tablets. All the tablets belonged to one of two tables. Due
to the file sizes involved, the following analysis was done on just one of the
data directories.
There were 7406 "live" blocks. I put live in quotes because these blocks were
orphaned by definition, as there were no live tablets. The running theory is
that they were orphaned due to tablet copy operations that failed mid-way.
KUDU-1853 tracks this issue, at least with respect to non-crash failures.
Failures due to a crash require a data GC of some kind, tracked in KUDU-829.
The live blocks were stored in 1025 LBM containers. The vast majority of the
file space in each container held punched-out dead blocks, as one might expect.
Taken together, the live blocks accounted for ~85 GB of data.
However, the total disk space usage of these container files was ~123 GB. There
were three discrepancies here, one tiny, one minor, and one major:
* There was ~17 MB of space lost to external fragmentation. This is because LBM
containers force live blocks to be aligned to the nearest filesystem block.
* There was ~1.4 GB of dead block data that was backed by live extents
according to filefrag. That is, these are dead blocks the tserver either failed
to punch, or (more likely) crashed before it could punch.
* There was ~40 GB of zeroed space hanging off the edge of the container files.
Unlike a typical preallocation, this space _was not_ accounted for in the
logical file size; it only manifests in filefrag or du. I believe this is due
to XFS's [speculative preallocation
feature|http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F]
feature. What is worrying is that this preallocation became permanent; neither
clearing the kernel's inode cache nor shutting down the tserver (the documented
workarounds) made it disappear. Only an explicit ftruncate() cleared it up.
There are a couple of conclusions to draw here:
# It's good that we've fixed KUDU-1853; that should reduce the number of
orphaned blocks. However, we should prioritize KUDU-829 too, as a crash during
a tablet copy can still orphan a ton of blocks, far more than a crash during a
flush or compaction.
# There's also a need to re-effect hole punches in case we crash after blocks
have been deleted but before the punches take place. This can be done blindly
on all dead blocks in an LBM container at startup, perhaps based on some
"actual disk space used > expected disk space used" threshold. Or we can use
[the FIEMAP
ioctl|https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt] to
figure out exactly where the extents are, and surgically only punch those that
are needed.
# On XFS, we really need to address this speculative preallocation problem.
It's not clear exactly what causes this temporary phenomenon to become
permanent; the [XFS
faq|http://xfs.org/index.php/XFS_FAQ#Q:_Is_speculative_preallocation_permanent.3F]
is vague on that. But, one option is to adjust the LBM
truncate-full-container-file-at-startup logic to ignore the container's logical
file size; that is, to always truncate the container to the end of the last
block.
I've attached two scripts that helped me during the analysis.
dump_all_blocks.py converts the on-disk LBM metadata files into a JSON
representation. check_fragmentation.py uses the JSON representation and the
output of filefrag to find fragmentation, unpunched holes, and excess
preallocation. frag_report is the output of "check_fragmentation.py -v" on the
JSON representation of one of the data directories.
Let's use this JIRA to track issue #3 from the above list.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)