[
https://issues.apache.org/jira/browse/KUDU-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847953#comment-15847953
]
Adar Dembo commented on KUDU-1856:
----------------------------------
I filed (and linked) KUDU-1857 to track issue #2 from the original comment.
> Kudu can consume far more data than it should on XFS
> ----------------------------------------------------
>
> Key: KUDU-1856
> URL: https://issues.apache.org/jira/browse/KUDU-1856
> Project: Kudu
> Issue Type: Bug
> Components: fs
> Affects Versions: 1.2.0
> Reporter: Adar Dembo
> Assignee: Adar Dembo
> Priority: Critical
> Attachments: check_fragmentation.py, dump_all_blocks.py, frag_report
>
>
> I was investigating Kudu's disk space consumption on an internal cluster and
> found a few interesting things. This is a 42-node cluster with three masters,
> running CentOS 6.6. I focused on a particular node with 11 data directories,
> each formatted with XFS. The node was serving a bunch of tombstoned tablets,
> but no actual live tablets. All the tablets belonged to one of two tables.
> Due to the file sizes involved, the following analysis was done on just one
> of the data directories.
> There were 7406 "live" blocks. I put live in quotes because these blocks were
> orphaned by definition, as there were no live tablets. The running theory is
> that they were orphaned due to tablet copy operations that failed mid-way.
> KUDU-1853 tracks this issue, at least with respect to non-crash failures.
> Failures due to a crash require a data GC of some kind, tracked in KUDU-829.
> The live blocks were stored in 1025 LBM containers. The vast majority of the
> file space in each container held punched-out dead blocks, as one might
> expect. Taken together, the live blocks accounted for ~85 GB of data.
> However, the total disk space usage of these container files was ~123 GB.
> There were three discrepancies here, one tiny, one minor, and one major:
> * There was ~17 MB of space lost to external fragmentation. This is because
> LBM containers force live blocks to be aligned to the nearest filesystem
> block.
> * There was ~1.4 GB of dead block data that was backed by live extents
> according to filefrag. That is, these are dead blocks the tserver either
> failed to punch, or (more likely) crashed before it could punch.
> * There was ~40 GB of zeroed space hanging off the edge of the container
> files. Unlike a typical preallocation, this space _was not_ accounted for in
> the logical file size; it only manifests in filefrag or du. I believe this is
> due to XFS's [speculative preallocation
> feature|http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F]
> feature. What is worrying is that this preallocation became permanent;
> neither clearing the kernel's inode cache nor shutting down the tserver (the
> documented workarounds) made it disappear. Only an explicit ftruncate()
> cleared it up.
> There are a couple of conclusions to draw here:
> # It's good that we've fixed KUDU-1853; that should reduce the number of
> orphaned blocks. However, we should prioritize KUDU-829 too, as a crash
> during a tablet copy can still orphan a ton of blocks, far more than a crash
> during a flush or compaction.
> # There's also a need to re-effect hole punches in case we crash after blocks
> have been deleted but before the punches take place. This can be done blindly
> on all dead blocks in an LBM container at startup, perhaps based on some
> "actual disk space used > expected disk space used" threshold. Or we can use
> [the FIEMAP
> ioctl|https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt] to
> figure out exactly where the extents are, and surgically only punch those
> that are needed.
> # On XFS, we really need to address this speculative preallocation problem.
> It's not clear exactly what causes this temporary phenomenon to become
> permanent; the [XFS
> faq|http://xfs.org/index.php/XFS_FAQ#Q:_Is_speculative_preallocation_permanent.3F]
> is vague on that. But, one option is to adjust the LBM
> truncate-full-container-file-at-startup logic to ignore the container's
> logical file size; that is, to always truncate the container to the end of
> the last block.
> I've attached two scripts that helped me during the analysis.
> dump_all_blocks.py converts the on-disk LBM metadata files into a JSON
> representation. check_fragmentation.py uses the JSON representation and the
> output of filefrag to find fragmentation, unpunched holes, and excess
> preallocation. frag_report is the output of "check_fragmentation.py -v" on
> the JSON representation of one of the data directories.
> Let's use this JIRA to track issue #3 from the above list.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)