[ 
https://issues.apache.org/jira/browse/KUDU-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847953#comment-15847953
 ] 

Adar Dembo commented on KUDU-1856:
----------------------------------

I filed (and linked) KUDU-1857 to track issue #2 from the original comment.

> Kudu can consume far more data than it should on XFS
> ----------------------------------------------------
>
>                 Key: KUDU-1856
>                 URL: https://issues.apache.org/jira/browse/KUDU-1856
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 1.2.0
>            Reporter: Adar Dembo
>            Assignee: Adar Dembo
>            Priority: Critical
>         Attachments: check_fragmentation.py, dump_all_blocks.py, frag_report
>
>
> I was investigating Kudu's disk space consumption on an internal cluster and 
> found a few interesting things. This is a 42-node cluster with three masters, 
> running CentOS 6.6. I focused on a particular node with 11 data directories, 
> each formatted with XFS. The node was serving a bunch of tombstoned tablets, 
> but no actual live tablets. All the tablets belonged to one of two tables. 
> Due to the file sizes involved, the following analysis was done on just one 
> of the data directories.
> There were 7406 "live" blocks. I put live in quotes because these blocks were 
> orphaned by definition, as there were no live tablets. The running theory is 
> that they were orphaned due to tablet copy operations that failed mid-way. 
> KUDU-1853 tracks this issue, at least with respect to non-crash failures. 
> Failures due to a crash require a data GC of some kind, tracked in KUDU-829. 
> The live blocks were stored in 1025 LBM containers. The vast majority of the 
> file space in each container held punched-out dead blocks, as one might 
> expect. Taken together, the live blocks accounted for ~85 GB of data.
> However, the total disk space usage of these container files was ~123 GB. 
> There were three discrepancies here, one tiny, one minor, and one major:
> * There was ~17 MB of space lost to external fragmentation. This is because 
> LBM containers force live blocks to be aligned to the nearest filesystem 
> block.
> * There was ~1.4 GB of dead block data that was backed by live extents 
> according to filefrag. That is, these are dead blocks the tserver either 
> failed to punch, or (more likely) crashed before it could punch.
> * There was ~40 GB of zeroed space hanging off the edge of the container 
> files. Unlike a typical preallocation, this space _was not_ accounted for in 
> the logical file size; it only manifests in filefrag or du. I believe this is 
> due to XFS's [speculative preallocation 
> feature|http://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F]
>  feature. What is worrying is that this preallocation became permanent; 
> neither clearing the kernel's inode cache nor shutting down the tserver (the 
> documented workarounds) made it disappear. Only an explicit ftruncate() 
> cleared it up.
> There are a couple of conclusions to draw here:
> # It's good that we've fixed KUDU-1853; that should reduce the number of 
> orphaned blocks. However, we should prioritize KUDU-829 too, as a crash 
> during a tablet copy can still orphan a ton of blocks, far more than a crash 
> during a flush or compaction.
> # There's also a need to re-effect hole punches in case we crash after blocks 
> have been deleted but before the punches take place. This can be done blindly 
> on all dead blocks in an LBM container at startup, perhaps based on some 
> "actual disk space used > expected disk space used" threshold. Or we can use 
> [the FIEMAP 
> ioctl|https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt] to 
> figure out exactly where the extents are, and surgically only punch those 
> that are needed.
> # On XFS, we really need to address this speculative preallocation problem. 
> It's not clear exactly what causes this temporary phenomenon to become 
> permanent; the [XFS 
> faq|http://xfs.org/index.php/XFS_FAQ#Q:_Is_speculative_preallocation_permanent.3F]
>  is vague on that. But, one option is to adjust the LBM 
> truncate-full-container-file-at-startup logic to ignore the container's 
> logical file size; that is, to always truncate the container to the end of 
> the last block.
> I've attached two scripts that helped me during the analysis. 
> dump_all_blocks.py converts the on-disk LBM metadata files into a JSON 
> representation. check_fragmentation.py uses the JSON representation and the 
> output of filefrag to find fragmentation, unpunched holes, and excess 
> preallocation. frag_report is the output of "check_fragmentation.py -v" on 
> the JSON representation of one of the data directories.
> Let's use this JIRA to track issue #3 from the above list.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to