[jira] [Commented] (KUDU-1625) Schedule compaction on rowsets with high percentage of deleted data

Andrew Wong (Jira) Fri, 31 Jan 2020 11:49:36 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027782#comment-17027782
 ]


Andrew Wong commented on KUDU-1625:
-----------------------------------

We've heard requests for this for some time, so I started thinking about it a 
bit. There are many approaches we could consider.
 * Add an op to delete empty, ancient rowsets. This op would be relatively 
cheap to perform, since it amounts to adding entire rowsets into the orphaned 
block list, rather than rewriting anything. The benefit of this approach is 
very dependent on workload though, since most users don't have visibility into 
their data at the rowset granularity. Nevertheless, it should help in cases 
where entire time ranges are deleted. I put up a WIP patch 
[here|https://gerrit.cloudera.org/c/15145/].
 * Improve the perf scoring for MajorDeltaCompactions to be performed more 
aggressively when a large portion of the deltas are ancient deletes. If we 
don't want to iterate through all the deltas, we might be able to consider a 
surrogate to this, since that would be a pretty expensive scoring system. An 
approximation for that might be (# live rows / # total rows) if the entire 
rowset is ancient.
* Attempt to shoehorn deltas into the existing compaction selection scoring 
system. I don't think this is a great idea since the math pretty complex 
already, but perhaps we could consider adding extra weight to rowsets that have 
a larger number of deltas (e.g. if we're looking to reduce overlap of rowsets, 
maybe we ought to treat a rowset with some deltas as two (or slightly fewer) 
overlapping rowsets).

> Schedule compaction on rowsets with high percentage of deleted data
> -------------------------------------------------------------------
>
>                 Key: KUDU-1625
>                 URL: https://issues.apache.org/jira/browse/KUDU-1625
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tablet
>    Affects Versions: 1.0.0
>            Reporter: Todd Lipcon
>            Priority: Major
>
> Although with KUDU-236 we can now remove rows that were deleted prior to the 
> ancient history mark, we don't actively schedule compactions based on deleted 
> rows. So, if for example we have a fully compacted table and issue a DELETE 
> for every row, the data size actually does not change, because no compactions 
> are triggered.
> We need some way to notice the fact that the ratio of deletes to rows is high 
> and decide to compact those rowsets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KUDU-1625) Schedule compaction on rowsets with high percentage of deleted data

Reply via email to