[
https://issues.apache.org/jira/browse/HBASE-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junegunn Choi reassigned HBASE-30036:
-------------------------------------
Assignee: Junegunn Choi
> Delete marker consolidation on memstore flush
> ---------------------------------------------
>
> Key: HBASE-30036
> URL: https://issues.apache.org/jira/browse/HBASE-30036
> Project: HBase
> Issue Type: Improvement
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Major
> Attachments: image-2026-03-27-19-39-23-218.png,
> image-2026-03-27-19-39-39-721.png, image-2026-03-27-19-40-48-215.png,
> image-2026-03-27-19-41-01-664.png, image-2026-03-27-19-41-09-974.png
>
>
> h2. Problem
> When the same column or row is repeatedly deleted and recreated, redundant
> {{DeleteColumn}} or {{DeleteFamily}} markers accumulate in the flushed
> HFiles. Read performance degrades linearly with the number of markers until
> the next major compaction.
> h2. Cause
> During flush, {{MinorCompactionScanQueryMatcher}} includes all delete markers
> unconditionally. Consolidation only happens as a side effect when a put is
> interleaved between markers -- the put triggers {{SEEK_NEXT_COL}}, which
> skips the rest. When markers are contiguous (no interleaved puts), or when
> {{DeleteFamily}} markers sort before all puts, no seek is triggered and all
> markers are written.
> h2. Description
> A {{DeleteColumn}} masks all versions of a column at timestamps <= its own.
> An older {{DeleteColumn}} for the same column is strictly redundant. Same for
> {{DeleteFamily}}. Only the latest marker is needed.
> h3. Consolidation that already works (/)
> When {{DeleteColumn}} markers alternate with puts for the same column, the
> flush scanner already consolidates them. The first {{DeleteColumn}} is
> included, then the next put is recognized as column-deleted, which triggers
> {{SEEK_NEXT_COL}} in the {{StoreScanner}}. This seek skips all remaining
> cells for that column -- including older {{DeleteColumn}} markers. Only the
> latest marker survives in the flushed HFile:
> {code:ruby}
> create 't', 'd'
> 10.times do
> put 't', 'row', 'd:foo', 'bar'
> sleep 0.001
> deleteall 't', 'row', 'd:foo'
> sleep 0.001
> end
> scan 't', RAW => true, VERSIONS => 100
> # ROW COLUMN+CELL
> # row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.771, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.769, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.767, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.766, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.764, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.762, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.760, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.758, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.756, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.754, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.753, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.751, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.749, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.747, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.745, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.743, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.741, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:43:56.739, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:43:56.736, value=bar
> # 1 row(s)
> # Took 0.0037 seconds
> flush 't'
> scan 't', RAW => true, VERSIONS => 100
> # ROW COLUMN+CELL
> # row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
> # 1 row(s)
> {code}
> 20 cells in the memstore (10 puts + 10 deletes) are reduced to a single cell
> in the HFile.
> h3. Case 1: Contiguous DeleteColumn markers are not consolidated (x)
> However, this consolidation relies on a put being interleaved between delete
> markers. When deletes are issued repeatedly without interleaved puts, the
> resulting {{DeleteColumn}} markers are contiguous. No put exists to trigger
> the seek, so all markers survive the flush:
> {code:ruby}
> create 't', 'd'
> put 't', 'row', 'd:foo', 'bar'
> 10.times do
> sleep 0.001
> deleteall 't', 'row', 'd:foo'
> end
> scan 't', RAW => true, VERSIONS => 100
> # ROW COLUMN+CELL
> # row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.254, value=bar
> # 1 row(s)
> # Took 0.0017 seconds
> flush 't'
> scan 't', RAW => true, VERSIONS => 1000
> # ROW COLUMN+CELL
> # row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
> # row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
> # 1 row(s)
> {code}
> Expected: only the most recent {{DeleteColumn}} marker should remain.
> h3. Case 2: Contiguous DeleteFamily markers are not consolidated (x)
> {{DeleteFamily}} markers have an empty qualifier and sort before all
> qualified cells. Even when puts and family deletes are interleaved in time,
> all {{DeleteFamily}} markers arrive as a contiguous block before any puts in
> the scanner, so the seek optimization never kicks in:
> {code:ruby}
> create 't', 'd'
> 10.times do
> put 't', 'row', 'd:foo', 'bar'
> sleep 0.001
> deleteall 't', 'row'
> sleep 0.001
> end
> scan 't', RAW => true, VERSIONS => 100
> # ROW COLUMN+CELL
> # row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
> # row column=d:foo, timestamp=2026-03-25T18:41:54.846, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.841, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.837, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.833, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.829, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.825, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.820, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.815, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.810, value=bar
> # row column=d:foo, timestamp=2026-03-25T18:41:54.804, value=bar
> # 1 row(s)
> # Took 0.0024 seconds
> flush 't'
> scan 't', RAW => true, VERSIONS => 100
> # ROW COLUMN+CELL
> # row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
> # row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
> # 1 row(s)
> {code}
> Expected: only the most recent {{DeleteFamily}} marker should remain.
> h2. Proposed Fix
> Add {{DeleteTracker.isRedundantDelete(ExtendedCell)}} with an implementation
> in {{ScanDeleteTracker}} that checks whether a delete marker is already
> covered by a previously tracked delete of equal or broader scope:
> - {{DeleteFamily}} / {{DeleteFamilyVersion}}
> -- Redundant if a {{DeleteFamily}} with a higher timestamp was already
> tracked.
> - {{DeleteColumn}} / {{Delete}}
> -- Redundant if a {{DeleteColumn}} for the same qualifier, or a
> {{DeleteFamily}}, with a higher timestamp was already tracked.
> {{MinorCompactionScanQueryMatcher.match()}} calls this check before
> {{trackDelete()}} and seeks past the remaining cells covered by the
> previously tracked delete.
> Note: The fix is compatible with {{KEEP_DELETED_CELLS}}. When set to
> {{TRUE}}, {{trackDelete()}} does not populate the delete tracker, so
> {{isRedundantDelete()}} always returns false and all markers are retained.
> h2. Benefits
> This substantially improves read performance for workloads that frequently
> delete and recreate records. Without this fix, each delete-recreate cycle
> adds a {{DeleteColumn}} or {{DeleteFamily}} marker that persists through
> flush. These markers accumulate across flushes and must be processed on every
> read, causing latency to degrade over time proportional to the number of
> delete cycles.
> With this fix, flush consolidates redundant markers down to a single one,
> keeping HFiles compact and read latency stable regardless of how many
> delete-recreate cycles a record has been through.
> h2. Test
> Used this hbase-shell script to test the benefit of the proposed fix.
> - https://gist.github.com/junegunn/bc0acf5269b8875330c0947dac7d0280
> Three scenarios:
> - DeleteFamily (naturally contiguous)
> - DeleteColumnContiguous
> - DeleteColumnInterleaved
> -- Already consolidated without the fix
> The tests show that with the fix, the number of cells and the read latency
> significantly decrease on every flush.
> h3. DeleteFamily
> {code}
> benchmark(:DeleteFamily) do
> T.put(PUT)
> T.delete(DF)
> end
> {code}
> !image-2026-03-27-19-40-48-215.png|width=800!
> - On every flush, the number of cells and the read latency drop sharply as
> redundant {{DeleteFamily}} markers are consolidated.
> - The flushed store files are significantly smaller.
> h3. DeleteColumnContiguous
> {code}
> benchmark(:DeleteColumnContiguous) do |i|
> T.put(PUT) if i.zero?
> T.delete(DC)
> # Let's manually flush after every 100,000 operations because it's hard to
> # fill up the memstore only with delete markers.
> flush 't' if (i % 100_000).zero? && i.positive?
> end
> {code}
> !image-2026-03-27-19-41-01-664.png|width=800!
> - Similar to the above. The delete markers are consolidated on every (manual)
> flush, and the read latency drops sharply as a result.
> - The flushed store files are also significantly smaller.
> h3. DeleteColumnInterleaved
> {code}
> benchmark(:DeleteColumnInterleaved) do |i|
> T.put(PUT)
> T.delete(DC)
> end
> {code}
> !image-2026-03-27-19-41-09-974.png|width=800!
> - Already consolidated on flush without the fix. No difference as expected.
> h2. Related work
> Even with this fix, redundant delete markers on MemStore negatively impact
> read performance before they are flushed and consolidated. HBASE-29039
> addresses this by changing the normal read path
> ({{NormalUserScanQueryMatcher}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)