Junegunn Choi created HBASE-30036:
-------------------------------------
Summary: Delete marker consolidation on memstore flush
Key: HBASE-30036
URL: https://issues.apache.org/jira/browse/HBASE-30036
Project: HBase
Issue Type: Improvement
Reporter: Junegunn Choi
Attachments: image-2026-03-27-19-39-23-218.png,
image-2026-03-27-19-39-39-721.png, image-2026-03-27-19-40-48-215.png,
image-2026-03-27-19-41-01-664.png, image-2026-03-27-19-41-09-974.png
h2. Problem
When the same column or row is repeatedly deleted and recreated, redundant
{{DeleteColumn}} or {{DeleteFamily}} markers accumulate in the flushed HFiles.
Read performance degrades linearly with the number of markers until the next
major compaction.
h2. Cause
During flush, {{MinorCompactionScanQueryMatcher}} includes all delete markers
unconditionally. Consolidation only happens as a side effect when a put is
interleaved between markers -- the put triggers {{SEEK_NEXT_COL}}, which skips
the rest. When markers are contiguous (no interleaved puts), or when
{{DeleteFamily}} markers sort before all puts, no seek is triggered and all
markers are written.
h2. Description
A {{DeleteColumn}} masks all versions of a column at timestamps <= its own. An
older {{DeleteColumn}} for the same column is strictly redundant. Same for
{{DeleteFamily}}. Only the latest marker is needed.
h3. Consolidation that already works (/)
When {{DeleteColumn}} markers alternate with puts for the same column, the
flush scanner already consolidates them. The first {{DeleteColumn}} is
included, then the next put is recognized as column-deleted, which triggers
{{SEEK_NEXT_COL}} in the {{StoreScanner}}. This seek skips all remaining cells
for that column -- including older {{DeleteColumn}} markers. Only the latest
marker survives in the flushed HFile:
{code:ruby}
create 't', 'd'
10.times do
put 't', 'row', 'd:foo', 'bar'
sleep 0.001
deleteall 't', 'row', 'd:foo'
sleep 0.001
end
scan 't', RAW => true, VERSIONS => 100
# ROW COLUMN+CELL
# row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.771, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.769, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.767, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.766, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.764, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.762, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.760, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.758, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.756, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.754, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.753, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.751, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.749, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.747, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.745, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.743, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.741, value=bar
# row column=d:foo, timestamp=2026-03-25T18:43:56.739, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:43:56.736, value=bar
# 1 row(s)
# Took 0.0037 seconds
flush 't'
scan 't', RAW => true, VERSIONS => 100
# ROW COLUMN+CELL
# row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
# 1 row(s)
{code}
20 cells in the memstore (10 puts + 10 deletes) are reduced to a single cell in
the HFile.
h3. Case 1: Contiguous DeleteColumn markers are not consolidated (x)
However, this consolidation relies on a put being interleaved between delete
markers. When deletes are issued repeatedly without interleaved puts, the
resulting {{DeleteColumn}} markers are contiguous. No put exists to trigger the
seek, so all markers survive the flush:
{code:ruby}
create 't', 'd'
put 't', 'row', 'd:foo', 'bar'
10.times do
sleep 0.001
deleteall 't', 'row', 'd:foo'
end
scan 't', RAW => true, VERSIONS => 100
# ROW COLUMN+CELL
# row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.254, value=bar
# 1 row(s)
# Took 0.0017 seconds
flush 't'
scan 't', RAW => true, VERSIONS => 1000
# ROW COLUMN+CELL
# row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
# row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
# 1 row(s)
{code}
Expected: only the most recent {{DeleteColumn}} marker should remain.
h3. Case 2: Contiguous DeleteFamily markers are not consolidated (x)
{{DeleteFamily}} markers have an empty qualifier and sort before all qualified
cells. Even when puts and family deletes are interleaved in time, all
{{DeleteFamily}} markers arrive as a contiguous block before any puts in the
scanner, so the seek optimization never kicks in:
{code:ruby}
create 't', 'd'
10.times do
put 't', 'row', 'd:foo', 'bar'
sleep 0.001
deleteall 't', 'row'
sleep 0.001
end
scan 't', RAW => true, VERSIONS => 100
# ROW COLUMN+CELL
# row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
# row column=d:foo, timestamp=2026-03-25T18:41:54.846, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.841, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.837, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.833, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.829, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.825, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.820, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.815, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.810, value=bar
# row column=d:foo, timestamp=2026-03-25T18:41:54.804, value=bar
# 1 row(s)
# Took 0.0024 seconds
flush 't'
scan 't', RAW => true, VERSIONS => 100
# ROW COLUMN+CELL
# row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
# row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
# 1 row(s)
{code}
Expected: only the most recent {{DeleteFamily}} marker should remain.
h2. Proposed Fix
Add {{DeleteTracker.isRedundantDelete(ExtendedCell)}} with an implementation in
{{ScanDeleteTracker}} that checks whether a delete marker is already covered by
a previously tracked delete of equal or broader scope:
- {{DeleteFamily}} / {{DeleteFamilyVersion}}
-- Redundant if a {{DeleteFamily}} with a higher timestamp was already tracked.
- {{DeleteColumn}} / {{Delete}}
-- Redundant if a {{DeleteColumn}} for the same qualifier, or a
{{DeleteFamily}}, with a higher timestamp was already tracked.
{{MinorCompactionScanQueryMatcher.match()}} calls this check before
{{trackDelete()}} and seeks past the remaining cells covered by the previously
tracked delete.
Note: The fix is compatible with {{KEEP_DELETED_CELLS}}. When set to {{TRUE}},
{{trackDelete()}} does not populate the delete tracker, so
{{isRedundantDelete()}} always returns false and all markers are retained.
h2. Benefits
This substantially improves read performance for workloads that frequently
delete and recreate records. Without this fix, each delete-recreate cycle adds
a {{DeleteColumn}} or {{DeleteFamily}} marker that persists through flush.
These markers accumulate across flushes and must be processed on every read,
causing latency to degrade over time proportional to the number of delete
cycles.
With this fix, flush consolidates redundant markers down to a single one,
keeping HFiles compact and read latency stable regardless of how many
delete-recreate cycles a record has been through.
h2. Test
Used this hbase-shell script to test the benefit of the proposed fix.
- https://gist.github.com/junegunn/bc0acf5269b8875330c0947dac7d0280
Three scenarios:
- DeleteFamily (naturally contiguous)
- DeleteColumnContiguous
- DeleteColumnInterleaved
-- Already consolidated without the fix
The tests show that with the fix, the number of cells and the read latency
significantly decrease on every flush.
h3. DeleteFamily
{code}
benchmark(:DeleteFamily) do
T.put(PUT)
T.delete(DF)
end
{code}
!image-2026-03-27-19-40-48-215.png|width=800!
- On every flush, the number of cells and the read latency drop sharply as
redundant {{DeleteFamily}} markers are consolidated.
- The flushed store files are significantly smaller.
h3. DeleteColumnContiguous
{code}
benchmark(:DeleteColumnContiguous) do |i|
T.put(PUT) if i.zero?
T.delete(DC)
# Let's manually flush after every 100,000 operations because it's hard to
# fill up the memstore only with delete markers.
flush 't' if (i % 100_000).zero? && i.positive?
end
{code}
!image-2026-03-27-19-41-01-664.png|width=800!
- Similar to the above. The delete markers are consolidated on every (manual)
flush, and the read latency drops sharply as a result.
- The flushed store files are also significantly smaller.
h3. DeleteColumnInterleaved
{code}
benchmark(:DeleteColumnInterleaved) do |i|
T.put(PUT)
T.delete(DC)
end
{code}
!image-2026-03-27-19-41-09-974.png|width=800!
- Already consolidated on flush without the fix. No difference as expected.
h2. Related work
Even with this fix, redundant delete markers on MemStore negatively impact read
performance before they are flushed and consolidated. HBASE-29039 addresses
this by changing the normal read path ({{NormalUserScanQueryMatcher}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)