Junegunn Choi created HBASE-30036:
-------------------------------------

             Summary: Delete marker consolidation on memstore flush
                 Key: HBASE-30036
                 URL: https://issues.apache.org/jira/browse/HBASE-30036
             Project: HBase
          Issue Type: Improvement
            Reporter: Junegunn Choi
         Attachments: image-2026-03-27-19-39-23-218.png, 
image-2026-03-27-19-39-39-721.png, image-2026-03-27-19-40-48-215.png, 
image-2026-03-27-19-41-01-664.png, image-2026-03-27-19-41-09-974.png

h2. Problem

When the same column or row is repeatedly deleted and recreated, redundant 
{{DeleteColumn}} or {{DeleteFamily}} markers accumulate in the flushed HFiles. 
Read performance degrades linearly with the number of markers until the next 
major compaction.

h2. Cause

During flush, {{MinorCompactionScanQueryMatcher}} includes all delete markers 
unconditionally. Consolidation only happens as a side effect when a put is 
interleaved between markers -- the put triggers {{SEEK_NEXT_COL}}, which skips 
the rest. When markers are contiguous (no interleaved puts), or when 
{{DeleteFamily}} markers sort before all puts, no seek is triggered and all 
markers are written.

h2. Description

A {{DeleteColumn}} masks all versions of a column at timestamps <= its own. An 
older {{DeleteColumn}} for the same column is strictly redundant. Same for 
{{DeleteFamily}}. Only the latest marker is needed.

h3. Consolidation that already works (/)

When {{DeleteColumn}} markers alternate with puts for the same column, the 
flush scanner already consolidates them. The first {{DeleteColumn}} is 
included, then the next put is recognized as column-deleted, which triggers 
{{SEEK_NEXT_COL}} in the {{StoreScanner}}. This seek skips all remaining cells 
for that column -- including older {{DeleteColumn}} markers. Only the latest 
marker survives in the flushed HFile:

{code:ruby}
create 't', 'd'

10.times do
  put 't', 'row', 'd:foo', 'bar'
  sleep 0.001
  deleteall 't', 'row', 'd:foo'
  sleep 0.001
end

scan 't', RAW => true, VERSIONS => 100
  # ROW  COLUMN+CELL
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.771, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.769, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.767, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.766, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.764, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.762, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.760, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.758, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.756, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.754, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.753, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.751, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.749, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.747, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.745, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.743, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.741, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.739, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.736, value=bar
  # 1 row(s)
  # Took 0.0037 seconds

flush 't'
scan 't', RAW => true, VERSIONS => 100
  # ROW  COLUMN+CELL
  #  row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
  # 1 row(s)
{code}

20 cells in the memstore (10 puts + 10 deletes) are reduced to a single cell in 
the HFile.

h3. Case 1: Contiguous DeleteColumn markers are not consolidated (x)

However, this consolidation relies on a put being interleaved between delete 
markers. When deletes are issued repeatedly without interleaved puts, the 
resulting {{DeleteColumn}} markers are contiguous. No put exists to trigger the 
seek, so all markers survive the flush:

{code:ruby}
create 't', 'd'

put 't', 'row', 'd:foo', 'bar'
10.times do
  sleep 0.001
  deleteall 't', 'row', 'd:foo'
end

scan 't', RAW => true, VERSIONS => 100
  # ROW  COLUMN+CELL
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.254, value=bar
  # 1 row(s)
  # Took 0.0017 seconds

flush 't'
scan 't', RAW => true, VERSIONS => 1000
  # ROW  COLUMN+CELL
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
  #  row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
  # 1 row(s)
{code}

Expected: only the most recent {{DeleteColumn}} marker should remain.

h3. Case 2: Contiguous DeleteFamily markers are not consolidated (x)

{{DeleteFamily}} markers have an empty qualifier and sort before all qualified 
cells. Even when puts and family deletes are interleaved in time, all 
{{DeleteFamily}} markers arrive as a contiguous block before any puts in the 
scanner, so the seek optimization never kicks in:

{code:ruby}
create 't', 'd'

10.times do
  put 't', 'row', 'd:foo', 'bar'
  sleep 0.001
  deleteall 't', 'row'
  sleep 0.001
end

scan 't', RAW => true, VERSIONS => 100
  # ROW  COLUMN+CELL
  #  row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.846, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.841, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.837, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.833, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.829, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.825, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.820, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.815, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.810, value=bar
  #  row column=d:foo, timestamp=2026-03-25T18:41:54.804, value=bar
  # 1 row(s)
  # Took 0.0024 seconds

flush 't'
scan 't', RAW => true, VERSIONS => 100
  # ROW  COLUMN+CELL
  #  row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
  #  row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
  # 1 row(s)
{code}

Expected: only the most recent {{DeleteFamily}} marker should remain.

h2. Proposed Fix

Add {{DeleteTracker.isRedundantDelete(ExtendedCell)}} with an implementation in 
{{ScanDeleteTracker}} that checks whether a delete marker is already covered by 
a previously tracked delete of equal or broader scope:

- {{DeleteFamily}} / {{DeleteFamilyVersion}}
-- Redundant if a {{DeleteFamily}} with a higher timestamp was already tracked.
- {{DeleteColumn}} / {{Delete}}
-- Redundant if a {{DeleteColumn}} for the same qualifier, or a 
{{DeleteFamily}}, with a higher timestamp was already tracked.

{{MinorCompactionScanQueryMatcher.match()}} calls this check before 
{{trackDelete()}} and seeks past the remaining cells covered by the previously 
tracked delete.

Note: The fix is compatible with {{KEEP_DELETED_CELLS}}. When set to {{TRUE}}, 
{{trackDelete()}} does not populate the delete tracker, so 
{{isRedundantDelete()}} always returns false and all markers are retained.

h2. Benefits

This substantially improves read performance for workloads that frequently 
delete and recreate records. Without this fix, each delete-recreate cycle adds 
a {{DeleteColumn}} or {{DeleteFamily}} marker that persists through flush. 
These markers accumulate across flushes and must be processed on every read, 
causing latency to degrade over time proportional to the number of delete 
cycles.

With this fix, flush consolidates redundant markers down to a single one, 
keeping HFiles compact and read latency stable regardless of how many 
delete-recreate cycles a record has been through.

h2. Test

Used this hbase-shell script to test the benefit of the proposed fix.

- https://gist.github.com/junegunn/bc0acf5269b8875330c0947dac7d0280

Three scenarios:
- DeleteFamily (naturally contiguous)
- DeleteColumnContiguous
- DeleteColumnInterleaved
-- Already consolidated without the fix

The tests show that with the fix, the number of cells and the read latency 
significantly decrease on every flush.

h3. DeleteFamily

{code}
benchmark(:DeleteFamily) do
  T.put(PUT)
  T.delete(DF)
end
{code}

 !image-2026-03-27-19-40-48-215.png|width=800! 

- On every flush, the number of cells and the read latency drop sharply as 
redundant {{DeleteFamily}} markers are consolidated.
- The flushed store files are significantly smaller.

h3. DeleteColumnContiguous

{code}
benchmark(:DeleteColumnContiguous) do |i|
  T.put(PUT) if i.zero?
  T.delete(DC)
  # Let's manually flush after every 100,000 operations because it's hard to
  # fill up the memstore only with delete markers.
  flush 't' if (i % 100_000).zero? && i.positive?
end
{code}

 !image-2026-03-27-19-41-01-664.png|width=800! 

- Similar to the above. The delete markers are consolidated on every (manual) 
flush, and the read latency drops sharply as a result.
- The flushed store files are also significantly smaller.

h3. DeleteColumnInterleaved


{code}
benchmark(:DeleteColumnInterleaved) do |i|
  T.put(PUT)
  T.delete(DC)
end
{code}

 !image-2026-03-27-19-41-09-974.png|width=800! 

- Already consolidated on flush without the fix. No difference as expected.

h2. Related work

Even with this fix, redundant delete markers on MemStore negatively impact read 
performance before they are flushed and consolidated. HBASE-29039 addresses 
this by changing the normal read path ({{NormalUserScanQueryMatcher}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to