[jira] [Assigned] (HBASE-30036) Delete marker consolidation on memstore flush

Junegunn Choi (Jira) Fri, 27 Mar 2026 03:44:07 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Junegunn Choi reassigned HBASE-30036:
-------------------------------------

    Assignee: Junegunn Choi

> Delete marker consolidation on memstore flush
> ---------------------------------------------
>
>                 Key: HBASE-30036
>                 URL: https://issues.apache.org/jira/browse/HBASE-30036
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Major
>         Attachments: image-2026-03-27-19-39-23-218.png, 
> image-2026-03-27-19-39-39-721.png, image-2026-03-27-19-40-48-215.png, 
> image-2026-03-27-19-41-01-664.png, image-2026-03-27-19-41-09-974.png
>
>
> h2. Problem
> When the same column or row is repeatedly deleted and recreated, redundant 
> {{DeleteColumn}} or {{DeleteFamily}} markers accumulate in the flushed 
> HFiles. Read performance degrades linearly with the number of markers until 
> the next major compaction.
> h2. Cause
> During flush, {{MinorCompactionScanQueryMatcher}} includes all delete markers 
> unconditionally. Consolidation only happens as a side effect when a put is 
> interleaved between markers -- the put triggers {{SEEK_NEXT_COL}}, which 
> skips the rest. When markers are contiguous (no interleaved puts), or when 
> {{DeleteFamily}} markers sort before all puts, no seek is triggered and all 
> markers are written.
> h2. Description
> A {{DeleteColumn}} masks all versions of a column at timestamps <= its own. 
> An older {{DeleteColumn}} for the same column is strictly redundant. Same for 
> {{DeleteFamily}}. Only the latest marker is needed.
> h3. Consolidation that already works (/)
> When {{DeleteColumn}} markers alternate with puts for the same column, the 
> flush scanner already consolidates them. The first {{DeleteColumn}} is 
> included, then the next put is recognized as column-deleted, which triggers 
> {{SEEK_NEXT_COL}} in the {{StoreScanner}}. This seek skips all remaining 
> cells for that column -- including older {{DeleteColumn}} markers. Only the 
> latest marker survives in the flushed HFile:
> {code:ruby}
> create 't', 'd'
> 10.times do
>   put 't', 'row', 'd:foo', 'bar'
>   sleep 0.001
>   deleteall 't', 'row', 'd:foo'
>   sleep 0.001
> end
> scan 't', RAW => true, VERSIONS => 100
>   # ROW  COLUMN+CELL
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.771, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.769, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.767, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.766, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.764, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.762, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.760, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.758, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.756, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.754, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.753, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.751, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.749, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.747, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.745, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.743, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.741, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.739, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.736, value=bar
>   # 1 row(s)
>   # Took 0.0037 seconds
> flush 't'
> scan 't', RAW => true, VERSIONS => 100
>   # ROW  COLUMN+CELL
>   #  row column=d:foo, timestamp=2026-03-25T18:43:56.772, type=DeleteColumn
>   # 1 row(s)
> {code}
> 20 cells in the memstore (10 puts + 10 deletes) are reduced to a single cell 
> in the HFile.
> h3. Case 1: Contiguous DeleteColumn markers are not consolidated (x)
> However, this consolidation relies on a put being interleaved between delete 
> markers. When deletes are issued repeatedly without interleaved puts, the 
> resulting {{DeleteColumn}} markers are contiguous. No put exists to trigger 
> the seek, so all markers survive the flush:
> {code:ruby}
> create 't', 'd'
> put 't', 'row', 'd:foo', 'bar'
> 10.times do
>   sleep 0.001
>   deleteall 't', 'row', 'd:foo'
> end
> scan 't', RAW => true, VERSIONS => 100
>   # ROW  COLUMN+CELL
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.254, value=bar
>   # 1 row(s)
>   # Took 0.0017 seconds
> flush 't'
> scan 't', RAW => true, VERSIONS => 1000
>   # ROW  COLUMN+CELL
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.360, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.358, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.356, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.355, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.353, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.351, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.349, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.348, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.345, type=DeleteColumn
>   #  row column=d:foo, timestamp=2026-03-25T18:38:33.342, type=DeleteColumn
>   # 1 row(s)
> {code}
> Expected: only the most recent {{DeleteColumn}} marker should remain.
> h3. Case 2: Contiguous DeleteFamily markers are not consolidated (x)
> {{DeleteFamily}} markers have an empty qualifier and sort before all 
> qualified cells. Even when puts and family deletes are interleaved in time, 
> all {{DeleteFamily}} markers arrive as a contiguous block before any puts in 
> the scanner, so the seek optimization never kicks in:
> {code:ruby}
> create 't', 'd'
> 10.times do
>   put 't', 'row', 'd:foo', 'bar'
>   sleep 0.001
>   deleteall 't', 'row'
>   sleep 0.001
> end
> scan 't', RAW => true, VERSIONS => 100
>   # ROW  COLUMN+CELL
>   #  row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.846, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.841, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.837, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.833, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.829, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.825, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.820, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.815, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.810, value=bar
>   #  row column=d:foo, timestamp=2026-03-25T18:41:54.804, value=bar
>   # 1 row(s)
>   # Took 0.0024 seconds
> flush 't'
> scan 't', RAW => true, VERSIONS => 100
>   # ROW  COLUMN+CELL
>   #  row column=d:, timestamp=2026-03-25T18:41:54.848, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.843, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.839, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.835, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.831, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.827, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.823, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.817, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.812, type=DeleteFamily
>   #  row column=d:, timestamp=2026-03-25T18:41:54.807, type=DeleteFamily
>   # 1 row(s)
> {code}
> Expected: only the most recent {{DeleteFamily}} marker should remain.
> h2. Proposed Fix
> Add {{DeleteTracker.isRedundantDelete(ExtendedCell)}} with an implementation 
> in {{ScanDeleteTracker}} that checks whether a delete marker is already 
> covered by a previously tracked delete of equal or broader scope:
> - {{DeleteFamily}} / {{DeleteFamilyVersion}}
> -- Redundant if a {{DeleteFamily}} with a higher timestamp was already 
> tracked.
> - {{DeleteColumn}} / {{Delete}}
> -- Redundant if a {{DeleteColumn}} for the same qualifier, or a 
> {{DeleteFamily}}, with a higher timestamp was already tracked.
> {{MinorCompactionScanQueryMatcher.match()}} calls this check before 
> {{trackDelete()}} and seeks past the remaining cells covered by the 
> previously tracked delete.
> Note: The fix is compatible with {{KEEP_DELETED_CELLS}}. When set to 
> {{TRUE}}, {{trackDelete()}} does not populate the delete tracker, so 
> {{isRedundantDelete()}} always returns false and all markers are retained.
> h2. Benefits
> This substantially improves read performance for workloads that frequently 
> delete and recreate records. Without this fix, each delete-recreate cycle 
> adds a {{DeleteColumn}} or {{DeleteFamily}} marker that persists through 
> flush. These markers accumulate across flushes and must be processed on every 
> read, causing latency to degrade over time proportional to the number of 
> delete cycles.
> With this fix, flush consolidates redundant markers down to a single one, 
> keeping HFiles compact and read latency stable regardless of how many 
> delete-recreate cycles a record has been through.
> h2. Test
> Used this hbase-shell script to test the benefit of the proposed fix.
> - https://gist.github.com/junegunn/bc0acf5269b8875330c0947dac7d0280
> Three scenarios:
> - DeleteFamily (naturally contiguous)
> - DeleteColumnContiguous
> - DeleteColumnInterleaved
> -- Already consolidated without the fix
> The tests show that with the fix, the number of cells and the read latency 
> significantly decrease on every flush.
> h3. DeleteFamily
> {code}
> benchmark(:DeleteFamily) do
>   T.put(PUT)
>   T.delete(DF)
> end
> {code}
>  !image-2026-03-27-19-40-48-215.png|width=800! 
> - On every flush, the number of cells and the read latency drop sharply as 
> redundant {{DeleteFamily}} markers are consolidated.
> - The flushed store files are significantly smaller.
> h3. DeleteColumnContiguous
> {code}
> benchmark(:DeleteColumnContiguous) do |i|
>   T.put(PUT) if i.zero?
>   T.delete(DC)
>   # Let's manually flush after every 100,000 operations because it's hard to
>   # fill up the memstore only with delete markers.
>   flush 't' if (i % 100_000).zero? && i.positive?
> end
> {code}
>  !image-2026-03-27-19-41-01-664.png|width=800! 
> - Similar to the above. The delete markers are consolidated on every (manual) 
> flush, and the read latency drops sharply as a result.
> - The flushed store files are also significantly smaller.
> h3. DeleteColumnInterleaved
> {code}
> benchmark(:DeleteColumnInterleaved) do |i|
>   T.put(PUT)
>   T.delete(DC)
> end
> {code}
>  !image-2026-03-27-19-41-09-974.png|width=800! 
> - Already consolidated on flush without the fix. No difference as expected.
> h2. Related work
> Even with this fix, redundant delete markers on MemStore negatively impact 
> read performance before they are flushed and consolidated. HBASE-29039 
> addresses this by changing the normal read path 
> ({{NormalUserScanQueryMatcher}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HBASE-30036) Delete marker consolidation on memstore flush

Reply via email to