junegunn commented on PR #8001:
URL: https://github.com/apache/hbase/pull/8001#issuecomment-4181375884

   _However_, even with qualifier comparison, false positives remain: exactly N 
consecutive redundant DCs for the same qualifier trigger an inefficient seek.
   
   - DC(q1) N=1 skip
   - DC(q1) N=2 skip
   - DC (q1) N=3 seek (false positive)
   - DC (q2) N=1 skip
   - DC (q2) N=2 skip
   - DC (q2) N=3 seek (false positive)
   - DC (q3) N=1 skip
   - DC (q3) N=2 skip
   - ...
   
   This should be rare in practice. But if overhead is a concern, increasing N 
is the only option.
   
   Here is a benchmark for this case, with an additional N=10 build:
   
   ```ruby
   benchmark(:DeleteColumnFalsePositiveEvery3) do |i|
     T.put(PUT) if i.zero?
   
     dc = Delete.new(ROW).addColumns(CF, (i / 3).to_s.to_java_bytes)
     T.delete(dc)
   
     flush 't' if (i % 100_000).zero? && i.positive?
   end
   ```
   
   <img width="1152" height="960" alt="image" 
src="https://github.com/user-attachments/assets/4fb939c6-4865-4da2-88ee-f25ad1312f95";
 />
   
   As expected, qualifier comparison does not help in this case, but a larger 
threshold (N=10) significantly reduces the overhead. Given the rarity of such 
scenarios, the overhead against master is acceptable.
   
   I believe 10 is a good threshold. This optimization targets cases where 
hundreds of thousands or millions of delete markers are swept, so the cost of 
10 extra skips is negligible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to