Emil Kleszcz created HBASE-29875:
------------------------------------

             Summary: NPE in DefaultStoreFileManager.getUnneededFiles aborts 
compaction requests causing unlimited storefile growth when TTL enabled
                 Key: HBASE-29875
                 URL: https://issues.apache.org/jira/browse/HBASE-29875
             Project: HBase
          Issue Type: Bug
          Components: Compaction
    Affects Versions: 2.5.10
            Reporter: Emil Kleszcz


*Summary:*
We are observing compactions permanently stopping for some regions due to a 
NullPointerException
thrown inside {_}DefaultStoreFileManager.getUnneededFiles(){_}. After the first 
failure, _requestCompaction()_
never schedules compactions for the store, and storefiles grow indefinitely.

*Impact:*
- storefile count grows from normal (~100k cluster-wide) to hundreds of 
thousands
- individual regions reach 500–1000+ HFiles
- MemStoreFlusher repeatedly logs:
  "Waited 100xx ms on a compaction to clean up 'too many store files'; 
proceeding with flush"
- compaction queue stays empty because compaction requests fail before being 
queued
- reopening the region immediately restores compaction

Stack trace:
{code:java}
<2026-02-08T22:31:41.183+0100> <ERROR> <ipc.RpcServer>: <Unexpected throwable 
object >

java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.lambda$getUnneededFiles$3(DefaultStoreFileManager.java:235)
  at java.util.stream.ReferencePipeline...
  at 
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getUnneededFiles(DefaultStoreFileManager.java:243)
  at 
org.apache.hadoop.hbase.regionserver.HStore.removeUnneededFiles(HStore.java:1566)
  at 
org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1447)
  at org.apache.hadoop.hbase.regionserver.CompactSplit.requestCompaction(...)
  at org.apache.hadoop.hbase.regionserver.RSRpcServices.compactRegion(...){code}
Failing line:
{code:java}
long fileTs = sf.getReader().getMaxTimestamp();{code}
It appears 
{code:java}
sf.getReader(){code}
can be null here.

Workload characteristics:
- TTL enabled (2.5 days)
- many small flushes (~8–10 MB files)
- large storefile counts per region
- expired-file cleanup path ({_}getUnneededFiles{_}) triggered frequently

*Behavior:*
- after this NPE, compactions for the store stop completely
- flush continues, so file count increases without bound
- moving the region (or disable/enable table) immediately fixes the issue
  because readers are rebuilt and compactions resume
- triggering compaction or major compaction manually causes the same issue

*Reproducer (observational):*
1) region accumulates many storefiles
2) TTL expired-file cleanup runs
3) NPE in getUnneededFiles()
4) compaction requests abort permanently for that store
5) reopen region -> compaction works again

*Expected behavior:*
Compaction should not fail due to a null StoreFile reader. The code should 
guard against null, attempt to init the reader, or skip the file rather than 
throwing.

*Related:*
This looks similar in effect to 
[HBASE-29348|https://issues.apache.org/jira/browse/HBASE-29348] (NPE during 
compaction leading to hfiles not being cleaned),
but occurs in a different call site (TTL/unneeded-files path instead of 
date-tiered compaction).

*Workaround currently used:*
- region reopen (move) or continuous balancing to force reopen
- this restores compaction but is only a mitigation

Please advise whether this should be tracked under HBASE-29348 or as a separate 
issue. I can prepare a patch that should be simple but I will need some time to 
test it properly first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to