[ 
https://issues.apache.org/jira/browse/HBASE-29875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068042#comment-18068042
 ] 

Emil Kleszcz edited comment on HBASE-29875 at 3/24/26 3:11 PM:
---------------------------------------------------------------

I opened a PR against _branch-2.5_ for this issue.

The proposed fix is split into two commits:

1. {_}DefaultStoreFileManager{_}: guard _getUnneededFiles_ against null readers 
and skip those files instead of failing in the expired-file/TTL cleanup path
2. {_}SortedCompactionPolicy{_}: filter null-reader files before compaction 
candidate selection so healthy files can still be compacted

Validation:
 - tested on our downstream 2.5.11-based build
 - Java 11
 - TTL enabled
 - frequent flushes
 - MOB workload
 - manual and automatic compaction triggering correctly and skipping null 
readers if needed

Observed result:
 - The known null-reader failure paths no longer abort compaction
 - Healthy files can still be compacted
 - Storefile growth is reduced again in affected stores

Limitations:
This is a defensive mitigation around the unexpected _reader == null_ state. It 
does not address the root cause of why some _HStoreFile_ instances end up with 
a null reader. Healthy files can now be compacted, and null-reader files are 
skipped. Those skipped files may disappear later only if normal 
cleanup/expiration paths can eventually handle them.


was (Author: ekleszcz):
I opened a PR against `branch-2.5` for this issue.

The proposed fix is split into two commits:

1. `DefaultStoreFileManager`: guard `getUnneededFiles` against null readers and 
skip those files instead of failing in the expired-file / TTL cleanup path
2. `SortedCompactionPolicy`: filter null-reader files before compaction 
candidate selection so healthy files can still be compacted

Validation:
 - tested on our downstream 2.5.11-based build
 - Java 11
 - TTL enabled
 - frequent flushes
 - MOB workload
 - manual and automatic compaction triggering correctly and skipping null 
readers if needed

Observed result:
 - the known null-reader failure paths no longer abort compaction
 - healthy files can still be compacted
 - storefile growth is reduced again in affected stores

Limitations:
This is a defensive mitigation around unexpected `reader == null` state. It 
does not address the root cause of why some `HStoreFile` instances end up with 
a null reader. Healthy files can now be compacted, and null-reader files are 
skipped. Those skipped files may disappear later only if normal 
cleanup/expiration paths can eventually handle them.

> NPE in DefaultStoreFileManager.getUnneededFiles aborts compaction requests 
> causing unlimited storefile growth when TTL enabled
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29875
>                 URL: https://issues.apache.org/jira/browse/HBASE-29875
>             Project: HBase
>          Issue Type: Bug
>          Components: Compaction
>    Affects Versions: 2.5.10
>            Reporter: Emil Kleszcz
>            Priority: Critical
>              Labels: pull-request-available
>
> *Summary:*
> We are observing compactions permanently stopping for some regions due to a 
> NullPointerException
> thrown inside {_}DefaultStoreFileManager.getUnneededFiles(){_}. After the 
> first failure, _requestCompaction()_
> never schedules compactions for the store, and storefiles grow indefinitely.
> *Impact:*
>  - storefile count grows from normal (~100k cluster-wide) to hundreds of 
> thousands
>  - individual regions reach 500–1000+ HFiles
>  - MemStoreFlusher repeatedly logs:
>   "Waited 100xx ms on a compaction to clean up 'too many store files'; 
> proceeding with flush"
>  - compaction queue stays empty because compaction requests fail before being 
> queued
>  - reopening the region immediately restores compaction
> *Stack trace:*
> {code:java}
> <2026-02-08T22:31:41.183+0100> <ERROR> <ipc.RpcServer>: <Unexpected throwable 
> object >
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.lambda$getUnneededFiles$3(DefaultStoreFileManager.java:235)
>   at java.util.stream.ReferencePipeline...
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getUnneededFiles(DefaultStoreFileManager.java:243)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.removeUnneededFiles(HStore.java:1566)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1447)
>   at org.apache.hadoop.hbase.regionserver.CompactSplit.requestCompaction(...)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.compactRegion(...){code}
> Failing line:
> {code:java}
> long fileTs = sf.getReader().getMaxTimestamp();{code}
> It appears 
> {code:java}
> sf.getReader(){code}
> can be null here.
> *Workload characteristics:*
>  - TTL enabled (2.5 days)
>  - many small flushes (~8–10 MB files)
>  - large storefile counts per region
>  - expired-file cleanup path ({_}getUnneededFiles{_}) triggered frequently
> *Behavior:*
>  - after this NPE, compactions for the store stop completely
>  - flush continues, so file count increases without bound
>  - moving the region (or disable/enable table) immediately fixes the issue
>   because readers are rebuilt and compactions resume
>  - triggering compaction or major compaction manually causes the same issue
> *Reproducer (observational):*
> 1) region accumulates many storefiles
> 2) TTL expired-file cleanup runs
> 3) NPE in getUnneededFiles()
> 4) compaction requests abort permanently for that store
> 5) reopen region -> compaction works again
> *Expected behavior:*
> Compaction should not fail due to a null StoreFile reader. The code should 
> guard against null, attempt to init the reader, or skip the file rather than 
> throwing.
> *Related:*
> This looks similar in effect to HBASE-29348 (NPE during compaction leading to 
> hfiles not being cleaned),
> but occurs in a different call site (TTL/unneeded-files path instead of 
> date-tiered compaction).
> *Workaround currently used:*
>  - region reopen (move) or continuous balancing to force reopen
>  - this restores compaction but is only a mitigation
> Please advise whether this should be tracked under HBASE-29348 or as a 
> separate issue. I can prepare a patch that should be simple but I will need 
> some time to test it properly first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to