[
https://issues.apache.org/jira/browse/HBASE-29875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emil Kleszcz updated HBASE-29875:
---------------------------------
Description:
*Summary:*
We are observing compactions permanently stopping for some regions due to a
NullPointerException
thrown inside {_}DefaultStoreFileManager.getUnneededFiles(){_}. After the first
failure, _requestCompaction()_
never schedules compactions for the store, and storefiles grow indefinitely.
*Impact:*
- storefile count grows from normal (~100k cluster-wide) to hundreds of
thousands
- individual regions reach 500–1000+ HFiles
- MemStoreFlusher repeatedly logs:
"Waited 100xx ms on a compaction to clean up 'too many store files';
proceeding with flush"
- compaction queue stays empty because compaction requests fail before being
queued
- reopening the region immediately restores compaction
*Stack trace:*
{code:java}
<2026-02-08T22:31:41.183+0100> <ERROR> <ipc.RpcServer>: <Unexpected throwable
object >
java.lang.NullPointerException
at
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.lambda$getUnneededFiles$3(DefaultStoreFileManager.java:235)
at java.util.stream.ReferencePipeline...
at
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getUnneededFiles(DefaultStoreFileManager.java:243)
at
org.apache.hadoop.hbase.regionserver.HStore.removeUnneededFiles(HStore.java:1566)
at
org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1447)
at org.apache.hadoop.hbase.regionserver.CompactSplit.requestCompaction(...)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.compactRegion(...){code}
Failing line:
{code:java}
long fileTs = sf.getReader().getMaxTimestamp();{code}
It appears
{code:java}
sf.getReader(){code}
can be null here.
*Workload characteristics:*
- TTL enabled (2.5 days)
- many small flushes (~8–10 MB files)
- large storefile counts per region
- expired-file cleanup path ({_}getUnneededFiles{_}) triggered frequently
*Behavior:*
- after this NPE, compactions for the store stop completely
- flush continues, so file count increases without bound
- moving the region (or disable/enable table) immediately fixes the issue
because readers are rebuilt and compactions resume
- triggering compaction or major compaction manually causes the same issue
*Reproducer (observational):*
1) region accumulates many storefiles
2) TTL expired-file cleanup runs
3) NPE in getUnneededFiles()
4) compaction requests abort permanently for that store
5) reopen region -> compaction works again
*Expected behavior:*
Compaction should not fail due to a null StoreFile reader. The code should
guard against null, attempt to init the reader, or skip the file rather than
throwing.
*Related:*
This looks similar in effect to HBASE-29348 (NPE during compaction leading to
hfiles not being cleaned),
but occurs in a different call site (TTL/unneeded-files path instead of
date-tiered compaction).
*Workaround currently used:*
- region reopen (move) or continuous balancing to force reopen
- this restores compaction but is only a mitigation
Please advise whether this should be tracked under HBASE-29348 or as a separate
issue. I can prepare a patch that should be simple but I will need some time to
test it properly first.
was:
*Summary:*
We are observing compactions permanently stopping for some regions due to a
NullPointerException
thrown inside {_}DefaultStoreFileManager.getUnneededFiles(){_}. After the first
failure, _requestCompaction()_
never schedules compactions for the store, and storefiles grow indefinitely.
*Impact:*
- storefile count grows from normal (~100k cluster-wide) to hundreds of
thousands
- individual regions reach 500–1000+ HFiles
- MemStoreFlusher repeatedly logs:
"Waited 100xx ms on a compaction to clean up 'too many store files';
proceeding with flush"
- compaction queue stays empty because compaction requests fail before being
queued
- reopening the region immediately restores compaction
Stack trace:
{code:java}
<2026-02-08T22:31:41.183+0100> <ERROR> <ipc.RpcServer>: <Unexpected throwable
object >
java.lang.NullPointerException
at
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.lambda$getUnneededFiles$3(DefaultStoreFileManager.java:235)
at java.util.stream.ReferencePipeline...
at
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getUnneededFiles(DefaultStoreFileManager.java:243)
at
org.apache.hadoop.hbase.regionserver.HStore.removeUnneededFiles(HStore.java:1566)
at
org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1447)
at org.apache.hadoop.hbase.regionserver.CompactSplit.requestCompaction(...)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.compactRegion(...){code}
Failing line:
{code:java}
long fileTs = sf.getReader().getMaxTimestamp();{code}
It appears
{code:java}
sf.getReader(){code}
can be null here.
Workload characteristics:
- TTL enabled (2.5 days)
- many small flushes (~8–10 MB files)
- large storefile counts per region
- expired-file cleanup path ({_}getUnneededFiles{_}) triggered frequently
*Behavior:*
- after this NPE, compactions for the store stop completely
- flush continues, so file count increases without bound
- moving the region (or disable/enable table) immediately fixes the issue
because readers are rebuilt and compactions resume
- triggering compaction or major compaction manually causes the same issue
*Reproducer (observational):*
1) region accumulates many storefiles
2) TTL expired-file cleanup runs
3) NPE in getUnneededFiles()
4) compaction requests abort permanently for that store
5) reopen region -> compaction works again
*Expected behavior:*
Compaction should not fail due to a null StoreFile reader. The code should
guard against null, attempt to init the reader, or skip the file rather than
throwing.
*Related:*
This looks similar in effect to
[HBASE-29348|https://issues.apache.org/jira/browse/HBASE-29348] (NPE during
compaction leading to hfiles not being cleaned),
but occurs in a different call site (TTL/unneeded-files path instead of
date-tiered compaction).
*Workaround currently used:*
- region reopen (move) or continuous balancing to force reopen
- this restores compaction but is only a mitigation
Please advise whether this should be tracked under HBASE-29348 or as a separate
issue. I can prepare a patch that should be simple but I will need some time to
test it properly first.
> NPE in DefaultStoreFileManager.getUnneededFiles aborts compaction requests
> causing unlimited storefile growth when TTL enabled
> ------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29875
> URL: https://issues.apache.org/jira/browse/HBASE-29875
> Project: HBase
> Issue Type: Bug
> Components: Compaction
> Affects Versions: 2.5.10
> Reporter: Emil Kleszcz
> Priority: Critical
>
> *Summary:*
> We are observing compactions permanently stopping for some regions due to a
> NullPointerException
> thrown inside {_}DefaultStoreFileManager.getUnneededFiles(){_}. After the
> first failure, _requestCompaction()_
> never schedules compactions for the store, and storefiles grow indefinitely.
> *Impact:*
> - storefile count grows from normal (~100k cluster-wide) to hundreds of
> thousands
> - individual regions reach 500–1000+ HFiles
> - MemStoreFlusher repeatedly logs:
> "Waited 100xx ms on a compaction to clean up 'too many store files';
> proceeding with flush"
> - compaction queue stays empty because compaction requests fail before being
> queued
> - reopening the region immediately restores compaction
> *Stack trace:*
> {code:java}
> <2026-02-08T22:31:41.183+0100> <ERROR> <ipc.RpcServer>: <Unexpected throwable
> object >
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.lambda$getUnneededFiles$3(DefaultStoreFileManager.java:235)
> at java.util.stream.ReferencePipeline...
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getUnneededFiles(DefaultStoreFileManager.java:243)
> at
> org.apache.hadoop.hbase.regionserver.HStore.removeUnneededFiles(HStore.java:1566)
> at
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1447)
> at org.apache.hadoop.hbase.regionserver.CompactSplit.requestCompaction(...)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.compactRegion(...){code}
> Failing line:
> {code:java}
> long fileTs = sf.getReader().getMaxTimestamp();{code}
> It appears
> {code:java}
> sf.getReader(){code}
> can be null here.
> *Workload characteristics:*
> - TTL enabled (2.5 days)
> - many small flushes (~8–10 MB files)
> - large storefile counts per region
> - expired-file cleanup path ({_}getUnneededFiles{_}) triggered frequently
> *Behavior:*
> - after this NPE, compactions for the store stop completely
> - flush continues, so file count increases without bound
> - moving the region (or disable/enable table) immediately fixes the issue
> because readers are rebuilt and compactions resume
> - triggering compaction or major compaction manually causes the same issue
> *Reproducer (observational):*
> 1) region accumulates many storefiles
> 2) TTL expired-file cleanup runs
> 3) NPE in getUnneededFiles()
> 4) compaction requests abort permanently for that store
> 5) reopen region -> compaction works again
> *Expected behavior:*
> Compaction should not fail due to a null StoreFile reader. The code should
> guard against null, attempt to init the reader, or skip the file rather than
> throwing.
> *Related:*
> This looks similar in effect to HBASE-29348 (NPE during compaction leading to
> hfiles not being cleaned),
> but occurs in a different call site (TTL/unneeded-files path instead of
> date-tiered compaction).
> *Workaround currently used:*
> - region reopen (move) or continuous balancing to force reopen
> - this restores compaction but is only a mitigation
> Please advise whether this should be tracked under HBASE-29348 or as a
> separate issue. I can prepare a patch that should be simple but I will need
> some time to test it properly first.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)