bolao created HBASE-26302:
-----------------------------
Summary: Due to many procedures reload from master:store, hmaster
takes too much time to initialize
Key: HBASE-26302
URL: https://issues.apache.org/jira/browse/HBASE-26302
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 2.3.5
Reporter: bolao
Attachments: image-2021-09-28-11-33-23-375.png,
image-2021-09-28-11-33-41-612.png
when the hbase restart, we found hmaster takes much time to initialize.
we add some logs for jars and found it's stuck in reloading procedure form
master:store in ProcedureExecutor's init method
{panel:title=1. the ProcedureExecutor logs only have}
2021-09-24 11:22:13 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster]
INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(569) -Starting
30 core workers (bigger of cpus/4 or 16) with max (burst) worker count=300
2021-09-24 11:22:13 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster]
INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(589) -Recovered
RegionProcedureStore lease in 1 msec
and don't have logs for load:
[https://github.com/apache/hbase/blob/cbebf85b3cfefc443ac8592908e8a6e95b020611/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java#L602]
{panel}
2. we add some logs like
that(org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore#load):
{code:java}
// code placeholder
loader.setMaxProcId(maxProcId);
LOG.info("there are {} procedures load from master:store", procs.size());
ProcedureTree tree = ProcedureTree.build(procs);
loader.load(tree.getValidProcs());
loader.handleCorrupted(tree.getCorruptedProcs());
{code}
and grep log found that:
2021-09-24 11:23:16 [master/fx-hd-sc-hbase-backup-0:16000:becomeActiveMaster]
INFO
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(294)
-there are 3357861 procedures load form master:store
3. we add some logs
(org.apache.hadoop.hbase.procedure2.ProcedureExecutor#restoreLocks())
{code:java}
// code placeholder
private void restoreLocks() {
Set<Long> restored = new HashSet<>();
Deque<Procedure<TEnvironment>> stack = new ArrayDeque<>();
AtomicInteger num = new AtomicInteger();
procedures.values().forEach(proc -> {
for (;;) {
LOG.info("this is num {}", num.incrementAndGet());
if (restored.contains(proc.getProcId())) {
restoreLocks(stack, restored);
return;
}
if (!proc.hasParent()) {
restoreLock(proc, restored);
restoreLocks(stack, restored);
return;
}
stack.push(proc);
proc = procedures.get(proc.getParentProcId());
}
});
}
{code}
found when the num added to 16W, it's spended about 20 minutes.
4. By viewing the metadata of the hfile, the Earliest time is 28th June.
!image-2021-09-28-11-33-41-612.png!
5. review the souce code, the master:store ttl is default
value(HConstants.FOREVER)
[https://github.com/apache/hbase/blob/fd3fdc08d1cd43eb3432a1a70d31c3aece6ecabe/hbase-server/src/main/java/org/apache/hadoop/hbase/master/region/MasterRegionFactory.java#L82]
and the scan for maste:store don't have filter too.
[https://github.com/apache/hbase/blob/cbebf85b3cfefc443ac8592908e8a6e95b020611/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L263]
so we have some questions:
1. Is it reasonable to set master:store ttl is HConstants.FOREVER?
2. can we keep a small number for master:store by deleting some historical
procedure?
Look forward to your reply! thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)