[
https://issues.apache.org/jira/browse/HBASE-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-21165:
--------------------------
Description:
I have a Master that crashed on a large cluster with hundreds of outstanding
Procedure WALs and (probably --TBD) a few million Procedures to load. It is
taking a long time (two hours)...
There were STUCK procedures that were preventing clean-up of the old WALs.
I can tell we are making progress by enabling TRACE on the Procedure Store.
Better would be an emission as we made progress through the files with an
output after every so many procedures loaded.
Then, post-load, there is a long time spent sorting-out the Procedure image...
We are in finish for ages doing stuff like:
{code}
"master/vc0207:22001:becomeActiveMaster" #98 daemon prio=5 os_prio=0
tid=0x0000000000d31800 nid=0x1efc0 runnable [0x00007f0a3c17d000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$WalProcedureMap.removeFromMap(ProcedureWALFormatReader.java:837)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$WalProcedureMap.fetchReady(ProcedureWALFormatReader.java:614)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:201)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:94)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:426)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:382)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:663)
at
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1335)
at
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:878)
at
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2119)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:567)
at
org.apache.hadoop.hbase.master.HMaster$$Lambda$42/1930759883.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
{code}
and....
{code}
"master/vc0207:22001:becomeActiveMaster" #98 daemon prio=5 os_prio=0
tid=0x0000000000d31800 nid=0x1efc0 runnable [0x00007f0a3c17d000]
java.lang.Thread.State: RUNNABLE
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.security.Provider$Service.newInstance(Provider.java:1595)
at sun.security.jca.GetInstance.getInstance(GetInstance.java:236)
at sun.security.jca.GetInstance.getInstance(GetInstance.java:164)
at java.security.Security.getImpl(Security.java:695)
at java.security.MessageDigest.getInstance(MessageDigest.java:167)
at org.apache.hadoop.hbase.util.MD5Hash.getMD5AsHex(MD5Hash.java:59)
at
org.apache.hadoop.hbase.client.RegionInfo.createRegionName(RegionInfo.java:560)
at
org.apache.hadoop.hbase.client.RegionInfo.createRegionName(RegionInfo.java:490)
at
org.apache.hadoop.hbase.client.RegionInfoBuilder$MutableRegionInfo.<init>(RegionInfoBuilder.java:243)
at
org.apache.hadoop.hbase.client.RegionInfoBuilder.build(RegionInfoBuilder.java:120)
at
org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toRegionInfo(ProtobufUtil.java:3132)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.deserializeStateData(ServerCrashProcedure.java:335)
at
org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:283)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$Entry.convert(ProcedureWALFormatReader.java:359)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$EntryIterator.next(ProcedureWALFormatReader.java:410)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:460)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$200(ProcedureExecutor.java:76)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$1.load(ProcedureExecutor.java:391)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:441)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:202)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:94)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:426)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:382)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:663)
at
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1335)
at
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:878)
at
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2119)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:567)
at
org.apache.hadoop.hbase.master.HMaster$$Lambda$42/1930759883.run(Unknown Source)
{code}
In this case, it took 1.25 hours before we starting showing progress in the
log. Let me add progress output (Separately am looking at how to do speedup).
was:
I have a Master that crashed on a large cluster with hundreds of outstanding
Procedure WALs and ~7M of Procedures to load. It is taking a long time (two
hours)...
There were STUCK procedures that were preventing clean-up of the old WALs.
I can tell we are making progress by enabling TRACE on the Procedure Store.
Better would be an emission as we made progress through the files with an
output after every so many procedures loaded.
Then, post-load, there is a long time spent sorting-out the Procedure image...
We are in finish for ages doing stuff like:
{code}
"master/vc0207:22001:becomeActiveMaster" #98 daemon prio=5 os_prio=0
tid=0x0000000000d31800 nid=0x1efc0 runnable [0x00007f0a3c17d000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$WalProcedureMap.removeFromMap(ProcedureWALFormatReader.java:837)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$WalProcedureMap.fetchReady(ProcedureWALFormatReader.java:614)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:201)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:94)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:426)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:382)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:663)
at
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1335)
at
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:878)
at
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2119)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:567)
at
org.apache.hadoop.hbase.master.HMaster$$Lambda$42/1930759883.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
{code}
and....
{code}
"master/vc0207:22001:becomeActiveMaster" #98 daemon prio=5 os_prio=0
tid=0x0000000000d31800 nid=0x1efc0 runnable [0x00007f0a3c17d000]
java.lang.Thread.State: RUNNABLE
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.security.Provider$Service.newInstance(Provider.java:1595)
at sun.security.jca.GetInstance.getInstance(GetInstance.java:236)
at sun.security.jca.GetInstance.getInstance(GetInstance.java:164)
at java.security.Security.getImpl(Security.java:695)
at java.security.MessageDigest.getInstance(MessageDigest.java:167)
at org.apache.hadoop.hbase.util.MD5Hash.getMD5AsHex(MD5Hash.java:59)
at
org.apache.hadoop.hbase.client.RegionInfo.createRegionName(RegionInfo.java:560)
at
org.apache.hadoop.hbase.client.RegionInfo.createRegionName(RegionInfo.java:490)
at
org.apache.hadoop.hbase.client.RegionInfoBuilder$MutableRegionInfo.<init>(RegionInfoBuilder.java:243)
at
org.apache.hadoop.hbase.client.RegionInfoBuilder.build(RegionInfoBuilder.java:120)
at
org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toRegionInfo(ProtobufUtil.java:3132)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.deserializeStateData(ServerCrashProcedure.java:335)
at
org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:283)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$Entry.convert(ProcedureWALFormatReader.java:359)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$EntryIterator.next(ProcedureWALFormatReader.java:410)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:460)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$200(ProcedureExecutor.java:76)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$1.load(ProcedureExecutor.java:391)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:441)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:202)
at
org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:94)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:426)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:382)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:663)
at
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1335)
at
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:878)
at
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2119)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:567)
at
org.apache.hadoop.hbase.master.HMaster$$Lambda$42/1930759883.run(Unknown Source)
{code}
In this case, it took 1.25 hours before we starting showing progress in the
log. Let me add progress output (Separately am looking at how to do speedup).
> During ProcedureStore load, there is no listing of progress...
> --------------------------------------------------------------
>
> Key: HBASE-21165
> URL: https://issues.apache.org/jira/browse/HBASE-21165
> Project: HBase
> Issue Type: Bug
> Components: amv2, Operability
> Reporter: stack
> Assignee: stack
> Priority: Major
>
> I have a Master that crashed on a large cluster with hundreds of outstanding
> Procedure WALs and (probably --TBD) a few million Procedures to load. It is
> taking a long time (two hours)...
> There were STUCK procedures that were preventing clean-up of the old WALs.
> I can tell we are making progress by enabling TRACE on the Procedure Store.
> Better would be an emission as we made progress through the files with an
> output after every so many procedures loaded.
> Then, post-load, there is a long time spent sorting-out the Procedure
> image... We are in finish for ages doing stuff like:
> {code}
> "master/vc0207:22001:becomeActiveMaster" #98 daemon prio=5 os_prio=0
> tid=0x0000000000d31800 nid=0x1efc0 runnable [0x00007f0a3c17d000]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$WalProcedureMap.removeFromMap(ProcedureWALFormatReader.java:837)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$WalProcedureMap.fetchReady(ProcedureWALFormatReader.java:614)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:201)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:94)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:426)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:382)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:663)
> at
> org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1335)
> at
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:878)
> at
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2119)
> at
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:567)
> at
> org.apache.hadoop.hbase.master.HMaster$$Lambda$42/1930759883.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> and....
> {code}
> "master/vc0207:22001:becomeActiveMaster" #98 daemon prio=5 os_prio=0
> tid=0x0000000000d31800 nid=0x1efc0 runnable [0x00007f0a3c17d000]
> java.lang.Thread.State: RUNNABLE
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.security.Provider$Service.newInstance(Provider.java:1595)
> at sun.security.jca.GetInstance.getInstance(GetInstance.java:236)
> at sun.security.jca.GetInstance.getInstance(GetInstance.java:164)
> at java.security.Security.getImpl(Security.java:695)
> at java.security.MessageDigest.getInstance(MessageDigest.java:167)
> at org.apache.hadoop.hbase.util.MD5Hash.getMD5AsHex(MD5Hash.java:59)
> at
> org.apache.hadoop.hbase.client.RegionInfo.createRegionName(RegionInfo.java:560)
> at
> org.apache.hadoop.hbase.client.RegionInfo.createRegionName(RegionInfo.java:490)
> at
> org.apache.hadoop.hbase.client.RegionInfoBuilder$MutableRegionInfo.<init>(RegionInfoBuilder.java:243)
> at
> org.apache.hadoop.hbase.client.RegionInfoBuilder.build(RegionInfoBuilder.java:120)
> at
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toRegionInfo(ProtobufUtil.java:3132)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.deserializeStateData(ServerCrashProcedure.java:335)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:283)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$Entry.convert(ProcedureWALFormatReader.java:359)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader$EntryIterator.next(ProcedureWALFormatReader.java:410)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:460)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$200(ProcedureExecutor.java:76)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$1.load(ProcedureExecutor.java:391)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:441)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:202)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:94)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:426)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:382)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:663)
> at
> org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1335)
> at
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:878)
> at
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2119)
> at
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:567)
> at
> org.apache.hadoop.hbase.master.HMaster$$Lambda$42/1930759883.run(Unknown
> Source)
> {code}
> In this case, it took 1.25 hours before we starting showing progress in the
> log. Let me add progress output (Separately am looking at how to do speedup).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)