[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859144#comment-15859144
 ] 

Abhishek Singh Chouhan commented on HBASE-17069:
------------------------------------------------

Did a bit of debugging around this. I think the issue is as follows:
- We try to split a region A into B and C. SplitTransactionImpl.execute(..) -> 
SplitTransactionImpl.createDaughters(..) -> MetaTableAccessor.splitRegion(..)
In MetaTableAccessor.splitRegion() we create a multimutate to add entries of B 
and C in meta, the puts here have hregioninfo. 
- We encounter an exception in MetaTableAccessor.multimutate(..) since we just 
slayed the RS hosting meta (this was the cached location)
MultiRowMutationProtos.MultiRowMutationService.BlockingInterface service =
      MultiRowMutationProtos.MultiRowMutationService.newBlockingStub(channel);
    try {
      service.mutateRows(null, mmrBuilder.build());
    } catch (ServiceException ex) {
      ProtobufUtil.toIOException(ex);
    }
- We swallow the exception here and move on and try to online the regions using 
SplitTransactionImpl.openDaughters(..). 
- We successfully open the region a target RS. At the time of opening in 
HRegionServer.postOpenDeployTasks(..) we do 
MetaTableAccessor.updateRegionLocation(..). This put has other entries like 
servername, startcode etc but doesn't have hregioninfo. This request now goes 
through since meta is now up again on another RS. We end up with partial info 
in hbase:meta and the region is online on a RS. The clients however can't 
access the data belonging to this region and we get "java.io.IOException: 
HRegionInfo was null".
The immediate possible solutions to the problem seems to be either throwing the 
exception rather than swallowing it, which since we were doing it after PONR 
would result in failure to rollback the split and eventually RS aborting, or 
having the hregioninfo in the put that we do during postOpen task so that we 
have complete information in meta even after the earlier multi fails. Throwing 
the exception resulting in rs aborting seems radical since the scenario of 
split+meta moving around can be there quite often especially in case of rolling 
restarts on a cluster under load. However i'm not very sure about the other fix 
too since we'll be in a state where we assume we have regioninfo in meta which 
wont be there untill the daughter opens successfully on a RS (not sure of the 
failure scenarios between the two events). 
I'm doing some testing with the adding hregioninfo during postOpen task part. I 
can put up a patch if this sounds ok. [~stack] [~apurtell] [~lhofhansl] 
Thoughts?

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-17069
>                 URL: https://issues.apache.org/jira/browse/HBASE-17069
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.4
>            Reporter: Andrew Purtell
>            Priority: Critical
>         Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for piwbr: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,717 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for tiny: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> {noformat}
> *The RegionServer onlines it*
> {noformat}
> 2016-11-08 04:23:27,015 INFO  
> [node-3.cluster,16020,1478578389506-daughterOpener=d55ef81c2f8299abbddfce0445067830]
>  regionserver.HRegion: Onlined d55ef81c2f8299abbddfce0445067830; next 
> sequenceid=19184
> 2016-11-08 04:23:27,029 INFO  
> [regionserver/node-3.cluster/192.168.124.4:16020-splits-1478579001099] 
> regionserver.HRegionServer: Post open deploy tasks for 
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830.
> 2016-11-08 04:23:27,047 INFO  
> [regionserver/node-3.cluster/192.168.124.4:16020-splits-1478579001099] 
> hbase.MetaTableAccessor: Updated row 
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830. 
> with server=node-3.cluster,16020,1478578389506
> {noformat}
> *The Master transitions state from SPLITTING_NEW to OPEN*
> {noformat}
> 2016-11-08 04:23:27,058 INFO  [AM.ZK.Worker-pool2-t84] master.RegionStates: 
> Transition {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579007057, server=node-3.cluster,16020,1478578389506} to 
> {d55ef81c2f8299abbddfce0445067830 state=OPEN, ts=1478579007058, 
> server=node-3.cluster,16020,1478578389506}
> 2016-11-08 04:23:27,059 INFO  [AM.ZK.Worker-pool2-t84] 
> master.AssignmentManager: Handled SPLIT event; 
> parent=IntegrationTestBigLinkedList,,1478577020916.393d2bfd8b1c52ce08540306659624f2.,
>  daughter 
> a=IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830.,
>  daughter 
> b=IntegrationTestBigLinkedList,/\xFB\x14,1478579001155.08629d59564726da2497f70451aafcdb.,
>  on node-3.cluster,16020,1478578389506
> {noformat}
> *RegionServer updates META  - BUT APPARENTLY NOT CORRECTLY*
> {noformat}
> 2016-11-08 04:23:27,165 INFO  
> [regionserver/node-3.cluster/192.168.124.4:16020-splits-1478579001099] 
> regionserver.SplitRequest: Region split, hbase:meta updated, and report to 
> master. 
> Parent=IntegrationTestBigLinkedList,,1478577020916.393d2bfd8b1c52ce08540306659624f2.,
>  new regions: 
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830.,
>  
> IntegrationTestBigLinkedList,/\xFB\x14,1478579001155.08629d59564726da2497f70451aafcdb..
>  Split took 6sec
> {noformat}
> *RegionServer delays flush*
> (Is this important?)
> {noformat}
> 2016-11-08 04:24:14,639 WARN  [MemStoreFlusher.0] 
> regionserver.MemStoreFlusher: Region 
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830. 
> has too many store files; delaying flush up to 90000ms
> {noformat}
> *Immediate warnings about No serialized HRegionInfo*
> {noformat}
> 2016-11-08 04:24:44,691 WARN  
> [B.defaultRpcServer.handler=26,queue=2,port=16000] hbase.MetaTableAccessor: 
> No serialized HRegionInfo in 
> keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478579007029/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478579007029/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478579007029/Put/vlen=8/seqid=0}
> {noformat}
> *Master is not happy either*
> {noformat}
> 2016-11-08 04:24:51,148 WARN  [MASTER_TABLE_OPERATIONS-node-1:16000-0] 
> hbase.MetaTableAccessor: No serialized HRegionInfo in 
> keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478579007029/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478579007029/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478579007029/Put/vlen=8/seqid=0}
> {noformat}
> *TestRunner MetaScanner complains about invalid entries in META missing 
> HRegionInfo*
> {noformat}
> (standard input):9086:2016-11-08 05:04:17,230 WARN  
> [B.defaultRpcServer.handler=4,queue=1,port=16000] hbase.MetaTableAccessor: No 
> serialized HRegionInfo in 
> keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
> {noformat}
> *ITBLL MapReduce tasks fail because part of the keyspace cannot be located:*
> {noformat}
> java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, 
> row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1293)
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1185)
>         at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
>         at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:154)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:121)
>         at 
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Generator$GeneratorMapper.persist(IntegrationTestBigLinkedList.java:486)
>         at 
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Generator$GeneratorMapper.map(IntegrationTestBigLinkedList.java:431)
>         at 
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Generator$GeneratorMapper.map(IntegrationTestBigLinkedList.java:375)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000008/syslog:920:java.io.IOException:
>  HRegionInfo was null in IntegrationTestBigLinkedList, 
> row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478580288482/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478580288482/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478580288482/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000010/syslog:920:java.io.IOException:
>  HRegionInfo was null in IntegrationTestBigLinkedList, 
> row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478580288482/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478580288482/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478580288482/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000011/syslog:909:java.io.IOException:
>  HRegionInfo was null in IntegrationTestBigLinkedList, 
> row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478580288482/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478580288482/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478580288482/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000030/syslog:48:java.io.IOException:
>  HRegionInfo was null in IntegrationTestBigLinkedList, 
> row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000048/syslog:48:java.io.IOException:
>  HRegionInfo was null in IntegrationTestBigLinkedList, 
> row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0,
>  
> IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to