[ 
https://issues.apache.org/jira/browse/HADOOP-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546087
 ] 

Jim Kellerman commented on HADOOP-2295:
---------------------------------------

We figured out a likely scenario for how this could happen.

 1. master assigns region to a region server
 2. region server puts open region on its 'todo list' 
  - todo list executes in a separate thread because actions on the todo list 
may take longer than a heartbeat interval to complete.
 3. region server is still processing todo list when it is time to send a 
heartbeat to the master
 4. master assigns the same region to the server because it is still unassigned 
and no region server is serving it
 5. region server completes bringing region on-line and tells the master in the 
next heartbeat message
 6. master replies saying "Ive assigned this region to someone else. close it 
and don't tell me you've closed it.
 7. region server queues close request and this stops the region from coming 
online.
 8. master thinks region server should be serving the region so it is removed 
from unassigned regions list.
 9. no region server ends up serving the region even though the master thinks 
it is not an unserved region

To fix this we propose that when the master assigns a region it puts the region 
in a delay queue with a timeout of 2 * lease interval.
If the region server responds in 2 * lease interval saying it is serving the 
region, then the master finishes processing the event of the region coming 
online.
If the region server does not respond in 2 * lease interval that it is serving 
the region, the master assumes that either:
- the region server has died and will never serve the region
- the region server never got the message to serve the region
and consequently will assign the region to the next appropriate region server.

> [hbase] Still multi-assigning regions
> -------------------------------------
>
>                 Key: HADOOP-2295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2295
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>
> Start key is img216/1725/22236401lvh6.gif.  We doubly assign.  In between 
> assignments, master rules 'should not have opened region' and sends a 
> shutdown region to remote server -- which it obeys. Meantime, master thinks 
> regions are deployed.
> Here is master log:
> {code}
> ...
> 2007-11-27 04:26:21,818 INFO  hbase.HMaster - region 
> postlog,img216/1725/22236401lvh6.gif,1196164441230 split. New regions are: 
> postlog,img216/1725/22236401lvh6.gif,1196166373235, 
> postlog,img216/3961/aasaxonlivene6.jpg,1196166373241
> 2007-11-27 04:26:23,261 INFO  hbase.HMaster - assigning region 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 to server XX.XX.XX.33:60020
> 2007-11-27 04:26:27,462 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> -400573094926830852 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: 
> <img216/1725/22236401lvh6.gif>, offline: true, split: true, tableDesc: {name: 
> postlog, families: {cookie:={name: cookie, max versions: 1, compression: 
> NONE, in memory: false, max length: 2147483647, bloom filter: none}, 
> ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max 
> length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.25:60020, 
> startCode: 1196155489036
> 2007-11-27 04:26:27,464 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> -400573094926830852 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}}}, server: , startCode: -1
> 2007-11-27 04:26:27,464 DEBUG hbase.HMaster - Checking 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 is assigned
> 2007-11-27 04:26:27,671 DEBUG hbase.HMaster - Checked 
> postlog,img216/1725/22236401lvh6.gif,1196164441230 for references: splitA: 
> true, splitB: true
> 2007-11-27 04:26:53,328 INFO  hbase.HMaster - assigning region 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 to server XX.XX.XX.33:60020
> 2007-11-27 04:27:05,364 INFO  hbase.HMaster - XX.XX.XX.33:60020 serving 
> postlog,img216/1725/22236401lvh6.gif,1196166373235
> 2007-11-27 04:27:05,364 DEBUG hbase.HMaster - region server XX.XX.XX.33:60020 
> should not have opened region 
> postlog,img216/1725/22236401lvh6.gif,1196166373235
> 2007-11-27 04:27:05,367 INFO  hbase.HMaster - regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}} open on XX.XX.XX.33:60020
> 2007-11-27 04:27:05,368 INFO  hbase.HMaster - updating row 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 in table .META.,,1 with 
> startcode 1196155493271 and server XX.XX.XX.33:60020
> 2007-11-27 04:27:26,380 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> -4876467680083821318 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: 
> <img216/1725/22236401lvh6.gif>, offline: true, split: true, tableDesc: {name: 
> postlog, families: {cookie:={name: cookie, max versions: 1, compression: 
> NONE, in memory: false, max length: 2147483647, bloom filter: none}, 
> ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max 
> length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.25:60020, 
> startCode: 1196155489036
> 2007-11-27 04:27:26,381 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> -4876467680083821318 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}}}, server: XX.XX.XX.33:60020, startCode: 1196155493271
> 2007-11-27 04:27:26,381 DEBUG hbase.HMaster - Checking 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 is assigned
> 2007-11-27 04:27:26,511 DEBUG hbase.HMaster - Checked 
> postlog,img216/1725/22236401lvh6.gif,1196164441230 for references: splitA: 
> true, splitB: true
> 2007-11-27 04:28:26,444 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> -7699450087241029374 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: 
> <img216/1725/22236401lvh6.gif>, offline: true, split: true, tableDesc: {name: 
> postlog, families: {cookie:={name: cookie, max versions: 1, compression: 
> NONE, in memory: false, max length: 2147483647, bloom filter: none}, 
> ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max 
> length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.25:60020, 
> startCode: 1196155489036
> 2007-11-27 04:28:26,446 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> -7699450087241029374 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}}}, server: XX.XX.XX.33:60020, startCode: 1196155493271
> 2007-11-27 04:28:26,446 DEBUG hbase.HMaster - Checking 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 is assigned
> 2007-11-27 04:28:26,590 DEBUG hbase.HMaster - Checked 
> postlog,img216/1725/22236401lvh6.gif,1196164441230 for references: splitA: 
> true, splitB: true
> 2007-11-27 04:29:26,459 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 
> 5846860608957952540 regioninfo: {regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: 
> <img216/1401lvh6.gif>, offline: true, split: true, tableDesc: {name: po: 1, 
> compression: NONE, in memory: false, max length: 21474836s: 1, compression: 
> NONE, in memory: false, max length: 21474836er: none}}}}, server: 
> XX.XX.XX.25:60020, startCode: 11961554890ase.HMaster - HMaster.metaScanner 
> scanner: 5846860608957952540 /1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22okie:={name: cookie, max versions: 1, compression: NONE, in 
> mem NONE, in memory: false, max length: 2147483647, bloom filter: artCode: 
> 1196155493271
> 2007-11-27 04:29:26,461 DEBUG hbase.HMas235 is assigned
> 2007-11-27 04:29:26,622 DEBUG hbase.HMaster - Cces: splitA: true, splitB: true
> ...
> {code}
> Here is regionserver log on node XX.XX.XX.33:
> {code}
> ...
> 2007-11-27 04:26:29,318 INFO  hbase.HRegionServer - MSG_REGION_OPEN : 
> regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}}
> 2007-11-27 04:26:29,325 DEBUG hbase.HStore - starting 51396220/cookie (no 
> reconstruction log)
> 2007-11-27 04:26:30,420 DEBUG hbase.HRegionServer - flushing region 
> postlog,img100/941/brandyfn6.gif,1196158363481
> 2007-11-27 04:26:30,420 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img100/941/brandyfn6.gif,1196158363481: snapshotMemcaches() 
> determined that there was nothing to do
> 2007-11-27 04:26:31,752 DEBUG hbase.HRegionServer - flushing region 
> postlog,img149/2019/liyanaum6.gif,1196164996321
> 2007-11-27 04:26:31,753 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img149/2019/liyanaum6.gif,1196164996321: snapshotMemcaches() 
> determined that there was nothing to do
> 2007-11-27 04:26:31,753 DEBUG hbase.HRegionServer - flushing region 
> postlog,img148/6048/p9290339kb0.jpg,1196162613173
> 2007-11-27 04:26:31,753 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img148/6048/p9290339kb0.jpg,1196162613173: snapshotMemcaches() 
> determined that there was nothing to do
> 2007-11-27 04:26:31,754 DEBUG hbase.HRegionServer - flushing region 
> postlog,img132/5512/timeline8bx.png,1196159744555
> 2007-11-27 04:26:31,754 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img132/5512/timeline8bx.png,1196159744555: snapshotMemcaches() 
> determined that there was nothing to do
> 2007-11-27 04:26:37,915 DEBUG hbase.HRegionServer - flushing region 
> postlog,img149/3724/whosherne4.jpg,1196164996322
> 2007-11-27 04:26:37,915 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img149/3724/whosherne4.jpg,1196164996322: snapshotMemcaches() 
> determined that there was nothing to do
> 2007-11-27 04:26:57,059 DEBUG hbase.HRegionServer - flushing region 
> postlog,img148/8339/361de8578d356flip0hd.png,1196165445439
> 2007-11-27 04:26:57,060 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img148/8339/361de8578d356flip0hd.png,1196165445439: 
> snapshotMemcaches() determined that there was nothing to do
> 2007-11-27 04:27:03,604 DEBUG hbase.HStore - maximum sequence id for hstore 
> 51396220/cookie is 30625781
> 2007-11-27 04:27:03,761 DEBUG hbase.HStore - starting 51396220/ip (no 
> reconstruction log)
> 2007-11-27 04:27:03,859 DEBUG hbase.HStore - maximum sequence id for hstore 
> 51396220/ip is 30625781
> 2007-11-27 04:27:03,965 DEBUG hbase.HRegion - Next sequence id for region 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 is 30625782
> 2007-11-27 04:27:03,969 INFO  hbase.HRegion - region 
> postlog,img216/1725/22236401lvh6.gif,1196166373235 available
> 2007-11-27 04:27:03,969 INFO  hbase.HRegionServer - MSG_REGION_OPEN : 
> regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}}
> 2007-11-27 04:27:04,578 INFO  hbase.HRegionServer - 
> MSG_REGION_CLOSE_WITHOUT_REPORT : regionname: 
> postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: 
> <img216/1725/22236401lvh6.gif>, tableDesc: {name: postlog, families: 
> {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, 
> max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, 
> compression: NONE, in memory: false, max length: 2147483647, bloom filter: 
> none}}}
> 2007-11-27 04:27:04,578 DEBUG hbase.HRegion - Not flushing cache for region 
> postlog,img216/1725/22236401lvh6.gif,1196166373235: snapshotMemcaches() 
> determined that there was nothing to do
> 2007-11-27 04:27:04,579 DEBUG hbase.HStore - closed 51396220/ip
> 2007-11-27 04:27:04,580 DEBUG hbase.HStore - closed 51396220/cookie
> 2007-11-27 04:27:04,580 INFO  hbase.HRegion - closed 
> postlog,img216/1725/22236401lvh6.gif,1196166373235
> ..
> {code}
> It looks like the assigned region hasn't been cleared from the unassigned 
> regions list when the report comes in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to