[
https://issues.apache.org/jira/browse/HBASE-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586042#action_12586042
]
stack commented on HBASE-549:
-----------------------------
Messages currently have Type and 'subject' where subject is usually a HRI;
splits reported to master IIRC carry the daughter regions.
To fix this issue and others like it that may occur,missing is an Message
'source'. Would also be sweet if Messages could carry optional payload. I'm
thinking of when a HRS sends a CLOSE, it could bundle the Exception that
prompted the closing. This way, could read the master log and get a sense of
the whole cluster.
So, a suggestion without having dug in to check that this suggestion is
overkill would be to change HMsg to be something like the below interface in
pseudo-code:
{code}
interface Message {
ServerAddress getSource();
// Are all of our Messages always about a Region?
HRI getSubject();
byte getType();
Text getOptionalPayload();
}
{code}
The split message could subclass the above.
Do we need an ID too? Should IDs be monotonically increasing so message
processing can be done in order?
> Don't CLOSE region if message is not from server that opened it or is opening
> it
> --------------------------------------------------------------------------------
>
> Key: HBASE-549
> URL: https://issues.apache.org/jira/browse/HBASE-549
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.16.0, 0.2.0, 0.1.1, 0.1.0
> Reporter: stack
> Fix For: 0.2.0
>
>
> We assign a region to a server. It takes too long to open (HBASE-505).
> Region gets assigned to another server. Meantime original host returns a
> MSG_REPORT_CLOSE (because other regions opening messes it up moving files on
> disk out from under it). We queue a shutdown which marks the region as
> needing reassignment. Second server reports in that it successfully opened
> the region. Master tells it it should not have opened it. Churn ensues.
> Fix is to ignore the CLOSE if its reported server/startcode does not match
> that of the server currently trying to open region. Fix is not easy because
> currently we don't keep list of server info in unassigned regions.
> Here's master log snippet showing problem:
> {code}
> ...
> 2008-03-25 19:16:43,711 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.220:60020
> 2008-03-25 19:16:46,725 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.220:60020
> 2008-03-25 19:18:06,411 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:18:06,811 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:46,841 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.221:60020
> 2008-03-25 19:19:49,849 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.221:60020
> 2008-03-25 19:19:56,883 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_CLOSE : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from
> XX.XX.XX.220:60020
> 2008-03-25 19:19:56,883 INFO org.apache.hadoop.hbase.HMaster:
> XX.XX.XX.220:60020 no longer serving regionname:
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482, startKey:
> <iLStZ0yTnfVUziYcNVVxWV==>, endKey: <jLB27Q4hKls4tSvp64rJfF==
> >, encodedName: 1857033608, tableDesc: {name: enwiki_080103, families:
> >{alternate_title:={name: alternate_title, max versions: 3, compression:
> >NONE, in memory: false, max length: 2147483647, bloom filter: none},
> >alternate_url:={name: al
> ternate_url, max versions: 3, compression: NONE, in memory: false, max
> length: 2147483647, bloom filter: none}, anchor:={name: anchor, max versions:
> 3, compression: NONE, in memory: false, max length: 2147483647, bloom filter:
> none}, mi
> sc:={name: misc, max versions: 3, compression: NONE, in memory: false, max
> length: 2147483647, bloom filter: none}, page:={name: page, max versions: 3,
> compression: NONE, in memory: false, max length: 2147483647, bloom filter:
> none}, re
> direct:={name: redirect, max versions: 3, compression: NONE, in memory:
> false, max length: 2147483647, bloom filter: none}}}
> 2008-03-25 19:19:56,885 DEBUG org.apache.hadoop.hbase.HMaster: Main
> processing loop: ProcessRegionClose of
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482, true, false
> 2008-03-25 19:19:56,885 INFO org.apache.hadoop.hbase.HMaster: region closed:
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:56,887 INFO org.apache.hadoop.hbase.HMaster: reassign
> region: enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:57,288 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.189:60020
> 2008-03-25 19:20:00,296 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.189:60020
> 2008-03-25 19:20:16,885 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_OPEN : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from
> XX.XX.XX.221:60020
> 2008-03-25 19:20:16,885 DEBUG org.apache.hadoop.hbase.HMaster: region server
> XX.XX.XX.221:60020 should not have opened region
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:51,707 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:51,834 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:53,947 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.97:60020
> ...
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.