[
https://issues.apache.org/jira/browse/HBASE-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586442#action_12586442
]
Jim Kellerman commented on HBASE-549:
-------------------------------------
So my current understanding of this issue is:
- Taking too long to open should no longer be an issue because HBASE-505 has
been addressed. (unless region server does not report in and its lease expires?)
- We no longer offline regions if a region server cannot open it.
- apparently we still see regions being offlined for some unexplained reason?
We discussed the idea that this issue and HBASE-543 are related. After thinking
about it, I agree.
When the master receives a close message, it knows which region server sent it.
If we were to combine this with HBASE-543, when we assign a region we could
record in the "region state" which server it was assigned to and that would
make it easy to determine if we should ignore the close message.
So I am thinking that these two issues could be addressed together. Stack, do
you want to take 543? I hadn't done much about it other than thinking about how
to approach it, so I don't have a lot of time invested here. If we continue to
work on two separate patches, it seems likely we are going to step on each
other's toes. (Not to mention there are plenty of other issues to address). If
you want 543, take it and link to this issue.
I do like the idea of HMsg carrying an exception as an optional parameter.
> Don't CLOSE region if message is not from server that opened it or is opening
> it
> --------------------------------------------------------------------------------
>
> Key: HBASE-549
> URL: https://issues.apache.org/jira/browse/HBASE-549
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.16.0, 0.2.0, 0.1.1, 0.1.0
> Reporter: stack
> Fix For: 0.2.0
>
>
> We assign a region to a server. It takes too long to open (HBASE-505).
> Region gets assigned to another server. Meantime original host returns a
> MSG_REPORT_CLOSE (because other regions opening messes it up moving files on
> disk out from under it). We queue a shutdown which marks the region as
> needing reassignment. Second server reports in that it successfully opened
> the region. Master tells it it should not have opened it. Churn ensues.
> Fix is to ignore the CLOSE if its reported server/startcode does not match
> that of the server currently trying to open region. Fix is not easy because
> currently we don't keep list of server info in unassigned regions.
> Here's master log snippet showing problem:
> {code}
> ...
> 2008-03-25 19:16:43,711 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.220:60020
> 2008-03-25 19:16:46,725 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.220:60020
> 2008-03-25 19:18:06,411 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:18:06,811 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:46,841 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.221:60020
> 2008-03-25 19:19:49,849 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.221:60020
> 2008-03-25 19:19:56,883 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_CLOSE : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from
> XX.XX.XX.220:60020
> 2008-03-25 19:19:56,883 INFO org.apache.hadoop.hbase.HMaster:
> XX.XX.XX.220:60020 no longer serving regionname:
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482, startKey:
> <iLStZ0yTnfVUziYcNVVxWV==>, endKey: <jLB27Q4hKls4tSvp64rJfF==
> >, encodedName: 1857033608, tableDesc: {name: enwiki_080103, families:
> >{alternate_title:={name: alternate_title, max versions: 3, compression:
> >NONE, in memory: false, max length: 2147483647, bloom filter: none},
> >alternate_url:={name: al
> ternate_url, max versions: 3, compression: NONE, in memory: false, max
> length: 2147483647, bloom filter: none}, anchor:={name: anchor, max versions:
> 3, compression: NONE, in memory: false, max length: 2147483647, bloom filter:
> none}, mi
> sc:={name: misc, max versions: 3, compression: NONE, in memory: false, max
> length: 2147483647, bloom filter: none}, page:={name: page, max versions: 3,
> compression: NONE, in memory: false, max length: 2147483647, bloom filter:
> none}, re
> direct:={name: redirect, max versions: 3, compression: NONE, in memory:
> false, max length: 2147483647, bloom filter: none}}}
> 2008-03-25 19:19:56,885 DEBUG org.apache.hadoop.hbase.HMaster: Main
> processing loop: ProcessRegionClose of
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482, true, false
> 2008-03-25 19:19:56,885 INFO org.apache.hadoop.hbase.HMaster: region closed:
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:56,887 INFO org.apache.hadoop.hbase.HMaster: reassign
> region: enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:57,288 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.189:60020
> 2008-03-25 19:20:00,296 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.189:60020
> 2008-03-25 19:20:16,885 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_OPEN : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from
> XX.XX.XX.221:60020
> 2008-03-25 19:20:16,885 DEBUG org.apache.hadoop.hbase.HMaster: region server
> XX.XX.XX.221:60020 should not have opened region
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:51,707 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:51,834 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:53,947 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.97:60020
> ...
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.