[
https://issues.apache.org/jira/browse/HBASE-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613735#action_12613735
]
Izaak Rubin commented on HBASE-549:
-----------------------------------
It's hard to tell exactly what is going on, since (a) the code has changed a
lot since the creation of this issue, and (b) it's very hard/rare to replicate
this problem. However, after looking at this issue for several hours, here's
what I've been able to glean:
I think the basic steps going on are as follows:
* We tell the ServerManager (working with HMaster) to start opening a region
on a server.
* The ServerManager won't mark the HRegionInfo object (region) as being online
until it actually receives a success message from the server. The server sends
in periodic messages saying that it is working on opening the region
(MSG_REPORT_PROCESS_OPEN), but this only extends the lease or time limit on the
server.
* For whatever reason (maybe the server goes out-of-touch for too long?), the
master decides that the server is taking too long, and it asks a new server to
start opening the region. The old server sends a MSG_REPORT_CLOSE to the
master (not totally sure why).
* The ServerManager, in processing the MSG_REPORT_CLOSE from the old server,
sees that the region was never assigned. It therefore assumes that the intent
of the close message is to either offline or delete the region, and a
ProcessRegionClose object is created.
Beyond this I am unsure of what happens. It seems to me that the
ProcessRegionClose object would try to delete the region since (I think) it
hasn't been marked for "offline-ing," but that doesn't seem right. Either way,
it's my impression that the close message *will not* lead to the region being
reassigned. I think this is a part of the code that might have changed since
the issue was created. The problem may still exist, but in a slightly
different form? The logs posted above show problems arising when the second
server finally opens the region successfully - the master complains that this
is wrong, that someone else should be opening the region. Maybe now the
problem would be different - the master would complain because it believes that
nobody should have the region (since the close message offlined or deleted the
region).
I want to be totally clear that my analysis is completely based on the old logs
and on poking through the current code (without even running the code) - I
apologize and expect that it isn't very accurate.
Either way, it's clear that a lot of problems originate from the close message
sent by the original (failed) server. The master needs to be made "smarter"
such that it would ignore a message if it's from someone who isn't currently
processing the relevant region. In essence, the master needs to know - for
each region that is either open, being opened, or being closed - which server
is responsible. Or, if we really want to be minimalist and avoid overhead,
just keep a mapping for all regions that are in the process of being opened.
Maybe RegionManager would be an appropriate place to keep this mappping. As it
is, RegionManager keeps a Set of all regions that are "pending." Perhaps all
we would have to do is turn the Set into a Map, where the key is the region and
the value is the server? I'm sure there are many other modifications that
would work, but this one seems reasonable to me.
> Don't CLOSE region if message is not from server that opened it or is opening
> it
> --------------------------------------------------------------------------------
>
> Key: HBASE-549
> URL: https://issues.apache.org/jira/browse/HBASE-549
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.16.0, 0.1.0, 0.1.1, 0.2.0
> Reporter: stack
> Assignee: stack
> Fix For: 0.2.0
>
>
> We assign a region to a server. It takes too long to open (HBASE-505).
> Region gets assigned to another server. Meantime original host returns a
> MSG_REPORT_CLOSE (because other regions opening messes it up moving files on
> disk out from under it). We queue a shutdown which marks the region as
> needing reassignment. Second server reports in that it successfully opened
> the region. Master tells it it should not have opened it. Churn ensues.
> Fix is to ignore the CLOSE if its reported server/startcode does not match
> that of the server currently trying to open region. Fix is not easy because
> currently we don't keep list of server info in unassigned regions.
> Here's master log snippet showing problem:
> {code}
> ...
> 2008-03-25 19:16:43,711 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.220:60020
> 2008-03-25 19:16:46,725 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.220:60020
> 2008-03-25 19:18:06,411 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:18:06,811 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:46,841 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.221:60020
> 2008-03-25 19:19:49,849 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.221:60020
> 2008-03-25 19:19:56,883 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_CLOSE : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from
> XX.XX.XX.220:60020
> 2008-03-25 19:19:56,883 INFO org.apache.hadoop.hbase.HMaster:
> XX.XX.XX.220:60020 no longer serving regionname:
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482, startKey:
> <iLStZ0yTnfVUziYcNVVxWV==>, endKey: <jLB27Q4hKls4tSvp64rJfF==
> >, encodedName: 1857033608, tableDesc: {name: enwiki_080103, families:
> >{alternate_title:={name: alternate_title, max versions: 3, compression:
> >NONE, in memory: false, max length: 2147483647, bloom filter: none},
> >alternate_url:={name: al
> ternate_url, max versions: 3, compression: NONE, in memory: false, max
> length: 2147483647, bloom filter: none}, anchor:={name: anchor, max versions:
> 3, compression: NONE, in memory: false, max length: 2147483647, bloom filter:
> none}, mi
> sc:={name: misc, max versions: 3, compression: NONE, in memory: false, max
> length: 2147483647, bloom filter: none}, page:={name: page, max versions: 3,
> compression: NONE, in memory: false, max length: 2147483647, bloom filter:
> none}, re
> direct:={name: redirect, max versions: 3, compression: NONE, in memory:
> false, max length: 2147483647, bloom filter: none}}}
> 2008-03-25 19:19:56,885 DEBUG org.apache.hadoop.hbase.HMaster: Main
> processing loop: ProcessRegionClose of
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482, true, false
> 2008-03-25 19:19:56,885 INFO org.apache.hadoop.hbase.HMaster: region closed:
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:56,887 INFO org.apache.hadoop.hbase.HMaster: reassign
> region: enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:19:57,288 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.189:60020
> 2008-03-25 19:20:00,296 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN :
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from XX.XX.XX.189:60020
> 2008-03-25 19:20:16,885 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_OPEN : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 from
> XX.XX.XX.221:60020
> 2008-03-25 19:20:16,885 DEBUG org.apache.hadoop.hbase.HMaster: region server
> XX.XX.XX.221:60020 should not have opened region
> enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:51,707 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:51,834 DEBUG org.apache.hadoop.hbase.HMaster: shutdown
> scanner looking at enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
> 2008-03-25 19:23:53,947 INFO org.apache.hadoop.hbase.HMaster: assigning
> region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482 to server
> XX.XX.XX.97:60020
> ...
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.