BaseScanner says "Current assignment of X is not valid" over and over for same
region
-------------------------------------------------------------------------------------
Key: HBASE-2599
URL: https://issues.apache.org/jira/browse/HBASE-2599
Project: Hadoop HBase
Issue Type: Bug
Reporter: stack
>From IRC today
{code}
12:41 < cmorgan> hey guys. I'm having a recent issue with a single node
cluster running 0.20.4. After stopping for a backup I now get region assignment
churn. Seems master keeps thinking that region
assignment is not valid even when it is. Following is a log
snippet:
12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [ HMaster] DEBUG
ter.RegionServerOperationQueue - Processing todo: PendingOpenOperation from
localhost.,7802,1274425405680
12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [ HMaster] INFO
e.master.RegionServerOperation -
net_troove_coin_account_AccountCredentials,,1234913258116 open on 127.0.0.1:7802
12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [ HMaster] INFO
e.master.RegionServerOperation - Updated row
net_troove_coin_account_AccountCredentials,,1234913258116 in region .META.,,1
with
startcode=1274425405680, server=127.0.0.1:7802
12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [ HMaster] DEBUG
ter.RegionServerOperationQueue - Processing todo: PendingOpenOperation from
localhost.,7802,1274425405680
12:41 < cmorgan> [21/05/10 00:59:42] 3443246 [ HMaster] INFO
e.master.RegionServerOperation -
net_troove_application_request_TemporaryRequest,,1234913268355 open on
127.0.0.1:7802
12:41 < cmorgan> [21/05/10 00:59:42] 3443247 [ HMaster] INFO
e.master.RegionServerOperation - Updated row
net_troove_application_request_TemporaryRequest,,1234913268355 in region
.META.,,1 with
startcode=1274425405680, server=127.0.0.1:7802
12:41 < cmorgan> [21/05/10 00:59:42] 3443247 [ger.metaScanner] DEBUG
adoop.hbase.master.BaseScanner - Current assignment of
net_troove_coin_account_AccountEntry,,1271448856984 is not valid;
serverAddress=127.0.0.1:7802, startCode=1274425405680 unknown.
12:41 < cmorgan> [21/05/10 00:59:42] 3443248 [ger.metaScanner] DEBUG
adoop.hbase.master.BaseScanner - Current assignment of
net_troove_coin_account_AccountEntry-Base_EntryDay_DESCENDING,,1273266418876
is not valid; serverAddress=127.0.0.1:7802,
startCode=1274425405680 unknown.
12:41 < cmorgan> [21/05/10 00:59:42] 3443251 [ger.metaScanner] DEBUG
adoop.hbase.master.BaseScanner - Current assignment of
net_troove_coin_bank_BankStatement,,1266433980935 is not valid;
serverAddress=127.0.0.1:7802, startCode=1274425405680 unknown.
12:58 < cmorgan> stack: I'd been running with 0.20.4 for a week or so
starting/stopping every night. Now this happens...
14:11 < cmorgan> stack: some more info: On our mini production server the
regionserver is getting "My address is localhost.:7802" (notice the dot after
localhost). But the master is also sometimes
referring to it as 127.0.0.1. I just used the same data and
config on my laptop, and its binding to my external LAN ip ("My address is
10.0.1.4:7802"). Under this setup hbase comes up
stable (no region assignment churn).
{code}
Looking at this, I think issue is that when we register a server we use a
getServerName on a HServerInfo provided by the regionserver (though we are on
the master side) but BaseScanner uses a getServerName that is made by doing a
dns lookup using the IP that it finds in the server column of .META. My sense
is that is possible for the regionserver hostname and what the master finds
when it does a lookup against dns can disagree, fatally.
This issue seems popular over last few weeks. Was reported at least once more
on a standalone instance and also on krispykola's 15-node ec2 cluster (He went
back to 0.20.3 and then it went away?). It made for what looked like
double-assignment in his case (Our attempt at caching DNS names may be amiss --
I tihnk tht the main diff between 0.20.3 and 0.20.4 in this area).
My thought is to purge DNS from the HServerInfo passed by the RS to Master on
startup and heartbeating and to use IPs only (and even then, the IP that the
master tells the RS to use, its remote address as seen by the master). We
might have to do this fix for 0.20.5 since it seems to happen more in 0.20.4.
I'm looking into this. Opinions welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.