[jira] Updated: (HBASE-1357) If one sets the hbase.master to 0.0.0.0 non local regionservers can't find the master

Jean-Daniel Cryans (JIRA) Tue, 02 Jun 2009 08:31:40 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans updated HBASE-1357:
--------------------------------------

    Attachment: hbase-1357-v4.patch

This latest version of the patch clears up hbase.cluster.distributed from 
zoo.cfg like Stack described. This will be thrown when the 
hbase.cluster.distributed is true and the value in zoo.cfg is localhost:

{quote}
localhost: starting zookeeper, logging to 
/home/jdcryans/svn/hbase/trunk/bin/../logs/hbase-jdcryans-zookeeper-jdcryans.mtl.out
localhost: java.io.IOException: The server in zoo.cfg cannot be set to 
localhost in a fully-distributed setup because it won't be reachable. See 
"Getting Started" for more information.
localhost:      at 
org.apache.hadoop.hbase.zookeeper.HQuorumPeer.parseConfig(HQuorumPeer.java:141)
localhost:      at 
org.apache.hadoop.hbase.zookeeper.HQuorumPeer.parseZooKeeperConfig(HQuorumPeer.java:82)
localhost:      at 
org.apache.hadoop.hbase.zookeeper.HQuorumPeer.main(HQuorumPeer.java:58)
{quote}

I tested master failover on 3 nodes doing hbase-daemons.sh start master then 
regionserver (which is kinda fun) and killed the first, then the second master. 
What I first saw is all regions getting reassigned (which was supposed to be 
fix) but this was because of the alls well messages:
{quote}
2009-06-02 11:02:19,941 DEBUG org.apache.hadoop.hbase.master.HMaster: Started 
service threads
2009-06-02 11:02:19,942 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 9 on 62000: starting
2009-06-02 11:02:19,956 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scan of 1 row(s) of meta region {server: 
192.168.1.88:62020, regionname: -ROOT-,,0, startKey: <>} complete
2009-06-02 11:02:20,144 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.metaScanner scan of 54 row(s) of meta region {server: 
192.168.1.87:62020, regionname: .META.,,1, startKey: <>} complete
2009-06-02 11:02:20,939 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Process all wells: address: 192.168.1.88:62020, startcode: 1243954293421, load: 
(requests=4, regions=19, usedHeap=27, maxHeap=963) openingCount: 0, 
nobalancingCount: 4
2009-06-02 11:02:20,945 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote out of safe mode
2009-06-02 11:02:20,945 INFO org.apache.hadoop.hbase.master.RegionManager: 
exiting safe mode
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Server is overloaded. Server load: 19 avg: 6.333333333333333, slop: 0.1
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Choosing to reassign 12 regions. mostLoadedRegions has 10 regions in it.
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0003047546,1242765850701
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0011562998,1241636821854
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0006799914,1242765888384
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0006349363,1242765888384
2009-06-02 11:02:20,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0001399792,1242840206758
2009-06-02 11:02:20,955 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0000072704,1242764140942
2009-06-02 11:02:20,955 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0008901015,1242765794486
2009-06-02 11:02:20,955 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0002902745,1242765697441
2009-06-02 11:02:20,955 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0000283600,1242764792449
2009-06-02 11:02:20,955 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Going to close region TestTable,0000612479,1242394901854
2009-06-02 11:02:20,955 INFO org.apache.hadoop.hbase.master.RegionManager: 
Skipped 0 region(s) that are in transition states
2009-06-02 11:02:21,164 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Process all wells: address: 192.168.1.87:62020, startcode: 1243954293575, load: 
(requests=57, regions=19, usedHeap=26, maxHeap=963) openingCount: 0, 
nobalancingCount: 4
2009-06-02 11:02:21,165 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Server is overloaded. Server load: 19 avg: 12.666666666666666, slop: 0.1
2009-06-02 11:02:21,165 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Choosing to reassign 6 regions. mostLoadedRegions has 10 regions in it.
{quote}

That's because the load is empty when adding a new region server so I added a 
check to instead use the load provided by the RS during the failover 
inspection. So it's fixed.

> If one sets the hbase.master to 0.0.0.0 non local regionservers can't find 
> the master
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-1357
>                 URL: https://issues.apache.org/jira/browse/HBASE-1357
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0, 0.20.1, 0.21.0
>         Environment: All
>            Reporter: Alex Newman
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1357-v1.patch, hbase-1357-v2.patch, 
> hbase-1357-v3.patch, hbase-1357-v4.patch
>
>
> (2:11:20 PM) posix4e: so i want to run a back master on each node
> (2:11:29 PM) posix4e: and i have my hbase.master set to 0.0.0.0
> (2:14:59 PM) posix4e: each master only gets the local regionserver connecting
> (2:15:08 PM) posix4e: as it must be using that variable to know what to 
> connect to
> (2:15:32 PM) nitay: the RS don't use hbase.master* anymore
> (2:15:36 PM) nitay: ohhh i think i know th eproblem
> (2:15:44 PM) nitay: so the RS use ZK to get the master address
> (2:15:49 PM) nitay: but the masters are writing 0.0.0.0 to it
> (2:15:58 PM) nitay: b/c they write whatever was in their conf
> (2:16:20 PM) posix4e: yea
> (2:16:42 PM) nitay: can u do a zookeeper dump of that node to verify my 
> thinking?
> (2:16:55 PM) posix4e: yea
> (2:17:12 PM) nitay: it should be /hbase/master, unless u've changed the 
> defaults
> (2:17:59 PM) nitay: hmm s o ye this is a problem, we solved this in RS 
> (allowing 0.0.0.0) by having master actually write RS's address to ZK when it 
> gets contacted
> (2:18:21 PM) nitay: so now we need to find a way to find out the _actual_ 
> address the master has bound to
> (2:19:47 PM) posix4e: is their a way to do that?
> (2:20:16 PM) nitay: i dont know, good question
> (2:20:18 PM) posix4e: or does it require code changes i.e. regionserver 
> checking zk
> (2:20:27 PM) nitay: did u verify the master address?
> (2:20:48 PM) posix4e: one sec
> (2:21:03 PM) nitay: its almost like we want ZK to be able to tell us what 
> address we're using to talk to it
> (2:21:20 PM) nitay: that assumes u dont have different NICs to talk to ZK vs. 
> HBase
> (2:21:59 PM) nitay: posix4e, u can't really use the RS as far as i can tell 
> b/c the RS knows nothing about the master until the master address appears in 
> ZK
> (2:22:25 PM) posix4e: 0:0:0:0:0:0:0:0:60000
> (2:22:40 PM) nitay: yep that's the magic
> (2:22:45 PM) nitay: k thx for verifying
> (2:22:54 PM) nitay: u want to open up a JIRA?
> (2:22:57 PM) posix4e: but if i could tell hbase.site to just use my 
> hostname:port it would work ok
> (2:22:58 PM) posix4e: yea
> (2:23:09 PM) posix4e: can i quote this conversation?
> (2:23:18 PM) nitay: yes please do
> (2:23:45 PM) nitay: also, to fix this here and now for u, u'd essentially 
> need to actually set hbase.master* to the ip/host u're using
> (2:23:55 PM) nitay: and change it on each backup master to that guy's host/ip
> (2:24:02 PM) nitay: i know, its a royal PITA
> (2:24:59 PM) posix4e: yea
> (2:25:03 PM) posix4e: no problem
> (2:25:20 PM) nitay: but that should work till we find a better solution
> (2:25:21 PM) posix4e: I am trying to think how a patch would work
> (2:25:25 PM) posix4e: have a masters file?
> (2:25:44 PM) nitay: yeah if u have any ideas please offer them
> (2:25:46 PM) nitay: hmm interesting idea
> (2:26:16 PM) nitay: and then do some local gethostbyname() type thing 
> checking against masters file?
> (2:26:26 PM) posix4e: yea
> (2:28:23 PM) nitay: one thing to note is we've talked about eventually 
> getting to a place where any RS can be master
> (2:28:30 PM) nitay: but i like your idea
> (2:28:37 PM) nitay: post it on the JIRA
> (2:30:24 PM) nitay: i gotta run, thanks for the info posix4e - very helpful, 
> its great to hear from people actually using this stuff
> (2:32:56 PM) posix4e: yep
> I also solved this by manually setting the hbase.master  on each host to 
> point to the local hostname, which sucks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1357) If one sets the hbase.master to 0.0.0.0 non local regionservers can't find the master

Reply via email to