[
https://issues.apache.org/jira/browse/HBASE-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008394#comment-13008394
]
Cosmin Lehene commented on HBASE-3660:
--------------------------------------
LZO not working would indeed be a bigger problem.
However I mentioned it (LZO) because it was easier to spot that way, but it's
not necessary to cause the problem.
The questions is: is it ok when a region is unavailable to have clients
contacting other region servers? I was thinking this could lead to other
problems. The solution I was thinking about was not to remove the old server
address from .META. but to mark that the region is not actually deployed.
I'm seeing this on my laptop when I switch networks. I retested a network
switch.
Shutdown everything in network A (192.168.2.0)
Start everything (including ZK and HDFS) in network B (10.131.171.0)
When starting HBase I get this:
in HMaster:
2011-03-18 11:40:38,953 INFO
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting
completed in 7944 ms for
hdfs://localhost:9000/hbase/.logs/192.168.2.102,60020,1300389033686
2011-03-18 11:40:58,998 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 11:41:20,000 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 11:41:25,163 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled
exception. Starting shutdown.
java.net.SocketException: Network is unreachable
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
Then it shuts down.
In HRegionServer
2011-03-18 11:39:24,138 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to
Master server at 192.168.2.102:60000
2011-03-18 11:39:44,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:05,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:26,174 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:26,175 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
master. Retrying. Error was:
java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel
to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=192.168.2.102/192.168.2.102:60000]
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
...
2011-03-18 11:40:29,180 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to
Master server at 10.131.171.219:60000
2011-03-18 11:40:29,297 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at
10.131.171.219:60000
2011-03-18 11:40:29,300 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
10.131.171.219:60000 that we are up
2011-03-18 11:40:29,329 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us address to
use. Was=10.131.171.219:60020, Now=10.131.171.219:60020
2011-03-18 11:40:29,331 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: Config from master:
fs.default.name=hdfs://localhost:9000/hbase
...
2011-03-18 11:40:30,784 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server
handler 9 on 60020: starting
2011-03-18 11:40:30,784 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as
10.131.171.219,60020,1300441163636, RPC listening on /10.131.171.219:60020,
sessionid=0x12ec85503600002
2011-03-18 11:40:30,795 INFO org.apache.hadoop.hbase.regionserver.StoreFile:
Allocating LruBlockCache with maximum size 199.2m
2011-03-18 11:41:27,876 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will retry
Since HMaster is dead I start it again:
011-03-18 12:04:32,863 INFO org.apache.hadoop.hbase.master.ServerManager:
Waiting on regionserver(s) count to settle; currently=1
2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager:
Finished waiting for regionserver count to settle; count=1, sleptFor=4500
2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager:
Exiting wait on regionserver(s) to checkin; count=1, stopped=false, count of
regions out on cluster=0
2011-03-18 12:04:34,368 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
Log folder hdfs://localhost:9000/hbase/.logs/10.131.171.219,60020,1300441163636
belongs to an existing region server
2011-03-18 12:04:54,057 DEBUG org.apache.hadoop.hbase.client.MetaScanner:
Scanning .META. starting at row= for max=2147483647 rows
2011-03-18 12:04:54,063 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Lookedup root region location,
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2;
hsa=192.168.2.102:60020
2011-03-18 12:04:54,390 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:15,391 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:36,392 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:36,393 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker:
Timed out connecting to 192.168.2.102:60020
2011-03-18 12:05:36,394 INFO
org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region
location in ZooKeeper
2011-03-18 12:05:36,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x12ec85503600004 Creating (or updating) unassigned node for
70236052 with OFFLINE state
2011-03-18 12:05:36,424 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
No previous transition plan was found (or we are ignoring an existing plan) for
-ROOT-,,0.70236052 so generated a random one; hri=-ROOT-,,0.70236052, src=,
dest=10.131.171.219,60020,1300441163636; 1 (online=1, exclude=null) available
servers
2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Assigning region -ROOT-,,0.70236052 to 10.131.171.219,60020,1300441163636
2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.ServerManager: New
connection to 10.131.171.219,60020,1300441163636
2011-03-18 12:05:56,395 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
^[[B^[[B2011-03-18 12:06:08,899 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
out: -ROOT-,,0.70236052 state=PENDING_OPEN, ts=1300442736425
2011-03-18 12:06:08,901 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_OPEN for too long, reassigning region=-ROOT-,,0.70236052
2011-03-18 12:06:08,901 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Forcing OFFLINE; was=-ROOT-,,0.70236052 state=PENDING_OPEN, ts=1300442736425
2011-03-18 12:06:17,397 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:06:38,399 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting
to server: 192.168.2.102/192.168.2.102:60020
...
2011-03-18 12:06:57,814 DEBUG org.apache.hadoop.hbase.client.MetaScanner:
Scanning .META. starting at row= for max=2147483647 rows
2011-03-18 12:06:57,817 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Lookedup root region location,
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2;
hsa=10.131.171.219:60020
2011-03-18 12:06:58,051 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled
exception. Starting shutdown.
java.net.SocketException: Network is unreachable
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
HMaster kills itself again. Stopping the regionserver and starting it again
with HMAster will yield the same results.
And so on. At some point after a few restarts it will start and work (at least
until you change IPs again)
It's not clear (to me) if the stale data is in .META. or if it could be in ZK
as well.
My point is that this is not a LZO issue.
> If regions assignment fails, clients will be directed to stale data from
> .META.
> -------------------------------------------------------------------------------
>
> Key: HBASE-3660
> URL: https://issues.apache.org/jira/browse/HBASE-3660
> Project: HBase
> Issue Type: Bug
> Components: master, regionserver
> Affects Versions: 0.90.1
> Reporter: Cosmin Lehene
> Fix For: 0.90.2
>
>
> I've noticed this when the IP on my machine changed (it's even easier to
> detect when LZO doesn't work)
> Master loads .META. successfully and then starts assigning regions.
> However LZO doesn't work so HRegionServer can't open the regions.
> A client attempts to get data from a table so it reads the location from
> .META. but goes to a totally different server (the old value in .META.)
> This could happen without the LZO story too.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira