[jira] Commented: (HBASE-3660) If regions assignment fails, clients will be directed to stale data from .META.

Cosmin Lehene (JIRA) Fri, 18 Mar 2011 03:55:58 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008394#comment-13008394
 ]


Cosmin Lehene commented on HBASE-3660:
--------------------------------------

LZO not working would indeed be a bigger problem.
However I mentioned it (LZO) because it was easier to spot that way, but it's 
not necessary to cause the problem.

The questions is: is it ok when a region is unavailable to have clients 
contacting other region servers? I was thinking this could lead to other 
problems. The solution I was thinking about was not to remove the old server 
address from .META. but to mark that the region is not actually deployed. 

I'm seeing this on my laptop when I switch networks. I retested a network 
switch. 
Shutdown everything in network A (192.168.2.0)
Start everything (including ZK and HDFS) in network B (10.131.171.0) 

When starting HBase I get this:

in HMaster:

2011-03-18 11:40:38,953 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting 
completed in 7944 ms for 
hdfs://localhost:9000/hbase/.logs/192.168.2.102,60020,1300389033686
2011-03-18 11:40:58,998 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 11:41:20,000 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 11:41:25,163 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled 
exception. Starting shutdown.
java.net.SocketException: Network is unreachable
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)

Then it shuts down.

In HRegionServer

2011-03-18 11:39:24,138 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to 
Master server at 192.168.2.102:60000
2011-03-18 11:39:44,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:05,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:26,174 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:26,175 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to 
master. Retrying. Error was:
java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel 
to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=192.168.2.102/192.168.2.102:60000]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
...

2011-03-18 11:40:29,180 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to 
Master server at 10.131.171.219:60000
2011-03-18 11:40:29,297 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
10.131.171.219:60000
2011-03-18 11:40:29,300 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
10.131.171.219:60000 that we are up
2011-03-18 11:40:29,329 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us address to 
use. Was=10.131.171.219:60020, Now=10.131.171.219:60020
2011-03-18 11:40:29,331 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: Config from master: 
fs.default.name=hdfs://localhost:9000/hbase

...


2011-03-18 11:40:30,784 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 9 on 60020: starting
2011-03-18 11:40:30,784 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as 
10.131.171.219,60020,1300441163636, RPC listening on /10.131.171.219:60020, 
sessionid=0x12ec85503600002
2011-03-18 11:40:30,795 INFO org.apache.hadoop.hbase.regionserver.StoreFile: 
Allocating LruBlockCache with maximum size 199.2m
2011-03-18 11:41:27,876 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will retry


Since HMaster is dead I start it again:

011-03-18 12:04:32,863 INFO org.apache.hadoop.hbase.master.ServerManager: 
Waiting on regionserver(s) count to settle; currently=1
2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: 
Finished waiting for regionserver count to settle; count=1, sleptFor=4500
2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: 
Exiting wait on regionserver(s) to checkin; count=1, stopped=false, count of 
regions out on cluster=0
2011-03-18 12:04:34,368 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://localhost:9000/hbase/.logs/10.131.171.219,60020,1300441163636 
belongs to an existing region server
2011-03-18 12:04:54,057 DEBUG org.apache.hadoop.hbase.client.MetaScanner: 
Scanning .META. starting at row= for max=2147483647 rows
2011-03-18 12:04:54,063 DEBUG 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Lookedup root region location, 
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2;
 hsa=192.168.2.102:60020
2011-03-18 12:04:54,390 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:15,391 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:36,392 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:36,393 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker: 
Timed out connecting to 192.168.2.102:60020
2011-03-18 12:05:36,394 INFO 
org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region 
location in ZooKeeper
2011-03-18 12:05:36,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:60000-0x12ec85503600004 Creating (or updating) unassigned node for 
70236052 with OFFLINE state
2011-03-18 12:05:36,424 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
-ROOT-,,0.70236052 so generated a random one; hri=-ROOT-,,0.70236052, src=, 
dest=10.131.171.219,60020,1300441163636; 1 (online=1, exclude=null) available 
servers
2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region -ROOT-,,0.70236052 to 10.131.171.219,60020,1300441163636
2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.ServerManager: New 
connection to 10.131.171.219,60020,1300441163636
2011-03-18 12:05:56,395 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
^[[B^[[B2011-03-18 12:06:08,899 INFO 
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
out:  -ROOT-,,0.70236052 state=PENDING_OPEN, ts=1300442736425
2011-03-18 12:06:08,901 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been PENDING_OPEN for too long, reassigning region=-ROOT-,,0.70236052
2011-03-18 12:06:08,901 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; was=-ROOT-,,0.70236052 state=PENDING_OPEN, ts=1300442736425
2011-03-18 12:06:17,397 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020
2011-03-18 12:06:38,399 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting 
to server: 192.168.2.102/192.168.2.102:60020

...


2011-03-18 12:06:57,814 DEBUG org.apache.hadoop.hbase.client.MetaScanner: 
Scanning .META. starting at row= for max=2147483647 rows
2011-03-18 12:06:57,817 DEBUG 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Lookedup root region location, 
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2;
 hsa=10.131.171.219:60020
2011-03-18 12:06:58,051 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled 
exception. Starting shutdown.
java.net.SocketException: Network is unreachable
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

HMaster kills itself again. Stopping the regionserver and starting it again 
with HMAster will yield the same results. 
And so on. At some point after a few restarts it will start and work (at least 
until you change IPs again)

It's not clear (to me) if the stale data is in .META. or if it could be in ZK 
as well.

My point is that this is not a LZO issue. 


> If regions assignment fails, clients will be directed to stale data from 
> .META.
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3660
>                 URL: https://issues.apache.org/jira/browse/HBASE-3660
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.90.1
>            Reporter: Cosmin Lehene
>             Fix For: 0.90.2
>
>
> I've noticed this when the IP on my machine changed (it's even easier to 
> detect when LZO doesn't work)
> Master loads .META. successfully and then starts assigning regions.
> However LZO doesn't work so HRegionServer can't open the regions. 
> A client attempts to get data from a table so it reads the location from 
> .META. but goes to a totally different server (the old value in .META.)
> This could happen without the LZO story too. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3660) If regions assignment fails, clients will be directed to stale data from .META.

Reply via email to