[
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262499#comment-13262499
]
ramkrishna.s.vasudevan commented on HBASE-5816:
-----------------------------------------------
@Maryann
If you are planning to work on this pls go ahead. :)
> Balancer and ServerShutdownHandler concurrently reassigning the same region
> ---------------------------------------------------------------------------
>
> Key: HBASE-5816
> URL: https://issues.apache.org/jira/browse/HBASE-5816
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.6
> Reporter: Maryann Xue
> Assignee: ramkrishna.s.vasudevan
> Priority: Critical
> Attachments: HBASE-5816.patch
>
>
> The first assign thread exits with success after updating the RegionState to
> PENDING_OPEN, while the second assign follows immediately into "assign" and
> fails the RegionState check in setOfflineInZooKeeper(). This causes the
> master to abort.
> In the below case, the two concurrent assigns occurred when AM tried to
> assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler
> tried to assign this region (from the region plan) spontaneously.
> 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
> src=hadoop05.sh.intel.com,60020,1334544902186,
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:44:57,648 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> (offlining)
> 2012-04-17 05:44:57,648 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
> serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0,
> regions=0, usedHeap=0, maxHeap=0) for region
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> 2012-04-17 05:44:57,666 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned
> node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b
> (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
> server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
> 2012-04-17 05:52:58,984 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> state=CLOSED, ts=1334612697672,
> server=hadoop05.sh.intel.com,60020,1334544902186
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:52:59,096 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.;
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
> src=hadoop05.sh.intel.com,60020,1334544902186,
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:52:59,096 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to
> xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:19,159 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> state=PENDING_OPEN, ts=1334613179096,
> server=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:59,033 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to
> serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0,
> regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket
> timeout exception: java.net.SocketTimeoutException: 120000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302
> remote=/10.239.47.87:60020]
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
> at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
> at $Proxy7.openRegion(Unknown Source)
> at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
> at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.SocketTimeoutException: 120000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302
> remote=/10.239.47.87:60020]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> at java.io.DataInputStream.readInt(DataInputStream.java:370)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479)
> 2012-04-17 05:54:59,035 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan
> was found (or we are ignoring an existing plan) for
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. so
> generated a random one;
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
> src=, dest=hadoop06.sh.intel.com,60020,1334544901894; 7 (online=7,
> exclude=serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253,
> load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers
> 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:54:59,045 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.;
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
> src=, dest=hadoop06.sh.intel.com,60020,1334544901894
> 2012-04-17 05:54:59,045 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to
> hadoop06.sh.intel.com,60020,1334544901894
> 2012-04-17 05:54:59,046 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unexpected state trying to OFFLINE;
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> state=PENDING_OPEN, ts=1334613299045,
> server=hadoop06.sh.intel.com,60020,1334544901894
> java.lang.IllegalStateException
> at
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1167)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1107)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
> at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:259)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> 2012-04-17 05:54:59,047 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira