[jira] [Updated] (HBASE-10136) the table-lock of TableEventHandler is released too early because reOpenAllRegions() is asynchronous

Matteo Bertozzi (JIRA) Tue, 17 Dec 2013 02:07:08 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matteo Bertozzi updated HBASE-10136:
------------------------------------

    Description: 
Expected behavior:
With the introduction of the table-lock, a user can issue a request for a 
snapshot of a table while that table is undergoing an online schema change and 
expect that snapshot request to complete correctly. Also, the same is true if a 
user issues a online schema change request while a snapshot attempt is ongoing.

Observed behavior:
Snapshot attempts time out when there is an ongoing online schema change 
because the table lock is not acquired by anyone else and the regions are 
closed and opened during the snapshot. 

TableEventHandler trace
{code}
// 1. client.addColumn() call from client...

// 2. The operation is now on the master
2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a 
lock for /hbase/table-lock/TestTable/write-master:452010000000001
2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling 
table operation C_M_ADD_FAMILY on table TestTable
2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. 
Table = TestTable HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 
'NONE',$
2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing 
regions by region server...
...
2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed 
table operation C_M_ADD_FAMILY on table TestTable
2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting 
unassign of TestTable,,1386849056038.854b280$
2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released 
/hbase/table-lock/TestTable/write-master:452010000000001

// 3. The Table*Handler operation is now completed, and the client notified 
with "I'm done!"

// 4. Now the BulkReopen is starting doing the reopen
2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned 
{854b280006aec464083778a5cb5f5456 state=OPEN,$
{code}

  was:
Expected behavior:
A user can issue a request for a snapshot of a table while that table is 
undergoing an online schema change and expect that snapshot request to complete 
correctly. Also, the same is true if a user issues a online schema change 
request while a snapshot attempt is ongoing.

Observed behavior:
Snapshot attempts time out when there is an ongoing online schema change 
because the region is closed and opened during the snapshot. 

As a side-note, I would expect that the attempt should fail quickly as opposed 
to timing out. 

Further, what I have seen is that subsequent attempts to snapshot the table 
fail because of some state/cleanup issues. This is also concerning.

Immediate error:
{code}type=FLUSH }' is still in progress!
2013-12-11 15:58:32,883 DEBUG [Thread-385] client.HBaseAdmin(2696): (#11) 
Sleeping: 10000ms while waiting for snapshot completion.
2013-12-11 15:58:42,884 DEBUG [Thread-385] client.HBaseAdmin(2704): Getting 
current status of snapshot from master...
2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
master.HMaster(2891): Checking to see if snapshot from request:{ ss=snapshot0 
table=changeSchemaDuringSnapshot1386806258640 type=FLUSH } is done
2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
snapshot.SnapshotManager(374): Snapshoting '{ ss=snapshot0 
table=changeSchemaDuringSnapshot1386806258640 type=FLUSH }' is still in 
progress!
Snapshot failure occurred
org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
'snapshot0' wasn't completed in expectedTime:60000 ms
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2713)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2638)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2602)
        at 
org.apache.hadoop.hbase.client.TestAdmin$BackgroundSnapshotThread.run(TestAdmin.java:1974){code}

Likely root cause of error:
{code}Exception in SnapshotSubprocedurePool
java.util.concurrent.ExecutionException: 
org.apache.hadoop.hbase.NotServingRegionException: 
changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
 is closing
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at 
org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:314)
        at 
org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
        at 
org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
        at 
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
        at 
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:1)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.NotServingRegionException: 
changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
 is closing
        at 
org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5327)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5289)
        at 
org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
        at 
org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:1)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        ... 5 more{code}


> the table-lock of TableEventHandler is released too early because 
> reOpenAllRegions() is asynchronous
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10136
>                 URL: https://issues.apache.org/jira/browse/HBASE-10136
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.96.0, 0.98.1, 0.99.0
>            Reporter: Aleksandr Shulman
>            Assignee: Matteo Bertozzi
>              Labels: online_schema_change
>         Attachments: HBASE-10136-trunk.patch
>
>
> Expected behavior:
> With the introduction of the table-lock, a user can issue a request for a 
> snapshot of a table while that table is undergoing an online schema change 
> and expect that snapshot request to complete correctly. Also, the same is 
> true if a user issues a online schema change request while a snapshot attempt 
> is ongoing.
> Observed behavior:
> Snapshot attempts time out when there is an ongoing online schema change 
> because the table lock is not acquired by anyone else and the regions are 
> closed and opened during the snapshot. 
> TableEventHandler trace
> {code}
> // 1. client.addColumn() call from client...
> // 2. The operation is now on the master
> 2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired 
> a lock for /hbase/table-lock/TestTable/write-master:452010000000001
> 2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling 
> table operation C_M_ADD_FAMILY on table TestTable
> 2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. 
> Table = TestTable HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 
> 'NONE',$
> 2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing 
> regions by region server...
> ...
> 2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed 
> table operation C_M_ADD_FAMILY on table TestTable
> 2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting 
> unassign of TestTable,,1386849056038.854b280$
> 2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released 
> /hbase/table-lock/TestTable/write-master:452010000000001
> // 3. The Table*Handler operation is now completed, and the client notified 
> with "I'm done!"
> // 4. Now the BulkReopen is starting doing the reopen
> 2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned 
> {854b280006aec464083778a5cb5f5456 state=OPEN,$
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Updated] (HBASE-10136) the table-lock of TableEventHandler is released too early because reOpenAllRegions() is asynchronous

Reply via email to