[jira] [Commented] (HBASE-10136) Alter table conflicts with concurrent snapshot attempt on that table

Matteo Bertozzi (JIRA) Mon, 16 Dec 2013 10:09:51 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849407#comment-13849407
 ]


Matteo Bertozzi commented on HBASE-10136:
-----------------------------------------

as I pointed out in my comment above we can fix case by case, but the main 
problems are still there.
If you implement a new handler you have to keep in mind the rules to make 
everything working.
In this case for example, the end of the handler is not the hand of the 
operation so the lock is released early.
also in this case, the master call uses handler.process() instead of the 
executor pool to make the client operation synchronous.
In the delete table case the last operation must be the removal of the table 
descriptor, otherwise the client call will not be synchronous.
...and so on with other, implementation details.

I've pointed out the new master design, to discuss this set of "rules" and be 
part of the design.
We must be able to know when an operation end, and not just guessing based on 
the result state of an operation. And this is a must for both server side (e.g. 
releasing the lock) and client side (e.g. sync operation)

> Alter table conflicts with concurrent snapshot attempt on that table
> --------------------------------------------------------------------
>
>                 Key: HBASE-10136
>                 URL: https://issues.apache.org/jira/browse/HBASE-10136
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 0.96.0, 0.98.1, 0.99.0
>            Reporter: Aleksandr Shulman
>            Assignee: Matteo Bertozzi
>              Labels: online_schema_change
>
> Expected behavior:
> A user can issue a request for a snapshot of a table while that table is 
> undergoing an online schema change and expect that snapshot request to 
> complete correctly. Also, the same is true if a user issues a online schema 
> change request while a snapshot attempt is ongoing.
> Observed behavior:
> Snapshot attempts time out when there is an ongoing online schema change 
> because the region is closed and opened during the snapshot. 
> As a side-note, I would expect that the attempt should fail quickly as 
> opposed to timing out. 
> Further, what I have seen is that subsequent attempts to snapshot the table 
> fail because of some state/cleanup issues. This is also concerning.
> Immediate error:
> {code}type=FLUSH }' is still in progress!
> 2013-12-11 15:58:32,883 DEBUG [Thread-385] client.HBaseAdmin(2696): (#11) 
> Sleeping: 10000ms while waiting for snapshot completion.
> 2013-12-11 15:58:42,884 DEBUG [Thread-385] client.HBaseAdmin(2704): Getting 
> current status of snapshot from master...
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> master.HMaster(2891): Checking to see if snapshot from request:{ ss=snapshot0 
> table=changeSchemaDuringSnapshot1386806258640 type=FLUSH } is done
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> snapshot.SnapshotManager(374): Snapshoting '{ ss=snapshot0 
> table=changeSchemaDuringSnapshot1386806258640 type=FLUSH }' is still in 
> progress!
> Snapshot failure occurred
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'snapshot0' wasn't completed in expectedTime:60000 ms
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2713)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2638)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2602)
>       at 
> org.apache.hadoop.hbase.client.TestAdmin$BackgroundSnapshotThread.run(TestAdmin.java:1974){code}
> Likely root cause of error:
> {code}Exception in SnapshotSubprocedurePool
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
>  is closing
>       at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:314)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
>       at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
>       at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:1)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.hbase.NotServingRegionException: 
> changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
>  is closing
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5327)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5289)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
>       at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:1)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       ... 5 more{code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HBASE-10136) Alter table conflicts with concurrent snapshot attempt on that table

Reply via email to