[ 
https://issues.apache.org/jira/browse/HBASE-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prathyusha updated HBASE-29109:
-------------------------------
    Description: 
{{SnapshotManager does *prepare* and *process* as below}}
{quote}{{{}{color:#6a3e3e}handler{color}{color:#000000}.prepare();{color}{}}}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}executorService{color}{color:#000000}.submit({color}{color:#6a3e3e}handler{color}{color:#000000});
 (process called here){color}
{quote}
For Enabled tables prepare() takes an EXCLUSIVE lock with a timeout 
and right when it starts processing it releases the current EXCLUSIVE lock on 
the table and takes a SHARED one instead
{quote}{color:#000000} {color}{color:#7f0055}try{color}{color:#000000} {{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
(downgradeToSharedTableLock()) {{color}

{color:#000000} {color}{color:#3f7f5f}// release the exclusive lock and hold 
the shared lock instead{color}

{color:#000000} {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000} 
= 
{color}{color:#0000c0}master{color}{color:#000000}.getLockManager().createMasterLock({color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}

{color:#000000} LockType.{color}{color:#0000c0}SHARED{color}{color:#000000}, 
{color}{color:#7f0055}this{color}{color:#000000}.getClass().getName() + 
{color}{color:#2a00ff}": take snapshot "{color}{color:#000000} + 
{color}{color:#0000c0}snapshot{color}{color:#000000}.getName());{color}

{color:#000000} 
{color}{color:#0000c0}tableLock{color}{color:#000000}.release();{color}

{color:#000000} {color}{color:#7f0055}boolean{color}{color:#000000} 
{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000} = 
{color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}.tryAcquire({color}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
(!{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000}) {{color}

{color:#000000} 
{color}{color:#0000c0}LOG{color}{color:#000000}.error({color}{color:#2a00ff}"Could
 not acquire shared lock on table {} in {} ms"{color}{color:#000000}, 
{color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}

{color:#000000} 
{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}

{color:#000000} {color}{color:#7f0055}throw{color}{color:#000000} 
{color}{color:#7f0055}new{color}{color:#000000} 
IOException({color}{color:#2a00ff}"Could not acquire shared lock on table 
"{color}{color:#000000} + 
{color}{color:#0000c0}snapshotTable{color}{color:#000000});{color}

{color:#000000} }{color}

{color:#000000} }{color}
{quote}
Since the process stage runs in ExecutorService sometimes there is a delay

and Snapshot hold the EXCLUSIVE too long, blocking the ASSIGN child procedures 
of SCP

Below are the logs we saw for the same in our production

Process stage of Snapshot started
{quote}2025-01-15 02:15:09,608 INFO  
[ER_SNAPSHOT_OPERATIONS-master/hmaster-3:60000-5360] 
snapshot.TakeSnapshotHandler - Running SKIPFLUSH table snapshot 
<TABLE_NAME>_1736902854385_1736906318965_0 C_M_SNAPSHOT_TABLE on table 
<TABLE_NAME>
{quote}
then the EXCLUSIVE lock is released 
{quote}2025-01-15 02:15:09,615 INFO  [PEWorker-20] procedure2.ProcedureExecutor 
- Finished pid=26291867, state=SUCCESS; 
org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=<TABLE_NAME>, 
type=EXCLUSIVE in 9 mins, 55.422 sec
{quote}
ASSIGN of SCP was able to then get the lock 
{quote}2025-01-15 02:15:09,615 INFO  [PEWorker-20] 
procedure.MasterProcedureScheduler - Took xlock for pid=26292238, 
ppid=26291869, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=<TABLE_NAME>, 
region=e9d104da353c65725b33f902c8a1449d, ASSIGN
{quote}
Though once lock is acquired ASSIGN was pretty quick, total time of that region 
availability shoot up to 9 mins
{quote}2025-01-15 02:15:10,384 INFO  [PEWorker-24] procedure2.ProcedureExecutor 
- Finished pid=26292238, ppid=26291869, state=SUCCESS; 
TransitRegionStateProcedure table=<TABLE_NAME>, 
region=e9d104da353c65725b33f902c8a1449d, ASSIGN in 9 mins, 1.364 sec
{quote}

  was:
{{SnapshotManager does *prepare* and *process* as below}}
{quote}{{{}{color:#6a3e3e}handler{color}{color:#000000}.prepare();{color}{}}}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}executorService{color}{color:#000000}.submit({color}{color:#6a3e3e}handler{color}{color:#000000});
 (process called here){color}
{quote}
For Enabled tables prepare() takes an EXCLUSIVE lock with a timeout 
and right when it starts processing it releases the current EXCLUSIVE lock on 
the table and takes a SHARED one instead
{quote}{color:#000000} {color}{color:#7f0055}try{color}{color:#000000} {{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
(downgradeToSharedTableLock()) {{color}

{color:#000000} {color}{color:#3f7f5f}// release the exclusive lock and hold 
the shared lock instead{color}

{color:#000000} {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000} 
= 
{color}{color:#0000c0}master{color}{color:#000000}.getLockManager().createMasterLock({color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}

{color:#000000} LockType.{color}{color:#0000c0}SHARED{color}{color:#000000}, 
{color}{color:#7f0055}this{color}{color:#000000}.getClass().getName() + 
{color}{color:#2a00ff}": take snapshot "{color}{color:#000000} + 
{color}{color:#0000c0}snapshot{color}{color:#000000}.getName());{color}

{color:#000000} 
{color}{color:#0000c0}tableLock{color}{color:#000000}.release();{color}

{color:#000000} {color}{color:#7f0055}boolean{color}{color:#000000} 
{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000} = 
{color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}.tryAcquire({color}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
(!{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000}) {{color}

{color:#000000} 
{color}{color:#0000c0}LOG{color}{color:#000000}.error({color}{color:#2a00ff}"Could
 not acquire shared lock on table {} in {} ms"{color}{color:#000000}, 
{color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}

{color:#000000} 
{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}

{color:#000000} {color}{color:#7f0055}throw{color}{color:#000000} 
{color}{color:#7f0055}new{color}{color:#000000} 
IOException({color}{color:#2a00ff}"Could not acquire shared lock on table 
"{color}{color:#000000} + 
{color}{color:#0000c0}snapshotTable{color}{color:#000000});{color}

{color:#000000} }{color}

{color:#000000} }{color}
{quote}
Since the process state runs in ExecutorService sometimes there is a delay

and Snapshot hold the EXCLUSIVE too long, blocking the ASSIGN child procedures 
of SCP

Below are the logs we saw for the same in our production

Process stage of Snapshot started
{quote}2025-01-15 02:15:09,608 INFO  
[ER_SNAPSHOT_OPERATIONS-master/hmaster-3:60000-5360] 
snapshot.TakeSnapshotHandler - Running SKIPFLUSH table snapshot 
<TABLE_NAME>_1736902854385_1736906318965_0 C_M_SNAPSHOT_TABLE on table 
<TABLE_NAME>
{quote}
then the EXCLUSIVE lock is released 
{quote}2025-01-15 02:15:09,615 INFO  [PEWorker-20] procedure2.ProcedureExecutor 
- Finished pid=26291867, state=SUCCESS; 
org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=<TABLE_NAME>, 
type=EXCLUSIVE in 9 mins, 55.422 sec
{quote}
ASSIGN of SCP was able to then get the lock 
{quote}2025-01-15 02:15:09,615 INFO  [PEWorker-20] 
procedure.MasterProcedureScheduler - Took xlock for pid=26292238, 
ppid=26291869, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=<TABLE_NAME>, 
region=e9d104da353c65725b33f902c8a1449d, ASSIGN
{quote}
Though once lock is acquired ASSIGN was pretty quick, total time of that region 
availability shoot up to 9 mins
{quote}2025-01-15 02:15:10,384 INFO  [PEWorker-24] procedure2.ProcedureExecutor 
- Finished pid=26292238, ppid=26291869, state=SUCCESS; 
TransitRegionStateProcedure table=<TABLE_NAME>, 
region=e9d104da353c65725b33f902c8a1449d, ASSIGN in 9 mins, 1.364 sec
{quote}


> Process stage of TakeSnapshotHandler runs Async which is holding the 
> EXCLUSIVE Lock of Table for around 9 mins blocking ASSIGN procedures of SCP
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29109
>                 URL: https://issues.apache.org/jira/browse/HBASE-29109
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.5.8
>            Reporter: Prathyusha
>            Assignee: Prathyusha
>            Priority: Major
>
> {{SnapshotManager does *prepare* and *process* as below}}
> {quote}{{{}{color:#6a3e3e}handler{color}{color:#000000}.prepare();{color}{}}}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}executorService{color}{color:#000000}.submit({color}{color:#6a3e3e}handler{color}{color:#000000});
>  (process called here){color}
> {quote}
> For Enabled tables prepare() takes an EXCLUSIVE lock with a timeout 
> and right when it starts processing it releases the current EXCLUSIVE lock on 
> the table and takes a SHARED one instead
> {quote}{color:#000000} {color}{color:#7f0055}try{color}{color:#000000} 
> {{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
> (downgradeToSharedTableLock()) {{color}
> {color:#000000} {color}{color:#3f7f5f}// release the exclusive lock and hold 
> the shared lock instead{color}
> {color:#000000} 
> {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000} = 
> {color}{color:#0000c0}master{color}{color:#000000}.getLockManager().createMasterLock({color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
> {color:#000000} LockType.{color}{color:#0000c0}SHARED{color}{color:#000000}, 
> {color}{color:#7f0055}this{color}{color:#000000}.getClass().getName() + 
> {color}{color:#2a00ff}": take snapshot "{color}{color:#000000} + 
> {color}{color:#0000c0}snapshot{color}{color:#000000}.getName());{color}
> {color:#000000} 
> {color}{color:#0000c0}tableLock{color}{color:#000000}.release();{color}
> {color:#000000} {color}{color:#7f0055}boolean{color}{color:#000000} 
> {color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000} = 
> {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}.tryAcquire({color}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
> (!{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000}) {{color}
> {color:#000000} 
> {color}{color:#0000c0}LOG{color}{color:#000000}.error({color}{color:#2a00ff}"Could
>  not acquire shared lock on table {} in {} ms"{color}{color:#000000}, 
> {color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
> {color:#000000} 
> {color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}throw{color}{color:#000000} 
> {color}{color:#7f0055}new{color}{color:#000000} 
> IOException({color}{color:#2a00ff}"Could not acquire shared lock on table 
> "{color}{color:#000000} + 
> {color}{color:#0000c0}snapshotTable{color}{color:#000000});{color}
> {color:#000000} }{color}
> {color:#000000} }{color}
> {quote}
> Since the process stage runs in ExecutorService sometimes there is a delay
> and Snapshot hold the EXCLUSIVE too long, blocking the ASSIGN child 
> procedures of SCP
> Below are the logs we saw for the same in our production
> Process stage of Snapshot started
> {quote}2025-01-15 02:15:09,608 INFO  
> [ER_SNAPSHOT_OPERATIONS-master/hmaster-3:60000-5360] 
> snapshot.TakeSnapshotHandler - Running SKIPFLUSH table snapshot 
> <TABLE_NAME>_1736902854385_1736906318965_0 C_M_SNAPSHOT_TABLE on table 
> <TABLE_NAME>
> {quote}
> then the EXCLUSIVE lock is released 
> {quote}2025-01-15 02:15:09,615 INFO  [PEWorker-20] 
> procedure2.ProcedureExecutor - Finished pid=26291867, state=SUCCESS; 
> org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=<TABLE_NAME>, 
> type=EXCLUSIVE in 9 mins, 55.422 sec
> {quote}
> ASSIGN of SCP was able to then get the lock 
> {quote}2025-01-15 02:15:09,615 INFO  [PEWorker-20] 
> procedure.MasterProcedureScheduler - Took xlock for pid=26292238, 
> ppid=26291869, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=<TABLE_NAME>, 
> region=e9d104da353c65725b33f902c8a1449d, ASSIGN
> {quote}
> Though once lock is acquired ASSIGN was pretty quick, total time of that 
> region availability shoot up to 9 mins
> {quote}2025-01-15 02:15:10,384 INFO  [PEWorker-24] 
> procedure2.ProcedureExecutor - Finished pid=26292238, ppid=26291869, 
> state=SUCCESS; TransitRegionStateProcedure table=<TABLE_NAME>, 
> region=e9d104da353c65725b33f902c8a1449d, ASSIGN in 9 mins, 1.364 sec
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to